## 如何调试梯度

![梯度调试](https://upload-images.jianshu.io/upload_images/9140378-172f399bf0f7d540.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/440)

## $\frac{dJ}{d\theta} = \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}$

### 若是多维的向量 $\theta = (\theta_0,\theta_1,\theta_2,...,\theta_n)$，则梯度变为$\frac{\partial J}{\partial \theta} = (\frac{\partial J}{\partial \theta_0},\frac{\partial J}{\partial \theta_1},\frac{\partial J}{\partial \theta_2},...,\frac{\partial J}{\partial \theta_n})$

### 可以近似的运用一维向量的做法

$\frac{\partial J}{\partial \theta_0} = \frac{J(\theta_0^{+}) - J(\theta_0^{-})}{2\epsilon} , \theta_0^{+} = (\theta_0 + \epsilon,\theta_1,\theta_2,...,\theta_n) , \theta_0^{-} = (\theta_0 - \epsilon,\theta_1,\theta_2,...,\theta_n)$

$\frac{\partial J}{\partial \theta_1} = \frac{J(\theta_1^{+}) - J(\theta_1^{-})}{2\epsilon} , \theta_1^{+} = (\theta_0 ,\theta_1 + \epsilon,\theta_2,...,\theta_n) , \theta_1^{-} = (\theta_0,\theta_1 - \epsilon,\theta_2,...,\theta_n)$

$\frac{\partial J}{\partial \theta_2} = \frac{J(\theta_2^{+}) - J(\theta_2^{-})}{2\epsilon} , \theta_2^{+} = (\theta_0 ,\theta_1,\theta_2 + \epsilon,...,\theta_n) , \theta_1^{-} = (\theta_0,\theta_1,\theta_2 - \epsilon,...,\theta_n)$

$......,......,......$

$\frac{\partial J}{\partial \theta_n} = \frac{J(\theta_n^{+}) - J(\theta_n^{-})}{2\epsilon} , \theta_n^{+} = (\theta_0 ,\theta_1,\theta_2,...,\theta_n + \epsilon) , \theta_n^{-} = (\theta_0,\theta_1,\theta_2,...,\theta_n - \epsilon)$

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
np.random.seed(666)
X = np.random.random(size=(1000,10))

In [3]:
true_theta = np.arange(1,12,dtype=float)

In [4]:
true_theta

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.])

In [5]:
X.shape

(1000, 10)

In [6]:
X_b = np.hstack([np.ones((len(X),1)),X]) # 加上一维的one数据层
X_b

array([[1.        , 0.70043712, 0.84418664, ..., 0.04881279, 0.09992856,
        0.50806631],
       [1.        , 0.20024754, 0.74415417, ..., 0.11285765, 0.11095367,
        0.24766823],
       [1.        , 0.0232363 , 0.72732115, ..., 0.25913185, 0.58381262,
        0.32569065],
       ...,
       [1.        , 0.88593917, 0.49480495, ..., 0.50598273, 0.86447115,
        0.31128276],
       [1.        , 0.81051618, 0.87890841, ..., 0.37299025, 0.81523744,
        0.31074351],
       [1.        , 0.75052272, 0.98612317, ..., 0.26679141, 0.34224855,
        0.02366081]])

In [7]:
X_b.shape

(1000, 11)

In [8]:
y = X_b.dot(true_theta) + np.random.normal(size=1000) # 正态分布噪音

In [9]:
y.shape

(1000,)

In [10]:
X.shape

(1000, 10)

In [11]:
def J(theta,X_b,y): # 损失函数
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
    except:
        return float('inf')

In [12]:
def dJ_math(theta,X_b,y): # 在theta处的梯度
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

res[i]的具体应用公式：$\frac{dJ}{d\theta} = \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon}$

In [13]:
def dJ_debug(theta,X_b,y,epsilon=0.01): 
    res = np.empty(len(theta)) # 创建theta长度的向量res
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_1[i] += epsilon # 第i个维度的值加上epsilon
        theta_2 = theta.copy()
        theta_2[i] -= epsilon # 第i个维度的值减去epsilon
        res[i] = (J(theta_1,X_b,y) - J(theta_2,X_b,y)) / (2 * epsilon)
    return res;

In [14]:
def gradient_descent(dJ,X_b,y,initial_theta,eta,n_iters=1e4,epsilon=0.01): 
    theta = initial_theta
    cur_iter = 0 # 从0开始遍历
    
    while cur_iter < n_iters: 
        gradient = dJ(theta,X_b,y)  # 计算梯度
        last_theta = theta
        theta = theta - eta * gradient
        if(abs(J(theta,X_b,y) - J(last_theta,X_b,y)) < epsilon):  # 相差距离小于某个度值
            break;
        cur_iter += 1
    return theta

In [15]:
X_b = np.hstack([np.ones((len(X),1)),X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

%time theta = gradient_descent(dJ_debug,X_b,y,initial_theta,eta)
theta

Wall time: 168 ms


array([8.26200061, 3.41547499, 3.50931419, 3.8359876 , 4.53870159,
       4.88207109, 5.10522801, 5.7360655 , 6.11828751, 6.60422379,
       7.06073814])

In [17]:
%time theta = gradient_descent(dJ_math,X_b,y,initial_theta,eta)
theta

Wall time: 34.9 ms


array([8.26200061, 3.41547499, 3.50931419, 3.8359876 , 4.53870159,
       4.88207109, 5.10522801, 5.7360655 , 6.11828751, 6.60422379,
       7.06073814])