### Suppose we have `n` features and `m` observations

| Index        | $X_{1}$       | $X_{2}$       | $X_{3}$       | .... | .... | $X_{n}$        | y        |
|--------------|---------------|---------------|---------------|------|------|----------------|----------|
| 1            | $x_{1}^{1} $  | $x_{2}^{1}$   | $x_{3}^{1}$   | ...  | ...  | $x_{n}^{1}$    | $y^{1}$  |
| 2            | $x_{1}^{2}$   | $x_{2}^{2}$   | $x_{3}^{2}$   | ...  | ...  | $x_{n}^{2}$    | $y^{2}$  |
| 3            | $x_{1}^{3}$   | $x_{2}^{3}$   | $x_{3}^{3}$   | ...  | ...  | $x_{n}^{3}$    | $y^{3}$  |
| .            | .             | .             | .             | ...  | ...  | .              |          |
| .            | .             | .             | .             | ...  | ...  | .              |          |
| .            | .             | .             | .             | ...  | ...  | .              |          |
| m            | $x_{1}^{m}$   | $x_{2}^{m}$   | $x_{3}^{m}$   | ...  | ...  | $x_{n}^{m}$    | $y^{m}$  |

Here `subscript` denotes feature and `superscript` denote observation number

Suppose the weights of matrix for n features is denoted by column vectors of shape [1, n]
<br/>
$$ \beta = \begin{bmatrix} \beta_{1} & \beta_{2} & \beta_{3} & .... &  \beta_{n} \end{bmatrix} $$

### we need to calculate prediction for each observation

\begin{equation}
\hat{y^1} = \beta_{0} + \beta_{1}x_{1}^{1} +  \beta_{2}x_{2}^{1}  +  \beta_{3}x_{3}^{1}  +  \beta_{3}x_{3}^{1} + .... + \beta_{n}x_{n}^{1}
\end{equation}

\begin{equation}
\hat{y^2} = \beta_{0} + \beta_{1}x_{1}^{2} +  \beta_{2}x_{2}^{2}  +  \beta_{3}x_{3}^{2}  +  \beta_{3}x_{3}^{2} + .... + \beta_{n}x_{n}^{2}
\end{equation}


\begin{equation}
\hat{y^3} =  \beta_{0} +\beta_{1}x_{1}^{3} +  \beta_{2}x_{2}^{3}  +  \beta_{3}x_{3}^{3}  +  \beta_{3}x_{3}^{3} + .... + \beta_{n}x_{n}^{3}
\end{equation}

\begin{equation}
 ..................
\end{equation}

\begin{equation}
 ..................
\end{equation}

\begin{equation}
\hat{y^m} =  \beta_{0} +\beta_{1}x_{1}^{m} +  \beta_{2}x_{2}^{m}  +  \beta_{3}x_{3}^{m}  +  \beta_{3}x_{3}^{m} + .... + \beta_{n}x_{n}^{m}
\end{equation}

### In matrix form 

\begin{equation}
\begin{bmatrix} \hat{y}^{1} \\ \hat{y}^{2} \\ \hat{y}^{3} \\ .. \\.. \\  \hat{y}^{m} \end{bmatrix} = 
\begin{bmatrix} x_{1}^{1} & x_{2}^{1} & x_{3}^{1} & .... &  x_{n}^{1}
           \\ x_{1}^{2} & x_{2}^{2} & x_{3}^{2} & .... &  x_{n}^{2}
           \\ x_{1}^{3} & x_{2}^{3} & x_{3}^{3} & .... &  x_{n}^{3}
           \\.... &.... & ... &.... &....
           \\.... &.... & ... &.... &....
            \\ x_{1}^{m} & x_{2}^{m} & x_{3}^{m} & .... &  x_{n}^{m}
\end{bmatrix}
*
\begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ .. \\.. \\  \beta_{n} \end{bmatrix}
+ \beta_{0}
\end{equation}

\begin{equation} \hat{y} = X.\beta^{T} + \beta_{0} \end{equation}

#### Equivalent numpy implementation is:

 `y_hat = np.dot(X, B.T) + b`  
Where 
 $$ B = \beta $$
 $$ b = \beta_{0}$$

### The Mean Squared Error cost function is:

$$ J(\beta, \beta_{0}) = \frac{1}{2m}\sum_{n=1}^{m}   (y - \hat{y})^{2} $$

#### Equivalent numpy implementation is:

`cost = (1/(2*m))*np.sum((y-y_hat)**2)`

### Just for confirmation let us take an example

| y | $\hat{y}$ | (y - $\hat{y}$)^2 |
|---|---------|-----------------|
| 2 | 1       | 1               |
| 4 | 2       | 4               |
| 6 | 3       | 9               |
|  |        | 14               |

\begin{equation} 14/3 = 4.67 \end{equation}

In [72]:
import numpy as np

y=np.array([[2],
           [4],
           [6]])
y_hat=np.array([[1],
           [2],
           [3]])

In [73]:
np.sum((y-y_hat)**2)/3

4.666666666666667

### Now lets calculate the gradient descent

### We have the loss function  defined as

$$ J(\beta, \beta_{0}) = \frac{1}{2m}\sum_{n=1}^{m}   (y - \hat{y})^{2}   $$

where: $$ \hat{y} = \beta_{0} + X*\beta^{T} $$

So:
$$ \frac {\partial J(\beta, \beta_{0})}{\partial \beta_{0}} = 
\frac{-1}{m}\sum_{n=1}^{m}(y-\hat{y}) $$

Again: $$ J(\beta, \beta_{0}) = \frac{1}{2m}\sum_{n=1}^{m}   (y - \hat{y})^{2}   $$

$$ \frac {\partial J(\beta, \beta_{0})}{\partial \beta} = \frac{-1}{m}\sum_{n=1}^{m}   (y - \hat{y}) * \frac{\partial (\beta_{0} + X*\beta^{T})}{\partial \beta} $$

$$ \frac {\partial J(\beta, \beta_{0})}{\partial \beta} = 
\frac{-1}{m}\sum_{n=1}^{m}(y-\hat{y})*X * \frac{\partial \beta^{T}}{\partial \beta} $$

where $$ \beta = \begin{bmatrix} \beta_{1} & \beta_{2} & \beta_{3} &.... & \beta_{n} \end{bmatrix}$$ 

and

$$ \beta^{T} = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\.... \\ \beta_{n} \end{bmatrix}$$ 


Since: $$ \frac{\partial \beta^{T}}{\partial \beta} = I $$

### Just a side note for calculation of $$ \frac{\partial \beta^{T}}{\partial \beta} $$
where $$ \beta = \begin{bmatrix} \beta_{1} & \beta_{2} & \beta_{3} &.... & \beta_{n} \end{bmatrix}$$ 

and

$$ \beta^{T} = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\.... \\ \beta_{n} \end{bmatrix}$$ 

$$ \frac{\partial \beta^{T}}{\partial \beta} = \begin{bmatrix} \frac{\partial \beta^{T}}{\partial \beta_{1}} & \frac{\partial \beta^{T}}{\partial \beta_{2}} & \frac{\partial \beta^{T}}{\partial \beta_{3}} & ....  & \frac{\partial \beta^{T}}{\partial \beta_{n}}\end{bmatrix} $$



$$ \frac{\partial \beta^{T}}{\partial \beta} = 
\begin{bmatrix} 
\frac{\partial \beta_{1}}{\partial \beta_{1}} & \frac{\partial \beta_{1}}{\partial \beta_{2}} & \frac{\partial \beta_{1}}{\partial \beta_{3}} & ..&..& \frac{\partial \beta_{1}}{\partial \beta_{n}}  
\\  
\frac{\partial \beta_{2}}{\partial \beta_{1}} & \frac{\partial \beta_{2}}{\partial \beta_{2}} & \frac{\partial \beta_{2}}{\partial \beta_{3}} &..&..& \frac{\partial \beta_{2}}{\partial \beta_{n}} 
\\ 
\frac{\partial \beta_{2}}{\partial \beta_{1}} & \frac{\partial \beta_{3}}{\partial \beta_{2}} & \frac{\partial \beta_{3}}{\partial \beta_{3}} &..&..& \frac{\partial \beta_{3}}{\partial \beta_{n}} 
\\ 
.. & .. & .. & ..  & .. &..
\\
.. & .. & .. & ..  & .. &..
\\
\frac{\partial \beta_{n}}{\partial \beta_{1}} & \frac{\partial \beta_{n}}{\partial \beta_{2}} & \frac{\partial \beta_{n}}{\partial \beta_{3}} & ..&..& \frac{\partial \beta_{n}}{\partial \beta_{n}}  
\end{bmatrix} $$


$$ \frac{\partial \beta^{T}}{\partial \beta} = 
\begin{bmatrix} 
1 & 0 & 0 & ..&..& 0  
\\  
0 & 1 & 0 &..&..& 0 
\\ 
0 & 0 & 1 &..&..& 0 
\\ 
.. & .. & .. & ..  & .. &..
\\
.. & .. & .. & ..  & .. &..
\\
0 & 0 & 0 & ..&..& 1  
\end{bmatrix} $$


i.e $$  \frac{\partial \beta^{T}}{\partial \beta} = I_{n*n} $$

Since: $$ X_{m*n} * I_{n*n} = X $$

So: $$ \frac {\partial J(\beta, \beta_{0})}{\partial \beta} = 
\frac{-1}{m}\sum_{n=1}^{m}(y-\hat{y})*X $$

### Shape of $$ \sum_{n=1}^{m}(y-\hat{y}) $$  `m rows and 1 cols [m, 1]`

### Shape of `X is [m, n] ie m observations and n features`

### Required shape of `dB is [1, n] `

### [1, m] * [m, n] == [1, n]

## Equivalent Numpy Implementation is:

 `dB = (-1/m)* np.dot((y-y_hat).T, X)`
 <br/>
  `db = (-1/m)*np.sum(y-y_hat)`

### Just a side note for calculation of $$ \frac{\partial \beta^{T}}{\partial \beta} $$
where $$ \beta = \begin{bmatrix} \beta_{1} & \beta_{2} & \beta_{3} &.... & \beta_{n} \end{bmatrix}$$ 

and

$$ \beta^{T} = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\.... \\ \beta_{n} \end{bmatrix}$$ 

$$ \frac{\partial \beta^{T}}{\partial \beta} = \begin{bmatrix} \frac{\partial \beta^{T}}{\partial \beta_{1}} & \frac{\partial \beta^{T}}{\partial \beta_{2}} & \frac{\partial \beta^{T}}{\partial \beta_{3}} & ....  & \frac{\partial \beta^{T}}{\partial \beta_{n}}\end{bmatrix} $$



$$ \frac{\partial \beta^{T}}{\partial \beta} = 
\begin{bmatrix} 
\frac{\partial \beta_{1}}{\partial \beta_{1}} & \frac{\partial \beta_{1}}{\partial \beta_{2}} & \frac{\partial \beta_{1}}{\partial \beta_{3}} & ..&..& \frac{\partial \beta_{1}}{\partial \beta_{n}}  
\\  
\frac{\partial \beta_{2}}{\partial \beta_{1}} & \frac{\partial \beta_{2}}{\partial \beta_{2}} & \frac{\partial \beta_{2}}{\partial \beta_{3}} &..&..& \frac{\partial \beta_{2}}{\partial \beta_{n}} 
\\ 
\frac{\partial \beta_{2}}{\partial \beta_{1}} & \frac{\partial \beta_{3}}{\partial \beta_{2}} & \frac{\partial \beta_{3}}{\partial \beta_{3}} &..&..& \frac{\partial \beta_{3}}{\partial \beta_{n}} 
\\ 
.. & .. & .. & ..  & .. &..
\\
.. & .. & .. & ..  & .. &..
\\
\frac{\partial \beta_{n}}{\partial \beta_{1}} & \frac{\partial \beta_{n}}{\partial \beta_{2}} & \frac{\partial \beta_{n}}{\partial \beta_{3}} & ..&..& \frac{\partial \beta_{n}}{\partial \beta_{n}}  
\end{bmatrix} $$


$$ \frac{\partial \beta^{T}}{\partial \beta} = 
\begin{bmatrix} 
1 & 0 & 0 & ..&..& 0  
\\  
0 & 1 & 0 &..&..& 0 
\\ 
0 & 0 & 1 &..&..& 0 
\\ 
.. & .. & .. & ..  & .. &..
\\
.. & .. & .. & ..  & .. &..
\\
0 & 0 & 0 & ..&..& 1  
\end{bmatrix} $$


i.e $$  \frac{\partial \beta^{T}}{\partial \beta} = I_{n*n} $$

In [74]:
.def propagate(B, b, X, Y):
    """
    params:
    B: weights of size [1, X.shape[1]]
    b: bias
    X: matrix of observations and features size [X.shape[0], X.shape[1]]
    Y: matrix of actual observation size [Y.shape[0], 1]
    
    returns:
    grads: dict of gradients, dB of shape same as B and db of shape [1, 1].
    cost: MSE cost of shape [m, 1]
    """
    
    ## m is no of observations ie rows of X
    m = X.shape[0]
    
    #Calculate hypothesis
    y_hat = np.dot(X, B.T) + b
    
    y = Y.values.reshape(Y.shape[0],1)
    
    #Compute Cost
    cost = (1/(2*m))*np.sum((y-y_hat)**2)
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dB = (-1/m)* np.dot((y-y_hat).T, X)
    
    db = -np.sum(y-y_hat)/m
    
    grads = {"dB": dB,
             "db": db}
    
    return grads, cost

SyntaxError: invalid syntax (<ipython-input-74-3d9ca5273f27>, line 1)

In [None]:
def optimize(B, b, X, Y, num_iterations, learning_rate):
    """
    params:
    B: weights of size [1, X.shape[1]]
    b: bias
    X: matrix of observations and features size [X.shape[0], X.shape[1]]
    Y: matrix of actual observation size [Y.shape[0], 1]
    num_iterations: number of iterations
    learning_rate: learning rate
    returns:
    params: parameters B of shape [1, X.shape[1]] and bias
    grads: dict of gradients, dB of shape same as B and db
    costs:  MSE cost 
    """
    costs = []
    
    for i in range(num_iterations):
        
        
        # Cost and gradient calculation call function propagate
        grads, cost = propagate(B,b,X,Y)
        
        # Retrieve derivatives from grads
        dB = grads["dB"]
        db = grads["db"]
        
        # update parameters
        B = B - learning_rate * dB
        b = b - learning_rate * db
        
        costs.append(cost)
    
    params = {"B": B,
              "b": b}
    
    grads = {"dB": dB,
             "db": db}
    
    return params, grads, costs

In [None]:
def predict(B, b, X):
    """:param
    B: weights
    b: bias
    X: matrix of observations and features
    """
  # Compute predictions for X
    Y_prediction = np.dot(X, B.T) + b
    return Y_prediction

In [None]:
def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5):
    """
    params: 
    X_train: X_train
    Y_train: Y_train
    X_test: X_test
    Y_test: Y_test
    
    returns:
    d: dictionary
    """
    
    
    # initialize parameters with zeros 
    B = np.zeros(shape=(1, X_train.shape[1]))
    b = 0
    
    # Gradient descent
    parameters, grads, costs = optimize(B, b, X_train, Y_train, num_iterations, learning_rate)
    
    # Retrieve parameters w and b from dictionary "parameters"
    B = parameters["B"]
    b = parameters["b"]
    
    # Predict test/train set examples
    Y_prediction_test = predict(B, b, X_test)
    Y_prediction_train = predict(B, b, X_train)
    
    Y_train = Y_train.values.reshape(Y_train.shape[0], 1)
    Y_test = Y_test.values.reshape(Y_test.shape[0], 1)

   # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "B" : B, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

In [None]:
import pandas as pd
df = pd.read_csv('USA_Housing.csv')

In [None]:
df

In [None]:
df.drop(['Address'],axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
from sklearn import preprocessing
pre_process = preprocessing.StandardScaler()

df_norm = (df - df.mean()) / (df.max() - df.min())

# Putting feature variable to X
X = df_norm[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Area Population']]

# Putting response variable to y
y = df_norm['Price']


X = pd.DataFrame(pre_process.fit_transform(X))

#random_state is the seed used by the random number generator, it can be any integer.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7 ,test_size = 0.3, random_state=2)



In [None]:
df_norm.head()

In [None]:
X_train.shape

In [None]:
X_train.head()

In [None]:
# y_train.reshape(y_train.shape[0],1)

In [None]:
y_test.shape[0]

In [None]:
model1 = model(X_train=X_train, Y_train=y_train, X_test=X_test, Y_test=y_test, num_iterations = 500, learning_rate = 0.001)

model2 = model(X_train=X_train, Y_train=y_train, X_test=X_test, Y_test=y_test, num_iterations = 500, learning_rate = 0.01)

model3 = model(X_train=X_train, Y_train=y_train, X_test=X_test, Y_test=y_test, num_iterations = 500, learning_rate = 0.1)

In [None]:
import matplotlib.pyplot as plt
plt.plot([i for i in range(500)], model1['costs'])
plt.plot([i for i in range(500)], model2['costs'])
plt.plot([i for i in range(500)], model3['costs'])

plt.gca().legend(('alpha 0.001','alpha 0.01', 'alpha 0.1'))
plt.show()

In [None]:
# model4 = model(X_train=X_train, Y_train=y_train, X_test=X_test, Y_test=y_test, num_iterations = 10, learning_rate = 0.01)

# model5 = model(X_train=X_train, Y_train=y_train, X_test=X_test, Y_test=y_test, num_iterations = 100, learning_rate = 0.01)

# model6 = model(X_train=X_train, Y_train=y_train, X_test=X_test, Y_test=y_test, num_iterations = 1000, learning_rate = 0.01)