# Linear Regression from Scratch

| | Egg price  | Gold price    | Oil price   | GDP   |
|---:|:-------------|:-----------|:------|:------|
| 1 | 3  | 100       | 4   | 21   |
| 2 | 4  | 500    | 7   | 43     |

### Notations and Definitions

In [1]:
import numpy as np

#sample 1  $x^1$
x1 = np.array([3, 100, 4])
y1 = np.array([21])

#what's the idea of prediction?  What is machine learning?
#- find the weights that can bring you from x1 to y1

#first sample
#3 * w1 + 100 * w2 + 4 * w3 = 21
#3 * 1  + 100 * 1  + 4 * 1  = 107
#3 * 7  + 100 * 1  + 4 * -25  = 21

#machine learning is trying to find the `best` weights

#2nd sample
#4 * w1 + 500 * w2 + 7 * w3   = 43
#4 * 7  + 500 * 1  + 7 * -25  = 353 

#machine learning is trying to find the `best` weights ACROSS all samples....


In [42]:
#Definition of terms and notations

#2 samples
#3 features - egg price, gold price, oil price
    #features are the variables used for predicting the label
    #factors, independent variables, predictors, X

#egg price - x_1 --> always a vector,  e.g., [3, 4]
#gold price - x_2 --> always a vector, e.g., [100, 500]
#oil price - x_3 --> always a vector, e.g., [4, 7]
#we call egg price + gold price + oil price - whole `feature matrix` --> \mathbf{X}
    
#1 label - gdp
    #label is the variable that we want to predict....
    #target, outcome, y
    #y_1 = y = a vector of labels, e.g., [21, 43]
    
#Tips: small and big
# small mean

Math notations:

- normal a -> scalar (one number)
- bold  $\mathbf{a}$  --> vector (a 1D numpy array)
- bold  $\mathbf{A}$  --> matrix (a 2D numpy array....)

- $\mathbf{x}_1^2$  --> feature 1, second sample

### How dot product works?

In [2]:
X = np.array([  [3, 100, 4] , [4, 500, 7]  ])
X.shape  #(2, 3) means 2 samples = m, 3 features = n

(2, 3)

In [3]:
#weights = theta = params
theta = np.array([7, 1, -25])
theta.shape  #weights must be the sample shape as X.shape[1]

(3,)

In [4]:
# X.dot(theta)
#to be able to dot, the number should be same in the close pair
#(2, 3)  @ (3, ) = (2, )
#(4, 6)  @ (6, 1) = (4, 1)
#(4, 6, 1) @ (1, 2) = (4, 6, 1, 2)
X @ theta

array([ 21, 353])

In [5]:
X[0][0] * theta[0] + X[0][1] * theta[1] + X[0][2] * theta[2]

21

### Steps for linear regression / gradient descent

Step 1: Randomize your weight
  - weight.shape (n, )

Step 2: Use this inital weight to predict
  - you will get errors

Step 3: Find the derivative

$\mathbf{X}^\top (\mathbf{\hat{y}} - \mathbf{y})$

Step 4: Change the weight

$\mathbf{w} = \mathbf{w} - \alpha * \mathbf{X}^\top (\mathbf{\hat{y}} - \mathbf{y})$

Step 5:  Repeat Step 2, 3, 4, until you either (1) reach the max iteration, or (2) your validation loss does not decrease anymore

### Let's code

#### Step 1: Load some toy dataset

In [6]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()

X = diabetes.data
y = diabetes.target

#print the shape of X and y
X.shape, y.shape
assert X.ndim == 2
assert y.ndim == 1

#print one row of X, and maybe try to see what it is...
#print one row of y, and maybe try to see what it is....
# X[0]
# y[0]
# diabetes.feature_names
# label is blood glucose level.....

#please help me set m and 
m = X.shape[0]  #number of samples
n = X.shape[1]  #number of features

#write an assert function to check that X and y has same amount of samples...
assert m == y.shape[0]

Note: We skip EDA and cleaning, because we are lazy; but actually this dataset is already clean...

#### Step 2: Train test split

In [7]:
from sklearn.model_selection import train_test_split

#split here
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 9999
)

#assert that X_train and y_train have the same amount of samples
assert X_train.shape[0] == y_train.shape[0]

#assert that X_test and y_test have the same amount of samples
assert X_test.shape[0] == y_test.shape[0]

#### Step 3: Standardization

In [8]:
#import the StandardScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

#standardize the training set
X_train = sc.fit_transform(X_train)

#standardize the test set
X_test = sc.transform(X_test)

#### Step 4: Add intercept to your X

In [9]:
# Example: if your X is        [  [3, 2, 4],    [2, 6, 8]  ]
# I want you to make it into   [  [1, 3, 2, 4], [1, 2, 6, 8]  ]
# Why 1?  because imagine you have another weight, which let's call w0
# this w0 is actually the intercept; so multiply with 1, will do nothing
# so we can still use X @ theta....

intercept = np.ones((X_train.shape[0], 1))
print(intercept.shape)

#hint: use np.concatenate with X_train on axis=1, to add these ones to X_train
X_train = np.concatenate((intercept, X_train), axis=1)

intercept = np.ones((X_test.shape[0], 1))
print(intercept.shape)

#hint: use np.concatenate with X_test on axis=1, to add these ones to X_test
X_test = np.concatenate((intercept, X_test), axis=1)


(309, 1)
(133, 1)


#### Step 5: Fitting!!! Gradient Descent

In [10]:
#put everything fit()

#1. randomize our theta
#please help me create a random theta of size (X_train.shape[1], )
theta = np.ones(X_train.shape[1])
#why X_train.shape[1]

#5. repeat 2, 3, 4
#please put a for loop for 2, 3, 4, for 1000 times
#set 1000 call it max_iter
#for _ in range(max_iter):
max_iter = 1000
alpha = 0.0001

def predict(X, theta):
    return X @ theta

def mean_squared_error(ytrue, ypred):
    return ((ypred - ytrue) ** 2).sum() / ytrue.shape[0]

def _grad(X, error):
    return X.T @ error

def fit(X_train, y_train, theta, max_iter, alpha):
    
    for i in range(max_iter):
        #2. predict
        yhat = predict(X_train, theta)  #put this into a function called predict(X_train, theta)

        #2.1 can you guys compute the squared error
        # squared_error = ((yhat - y_train) ** 2).sum()
        #print the mean squared error, we can see whether MSE goes down eventually...
        mse =  mean_squared_error(y_train, yhat)
        if(i % 50 == 0):
            print(f"MSE: {mse}")  

        #3. get derivatives
        deriv = _grad(X_train, yhat - y_train)

        #4. update weight
        theta = theta - alpha * deriv
        
    return theta


In [11]:
theta = fit(X_train, y_train, theta, max_iter, alpha)

MSE: 28562.951917344537
MSE: 3897.336230339511
MSE: 2877.3736974835792
MSE: 2831.2558171055753
MSE: 2827.9119296024915
MSE: 2826.5530590194744
MSE: 2825.334848687645
MSE: 2824.1622448159806
MSE: 2823.0266260126527
MSE: 2821.925301403898
MSE: 2820.8564421902097
MSE: 2819.8185235878746
MSE: 2818.810206706694
MSE: 2817.8302933929094
MSE: 2816.877699352769
MSE: 2815.951434445826
MSE: 2815.0505871961686
MSE: 2814.1743123627202
MSE: 2813.3218208956955
MSE: 2812.492371795244


#### Step 6: Testing

In [12]:
yhat = predict(X_test, theta)

mean_squared_error(y_test, yhat)

3079.2424482139854

#