# Chapter1: Linear Regression

<b>Linear regression</b> is often considered the foundation of machine learning, and I vividly recall taking the Stats304 class at NU, which marked the beginning of my journey into machine learning and data science. Although the algorithm itself may seem straightforward, it is rooted in extensive statistical knowledge. Even today, linear regression finds widespread application in finance, marketing, and various other industries. It's worth noting that simpler models sometimes outperform complex and large models, demonstrating the effectiveness of simplicity in certain contexts.

## Basic Assumptions of Linear Regression

- Linearity: Linear relationship of X
- Normality: normally distributed X
- Multicollinearity: independent X
- Homoscedasticity: variance constant

## Using stats Package

In [56]:
import statsmodels.api as sm
dataset = sm.datasets.spector.load()

#adding a constant
x = sm.add_constant(dataset.exog, prepend=False)
y = dataset.endog

#performing the regression
model = sm.OLS(y, x).fit()

# Result of statsmodels 
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  GRADE   R-squared:                       0.416
Model:                            OLS   Adj. R-squared:                  0.353
Method:                 Least Squares   F-statistic:                     6.646
Date:                Wed, 14 Jun 2023   Prob (F-statistic):            0.00157
Time:                        22:46:25   Log-Likelihood:                -12.978
No. Observations:                  32   AIC:                             33.96
Df Residuals:                      28   BIC:                             39.82
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GPA            0.4639      0.162      2.864      0.0

## Using SkLearn Package

In [70]:
from sklearn.linear_model import LinearRegression
x = dataset.exog
model = LinearRegression().fit(x, y)
y_pred = model.predict(x)
model.coef_

array([0.46385168, 0.01049512, 0.37855479, 0.        ])

## Standardize the Data

To ensure accurate analysis and modeling, it is crucial to normalize or standardize data when the feature distributions and scales are significantly different. By applying appropriate normalization or standardization techniques, we can achieve a more reliable and meaningful analysis.

In [25]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [72]:
x_scaled = sm.add_constant(x_scaled, prepend=False)
model = sm.OLS(y, x_scaled).fit()
# Result of statsmodels 
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  GRADE   R-squared:                       0.416
Model:                            OLS   Adj. R-squared:                  0.353
Method:                 Least Squares   F-statistic:                     6.646
Date:                Wed, 14 Jun 2023   Prob (F-statistic):            0.00157
Time:                        22:52:18   Log-Likelihood:                -12.978
No. Observations:                  32   AIC:                             33.96
Df Residuals:                      28   BIC:                             39.82
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.2131      0.074      2.864      0.0

## From scratch

In [36]:
import numpy as np
class LinearRegression() :
    def __init__( self, learning_rate, iterations ) :
        self.learning_rate = learning_rate
        self.iterations = iterations
          
    # Function for model training    
    def fit(self,X,Y) :
        self.m, self.n = X.shape
        # weight initialization
        self.W = np.zeros(self.n)
        self.b = 0
         
        self.X = X   
        self.Y = Y
        # gradient descent learning
        for i in range( self.iterations ) :
            self.update_weights()
        return self
      
    # Function for updating weights
    def update_weights( self ) :           
        Y_pred = self.predict( self.X )
        # calculate gradients  
        dW = - ( 2 * ( self.X.T ).dot( self.Y - Y_pred )  ) / self.m
        db = - 2 * np.sum( self.Y - Y_pred ) / self.m 
        # update weights
        self.W = self.W - self.learning_rate * dW
        self.b = self.b - self.learning_rate * db
        return self
      
    # Inferences
    def predict( self, X ) :
        return X.dot( self.W ) + self.b

In [68]:
model = LinearRegression(iterations = 3000, learning_rate = 0.001 )
model.fit(x_scaled,y)
# Prediction on test set
y_pred = model.predict(x_scaled)
print( "Trained W:", round( model.W[0], 2 ),round( model.W[1], 2 ),round( model.W[2], 2 ) )
print( "Trained b:", round( model.b, 2 ) )

Trained W: 0.21 0.04 0.19
Trained b: 0.34
