# Predictive Analytics: Regression and Classification

## Instructor: Sourish

### Lecture 1:

#### Objective:

The objective of this Python hands-on is to understand how OLS estimation for Regression model works in Regression

In [2]:
# Load pandas and numpy
import pandas as pd
import numpy as np
# Read CSV file into DataFrame df
df = pd.read_csv('mtcars.csv')

In [3]:
df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [4]:
# Sample size
n = df.shape[0]
n

32

**Target variable** is **mpg**. We define it as matrix of order $n\times 1$.

In [6]:
y = np.transpose(np.asmatrix(df['mpg']))
y[0:4]

matrix([[21. ],
        [21. ],
        [22.8],
        [21.4]])

We consider the **hp**, **wt**, and **disp** as the predictor in our model. So model we are fitting is


$$
\text{mpg} = \beta_0 + \beta_1 \text{hp} + \beta_2 \text{wt} + \beta_3 \text{disp} + \varepsilon
$$

In [7]:
## First we define intercept of the model
Intercept = pd.DataFrame(np.ones(n))
Intercept

Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0
7,1.0
8,1.0
9,1.0


In [8]:
## We consider 3 variables as predictors, i.e., 'hp','wt','disp' as predictor for 'mpg'
Xdf = df[['hp','wt','disp']]

## Add intercept to Xdf
Xdf = pd.concat([Intercept,Xdf],axis=1)
Xdf.head()

Unnamed: 0,0,hp,wt,disp
0,1.0,110,2.62,160.0
1,1.0,110,2.875,160.0
2,1.0,93,2.32,108.0
3,1.0,110,3.215,258.0
4,1.0,175,3.44,360.0


The design matrix $X$ is defined as matrix `X` from `Xdf` which is of the datatype of DataFrame

In [9]:
X = np.asmatrix(Xdf)
X[0:5,:]

matrix([[  1.   , 110.   ,   2.62 , 160.   ],
        [  1.   , 110.   ,   2.875, 160.   ],
        [  1.   ,  93.   ,   2.32 , 108.   ],
        [  1.   , 110.   ,   3.215, 258.   ],
        [  1.   , 175.   ,   3.44 , 360.   ]])

Define $X'$

In [10]:
Xt = np.transpose(X)

Calculate $X'X$

In [11]:
Xt*X

matrix([[3.20000000e+01, 4.69400000e+03, 1.02952000e+02, 7.38310000e+03],
        [4.69400000e+03, 8.34278000e+05, 1.64717440e+04, 1.29136440e+06],
        [1.02952000e+02, 1.64717440e+04, 3.60901070e+02, 2.70914888e+04],
        [7.38310000e+03, 1.29136440e+06, 2.70914888e+04, 2.17962747e+06]])

Calculate $(X'X)^{-1}$

In [13]:
XtxX_inv = np.linalg.inv(Xt*X)
XtxX_inv

matrix([[ 6.39800586e-01, -1.29284038e-03, -2.73552559e-01,
          1.99885645e-03],
        [-1.29284038e-03,  1.87791901e-05,  2.71270749e-04,
         -1.01185806e-05],
        [-2.73552559e-01,  2.71270749e-04,  1.63235239e-01,
         -1.26302737e-03],
        [ 1.99885645e-03, -1.01185806e-05, -1.26302737e-03,
          1.53816695e-05]])

Calculate $\hat{\beta}=(X'X)^{-1}X'y$

In [14]:
beta_hat = XtxX_inv*Xt*y
beta_hat

matrix([[ 3.71055053e+01],
        [-3.11565508e-02],
        [-3.80089058e+00],
        [-9.37009081e-04]])

In [15]:
np.round(beta_hat,4)

array([[ 3.71055e+01],
       [-3.12000e-02],
       [-3.80090e+00],
       [-9.00000e-04]])

Now we are going implement the OLS estimation using the `statsmodels` package

In [16]:
import statsmodels.api as sm

In [17]:
#fit linear regression model
model = sm.OLS(y, X).fit()

#view model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.827
Model:                            OLS   Adj. R-squared:                  0.808
Method:                 Least Squares   F-statistic:                     44.57
Date:                Wed, 19 Oct 2022   Prob (F-statistic):           8.65e-11
Time:                        09:10:36   Log-Likelihood:                -74.321
No. Observations:                  32   AIC:                             156.6
Df Residuals:                      28   BIC:                             162.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         37.1055      2.111     17.579      0.0

## Self-Learning Exercise:

(**Not for grading**)

Implement the same OLS method using the `sklearn` package of Python