## Content

- Feature Scaling
- ADj. R-Square
- Intro to Stats Model

- Linearity Assumption

## How feature scaling helps easier model training?



<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/747/original/z.png?1705225934' width=800>

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/085/918/original/download_%281%29.png?1723533045" width=800>


<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/085/923/original/download_%2811%29.jpeg?1723534701' width=800>


## Problems with R-squared, Adjusted R-squared

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/748/original/z.png?1705225962' width=800>




<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/749/original/z.png?1705225995' width=800>



<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/750/original/z.png?1705226069' width=800>



<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/751/original/z.png?1705226101' width=800>

## Intro to statsmodel

Let's check a library called `statsmodel` which we will be using throughout this lecture.

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/753/original/z.png?1705226176' width=800>

First we will download our data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

!gdown 1UpLnYA48Vy_lGUMMLG-uQE1gf_Je12Lh

Downloading...
From: https://drive.google.com/uc?id=1UpLnYA48Vy_lGUMMLG-uQE1gf_Je12Lh
To: /content/cars24-car-price-clean.csv
  0% 0.00/7.10M [00:00<?, ?B/s]100% 7.10M/7.10M [00:00<00:00, 153MB/s]


In [None]:
df = pd.read_csv('cars24-car-price-clean.csv')
df.head()

Unnamed: 0,selling_price,year,km_driven,mileage,engine,max_power,age,make,model,Individual,Trustmark Dealer,Diesel,Electric,LPG,Petrol,Manual,5,>5
0,-1.111046,-0.801317,1.195828,0.045745,-1.310754,-1.15778,0.801317,-0.433854,-1.125683,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
1,-0.223944,0.45003,-0.737872,-0.140402,-0.537456,-0.360203,-0.45003,-0.327501,-0.333227,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
2,-0.915058,-1.42699,0.035608,-0.582501,-0.537456,-0.404885,1.42699,-0.327501,-0.789807,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
3,-0.892365,-0.801317,-0.409143,0.32962,-0.921213,-0.693085,0.801317,-0.433854,-0.905265,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
4,-0.182683,0.137194,-0.544502,0.760085,0.042999,0.010435,-0.137194,-0.246579,-0.013096,-0.80071,-0.098382,1.014945,-0.020095,-0.056917,-0.97597,0.495818,0.444503,-0.424728


In [None]:
y=df[['selling_price']]
X=df.drop('selling_price', axis=1)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [None]:
y_train = np.array(y_train)

Now, let's train our model on the data.

In [None]:
X_sm = sm.add_constant(X_train)  # Statmodels default is without intercept, to add intercept we need to add constant.

model = sm.OLS(y_train, X_sm)
results = model.fit()

# Print the summary statistics of the model
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.941
Method:                 Least Squares   F-statistic:                 1.588e+04
Date:                Wed, 09 Aug 2023   Prob (F-statistic):               0.00
Time:                        13:34:26   Log-Likelihood:                -7.3180
No. Observations:               15856   AIC:                             48.64
Df Residuals:                   15839   BIC:                             179.0
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             7.664e-05      0.002  

Let's look at few of the variables in this table:

- Dep. Variable: This column displays the name of the dependent variable being predicted in the regression.

- Model: It provides a concise representation of the model type and method used, such as "OLS" (Ordinary Least Squares).

- R-squared: Represents the coefficient of determination (R-squared) value.

- Adj. R-squared: This is the adjusted R-squared value, which accounts for the number of predictors in the model and adjusts the R-squared accordingly.

The prediction is same as scikit learn

In [None]:
results.predict(X_sm)

3443    -0.354511
16090   -0.476414
11070   -0.359932
19214   -0.121763
17843   -0.656579
           ...   
1099     2.335550
18898   -0.334020
11798    0.398398
6637     2.564373
2575    -0.076645
Length: 15856, dtype: float64

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/754/original/z.png?1705226214' width=800>

We will see more uses of statsmodel library as we continue with today's lecture.


## Assumptions of Linear Regression

<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/755/original/z.png?1705226251' width=800>

- We can arrive at concept of Linear regression in two ways.
  - Algebra & Optimization (Geometric) - We covered this
  - Probability & Statistics
- We can prove that Linear regression is a very good model if all the statistical assumptions holds true.


## 1.Assumption of Linearity



<img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/061/756/original/z.png?1705226286' width=800>