In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split

In [3]:
auto = pd.read_csv('auto_mpg.csv')
print(auto.head())
auto_disp = auto['displacement'].astype(float)
auto_mpg = auto['mpg'].astype(float)
mean_disp = np.mean(auto_disp)
min_disp = np.min(auto_disp)
max_disp = np.max(auto_disp)
print(mean_disp, max_disp, min_disp, max_disp-min_disp)
auto_disp = auto_disp.apply(lambda x: x/100)
print("-----Displacement Stats-----" )
print(auto_disp.describe())
auto_mpg = auto_mpg.apply(lambda x: x/100)
print("-----Mileage Stats-----" )
print(auto_mpg.describe())

    mpg  cylinder  displacement horse power  weight  acceleration  model year  \
0  18.0         8         307.0         130    3504          12.0          70   
1  15.0         8         350.0         165    3693          11.5          70   
2  18.0         8         318.0         150    3436          11.0          70   
3  16.0         8         304.0         150    3433          12.0          70   
4  17.0         8         302.0         140    3449          10.5          70   

   origin                   car name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  
193.42587939698493 455.0 68.0 387.0
-----Displacement Stats-----
count    398.000000
mean       1.934259
std        1.042698
min        0.680000
25%        1.042500
50%        1.485000
75%        2.620000
max        4.550000
Name: displacement, dtype: float64
-----Mileage Stats-

In [4]:
print(type(auto_disp))

<class 'pandas.core.series.Series'>


What is correlation? 

Correlation describes the relationship between two variables. 

Correlation coefficient is a value that describes the strenght of the 
relationship betweent two variables. 

Correlation graph

<img src="correlation_graph.png" width=400, height=300>

Correlation coefficient formula

<img src="correlation_formula.png" width=400, height=300>

Values of $r$ range from -1 to 1, -1 represents inverse or negative correlation, 1 represents direct or positive correlation. 

Reference - https://www.wallstreetmojo.com/correlation-coefficient-formula/

Which features to select?

Choose features that are not correlated. 

In [None]:
# library
import seaborn as sns
import matplotlib.pyplot as plt
 
# Basic correlogram
sns.pairplot(auto)
plt.show()

corr = auto.corr()
corr.style.background_gradient(cmap='coolwarm')

In [5]:
reg = linear_model.LinearRegression()
print(reg)

LinearRegression()


In [6]:
x_train, x_test, y_train, y_test = train_test_split(auto_disp, \
                                                    auto_mpg, \
                                                    test_size=0.2, \
                                                    random_state=4)

In [7]:
print(y_train.shape) 
y_train = y_train[:,None]
print(y_train.shape)
x_train = x_train[:,None]
print(x_train.shape)
x_test = x_test[:,None]
y_test = y_test[:,None]

(318,)
(318, 1)
(318, 1)


In [8]:
reg.fit(x_train, y_train)

LinearRegression()

In [9]:
print(reg.coef_)
print(reg.intercept_)

[[-0.05885609]]
[0.34875617]


#### Metrics for Linear Regression

Mean Squared Error 

For linear regresion with one variable, $ y = mx +b $

$ MSE = \frac{1}{N} \sum_{i=1}^{n} (y_i - (mx_i + b))^2 $ 

$y_i$ is the actual value and $mx_i + b$ is the predicted value.

$N$ is the number of observations.

The loss function based on the MSE is 

$ L(m, b) = \frac{1}{N} \sum_{i=1}^{n} (y_i - (mx_i + b))^2 $ 

our goal is to minimize $L$ with respect to $m$ and $b$

The gradient of $L$

$L'(m, b) = \begin{bmatrix} \frac{dL}{dm} \\ \frac{dL}{dm} \end{bmatrix} = \begin{bmatrix} \frac{1}{N} \sum -x_i.2(y_i - (mx_i + b))  \\ \frac{1}{N} \sum -2(y_i - (mx_i + b))  \end{bmatrix}$ 

Update equation of m and b with learning rate $\epsilon$ is

$ m = m - \epsilon \frac{dL}{dm} $

$ b = b - \epsilon \frac{dL}{db} $

<img src="linear_loss.png" width=400, height=300>

#### Different Gradient Descents

Gradient Descsent - every single data point is considered for update. 

Batch Gradient Descent - A whole batch of data is considered and then an update is done. 
It is slow when the training data is large. 

Stochastic Gradient Descent - a single point at random is chosen and loss is computed for update. 

Mini-batch Stochastic Gradient Descent - a mini-batch of randomly selected data points is considered and the average loss of the mini-batch is computed for the update. 

In [None]:
yhat = reg.predict(x_test)

In [None]:
print(yhat[0])

In [None]:
# mean squared error
np.mean((yhat-y_test)**2)

In [None]:
print(y_test[0])
print((np.mean(yhat-y_test)**2)/np.mean(y_test))

In [None]:
"""
In-class activity: In the auto-mpg example, find the relationship between 
weight and mpg. Find the mean squared error. 
"""

Multilinear Regression - In a multilinear regression, instead of one independent variable, we will consider more than one independent variable to find a linear relationship between independent variables and dependent variable. In the below example, we will consider two features, displacement and weight as inputs for our model and our target will still be mpg. 

In [None]:
auto_weight = auto['weight'].astype(float)

In [None]:
x = np.array([auto_disp, auto_weight]).T
print(x.shape)
y = np.array([auto_mpg]).T
print(y.shape)
x_train, x_test, y_train, y_test = train_test_split(x, \
                                                    y, \
                                                    test_size=0.2, \
                                                    random_state=4)

In [None]:
reg.fit(x_train, y_train)

In [None]:
print(reg.coef_)
print(reg.intercept_)
print(reg.score(x_train, y_train))
print(reg.score(x_test, y_test))

In [None]:
np.mean((yhat-y_test)**2)

In [None]:
"""
In-class activity: In the auto-mpg example, find the relationship between 
horse power and weight with car mpg. Find the mean squared error. 
"""