# Linear Regression 
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. <a href="https://en.wikipedia.org/wiki/Linear_regression">read more</a><br><br>

in this notebook we are going to train a linear regression model with our data. the overall process is : 
1. encode ordinal and nominal features with scikit-learn encoders.
2. bring data into the same scale with scikit-learn feature scalers.
3. split our dataset into train and test parts.
4. fit our model with train data and evaluate the goodness of our fit with test data.

importing needed modules

In [2]:
import numpy as np
import pandas as pd

reading dataset with pandas library using **read_csv** function and see what our dataset looks like.

In [3]:
cars =  pd.read_csv("cleaned_data.csv")
cars.head()

Unnamed: 0,Name,style,Exterior color,interior color,Engine,drive type,Fuel Type,Transmission,Mileage,mpg city,mpg highway,price,Year,Engine V,Brand
0,Titan,Pickup Truck,Deep Blue Pearl,Black,V-8 Gas,4WD,Gas,Automatic,82230,15,21,35620,2018,5.6,Nissan
1,Civic,Hatchback,Sonic Gray Pearl,Unknown,Inline-4 Gas Turbocharged,FWD,Gas,Automatic,24282,31,40,24999,2020,1.5,Honda
2,Charger,Sedan,Indigo Blue,Brazen Gold/Black,V-8 Gas,RWD,Gas,Automatic,19468,16,25,41999,2018,5.7,Dodge
3,F-150,Pickup Truck,Shadow Black,Medium Earth Gray,V-6 Gas Turbocharged,4WD,Gas,Automatic,195205,18,23,20995,2018,2.7,Ford
4,Altima,Sedan,White,Black,Inline-4 Gas,FWD,Gas,Automatic,92366,27,38,10995,2015,2.5,Nissan


exporting column names

In [4]:
cars.columns

Index(['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'price', 'Year', 'Engine V', 'Brand'],
      dtype='object')

now we ae going to create our X and Y data in order to perform the processes that have been told in the first part.

In [10]:
X =  cars[['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'Year', 'Engine V', 'Brand']]


Y = cars["price"].values

#### Encode categorical features as a one-hot numeric array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)<br><br>

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">read full documentation</a>

In [11]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(categories="auto", handle_unknown="ignore")

categorical_features = onehot.fit_transform(X.iloc[:, [1,4,5,6,7,13]]).toarray()
print(categorical_features)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [7]:
print(categorical_features.shape)

(6532, 91)


in this part we are going to delete unnecessary features and the categorical features that we have encoded in the previous part.

In [12]:
X = np.delete(X.values, [0,1,2,3,4,5,6,7,13], 1)
print(X.shape)

(6532, 5)


now we combine remaining features with the encoded array:

In [13]:
X = np.concatenate((X,categorical_features), axis=1)
X.shape

(6532, 96)

#### Split arrays or matrices into train and test subsets.
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">read full documentation</a>

In [14]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, Y,
    test_size=0.1,
    random_state=82,
    shuffle=True
)

#### Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x is calculated as:
<br><br>
z = (x - u) / s
<br><br>
where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
<br><br>
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.
<br><br>
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">read full documentation</a>

In [15]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
std_scaler.fit(x_train)

x_train_std = std_scaler.transform(x_train)
x_test_std  = std_scaler.transform(x_test)

### Linear Regression
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">read full documentation</a>

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lr = LinearRegression(fit_intercept=True, normalize="deprecated")
lr.fit(x_train_std, y_train)

y_pred_lr = lr.predict(x_test_std)
print(r2_score(y_test, y_pred_lr))

0.8627732275821445


### Ridge : Linear least squares with l2 regularization.

Minimizes the objective function:
<br><br>
||y - Xw||^2_2 + alpha * ||w||^2_2
<br><br>
This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape (n_samples, n_targets)).<br>
> Technically the Lasso model is optimizing the same objective function as the Elastic Net with alpha=1.0 (no L1 penalty).

<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html">read full documentation</a>

In [18]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1)
ridge.fit(x_train_std, y_train)

y_pred_ri = ridge = ridge.predict(x_test_std)
print(r2_score(y_test, y_pred_ri))

0.862553379277978


### Lasso :
Linear Model trained with L1 prior as regularizer.<br>
> Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).

<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">read full documentation</a>

In [21]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1, max_iter=20000)
lasso.fit(x_train_std, y_train)

y_pred_ls = lasso.predict(x_test_std)
print(r2_score(y_test, y_pred_ls))

0.862544427139954


### ElasticNet :

Linear regression with combined L1 and L2 priors as regularizer.
<br><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html">read full documentation</a>

In [22]:
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=1, l1_ratio=0.5)
elastic.fit(x_train_std, y_train)

y_pred_el = elastic.predict(x_test_std)
print(r2_score(y_test, y_pred_el))

0.8014088417173381


so we have created linear regression models with and without l1 and l2 penalties and saw that they dont bring us a better result than a linear regression without penalty except the ElasticNet regressor that used too much penalty in order to reduce the rate of ovrfitting. in the next parts we are going to use different methods and tune hyperparameters to see if we can get a higher accuracy or not. but before that lets use pipeline for more simplified code.

### Pipeline

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.
<br><br>
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__', as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to 'passthrough' or None.
<br><a href="https://scikit-learn.org/stable/modules/generated/sklearnpreprocessingOneHotEncoder.html">read full documentation</a>

In [24]:
from sklearn.pipeline import make_pipeline

pip_lr = make_pipeline(
    StandardScaler(),
    LinearRegression(fit_intercept=True, normalize="deprecated")
)
pip_lr.fit(x_train_std, y_train)
h_pred = pip_lr.predict(x_test)
print("Test Accuracy : {:.3f}".format(pip_lr.score(x_test_std, y_test)))

Test Accuracy : 0.863


Sina Kazemi<br>
Github : <a href="https://github.com/sina96n/">sina96n</a>