# Comparing linear regression models

In this project I will compare different linear regression models accuracy using House Prices Dataset from Kaggle https://www.kaggle.com/c/house-prices-advanced-regression-techniques. 

### The goal of this project:
 * practice building logistic regression, ridge regression, lasso regression models using Scikit-learn;
 * compare these models on this particular task;

# Steps:

   * **Data Preprocessing**
   * **Linear Regression**
   * **Ridge**
   * **RidgeCV**
   * **LassoCV**
   * **Comparing models**

### Import libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

file_path = '/Users/elizaveta/Documents/datasets/cleaned_melb_data.csv'
melb_df = pd.read_csv(file_path, index_col=0)

melb_df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,1.0,94.0,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,2.0,120.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [2]:
# Encoding categorical variables
categorical = [col for col in melb_df.columns if melb_df[col].dtype == 'object']
features_to_drop = [col for col in melb_df[categorical] if len(melb_df[col].unique()) > 35]

melb_df.drop(features_to_drop, axis = 1,inplace=True)

melb_df = pd.get_dummies(melb_df)

### Normalization

In [3]:
cols = melb_df.columns

melb_df.head(3)

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,...,CouncilArea_Yarra,CouncilArea_Yarra Ranges,Regionname_Eastern Metropolitan,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
0,2,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,-37.7996,144.9984,...,1,0,0,0,1,0,0,0,0,0
1,2,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,-37.8079,144.9934,...,1,0,0,0,1,0,0,0,0,0
2,3,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,-37.8093,144.9944,...,1,0,0,0,1,0,0,0,0,0


In [4]:
scaler = StandardScaler()

df_norm = scaler.fit_transform(melb_df)

df = pd.DataFrame(df_norm, columns=melb_df.columns)

df.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,...,CouncilArea_Yarra,CouncilArea_Yarra Ranges,Regionname_Eastern Metropolitan,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
0,-0.981463,0.632448,-1.301485,-0.422415,-0.947035,-0.772376,-0.635232,-0.089316,0.12116,0.03064,...,4.470926,-0.036431,-0.34854,-0.062595,1.578291,-0.05503,-0.185129,-0.726924,-0.52657,-0.0486
1,-0.981463,-0.06364,-1.301485,-0.422415,-0.947035,-0.772376,-1.676467,-0.100843,0.016437,-0.017478,...,4.470926,-0.036431,-0.34854,-0.062595,1.578291,-0.05503,-0.185129,-0.726924,-0.52657,-0.0486
2,0.064876,0.608984,-1.301485,-0.422415,0.088284,0.673367,-1.676467,-0.106356,-0.001227,-0.007855,...,4.470926,-0.036431,-0.34854,-0.062595,1.578291,-0.05503,-0.185129,-0.726924,-0.52657,-0.0486
3,0.064876,-0.353025,-1.301485,-0.422415,0.088284,0.673367,-0.635232,-0.11638,0.155226,0.016204,...,4.470926,-0.036431,-0.34854,-0.062595,1.578291,-0.05503,-0.185129,-0.726924,-0.52657,-0.0486
4,1.111216,0.820157,-1.301485,-0.422415,0.088284,-0.772376,0.406003,-0.109864,0.025269,-0.010742,...,4.470926,-0.036431,-0.34854,-0.062595,1.578291,-0.05503,-0.185129,-0.726924,-0.52657,-0.0486


In [5]:
y = df.Price

features = [col for col in df.columns if col != 'Price']
X = df[features]

X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=0)

print(f'X_train.shape: {X_train.shape}')
print(f'X_val.shape: {X_val.shape}')

X_train.shape: (9506, 59)
X_val.shape: (4074, 59)


## LinearRegression 

In [26]:
linear_model = LinearRegression()

linear_model.fit(X_train, y_train)
pred_val = linear_model.predict(X_val)
print(f'Mean absolute error of this linear regression model is\n {mean_absolute_error(pred_val, y_val)}')
print(f'The coefficient of determination of this linear regression model is\n {linear_model.score(X_train, y_train)}')

Mean absolute error of this linear regression model is
 80050518.15376934
The coefficient of determination of this linear regression model is
 0.6254258836610043


## Ridge

In [27]:
linear_model_ridge = Ridge()
linear_model_ridge.fit(X_train, y_train)

pred_ridge = linear_model_ridge.predict(X_val)
print(f'Mean absolute error of this linear regression model is\n {mean_absolute_error(pred_ridge, y_val)}')
print(f'The coefficient of determination of this linear regression model is\n {linear_model_ridge.score(X_train, y_train)}')

Mean absolute error of this linear regression model is
 0.4017977838296181
The coefficient of determination of this linear regression model is
 0.6254282170791652


## RidgeCV

In [28]:
linear_model_ridgeCV = RidgeCV()
linear_model_ridgeCV.fit(X_train, y_train)

pred_ridgeCV = linear_model_ridgeCV.predict(X_val)
print(f'Mean absolute error of this linear regression model is\n {mean_absolute_error(pred_ridgeCV, y_val)}')
print(f'The coefficient of determination of this linear regression model is\n {linear_model_ridgeCV.score(X_train, y_train)}')

Mean absolute error of this linear regression model is
 0.40173315692535894
The coefficient of determination of this linear regression model is
 0.6254269144271357


## LassoCV

In [29]:
linear_model_lassoCV = LassoCV()
linear_model_lassoCV.fit(X_train, y_train)

pred_lassoCV = linear_model_lassoCV.predict(X_val)
print(f'Mean absolute error of this linear regression model is\n {mean_absolute_error(pred_lassoCV, y_val)}')
print(f'The coefficient of determination of this linear regression model is\n {linear_model_lassoCV.score(X_train, y_train)}')

Mean absolute error of this linear regression model is
 0.40144777119235514
The coefficient of determination of this linear regression model is
 0.6239188515602888


# Let's compare all this models now

In [32]:
lst_preds = [pred_val, pred_ridge, pred_ridgeCV, pred_lassoCV]
lst_models = [linear_model, linear_model_ridge, linear_model_ridgeCV, linear_model_lassoCV]
models = pd.Series(['linear_model', 'linear_model_ridge', 'linear_model_ridgeCV', 'linear_model_lassoCV'])

mae = pd.Series([round(elem, 2) for elem in[mean_absolute_error(pred, y_val) for pred in lst_preds]])
score = pd.Series([round(elem, 2) for elem in [model.score(X_train, y_train) for model in lst_models]])


models_analysis = pd.concat([models, mae, score], axis=1, keys=['Models', 'Mean Absolute Error', 'Coefficient of determination'])

models_analysis

Unnamed: 0,Models,Mean Absolute Error,Coefficient of determination
0,linear_model,80050518.15,0.63
1,linear_model_ridge,0.4,0.63
2,linear_model_ridgeCV,0.4,0.63
3,linear_model_lassoCV,0.4,0.62
