<a href="https://colab.research.google.com/github/rohitpaul23/Python_Assignment/blob/main/regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **House Price Prediction with Linear Regression**
 In this assignment, we're going to predict the price of a house using information like its location, area, no. of rooms etc. We'll use the dataset from the House Prices - Advanced Regression Techniques competition on Kaggle. We'll follow a step-by-step process to train our model:
Download and explore the data

1. Prepare the dataset for training
2. Train a linear regression model
3. Make predictions and evaluate the model

In [103]:
import numpy as np
import pandas as pd
import matplotlib

Data is availble in the given url. Retrieving the data from the url.

In [104]:
dataset_url = 'https://github.com/JovianML/opendatasets/raw/master/data/house-prices-advanced-regression-techniques.zip'


In [105]:
from urllib.request import urlretrieve

In [106]:
urlretrieve(dataset_url, 'house-prices.zip')

('house-prices.zip', <http.client.HTTPMessage at 0x7f7db1450d90>)

Data is available as a zip file, extracting it to folder 'house-price'.

In [107]:
from zipfile import ZipFile

In [108]:
with ZipFile('house-prices.zip') as f:
    f.extractall(path='house-prices')

In [109]:
import os

In [110]:
data_dir = 'house-prices'
os.listdir(data_dir)

['test.csv', 'sample_submission.csv', 'train.csv', 'data_description.txt']

Extracting training and test data from the folder

In [111]:
train_csv_path = data_dir + '/train.csv'
test_csv_path = data_dir + '/test.csv'
train_csv_path

'house-prices/train.csv'

In [112]:
train_price_df = pd.read_csv(train_csv_path)
test_price_df = pd.read_csv(test_csv_path)

In [113]:
train_price_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [114]:
test_price_df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [115]:
no_of_rows = train_price_df.shape[0]
no_of_cols = train_price_df.shape[1]
no_of_rows, no_of_cols

(1460, 81)

In [116]:
train_x = train_price_df.iloc[:, :no_of_cols - 1]
train_y = train_price_df.iloc[:, no_of_cols - 1]

In [117]:
no_of_testrows = test_price_df.shape[0]
test_x = test_price_df
no_of_testrows, no_of_cols

(1459, 81)

Understanding the data

In [118]:
train_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

Numeric and Categorical columns

In [119]:
numeric_cols = train_x.select_dtypes(include=['int64', 'float64']).columns.tolist()
numeric_cols, len(numeric_cols)


(['Id',
  'MSSubClass',
  'LotFrontage',
  'LotArea',
  'OverallQual',
  'OverallCond',
  'YearBuilt',
  'YearRemodAdd',
  'MasVnrArea',
  'BsmtFinSF1',
  'BsmtFinSF2',
  'BsmtUnfSF',
  'TotalBsmtSF',
  '1stFlrSF',
  '2ndFlrSF',
  'LowQualFinSF',
  'GrLivArea',
  'BsmtFullBath',
  'BsmtHalfBath',
  'FullBath',
  'HalfBath',
  'BedroomAbvGr',
  'KitchenAbvGr',
  'TotRmsAbvGrd',
  'Fireplaces',
  'GarageYrBlt',
  'GarageCars',
  'GarageArea',
  'WoodDeckSF',
  'OpenPorchSF',
  'EnclosedPorch',
  '3SsnPorch',
  'ScreenPorch',
  'PoolArea',
  'MiscVal',
  'MoSold',
  'YrSold'],
 37)

In [120]:
categorical_cols = train_x.select_dtypes(include=['object']).columns.tolist()
categorical_cols, len(categorical_cols)

(['MSZoning',
  'Street',
  'Alley',
  'LotShape',
  'LandContour',
  'Utilities',
  'LotConfig',
  'LandSlope',
  'Neighborhood',
  'Condition1',
  'Condition2',
  'BldgType',
  'HouseStyle',
  'RoofStyle',
  'RoofMatl',
  'Exterior1st',
  'Exterior2nd',
  'MasVnrType',
  'ExterQual',
  'ExterCond',
  'Foundation',
  'BsmtQual',
  'BsmtCond',
  'BsmtExposure',
  'BsmtFinType1',
  'BsmtFinType2',
  'Heating',
  'HeatingQC',
  'CentralAir',
  'Electrical',
  'KitchenQual',
  'Functional',
  'FireplaceQu',
  'GarageType',
  'GarageFinish',
  'GarageQual',
  'GarageCond',
  'PavedDrive',
  'PoolQC',
  'Fence',
  'MiscFeature',
  'SaleType',
  'SaleCondition'],
 43)

**Missing value treatment**

Numerical columns containing NA value (or missing value) and fixing it by imputing mean value

In [121]:
missing_counts = train_x[numeric_cols].isna().sum().sort_values(ascending=False)
col_with_na = missing_counts[missing_counts > 0].keys().tolist()
col_with_na

['LotFrontage', 'GarageYrBlt', 'MasVnrArea']

In [122]:
from sklearn.impute import SimpleImputer

In [123]:
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(train_x[col_with_na])

SimpleImputer()

In [124]:
list(imputer.statistics_)

[70.04995836802665, 1978.5061638868744, 103.68526170798899]

In [125]:
train_x[col_with_na] = imputer.transform(train_x[col_with_na])

In [126]:
train_x[numeric_cols].isna().sum()

Id               0
MSSubClass       0
LotFrontage      0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
MasVnrArea       0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Fireplaces       0
GarageYrBlt      0
GarageCars       0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
MiscVal          0
MoSold           0
YrSold           0
dtype: int64

**Scaling Numerical Values**

Ranges of each columns

In [127]:
train_x[numeric_cols].describe().loc[['min', 'max']]

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0


In [128]:
from sklearn.preprocessing import MinMaxScaler

In [129]:
scaler = MinMaxScaler()
scaler.fit(train_x[numeric_cols])

MinMaxScaler()

In [130]:
train_x[numeric_cols] = scaler.transform(train_x[numeric_cols])

In [131]:
train_x[numeric_cols].describe().loc[['min', 'max']]

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Encoding Categorical Columns**

Unique values in each columns

In [132]:
train_x[categorical_cols].nunique()

MSZoning          5
Street            2
Alley             2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        8
BldgType          5
HouseStyle        8
RoofStyle         6
RoofMatl          8
Exterior1st      15
Exterior2nd      16
MasVnrType        4
ExterQual         4
ExterCond         5
Foundation        6
BsmtQual          4
BsmtCond          4
BsmtExposure      4
BsmtFinType1      6
BsmtFinType2      6
Heating           6
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        7
FireplaceQu       5
GarageType        6
GarageFinish      3
GarageQual        5
GarageCond        5
PavedDrive        3
PoolQC            3
Fence             4
MiscFeature       4
SaleType          9
SaleCondition     6
dtype: int64

In [133]:
from sklearn.preprocessing import OneHotEncoder

In [134]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoder.fit(train_x[categorical_cols])

OneHotEncoder(handle_unknown='ignore', sparse=False)

In [135]:
encoded_cols = list(encoder.get_feature_names(categorical_cols))



Using encoder get_feature_name to get a unique name ofor each newly created columns 

In [136]:
train_x[encoded_cols] = encoder.transform(train_x[categorical_cols])

  self[col] = igetitem(value, i)


In [137]:
train_x

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0.000000,0.235294,RL,0.150685,0.033420,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.000685,0.000000,RL,0.202055,0.038795,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.001371,0.235294,RL,0.160959,0.046507,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.002056,0.294118,RL,0.133562,0.038561,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.002742,0.235294,RL,0.215753,0.060576,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0.997258,0.235294,RL,0.140411,0.030929,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1456,0.997944,0.000000,RL,0.219178,0.055505,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1457,0.998629,0.294118,RL,0.154110,0.036187,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1458,0.999315,0.000000,RL,0.160959,0.039342,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


By doing one hot encoding, the number of columns increase from 81 to 348

**Creating Training and Validation set from the processed dataframe**

In [138]:
from sklearn.model_selection import train_test_split

In [139]:
trainX, valX, trainY, valY = train_test_split(train_x[numeric_cols + encoded_cols], 
                                                                        train_y, 
                                                                        test_size=0.2, 
                                                                        random_state=41)

In [140]:
trainX.shape, trainY.shape, valX.shape, valY.shape

((1168, 305), (1168,), (292, 305), (292,))

# Train a Linear Regression Model

In [141]:
from sklearn.linear_model import LinearRegression

In [142]:
model1 = LinearRegression()

In [143]:
model1.fit(trainX, trainY)

LinearRegression()

In [144]:
model1.score(trainX, trainY)

0.938957671257237

In [145]:
model1.score(valX, valY)

-2.763069284660115e+17

With training score above 90 and validation score of zero, this model experience an overfitting. So to get rid of it we can use regularization. 

Training using **Ridge regression** (adding a regularization term)

In [150]:
from sklearn.linear_model import Ridge

In [151]:
model2 = Ridge()

In [152]:
model2.fit(trainX, trainY)

Ridge()

In [153]:
model2.score(trainX, trainY)

0.9218195938022078

In [154]:
model2.score(valX, valY)

0.8742297942493717

In [155]:
missing_counts = test_x[numeric_cols].isna().sum().sort_values(ascending=False)
col_with_na = missing_counts[missing_counts > 0].keys().tolist()

In [147]:
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(test_x[col_with_na])
test_x[col_with_na] = imputer.transform(test_x[col_with_na])

In [148]:
scaler = MinMaxScaler()
scaler.fit(test_x[numeric_cols])
test_x[numeric_cols] = scaler.transform(test_x[numeric_cols])

In [149]:
test_x[encoded_cols] = encoder.transform(test_x[categorical_cols])

  self[col] = igetitem(value, i)


In [161]:
predict_test_y = model2.predict(test_x[trainX.columns.tolist()])
predict_test_y

array([103099.03803305, 136863.11679585, 168517.22578795, ...,
       147843.90121146,  99469.54048185, 209196.83630726])