Importing necessary packages:

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from google.colab import drive                       #This is to import the datasets which are uploaded to Google drive
drive.mount('/content/drive')
from matplotlib import pyplot as plt
%matplotlib inline


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading testing and training data:

In [2]:
train_df = pd.read_csv('/content/drive/MyDrive/Datasets/House prices prediction dataset/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Datasets/House prices prediction dataset/test.csv')

Merging Training and testing data:

In [3]:
Df = pd.concat([train_df,test_df])

Inspecting data, column types and rows:

In [4]:
Df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500.0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500.0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500.0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000.0


In [5]:
Df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2919 entries, 0 to 1458
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             2919 non-null   int64  
 1   MSSubClass     2919 non-null   int64  
 2   MSZoning       2915 non-null   object 
 3   LotFrontage    2433 non-null   float64
 4   LotArea        2919 non-null   int64  
 5   Street         2919 non-null   object 
 6   Alley          198 non-null    object 
 7   LotShape       2919 non-null   object 
 8   LandContour    2919 non-null   object 
 9   Utilities      2917 non-null   object 
 10  LotConfig      2919 non-null   object 
 11  LandSlope      2919 non-null   object 
 12  Neighborhood   2919 non-null   object 
 13  Condition1     2919 non-null   object 
 14  Condition2     2919 non-null   object 
 15  BldgType       2919 non-null   object 
 16  HouseStyle     2919 non-null   object 
 17  OverallQual    2919 non-null   int64  
 18  OverallCond  

So, we have a total of 81 columns/features. Please note that the "SalePrice' column is our target feature. Essentially, we have 80 features which will help us predict the target feature i.e. the Sale price  of the house.

In [6]:
Df.shape

(2919, 81)

Total rows in the dataset are 2919.

In [12]:
numerical_features = Df.select_dtypes(exclude=['object']).columns
categorical_features = Df.select_dtypes(include=['object']).columns

In [13]:
numerical_features

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [14]:
categorical_features

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

There are a mix of categorical and numerical features to work with.

**Exploratory Data Abalysis (EDA):**

In [10]:
Df[numerical_features].isnull().any()

Unnamed: 0,0
Id,False
MSSubClass,False
LotFrontage,True
LotArea,False
OverallQual,False
OverallCond,False
YearBuilt,False
YearRemodAdd,False
MasVnrArea,True
BsmtFinSF1,True


In [11]:
Df[numerical_features].isnull().sum()

Unnamed: 0,0
Id,0
MSSubClass,0
LotFrontage,486
LotArea,0
OverallQual,0
OverallCond,0
YearBuilt,0
YearRemodAdd,0
MasVnrArea,23
BsmtFinSF1,1


The 1459 null values in the SalePrice column are probably the ones that need to be predicted by our model.
'LotFrontage', 'GarageYrBlt' and 'MasVnrArea' columns have significantly higher null values. We need to decide what to do about them.

As per the data_description file that was packaged along with the datasets from Kaggle, following are the descriptions of a few column features which have higher number of null values:

1. LotFrontage - Linear feet of street connected to property
2. GarageYrBlt - Year garage was built
3. MasVnrArea: Masonry veneer area in square feet


In [22]:
Df['LotFrontage'] = Df['LotFrontage'].fillna(Df['LotFrontage'].mean())
Df['GarageYrBlt'] = Df['GarageYrBlt'].fillna(Df['GarageYrBlt'].mean())
Df['MasVnrArea'] = Df['MasVnrArea'].fillna(Df['MasVnrArea'].mean())

So, we have replaced the null values in these columns with the mean of all values in that column.

In [23]:
Df[numerical_features].isnull().sum()

Unnamed: 0,0
Id,0
MSSubClass,0
LotFrontage,0
LotArea,0
OverallQual,0
OverallCond,0
YearBuilt,0
YearRemodAdd,0
MasVnrArea,0
BsmtFinSF1,1


For the rest of the missing value rows, we will drop them as they are lesser frequent in the dataset and will not lead to huge data loss for accurate predictions.

In [24]:
Df[numerical_features] = Df[numerical_features].dropna(axis=0)

In [25]:
Df[numerical_features].isnull().sum()

Unnamed: 0,0
Id,0
MSSubClass,0
LotFrontage,0
LotArea,0
OverallQual,0
OverallCond,0
YearBuilt,0
YearRemodAdd,0
MasVnrArea,0
BsmtFinSF1,0


In [27]:
Df[categorical_features].isnull().sum()

Unnamed: 0,0
MSZoning,4
Street,0
Alley,2721
LotShape,0
LandContour,0
Utilities,2
LotConfig,0
LandSlope,0
Neighborhood,0
Condition1,0


The categorical features with null values are described, as follows:

1. Alley: Type of alley access to property
2. MasVnrType: Masonry veneer type
3. BsmtQual: Evaluates the height of the basement
4. BsmtCond: Evaluates the general condition of the basement
5. BsmtExposure: Refers to walkout or garden level walls
6. BsmtFinType1: Rating of basement finished area
7. BsmtFinType2: Rating of basement finished area
8. Electrical: Electrical system
9. KitchenQual: Kitchen quality
10. Functional: Home functionality
11. FireplaceQuL: Number of fireplaces
12. GarageType: Garage location
13. GarageFinish: Interior finish of the garage
14. GarageQual: Garage quality
15. GarageCond: Garage condition
16. PoolQC: Pool quality
17. Fence: Fence quality

We have a total of 2919 rows. Certain columns have a lot of null values -

PoolQC: 2919, Fence: 2348, MiscFeature: 2814, Alley: 2721, MasVnrType: 1766.

Removing these columns entirely will be an intelligent decision as they wont be able to provide a lot of data anyways and will not affect our model substantially.

In [41]:
Df.drop(['PoolQC','Fence','MiscFeature','Alley','MasVnrType'],axis=1,inplace=True)

In [46]:
categorical_features = [col for col in Df.columns if Df[col].dtype == 'object']
Df[categorical_features].isnull().sum()

Unnamed: 0,0
MSZoning,4
Street,0
LotShape,0
LandContour,0
Utilities,2
LotConfig,0
LandSlope,0
Neighborhood,0
Condition1,0
Condition2,0
