# Kaggle Dataset: Advance Regression Techniques

## 1. Understanding Data

#### Quite Large Description of the Dataset. Please seek into "description.txt"

#### Brief of Columns:
- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

## 2. Understanding Task: 
To predict SalePrice(target variable) on the basis of other columns(independent variable).

## 3. Data Loading

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline 
#This is known as magic inline function.
#When using the 'inline' backend, our matplotlib graphs will be included in our notebook, next to the code.

In [None]:
#reading dataset
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
#copying dataset
train_df = train_data.copy()
test_df = test_data.copy()

In [None]:
#in large dataset we use set_option to display maximum rows and columns.
pd.set_option("display.max_rows", None, "display.max_columns", None)

In [None]:
train_df.head()

In [None]:
test_df.head()

Here the one missing column is test_df must be our dependent variable (or target variable). 

## 4. Exploratory Data Analysis (EDA)

### 4.1. Understanding Variables and DataFrame

In [None]:
train_df.shape

In [None]:
test_df.shape

In [None]:
train_df.describe()

In [None]:
test_df.describe()

In [None]:
train_df.info()

In [None]:
test_df.info()

So, our train & test dataframe have datatypes like int, float and object. Also, our datasets have missing values.

### 4.2. Data Pre-Processing or Data Cleaning

#### 4.2.1. Missing Values

In [None]:
#count of missing values in columns having any missing values
train_df[train_df.columns[train_df.isnull().any()]].isnull().sum()

In [None]:
#percentage of missing values in columns having any missing values
((train_df[train_df.columns[train_df.isnull().any()]].isnull().sum()* 100)/(len(train_df)))

So, few of the columns have a lot of missing values, which can be treated. Ideally we can treat ~15-20% of the data, because more we make chnages in our dataset more we'll get deviated from the accuracy. So we'll remove such columns in further steps. But first let's check for our test_df.

In [None]:
#count of missing values in columns having any missing values
test_df[test_df.columns[test_df.isnull().any()]].isnull().sum()

In [None]:
#percentage of missing values in columns having any missing values
((test_df[test_df.columns[test_df.isnull().any()]].isnull().sum()* 100)/(len(test_df)))

As you can also observe our train_df has less missing values than test_df. So firstly get rid of columns having missing values more than 20%.

In [None]:
#dropping columns which have more than 20% missing values.
train_df.drop(['Alley','FireplaceQu','PoolQC','Fence','MiscFeature'], axis=1, inplace=True)

In [None]:
#dropping columns which have more than 20% missing values.
test_df.drop(['Alley','FireplaceQu','PoolQC','Fence','MiscFeature'], axis=1, inplace=True)

In [None]:
#replacing missing values with median or mode according to their datatypes.

train_missing = train_df.columns[train_df.isnull().any()]
missing_obj = []
missing_not_obj = []

for i in train_missing:
    if train_df[i].dtypes == object:
        missing_obj.append(i)
    else:
        missing_not_obj.append(i)
        
for i in missing_obj:
    train_df[i] = train_df[i].fillna(train_df[i].mode()[0])

for i in missing_not_obj:
    train_df[i] = train_df[i].fillna(train_df[i].median())

In [None]:
#replacing missing values with median or mode according to their datatypes.

test_missing = test_df.columns[test_df.isnull().any()]
missing_obj = []
missing_not_obj = []

for i in test_missing:
    if test_df[i].dtypes == object:
        missing_obj.append(i)
    else:
        missing_not_obj.append(i)
        
for i in missing_obj:
    test_df[i] = test_df[i].fillna(test_df[i].mode()[0])

for i in missing_not_obj:
    test_df[i] = test_df[i].fillna(test_df[i].median())

Now, as we have 2 datasets, one is for train and other is for test. Afterwards I'll only perform changes in train dataset and will also perform similar changes in test data without checking into it.

#### 4.2.2. Dividing columns on the basis of datatypes.

In [None]:
train_df.nunique()

In [None]:
## Id is nothing but a serial number which will never affect our target variable.

train_df.drop(['Id'], axis=1, inplace=True)
test_df.drop(['Id'], axis=1, inplace=True)

In [None]:
year_cols = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

In [None]:
cols_num = [] #numerical columns
cols_obj = [] #object columns

for i in train_df.columns:
    if i in year_cols:
        pass
    elif train_df[i].dtypes == object:
        cols_obj.append(i)
    else:
        cols_num.append(i)

In [None]:
cols_num_dis = [] # discrete numerical values
cols_num_con = [] # continuous numerical values

for i in cols_num:
    if train_df[i].nunique()>12:
        cols_num_con.append(i)
    else:
        cols_num_dis.append(i)

#### 4.2.2. (a) Handling Continuous Numerical Variables

In [None]:
train_df[cols_num_con].head()

#### Skewness, Kurtosis and Outliers

**Skewness**: Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
- If skewness is less than -1 or greater than 1, the distribution is highly skewed.
- If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
- If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

**Kurtosis**: Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given distribution contain extreme values.
- A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis ≈3 (excess ≈0) is called mesokurtic.
- A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal distribution, its tails are shorter and thinner, and often its central peak is lower and broader.
- A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution, its tails are longer and fatter, and often its central peak is higher and sharper.

**Outliers**: They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems.

In [None]:
#checking for skewness and kurtosis values in dataset.
for i in cols_num_con:
    print(f'For {i} Skewness is {round(train_df[i].skew(),2)} and Kurtosis is {round(train_df[i].kurtosis(),2)}')

In [None]:
#plotting histplot for dataset to check skewness.
for i in cols_num_con:
    sns.histplot(train_df[i], kde=True)
    plt.show()

In [None]:
#plotting scatter-plot for a dataset to check outliers.
for i in cols_num_con:
    sns.scatterplot(data=train_df, x=train_df[i].index, y=i)
    plt.show()

As you can also observe almost all the columns have outliers and therefore skewed. One of the solution is Log Transformation.

**Log transformation**: Log transformation is a data transformation method in which it replaces each variable x with a log(x). Benefits of log transformation is, we can deal with outliers and skewness at the same time bacause as you know skewness happens because of outlier values present in our data.

But here as I noticed many of the columns have 0 values so, what I'm going to do is apply log(x+1) instead of log(x) because log 0 is undefined. It's not a real number, because you can never get zero by raising anything to the power of anything else. 

So, what I'm going to do is log transform only those values which do not have any 0s in it.

In [None]:
#checking our columns after outliers removal.
for i in cols_num_con:
    train_df[i] = np.log(train_df[i]+1)
    sns.histplot(train_df[i], kde=True)
    plt.show()

In [None]:
train_df[cols_num_con].head()

In [None]:
cols_num_com = ['MSSubClass','LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF', '2ndFlrSF',
 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch','MiscVal']
for i in cols_num_com:
    test_df[i] = np.log(test_df[i]+1)

Done with all continuous numerical columns.

#### 4.2.2. (b) Handling Discrete Numerical Variables

In [None]:
train_df[cols_num_dis].head()

In [None]:
for i in cols_num_dis:
    sns.catplot(x = i, y = 'SalePrice', data = train_df)
    plt.show()

In [None]:
Qual = train_df.groupby(['OverallQual']).SalePrice.agg([len, min, max])
Qual

In [None]:
train_df['OverallQual'] = np.where((train_df.OverallQual>8 ), 8, train_df.OverallQual)
train_df['OverallQual'] = np.where((train_df.OverallQual<4 ), 4, train_df.OverallQual)

test_df['OverallQual'] = np.where((test_df.OverallQual>8 ), 8, test_df.OverallQual)
test_df['OverallQual'] = np.where((test_df.OverallQual<4 ), 4, test_df.OverallQual)

In [None]:
Cond = train_df.groupby(['OverallCond']).SalePrice.agg([len, min, max])
Cond

In [None]:
train_df['OverallCond'] = np.where((train_df.OverallCond<3), 3, train_df.OverallCond)

test_df['OverallCond'] = np.where((test_df.OverallCond<3), 3, test_df.OverallCond)

In [None]:
BsmtFullBath = train_df.groupby(['BsmtFullBath']).SalePrice.agg([len, min, max])
BsmtFullBath

In [None]:
train_df['BsmtFullBath'] = np.where((train_df.BsmtFullBath!=0), 1, train_df.BsmtFullBath)

test_df['BsmtFullBath'] = np.where((test_df.BsmtFullBath!=0), 1, test_df.BsmtFullBath)

In [None]:
BsmtHalfBath = train_df.groupby(['BsmtHalfBath']).SalePrice.agg([len, min, max])
BsmtHalfBath

In [None]:
train_df['BsmtHalfBath'] = np.where((train_df.BsmtHalfBath!=0), 1, train_df.BsmtHalfBath)

test_df['BsmtHalfBath'] = np.where((test_df.BsmtHalfBath!=0), 1, test_df.BsmtHalfBath)

In [None]:
FullBath = train_df.groupby(['FullBath']).SalePrice.agg([len, min, max])
FullBath

In [None]:
train_df['FullBath'] = np.where((train_df.FullBath!=0 ), 1, train_df.FullBath)

test_df['FullBath'] = np.where((test_df.FullBath!=0 ), 1, test_df.FullBath)

In [None]:
HalfBath = train_df.groupby(['HalfBath']).SalePrice.agg([len, min, max])
HalfBath

In [None]:
train_df['HalfBath'] = np.where((train_df.HalfBath!=0 ), 1, train_df.HalfBath)

test_df['HalfBath'] = np.where((test_df.HalfBath!=0 ), 1, test_df.HalfBath)

In [None]:
BedroomAbvGr = train_df.groupby(['BedroomAbvGr']).SalePrice.agg([len, min, max])
BedroomAbvGr

In [None]:
train_df['BedroomAbvGr'] = np.where((train_df.BedroomAbvGr<2 ), 2, train_df.BedroomAbvGr)
train_df['BedroomAbvGr'] = np.where((train_df.BedroomAbvGr>4 ), 4, train_df.BedroomAbvGr)

test_df['BedroomAbvGr'] = np.where((test_df.BedroomAbvGr<2 ), 2, test_df.BedroomAbvGr)
test_df['BedroomAbvGr'] = np.where((test_df.BedroomAbvGr>4 ), 4, test_df.BedroomAbvGr)

In [None]:
KitchenAbvGr = train_df.groupby(['KitchenAbvGr']).SalePrice.agg([len, min, max])
KitchenAbvGr

In [None]:
train_df['KitchenAbvGr'] = np.where((train_df.KitchenAbvGr<1 ), 1, train_df.KitchenAbvGr)
train_df['KitchenAbvGr'] = np.where((train_df.KitchenAbvGr>2 ), 2, train_df.KitchenAbvGr)

test_df['KitchenAbvGr'] = np.where((test_df.KitchenAbvGr<1 ), 1, test_df.KitchenAbvGr)
test_df['KitchenAbvGr'] = np.where((test_df.KitchenAbvGr>2 ), 2, test_df.KitchenAbvGr)

In [None]:
TotRmsAbvGrd = train_df.groupby(['TotRmsAbvGrd']).SalePrice.agg([len, min, max])
TotRmsAbvGrd

In [None]:
train_df['TotRmsAbvGrd'] = np.where((train_df.TotRmsAbvGrd>10 ), 10, train_df.TotRmsAbvGrd)
train_df['TotRmsAbvGrd'] = np.where((train_df.TotRmsAbvGrd<3 ), 3, train_df.TotRmsAbvGrd)

test_df['TotRmsAbvGrd'] = np.where((test_df.TotRmsAbvGrd>10 ), 10, test_df.TotRmsAbvGrd)
test_df['TotRmsAbvGrd'] = np.where((test_df.TotRmsAbvGrd<3 ), 3, test_df.TotRmsAbvGrd)

In [None]:
Fireplaces = train_df.groupby(['Fireplaces']).SalePrice.agg([len, min, max])
Fireplaces

In [None]:
train_df['Fireplaces'] = np.where((train_df.Fireplaces!=0 ), 1, train_df.Fireplaces)

test_df['Fireplaces'] = np.where((test_df.Fireplaces!=0 ), 1, test_df.Fireplaces)

In [None]:
GarageCars = train_df.groupby(['GarageCars']).SalePrice.agg([len, min, max])
GarageCars

In [None]:
train_df['GarageCars'] = np.where((train_df.GarageCars==4 ), 3, train_df.GarageCars)

test_df['GarageCars'] = np.where((test_df.GarageCars==4 ), 3, test_df.GarageCars)

In [None]:
MoSold = train_df.groupby(['MoSold']).SalePrice.agg([len, min, max])
MoSold

Done with Discrete Continuous Variables.

#### 4.2.2. (c) Handling Year Variables

In [None]:
train_df[year_cols].head()

In [None]:
for i in year_cols:
    sns.catplot(x = i, y = 'SalePrice', data = train_df)
    plt.show()

Quite satisfactory data distribution.
- Newer houses have more SalePrice than older ones.
- Something similar to Remodeled houses have higher SalePrice.
- Similar to GarageBuild

#### 4.2.2. (d) Handling Categorical Variables

In [None]:
train_df[cols_obj].head()

In [None]:
for i in cols_obj:
    sns.catplot(x = i, y = 'SalePrice', data = train_df)
    plt.show()

In [None]:
for i in cols_obj:
    print(train_df[i].value_counts(normalize=True)*100)
    print()

As you can observe that occurance of some of the enteries of few columns have more than 90% of the values. Such columns needs to be dropped.

Also few columns columns have 1 entry with moderately highl value than rest of the columns, shoulld be transformed.

And, rest should be encoded.

In [None]:
train_df.drop(['Street','Utilities','LandSlope','Condition2','RoofMatl','BsmtCond','Heating','CentralAir','Electrical','Functional','GarageQual','GarageCond','PavedDrive'],axis=1, inplace=True)
test_df.drop(['Street','Utilities','LandSlope','Condition2','RoofMatl','BsmtCond','Heating','CentralAir','Electrical','Functional','GarageQual','GarageCond','PavedDrive'],axis=1, inplace=True)

In [None]:
MSZoning = train_df.groupby(['MSZoning']).SalePrice.agg([len, min, max])
MSZoning

In [None]:
train_df['MSZoning'] = np.where((train_df.MSZoning=='RL' ), 1, 0)

test_df['MSZoning'] = np.where((test_df.MSZoning=='RL' ), 1, 0)

In [None]:
LotShape = train_df.groupby(['LotShape']).SalePrice.agg([len, min, max])
LotShape

In [None]:
train_df['LotShape'] = np.where((train_df.LotShape=='Reg' ), 1, 0)

test_df['LotShape'] = np.where((test_df.LotShape=='Reg' ), 1, 0)

In [None]:
LandContour = train_df.groupby(['LandContour']).SalePrice.agg([len, min, max])
LandContour

In [None]:
train_df['LandContour'] = np.where((train_df.LandContour=='Lvl' ), 1, 0)

test_df['LandContour'] = np.where((test_df.LandContour=='Lvl' ), 1, 0)

In [None]:
LotConfig = train_df.groupby(['LotConfig']).SalePrice.agg([len, min, max])
LotConfig

In [None]:
train_df['LotConfig'] = np.where((train_df.LotConfig=='Inside' ), 1, 0)

test_df['LotConfig'] = np.where((test_df.LotConfig=='Inside' ), 1, 0)

In [None]:
Condition1 = train_df.groupby(['Condition1']).SalePrice.agg([len, min, max])
Condition1

In [None]:
train_df['Condition1'] = np.where((train_df.Condition1=='Norm' ), 1, 0)

test_df['Condition1'] = np.where((test_df.Condition1=='Norm' ), 1, 0)

In [None]:
BldgType = train_df.groupby(['BldgType']).SalePrice.agg([len, min, max])
BldgType

In [None]:
train_df['BldgType'] = np.where((train_df.BldgType=='1Fam' ), 1, 0)

test_df['BldgType'] = np.where((test_df.BldgType=='1Fam' ), 1, 0)

In [None]:
RoofStyle = train_df.groupby(['RoofStyle']).SalePrice.agg([len, min, max])
RoofStyle

In [None]:
train_df['RoofStyle'] = np.where((train_df.RoofStyle=='Gable' ), 1, 0)

test_df['RoofStyle'] = np.where((test_df.RoofStyle=='Gable' ), 1, 0)

In [None]:
MasVnrType = train_df.groupby(['MasVnrType']).SalePrice.agg([len, min, max])
MasVnrType

In [None]:
train_df['MasVnrType'] = np.where((train_df.MasVnrType=='None' ), 0, 1)

test_df['MasVnrType'] = np.where((test_df.MasVnrType=='None' ), 0, 1)

In [None]:
ExterQual = train_df.groupby(['ExterQual']).SalePrice.agg([len, min, max])
ExterQual

In [None]:
train_df['ExterQual'] = np.where((train_df.ExterQual=='Ex' ), 'Gd', train_df.ExterQual)
train_df['ExterQual'] = np.where((train_df.ExterQual=='Fa' ), 'TA', train_df.ExterQual)
train_df['ExterQual'] = np.where((train_df.ExterQual=='TA' ), 1, 0)

test_df['ExterQual'] = np.where((test_df.ExterQual=='Ex' ), 'Gd', test_df.ExterQual)
test_df['ExterQual'] = np.where((test_df.ExterQual=='Fa' ), 'TA', test_df.ExterQual)
test_df['ExterQual'] = np.where((test_df.ExterQual=='TA' ), 1, 0)

In [None]:
ExterCond = train_df.groupby(['ExterCond']).SalePrice.agg([len, min, max])
ExterCond

In [None]:
train_df['ExterCond'] = np.where((train_df.ExterCond=='TA' ), 1, 0)

test_df['ExterCond'] = np.where((test_df.ExterCond=='TA' ), 1, 0)

In [None]:
Foundation = train_df.groupby(['Foundation']).SalePrice.agg([len, min, max])
Foundation

In [None]:
train_df['Foundation'] = np.where((train_df.Foundation=='PConc' ), 1, 0)

test_df['Foundation'] = np.where((test_df.Foundation=='PConc' ), 1, 0)

In [None]:
BsmtQual = train_df.groupby(['BsmtQual']).SalePrice.agg([len, min, max])
BsmtQual

In [None]:
train_df['BsmtQual'] = np.where((train_df.BsmtQual=='Ex' ), 'Gd', train_df.BsmtQual)
train_df['BsmtQual'] = np.where((train_df.BsmtQual=='Fa' ), 'TA', train_df.BsmtQual)
train_df['BsmtQual'] = np.where((train_df.BsmtQual=='TA' ), 1, 0)

test_df['BsmtQual'] = np.where((test_df.BsmtQual=='Ex' ), 'Gd', test_df.BsmtQual)
test_df['BsmtQual'] = np.where((test_df.BsmtQual=='Fa' ), 'TA', test_df.BsmtQual)
test_df['BsmtQual'] = np.where((test_df.BsmtQual=='TA' ), 1, 0)

In [None]:
BsmtExposure = train_df.groupby(['BsmtQual']).SalePrice.agg([len, min, max])
BsmtExposure

In [None]:
train_df['BsmtExposure'] = np.where((train_df.BsmtExposure=='Mn' ), 'No', train_df.BsmtExposure)
train_df['BsmtExposure'] = np.where((train_df.BsmtExposure=='Gd' ), 'Av', train_df.BsmtExposure)
train_df['BsmtExposure'] = np.where((train_df.BsmtExposure=='No' ), 1, 0)

test_df['BsmtExposure'] = np.where((test_df.BsmtExposure=='Mn' ), 'No', test_df.BsmtExposure)
test_df['BsmtExposure'] = np.where((test_df.BsmtExposure=='Gd' ), 'Av', test_df.BsmtExposure)
test_df['BsmtExposure'] = np.where((test_df.BsmtExposure=='No' ), 1, 0)

In [None]:
GarageType = train_df.groupby(['GarageType']).SalePrice.agg([len, min, max])
GarageType

In [None]:
train_df['GarageType'] = np.where((train_df.GarageType=='Attchd' ), 1, 0)

test_df['GarageType'] = np.where((test_df.GarageType=='Attchd' ), 1, 0)

In [None]:
SaleType = train_df.groupby(['SaleType']).SalePrice.agg([len, min, max])
SaleType

In [None]:
train_df['SaleType'] = np.where((train_df.SaleType=='WD' ), 1, 0)

test_df['SaleType'] = np.where((test_df.SaleType=='WD' ), 1, 0)

In [None]:
SaleCondition = train_df.groupby(['SaleCondition']).SalePrice.agg([len, min, max])
SaleCondition

In [None]:
train_df['SaleCondition'] = np.where((train_df.SaleCondition=='Normal' ), 1, 0)

test_df['SaleCondition'] = np.where((test_df.SaleCondition=='Normal' ), 1, 0)

In [None]:
#GarageFinish, KitchenQual, HeatingQC, BsmtFinType2, BsmtFinType1, Exterior2nd, Exterior1st, HouseStyle, Neighborhood

In [None]:
ax = sns.catplot(x = 'Neighborhood', y = 'SalePrice', data = train_df,height=5, aspect=2)
ax.set_xticklabels(rotation=90)

In [None]:
#By the Visualisation from the graph, I decided to convert entries from Neighbourhood column into three groups.
#[CollgCr, Veenker, Crawfor, Mitchel,  NWAmes, NAmes, SawyerW, Edwards, NPkVill] = 0
#[NoRidge, Somerst,NridgHt, Timber, Gilbert, StoneBr, ClearCr, Blmngth] = 1
#[OldTown, BrkSide, Sawyer, IDOTRR, MeadowV, BrDale, SWISU, Blueste] = 2

In [None]:
temp1 = ['CollgCr', 'Veenker', 'Crawfor', 'Mitchel',  'NWAmes', 'NAmes', 'SawyerW', 'Edwards', 'NPkVill']
temp2 = ['NoRidge', 'Somerst' ,'NridgHt', 'Timber', 'Gilbert', 'StoneBr', 'ClearCr', 'Blmngtn']
temp3 = ['OldTown', 'BrkSide', 'Sawyer', 'IDOTRR', 'MeadowV', 'BrDale', 'SWISU', 'Blueste']

for i in temp1:
    train_df['Neighborhood'] = np.where((train_df.Neighborhood==i), 0, train_df.Neighborhood)
    
for j in temp2:
    train_df['Neighborhood'] = np.where((train_df.Neighborhood==j), 1, train_df.Neighborhood)
    
for k in temp3:
    train_df['Neighborhood'] = np.where((train_df.Neighborhood==k), 2, train_df.Neighborhood)
    

for i in temp1:
    test_df['Neighborhood'] = np.where((test_df.Neighborhood==i), 0, test_df.Neighborhood)
    
for j in temp2:
    test_df['Neighborhood'] = np.where((test_df.Neighborhood==j), 1, test_df.Neighborhood)
    
for k in temp3:
    test_df['Neighborhood'] = np.where((test_df.Neighborhood==k), 2, test_df.Neighborhood)

In [None]:
ax = sns.catplot(x = 'GarageFinish', y = 'SalePrice', data = train_df,height=5, aspect=2)

In [None]:
train_df['GarageFinish'] = np.where((train_df.GarageFinish=='Unf' ), 1, 0)

test_df['GarageFinish'] = np.where((test_df.GarageFinish=='Unf' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'KitchenQual', y = 'SalePrice', data = train_df,height=5, aspect=2)

In [None]:
train_df['KitchenQual'] = np.where((train_df.KitchenQual=='Ex' ), 'Gd', train_df.KitchenQual)
train_df['KitchenQual'] = np.where((train_df.KitchenQual=='Fa' ), 'TA', train_df.KitchenQual)
train_df['KitchenQual'] = np.where((train_df.KitchenQual=='TA' ), 1, 0)

test_df['KitchenQual'] = np.where((test_df.KitchenQual=='Ex' ), 'Gd', test_df.KitchenQual)
test_df['KitchenQual'] = np.where((test_df.KitchenQual=='Fa' ), 'TA', test_df.KitchenQual)
test_df['KitchenQual'] = np.where((test_df.KitchenQual=='TA' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'HeatingQC', y = 'SalePrice', data = train_df,height=5, aspect=2)

In [None]:
train_df['HeatingQC'] = np.where((train_df.HeatingQC=='Ex' ), 1, 0)

test_df['HeatingQC'] = np.where((test_df.HeatingQC=='Ex' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'BsmtFinType1', y = 'SalePrice', data = train_df,height=5, aspect=2)

In [None]:
train_df['BsmtFinType1'] = np.where((train_df.BsmtFinType1=='Rec'), 'Unf', train_df.BsmtFinType1)
train_df['BsmtFinType1'] = np.where((train_df.BsmtFinType1=='BLQ'), 'Unf', train_df.BsmtFinType1)
train_df['BsmtFinType1'] = np.where((train_df.BsmtFinType1=='ALQ'), 'GLQ', train_df.BsmtFinType1)
train_df['BsmtFinType1'] = np.where((train_df.BsmtFinType1=='LwQ'), 'GLQ', train_df.BsmtFinType1)
train_df['BsmtFinType1'] = np.where((train_df.BsmtFinType1=='Unf' ), 1, 0)

test_df['BsmtFinType1'] = np.where((test_df.BsmtFinType1=='Rec'), 'Unf', test_df.BsmtFinType1)
test_df['BsmtFinType1'] = np.where((test_df.BsmtFinType1=='BLQ'), 'Unf', test_df.BsmtFinType1)
test_df['BsmtFinType1'] = np.where((test_df.BsmtFinType1=='ALQ'), 'GLQ', test_df.BsmtFinType1)
test_df['BsmtFinType1'] = np.where((test_df.BsmtFinType1=='LwQ'), 'GLQ', test_df.BsmtFinType1)
test_df['BsmtFinType1'] = np.where((test_df.BsmtFinType1=='Unf' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'BsmtFinType2', y = 'SalePrice', data = train_df,height=5, aspect=2)

In [None]:
train_df['BsmtFinType2'] = np.where((train_df.BsmtFinType2=='Unf' ), 1, 0)

test_df['BsmtFinType2'] = np.where((test_df.BsmtFinType2=='Unf' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'Exterior1st', y = 'SalePrice', data = train_df,height=5, aspect=2)
ax.set_xticklabels(rotation=90)

In [None]:
train_df['Exterior1st'] = np.where((train_df.Exterior1st=='AsbShng'), 'VinylSd', train_df.Exterior1st)
train_df['Exterior1st'] = np.where((train_df.Exterior1st=='BrkFace'), 'VinylSd', train_df.Exterior1st)
train_df['Exterior1st'] = np.where((train_df.Exterior1st=='Wd Sdng'), 'VinylSd', train_df.Exterior1st)
train_df['Exterior1st'] = np.where((train_df.Exterior1st=='VinylSd' ), 1, 0)

test_df['Exterior1st'] = np.where((test_df.Exterior1st=='AsbShng'), 'VinylSd', test_df.Exterior1st)
test_df['Exterior1st'] = np.where((test_df.Exterior1st=='BrkFace'), 'VinylSd', test_df.Exterior1st)
test_df['Exterior1st'] = np.where((test_df.Exterior1st=='Wd Sdng'), 'VinylSd', test_df.Exterior1st)
test_df['Exterior1st'] = np.where((test_df.Exterior1st=='VinylSd' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'Exterior2nd', y = 'SalePrice', data = train_df,height=5, aspect=2)
ax.set_xticklabels(rotation=90)

In [None]:
train_df['Exterior2nd'] = np.where((train_df.Exterior2nd=='AsbShng'), 'VinylSd', train_df.Exterior2nd)
train_df['Exterior2nd'] = np.where((train_df.Exterior2nd=='BrkFace'), 'VinylSd', train_df.Exterior2nd)
train_df['Exterior2nd'] = np.where((train_df.Exterior2nd=='Wd Sdng'), 'VinylSd', train_df.Exterior2nd)
train_df['Exterior2nd'] = np.where((train_df.Exterior2nd=='VinylSd' ), 1, 0)

test_df['Exterior2nd'] = np.where((test_df.Exterior2nd=='AsbShng'), 'VinylSd', test_df.Exterior2nd)
test_df['Exterior2nd'] = np.where((test_df.Exterior2nd=='BrkFace'), 'VinylSd', test_df.Exterior2nd)
test_df['Exterior2nd'] = np.where((test_df.Exterior2nd=='Wd Sdng'), 'VinylSd', test_df.Exterior2nd)
test_df['Exterior2nd'] = np.where((test_df.Exterior2nd=='VinylSd' ), 1, 0)

In [None]:
ax = sns.catplot(x = 'HouseStyle', y = 'SalePrice', data = train_df,height=5, aspect=2)

In [None]:
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='1.5Fin'), 'Other', train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='1.5Unf'), 'Other', train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='SFoyer'), 'Other', train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='SLvl'), 'Other', train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='2.5Unf'), 'Other', train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='2.5Fin'), 'Other', train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='Other' ), 0, train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='1Story' ), 1, train_df.HouseStyle)
train_df['HouseStyle'] = np.where((train_df.HouseStyle=='2Story' ), 2, train_df.HouseStyle)

test_df['HouseStyle'] = np.where((test_df.HouseStyle=='1.5Fin'), 'Other', test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='1.5Unf'), 'Other', test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='SFoyer'), 'Other', test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='SLvl'), 'Other', test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='2.5Unf'), 'Other', test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='2.5Fin'), 'Other', test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='Other' ), 0, test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='1Story' ), 1, test_df.HouseStyle)
test_df['HouseStyle'] = np.where((test_df.HouseStyle=='2Story' ), 2, test_df.HouseStyle)

That's it for Categorical Varibles.

### 4.3. Duplicate Columns.

In [None]:
#checking for duplicate rows
train_df[train_df.duplicated()]

In [None]:
test_df[test_df.duplicated()]

No duplicate rows in both of the dataframe.

So, we left with just two categorical columns i.e. Neighborhood and HouseStyle. Will perform One Hot Encoding for these two columns.

### 4.4. One-Hot Encoding

In [None]:
train_df['HouseStyle'] = pd.get_dummies(train_df['HouseStyle'])
train_df['Neighborhood'] = pd.get_dummies(train_df['Neighborhood'])

test_df['HouseStyle'] = pd.get_dummies(test_df['HouseStyle'])
test_df['Neighborhood'] = pd.get_dummies(test_df['Neighborhood'])

In [None]:
train_df.head()

### 4.5. Feature Scaling
**Feature Scaling or Standardization**: It is a step of Data Pre-Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.

Standardisation replaces the values by their Z scores.

In [None]:
train_df.head()

### 4.6. Mulitcolinearity

**Multicollinearity**: Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.

**Why not Multicollinearity?**: Multicollinearity can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable.

**Detection of Multicollinearity**: Multicollinearity can be detected via various methods. One of the popular method is using VIF.

**VIF**: VIF stands for Variable Inflation Factors. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.

In [None]:
X1 = train_df.drop(['SalePrice'], axis=1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

X_vif = add_constant(X1)

pd.Series([variance_inflation_factor(X_vif.values, i) 
               for i in range(X_vif.shape[1])], 
              index=X_vif.columns)

In [None]:
temp_high = ['GrLivArea', '2ndFlrSF', '1stFlrSF', 'BsmtFinSF2', 'BsmtFinType2', 'MasVnrArea', 'MasVnrType'] #these columns have high multicolinearity.

Now, VIF of few columns are very high. That means we have to drop some of the columns because it's not at all good for our model. 

But Wait! How will we decide which of the columns should be dropped?

Here comes the role of Significancy.

**4.6.1. Significancy**: In statistics, statistical significance means that the result that was produced has a reason behind it, it was not produced randomly, or by chance.

(a) **Correlation**: Correlation is a statistic that measures the degree to which two variables move in relation to each other. We use this technique to find correlation between two continuous columns. The correlation coefficient has values between -1 to 1
- A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
- A value closer to 1 implies stronger positive correlation
- A value closer to -1 implies stronger negative correlation

(b) **ANOVA**: ANOVA stands for Analysis of Variance. It is performed to figure out the relation between the different group of categorical data. Under ANOVA we have two measures as result:
- F-testscore : which shows the variaton of groups mean over variation
- p-value: it shows the importance of the result
- We use this technique to find relation between continuous and categorical columns.
- As a conclusion, we can say that there is a strong correlation between other variables and a categorical variable if the ANOVA test gives us a large F-test value and a small p-value.

In [None]:
from scipy.stats import pearsonr

for i in temp_high:
    for j in temp_high:
        if i in cols_num_con:
            corr, _ = pearsonr(train_df[i], train_df[j])
            print(i,'&',j ,'correlation: %.3f' % corr)
        else:
            print(i,'&',j ,"ANOVA: ",stats.f_oneway(train_df[i],train_df[j]))
    
    
    if i in cols_num_con:
        corr, _ = pearsonr(train_df[i], train_df['SalePrice'])
        print(i, '& SalePrice', 'correlation: %.3f' % corr)
        print()
    else:
        print(i," & SalePrice ANOVA: ",stats.f_oneway(train_df[i],train_df['SalePrice']))
        print()

So according to my observations:
- BsmtFinType2 is high multicolinearity with BsmtFinSF2 but less related with SalePrice.
- MasVnrType is high multicolinearity with MasVnrArea but less related with SalePrice.
- 2ndFloorSF is GrLivArea are having high multicolinearity but less with SalePrice.

In [None]:
#according to our results for significance, I'm again checking muliticolinearity after
#dropping few columns
X_vif = X_vif.drop(['BsmtFinSF2','2ndFlrSF','MasVnrType'],axis = 1)
pd.Series([variance_inflation_factor(X_vif.values, i) 
               for i in range(X_vif.shape[1])], 
              index=X_vif.columns)

Done!!..  Now let's drop these columns from train_df and test_df.

In [None]:
train_df.drop(['BsmtFinSF2','2ndFlrSF','MasVnrType'],axis = 1, inplace=True)
test_df.drop(['BsmtFinSF2','2ndFlrSF','MasVnrType'],axis = 1, inplace=True)

## 5. Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = train_df.drop(['SalePrice'], axis=1)
y = train_df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

In [None]:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.1, random_state=4)

## 6. Model Evaluation

### 6.1. Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
LR = LinearRegression()
LR.fit(X_train,y_train)
y_predicted = LR.predict(X_test)

print(round(LR.score(X_train, y_train)*100,2))
print(round(LR.score(X_test, y_test)*100,2))
mean_squared_error(y_test, y_predicted, squared=False)

### 6.2. Regularization:

#### 6.2. (a) Lasso Regression:
Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It adds penalty term to the cost function. This term is the absolute sum of the coefficients. As the value of coefficients increases from 0 this term penalizes, cause model, to decrease the value of coefficients in order to reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.

**Limitation**: 
- Lasso sometimes struggles with some types of data. If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set).
- If there are two or more highly collinear variables then LASSO regression select one of them randomly which is not good for the interpretation of data

#### 6.2. (b) Ridge Regression:
In Ridge regression, we add a penalty term which is equal to the square of the coefficient. The L2 term is equal to the square of the magnitude of the coefficients. We also add a coefficient lambda to control that penalty term. In this case if lambda  is zero then the equation is the basic OLS else if `lambda > 0` then it will add a constraint to the coefficient. As we increase the value of lambda this constraint causes the value of the coefficient to tend towards zero. This leads to both low variance (as some coefficient leads to negligible effect on prediction) and low bias (minimization of coefficient reduce the dependency of prediction on a particular variable).

**Limitation**:
Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

**Lasso Regression**

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

parameters = {'alpha':[0,0.1,0.5,1,5,10],
              'normalize': [True,False]}

LassoReg = Lasso()

Lasso_reg= GridSearchCV(LassoReg, parameters, scoring='neg_mean_squared_error',cv=20)
Lasso_reg.fit(X_train,y_train)

# best estimator
print(Lasso_reg.best_estimator_)

# best model
best_model = Lasso_reg.best_estimator_
best_model.fit(X_train,y_train)
y_predicted = best_model.predict(X_test)
print(best_model.score(X_train,y_train)*100)
print(mean_squared_error(y_test, y_predicted, squared=False))

**Ridge**

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

parameters = {'alpha':[0.001,0.01,0.1,0.2,0.4, 0.5,0.7,0.9,1,5,10],
              'normalize': [True,False]}

RidgeReg = Ridge()

Ridge_reg= GridSearchCV(RidgeReg, parameters, scoring='neg_mean_squared_error',cv=20)
Ridge_reg.fit(X_train,y_train)

# best estimator
print(Ridge_reg.best_estimator_)

# best model
best_model = Ridge_reg.best_estimator_
best_model.fit(X_train,y_train)
y_predicted = best_model.predict(X_test)
print(best_model.score(X_train,y_train)*100)
print(mean_squared_error(y_test, y_predicted, squared=False))

### 6.3. SVR

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVR

svr = make_pipeline(RobustScaler(), SVR(kernel ='rbf' ,C= 20))
svr.fit(X_train,y_train)
y_predicted = svr.predict(X_test)
print(svr.score(X_train,y_train)*100)
print(mean_squared_error(y_test, y_predicted, squared=False))

### 6.4. Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

Dt = DecisionTreeRegressor(criterion='mse',max_depth=15, min_samples_split=5, min_samples_leaf=5,
                           max_features=None, random_state=42)
Dt.fit(X_train,y_train)
y_predicted = Dt.predict(X_test)
print(Dt.score(X_train,y_train)*100)
print(mean_squared_error(y_test, y_predicted, squared=False))

As we can observe least MSE is with SVR model. So I'm going to use this model to predict my test_df.

## 7. Submission CSV

In [None]:
test_df['SalePrice'] = np.exp(svr.predict(test_df))
test_df['Id'] = test_data['Id']
Predicted_outcome=  test_df[['Id','SalePrice']]
Predicted_outcome.to_csv("Predicted_outcome.csv", index=False)
Predicted_outcome.head()