# House Price Prediction - Advanced Regression Techniques

# ![](https://miro.medium.com/max/804/1*D6s2K1y7kjE14swcgITB1w.png)
### House prices increase every year, so there is a need for a system to predict house prices in the future. House price prediction can help the developer determine the selling price of a house and can help the customer to arrange the right time to purchase a house.


#### In this Notebook, I have used Advanced Regression Techniques like **Ridge, Lasso & Polynomial Regression** in most simplystic manner with EDA.
#### The speciality of this notebook is the detailed & bit by bit **feature engineering** done and its impact on model performance.

1. Exploratory Data Analysis
2. Feature Engineering
3. Model Building & Evaluation


## 1. Exploratory Data Analysis    

In [None]:
#Import necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
pd.set_option('display.max_columns',500)
pd.set_option('display.max_rows',500)
pd.set_option('display.width', 500)
#to display all the columns of dataframe

In [None]:
#read train.csv file
df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
df

In [None]:
df.describe()

In [None]:
# Check for Corelation between Features
plt.figure(figsize=(20, 10))
sb.heatmap(df.corr(),yticklabels=True,cbar=True,cmap='viridis')

In [None]:
#Let's check which feature has maximum corelation with our dependent feature- SalePrice
df.corr()["SalePrice"].sort_values(ascending = False)

- Here we can see that 'OverAllQual' is most corelated with 'SalePrice'.
- Lets explore highly correlated features with 'Sale Price'.

In [None]:
sb.scatterplot(data = df, x = "OverallQual", y = "SalePrice");

In [None]:
sb.scatterplot(data = df, x = "GrLivArea", y = "SalePrice");

In [None]:
sb.scatterplot(data = df, x = "GarageCars", y = "SalePrice");

In [None]:
sb.scatterplot(data = df, x = "GarageArea", y = "SalePrice");

- We can see that there are houses with higher(10/10) quality but have very low prices.
- There are high prices for larger living areas, but we can see some of the outliers also.Same goes for Garage Area too.
- Number of garage cars also tend to follow the trend where higher the amount of cars is propotional to higher Sales Price.(Highest number 4 can be counted as exceptional)

### Check for Missing Values:

In [None]:
df.info()

In [None]:
def percent_missing_data(df):
    missing_count = df.isna().sum().sort_values(ascending = False)
    missing_percent = 100 * df.isna().sum().sort_values(ascending = False) / len(df)
    
    missing_count = pd.DataFrame(missing_count[missing_count > 0])
    missing_percent = pd.DataFrame(missing_percent[missing_percent > 0])
    
    missing_table = pd.concat([missing_count,missing_percent], axis = 1)
    missing_table.columns = ["missing_count", "missing_percent"]
    
    return missing_table

In [None]:
missing_values = percent_missing_data(df)
missing_values

In [None]:
df_object = df.select_dtypes(include = "object")
df_numeric = df.select_dtypes(exclude = "object")
df_object.shape , df_numeric.shape

-- Numerical variables are usually of 3 types:
-- Continous variable 
-- Discrete Variables and 
-- Temporal(Date-Time Features) Variables

In [None]:
# list of variables that contain year information
year_feature = [feature for feature in df_numeric if 'Yr' in feature or 'Year' in feature]
print("Temporial feature Count : {}".format(len(year_feature)))
year_feature

In [None]:
## Visualising the Temporal Datetime Variables
## We will check whether there is a relation between year the house is sold and the sales price

df.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")

- We can see that House sales prices are actually decreasing over the time.

In [None]:
#Discrete Features
discrete_feature=[feature for feature in df_numeric if len(df[feature].unique())<25 and feature not in year_feature+['Id']]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

In [None]:
#Continous Features
continuous_feature=[feature for feature in df_numeric if feature not in discrete_feature+year_feature+['Id']]
print("Continuous feature Count: {}".format(len(continuous_feature)))

### Checking for Outliers

In [None]:
for feature in continuous_feature:
    data=df.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

# 2. Feature Engineering

### Handling Missing Values

In [None]:
missing_values = percent_missing_data(df)
missing_values

In [None]:
plt.figure(figsize=(10,4), dpi = 100)
sb.barplot(x=missing_values.index,y='missing_percent', data=missing_values)
plt.xticks(rotation=90)
plt.show()

- We can see column 'PoolQC' has maximum missing values.We can either drop it or decide to keep it if the feature is important for our model.
- But first, we can deal with features that have less missing values.ie. less than 1% missing values.


In [None]:
#Extracting Features that have less than 1% missing values
missing_values[missing_values['missing_percent']<1]

In [None]:
#Electrical has only 1 missing value.It can be filled with mode
df['Electrical'].mode()

In [None]:
df['Electrical'] = df['Electrical'].fillna('SBrkr')
df['Electrical'].isna().sum()

- 'MasVnrArea' & 'MasVnrType' also have less than 1% missing values.


In [None]:
df['MasVnrArea'].value_counts(), df['MasVnrType'].value_counts()

- By going through that data,we can see that 'MasVnrArea' has values with 0, so missing values can be filled with '0'.
- 'MasType' has category for None, so missing values can be filled with 'None'.


In [None]:
df['MasVnrArea']= df['MasVnrArea'].fillna(0)
df['MasVnrType']= df['MasVnrType'].fillna('None')

In [None]:
missing_values = percent_missing_data(df)
missing_values

- All the Basement related features have 2% missing values.
- By going through data,we can find that Nan actually means that the house do not has a basement.So we can replace Nan values with 'None' which means no basement.
- For Basement related numeric columns we will replace Nan values with zero.

In [None]:
# basement string features ==> fill with none
bsmt_str_cols =  ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
df[bsmt_str_cols] = df[bsmt_str_cols].fillna('None')

# basement numeric features ==> fill with 0
bsmt_num_cols = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
df[bsmt_num_cols] = df[bsmt_num_cols].fillna(0)


In [None]:
missing_values = percent_missing_data(df)
missing_values

- Here all the Garage related features has around 5% missing values.
- We can fill those categories using mean and mode.(mean for numerical features and mode for categorical features)


In [None]:
# Garage string features ==> fill with Mode                                           
df['GarageType']= df['GarageType'].fillna('Attchd')    
df['GarageCond']= df['GarageCond'].fillna('TA') 
df['GarageFinish']= df['GarageFinish'].fillna('Unf') 
df['GarageQual']= df['GarageQual'].fillna('TA') 

# basement numeric features ==> fill with Mean
df['GarageYrBlt']= df['GarageYrBlt'].fillna(df.GarageYrBlt.mean()) 


In [None]:
missing_values = percent_missing_data(df)
missing_values

- Here some of the columns has more than 80% missing values. It is best option to drop them.

In [None]:
# Dropping columns with more than 80% missing values.
df = df.drop(["PoolQC", "MiscFeature", "Alley", "Fence"], axis = 1)

In [None]:
missing_values = percent_missing_data(df)
missing_values

In [None]:
df['FireplaceQu'].value_counts() , df['LotFrontage'].value_counts()

- 'FireplaceQu' is categorical column, so missing values can be filled with 'None'.
- But, 'LotFrontage' is numerical column and it also has outliers, so it can be filled with median.

In [None]:
df['FireplaceQu']= df['FireplaceQu'].fillna('None')    
df['LotFrontage']= df['LotFrontage'].fillna(df.LotFrontage.median())    

In [None]:
missing_values = percent_missing_data(df)
missing_values

### Yayy!! There is no missing values now!

#### Numerical Features

In [None]:
df_numeric

- numerical variables are skewed,so we can perform log normal distribution to prevent negative predictions.
- We will only perform log normal distribution to columns which do not have any zero values.

In [None]:
num_features=['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']

for feature in num_features:
    df[feature]=np.log(df[feature])

In [None]:
df_numeric

#### Categorical Features:

- Let's Convert categorical features into numerical.

In [None]:
df.head()

In [None]:
for feature in df.select_dtypes(include = "object"):
    labels_ordered=df.groupby([feature])['SalePrice'].mean().sort_values().index
    labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
    df[feature]=df[feature].map(labels_ordered)

In [None]:
df.head()

In [None]:
df.shape

## Feature Scaling

In [None]:
X = df.drop(['Id','SalePrice'],axis=1)
Y = df['SalePrice']

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=101)

In [None]:
# Standard scaling our data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) 
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train.shape , X_test.shape

# 3.Model Building & Evaluation

### Ridge Regression :

In [None]:
# Create the Ridge model
from sklearn.linear_model import Ridge
rid_reg = Ridge(alpha = 100)
rid_reg.fit(X_train, Y_train)

Y_pred = rid_reg.predict(X_test)

# testing the model
from sklearn.metrics import r2_score,mean_absolute_error
print("MAE : ",mean_absolute_error(Y_test, Y_pred))
print('R2 SCORE : ',r2_score(Y_test, Y_pred))


In [None]:
# Now, let's find best values for alpha and train model again

alpha_list = []
mse_list = []
for alpha_val in np.arange(0.01, 200):
    ridge1 = Ridge(alpha = alpha_val)
    ridge1.fit(X_train, Y_train)
    alpha_list.append(alpha_val)
    
    # testing the model
    Y_predict = ridge1.predict(X_test)
    mse = mean_absolute_error(Y_test, Y_predict)
    mse_list.append(mse)
    
alpha_list = pd.DataFrame(alpha_list)
mse_list = pd.DataFrame(mse_list)
alpha_mse = pd.concat([alpha_list, mse_list], axis = 1)
alpha_mse.columns = ["alpha_list", "mse_list"]

alpha_mse[alpha_mse["mse_list"] == alpha_mse["mse_list"].min()]

In [None]:
# Create the Ridge model using best alpha value:
from sklearn.linear_model import Ridge
rid_reg = Ridge(alpha = 8.01)
rid_reg.fit(X_train, Y_train)

Y_pred_ridge = rid_reg.predict(X_test)

# testing the model
from sklearn.metrics import r2_score,mean_absolute_error
ridge_mae = mean_absolute_error(Y_test, Y_pred_ridge)
ridge_r2_score= r2_score(Y_test, Y_pred_ridge)

print("MAE for Ridge : ",ridge_mae)
print('R2 SCORE for Ridge: ',ridge_r2_score)


In [None]:
Y_pred.min()

In [None]:
plt.figure(figsize=(10,8))
sb.regplot(Y_pred_ridge,Y_test);

## Lasso Regression

In [None]:
# Create Lasso model
from sklearn.linear_model import Lasso
ls = Lasso(alpha = 0.8)
ls.fit(X_train, Y_train)

Y_pred = ls.predict(X_test)

# testing the model
from sklearn.metrics import mean_absolute_error
print("MAE : ",mean_absolute_error(Y_test, Y_pred))

from sklearn.model_selection import cross_val_score
print('R2 SCORE : ',r2_score(Y_test, Y_pred))


In [None]:
# Now, let's find best values for alpha and train model again

alpha_list = []
mse_list = []
for alpha_val in np.arange(0.01, 200):
    ls1 = Lasso(alpha = alpha_val)
    ls1.fit(X_train, Y_train)
    alpha_list.append(alpha_val)
    
    # testing the model
    Y_predict = ls1.predict(X_test)
    mse = mean_absolute_error(Y_test, Y_predict)
    mse_list.append(mse)
    
alpha_list = pd.DataFrame(alpha_list)
mse_list = pd.DataFrame(mse_list)
alpha_mse = pd.concat([alpha_list, mse_list], axis = 1)
alpha_mse.columns = ["alpha_list", "mse_list"]

alpha_mse[alpha_mse["mse_list"] == alpha_mse["mse_list"].min()]

In [None]:
# Create the Lasso model using best alpha value:

ls = Lasso(alpha = 0.01)
ls.fit(X_train, Y_train)

Y_pred_lasso = ls.predict(X_test)

# testing the model
lasso_mae = mean_absolute_error(Y_test, Y_pred_lasso)
lasso_r2_score= r2_score(Y_test, Y_pred_lasso)

print("MAE for Lasso : ",lasso_mae)
print('R2 SCORE for Lasso : ',lasso_r2_score)


In [None]:
Y_pred_lasso.min()

In [None]:
plt.figure(figsize=(10,8))
sb.regplot(x = Y_pred_lasso, y = Y_test)

## Polynomial Regression

In [None]:
#Import the poly conerter 
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter = PolynomialFeatures(degree=2,include_bias=False)

#convert X data 
poly_features_train = polynomial_converter.fit_transform(X_train)
poly_features_test = polynomial_converter.fit_transform(X_test)

In [None]:
#import elastic net 
from sklearn.linear_model import ElasticNetCV
elastic_model = ElasticNetCV(l1_ratio= 1,tol=0.01)
elastic_model.fit(poly_features_train,Y_train)

In [None]:
Y_pred_poly = elastic_model.predict(poly_features_test)

In [None]:
#testing model
poly_mae = mean_absolute_error(Y_test, Y_pred_poly)
poly_r2_score = r2_score(Y_test, Y_pred_poly)
print("MAE for Polynomial: ",poly_mae)
print('R2 SCORE for Polynomial: ',poly_r2_score)

In [None]:
Y_pred.min()

In [None]:
plt.figure(figsize=(10,8))
sb.regplot(Y_pred_poly,Y_test)

In [None]:
models = pd.DataFrame({
    'Regression Model': ['Ridge','Lasso','Polynomial'],
    'MAE Score': [
        ridge_mae, 
        lasso_mae,
        poly_mae ],
    'R2 Score': [
        ridge_r2_score, 
        lasso_r2_score,
        poly_r2_score   
    ]})
print("--- MODEL EVALUATION---")
models.sort_values(by='MAE Score', ascending=True)

- We can see that **Ridge Regression Model is best suitable** here.

## Consider **UPVOTING** if you find it useful.....

### Please share your valuable feedbacks and suggestions in comments.