# **House Price Prediction using Advance Reggression:**
* **[EDA](#intLink)**
* **[Pre-Processing](#intLink2)**
  * **[Handling Missing Data](#intLink3)**
  * **[Handling Outliners](#intLink4)**   
* **[Lineer Reggresion](#intLink5)**
* **[polynomial Regression](#intLink6)**
* **[Regularization (Ridge - LASSO - ElasticNet)](#intLink7)**


In [None]:
#loading necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#loading train and test from house-prices dataset
train_set = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_set = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
test_label = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')

In [None]:
test_set.shape

In [None]:
test_set.shape

In [None]:
#adding test label
test_set["SalePrice"] =test_label['SalePrice']

In [None]:
#combining the train and test set for cleaning
#df_final= pd.concat([train_set,test_set])
df_final = pd.concat([test_set.assign(ind="test_set"), train_set.assign(ind="train_set")])

# <div id="intLink"> EDA </div>

In [None]:
# correlation between 'SalePrice' and 'GrLivArea' 
# we can spot outliners that we delete later on
sns.jointplot(data=train_set, x='SalePrice', y='GrLivArea')

In [None]:
#OverallQual is the most corr feature we can see the Linear shape and no outliner
figure=figsize = (30,34)
sns.stripplot(data=train_set, x = 'OverallQual', y='SalePrice')

# <div id="intLink2"> Pre-Processing # 
1. Handling missing value
2. High/Low Correletion data
3. Categorical Data
4. Numerical Columns to Categorical
5. Dealing with Outliers
6. Creating Dummy Variables
one-hot-encodding </div>

#  <div id="intLink3">1.Handling **missing** Values
* Dropping columns with more than 70% null_value and Id (it might not be the best case for every problem or dataset).
* Handling null value in the rest of thr features</div>

> 

In [None]:
#finding features with the most duplicant value ?

In [None]:

def missing_percent(train_set):
    nan_percent = 100*(train_set.isnull().sum()/len(train_set))
    nan_percent = nan_percent[nan_percent>0].sort_values(ascending=False).round(1)
    DataFrame = pd.DataFrame(nan_percent)
    # Rename the columns
    mis_percent_table = DataFrame.rename(columns = {0 : '% of Misiing Values'}) 
    # Sort the table by percentage of missing descending
    mis_percent = mis_percent_table
    return mis_percent


In [None]:
miss = missing_percent(train_set)
miss

In [None]:
#Removing the Id that has no value for our prediction
train_set= train_set.drop('Id', axis=1)

In [None]:
nan_percent = 100*(train_set.isnull().sum()/len(train_set))
nan_percent = nan_percent[nan_percent>0].sort_values()

In [None]:
# Every Feature with missing data must be checked!
# We choose a threshold of 1%. It means, if there is less than 1% of a feature are missing

plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#Set 1% threshold:
plt.ylim(0,1)

In [None]:
train_set.shape

In [None]:
#Removing the Id that has no value for our prediction
train_set= train_set.drop('Id', axis=1)

In [None]:
#drop features that have more than 70% missing value
#credit: https://www.kaggle.com/rushikeshdarge/handle-missing-values-only-notebook-you-need
threshold = 70
drop_cols = miss[miss['% of Misiing Values'] > threshold].index.tolist()
drop_cols


In [None]:
train_set= train_set.drop(drop_cols, axis=1)

In [None]:
nan_percent = 100*(train_set.isnull().sum()/len(train_set))
nan_percent = nan_percent[nan_percent>0].sort_values()

In [None]:
#every Feature with missing data must be checked!
#We choose a threshold of 1%. It means, if there is less than 1% of a feature are missing

plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#Set 1% threshold:
plt.ylim(0,1)

**FireplaceQu: Fireplace quality**
* acoording to the data this feature has an NA value that means the house has no fire place so we fill the column with 'None'

In [None]:
train_set['FireplaceQu']= train_set['FireplaceQu'].fillna('None')

In [None]:
#Filling null values most freq value
#train_set['KitchenQual']= train_set['KitchenQual'].fillna('TA')

In [None]:
#df_final['SaleType']= df_final['SaleType'].fillna('Oth')

In [None]:
#df_final['Functional']= df_final['Functional'].fillna('Typ')

In [None]:
#df_final['Exterior1st']= df_final['Exterior1st'].fillna('Other')
#df_final.fillna({'Exterior1st':'Other', 'Exterior2nd':'Other', 'Utilities':'Other'}, inplace=True)

**Garage & Bacement**
* by looking at the plot we realize that most features with missing value are from the same catagories.

In [None]:
#After checking the data documentation,
#it shows that missing value (two rows) in Basement Features are becouse of there is no basement in these rows
#Decision: Filling in data based on column: numerical basement & string descriptive:

#Numerical Columns fill with 0:
bsmt_num_cols= ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF' ,'BsmtFullBath', 'BsmtHalfBath']
train_set[bsmt_num_cols]=train_set[bsmt_num_cols].fillna(0)

#String Columns fill with None:
bsmt_str_cols= ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
train_set[bsmt_str_cols]= train_set[bsmt_str_cols].fillna('None')

**Mas Vnr Features:**

* Based on the Dataset Document File, missing values for 'Mas Vnr Type' and 'Mas Vnr Area' means the house doesn't have any mansonry veneer. so, we decide to fill the missing value as below:

In [None]:
train_set["MasVnrType"]= train_set["MasVnrType"].fillna("None")
train_set["MasVnrArea"]= train_set["MasVnrArea"].fillna(0)

**Garage Columns:**
* Based on the dataset documentation, NaN in Garage Columns seems to indicate no garage.

* Decision: Fill with 'None' or 0

In [None]:
train_set[['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']]

In [None]:
#now we will extract all the numerical features from the dataset
#numerical_features= [feature for feature in train_set.columns if train_set[feature].dtypes !='O']

#print('Number of Numerical Features:',len(numerical_features))

#train_set[numerical_features].head(5)

In [None]:
#now we will extract datatime features from the dataset
#year_feature=[feature for feature in numerical_features if 'Year' in feature or 'Yr' in feature]

#print('Number of Yearly Features:',len(year_feature))
#train_set[year_feature].head(5)

In [None]:
#now we will analyze yearly features wrt SalePrice which is our independent feature
#for feature in year_feature:
    
    
 #   train_set.groupby(feature)['SalePrice'].median().plot()
  #  plt.show()

In [None]:
#Filling the missing Value:
Gar_str_cols= ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']
train_set[Gar_str_cols]=train_set[Gar_str_cols].fillna('None')

train_set['GarageYrBlt']=train_set['GarageYrBlt'].fillna(0)

In [None]:
#Impute missing data based on other columns:

train_set.groupby('Neighborhood')['LotFrontage']

In [None]:
train_set.groupby('Neighborhood')['LotFrontage'].mean()

In [None]:
#Filling null values mean value
train_set.groupby('Neighborhood')['LotFrontage'].transform(lambda val: val.fillna(val.mean()))

In [None]:
train_set['LotFrontage']=train_set.groupby('Neighborhood')['LotFrontage'].transform(lambda val: val.fillna(val.mean()))

In [None]:
#Filling null values most freq value
#train_set['MSZoning'].value_counts()[train_set['MSZoning'].value_counts() == df_final['MSZoning'].value_counts().max()].index


In [None]:
#train_set['MSZoning']=train_set['MSZoning'].fillna('RL')

In [None]:
train_set['LotFrontage']=train_set['LotFrontage'].fillna(0)

In [None]:
train_set[train_set['GarageArea'].isnull()]

# Handling test_set null value 
* Functional
* Exterior1st    
* Exterior2nd    
* KitchenQual    
* SaleType       
* Utilities      
* Functional     
* MSZoning

In [None]:
#Filling null values most freq value
train_set['LotFrontage']=train_set.groupby('Neighborhood')['LotFrontage'].transform(lambda val: val.fillna(val.max()))

In [None]:
#Filling null values most freq value
#train_set['Functional'].value_counts()[train_set['Functional'].value_counts() == train_set['Functional'].value_counts().max()].index

In [None]:
nan_percent = 100*(train_set.isnull().sum()/len(df_final))
nan_percent = nan_percent[nan_percent>0].sort_values()

In [None]:
#plot the feature with missing indicating the percent of missing data
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

In [None]:
train_set= train_set.dropna(axis=0, subset=['Electrical', 'GarageArea'])

In [None]:
#Filling null values most freq value
#df_final['MSZoning'].value_counts()[df_final['MSZoning'].value_counts() == df_final['MSZoning'].value_counts().max()].index


In [None]:
#df_final['MSZoning']= df_final['MSZoning'].transform(lambda val: val.fillna(val.max()))

In [None]:
df_final.isnull().sum()

**Finally we check if there is more null value in the dataset**

In [None]:
nan_percent= missing_percent(train_set)

In [None]:
nan_percent = 100*(train_set.isnull().sum()/len(df_final))
nan_percent = nan_percent[nan_percent>0].sort_values()

In [None]:
nan_percent

In [None]:
#df = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

In [None]:
train_set.shape

# <div id="intLink4"> Handling Outliers</div>

In [None]:
from sklearn.neighbors import LocalOutlierFactor
#credit https://www.kaggle.com/hrshtporwal5/houseprice-prediction
def detect_outliers(x, y, top=5, plot=True):
    lof = LocalOutlierFactor(n_neighbors=40, contamination=0.1)
    x_ =np.array(x).reshape(-1,1)
    preds = lof.fit_predict(x_)
    lof_scr = lof.negative_outlier_factor_
    out_idx = pd.Series(lof_scr).sort_values()[:top].index
    if plot:
        f, ax = plt.subplots(figsize=(9, 6))
        plt.scatter(x=x, y=y, c=np.exp(lof_scr), cmap='RdBu')
    return out_idx

In [None]:
#GrLivArea-SalePrice outlier detection
outs = detect_outliers(train_set['GrLivArea'], train_set['SalePrice'],top=5) 
outs
plt.axhline(y=200000, color='r')
plt.axvline(x=4000, color='r')
#credit https://www.kaggle.com/hrshtporwal5/houseprice-prediction

In [None]:
#train_set[(train_set['OverallQual']>8) &(train_set['SalePrice']<200000)][['SalePrice', 'OverallQual']]

In [None]:
#corr = train_set.corr()
#top_corr_features = corr.index[abs(corr["SalePrice"])>0.5].sort_values(ascending=True)


In [None]:
#Remove the outliers:
index_drop=train_set[(train_set['GrLivArea']>4000) & (train_set['SalePrice']<400000)].index
train_set=train_set.drop(index_drop, axis=0)

GrLivArea without outliner

In [None]:
#Remove the outliers:
index_drop=train_set[(train_set['GrLivArea']>4000) & (train_set['SalePrice']>400000)].index
train_set=train_set.drop(index_drop, axis=0)

In [None]:
sns.scatterplot(x='GrLivArea', y='SalePrice', data=train_set)
plt.axhline(y=200000, color='r')
plt.axvline(x=4000, color='r')

In [None]:
sns.scatterplot(x='OverallQual', y='SalePrice', data=train_set)
#no need to remove any data

In [None]:
sns.boxplot(x='GarageCars', y='SalePrice', data=train_set)
plt.axhline(y=680000,color='r')


In [None]:
sns.scatterplot(data=train_set, x='TotRmsAbvGrd', y='SalePrice')
plt.axhline(y=250000, color='r')
plt.axvline(x=12.8, color='r')

In [None]:
train_set[(train_set['TotRmsAbvGrd']>12.7) & (train_set['SalePrice']<250000)][['SalePrice', 'TotRmsAbvGrd']]

In [None]:
#Remove the outliers:
index_drop=train_set[(train_set['TotRmsAbvGrd']>12.7) & (train_set['SalePrice']<250000)].index
train_set=train_set.drop(index_drop, axis=0)

In [None]:
sns.scatterplot(data=train_set, x='TotRmsAbvGrd', y='SalePrice')
sns.scatterplot(data=train_set, x='TotRmsAbvGrd', y='SalePrice')

# Features that have high correlation (higher than 0.5)

In [None]:
# get correlations of each features in dataset
# Plotting Heat Map to visualise correlation data better. 
# Drwan for only features having high correlation 
# (>0.5) with Target Variable
corr = train_set.corr()
top_corr_features = corr.index[abs(corr["SalePrice"])>0.5]

plt.figure(figsize=(10,10))
#plot heat map
g=sns.heatmap(train_set[top_corr_features].corr(),annot=True,cmap="YlGnBu")

In [None]:
#this shows that Houses become more expensive with time
train_set.groupby('OverallQual')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")

In [None]:
top_corr_features

#  Dealing with Categorical Data

In [None]:
#Convert to String:
train_set['MSSubClass']= train_set['MSSubClass'].apply(str)

**Creating Dummy Variables**

In [None]:
train_set.select_dtypes(include='object')

In [None]:
train_set_num= train_set.select_dtypes(exclude='object')
train_set_obj= train_set.select_dtypes(include='object')

In [None]:
# Converting:
train_set_obj= pd.get_dummies(train_set_obj, drop_first=True)

In [None]:
Final_df= pd.concat([train_set_num, train_set_obj], axis=1)

# <div id="intLink5">**Linear Reggression**
* we start our path with simple Linear Regression and then we try to improve our model</div>

In [None]:
#Separate features and target from train_df
X = Final_df.drop('SalePrice',axis = 1)
y = Final_df['SalePrice']

In [None]:
#X = X.apply(pd.to_numeric, errors='coerce')
#y = y.apply(pd.to_numeric, errors='coerce')

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
#Split the Dataset to Train & Test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

In [None]:
#train the model
from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(X_train, y_train)

In [None]:
#predicting test data
y_pred=model.predict(X_test)

In [None]:
#evaluating the model
from sklearn import metrics
MAE=metrics.mean_absolute_error(y_test,y_pred)
MSE=metrics.mean_squared_error(y_test,y_pred)
RMSE=np.sqrt(MSE)

In [None]:
#coeficient matrix
pd.DataFrame(model.coef_,X.columns,columns=["coeficient"])

In [None]:
pd.DataFrame(data=[MAE,MSE,RMSE],index=["MAE","MSE","RMSE"],columns=["LinearRegression"])

# <div id="intLink6">Polynomial Regression improves our model
* Polynomial Regression adding more relevant features</div>

In [None]:
from sklearn.preprocessing import PolynomialFeatures

polynomial_converter=PolynomialFeatures(degree=2, include_bias=False)

In [None]:
poly_features=polynomial_converter.fit(X)

In [None]:
poly_features=polynomial_converter.transform(X)

# Poly_Features: 
(X1, X2, X3, X1^2, X2^2, X3^2, X1X2, X1X3, X2X3)
* Split the Data to Train & Test
* Train the Model

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

In [None]:
from sklearn.linear_model import LinearRegression
polymodel=LinearRegression()
polymodel.fit(X_train, y_train)

In [None]:
y_pred=polymodel.predict(X_test)

In [None]:
pd.DataFrame({'Y_Test': y_test,'Y_Pred':y_pred, 'Residuals':(y_test-y_pred) }).head(5)

In [None]:
from sklearn import metrics
MAE_Poly = metrics.mean_absolute_error(y_test,y_pred)
MSE_Poly = metrics.mean_squared_error(y_test,y_pred)
RMSE_Poly = np.sqrt(MSE_Poly)

pd.DataFrame([MAE_Poly, MSE_Poly, RMSE_Poly], index=['MAE', 'MSE', 'RMSE'], columns=['metrics'])

# Polymodel Regression vs Linear Regression
* **RMSE decresed significeantly**

In [None]:
XS_train, XS_test, ys_train, ys_test = train_test_split(X, y, test_size=0.3, random_state=101)
simplemodel=LinearRegression()
simplemodel.fit(XS_train, ys_train)
ys_pred=simplemodel.predict(XS_test)

MAE_simple = metrics.mean_absolute_error(ys_test,ys_pred)
MSE_simple = metrics.mean_squared_error(ys_test,ys_pred)


In [None]:
RMSE_simple = np.sqrt(MSE_simple)

In [None]:
pd.DataFrame({'Poly Metrics': [MAE_Poly, MSE_Poly, RMSE_Poly], 'Simple Metrics':[MAE_simple, MSE_simple, RMSE_simple]}, index=['MAE', 'MSE', 'RMSE'])

# <div id="intLink7">Regularization
* [Scaling the Data](#intLink8)
* [Ridge Regression(Cross-Validation)](#intLink9)


In [None]:
X = Final_df.drop('SalePrice',axis = 1)
y = Final_df['SalePrice']

In [None]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter= PolynomialFeatures(degree=2, include_bias=False)
poly_features= polynomial_converter.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

Scaling the Data<div id="intLink8">

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X_train)

In [None]:
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)

Ridge Regression<div id="intLink9">


In [None]:
#Train the Model
from sklearn.linear_model import Ridge
ridge_model= Ridge(alpha=10)

In [None]:
ridge_model.fit(X_train, y_train)

In [None]:
#predict Test Data
y_pred= ridge_model.predict(X_test)

In [None]:
#Evaluating the Model
from sklearn.metrics import mean_absolute_error, mean_squared_error

MAE= mean_absolute_error(y_test, y_pred)
MSE= mean_squared_error(y_test, y_pred)
RMSE= np.sqrt(MSE)

In [None]:
pd.DataFrame([MAE, MSE, RMSE], index=['MAE', 'MSE', 'RMSE'], columns=['metrics'])

In [None]:
#Train the Model
from sklearn.linear_model import RidgeCV
ridge_cv_model=RidgeCV(alphas=(0.1, 1.0, 10.0), scoring='neg_mean_absolute_error')

In [None]:
ridge_cv_model.fit(X_train, y_train)

In [None]:
ridge_cv_model.alpha_

In [None]:
#Predicting Test Data
y_pred_ridge= ridge_cv_model.predict(X_test)

In [None]:
MAE_ridge= mean_absolute_error(y_test, y_pred_ridge)
MSE_ridge= mean_squared_error(y_test, y_pred_ridge)
RMSE_ridge= np.sqrt(MSE_ridge)

In [None]:
pd.DataFrame([MAE_ridge, MSE_ridge, RMSE_ridge], index=['MAE', 'MSE', 'RMSE'], columns=['Ridge Metrics'])

Lasso Regression<div id="intLink11">


In [None]:
from sklearn.linear_model import LassoCV
lasso_cv_model= LassoCV(eps=0.01, n_alphas=100, cv=5)

In [None]:
lasso_cv_model.alpha_

In [None]:
y_pred_lasso= lasso_cv_model.predict(X_test)

In [None]:
MAE_Lasso= mean_absolute_error(y_test, y_pred_lasso)
MSE_Lasso= mean_squared_error(y_test, y_pred_lasso)
RMSE_Lasso= np.sqrt(MSE_Lasso)

In [None]:
pd.DataFrame([MAE_Lasso, MSE_Lasso, RMSE_Lasso], index=['MAE', 'MSE', 'RMSE'], columns=['Lasso Metrics'])

Elastic Net<div id="intLink12">