# House price Prediction
In this notebook I tried to bring detailed data analysis and some simple yet complex regression techniques. I hope you guys enjoy my analysis here.

In the first few cells I imported all necessary libraies and explored our train and test dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
train.head()

In [None]:
test.shape

In [None]:
train.shape

In [None]:
test.head()

In [None]:
train.columns

In [None]:
test.columns

The features of our dataset that we will be working with is given below.

1.SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

2.MSSubClass: The building class.

3.MSZoning: The general zoning classification.

4.LotFrontage: Linear feet of street connected to property.

5.LotArea: Lot size in square feet.

6.Street: Type of road access.

7.Alley: Type of alley access.

8.LotShape: General shape of property.

9.LandContour: Flatness of the property.

10.Utilities: Type of utilities available.

11.LotConfig: Lot configuration.

12.LandSlope: Slope of property.

13.Neighborhood: Physical locations within Ames city limits.

14.Condition1: Proximity to main road or railroad.

15.Condition2: Proximity to main road or railroad (if a second is present).

16.BldgType: Type of dwelling.

17.HouseStyle: Style of dwelling.

18.OverallQual: Overall material and finish quality.

19.OverallCond: Overall condition rating.

20.YearBuilt: Original construction date.

21.YearRemodAdd: Remodel date.

22.RoofStyle: Type of roof.

23.RoofMatl: Roof material.

24.Exterior1st: Exterior covering on house.

25.Exterior2nd: Exterior covering on house (if more than one material).

26.MasVnrType: Masonry veneer type.

27.MasVnrArea: Masonry veneer area in square feet.

28.ExterQual: Exterior material quality.

29.ExterCond: Present condition of the material on the exterior.

30.Foundation: Type of foundation.

31.BsmtQual: Height of the basement.

32.BsmtCond: General condition of the basement.

33.BsmtExposure: Walkout or garden level basement walls.

34.BsmtFinType1: Quality of basement finished area.

35.BsmtFinSF1: Type 1 finished square .

36.BsmtFinType2: Quality of second finished area (if present).

37.BsmtFinSF2: Type 2 finished square feet.

38.BsmtUnfSF: Unfinished square feet of basement area.

39.TotalBsmtSF: Total square feet of basement area.

40.Heating: Type of heating.

41.HeatingQC: Heating quality and condition.

42.CentralAir: Central air conditioning.

43.Electrical: Electrical system.

44.1stFlrSF: First Floor square feet.

45.2ndFlrSF: Second floor square feet.

46.LowQualFinSF: Low quality finished square feet (all floors).

47.GrLivArea: Above grade (ground) living area square feet.

48.BsmtFullBath: Basement full bathrooms.

49.BsmtHalfBath: Basement half bathrooms.

50.FullBath: Full bathrooms above grade.

51.HalfBath: Half baths above grade.

52.Bedroom: Number of bedrooms above basement level.

53.Kitchen: Number of kitchens.

54.KitchenQual: Kitchen quality.

55.TotRmsAbvGrd: Total rooms above grade (does not include bathrooms).

56.Functional: Home functionality rating.

57.Fireplaces: Number of fireplaces.

58.FireplaceQu: Fireplace quality.

59.GarageType: Garage location.

60.GarageYrBlt: Year garage was built.

61.GarageFinish: Interior finish of the garage.

62.GarageCars: Size of garage in car capacity.

63.GarageArea: Size of garage in square feet.

64.GarageQual: Garage quality.

65.GarageCond: Garage condition.

66.PavedDrive: Paved driveway.

67.WoodDeckSF: Wood deck area in square feet.

68.OpenPorchSF: Open porch area in square feet.

69.EnclosedPorch: Enclosed porch area in square feet.

70.3SsnPorch: Three season porch area in square feet.

71.ScreenPorch: Screen porch area in square feet.

72.PoolArea: Pool area in square feet.

73.PoolQC: Pool quality.

74.Fence: Fence quality.

75.MiscFeature: Miscellaneous feature not covered in other categories.

76.MiscVal: Value of miscellaneous feature.

77.MoSold: Month Sold.

78.YrSold: Year Sold.

79.SaleType: Type of sale.

80.SaleCondition: Condition of sale.

Our target variable is 'SalePrice' we might want to see it's basic statistical distribution.

In [None]:
train.SalePrice.describe()

In [None]:
plt.style.use('bmh')
sns.distplot(train['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4})

As we have seen our dataset have both numerical and categorical data. So, we separated numerical and categorical data into separated dataframe and then analyzed them differently

In [None]:
list(set(train.dtypes.tolist()))

In [None]:
train = train.drop(labels = ["Id"],axis = 1)

In [None]:
df_numerical=train.select_dtypes(include=['int64','float64'])
df_numerical.head()

In [None]:
corrmat = df_numerical.corr()
g = sns.heatmap(df_numerical.corr())

Among all the numerical columns we selected most important ones which have correlation greater than 55% with our target variable

In [None]:
T_corr=corrmat.index[abs(corrmat['SalePrice'])>0.55]
g = sns.heatmap(df_numerical[T_corr].corr(),annot=True,cmap="RdYlGn")

So, 'OverallQual','TotalBsmtSF','1stFlrSF','GrLiveArea','GarageCars','GarageArea' these are the most important features among numerical columns which affects our target variable.

In [None]:
df_cat=train.select_dtypes(include=['O'])
df_cat.head()

In [None]:
df_cat.columns

# Spotting and rejecting outliers

In [None]:
import plotly.offline as py
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected = True)

In [None]:
#Plotting scatter in plotly
def scatter_plot(x, y, title, xaxis, yaxis, size, c_scale):
    trace = go.Scatter(x = x,
                        y = y,
                        mode = 'markers',
                        marker = dict(color = y, size=size, showscale = True, colorscale = c_scale))
    layout = go.Layout(hovermode = 'closest', title = title, xaxis = dict(title = xaxis), yaxis = dict(title = yaxis))
    fig = go.Figure(data = [trace], layout = layout)
    return iplot(fig)

In [None]:
scatter_plot(train.GrLivArea, train.SalePrice, 'GrLivArea vs SalePrice', 'GrLivArea', 'SalePrice', 10, 'Rainbow')

In [None]:
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
scatter_plot(train.GrLivArea, train.SalePrice, 'GrLivArea vs SalePrice', 'GrLivArea', 'SalePrice', 10, 'Rainbow')

In [None]:
scatter_plot(train.TotalBsmtSF, train.SalePrice, 'TotalBsmtSF Vs SalePrice', 'TotalBsmtSF', 'SalePrice', 10, 'Cividis')

In [None]:
train.drop(train[train.TotalBsmtSF>3000].index, inplace = True)
train.reset_index(drop = True, inplace = True)
scatter_plot(train.TotalBsmtSF, train.SalePrice, 'TotalBsmtSF Vs SalePrice', 'TotalBsmtSF', 'SalePrice', 10, 'Cividis')

# Handling Missing Values

In [None]:
missing=train.isnull().sum().sort_values(ascending=False)
percent=(train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([missing,percent],axis=1,keys=['missing','percent'])
missing_data.head(20)

In [None]:
train.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature', 'FireplaceQu','LotFrontage'], inplace=True)

In [None]:
train.shape

In [None]:
missing1=test.isnull().sum().sort_values(ascending=False)
percent1=(test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data1=pd.concat([missing1,percent],axis=1,keys=['missing1','percent1'])
missing_data1.head(25)

We droppes all the columns that have missing values more than 100

In [None]:
test.drop(columns=['Alley', 'PoolQC', 'Fence', 'MiscFeature', 'FireplaceQu','LotFrontage'], inplace=True)
test.shape

In [None]:
train.fillna(method ='ffill', inplace=True)
test.fillna(method ='ffill', inplace=True)

In [None]:
df_cat=train.select_dtypes(include=['O'])
df_cat.head()

In [None]:
df_cat.columns

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1 = df_cat.apply(le.fit_transform) 
df1.head(2)

In [None]:
df_numerical=train.select_dtypes(include=['int64','float64'])
df_numerical.head()

In [None]:
data=pd.concat([df1, df_numerical], axis=1)
data.head()

In [None]:
corrmat1 = data.corr()
T1_corr=corrmat.index[abs(corrmat['SalePrice'])>0.5]
g1 = sns.heatmap(data[T1_corr].corr(),annot=True,cmap="RdYlGn")

# Simple Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=44, shuffle =True)

In [None]:
LinearRegressionModel = LinearRegression()
LinearRegressionModel.fit(X_train, y_train)
print('Linear Regression Train Score is : ' , LinearRegressionModel.score(X_train, y_train))
print('Linear Regression Test Score is : ' , LinearRegressionModel.score(X_test, y_test))
print('Linear Regression Coef is : ' , LinearRegressionModel.coef_)
print('Linear Regression intercept is : ' , LinearRegressionModel.intercept_)

In [None]:
y_pred_linear = LinearRegressionModel.predict(X_test)
print('Y predict: ',y_pred_linear[:5])
print('Y test: ', y_test[:5])

# Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
RidgeRegressionModel = Ridge(alpha=10)
RidgeRegressionModel.fit(X_train, y_train)
print('Ridge Regression Train Score is : ' , RidgeRegressionModel.score(X_train, y_train))
print('Ridge Regression Test Score is : ' , RidgeRegressionModel.score(X_test, y_test))

# Xgboost Regression

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
GBRModel = GradientBoostingRegressor(n_estimators=200,max_depth=4,learning_rate = 0.2 ,random_state=44)
GBRModel.fit(X_train, y_train)
print('GBRModel Train Score is : ' , GBRModel.score(X_train, y_train))
print('GBRModel Test Score is : ' , GBRModel.score(X_test, y_test))

# Random Forrest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressorModel = RandomForestRegressor(max_features='sqrt',bootstrap=False,n_estimators=100,max_depth=10,
                                                   criterion='squared_error',random_state=44)
#FutureWarning: Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2.
#Use `criterion='squared_error'` which is equivalent.
RandomForestRegressorModel.fit(X_train, y_train)
print('Random Forest Regressor Train Score is : ' , RandomForestRegressorModel.score(X_train, y_train))
print('Random Forest Regressor Test Score is : ' , RandomForestRegressorModel.score(X_test, y_test))

# Test data preprocessing

In [None]:
test_cat=test.select_dtypes(include=['O'])
test_num=test.select_dtypes(include=['int64','float64'])
le = LabelEncoder()
test1 = test_cat.apply(le.fit_transform)
test1.head()


In [None]:
test=pd.concat([test_num, test1],axis=1)
test = test.drop(labels = ["Id"],axis = 1)
test.columns

In [None]:
xlr = test.iloc[:]
y_pred_lr = LinearRegressionModel.predict(xlr)

In [None]:
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
submission = test[["Id"]]
submission["SalePrice"] = y_pred_lr
submission.to_csv('my_output_lr.csv', index=False)
submission