In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data Wrangling

In [None]:
df = pd.read_csv("../input/housesalesprediction/kc_house_data.csv")
df.head()

In [None]:
df.describe()

First, Null Values need to be checked as it is 
very important to remove Null values for 
Multiple Regression. However, we did not find 
any Null values in this dataset. Then 2 columns: 
‘id’ and ‘date’ are removed from DataFrame as 
these contain useless information.

In [None]:
df.isna().sum()

In [None]:
df.info()

In [None]:
df1 = df.drop(['id','date'], axis=1)
df1.head()

# Correlation:
We can demonstrate that all variables 
are in good correlation with ‘price’. Only 
‘zipcode’ has a negative correlation of -0.05 but 
are very near to 0 with the target variable. 
‘sqft_living’, ‘grades’ and ‘bathrooms’ are 
having a positive strong correlation with the 
target variable ‘price’.


In [None]:
corr = df1.corr()
plt.figure(figsize=(25,15))
sns.heatmap(corr, annot=True)

# Splitting data into train and test
We used train_test_split from sklearn library to 
split our data into 75% and 25% for train and 
test sets respectively. We created x_train, 
x_test, y_train and y_test. The Random state for 
train and test is 3.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x = df1.drop(['price'], axis=1)
y = df1['price']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=3)

# Visualization:

**1st Plot:** Shows the bedrooms count, and it can be observed that most of the properties are having 3 bedrooms and 4 bedrooms.

In [None]:
plt.subplots(figsize=(7, 5))
sns.countplot(df1["bedrooms"])
plt.show()

**2nd Plot:**  Shows the bathroom count, 
and it can be observed that most of the houses 
are having 2.5, 1, and 1.75 bathrooms.

In [None]:
plt.subplots(figsize=(15, 5))
sns.countplot(df1["bathrooms"])
plt.show()

**3rd Plot:**  Shows property with waterfront and we can 
observe that the maximum of the houses is not 
having a waterfront and only a few have a 
waterfront feature. 

In [None]:
sns.countplot(df1["waterfront"])
plt.show()

**4th Plot:**  Shows how many 
floors maximum properties have, and we can 
observe that most of the properties are having 1 
and 2 floors.


In [None]:
sns.countplot(df1["floors"])
plt.show()

# **Machine Learning models:**

4 Machine Learning models are used:

## 1. Multiple Linear Regression:

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn import metrics

In [None]:
lm = LinearRegression()
lm.fit(x_train,y_train)            # Fitting model with x_train and y_train
lm_pred = lm.predict(x_test)       # Predicting the results
print('RMSE:', np.sqrt(mean_squared_error(y_test, lm_pred, squared=False)))
print('r2 score: %.2f' % r2_score(y_test, lm_pred))
print("Accuracy :",lm.score(x_test, y_test))

In [None]:
labels = {'True Labels': y_test, 'Predicted Labels': lm_pred}
df_lm = pd.DataFrame(data = labels)
sns.lmplot(x='True Labels', y= 'Predicted Labels', data = df_lm)

I have used first Multiple Linear Regression for 
this dataset. This model provided an average 
result. Below are the results:
* RMSE: 444.30
* R2 Score: 0.71
* Accuracy: 70.78 % 

Shows the lmplot for this multiple 
linear regression model and it plots a straight 
line, but this is not much close to 45 degrees.


## 2. Decision Tree:

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV

### Unpruned Tree

In [None]:
dtree_up = DecisionTreeRegressor()
dtree_up.fit(x_train, y_train)               # Fitting model with x_train and y_train
dtree_pred_up = dtree_up.predict(x_test)     # Predicting the results
print('RMSE:', np.sqrt(mean_squared_error(y_test, dtree_pred_up, squared=False)))
print('r2 score: %.2f' % r2_score(y_test, dtree_pred_up))
print("Accuracy :",dtree_up.score(x_test, y_test))

### HyperParameter Tuned Decision Tree Regressor:

In [None]:
d = np.arange(1, 21, 1)

dtree = DecisionTreeRegressor(random_state=5)
hyperParam = [{'max_depth':d}]

gsv = GridSearchCV(dtree,hyperParam,cv=5,verbose=1)
best_model = gsv.fit(x_train, y_train)                          # Fitting model with xtrain_scaler and y_train
dtree_pred_mms = best_model.best_estimator_.predict(x_test)     # Predicting the results

print("Best HyperParameter: ",gsv.best_params_)

print('RMSE:', np.sqrt(mean_squared_error(y_test, dtree_pred_mms, squared=False)))
print('r2 score: %.2f' % r2_score(y_test, dtree_pred_mms))
print("Accuracy :",best_model.score(x_test, y_test))

In [None]:
labels = {'True Labels': y_test, 'Predicted Labels': dtree_pred_mms}
df_lm = pd.DataFrame(data = labels)
sns.lmplot(x='True Labels', y= 'Predicted Labels', data = df_lm)

Next, we used Decision Tree for our model. For 
this, we used 2 variants of model unpruned 
simple decision tree model and tuned regressor 
with multiple max_depth. Results are:
1. Decision Tree (Unpruned):
* RMSE: 422.72
* R2 Score: 0.76
* Accuracy: 76.05 % 
2. Decision Tree (Pruned): which was pruned using max_depth for 1 to 20 range. This model is hyperparameter tuned using sklearn’s GridSearchCV.
* Max_depth: 11
* RMSE: 406.80
* R2 Score: 0.79
* Accuracy: 79.46 %

Shows the lmplot which is a straight 
line and closer to 45 degrees. This plot turns out 
to be much better than the Multiple Linear 
Regression model.


## 3. Random Forest: 

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

### Simple Random Forest

In [None]:
rf = RandomForestRegressor()
rf.fit(x_train, y_train)             # Fitting model with x_train and y_train
rf_pred = rf.predict(x_test)         # Predicting the results
print('RMSE:', np.sqrt(mean_squared_error(y_test, rf_pred, squared=False)))
print('r2 score: %.2f' % r2_score(y_test, rf_pred))
print("Accuracy :",rf.score(x_test, y_test))

### HyperParameter Tuned Random Forest Regressor:

In [None]:
nEstimator = [140,160,180,200,220]
depth = [10,15,20,25,30]

RF = RandomForestRegressor()
hyperParam = [{'n_estimators':nEstimator,'max_depth': depth}]

gsv = GridSearchCV(RF,hyperParam,cv=5,verbose=1,scoring='r2',n_jobs=-1)
gsv.fit(x_train, y_train)

print("Best HyperParameter: ",gsv.best_params_)
scores = gsv.cv_results_['mean_test_score'].reshape(len(nEstimator),len(depth))
maxDepth=gsv.best_params_['max_depth']
nEstimators=gsv.best_params_['n_estimators']

model = RandomForestRegressor(n_estimators = nEstimators,max_depth=maxDepth)
model.fit(x_train, y_train)        # Fitting model with x_train and y_train

# Predicting the results:
rf_pred_tune = model.predict(x_test)
print('RMSE:', np.sqrt(mean_squared_error(y_test, rf_pred_tune, squared=False)))
print('r2 score: %.2f' % r2_score(y_test, rf_pred_tune))
print("Accuracy :",model.score(x_test, y_test))

In [None]:
labels = {'True Labels': y_test, 'Predicted Labels': rf_pred_tune}
df_lm = pd.DataFrame(data = labels)
sns.lmplot(x='True Labels', y= 'Predicted Labels', data = df_lm)

We have used Random Forest for this dataset. 
We have used 2 variants of Random Forest; 1st 
is normal Random Forest and 2nd is 
Hyperparameter tuned, Random Forest. We are 
using GridSearchCV from sklearn. For the 2nd 
model, we have used parameters like 
‘n_estimators’ and ‘max_depth’. We will 
iterate through all parameters and find the best 
one. Results are:
1. Random Forest (Simple):
* RMSE: 351.26
* R2 Score: 0.89
* Accuracy: 88.58 % 
2. Random Forest (Tuned): n_estimators = [140,160,180,200,220] and max_depth = [10,15,20,25,30]
* Best n_estimators: 180
* Best max_depth: 30
* RMSE: 351.30
* R2 Score: 0.89
* Accuracy: 88.58 %

Shows the lmplot and it can be 
observed that this time we got a straight line 
which is close to 45 degrees. Random Forest 
with tuned parameters looks very efficient for 
this dataset.

## 4. StatsModel OLS:

In [None]:
import statsmodels.api as sm

In [None]:
x1 = sm.add_constant(x)
# Results will contain output of Ordinary Least Squares(OLS). Fit will apply a technique to obtain the fit of the model.
results = sm.OLS(y,x1).fit() 
results.summary()

In [None]:
print('R2: ', results.rsquared)

In [None]:
# Removing floors from the Independent Variables because P > 0.05
x2 = x.drop(['floors'], axis=1)

In [None]:
x3 = sm.add_constant(x2)
# Results will contain output of Ordinary Least Squares(OLS). Fit will apply a technique to obtain the fit of the model.
results1 = sm.OLS(y,x2).fit() 
results1.summary()

In [None]:
print('R2: ', results1.rsquared)

StatsModel is the last model we are using to get 
the best ‘price’ prediction. First, we are using a 
basic model and from (Fig. 23) we can observe 
the P values of all Independent Variables. It is 
observed that only the floor is having P > 0.05, 
i.e, 0.063. So, for the next model we will 
remove the ‘floor’ variable and run this model 
again to get very good results.
1. StatsModel OLS:
* Accuracy = 70 %
2. StatsModel OLS after removing ‘floors’ 
(P>0.05):
* Accuracy = 90.50 %

Clearly shows that after removing the 
‘floor’ variable we are getting 90.50 % 
accuracy which is the highest among all other 
models. Also, the F-Statistics value is very 
small and close to 0.

# **Conclusion:**
This dataset is House Sales in King 
County, USA, where we predicted ‘price’. This 
dataset had few variables which were removed 
during data cleaning and the correlation of all 
variables were good with target variables. We 
have used 4 machine learning models for this 
dataset, Multiple Linear Regression produced 
an average result and accuracy of 70.78 %, 
however, hyperparameter tuned Decision Tree 
also provided accuracy of around 79.46 %. 
Random Forest worked well and for both 
simple and hyperparameter tuned Random 
Forest Model, accuracy came out to be 88.58 %. 
However, after using StatsModel OLS, we 
found that the ‘floors’ variable has P values > 
0.05, so we removed that variable and received 
a very good model with 90.50 % accuracy. 
StatsModel after removing the ‘floors’ variable 
turns out to be the best model for our dataset.