<hr/>
# **Predicting House Prices**
<span id="0"></span>
[**Chi Le**](https://www.kaggle.com/anphawolf)
<hr/>
<font color=green>
1. [Overview](#1)
1. [Importing Modules, Reading the Dataset](#2)
1. [Visualizing and Examining Data](#3)
1. [Data Preprocessing](#4)
1. [Neural Network Model](#6)
1. [Conclusion](#7)

# 1. Overview
**Feature Columns**
* id - Unique ID for each home sold
* date - Date of the home sale
* price - Price of each home sold
* bedrooms - Number of bedrooms
* bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* sqft_living - Square footage of the apartments interior living space
* sqft_lot - Square footage of the land space
* floors - Number of floors
* waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
* view - An index from 0 to 4 of how good the view of the property was
* condition - An index from 1 to 5 on the condition of the apartment,
* grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* sqft_above - The square footage of the interior housing space that is above ground level
* sqft_basement - The square footage of the interior housing space that is below ground level
* yr_built - The year the house was initially built
* yr_renovated - The year of the houseâ€™s last renovation
* zipcode - What zipcode area the house is in
* lat - Lattitude
* long - Longitude
* sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
* sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

# <span id="2"></span> Importing Modules, Reading the Dataset and Defining an Evaluation Table
#### [Return Contents](#0)
<hr/>






In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

In [None]:
df.head()
#df.describe()
#df.info()

# <span id="3"></span> Visualizing and Examining Data
#### [Return Contents](#0)
<hr/>

This is not a very big data and we do not have too many features. Thus, we have chance to plot most of them and reach some useful analytical results. Drawing charts and examining the data before applying a model is a very good practice because we may detect some possible outliers or decide to do normalization. This is not a must but get know the data is always good. Then, I started with the histograms of dataframe.

In [None]:
df1 = df[['price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']]
h = df1.hist(bins = 25,figsize = (16,16),xlabelsize = 10,ylabelsize = 10,xrot=-15)
sns.despine(left = True,bottom=True)
[x.title.set_size(12) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];

To determine bedrooms, floors or bathrooms/bedrooms vs price, I preferred boxplot because we have numerical data but they are not continuous as 1,2,... bedrooms, 2.5, 3,... floors (probably 0.5 stands for the penthouse).

From the below charts, it can be seen that there are very few houses which have some features or price appears far from others like 33 bedrooms or price around 7000000. However, determining their possible negative effect will be time consuming and in the real data sets there will always be some outliers like some luxury house prices in this dataset. That's why I am not planning to remove outliers.

In [None]:
sns.set(style="whitegrid", font_scale=1)

In [None]:
f, axes = plt.subplots(1,2,figsize = (15,5))
sns.boxplot(x = 'bedrooms',y = 'price',data = df,ax = axes[0]);
sns.boxplot(x = 'floors', y = 'price',data = df,ax = axes[1]);
sns.despine(left=True, bottom=True)
axes[0].set(xlabel = 'Bedrooms',ylabel = 'Price')
axes[0].yaxis.tick_left()

axes[1].set(xlabel = 'Floors',ylabel = 'Price')
axes[1].yaxis.set_label_position('right')
axes[1].yaxis.tick_right()

f, axe = plt.subplots(1,1,figsize = (15,5))
sns.boxplot(x = 'bathrooms' , y = 'price',data = df,ax = axe);
axe.set(xlabel = 'Bathrooms', ylabel = 'Price');

Let's visualize more features. When we look at the below boxplots, grade and waterfront effect price visibly. On the other hand, view seem to effect less but it also has an effect on price.

In [None]:
f, axes = plt.subplots(1, 2,figsize=(15,5))
sns.boxplot(x=df['waterfront'],y=df['price'], ax=axes[0])
sns.boxplot(x=df['view'],y=df['price'], ax=axes[1])
sns.despine(left=True, bottom=True)
axes[0].set(xlabel='Waterfront', ylabel='Price')
axes[0].yaxis.tick_left()
axes[1].yaxis.set_label_position("right")
axes[1].yaxis.tick_right()
axes[1].set(xlabel='View', ylabel='Price')

f, axe = plt.subplots(1, 1,figsize=(12.18,5))
sns.boxplot(x=df['grade'],y=df['price'], ax=axe)
sns.despine(left=True, bottom=True)
axe.yaxis.tick_left()
axe.set(xlabel='Grade', ylabel='Price');

## <span id="7"></span> Checking Out the Correlation Among Explanatory Variables

Having too many features in a model is not always a good thing because it might cause overfitting and worser results when we want to predict values for a new dataset. Thus, if a feature does not improve your model a lot, not adding it may be a better choice.

In [None]:
df.corr()['price'].sort_values()

In [None]:
plt.figure(figsize = (10,6))
sns.scatterplot(x = 'sqft_living',y = 'price', data = df);

# <span id="4"></span> Data Preprocessing
#### [Return Contents](#0)
<hr/>

In [None]:
df.head()

In [None]:
df = df.drop('id',axis = 1)

In [None]:
df['date'] = pd.to_datetime(df['date'])

In [None]:
df['year'] = df['date'].apply(lambda date: date.year)
df['month'] = df['date'].apply(lambda date: date.month)

In [None]:
df.groupby('month').mean()['price'].plot();

In [None]:
df = df.drop('date',axis = 1)

In [None]:
#df = df.drop('zipcode',axis =1)

In [None]:
df['yr_renovated'].value_counts()

In [None]:
df['sqft_basement'].value_counts()

# <span id="6"></span> Neural Network Model
#### [Return Contents](#0)
<hr/>

In [None]:
X = df.drop('price', axis = 1).values
y = df['price'].values

In [None]:
X.shape,y.shape

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.8, random_state = 42)

In [None]:
mm_scaler = MinMaxScaler()
X_train = mm_scaler.fit_transform(X_train)
X_test = mm_scaler.transform(X_test)

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers

In [None]:
model = Sequential()

model.add(Dense(units = 6,activation = 'relu',input_dim = X.shape[1]))
model.add(Dense(units = 6,activation = 'relu'))
model.add(Dense(units = 6,activation = 'relu'))
model.add(Dense(units = 6,activation = 'relu'))

model.add(Dense(units = 1,activation = 'linear'))

adam = optimizers.Adam(learning_rate=0.01, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer = adam,loss = 'mse')

In [None]:
model.fit(x = X_train, y = y_train,epochs = 1000,validation_data = (X_test,y_test), batch_size = 128,verbose = 1)

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.plot()

In [None]:
y_pred = model.predict(X_test)

In [None]:
np.sqrt(metrics.mean_squared_error(y_test,y_pred))

In [None]:
metrics.mean_absolute_error(y_test,y_pred)

In [None]:
df['price'].describe()

In [None]:
plt.figure(figsize = (12,8))
plt.scatter(y_test, y_pred)
plt.plot(y_test, y_test,'r')