# Prediction of house prices in King County

We will analyse the data and try to predict the prices of the houses with different regression models.

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

## Loading the data

In [None]:
df = pd.read_csv("../input/kc_house_data.csv")
df = df.sample(frac=1)
df.head()

In [None]:
df["price"].mean()

**The mean price is around 550 thousand dollars.**

Let's ignore the id and the zipcode for now.

In [None]:
df=df.drop(["id","zipcode"], axis = 1)

In [None]:
df.info()

**The dataset contains 21613 entries, 19 columns and no null values**

We don't need to apply any one hot encoding on the data, the only category is "waterfront" which is already one hot encoded.

Let's change the date into something more usable : a timestamp.

In [None]:
import time
import datetime

def transfo(date):
    year = date[:4]
    month = date[4:6]
    day = date[6:8]
    
    s=day+"/"+month+"/"+year

    ans = time.mktime(datetime.datetime.strptime(s, "%d/%m/%Y").timetuple())
    
    return ans

In [None]:
df["date"]=df["date"].apply(transfo)

In [None]:
df["date"].head()

In [None]:
%matplotlib inline
df.hist(bins=20, figsize=(20,15))
plt.show()

## Creating the training and testing datasets

**It is important to create a training dataset representative of the complete dataset**

We will use stratified sampling to ensure a good amount of houses of every price category.

In [None]:
df["price"].hist(bins=100, figsize=(10,5))
plt.show()

In [None]:
df["price_category"] = pd.cut(df["price"],
                            bins=list(range(0,2000001,100000))+[np.inf],
                            labels=list(range(21)))

In [None]:
df["price_category"].hist(figsize=(10,5))
plt.show()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1)
for train_index, test_index in split.split(df, df["price_category"]):
    train_set = df.loc[train_index]
    test_set = df.loc[test_index]

In [None]:
test_set["price_category"].value_counts() / len(test_set)

In [None]:
df["price_category"].value_counts() / len(df)

The sets are well split with respect to the price of the houses.

We can now remove the "price_category" column.

In [None]:
train_set=train_set.drop("price_category", axis = 1)
test_set=test_set.drop("price_category", axis = 1)

In [None]:
train_set.head()

## Geographical position of the data

In [None]:
prices = train_set.copy()

In [None]:
prices.plot(kind="scatter", x="long", y="lat", alpha = 0.1)

**To have an idea of the correlation between prices and geographical position, we need to calculate the average price by zone.**

In [None]:
# We round those values in order to identify zones
prices["new_lat"]=round(prices["lat"],2)
prices["new_long"]=round(prices["long"],2)

In [None]:
# This is the indicator of a zone, each value for this column corresponds to a certain geographical zone
prices["zone"]=prices["new_lat"]*1000+prices["new_long"]

In [None]:
# The only columns we need for this representation
prices=prices[["zone","price", "lat", "long"]]
# This column will be used to count the number of houses in a zone
prices["number"]=1

In [None]:
# Will contain the average price for each zone
df=prices.groupby('zone').mean()
# Will contain the number of houses for each zone
df1=prices.groupby('zone').sum()
df["number"]=df1["number"]

In [None]:
df.plot(kind="scatter", x="long", y="lat", alpha=0.5,
        s=df["number"]*2, label="population",
        figsize=(15,10),c="price", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.savefig("prices",format="png",resolution=300)

With this representation, the size of the bubble is proportional to the number of house sales in the zone.

**This representation does not give a lot of information because the highest values are too high**

Let's use a logarithmic scale

In [None]:
from math import log
df["log_price"]=df["price"].apply(log)

In [None]:
df.plot(kind="scatter", x="long", y="lat", alpha=0.5,
        s=df["number"]*2, label="population",
        figsize=(15,10),c="log_price", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.savefig("log_prices",format="png",resolution=300)

The prices do not appear in an explicit way but we can easily see the correlation between the position and the price.

## Preprocessing

In [None]:
corr_matrix = train_set.corr()
corr_matrix["price"].sort_values(ascending=False)

We can see that the price is heavily correlated to the square footage of the home.

In [None]:
houses_data = train_set.drop("price", axis=1) # drop labels for training set
houses_labels = train_set["price"].copy()

In [None]:
test_data = test_set.drop("price", axis=1)
test_labels = test_set["price"].copy()

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
houses_prepared = std_scaler.fit_transform(houses_data)

In [None]:
houses_prepared.shape

**The preprocessing is complete, we can now train a model**

## Testing some regression models

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(houses_prepared, houses_labels)

In [None]:
from sklearn.metrics import mean_squared_error

houses_predictions = lin_reg.predict(houses_prepared)
lin_mse = mean_squared_error(houses_labels, houses_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

**The RMSE is quite high on the training data, as shown earlier, the price is not correlated enough to the other data to obtain a good model**

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=1, max_depth = 12)
tree_reg.fit(houses_prepared, houses_labels)

In [None]:
houses_predictions = tree_reg.predict(houses_prepared)
tree_mse = mean_squared_error(houses_labels, houses_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

In [None]:
from sklearn.model_selection import cross_val_score

tree_scores = cross_val_score(tree_reg, houses_prepared, houses_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-tree_scores)

print("Scores : ",tree_rmse_scores)
print("Mean : ",tree_rmse_scores.mean())
print("Standard deviation : ",tree_rmse_scores.std())

**The RMSE on the training data is way better but the model tends to overfit the training set**

### SVM

In [None]:
from sklearn.svm import SVR

svm_reg = SVR(gamma='scale')
svm_reg.fit(houses_prepared, houses_labels)

In [None]:
svm_scores = cross_val_score(svm_reg, houses_prepared, houses_labels,
                             scoring="neg_mean_squared_error", cv=10)

svm_rmse_scores = np.sqrt(-svm_scores)

print("Scores : ",svm_rmse_scores)
print("Mean : ",svm_rmse_scores.mean())
print("Standard deviation : ",svm_rmse_scores.std())

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=50, random_state=1)
forest_reg.fit(houses_prepared, houses_labels)

In [None]:
houses_predictions = forest_reg.predict(houses_prepared)
forest_mse = mean_squared_error(houses_labels, houses_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
forest_scores = cross_val_score(forest_reg, houses_prepared, houses_labels,
                                scoring="neg_mean_squared_error", cv=10)

forest_rmse_scores = np.sqrt(-forest_scores)

print("Scores : ",forest_rmse_scores)
print("Mean : ",forest_rmse_scores.mean())
print("Standard deviation : ",forest_rmse_scores.std())

**The RMSE is much better with this model**

Let's search the best hyperparameters for this model.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [{"max_depth" : [10,15,20,30],
               "n_estimators" : [10, 50, 100]}]

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(houses_prepared, houses_labels)

In [None]:
params = grid_search.best_params_
depth_param = params['max_depth']
estimator_param = params['n_estimators']

In [None]:
forest_reg = RandomForestRegressor(max_depth=depth_param, n_estimators=estimator_param, random_state=1)
forest_reg.fit(houses_prepared, houses_labels)

In [None]:
test_prepared = std_scaler.transform(test_data)
test_predict = forest_reg.predict(test_prepared)

In [None]:
forest_mse = mean_squared_error(test_labels, test_predict)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
print('The RMSE with this model is {} dollars on the validation dataset.'.format(forest_rmse))