# Import data

The Following notebook presents data analysis and whole process of building a model using few different machine learning alghoritms.
<br><br>
First step imports the data and checks any possible error and missing values.<br>
After I displayed matrix of features correlation to see how the features related to each others.<br>
Then I took cared of the data which seems to be inconsistent with others. Many regression models are sensitive to this outliners.<br>
Last step before building a model is about about prepering feautres to be more appropriable form alghoritms.<br>
After that I implemented few regression models and find the best parameters for them. At the end I tried to combined them to reach better performance.




In [None]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [None]:
dataset = pd.read_csv("../input/kc_house_data.csv", parse_dates = ['date'])
dataset.info()

Feautres: id and date won't be useful in building a model.

In [None]:
dataset.drop(['id','date'],axis=1,inplace=True)

I am checking whether null occures in any of the column.

In [None]:
dataset.isnull().any()

#  Target variable

Now it's the time to explore some data. I'm starting with price this is the target variable. <br>
Box plot is displaying target variable distribution.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(11,5))
sns.boxplot(x = 'price', data = dataset, orient = 'h',  
                 fliersize = 3, showmeans=True, ax = ax)
plt.show()

# Features and outliners

My next step is analyze correlation between variables in the dataset and the variable that I am going to predict.
This step will clearly demonstrate which values have the biggest influence to the price of house.

In [None]:
import numpy as np

corrmat = dataset.corr()
cols = corrmat.nlargest(30, 'price').index
cm = np.corrcoef(dataset[cols].values.T)
plt.subplots(figsize=(16,12))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Let's see more precisely how particular features are correlated to price. I will try to find observation point which are distant from other and then eliminate them. <br>

In [None]:
sns.jointplot(x=dataset['sqft_living'], y = dataset['price'], kind='reg');

One point in the bottom-right doesn't fit to others. I am removing this point manually

In [None]:
dataset = dataset.drop(dataset[dataset['sqft_living']>12500].index).reset_index(drop=True)

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x=dataset['grade'], y=dataset['price'])

Boxplot looking properly.

In [None]:
sns.jointplot(x=dataset['sqft_above'], y = dataset['price'], kind='reg');

In [None]:
f, ax = plt.subplots(figsize=(14, 6))
sns.boxplot(x=dataset['bathrooms'], y=dataset['price'])
locs, labels = plt.xticks()
plt.xticks(rotation=90);

Again one record where bathrooms are equals 7.5 don't fit to others.

In [None]:
dataset = dataset.drop(dataset[dataset['bathrooms']==7.5].index).reset_index(drop=True)

In [None]:
f, ax = plt.subplots(figsize=(7, 5))
sns.boxplot(x=dataset['view'], y=dataset['price']);

I didn't expect any anomaly at this point. Small correlation is visible.

In [None]:
sns.jointplot(x=dataset['sqft_basement'], y = dataset['price'], kind='reg');

There is a lot of record without basement. I will care about it later.

In [None]:
f, ax = plt.subplots(figsize=(7, 5))
sns.boxplot(x=dataset['bedrooms'], y=dataset['price'])

33 bedrooms with this price indicate anomaly.

In [None]:
dataset = dataset.drop(dataset[dataset['bedrooms']==33].index).reset_index(drop=True)

In [None]:
f, ax = plt.subplots(figsize=(5, 5))
sns.boxplot(x=dataset['waterfront'], y=dataset['price']);

In [None]:
f, ax = plt.subplots(figsize=(7, 5))
sns.boxplot(x=dataset['floors'], y=dataset['price']);

In [None]:
dataset.info()

In [None]:
sns.jointplot(x=dataset['sqft_lot'], y=dataset['price'], kind='reg');

In [None]:
plt.subplots(figsize=(16, 8))
sns.boxplot(x=dataset['yr_built'], y=dataset['price'])
locs, labels = plt.xticks()
plt.xticks(locs[0:115:3],labels[0:115:3],rotation=90);

In [None]:
plt.subplots(figsize=(16, 8))
sns.boxplot(x=dataset['yr_built'], y=dataset['price'])
locs, labels = plt.xticks()
plt.xticks(locs[0:115:3],labels[0:115:3],rotation=90);

In [None]:
plt.subplots(figsize=(6, 6))
sns.boxplot(x=dataset['condition'], y=dataset['price']);

3 features with the biggest association with price are sqft_living, grade,sqft_above. Area related feature are very important and<br> highly related to each other. What is nonobvious that amount of bathrooms has almost 2 times bigger correlation coefficient<br> than amount of bedrooms and the correlation between them is smaller that the correlation between bathrooms and area related features.<br>
Last 3 feature has very low correlation coefficient so they are not worth to attach much attention.

# Feautres transformation

At the start I decided to create new feature which indicate age of the house.<br>
If there was a renovation I interpreted this as new age. 


In [None]:
data = []
for x,y in zip(dataset['yr_built'],dataset['yr_renovated']):
    if y != 0:
        data.append(y)
    else:
        data.append(x)
data = pd.Series(data)
dataset['age'] = -(2015-data)

Then I created new feature basement_existence

In [None]:
dataset['basement_existence'] = dataset['sqft_basement'].apply(lambda x: 1 if x>1 else 0)

Few feature should be treat as categorical instead of numerical. Actually now they might be misinterpreted and represent something they are not.


In [None]:
for i in ('waterfront','view','condition','grade','basement_existence','zipcode'):
    dataset[i] = dataset[i].astype(str)

dataset = pd.get_dummies(dataset)

After this step my dataset significantly grew.

In [None]:
dataset.info()

To optimalise the computation for alghoritms I limited the features to 30. After this data became 6 times smaller.   

In [None]:
corrmat = dataset.corr()
cols = corrmat.nlargest(30, 'price').index

In [None]:
dataset = dataset[cols]
dataset.info()

# Building a model

I am going to use few different regression models to complete the task. <br>
To train and test model I need two separate group, so I split data using cross validation method.<br>
I am evaluating performance by measure RMSE error.

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler

Dividing data into inputs variables and target variable

In [None]:
input_features = dataset.columns.tolist()
input_features.remove('price')

X = dataset[input_features]
y = dataset['price']

Spliting data into training and test dataset.

In [None]:
X_train, X_test,y_train,y_test = train_test_split(dataset[input_features],dataset['price'], train_size = 0.75, random_state = 20)
print(X_train.shape, X_test.shape)

Firstly I have to select the best parameters for each regression model. <br>
I created function which takes model, parameters, and boolean determinates whether scale the data or not as an function argument. <br>
Function prints RMSE error of the predicted data and parameters on which the RMSE was the lowest. 

In [None]:
def param_selector(estimator,params,scaler):
    if scaler:
        estimator = Pipeline([('scaler',StandardScaler()),('estimator',estimator)])
    cv = GridSearchCV(estimator,param,scoring="neg_mean_squared_error")
    cv.fit(X_train,y_train)
    score = np.sqrt(mean_squared_error(y_test,cv.predict(X_test)))
    print("RMSE error: {:.4f} ".format(score))
    print("Best parameters {}".format(cv.best_params_))

In [None]:
lasso = Lasso()
param = {'alpha': [0.05,0.1,0.5,1,5,10],'normalize': [True,False]}
param_selector(lasso,param,False)


#### Lasso regression:

In [None]:
lasso = Lasso(alpha=10,normalize=False)

In [None]:
param = {'alpha': [0.001,0.05,0.1,0.5,1,5,10],'normalize':[True,False]}
ridge = Ridge()
param_selector(ridge,param,False)

#### Ridge regression

In [None]:
ridge = Ridge(alpha=1,normalize=False)

#### Elastic Net Regression

In [None]:
param = {'estimator__alpha': [0.005,0.05,0.1,1,0.01,10],'estimator__l1_ratio':[.1, .2, .8,.9]}
Enet = ElasticNet()
param_selector(Enet,param,True)

In [None]:
Enet = make_pipeline(StandardScaler(),ElasticNet(alpha=0.1,l1_ratio=0.9))

#### k-Nearest Neighbors Regression

In [None]:
param = {'estimator__n_neighbors': list(range(14,16)),'estimator__weights':['distance']}
knn = KNeighborsRegressor()
param_selector(knn,param,True)

In [None]:
knn = make_pipeline(StandardScaler(),KNeighborsRegressor(n_neighbors=14,weights='distance'))

####  Gradient Boosting Regressor


In [None]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

GBoost.fit(X_train,y_train)
score = np.sqrt(mean_squared_error(y_test,GBoost.predict(X_test)))
print("score: {:.4f} \n".format(score))

Gradient Boosting regressor reach the best score. Remaining four models has almost equal scores to each others.

My next approach to achieve better score is averaging models.. I build a class for this purpose .

In [None]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1) 

I iterate throught all of the possible subsets to check which combination of models can give me the lowest RMSE.

In [None]:
import itertools

for j in range(1,5):
    for i in itertools.combinations([lasso,ridge,knn,Enet],j):
        averaged_models = AveragingModels(i)
        averaged_models.fit(X_train,y_train)
        score = np.sqrt(mean_squared_error(y_test,averaged_models.predict(X_test)))
        print("score: {:.4f} ".format(score))

Combination of Ridge and k-Nearest Neighbors Regression give better result than every single model except of Gradient Boosting Regressor.

In [None]:
averaged_models = AveragingModels(models = (lasso,knn))
averaged_models.fit(X_train,y_train)
score = np.sqrt(mean_squared_error(y_test,averaged_models.predict(X_test)))
print("RMSE error: {:.4f}".format(score))

I am taking this model and combining him with Gradient Boosting Regressor on different proporcions. <br>
That approach still didn't give better performance than pure  Gradient Boosting regressor

In [None]:
Enet.fit(X_train,y_train)
score = np.sqrt(mean_squared_error(y_test,averaged_models.predict(X_test)*0.3+GBoost.predict(X_test)*0.7))
print("RMSE error: {:.4f}".format(score))