# Machine Learning in Python - Project 1

Due Friday, March 6th by 5 pm.

## 1. Setup

### 1.1 Libraries

In [1]:
# Add any additional libraries or submodules below

# Display plots inline
%matplotlib inline

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

# sklearn modules
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, KBinsDiscretizer
from sklearn.compose import ColumnTransformer, make_column_transformer
import time

### 1.2 Data

In [2]:
sales = pd.read_csv("sales.csv")
sales_test = pd.read_csv("sales_test.csv")

## 2. Exploratory Data Analysis and Preprocessing

*Include a discussion of the data with a particular emphasis on the features of the data that are relevant for the subsequent modeling. Including visualizations of the data is strongly encouraged - all code and plots must also be described in the write up.*

*In this section you should also implement and describe any preprocessing / transformations of the features. Hint - you should not be modeling this data without transforming some of the features, e.g. modeling sale price directly is not a good idea.*

### 2.1 A glimpse on the dataset.

Looking at a few rows in the dataset.

In [3]:
sales

Unnamed: 0,sale_price,year_sold,year_built,lot_area,basement_area,living_area,full_bath,half_bath,bedroom,garage_cars,garage_area,ac,zoning,neighborhood,quality,condition
0,244000,2010,1968,11160,2110,2110,2,1,3,2,522,Y,Residential_Low_Density,nb_07,good,average
1,189900,2010,1997,13830,928,1629,2,1,3,2,482,Y,Residential_Low_Density,nb_22,average,average
2,191500,2010,1992,5005,1280,1280,2,0,2,2,506,Y,Residential_Low_Density,nb_10,good,average
3,236500,2010,1995,5389,1595,1616,2,0,2,2,608,Y,Residential_Low_Density,nb_10,good,average
4,189000,2010,1999,7500,994,1804,2,1,3,2,442,Y,Residential_Low_Density,nb_22,good,average
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1579,79500,2006,1970,1526,546,1092,1,1,3,0,0,Y,Residential_Medium_Density,nb_02,average,average
1580,160000,2006,1977,17400,1126,1126,2,0,3,2,484,Y,Residential_Low_Density,nb_20,average,average
1581,142500,2006,1984,7937,1003,1003,1,0,3,2,588,Y,Residential_Low_Density,nb_20,average,average
1582,132000,2006,1992,10441,912,970,1,0,3,0,0,Y,Residential_Low_Density,nb_20,average,average


Take a look at the datatypes of the columns.

In [4]:
sales.dtypes

sale_price        int64
year_sold         int64
year_built        int64
lot_area          int64
basement_area     int64
living_area       int64
full_bath         int64
half_bath         int64
bedroom           int64
garage_cars       int64
garage_area       int64
ac               object
zoning           object
neighborhood     object
quality          object
condition        object
dtype: object

Column `ac`, `zoning`, `neighborhood`, `quality`, and `condition` are apparent categorical variables to be encoded. This is also mentioned in the `README.ipynb` file. 

### 2.2 Transform categorical variables using one hot encoding

Here, I would like to discuss how I would transform the categorical variables of this dataset, with rules specified in the `encoding` dictionary.

All the categorical features except `neighborhood` can be easily transformed. However, I decide to sort the `neighborhood` by their means. I will then transform it into an ordinal feature so that this feature can be fitted into a linear model.

In [5]:
rank_neighbor = sales[["sale_price", "neighborhood"]].groupby(['neighborhood'], as_index=False).mean()
rank_neighbor.sort_values("sale_price", inplace=True)
ordering = {
    neighbor: i for i, neighbor in enumerate(rank_neighbor["neighborhood"])
}

If there is air-conditioning (`ac` = `"Y"`), encode it as `1`; otherwise `0`.

Transform `zoning` into an ordinal variable, with `0` being the `Residential_Low_Density` , `1` being the `Residential_Medium_Density` and `2` being the `Residential_High_Density`.

Transform `neighborhood` with the numbering of that neighborhood, e.g., `nb_01` is encoded as `1` and `nb_12` is encoded as `12`.

Transform the `quality` and `condition` columns into ordinals, `poor` as `0`, `fair` as `1`, `good` as `2`, `excellent` as `3`. 

In [6]:
encoding = {
    "ac": {"N": 0, "Y": 1},
    "zoning": {"Residential_Low_Density": 0, "Residential_Medium_Density": 1, "Residential_High_Density": 2},
    "neighborhood": ordering,
    "quality": {"poor": 0, "fair": 1, "average": 2, "good": 3, "excellent": 4},
    "condition": {"poor": 0, "fair": 1, "average": 2, "good": 3, "excellent": 4}
}

Transform the features as specified. I also decide to scale the `sale_price` using a log function and store it in the `log_sale_price` column so it looks more like a normal distribution.

In [7]:
sales.replace(encoding, inplace=True)
sales["log_sale_price"] = np.log(sales["sale_price"])

In [8]:
sales_test.replace(encoding, inplace=True)
y_test = sales_test["sale_price"]
log_y_test = np.log(sales_test["sale_price"])

Feature matrix X of the training data

In [9]:
X = sales.loc[:, "year_sold": "condition"]

Feature matrix X_test of the testing data

In [10]:
X_test = sales_test.loc[:, "year_sold": "condition"]

True label vector y and scaled label vector log_y of the training data

In [11]:
log_y = sales["log_sale_price"]
y = sales["sale_price"]

## 3. Model Fitting and Tuning

The combined model of polynomial regression and regression tree

In [12]:
degree_poly = 3

polytree_model  = make_pipeline(
    make_column_transformer(
        (KBinsDiscretizer(n_bins=5, strategy="uniform", encode="onehot-dense"), ["year_sold"]),
        (PolynomialFeatures(degree=degree_poly, include_bias=False), ["year_built"]),
        (PolynomialFeatures(degree=degree_poly, include_bias=False), ["lot_area"]),
        (PolynomialFeatures(degree=degree_poly, include_bias=False), ["basement_area"]),
        (PolynomialFeatures(degree=degree_poly, include_bias=False), ["living_area"]),
        (KBinsDiscretizer(n_bins=5, strategy="uniform", encode="onehot-dense"), ["full_bath"]), 
        (KBinsDiscretizer(n_bins=3, strategy="uniform", encode="onehot-dense"), ["half_bath"]), 
        (KBinsDiscretizer(n_bins=7, strategy="uniform", encode="onehot-dense"), ["bedroom"]), 
        (KBinsDiscretizer(n_bins=6, strategy="uniform", encode="onehot-dense"), ["garage_cars"]), 
        (PolynomialFeatures(degree=degree_poly, include_bias=False), ["garage_area"]),
        (KBinsDiscretizer(n_bins=2, strategy="uniform", encode="onehot-dense"), ["ac"]),
        (KBinsDiscretizer(n_bins=3, strategy="uniform", encode="onehot-dense"), ["zoning"]),
        (KBinsDiscretizer(n_bins=24, strategy="uniform", encode="onehot-dense"), ["neighborhood"]),
        (KBinsDiscretizer(n_bins=5, strategy="uniform", encode="onehot-dense"), ["quality"]),
        (KBinsDiscretizer(n_bins=5, strategy="uniform", encode="onehot-dense"), ["condition"]),
    ),
    LinearRegression(fit_intercept=True)
)

fit = polytree_model.fit(X, log_y)
polytree_model.named_steps["columntransformer"].named_transformers_

{'kbinsdiscretizer-1': KBinsDiscretizer(encode='onehot-dense', n_bins=5, strategy='uniform'),
 'polynomialfeatures-1': PolynomialFeatures(degree=3, include_bias=False, interaction_only=False,
                    order='C'),
 'polynomialfeatures-2': PolynomialFeatures(degree=3, include_bias=False, interaction_only=False,
                    order='C'),
 'polynomialfeatures-3': PolynomialFeatures(degree=3, include_bias=False, interaction_only=False,
                    order='C'),
 'polynomialfeatures-4': PolynomialFeatures(degree=3, include_bias=False, interaction_only=False,
                    order='C'),
 'kbinsdiscretizer-2': KBinsDiscretizer(encode='onehot-dense', n_bins=5, strategy='uniform'),
 'kbinsdiscretizer-3': KBinsDiscretizer(encode='onehot-dense', n_bins=3, strategy='uniform'),
 'kbinsdiscretizer-4': KBinsDiscretizer(encode='onehot-dense', n_bins=7, strategy='uniform'),
 'kbinsdiscretizer-5': KBinsDiscretizer(encode='onehot-dense', n_bins=6, strategy='uniform'),
 'polynomi

In [13]:
parameters = {
    'columntransformer__polynomialfeatures-1__degree': np.arange(1, degree_poly + 1),
    'columntransformer__polynomialfeatures-2__degree': np.arange(1, degree_poly + 1),
    'columntransformer__polynomialfeatures-3__degree': np.arange(1, degree_poly + 1),
    'columntransformer__polynomialfeatures-4__degree': np.arange(1, degree_poly + 1),
    'columntransformer__polynomialfeatures-5__degree': np.arange(1, degree_poly + 1),
}

start = time.time()
grid_search = GridSearchCV(polytree_model, parameters, 
                           cv=KFold(5, True, random_state=2020), 
                           scoring="neg_root_mean_squared_error", 
                           verbose=1).fit(X, log_y)
print((time.time() - start)/ 60, "minutes")
grid_search

Fitting 5 folds for each of 243 candidates, totalling 1215 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1215 out of 1215 | elapsed:  4.4min finished


4.36329962015152 minutes


GridSearchCV(cv=KFold(n_splits=5, random_state=2020, shuffle=True),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('columntransformer',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('kbinsdiscretizer-1',
                                                                         KBinsDiscretizer(encode='onehot-dense',
                                                                                          n_bins=5,
                                                                                          strategy='uniform'),
                                                                  

In [14]:
print("best index: ", grid_search.best_index_)
print("best param: ", grid_search.best_params_)
print("best score: ", grid_search.best_score_)

best index:  132
best param:  {'columntransformer__polynomialfeatures-1__degree': 2, 'columntransformer__polynomialfeatures-2__degree': 2, 'columntransformer__polynomialfeatures-3__degree': 3, 'columntransformer__polynomialfeatures-4__degree': 3, 'columntransformer__polynomialfeatures-5__degree': 1}
best score:  -0.11014691279818682


In [15]:
print(len(grid_search.best_estimator_.named_steps['linearregression'].coef_), "Best Coefficients", 
      grid_search.best_estimator_.named_steps['linearregression'].coef_)
print("Best Intercept", grid_search.best_estimator_.named_steps['linearregression'].intercept_)

76 Best Coefficients [ 3.35329616e-03 -5.14185922e-03  6.24168396e-03 -2.89013224e-03
 -1.56303918e-03 -7.58843566e-02  2.02642768e-05  6.55349069e-06
 -3.10862351e-11  2.38523803e-04 -6.69432500e-08  1.22833063e-11
  7.55491714e-04 -2.05819582e-07  2.71854952e-11 -1.08643569e-01
 -3.48334114e-02 -4.14808647e-02 -5.53758334e-03  1.90495429e-01
  5.44794415e-03  1.56013628e-02 -2.10493070e-02  1.84617597e-01
  7.17862009e-02  2.96024020e-02  1.82480473e-02 -2.44208039e-02
 -3.72700648e-02 -2.42563379e-01 -6.79106370e-02 -2.31354039e-03
  1.48417916e-02  4.71825620e-02  1.88384058e-02 -1.06385820e-02
  1.13811957e-04 -3.87501103e-02  3.87501103e-02  4.88471492e-02
 -5.67303611e-03 -4.31741131e-02 -1.85585046e-01 -9.62775121e-02
  1.27151250e-04 -3.37626543e-02  4.01465099e-02 -5.77159269e-02
 -1.26179869e-02 -2.23143758e-02 -3.53561093e-02 -3.66819243e-03
 -1.54275347e-02 -3.36802510e-02 -3.60936717e-02 -6.48609037e-03
 -2.20923422e-02 -3.25905960e-02  1.43308507e-01  5.94279442e-02
  2.

In [16]:
log_train_prediction = grid_search.best_estimator_.predict(X)
print("training log rmse", np.sqrt(mean_squared_error(log_y, log_train_prediction )))

training log rmse 0.1025951666361805


In [17]:
print("training rmse", np.sqrt(mean_squared_error(y, np.exp(log_train_prediction ))))

training rmse 18978.01543333264


In [18]:
print("log cross validation rmse", -1 * cross_val_score(grid_search.best_estimator_, X, log_y, 
                                                        cv=KFold(5, True, random_state=2020), 
                                                        scoring="neg_root_mean_squared_error"))

log cross validation rmse [0.11517973 0.10810453 0.11404037 0.11359755 0.09981239]


In [19]:
log_test_prediction = grid_search.best_estimator_.predict(X_test)
print("testing log rmse", np.sqrt(mean_squared_error(log_y_test, log_test_prediction)))

testing log rmse 0.11617928944609092


In [20]:
print("testing rmse", np.sqrt(mean_squared_error(y_test, np.exp(log_test_prediction))))

testing rmse 21960.640520199366
