# Boston Housing Prices

Welcome everyone! This is my first notebook and I'm so happy to show you this machine
learning starting shot. After several theorical and practical courses I thought it was time
to get involved seriously! For a trying, I'll present you my work on the Boston Housing Prices
predictions. All comments, suggestions and corrections are welcome, keep progressing is the
goal!

I've cut this notebook into 3 major parts:
* The first one is dedicated to understand our dataset in order to be the best prepared to explain and make the best choices to get the best models possible.
* Then, I transformed these data, knowing which models I would use and with a better understanding of the features, I was able to do the most appropriate transformations, with sometimes some for a specific model.
* Finally I trained and compared the different models, denoted the  best one to make future predictions for our goal. 

## Initial step : Importing the Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import PowerTransformer, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor

## 1. Importing & Describing the Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of [ Boston MA](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The following describes the dataset columns:

* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per \$10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in \$1000's

The description below already gave us some informations:
* LSTAT will take values between 0 and 100.
* CHAS is a binary variable.

In [None]:
boston = load_boston()
dataset = pd.DataFrame(boston.data)
dataset.columns = boston.feature_names
dataset['MEDV'] = boston.target

In [None]:
dataset.head()

In [None]:
dataset.shape

In [None]:
dataset.describe()

The function```describe()```show us interesting things about our features. The first ones is with the column
**ZN** and **BN**. These feature seems to be conditional as for **ZN**, until you reach the third quartile, all the values are zero and for **B** the starting 25% values are 0 before increase to around the mean of 356.

We can also note that for some of our features the distribution of the data seems rather asymmetrical, seeing the difference between the median value and the mean.

Let's confirm it with a quick look over features statistics boxplots and distribution plots.

P.S: Their is no missing data!

In [None]:
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in dataset.items():
    ax = sns.boxplot(y=k, data=dataset, ax=axs[index])
    ax.set_title(dataset.columns[index] + " boxplot")
    index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

In [None]:
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in dataset.items():
    try:
        plot = sns.distplot(v, ax=axs[index])
        plot.set_title(dataset.columns[index] + " dist plot")
    except RuntimeError as re:
        if str(re).startswith("Selected KDE bandwidth is 0. Cannot estimate density."):
            plot = sns.distplot(v, kde_kws={'bw': 0.1}, ax=axs[index])
            plot.set_title(dataset.columns[index] + " dist plot")
        else:
            raise re

    index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

As discussed later on, the different plots confirmed that **ZN**, **B** and **CHAS** are what we identified them to.

Now, these plots show us too that some features like **CRIM**, **B**, **MEDV** etc. got a lot of outlier values. 
It also confirm what we first said about asymmetric data, the boxplot and distribution show us how much the skewness is important for those features. 

As we will want to train our data through a ordinary least square regression model we won't need to treat our features for skewness. However, normality of the output is an assumption of the OLS regression, so we will try to apply a transformation to our **MEDV** output to make it more gaussian-distributed.  

One last thing, **MEDV** seems also to be locked at a value of 50, which could impact the learning algorithm best fitting the data.

## 3. Feature selection & transformations

### 3.1 Selection

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(dataset.corr().abs(), annot=True)

In [None]:
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for i, k in enumerate(dataset.columns[:-1]):
    sns.regplot(y=dataset['MEDV'], x=dataset[k], ax=axs[i], color=np.random.rand(3,))
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

Looking at the correlation matrix, **LSTATS**, **RM**, **PTRATIO**, **INDUS**, **TAX** and **NOX** are the highest correlated features to the output **MEDV**. More generally, none plots seems to show a non-linear correlation pattern. 

In fact, as we still didn't transform our data to take care of the fact that **MEDV** feature is locked at a max value of 50, their could be a change in the correlation values, but trust me and I could show it to you later, it don't affect that much the correlation values and our 6 main features will stay the same. 

### 3.2 Transformations

Now let's start the features transformation.

Without any particular motivation, we choose to try to make predictions with 3 models, OLS (Linear regression) regression, SVR (Support Vector Regression) and Random Forest regression.
For each of these models, we need to take care of particular uses conditions:
* Normality of the output is an OLS assumption, we will need to transform our output to make it the most possible gaussian-distributed.
* SVR is very sensitive to outliers and different scales between the features, as it is we will need to apply feature scalling and selection to best train this model.
* Random Forest Regression don't need special treatment looking at the nature of our dataset but will be interesting to see how well he can fit our output.

The first step will however be to delete values greater or equal to 50 from our output column so our models won't thought it's a special behavior of our case study. 

In [None]:
dataset = dataset[~(dataset['MEDV'] >= 50.0)]

X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=4)

#### 3.2.1 Transformations for OLS regression (multiple linear regression)

As we said later, OLS regression is done making the assumption that the output y is gaussian-distributed. Let's make this assumption a bit more true than it was before. 

In [None]:
print('old skewness: ', y_train.skew())
pt_y = PowerTransformer(method='yeo-johnson', standardize=False)
y_train_ols = pt_y.fit_transform(y_train.values.reshape(len(y_train), 1))
y_test_ols = pt_y.transform(y_test.values.reshape(len(y_test), 1))
print('new skewness: ', stats.skew(y_train_ols)[0])

The skewness measure how much the data distribution is asymmetrical. A value greater than 1 or less than -1 is considered highly skewed. Our output is considered moderately skewed with a value of 0.7.
Using a Yeo-Johnson transformation (that I won't try to describe the process here!) we were able to make our **MEDV** output, a far more gaussian-distributed serie. 

3.2.2 Transformations for SVR

For SVR, the problem is somewhere else. 

Still we said it later too, let's make a brief reminder: Support Vector Regression model works in part by exploiting the principle of Kernel functions. These functions allow to determine the membership of an observation to a group by calculating distances between the observation and the centre of the kernel. Thus, this model is very sensitive to outliers and scale difference between your features.

Applying feature selection over the more correlated features and those which have the least outliers combine with a feature scaling we should be able to best fit this model.

In [None]:
X_train_svr = X_train.loc[:, ['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']]
X_test_svr = X_test.loc[:, ['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']]

In [None]:
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train_svr = sc_X.fit_transform(X_train_svr)
X_test_svr = sc_X.transform(X_test_svr)
y_train_svr = sc_y.fit_transform(y_train.values.reshape(len(y_train), 1))

## 4. Model training & evaluation



### 4.1 Multiple linear regression (Ordinary Least Squares regression)

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train_ols)

y_pred_ols = pt_y.inverse_transform(lin_reg.predict(X_test))

error_ols = mean_squared_error(pt_y.inverse_transform(y_test_ols), y_pred_ols, squared=False)
r2_ols = r2_score(pt_y.inverse_transform(y_test_ols), y_pred_ols)

print('RMSE: ', error_ols)
print('R2: ', r2_ols)

### 4.2 Support Vector Regression

In [None]:
svr = SVR(kernel='rbf')
svr.fit(X_train_svr, y_train_svr.ravel())

y_pred_svr = sc_y.inverse_transform(svr.predict(X_test_svr))

error_svr = mean_squared_error(y_test, y_pred_svr, squared=False)
r2_svr = r2_score(y_test, y_pred_svr)

print('RMSE: ', error_svr)
print('R2: ', r2_svr)

### 4.3 Random Forest Regression

In [None]:
tree_reg = RandomForestRegressor(n_estimators=20, random_state=0)
tree_reg.fit(X_train, y_train)

y_pred = tree_reg.predict(X_test)

error_tree = mean_squared_error(y_test, y_pred, squared=False)
r2_tree = r2_score(y_test, y_pred)

print('RMSE: ', error_tree)
print('R2: ', r2_tree)

Considering the different scores and despite all the measures taken, it would seem that the Random Forest Regression is the best model for this case study. 

However, it was noted that the SVR model is very close behind him.

Overall, each of the models performed well. Given the small amount of data compared to the number of features, the nature of each of them and the objective sought, we felt that we had very good reliability of our models. Improving the accuracy would obviously require more observations in the first place. Then, one could perhaps expect a beneficial effect from a more thorough data cleaning with an outlier treatment. 

In [None]:
pd.DataFrame([[error_ols, error_svr, error_tree], [r2_ols, r2_svr, r2_tree]], index=['RMSE', 'R2'], columns=['OLS reg', 'SVR', 'Random Forest reg'])

As a last few words, I would make a huge thanks to Shreayan Chaudhary and Prasad Perara for their own notebooks on the subject which were very inspiring for me as a starting point of reflexion!