## Introduction

Goal of this notebook is to predict variable MEDV, which is corresponding to median value of owner-occupied homes in thousands of dollars. Methods used are:
- linear regression
- Support vector regression
- Gradient boosting regressor

In [None]:
import pandas as pd
import matplotlib.pyplot as plt        #reading libraries
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
import warnings  


In [None]:
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)                             #reading dataset
df = pd.read_csv("/kaggle/input/real-estate-dataset/data.csv")           

## First look at the data


In [None]:
print(df.head())
print(df.shape)
print(df.describe())
print(df.isnull().sum())
df[df.isnull().any(axis=1)]

Dataset has 14 variables and 511 observations. 5 rows contain a missing values for variable rm (average number of rooms per dwelling) because these observations aren't very numerous removing them shouldn't have impact on future models, so I decided skip this rows in later parts. Summary of variables show that, values have different scales and for some of them difference is quite big.  

In [None]:
df2 = df.dropna() #removing rows with na
df2.shape

In [None]:
print(df2.columns[0:13])
x_names = df2.columns[0:13]           # ploting all x variables with y 
y_name = df2.columns[-1]
def pllot(x,y):
  plt.scatter(df2[x],df2[y])
  plt.xlabel(x)
  plt.ylabel(y)
  plt.title("Scatter plot of "+x+" and "+y)
  plt.show()
for i in x_names:
  pllot(i, y_name)

Plots above show relation between endogenous variable MEDV and all other variables. Firstly it is possible to see that almost all variables are biased and many of them have some clear outliers. Variables ZN, CHAS, RAD, TAX seems to be categorical as they have only several specified values, but from the dataset description only CHAS is clearly categorical as it gives information if home bounds tract with Charles river.  Variables RM and LSTAT show the best correlation based on the plots. RM corresponds to numebr of rooms per dwelling and shows positive relation with price of home, which is not suprising as bigger house have more rooms and cost more, LSTAT on the other hand show negative correlation which is also not suprising as it corresponds to % of lower status of the population. 

## Preparation of data

Before constructing prediction models it is neccessary to prepare data for this purpose. Firstly problem of the outliers will be solved. Observations that extend more than 3 times standard devation of the varaible in any direction will be replaced with value of the closest observation, using function showed below.

In [None]:
def outliers(x):                       # removing outliers
  l_b = x.mean()-3*x.std() 
  u_b = x.mean()+3*x.std()
  x_u = x.index[x>u_b]
  x_l = x.index[x<l_b]
  x[x_u] = max(x.drop(x_u, axis=0))
  x[x_l] = min(x.drop(x_l, axis=0))
  return x
for i in x_names:
  if i != 'CHAS':
    df2[i] = outliers(df2[i])

In the next step correlation between all variables was calculated, to check for potential problems with collinearity, and see which exogenous variables have strongest relation with MEDV. 

In [None]:
cor_matrix = df2.corr().abs().round(2)
sns.set(rc={'figure.figsize':(12,6)})
sns.heatmap(data=cor_matrix , annot=True)
cor_matrix

Table and plot above show strength of correlation in absolute values. As it was possible to see on the previous plots LSTAT and RM have the strongest correlation with MEDV equal to 0,66 and 0,68. Other variables aren't that strongly correlated but for many of them it is about 0,4-0,45 which is not that low. The strongest relation for all x variables exists for RAD and TAX, and is equal over 90% which is strong enough to call it collinearity. Due to that as TAX variable is higher correlated with Y variable it will remain for the future analysis and RAD will be removed

In the next step principial component analysis was implented to check possibility of reducing number of variables, for this purpose all exogenous variables were standardized 

In [None]:
x_names = x_names.drop("RAD")
x_scaled = StandardScaler().fit_transform(df2[x_names])
features = x_scaled.T
cov_matrix = np.cov(features)
values, vectors = np.linalg.eig(cov_matrix)
explained_variances = []
cum_variances = []
for i in range(len(values)):
    explained_variances.append(values[i] / np.sum(values))
    cum_variances.append(sum(explained_variances))
 
print(explained_variances)
print(cum_variances)
plt.plot(explained_variances, label = "explained variance")
plt.plot(cum_variances, label = "cumulative explained variance")
plt.legend(loc = "right")

PCA analysis informs how much variance of Y variable is explained in the space with reduced number of dimensions. For example in 1 dimension variance is explained in about 47% for 2 dimensions in 59% and so on. If there was a point where grow of explained variance rapidly dropped and cumulative variance was high (80-90%) it would be reasonable to consider transforming data, but it's not the case in this dataset so for the further analysis no changes were made. 

Before making any models dataset has to be split into training and test in proportion 70:30. Beside unscaled dataset there will be also dataset with standardized values.

In [None]:
x_train,x_test,y_train,y_test= train_test_split(df2[x_names],df2[y_name],test_size=0.3,random_state=1)
x_train_scaled = StandardScaler().fit_transform(x_train)
x_test_scaled = StandardScaler().fit_transform(x_test)

## Linear Regression

First model will be a classical linear regression. 

In [None]:
model = LinearRegression()
model.fit(x_train,y_train)
#print(model.intercept_)
#print(model.coef_)
x_train2 = sm.add_constant(x_train)
est = sm.OLS(y_train, x_train2)
est2 = est.fit()
print(est2.summary())
y_pred_train1 = model.predict(x_train)
y_pred_test1 = model.predict(x_test)
print("MSE train", round(mean_squared_error(y_train,y_pred_train1),2))
print("MAE train", round(mean_absolute_error(y_train,y_pred_train1),2))
print("RMSE train", round(np.sqrt(mean_squared_error(y_train,y_pred_train1)),2))
print("MSE test", round(mean_squared_error(y_test,y_pred_test1),2))
print("MAE test", round(mean_absolute_error(y_test,y_pred_test1),2))
print("RMSE test", round(np.sqrt(mean_squared_error(y_test,y_pred_test1)),2))
score = cross_val_score(model,x_train,y_train, scoring ="r2" ,cv=RepeatedKFold(n_splits=10, n_repeats=3, random_state=1),n_jobs=-1)
print("Average value of r2 score for cross validation was equal to {}.".format(round(score.mean(),4)))
plt.scatter(y_pred_test1, y_test)
plt.plot(y_test, y_test, color = "red")
plt.title("Plot of real values vs predicted")
plt.xlabel('Predictions')
plt.ylabel('Real values')

Firstly it is good to look which variables are most significant and how they impact the created model. Looking at the t statistic and probability that coefficient is equal to 0 three most significant varaibles are:
 - RM - average number of rooms per dwelling
 - DIS - weighted distances to five Boston employment centres
 - LSTAT - % lower status of the population
 They can be interpreted as:
 - Each additional room increase the price of home by around 6053 dollars ceteris paribus
 - Increase of weighted distances to Boston employment centers by 1 unit decrease the price of the house by around 1867 dollars ceteris paribus
 - Increase of lower status population by 1% decrease the price of the home by around 284 dollars ceteris paribus
 
 Cross validation was applied for 10 splits and 3 repeats, and scoring method was r2. Average value of r2 was equal to 64,28% which is around 4% lower than r2 achieved for whole training set (68,8%). Metrics choosen to measure goodness of predictions were MSE, MAE and RMSE. Difference between training and test sets are very small, MAE even shows smaller error for test set.

## Support vector regression

Next method used will be SVR, because this method is based on distances between observations training will be implemented on standardized dataset. 

In [None]:
model2 = SVR()
kernel = ["linear","sigmoid","rbf","poly"]
tolerance = [1e-3, 1e-4, 1e-5, 1e-6]
C = [1, 1.5, 2, 2.5, 3, 4, 5]
grid = dict(kernel=["linear"], tol=tolerance, C=C)
cvFold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
gridSearch = GridSearchCV(estimator=model2, param_grid=grid, n_jobs=-1,
	cv=cvFold, scoring="neg_mean_squared_error")
searchResults = gridSearch.fit(x_train_scaled, y_train)
bestModel = searchResults.best_estimator_

For SVR it is neccessary to specify some parameters like kernel, tolerance and cost. Due to that tuning of these parameters was performed, based on the cross validation with 10 splits and 3 repeats and scoring based on MSE the best combination of parametrs was choosen and used to fit model.

In [None]:
y_pred_train2 = bestModel.predict(x_train_scaled)
y_pred_test2 = bestModel.predict(x_test_scaled)
print("MSE train", round(mean_squared_error(y_train,y_pred_train2),2))
print("MAE train", round(mean_absolute_error(y_train,y_pred_train2),2))
print("RMSE train", round(np.sqrt(mean_squared_error(y_train,y_pred_train2)),2))
print("MSE test", round(mean_squared_error(y_test,y_pred_test2),2))
print("MAE test", round(mean_absolute_error(y_test,y_pred_test2),2))
print("RMSE test", round(np.sqrt(mean_squared_error(y_test,y_pred_test2)),2))
print(bestModel)
plt.scatter(y_pred_test2, y_test)
plt.plot(y_pred_test2, y_pred_test2, color = "red")
plt.title("Plot of real values vs predicted")
plt.xlabel('Predictions')
plt.ylabel('Real values')
plt.show()
pd.Series(abs(bestModel.coef_[0]), index=x_names).nlargest(12).plot(kind='barh')
plt.title("Plot of variable improtance for SVR")


Errors achieved by this model were a bit higher for test dataset, and higher than in case of linear regression. The most important feature for SVR model is number of rooms per dwelling, and next are distance from the employment centers and % of lower status population, so it looks very similar as in the linear regression case. 

## Gradient boosting regressor

Last used method will be GBR for standardized dataset. 

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
GBR=GradientBoostingRegressor()
search_grid={'n_estimators':[25,50,100,200],'learning_rate':[0.15,0.1,0.05,0.01]}
search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=cv)
search_fit = search.fit(x_train_scaled, y_train)
best_model_gbr = search_fit.best_estimator_

For GBR cross validation was used to determine two parameters: number of estimators and learning rate, rest of the parameters was left as default. Similarly to previous models cross validation was made with 10 splits and 3 repeats and MSE for scoring, based on that the best set of parameters was choosen to fit model. 

In [None]:
y_pred_train3 = bestModel.predict(x_train_scaled)
print("MSE train", round(mean_squared_error(y_train,y_pred_train3),2))
print("MAE train", round(mean_absolute_error(y_train,y_pred_train3),2))
print("RMSE train", round(np.sqrt(mean_squared_error(y_train,y_pred_train3)),2))
y_pred_test3 = best_model_gbr.predict(x_test_scaled)
print("MSE test", round(mean_squared_error(y_test,y_pred_test3),2))
print("MAE test", round(mean_absolute_error(y_test,y_pred_test3),2))
print("RMSE test", round(np.sqrt(mean_squared_error(y_test,y_pred_test3)),2))
print(best_model_gbr)

plt.scatter(y_pred_test3, y_test)
plt.plot(y_pred_test3, y_pred_test3, color = "red")
plt.title("Plot of real values vs predicted")
plt.xlabel('Predictions')
plt.ylabel('Real values')
plt.show()
pd.Series(abs(best_model_gbr.feature_importances_), index=x_names).nlargest(12).plot(kind='barh')
plt.title("Plot of variable improtance for GBR")

Metrics results based on GBR model achieved the least error for test set, and are much lower than for training set. For GBR model most important variable was % of lower status population and number of rooms per dwelling. Other variables were far less important unlike in previous methods.

## Summary

The best method based on choosen metrics was GBR, second was linear regression and last one SVR. Errors achieved for all methods show that predictions weren't very precise, but at least models gave information which factors have higher impact on prices of homes. 3 most improtant features were:
- RM - average number of rooms per dwelling
- LSTAT - % lower status of the population
- DIS - weighted distances to five Boston employment centres

On the other hand 3 features that had the lowest impact were:
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town
- CHAS - variable that informs if tract bounds river


