# Understanding Over & Underfitting
## Predicting Boston Housing Prices

## Getting Started
In this project, you will use the Boston Housing Prices dataset to build several models to predict the prices of homes with particular qualities from the suburbs of Boston, MA.
We will build models with several different parameters, which will change the goodness of fit for each. 

---
## Data Exploration
Since we want to predict the value of houses, the **target variable**, `'MEDV'`, will be the variable we seek to predict.

### Import and explore the data. Clean the data for outliers and missing values. 

In [158]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [159]:
boston = pd.read_csv('../data/boston_data.csv')
boston

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.15876,0.0,10.81,0.0,0.413,5.961,17.5,5.2873,4.0,305.0,19.2,376.94,9.88,21.7
1,0.10328,25.0,5.13,0.0,0.453,5.927,47.2,6.9320,8.0,284.0,19.7,396.90,9.22,19.6
2,0.34940,0.0,9.90,0.0,0.544,5.972,76.7,3.1025,4.0,304.0,18.4,396.24,9.97,20.3
3,2.73397,0.0,19.58,0.0,0.871,5.597,94.9,1.5257,5.0,403.0,14.7,351.85,21.45,15.4
4,0.04337,21.0,5.64,0.0,0.439,6.115,63.0,6.8147,4.0,243.0,16.8,393.97,9.43,20.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
399,9.32909,0.0,18.10,0.0,0.713,6.185,98.7,2.2616,24.0,666.0,20.2,396.90,18.13,14.1
400,51.13580,0.0,18.10,0.0,0.597,5.757,100.0,1.4130,24.0,666.0,20.2,2.60,10.11,15.0
401,0.01501,90.0,1.21,1.0,0.401,7.923,24.8,5.8850,1.0,198.0,13.6,395.52,3.16,50.0
402,0.02055,85.0,0.74,0.0,0.410,6.383,35.7,9.1876,2.0,313.0,17.3,396.90,5.77,24.7


In [162]:
boston.isnull().sum() # No missing values
boston.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0
mean,3.730912,10.509901,11.189901,0.069307,0.55671,6.30145,68.601733,3.799666,9.836634,411.688119,18.444554,355.068243,12.598936,22.312376
std,8.943922,22.053733,6.814909,0.25429,0.117321,0.67583,28.066143,2.109916,8.834741,171.073553,2.150295,94.489572,6.925173,8.837019
min,0.00632,0.0,0.46,0.0,0.392,3.561,2.9,1.1691,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082382,0.0,5.19,0.0,0.453,5.90275,45.8,2.087875,4.0,281.0,17.375,374.71,7.135,17.1
50%,0.253715,0.0,9.795,0.0,0.538,6.2305,76.6,3.20745,5.0,330.0,19.0,391.065,11.265,21.4
75%,4.053158,12.5,18.1,0.0,0.631,6.62925,94.15,5.222125,24.0,666.0,20.2,396.0075,16.91,25.0
max,88.9762,95.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,34.37,50.0


### Next, we want to explore the data. Pick several varibables you think will be ost correlated with the prices of homes in Boston, and create plots that show the data dispersion as well as the regression line of best fit.

In [None]:
fig, axs = plt.subplots(2,2, figsize=(12,12))

sns.regplot('crim', 'medv', data=boston, ax = axs[0,0], order =3)
sns.regplot('rm', 'medv', data=boston, ax = axs[0,1])
sns.regplot('black', 'medv', data=boston, ax = axs[1,0], order = 3)
sns.regplot('lstat', 'medv', data=boston, ax = axs[1,1], order =3)

### What do these plots tell you about the relationships between these variables and the prices of homes in Boston? Are these the relationships you expected to see in these variables?

##### Number of Rooms and Population status are clearly correlated to the housing price ( I expected this). Number of rooms seems fairly linear status not so much (it seems to have less impact as the status increases, which would be kind of expected).

##### For the crime rate, it seems to have an impact, with the price decreasing upto a point then there are very few samples for higher rates, so not representative.

##### For % of black people,I don´t see what I expected - concentration of lower prices where % of blacks is higher. The vast majority of points are at the top of the scale and dist. is all over the place for those

### Make a heatmap of the remaining variables. Are there any variables that you did not consider that have very high correlations? What are they?

##### INDUS, NOX, Age, Tax, Pt Ratio seems to have some negative correlation also with the price. (Actually I plotted some of them and don´t see that much of a relation in terms of a x/y curve...).

In [None]:
fig, ax = plt.subplots(1,1, figsize = (12,12))
sns.heatmap(boston.corr(method='spearman'),annot = True, ax=ax)
plt.show()

### Calculate Statistics
Calculate descriptive statistics for housing price. Include the minimum, maximum, mean, median, and standard deviation. 

In [None]:
print(boston.medv.describe())
print('median:',boston.medv.median())

----

## Developing a Model

### Implementation: Define a Performance Metric
What is the performance meteric with which you will determine the performance of your model? Create a function that calculates this performance metric, and then returns the score. 

In [None]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    return r2_score(y_true, y_predict)

### Implementation: Shuffle and Split Data
Split the data into the testing and training datasets. Shuffle the data as well to remove any bias in selecting the traing and test. 

In [None]:
from sklearn.model_selection import train_test_split

X = np.array(boston.iloc[:,:-1])
y = np.array(boston.iloc[:,-1])

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=0)

----

## Analyzing Model Performance
Next, we are going to build a Random Forest Regressor, and test its performance with several different parameter settings.

### Learning Curves
Lets build the different models. Set the max_depth parameter to 2, 4, 6, 8, and 10 respectively. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr_models =[]

for p in [2,4,6,8,10]: rfr_models.append(RandomForestRegressor(max_depth=p))

Now, plot the score for each tree on the training set and on the testing set.

In [None]:
y_pred_train_scores = []
y_pred_test_scores = []

for i,p in enumerate([2,4,6,8,10]):
    rfr = rfr_models[i]
    rfr.fit(X_train, y_train)
    y_pred_train=rfr.predict(X_train)
    y_pred_test = rfr.predict(X_test)
    y_pred_train_scores.append(performance_metric(y_train, y_pred_train))
    y_pred_test_scores.append(performance_metric(y_test, y_pred_test))

plt.plot([2,4,6,8,10],y_pred_train_scores,label='train')
plt.plot([2,4,6,8,10],y_pred_test_scores,label='test')
plt.xticks([2,4,6,8,10])
plt.xlabel('max_depth')
plt.ylabel('r2score')
plt.legend()
plt.show()

What do these results tell you about the effect of the depth of the trees on the performance of the model?

##### Given the results, increasing the depth improves the training result up to a point. From depth 8 to 10 we can see that the test result worsens, which means at this point we are starting to overfit the model. It would actually be more beneficial to choose max_depth = 8 in this case.

### Bias-Variance Tradeoff
When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance? How about when the model is trained with a maximum depth of 10?

##### with maximum_depth=1 the model suffers from high bias (the model is not complex enough to fully "interpret" the data. With maximum_depth = 10 the model has high variance, the models is too complex ending up "over interpreting " the data and overfitting the training data, and not generalizing well --> test results are worse than for max_depth = 8.

### Best-Guess Optimal Model
What is the max_depth parameter that you think would optimize the model? Run your model and explain its performance.

In [None]:
y_pred_train_scores = []
y_pred_test_scores = []

depths = np.arange(1,11,1)

for p in depths:
    rfr = RandomForestRegressor(max_depth=p)
    rfr.fit(X_train, y_train)
    y_pred_train=rfr.predict(X_train)
    y_pred_test = rfr.predict(X_test)
    y_pred_train_scores.append(performance_metric(y_train, y_pred_train))
    y_pred_test_scores.append(performance_metric(y_test, y_pred_test))

plt.plot(depths,y_pred_train_scores,label='train')
plt.plot(depths,y_pred_test_scores,label='test')
plt.xlabel('max_depth')
plt.xticks(depths)
plt.ylabel('r2score')
plt.legend()
plt.show()

##### The results vary with each run ... I dont´know why since the test train split has random state fixed!? But anyway with the result above the best scenario would be for max_depth = 8 since with provides the highest test score.

### Applicability
*In a few sentences, discuss whether the constructed model should or should not be used in a real-world setting.*  
**Hint:** Some questions to answering:
- *How relevant today is data that was collected from 1978?*
- *Are the features present in the data sufficient to describe a home?*
- *Is the model robust enough to make consistent predictions?*
- *Would data collected in an urban city like Boston be applicable in a rural city?*

##### Data that is more than 40 years old should not be used in predicting house values which varies a lot in time due to speculation, inflation. There seems to be some variation in test results, but overall the r2_score is pretty consistent above 0.80 suggesting some robustness if max_depth>3. So it should provide some degree of confidence in the results if waht we were trying to predict was the housing price in 1978. I would prefer a dataset with more data also (more rows). This would be good for Boston, don´t think it would apply to differently stratified towns or rural cities.