# Understanding Over & Underfitting

In [None]:
# Imports

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

from sklearn.ensemble import RandomForestRegressor

## Predicting Boston Housing Prices

## Getting Started
In this project, you will use the Boston Housing Prices dataset to build several models to predict the prices of homes with particular qualities from the suburbs of Boston, MA.
We will build models with several different parameters, which will change the goodness of fit for each. 

---
## Data Exploration
Since we want to predict the value of houses, the **target variable**, `'MEDV'`, will be the variable we seek to predict.

### Import and explore the data. Clean the data for outliers and missing values. 

In [None]:
housing = pd.read_csv('../data/boston_data.csv')

housing.shape

In [None]:
housing.sample()

In [None]:
housing.describe()

In [None]:
housing.isna().sum()

In [None]:
# Getting rid of outliers

Q1 = housing.describe().loc['25%']
Q3 = housing.describe().loc['75%']

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Dropping all rows containing any values that are outside of the lower and upper bounds
housing = housing[~((housing < lower_bound) | (housing > upper_bound)).any(axis = 1)]

### Next, we want to explore the data. Pick several varibables you think will be most correlated with the prices of homes in Boston, and create plots that show the data dispersion as well as the regression line of best fit.

In [None]:
corr = housing.corr()

mask = np.zeros_like(corr, dtype = bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr, mask = mask, annot = np.round(corr, 1), annot_kws = {"size": 8})

plt.show()

In [None]:
# Most correlated variables:
# INDUS (-0.5), NOX (-0.5), RM (0.6), AGE (-0.5), LSTAT (-0.7)

sns.regplot(x = 'rm', y = 'medv', data = housing)
plt.show()

sns.regplot(x = 'lstat', y = 'medv', data = housing)
plt.show()

sns.regplot(x = 'ptratio', y = 'medv', data = housing)
plt.show()

sns.regplot(x = 'age', y = 'medv', data = housing)
plt.show()

sns.regplot(x = 'lstat', y = 'medv', data = housing)
plt.show()

### What do these plots tell you about the relationships between these variables and the prices of homes in Boston? Are these the relationships you expected to see in these variables?

In [None]:
# These relationships are what we would expect to see based on previous research and common sense: areas with higher non-retail business acreage, higher nitric oxides concentration, lower average number of rooms, higher proportion of older homes, and higher proportion of lower-income residents are generally less desirable and therefore have lower housing prices.

### Make a heatmap of the remaining variables. Are there any variables that you did not consider that have very high correlations? What are they?

In [None]:
remaining_vars = housing[[var for var in housing.columns if var not in ['indus', 'nox', 'rm', 'age', 'lstat']]]

remaining_corr = remaining_vars.corr()

mask = np.zeros_like(remaining_corr)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(remaining_corr, annot = True, mask = mask, fmt = '.1f')

plt.show()

# There are no other variables that have a very high correlation with MEDV.

### Calculate Statistics
Calculate descriptive statistics for housing price. Include the minimum, maximum, mean, median, and standard deviation. 

In [None]:
# After we've taken care of the outliers (also those in MEDV)

medv = housing['medv']

min_medv = medv.min()
max_medv = medv.max()
mean_medv = medv.mean()
median_medv = medv.median()
std_medv = medv.std()

print('Minimum housing price:', min_medv)
print('Maximum housing price:', max_medv)
print('Mean housing price:', mean_medv)
print('Median housing price:', median_medv)
print('Standard deviation of housing price:', std_medv)

----

## Developing a Model

### Implementation: Define a Performance Metric
What is the performance meteric with which you will determine the performance of your model? Create a function that calculates this performance metric, and then returns the score. 

In [None]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    return r2_score(y_true, y_predict)

### Implementation: Shuffle and Split Data
Split the data into the testing and training datasets. Shuffle the data as well to remove any bias in selecting the training and test. 

In [None]:
X = housing.drop('medv', axis = 1)
y = housing['medv']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, train_size = 0.8)

X_train, y_train = shuffle(X_train, y_train, random_state = 0)
X_test, y_test = shuffle(X_test, y_test, random_state = 0)

----

## Analyzing Model Performance
Next, we are going to build a Random Forest Regressor, and test its performance with several different parameter settings.

### Learning Curves
Lets build the different models. Set the max_depth parameter to 2, 4, 6, 8, and 10 respectively. 

In [None]:
rf_depth_2 = RandomForestRegressor(max_depth = 2)
rf_depth_4 = RandomForestRegressor(max_depth = 4)
rf_depth_6 = RandomForestRegressor(max_depth = 6)
rf_depth_8 = RandomForestRegressor(max_depth = 8)
rf_depth_10 = RandomForestRegressor(max_depth = 10)

forests = [rf_depth_2, rf_depth_4, rf_depth_6, rf_depth_8, rf_depth_10]

[rf.fit(X_train, y_train) for rf in forests]

Now, plot the score for each tree on the training set and on the testing set.

In [None]:
training_scores = [performance_metric(y_train, rf.predict(X_train)) for rf in forests]
testing_scores = [performance_metric(y_test, rf.predict(X_test)) for rf in forests]

plt.plot(training_scores, label = 'Training scores')
plt.plot(testing_scores, label = 'Testing scores')

plt.xticks([0, 1, 2, 3, 4], [2, 4, 6, 8, 10])
plt.xlabel('Depth of trees')
plt.ylabel('Score')

plt.legend()
plt.show()

What do these results tell you about the effect of the depth of the trees on the performance of the model?

In [None]:
# The performance on the testing data initially improves and then decreases after a certain point.
# This indicates that the model is overfitting to the training data when the depth of the trees becomes too large.

### Bias-Variance Tradeoff
When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance? How about when the model is trained with a maximum depth of 10? Check out this article before answering: https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

In [None]:
# When the model is trained with a maximum depth of 1, it suffers from high bias.
# When the model is trained with a maximum depth of 10, it suffers from high variance.
# To achieve good performance, the goal is to find a model that strikes a balance between these two characteristics.

### Best-Guess Optimal Model
What is the max_depth parameter that you think would optimize the model? Run your model and explain its performance.

In [None]:
# The optimal value of the max_depth parameter is likely to be somewhere between 4 and 6.
# The training and testing scores are both relatively high and there is not a significant gap between the two.

rf_depth_5 = RandomForestRegressor(max_depth = 5)
rf_depth_5.fit(X_train, y_train)

y_pred = rf_depth_5.predict(X_test)

performance_metric(y_test, y_pred)

### Applicability
*In a few sentences, discuss whether the constructed model should or should not be used in a real-world setting.*  
**Hint:** Some questions to answering:
- *How relevant today is data that was collected from 1978?*
- *Are the features present in the data sufficient to describe a home?*
- *Is the model robust enough to make consistent predictions?*
- *Would data collected in an urban city like Boston be applicable in a rural city?*

In [None]:
# The model constructed from 1978 data may not be relevant today as real estate prices and factors affecting them have likely changed over time.
# The features used to describe a home in the model may not be sufficient to accurately represent a home in a real-world setting, as there may be other important factors to consider.
# The model's performance, as indicated by the R^2 score, may not be robust enough to make consistent predictions.
# Real estate prices in urban cities like Boston may not be applicable to a rural city, as the factors affecting real estate prices may differ greatly between the two locations.