#### Introduction to Statistical Learning, Exercise 2.3

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Explore the Boston Housing Data Set

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from islpy import datasets
%matplotlib inline

### A. Learn About the Data Set

Read the documentation of the `Boston` data set. Then load it into a variable called `boston` and look at a few rows.

How many rows (observations) and columns (variables) are in the data set? What do the rows and columns represent? (You do not need to describe every variable in detail).

In [None]:
help(datasets.Boston)

In [None]:
boston = datasets.Boston()
boston.head()

In [None]:
boston.shape

There are 506 rows and 14 columns in the data set. Each row corresponds to a different Boston suburb. The columns describe various properties of a suburb, such as crime rate and air quality.

### B. A Quick Look at the Data Set

Make some pairwise scatter plots of variables in the data set. Describe your findings.

With fourteen columns in the data set, the full scatter plot matrix will not be very readable. We chose a few variables that might be interesting.

In [None]:
sel = boston[['tax', 'indus', 'nox', 'medv', 'rm']]
sns.pairplot(data=sel)
plt.show()

We can see some expected correlations: property value (`medv`) is strongly correlated with the number of rooms (`rm`) and air quality (`nox`) depends on the presence of industrial installations (`indus`). It looks like `medv` goes down with `indus` (and therefore `nox`), which is somewhat expected. Other scatter plots have more complicated structures that are harder to interpret. 

### C. Crime Rate

Are there any predictors associated with the per capita crime rate (`crim`)? If so, explain the nature of the relationship. 

We suspect that the crime rate might be related to `medv`, `ptratio`, `lstat` and `dis`:

In [None]:
sel = boston[['crim', 'medv', 'ptratio', 'lstat', 'dis']]
sns.pairplot(data=sel)
plt.show()

The crime rate seems indeed related to `medv`, `lstat` and `dis`. Higher property values are correlated with lower crime rates. The highest crime rates occur in suburbs with lowest weighted distance to job centres. Lower social status also seems to correlate with higher crime rates.

### D. Extreme Values & Ranges

Do any suburbs have exceptionally high crime rates? Tax rates? Pupil-teacher ratios? Use the `describe()` and `nlargest()` methods to find out.

In [None]:
sel = boston[['crim', 'tax', 'ptratio']]
sel.describe()

In [None]:
sel.nlargest(20, 'crim')

In [None]:
sns.distplot(sel['crim'])
plt.show()

There is quite a large number of suburbs that are far above the mean of 3.6. This can be seen from the numbers and the long tail in the distribution plot above.

In [None]:
sel[sel.tax > 600]

In [None]:
sns.distplot(sel['tax'], kde=False, bins=100)
plt.show()

There are 137 suburbs with exceptionally high tax rates above 600.

In [None]:
sel.nsmallest(20, 'ptratio')

In [None]:
sns.distplot(sel['ptratio'])
plt.show()

There are some very low values of pupil-teacher ratio. The distribution shows a tail on the left.

### E. Counting

How many suburbs in the data set bound the Charles river? 

In [None]:
boston['chas'].sum()

There are 35 suburbs bounding the Charles river.

### F. Pupil-teacher Ratio Median

What is the median of the pupil-teacher ratio among the suburbs in the data set?

In [None]:
boston['ptratio'].median()

The median of the pupil-teacher ratio (`ptratio`) is 19.05.

### G. Single Suburb Properties

What is the suburb with the lowest median value (`medv`) of ownership occupied homes? What are the values of some of the other predictors for this suburb and how do they compare to the overall distributions? Use the `idxmin()` method to find the row with the lowest value of `medv` and the `iloc` attribute to access it. Comment on your findings.

First we prepare a series representing the row with minimal `medv` and drop the `medv` value. Then we create a data frame with the value, mean, median and standard deviation of all the variables. 

In [None]:
sel = boston.iloc[boston['medv'].idxmin()].drop('medv')
others = boston.drop('medv', axis=1)

df = pd.DataFrame({'variable': sel.index,
                   'value': sel,
                   'mean': others.mean(),
                   'median': others.median(),
                   'std': others.std()})
df.set_index('variable')

The full distributions contain more information so we create a graphical overview.

In [None]:
cols = 4
rows = len(sel) // cols + 1
fig = plt.figure(figsize=(12,9))
for idx, (var, value) in enumerate(zip(sel.index, sel)):
    row = idx // cols
    col = idx % cols
    ax = plt.subplot2grid((rows,cols),(row, col), fig=fig)
    sns.distplot(others[var], kde=False, ax=ax)
    ax.axvline(value, color='C1', label=var)
    ax.legend()
plt.show()

The suburb with the minimum `medv` can be expected to be one of the poorest neighbourhoods. The graphical overview confirms several expectations we have for poor neighbourhoods. For example, the crime rate is rather high and the number of rooms per dwelling is low.

### H. Conditional Row Selection

In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.  

We can sum over Boolean values to get the counts:

In [None]:
print((boston['rm'] > 7).sum(), (boston['rm'] > 8).sum())

We plot some distributions, using the `rm` condition for colour coding.

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(12,9))
sns.distplot(boston[boston.rm <= 8]['lstat'], label='rm <= 8', ax=ax[0][0])
sns.distplot(boston[boston.rm > 8]['lstat'], label='rm > 8', ax=ax[0][0])
ax[0][0].legend()
sns.distplot(boston[boston.rm <= 8]['medv'], label='rm <= 8', ax=ax[0][1])
sns.distplot(boston[boston.rm > 8]['medv'], label='rm > 8', ax=ax[0][1])
ax[0][1].legend()
sns.distplot(boston[boston.rm <= 8]['indus'], label='rm <= 8', ax=ax[1][0])
sns.distplot(boston[boston.rm > 8]['indus'], label='rm > 8', ax=ax[1][0])
ax[1][0].legend()
sns.distplot(boston[boston.rm <= 8]['ptratio'], label='rm <= 8', ax=ax[1][1])
sns.distplot(boston[boston.rm > 8]['ptratio'], label='rm > 8', ax=ax[1][1])
ax[1][1].legend()
plt.show()

We expect the suburbs averaging more than eight rooms per dwelling to be rather rich neighbourhoods. This is confirmed by the `lstat` and `medv` distributions. Most of these suburbs have a low number of industrial installations. There is no clear conclusion to draw from the `ptratio` distibrution.