## Applied - Question 9

The question involve Boston dataset - a data frame with 506 observations and 14 variables.
The data was originally published by Harrison, 
D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', 
J. Environ. Economics & Management, vol.5, 81-102, 1978.

There are 14 attributes in each case of the dataset. They are:

  1. CRIM - per capita crime rate by town  
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000's

We will use cross-validation to recalculate the crime rate error rate.

Import block

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import LeaveOneOut, KFold, cross_val_score, train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import resample

import statsmodels.stats.api as sms
import statsmodels.api as sm

from scipy import stats

%matplotlib inline
plt.style.use('seaborn-white')

Getting dataset

In [26]:
data_path = 'D:\\PycharmProjects\\ISLR\\data\\'
df = pd.read_csv(f'{data_path}Boston.csv')
df.medv.describe()

count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: medv, dtype: float64

We can see the sample mean of medv is 22.5328 or $\hat{\mu} = 22.5328$.  

Standard error of $\hat{\mu}$ = 
\begin{equation}
\frac {std}{\sqrt{n}} = \frac {9.1971}{\sqrt{506}} = 0.4089
\end{equation}

(c) Estimating $\hat{\mu}$ using bootstrap

In [32]:
i = 0
boot_mean = []

# bootstrap 50 times
while i < 10000:
    df2 = resample(df, n_samples=506, replace=True)
    boot_mean.append(np.mean(df2.medv))
    i += 1
    
print(f'The estimated SE of mean using bootstrap: {np.std(boot_mean)}')

The estimated standard deviation using bootstrap: 0.4105226577681551


After the 10000 bootstrap step, we have an estimate of SE($\hat{\mu}$) = 0.4105. This is a bit larger
than our original result with 2 more significant digits.
 

(d) Calculating the 95% CI using bootstrap SE:  
\begin{align}
CI &= [\hat{\mu} - 2SE(\hat{\mu}),\hat{\mu} + 2SE(\hat{\mu})] \\
&= [22.5328 - 2*0.4105, 22.5328 - 2*0.4105] \\
&= [21.7118, 23.3538]
\end{align}

In [38]:
# Using statsmodel package
sms.DescrStatsW(df.medv).tconfint_mean()

(21.72952801457859, 23.336084633642756)

There is a small different at the .01 significant level.

(e) We have the median value $\hat{\mu}_{med} = 21.2$ as calculated below

In [39]:
np.median(df.medv)

21.2

(f) Same loop for bootstrap but now we calculate median 

In [41]:
i = 0
boot_median = []

# bootstrap 50 times
while i < 10000:
    df2 = resample(df, n_samples=506, replace=True)
    boot_median.append(np.median(df2.medv))
    i += 1

print(f'The estimated SE of median using bootstrap: {np.std(boot_median)}')

The estimated SE of median using bootstrap: 0.37626728368940043


The 10000 bootstrap gives $SE(\hat{\mu}_{med}) = 0.3763$. This is a smaller figure 
than the estimated SE for mean. 

(g) Get the 10th percentile using numpy. We have $\hat{\mu}_{0.1} = 12.75$

In [44]:
print(f'Lowest 10% of medv = {np.quantile(df.medv, q=0.1)}')

Top 10% of medv = 12.75


(f) Again, using the same loop we can estimate the SE of $\hat{\mu}_{0.1}$

In [47]:
i = 0
boot_10 = []

# bootstrap 50 times
while i < 10000:
    df2 = resample(df, n_samples=506, replace=True)
    boot_10.append(np.quantile(df2.medv, q=0.1))
    i += 1

print(f'The estimated SE of 10% quantile using bootstrap: {np.std(boot_10)}')

The estimated SE of 10% quantile using bootstrap: 0.5025508296680048


$SE(\hat{\mu}_{0.1}) = 0.5025$ is relative small to the value of the 10% quantile.