## Applied - Question 13

The question involve Boston dataset - a data frame with 506 observations and 14 variables.
The data was originally published by Harrison, 
D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', 
J. Environ. Economics & Management, vol.5, 81-102, 1978.

There are 14 attributes in each case of the dataset. They are:

  1. CRIM - per capita crime rate by town  
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000's

We will try to classify whether a given suburb has a crim rate above or below the median.

### [Part I](#part1) : Correct way to do this!  
### [Part II](#part2) : How I am so wrong!

Import block

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.formula.api as smf

import sklearn.linear_model as skl_lm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
from util import print_cm

%matplotlib inline
plt.style.use('seaborn-white')

Load data

In [5]:
data_path = 'D:\\PycharmProjects\\ISLR\\data\\'
boston = pd.read_csv(f'{data_path}Boston.csv')
boston.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [8]:
df = pd.DataFrame.copy(boston, deep=True)
median_value = np.median(df.crim)
df['crim01'] = np.where(df['crim'] > median_value, 1, 0)
df = df.drop('crim', axis=1)
df.head()

Unnamed: 0,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv,crim01
0,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,0
1,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,0
2,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,0
3,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,0
4,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,0


We learned from exercise 15 of chapter 3, putting all of our variables in the model will
result in most of tem statistically insignificant. Thus, I thought to use only significant 
ones:
 1. zn
 2. dis
 3. rad
 4. black
 5. medv

Nevertheless, we will proceed with the full dataset in the first part and the 5 variables
mentioned above in the second part.
Lets start with splitting 60% train, 40% test.

In [15]:
X = df.drop('crim01', axis=1)
y = df.crim01
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

<a id='part1'></a>

## Part 1: All features classification

#### Logistic Model

In [18]:
regr = skl_lm.LogisticRegression(solver='newton-cg')
pred = regr.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, regr)

Confusion Matrix 
 True        0   1
Predicted        
0          91  18
1          10  84 

Classification report 
               precision    recall  f1-score   support

           0      0.835     0.901     0.867       101
           1      0.894     0.824     0.857       102

    accuracy                          0.862       203
   macro avg      0.864     0.862     0.862       203
weighted avg      0.864     0.862     0.862       203



We get 0.862 accuracy rate which is not too bad considering how we choose our features
which is everything!

#### LDA Model

In [19]:
lda = LinearDiscriminantAnalysis()
pred = lda.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, lda)

Confusion Matrix 
 True        0   1
Predicted        
0          94  22
1           7  80 

Classification report 
               precision    recall  f1-score   support

           0      0.810     0.931     0.866       101
           1      0.920     0.784     0.847       102

    accuracy                          0.857       203
   macro avg      0.865     0.858     0.856       203
weighted avg      0.865     0.857     0.856       203



LDA gives us a similar results at 0.857 accuracy/

#### QDA Model

In [20]:
qda = QuadraticDiscriminantAnalysis()
pred = qda.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, qda)

Confusion Matrix 
 True         0   1
Predicted         
0          100  18
1            1  84 

Classification report 
               precision    recall  f1-score   support

           0      0.847     0.990     0.913       101
           1      0.988     0.824     0.898       102

    accuracy                          0.906       203
   macro avg      0.918     0.907     0.906       203
weighted avg      0.918     0.906     0.906       203



Wow QDA gives a stunning results with 0.906 accuracy and 0.99 specificity (avoidance of false
positive).

Lets try KNN with K = 1

In [21]:
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
pred = knn.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, knn)

Confusion Matrix 
 True        0   1
Predicted        
0          93   7
1           8  95 

Classification report 
               precision    recall  f1-score   support

           0      0.930     0.921     0.925       101
           1      0.922     0.931     0.927       102

    accuracy                          0.926       203
   macro avg      0.926     0.926     0.926       203
weighted avg      0.926     0.926     0.926       203



While our specificity takes a hit at 0.921, our accuracy improves by nearly 0.02 to 
0.926. 

KNN with K = 5

In [25]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
pred = knn.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, knn)

Confusion Matrix 
 True        0   1
Predicted        
0          94   9
1           7  93 

Classification report 
               precision    recall  f1-score   support

           0      0.913     0.931     0.922       101
           1      0.930     0.912     0.921       102

    accuracy                          0.921       203
   macro avg      0.921     0.921     0.921       203
weighted avg      0.921     0.921     0.921       203



Seems like increasing K in this case is not a good idea.

KNN with K = 10

In [26]:
knn = neighbors.KNeighborsClassifier(n_neighbors=10)
pred = knn.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, knn)

Confusion Matrix 
 True        0   1
Predicted        
0          92  17
1           9  85 

Classification report 
               precision    recall  f1-score   support

           0      0.844     0.911     0.876       101
           1      0.904     0.833     0.867       102

    accuracy                          0.872       203
   macro avg      0.874     0.872     0.872       203
weighted avg      0.874     0.872     0.872       203



Another step back from increasing K. Lets use K = 3 as the best approx for KNN method.

<a id='part2'></a>

## Part II: Using only 5 variables
Our predictors :
 1. zn
 2. dis
 3. rad
 4. black
 5. medv
 

In [29]:
X = df[['zn', 'dis','rad','black','medv']]
y = df.crim01
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

#### Logistic Model

In [31]:
regr = skl_lm.LogisticRegression()
pred = regr.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, regr)

Confusion Matrix 
 True        0   1
Predicted        
0          88  25
1          13  77 

Classification report 
               precision    recall  f1-score   support

           0      0.779     0.871     0.822       101
           1      0.856     0.755     0.802       102

    accuracy                          0.813       203
   macro avg      0.817     0.813     0.812       203
weighted avg      0.817     0.813     0.812       203





Hrm! we get a bit worse result with using only 5 variables.

Our accuracy drop down from 0.82 to 0.813 - not a big difference. 

#### LDA Model

In [32]:
lda = LinearDiscriminantAnalysis()
pred = lda.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, lda)

Confusion Matrix 
 True        0   1
Predicted        
0          93  23
1           8  79 

Classification report 
               precision    recall  f1-score   support

           0      0.802     0.921     0.857       101
           1      0.908     0.775     0.836       102

    accuracy                          0.847       203
   macro avg      0.855     0.848     0.847       203
weighted avg      0.855     0.847     0.847       203



Again we get a drop in accuracy from 0.857 to 0.847.

#### QDA model

In [33]:
qda = QuadraticDiscriminantAnalysis()
pred = qda.fit(X_train, y_train).predict(X_test)
print_cm(y_test, pred, qda)

Confusion Matrix 
 True        0   1
Predicted        
0          96  31
1           5  71 

Classification report 
               precision    recall  f1-score   support

           0      0.756     0.950     0.842       101
           1      0.934     0.696     0.798       102

    accuracy                          0.823       203
   macro avg      0.845     0.823     0.820       203
weighted avg      0.845     0.823     0.820       203



This time we get a much worse result. Our accuracy drops nearly 0.10. The omission of
many features/predictors hinders the flexibility of QDA to capture non linear relationship

#### KNN with K = 1, 3, 5 and 7

In [35]:
for i in [1, 2, 5, 10]:
    print(f'For K = {i}\n')
    knn = neighbors.KNeighborsClassifier(n_neighbors=i)
    pred = knn.fit(X_train, y_train).predict(X_test)
    print_cm(y_test, pred, knn)
    print()

For K = 1

Confusion Matrix 
 True        0   1
Predicted        
0          89  17
1          12  85 

Classification report 
               precision    recall  f1-score   support

           0      0.840     0.881     0.860       101
           1      0.876     0.833     0.854       102

    accuracy                          0.857       203
   macro avg      0.858     0.857     0.857       203
weighted avg      0.858     0.857     0.857       203


For K = 2

Confusion Matrix 
 True        0   1
Predicted        
0          94  28
1           7  74 

Classification report 
               precision    recall  f1-score   support

           0      0.770     0.931     0.843       101
           1      0.914     0.725     0.809       102

    accuracy                          0.828       203
   macro avg      0.842     0.828     0.826       203
weighted avg      0.842     0.828     0.826       203


For K = 5

Confusion Matrix 
 True        0   1
Predicted        
0          89  15
1   

Again, seems like I was wrong to omit many other features from our dataset!
