Exercises

### Create a new notebook, knn_model, and work with the titanic dataset to answer the following:

1.Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

2.Evaluate your results using the model score, confusion matrix, and classification report.

3.Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

4.Run through steps 1-3 setting k to 10

5.Run through steps 1-3 setting k to 20

6.What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

7.Which model performs best on our out-of-sample data from validate?

In [1]:
# custom modules for data prep:
import acquire as a
import prepare as p
import model as m

# tabular manipulation
import numpy as np
import pandas as pd

# ML stuff:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,recall_score,\
precision_score, f1_score
from sklearn.tree import DecisionTreeClassifier, \
export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression #logistic not linear!
from sklearn.neighbors import KNeighborsClassifier #pick the classifier one

In [2]:
# acquire
df=a.get_titanic_data()

this file exists, reading csv


In [3]:
df.head(3)

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1


In [4]:
# prepare
df=p.prep_titanic(df)
df

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,fare,embark_town,alone
0,0,0,3,male,1,0,7.2500,Southampton,0
1,1,1,1,female,1,0,71.2833,Cherbourg,0
2,2,1,3,female,0,0,7.9250,Southampton,1
3,3,1,1,female,1,0,53.1000,Southampton,0
4,4,0,3,male,0,0,8.0500,Southampton,1
...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,0,0,13.0000,Southampton,1
887,887,1,1,female,0,0,30.0000,Southampton,1
888,888,0,3,female,1,2,23.4500,Southampton,0
889,889,1,1,male,0,0,30.0000,Cherbourg,1


In [5]:
train,val,test=p.splitting_data(df,'survived')

In [6]:

train_en,val_en,test_en=m.preprocess_titanic(train,val,test)


In [7]:
# separate independents feature & target
X_train, y_train = train_en.drop(columns='survived'), train_en.survived
X_validate, y_validate = val_en.drop(columns='survived'), val_en.survived
X_test, y_test = test_en.drop(columns='survived'), test_en.survived

### baseline (train) -- this is apply for all splits dataframes

In [8]:
# for train
y_train.value_counts()

survived
0    329
1    205
Name: count, dtype: int64

In [9]:
y_train.mode()

0    0
Name: survived, dtype: int64

In [10]:
# baseline accuracy
y_train.value_counts(normalize=True)[0].round(2)

0.62

> conclusion 
    
    baseline predict = 0
    
    baseline accuracy = 62%

> Q1) Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)


In [11]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)


In [12]:
# predict target by using algorithm knn model

y_pred = knn.predict(X_train)
y_pred[:10]


array([0, 1, 1, 1, 0, 0, 1, 0, 1, 0])

> Q2.Evaluate your results using the model score, confusion matrix, and classification report.


In [13]:
# model score
acc =knn.score(X_train,y_train)
acc

0.8258426966292135

In [14]:
# confusion matrix
pd.crosstab(y_train,y_pred)

col_0,0,1
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,286,43
1,50,155


In [15]:
print(classification_report(y_train, y_pred))


              precision    recall  f1-score   support

           0       0.85      0.87      0.86       329
           1       0.78      0.76      0.77       205

    accuracy                           0.83       534
   macro avg       0.82      0.81      0.81       534
weighted avg       0.82      0.83      0.83       534



> Q3)Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [16]:
def compute_class_metrics(y_train, y_pred):
    
    counts = pd.crosstab(y_train, y_pred)
    TP = counts.iloc[1,1]
    TN = counts.iloc[0,0]
    FP = counts.iloc[0,1]
    FN = counts.iloc[1,0]
    
    
    all_ = (TP + TN + FP + FN)

    accuracy = (TP + TN) / all_

    TPR = recall = TP / (TP + FN)
    FPR = FP / (FP + TN)

    TNR = TN / (FP + TN)
    FNR = FN / (FN + TP)

    precision =  TP / (TP + FP)
    f1 =  2 * ((precision * recall) / ( precision + recall))

    support_pos = TP + FN
    support_neg = FP + TN
    
    print(f"Accuracy: {accuracy}\n")
    print(f"True Positive Rate/Sensitivity/Recall/Power: {TPR}")
    print(f"False Positive Rate/False Alarm Ratio/Fall-out: {FPR}")
    print(f"True Negative Rate/Specificity/Selectivity: {TNR}")
    print(f"False Negative Rate/Miss Rate: {FNR}\n")
    print(f"Precision/PPV: {precision}")
    print(f"F1 Score: {f1}\n")
    print(f"Support (0): {support_pos}")
    print(f"Support (1): {support_neg}")

In [17]:
compute_class_metrics(y_train, y_pred)

Accuracy: 0.8258426966292135

True Positive Rate/Sensitivity/Recall/Power: 0.7560975609756098
False Positive Rate/False Alarm Ratio/Fall-out: 0.13069908814589665
True Negative Rate/Specificity/Selectivity: 0.8693009118541033
False Negative Rate/Miss Rate: 0.24390243902439024

Precision/PPV: 0.7828282828282829
F1 Score: 0.7692307692307692

Support (0): 205
Support (1): 329


> Q4) Run through steps 1-3 setting k to 10

In [18]:
knn10 = KNeighborsClassifier(n_neighbors=10)
knn10.fit(X_train, y_train)
y_pred = knn10.predict(X_train)
compute_class_metrics(y_train, y_pred)

Accuracy: 0.7883895131086143

True Positive Rate/Sensitivity/Recall/Power: 0.6536585365853659
False Positive Rate/False Alarm Ratio/Fall-out: 0.1276595744680851
True Negative Rate/Specificity/Selectivity: 0.8723404255319149
False Negative Rate/Miss Rate: 0.3463414634146341

Precision/PPV: 0.7613636363636364
F1 Score: 0.7034120734908137

Support (0): 205
Support (1): 329


> Q5) Run through steps 1-3 setting k to 20

In [19]:
knn20 = KNeighborsClassifier(n_neighbors=20)
knn20.fit(X_train, y_train)
y_pred = knn20.predict(X_train)
compute_class_metrics(y_train, y_pred)

Accuracy: 0.7340823970037453

True Positive Rate/Sensitivity/Recall/Power: 0.5317073170731708
False Positive Rate/False Alarm Ratio/Fall-out: 0.1398176291793313
True Negative Rate/Specificity/Selectivity: 0.8601823708206687
False Negative Rate/Miss Rate: 0.4682926829268293

Precision/PPV: 0.7032258064516129
F1 Score: 0.6055555555555556

Support (0): 205
Support (1): 329


> Q6)What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

From the above result, the model with 5 nearest neighbors performed the best.

> Q7) Which model performs best on our out-of-sample data from validate?

In [20]:
knn.score(X_train, y_train)

0.8258426966292135

In [21]:
knn10.score(X_train, y_train)

0.7883895131086143

In [22]:
knn20.score(X_train, y_train)


0.7340823970037453

 from the output, the model with 5 nearest neighbors performs the best model.