## Day 31 Lecture 2 Assignment

In this assignment, we will learn about the weighting and scaling with the K nearest neighbor algorithm. We will use the acute nephritis dataset loaded below and analyze the model generated for this dataset.

In [2]:
%matplotlib inline

import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import ds_useful as ds

[My Useful Data Science Functions](https://github.com/cobyoram/python-for-data-scientists/blob/master/ds_useful.py)

In [3]:
# columns: 
# Temperature of patient { 35C-42C }
# Occurrence of nausea { yes, no }
# Lumbar pain { yes, no }
# Urine pushing (continuous need for urination) { yes, no }
# Micturition pains { yes, no }
# Burning of urethra, itch, swelling of urethra outlet { yes, no }
# decision: Nephritis of renal pelvis origin { yes, no } 

cols = ['temp', 'nausea', 'lumbar_pain', 'urine_pushing', 'micturition_pains', 'burning', 'nephritis']
nephritis = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/acute.csv', names=cols)

Recall that we need to check for missing data and create dummy variables from the non-numeric columns. Perform both steps below:

In [62]:
# answer below:

ds.missingness_summary(nephritis)

nephritis            0.0
burning              0.0
micturition_pains    0.0
urine_pushing        0.0
lumbar_pain          0.0
nausea               0.0
temp                 0.0
dtype: float64

Scale only the independent variables using the minmax scaler.

In [7]:
obj_columns = nephritis.select_dtypes('object').columns
non_obj_columns = nephritis.drop(obj_columns, axis=1).columns
dums = pd.get_dummies(nephritis[obj_columns], drop_first=True)
feat_neph = pd.concat([dums, nephritis[non_obj_columns]], axis=1)
feat_neph.head()

Unnamed: 0,nausea_yes,lumbar_pain_yes,urine_pushing_yes,micturition_pains_yes,burning_yes,nephritis_yes,temp
35,0,1,0,0,0,0,5
35,0,0,1,1,1,0,9
35,0,1,0,0,0,0,9
36,0,0,1,1,1,0,0
36,0,1,0,0,0,0,0


In [10]:
# answer below:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

X = feat_neph.drop('nephritis_yes', axis=1)
y = feat_neph['nephritis_yes']

X = MinMaxScaler().fit_transform(X)

Create a train test split with 20% of the data in the test subsample.

In [68]:
# answer below:

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2, random_state=4)

Create a KNN model for our scaled data with k=5 and report the accuracy score.

In [69]:
# answer below:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, y_train)

knn.score(X_test, y_test)

1.0

When generating a KNN model, we can use the weighted model by setting `weights='distance'`. We can also write our own custom weights function.

Write a custom weight function that assigns the weight of 1/sqrt(distance) and use this function in your model. Report the accuracy score.

Hint: Use the `_get_weights` function in scikit learn as a resource. The code is <a href="https://github.com/scikit-learn/scikit-learn/blob/fdbaa58acbead5a254f2e6d597dc1ab3b947f4c6/sklearn/neighbors/base.py#L63" title="_get_weights">here</a>.

In [81]:
# answer below:
import math
def inverse_square(x):
    if x == 0:
        return 0
    return 1/math.sqrt(x)

def give_weight(s):
    return s.apply(inverse_square)

def get_weight(neighbors):
    print(type(neighbors))
    neigh_dist = pd.DataFrame(neighbors)
    print(neigh_dist)
    neigh_dist = neigh_dist.apply(give_weight, axis=1)
    print(neigh_dist)

    return neigh_dist.to_numpy()

In [83]:

knn = KNeighborsClassifier(n_neighbors=5, weights=get_weight)
knn.fit(X_train, y_train)

print(knn.score(X_train, y_train))
print(knn.score(X_test, y_test))

<class 'numpy.ndarray'>
      0         1         2         3         4
0   0.0  0.111111  0.111111  0.333333  0.333333
1   0.0  0.000000  0.111111  0.222222  0.222222
2   0.0  0.000000  0.000000  0.111111  0.222222
3   0.0  0.000000  0.111111  0.222222  0.222222
4   0.0  0.000000  0.000000  0.111111  0.111111
..  ...       ...       ...       ...       ...
91  0.0  0.000000  0.000000  0.000000  0.111111
92  0.0  0.000000  0.000000  0.111111  0.111111
93  0.0  0.000000  0.222222  0.222222  0.222222
94  0.0  0.000000  0.222222  0.222222  0.333333
95  0.0  0.000000  0.111111  0.111111  0.111111

[96 rows x 5 columns]
      0    1        2         3         4
0   0.0  3.0  3.00000  1.732051  1.732051
1   0.0  0.0  3.00000  2.121320  2.121320
2   0.0  0.0  0.00000  3.000000  2.121320
3   0.0  0.0  3.00000  2.121320  2.121320
4   0.0  0.0  0.00000  3.000000  3.000000
..  ...  ...      ...       ...       ...
91  0.0  0.0  0.00000  0.000000  3.000000
92  0.0  0.0  0.00000  3.000000  3.000000

I think I did it