## Day 31 Lecture 2 Assignment

In this assignment, we will learn about the weighting and scaling with the K nearest neighbor algorithm. We will use the wine quality dataset loaded below and analyze the model generated for this dataset.

In [1]:
%matplotlib inline

import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
wine = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/data-science-lectures/master/wineQualityReds.csv', index_col=0)
wine.head()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
2,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
3,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
4,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
5,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Recall that we need to check for missing data.

In [3]:
# answer below:
wine.isnull().sum()


fixed.acidity           0
volatile.acidity        0
citric.acid             0
residual.sugar          0
chlorides               0
free.sulfur.dioxide     0
total.sulfur.dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Convert quality to a binary variable, with the dividing line between 5 and 6.

In [4]:
# answer below
wine['quality'] = np.where(wine['quality'] >= 6, 1, 0)

wine.head()

Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
2,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
3,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
4,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
5,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


Create a train test split with 20% of the data in the test subsample.

In [5]:
# answer below:
from sklearn.model_selection import train_test_split

X = wine.drop('quality', axis=1)
y = wine['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


Scale only the independent variables using the minmax scaler.

In [6]:
# answer below:
from sklearn.preprocessing import MinMaxScaler

scale = MinMaxScaler()

X_train_scale = scale.fit_transform(X_train)
X_test_scale = scale.fit_transform(X_test)


Create a KNN model with k=5 and report the accuracy scores. Then make a second model using the scaled data and compare your results.

In [7]:
# answer below:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

print(f"KNN Train Score: {knn.score(X_train, y_train):.3f}")
print(f"KNN Test Score: {knn.score(X_test, y_test):.3f}")

KNN Train Score: 0.764
KNN Test Score: 0.675


In [8]:
knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_scale, y_train)

print(f"KNN Train Scaled Score: {knn.score(X_train_scale, y_train):.3f}")
print(f"KNN Test Scaled Score: {knn.score(X_test_scale, y_test):.3f}")

KNN Train Scaled Score: 0.817
KNN Test Scaled Score: 0.728


**Scaled model has better training and test scores.**

When generating a KNN model, we can use the weighted model by setting `weights='distance'`. We can also write our own custom weights function.

Write a custom weight function that assigns the weight of 1/sqrt(distance) and use this function in your model. Report the accuracy score.

Hint: Use the `_get_weights` function in scikit learn as a resource. The code is <a href="https://github.com/scikit-learn/scikit-learn/blob/fdbaa58acbead5a254f2e6d597dc1ab3b947f4c6/sklearn/neighbors/base.py#L63" title="_get_weights">here</a>.

In [9]:
# answer below:

def get_weights(dist):
    """Get the weights from an array of distances and a parameter ``weights``
    Parameters
    ----------
    dist : ndarray
        The input distances
    weights : {'uniform', 'distance' or a callable}
        The kind of weighting used
    Returns
    -------
    weights_arr : array of the same shape as ``dist``
        if ``weights == 'uniform'``, then returns None
    """
    if dist.dtype is np.dtype(object):
        for point_dist_i, point_dist in enumerate(dist):
            # check if point_dist is iterable
            # (ex: RadiusNeighborClassifier.predict may set an element of
            # dist to 1e-6 to represent an 'outlier')
            if hasattr(point_dist, '__contains__') and 0. in point_dist:
                dist[point_dist_i] = point_dist == 0.
            else:
                dist[point_dist_i] = 1. / np.sqrt(point_dist)
    else:
        with np.errstate(divide='ignore'):
            dist = 1. / np.sqrt(dist)
        inf_mask = np.isinf(dist)
        inf_row = np.any(inf_mask, axis=1)
        dist[inf_row] = inf_mask[inf_row]
    return dist

In [10]:
#custom distance weighted model
knn = KNeighborsClassifier(n_neighbors=5, weights=get_weights)

knn.fit(X_train_scale, y_train)

print(f"KNN Train Scaled Score: {knn.score(X_train_scale, y_train):.3f}")
print(f"KNN Test Scaled Score: {knn.score(X_test_scale, y_test):.3f}")

KNN Train Scaled Score: 1.000
KNN Test Scaled Score: 0.728


In [11]:
#distance weighted model
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')

knn.fit(X_train_scale, y_train)

print(f"KNN Train Scaled Score: {knn.score(X_train_scale, y_train):.3f}")
print(f"KNN Test Scaled Score: {knn.score(X_test_scale, y_test):.3f}")

KNN Train Scaled Score: 1.000
KNN Test Scaled Score: 0.734


Our custom weighted distance model did slightly worse on the test data than KNN's default weighted distance. Training scores improved to 100% using these approaches compared to our scaled model training score of 0.817. However test scores were the same for the scaled model (0.728) and our custom weighted distance model (0.728). We did see a slight improvement in test score to (0.734) with the KNN default weighted distance model.