### Required Assignment 12.1: Identifying the Best K

This activity focuses on identifying the "best" number of neighbors that optimize the accuracy of a `KNearestNeighbors` estimator. The ideal number of neighbors will be selected through cross-validation and a grid search over the `n_neighbors` parameter.  Again, prior to building the model, you will want to scale the data in a `Pipeline`.

**Expected Time: 60 Minutes**

**Total Points: 50**

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)


In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

### The Dataset

Again, you will use the credit default dataset to predict default -- yes or no.  The data is loaded and split into train and test sets for you below.  You will again build a column transformer to encode the `student` feature.  Note that scikit-learn handles a string target features in the `KNeighborsClassifier`, and we do not need to encode this column.

In [10]:
df = pd.read_csv('data/default.csv', index_col=0)

In [11]:
df.head(2)

Unnamed: 0,default,student,balance,income
1,No,No,729.526495,44361.625074
2,No,Yes,817.180407,12106.1347


In [12]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('default', axis = 1), 
                                                    df['default'],
                                                   random_state=42)

[Back to top](#-Index)

### Problem 1

#### Baseline for Models

**5 Points**

Before starting the modeling process, you should have a baseline to determine whether your model is any good. 

Consider the `default` column of `df`. Perform a `value_counts` operation with the argument `normalize` equal to `True`. 

What would the accuracy of such a classifier be?  Enter your answer as a float to `baseline` below.



In [13]:
### GRADED

baseline = df['default'].value_counts(normalize = True).iloc[0]
# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(baseline)

0.9667


[Back to top](#-Index)

### Problem 2

#### Column transforms and KNN

**10 Points**

Use the `make_column_transformer` to create a column `transformer`. Inside the `make_column_transformer` specify an instance of the `OneHotEncoder` transformer from scikit-learn. Inside `OneHotEncoder` set `drop` equal to `'if_binary'`. Apply this transformation to the `student` column. On the `remainder` columns, apply a `StandardScaler()` transformation.
 

Next, build a `Pipeline` named `knn_pipe` with  steps `transform` and `knn`. Set `transform` equal to `transformer` and `knn` equal to `KNeighborsClassifier()`. Be sure to leave all the settings in `knn` to default.  

In [14]:
### GRADED

transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['student']),remainder = StandardScaler())
knn_pipe = Pipeline([('transform',transformer),('knn', KNeighborsClassifier())])

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
knn_pipe

[Back to top](#-Index)

### Problem 3

#### Parameter grid

**10 Points**

Now that your pipeline is ready, you are to construct a parameter grid to search over.  Consider two things:

- You will not be able to predict on a test dataset where `n_neigbors > len(test_data)`.  This will limit our upper bound on `k`.  In this example, too high a `k` will slow down the computation, so only consider `k = [1, 3, 5, ..., 21]`. 
- Ties in voting are decided somewhat arbitrarily and for speed and clarity you should consider only odd values for the number of neighbors

Creating a dictionary called `params` that specifies hyperparameters for the KNN classifier. 

- The key of your dictionary will be `knn__n_neighbors`
- The values in your dictionary will be `list(range(1, 22, 2))`



In [15]:
### GRADED

params = {
    'knn__n_neighbors': list(range(1, 22,2)) 
}


# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
list(params.values())[0]

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]

[Back to top](#-Index)

### Problem 4

#### Grid search `k`

**10 Points**

- Use `GridSearchCV` with the `knn_pipe` and `param_grid` equal to `params`. Assign the result to `knn_grid`.
- Use the `fit` function on `knn_grid` to train your model on `X_train` and `y_train`.
- Retrieve the best value for the hyperparameter `k` from the `best_params_` attribute of the grid search object `knn_grid`. Assign the result to `best_k`.
- Use the `score` function to calculate the accuracy of the `knn_grid` classifier on a test dataset. Assign your best models accuracy on the test data as a float to `best_acc`



In [17]:
### GRADED

knn_grid = GridSearchCV(estimator=knn_pipe, param_grid=params, scoring='accuracy')
best_k = knn_grid.fit(X_train,y_train).best_params_['knn__n_neighbors']
best_acc = knn_grid.score(X_test,y_test)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(best_acc)
print(best_k)

0.9708
11


[Back to top](#-Index)

### Problem 5

#### Other parameters to consider

**10 Points**

The number of neighbors is not the only parameter in the implementation from scikit-learn.  For example, you can also consider different weightings of points based on their distance, change the distance metric, and search over alternative versions of certain metrics like Minkowski.  See the docstring from `KNeighborsClassifier` below. 

```
weights : {'uniform', 'distance'} or callable, default='uniform'
    Weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.
      
===========================

p : int, default=2
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
    
```

Create a new parameter grid and consider both weightings as well as `p = [1, 2]`.  Assign this as a dictionary to `params2` below.  

Search over these parameters in your `knn_pipe` with a `GridSearchCV` named `weight_grid` below. Also, consider `n_neighbors` as in [Problem 4](#-Problem-4).  Did your new grid search results perform better than earlier?  Assign this grids accuracy to `weights_acc` below.

In [28]:
### GRADED

params2 = {
    'knn__n_neighbors': list(range(1, 22, 2)),
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]                          
}
weight_grid = GridSearchCV(estimator=knn_pipe, param_grid=params2)
weight_grid.fit(X_train, y_train)
weights_acc =weight_grid.score(X_test,y_test)

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(weights_acc)

0.9708


[Back to top](#-Index)

### Problem 6

#### Further considerations

**5 Points**

When performing your grid search you want to also be sensitive to the amount of parameters you are searching and the number of different models being built.  How many models were constructed in [Problem 5](#-Problem-5)?  Enter your answer as an integer to `ans6` below.  You might use the grids `.cv_results_` attribute to determine this.

In [30]:
### GRADED
ans6 = 5*44

# YOUR CODE HERE
#raise NotImplementedError()

# Answer check
print(ans6)

220
