<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# KNN Classification and Imputation: Cell Phone Churn Data

_Authors: Kiefer Katovich (SF)_

---

In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to **impute** missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

In [55]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.neighbors import KNeighborsClassifier

### 1. Load the cell phone "churn" data containing some missing values.

In [56]:
churn = pd.read_csv('churn_missing.csv')

### 2. Examine the data. What columns have missing values?

In [57]:
# A:

In [58]:
churn.head()

Unnamed: 0,state,account_length,area_code,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,night_calls,night_charge,intl_mins,intl_calls,intl_charge,custserv_calls,churn
0,KS,128,415,no,yes,25.0,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,no,yes,26.0,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,no,no,0.0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,yes,no,0.0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,yes,no,0.0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [59]:
churn.shape

(3333, 20)

In [60]:
round(churn.describe(include='all'),1).transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
state,3333,51.0,WV,106.0,,,,,,,
account_length,3333,,,,101.1,39.8,1.0,74.0,101.0,127.0,243.0
area_code,3333,,,,437.2,42.4,408.0,408.0,415.0,510.0,510.0
intl_plan,3333,2.0,no,3010.0,,,,,,,
vmail_plan,2933,2.0,no,2130.0,,,,,,,
vmail_message,2933,,,,8.0,13.7,0.0,0.0,0.0,19.0,51.0
day_mins,3333,,,,179.8,54.5,0.0,143.7,179.4,216.4,350.8
day_calls,3333,,,,100.4,20.1,0.0,87.0,101.0,114.0,165.0
day_charge,3333,,,,30.6,9.3,0.0,24.4,30.5,36.8,59.6
eve_mins,3333,,,,201.0,50.7,0.0,166.6,201.4,235.3,363.7


In [61]:
churn.isna().sum()

state               0
account_length      0
area_code           0
intl_plan           0
vmail_plan        400
vmail_message     400
day_mins            0
day_calls           0
day_charge          0
eve_mins            0
eve_calls           0
eve_charge          0
night_mins          0
night_calls         0
night_charge        0
intl_mins           0
intl_calls          0
intl_charge         0
custserv_calls      0
churn               0
dtype: int64

In [62]:
churn.dtypes

state              object
account_length      int64
area_code           int64
intl_plan          object
vmail_plan         object
vmail_message     float64
day_mins          float64
day_calls           int64
day_charge        float64
eve_mins          float64
eve_calls           int64
eve_charge        float64
night_mins        float64
night_calls         int64
night_charge      float64
intl_mins         float64
intl_calls          int64
intl_charge       float64
custserv_calls      int64
churn                bool
dtype: object

### 3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

In [63]:
# A:

In [139]:
churn.vmail_plan.value_counts(dropna=False)

0      2130
1       803
NaN     400
Name: vmail_plan, dtype: Int64

In [138]:
churn.intl_plan.value_counts(dropna=False)

0      3010
1       323
NaN       0
Name: intl_plan, dtype: Int64

In [66]:
churn.vmail_plan = churn.vmail_plan.map(dict(yes=1, no=0)).astype('Int64')
churn.intl_plan = churn.intl_plan.map(dict(yes=1, no=0)).astype('Int64')

In [67]:
churn.vmail_plan.value_counts()

0    2130
1     803
Name: vmail_plan, dtype: Int64

In [68]:
churn.intl_plan.value_counts()

0    3010
1     323
Name: intl_plan, dtype: Int64

In [70]:
churn[['vmail_plan','intl_plan']].dtypes

vmail_plan    Int64
intl_plan     Int64
dtype: object

### 4. Create dummy coded columns for state and concatenate it to the churn dataset.

> **Remember:** You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

In [5]:
# A:

In [79]:
churn.state.value_counts().sort_index()

AK     52
AL     80
AR     55
AZ     64
CA     34
CO     66
CT     74
DC     54
DE     61
FL     63
GA     54
HI     53
IA     44
ID     73
IL     58
IN     71
KS     70
KY     59
LA     51
MA     65
MD     70
ME     62
MI     73
MN     84
MO     63
MS     65
MT     68
NC     68
ND     62
NE     61
NH     56
NJ     68
NM     62
NV     66
NY     83
OH     78
OK     61
OR     78
PA     45
RI     65
SC     60
SD     60
TN     53
TX     72
UT     72
VA     77
VT     73
WA     66
WI     78
WV    106
WY     77
Name: state, dtype: int64

In [106]:
state_dummies = pd.get_dummies(churn.state.sort_values(), drop_first=True)

In [107]:
churn.shape

(3333, 20)

In [108]:
state_dummies.shape

(3333, 50)

In [109]:
churn = churn.merge(state_dummies, left_index=True, right_index=True)

In [112]:
churn.drop(columns='state', inplace=True)

In [140]:
churn.head()

Unnamed: 0,account_length,area_code,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,128,415,0,1,25.0,265.1,110,45.07,197.4,99,...,0,0,0,0,0,0,0,0,0,0
1,107,415,0,1,26.0,161.6,123,27.47,195.5,103,...,0,0,0,0,0,0,0,0,0,0
2,137,415,0,0,0.0,243.4,114,41.38,121.2,110,...,0,0,0,0,0,0,0,0,0,0
3,84,408,1,0,0.0,299.4,71,50.9,61.9,88,...,0,0,0,0,0,0,0,0,0,0
4,75,415,1,0,0.0,166.7,113,28.34,148.3,122,...,0,0,0,0,0,0,0,0,0,0


In [219]:
churn.vmail_plan.value_counts(dropna=False)

0      2130
1       803
NaN     400
Name: vmail_plan, dtype: Int64

In [216]:
churn.vmail_message.value_counts(dropna=False)

0.0     2130
NaN      400
31.0      46
29.0      46
28.0      45
33.0      42
27.0      42
24.0      37
30.0      37
26.0      36
32.0      36
25.0      33
36.0      32
23.0      31
22.0      30
35.0      28
21.0      27
39.0      25
34.0      24
38.0      23
37.0      22
20.0      19
40.0      15
19.0      14
42.0      13
17.0      12
41.0      10
16.0       9
43.0       9
15.0       7
18.0       6
14.0       6
12.0       6
44.0       6
45.0       5
46.0       4
13.0       4
47.0       3
48.0       2
11.0       2
8.0        2
50.0       2
9.0        2
4.0        1
51.0       1
49.0       1
Name: vmail_message, dtype: int64

In [225]:
churn[churn.vmail_plan.isnull()]

Unnamed: 0,account_length,area_code,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
6,121,510,0,,,218.2,88,37.09,348.5,108,...,0,0,0,0,0,0,0,0,0,0
8,117,408,0,,,184.5,97,31.37,351.6,80,...,0,0,0,0,0,0,0,0,0,0
15,161,415,0,,,332.9,67,56.59,317.8,97,...,0,0,0,0,0,0,0,0,0,0
21,77,408,0,,,62.4,89,10.61,169.9,121,...,0,0,0,0,0,0,0,0,0,0
22,130,415,0,,,183.0,112,31.11,72.9,99,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3251,88,408,0,,,274.6,105,46.68,161.1,121,...,0,0,0,0,0,0,0,0,0,0
3254,57,415,0,,,179.2,105,30.46,283.2,83,...,0,0,0,0,0,0,0,0,0,0
3290,127,510,0,,,107.9,128,18.34,187.0,77,...,0,0,0,0,0,0,0,0,0,0
3302,75,510,1,,,153.2,78,26.04,210.8,99,...,0,0,0,0,0,0,0,0,0,0


In [224]:
churn[churn.vmail_message.isnull()]

Unnamed: 0,account_length,area_code,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
6,121,510,0,,,218.2,88,37.09,348.5,108,...,0,0,0,0,0,0,0,0,0,0
8,117,408,0,,,184.5,97,31.37,351.6,80,...,0,0,0,0,0,0,0,0,0,0
15,161,415,0,,,332.9,67,56.59,317.8,97,...,0,0,0,0,0,0,0,0,0,0
21,77,408,0,,,62.4,89,10.61,169.9,121,...,0,0,0,0,0,0,0,0,0,0
22,130,415,0,,,183.0,112,31.11,72.9,99,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3251,88,408,0,,,274.6,105,46.68,161.1,121,...,0,0,0,0,0,0,0,0,0,0
3254,57,415,0,,,179.2,105,30.46,283.2,83,...,0,0,0,0,0,0,0,0,0,0
3290,127,510,0,,,107.9,128,18.34,187.0,77,...,0,0,0,0,0,0,0,0,0,0
3302,75,510,1,,,153.2,78,26.04,210.8,99,...,0,0,0,0,0,0,0,0,0,0


In [227]:
churn.vmail_message = pd.to_numeric(churn.vmail_message, errors='coerce')

In [230]:
churn.vmail_plan = pd.to_numeric(churn.vmail_plan, errors='coerce')

### 5. Create a version of the churn data that has no missing values.

Calculate the shape

In [113]:
# A:
churn.shape

(3333, 69)

In [121]:
churn.vmail_plan.value_counts(dropna=False)

0      2130
1       803
NaN     400
Name: vmail_plan, dtype: Int64

In [137]:
# create X Test
X_test = churn[churn.vmail_plan.isna()]
X_test.drop(columns=['vmail_plan','vmail_message'],inplace=True)
X_test.head()

Unnamed: 0,account_length,area_code,intl_plan,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
6,121,510,0,218.2,88,37.09,348.5,108,29.62,212.6,...,0,0,0,0,0,0,0,0,0,0
8,117,408,0,184.5,97,31.37,351.6,80,29.89,215.8,...,0,0,0,0,0,0,0,0,0,0
15,161,415,0,332.9,67,56.59,317.8,97,27.01,160.6,...,0,0,0,0,0,0,0,0,0,0
21,77,408,0,62.4,89,10.61,169.9,121,14.44,209.6,...,0,0,0,0,0,0,0,0,0,0
22,130,415,0,183.0,112,31.11,72.9,99,6.2,181.8,...,0,0,0,0,0,0,0,0,0,0


In [136]:
# Create X Train
X_train = churn[~churn.vmail_plan.isna()]
X_train.drop(columns=['vmail_plan','vmail_message'],inplace=True)
X_train.head()

Unnamed: 0,account_length,area_code,intl_plan,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,128,415,0,265.1,110,45.07,197.4,99,16.78,244.7,...,0,0,0,0,0,0,0,0,0,0
1,107,415,0,161.6,123,27.47,195.5,103,16.62,254.4,...,0,0,0,0,0,0,0,0,0,0
2,137,415,0,243.4,114,41.38,121.2,110,10.3,162.6,...,0,0,0,0,0,0,0,0,0,0
3,84,408,1,299.4,71,50.9,61.9,88,5.26,196.9,...,0,0,0,0,0,0,0,0,0,0
4,75,415,1,166.7,113,28.34,148.3,122,12.61,186.9,...,0,0,0,0,0,0,0,0,0,0


In [210]:
# Create Y Train
Y_train = churn[~churn.vmail_plan.isna()]
Y_train_vmail_plan = Y_train.vmail_plan.astype(int)
Y_train_vmail_message = Y_train.vmail_message.astype(int)

In [211]:
# Check Shapes and Types
print(X_train.shape)
print(Y_train_vmail_plan.shape)

print(type(X_train.values))
print(type(Y_train_vmail_plan))

(2933, 67)
(2933,)
<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>


In [196]:
Y_train_vmail_plan.value_counts(dropna=False)

0    2130
1     803
Name: vmail_plan, dtype: int64

In [197]:
# KNN to predict vmail_plan

# Instansitate
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)


# Fit
knn.fit(X_train.values, Y_train_vmail_plan)

# Predict
vmail_plan_preds = knn.predict(X_test)

In [232]:
# KNN to predict vmail_message

# Instansitate
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# Fit
knn.fit(X_train.values, Y_train_vmail_message)

# Predict
vmail_message_preds = knn.predict(X_test)

In [233]:
vmail_plan_preds

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

In [234]:
vmail_message_preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

#### Try using KNN Imputer 

In [207]:
from sklearn.impute import KNNImputer

In [213]:
print(churn.vmail_plan.dtypes)
print(churn.vmail_message.dtypes)

Int64
float64


In [231]:
imputer = KNNImputer(n_neighbors=5)
imputer.fit_transform(churn)

TypeError: float() argument must be a string or a number, not 'NAType'

### 6. Create a target vector and predictor matrix.

- Target should be the `churn` column.
- Predictor matrix should be all columns except `area_code`, `state`, and `churn`.

In [7]:
# A:


### 7. Calculate the baseline accuracy for `churn`.

In [8]:
# A:

### 8. Cross-validate a KNN model predicting `churn`. 

- Number of neighbors should be 5.
- Make sure to standardize the predictor matrix.
- Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

In [9]:
# A:

### 9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

In [10]:
# A:

### 10. Imputing with KNN

K-Nearest Neighbors can be used to impute missing values in datasets. What we will do is estimate the most likely value for the missing data based on a KNN model.

We have two columns with missing data:
- `vmail_plan`
- `vmail_message`

**10.A Create two subsets of the churn dataset: one without missing values for `vmail_plan` and `vmail_message`, and one with the missing values.**

In [11]:
# A:

First we will impute values for `vmail_plan`. This is a categorical column and so we will impute using classification (predicting whether the plan is yes or no, 1 vs. 0).

**10.B Create a target that is `vmail_plan` and predictor matrix that is all columns except `state`, `area_code`, `churn`, `vmail_plan`, and `vmail_message`.**

> **Note:** We don't include the `churn` variable in the model to impute. Why? We are imputing these missing values so that we can use the rows to predict churn with more data afterwards. If we imputed with churn as a predictor then we would be cheating.

In [12]:
# A:

**10.C Standardize the predictor matrix.**

In [13]:
# A:

**10.D Find the best K for predicting `vmail_plan`.**

You may want to write a function for this. What is the accuracy for predicting `vmail_plan` at the best K? What is the baseline accuracy for `vmail_plan`?

In [14]:
# A:

**10.E Fit a `KNeighborsClassifier` with the best number of neighbors.**

In [15]:
# A:

**10.F Predict the missing `vmail_plan` values using the subset of the data where it is misssing.**

You will need to:
1. Create a new predictor matrix using the same predictors but from the missing subset of data.
- Standardize this predictor matrix *using the StandardScaler object fit on the non-missing data*. This means you will just use the `.transform()` function. It is important to standardize the new predictors the same way we standardized the original predictors if we want the predictions to make sense. Calling `.fit_transform()` will reset the standardized scale.
- Predict what the missing vmail plan values should be.
- Replace the missing values in the original with the predicted values.

> **Note:** It may predict all 0's. This is OK. If you want to see the predicted probabilities of `vmail_plan` for each row you can use the `.predict_proba()` function instead of `.predict()`.  You can use these probabilities to manually set the criteria threshold.

In [16]:
# A:

### 11. Impute the missing values for `vmail_message` using the same process.

Since `vmail_message` is essentially a continuous measure, you need to use `KNeighborsRegressor` instead of the `KNeighborsClassifier`.

KNN can do both regression and classification! Instead of "voting" on the class like in classification, the neighbors will average their value for the target in regression.

In [17]:
# A:

### 12. Given the accuracy (and $R^2$) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?

In [18]:
# A:

### 13. With the imputed dataset, cross-validate the accuracy predicting churn. Is it better? Worse? The same?

In [19]:
# A: