# Week 1 Lab: k Nearest Neighbors

We're going to take a look at a simple way to use one of the easiest Machine Learning algorithms - k Nearest Neighbors (kNN).

k Nearest Neighbors works by taking data from features, and plotting this data in space. It then guesses the category of new data by calculating how close or far the test data is to the training data in this space.

We'll work on two datasets (if we have time!). The first one is an anime dataset from Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database). We'll use kNN to plot a confusion matrix, which will tell us how accurate/precise our classifier is, and then a cross-validation which will give us a better score of how well our classifier works. We'll then run the same analysis on a new dataset about wine (https://archive.ics.uci.edu/ml/datasets/wine), and we'll see how the results differ. This will give us some sense of the importance of features and data on the success of the classifier.

## Import Libraries

`numpy`, `pandas`, and `re` are libraries in python that will help us with data analysis and string processing. You can learn more about them here:

**numpy**<br>
http://www.numpy.org/

**pandas**<br>
https://pandas.pydata.org/

**re**<br>
https://docs.python.org/2/library/re.html

##### `import` the numpy, pandas, and re libraries

In [2]:
import numpy as np


In [3]:
import pandas as pd

In [4]:
import re as re

##### 

In [5]:
anime_dataset = pd.read_csv('anime.csv', index_col = False)

In [6]:
anime_dataset.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


# Data preprocessing 

## Episodes


##### Many animes have an unknown number of episodes even if they have a similar rating. On top of that, many super popular animes such as Naruto Shippuden, Attack on Titan Season 2 were ongoing when the data was collected, thus their number of episodes was considered as "Unknown". For some animes, we'll fill in the episode numbers manually. For the other animes, we'll make some educated guesses.

Animes that are grouped are "OVA" stands for "Original Video Animation". These are generally one/two episode long animes; let's just fill the unknown numbers of episodes with 1. Animes that are grouped under "Movies" are considered as '1' episode as per the dataset overview goes.

Fill in anime whose type is movies and whose episodes are unknown, with 1 episode. Do this by using the `.loc` function in pandas (i.e.`anime_dataset.loc[...]`), and a condition inside the dataframe that checks for `anime_dataset['type'] == 'Movie` and `anime_dataset['episodes'] == 'Unknown'`. Set the number of episodes for instances that match the condition, to 1.

##### 

Repeat the above for anime whose type is `OVA` and whose episodes are `Unknown`, with 1 episode.

In [8]:
### Preprocessing empty episodes 
anime_dataset.loc[((anime_dataset['type'] == 'Movie') | (anime_dataset['type'] == 'OVA')) & (anime_dataset['episodes'] == 'Unknown'), ['episodes']] = 1
anime_dataset['episodes']

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
12264    False
12265    False
12266    False
12267    False
12268    False
12269    False
12270    False
12271    False
12272    False
12273    False
12274    False
12275    False
12276    False
12277    False
12278    False
12279    False
12280    False
12281    False
12282    False
12283    False
12284    False
12285    False
12286    False
12287    False
12288    False
12289    False
12290    False
12291    False
12292    False
12293    False
Name: episodes, Length: 12294, dtype: bool

Great! Now, shown below are the animes whose episode numbers we know. `known_animes` is a dictionary object, where each item is a key-value pair. The key is the name of the anime, and the value is the number of episodes.

In [8]:
known_animes = {"Naruto Shippuuden":500, "One Piece":784,"Detective Conan":854, "Dragon Ball Super":86,
                "Crayon Shin chan":942, "Yu Gi Oh Arc V":148,"Shingeki no Kyojin Season 2":25,
                "Boku no Hero Academia 2nd Season":25,"Little Witch Academia TV":25}


Fill `anime_dataset` with the episode numbers from the dictionary above. Do this by iterating over the dictionary items above in a `for` loop, and for each item in the dictionary, find the anime whose `name` matches the key of the dictionary item, and replace the number of `episodes` with the corresponding value. Again, use the `.loc` function to do so.


In [9]:
def fill_anime_dataset():
    for name, episodes in known_animes.items():
        anime_dataset.loc[anime_dataset['name'] == name, ['episodes']] = episodes

In [10]:
fill_anime_dataset()

Now, for any remaining anime whose episodes we don't know, let's re-assign them to the median number of episodes. This is a two step process: first, replace the unknown episodes with np.nan. You can do this by using a `lambda` function on `anime_dataset['episodes'].map`.

`NaN` stands for 'Not a Number'. Some of the episodes have unknown numbers -- so, replace these `NaN` values by using the `.fillna` function on `anime_dataset['episodes']` with the median of the episodes. Do this `inplace`. 'In place' simply means that, we will be actually replacing/updating data in the original dataframe and hence changing it, rather than simply displaying what the result *would* look like had we performed the operation.

### Rating and Members


We're about to build our features. Features are the pieces of information that our classifer needs in order to try and guess / categorize our anime. Features have to be numbers, because computers work with numbers! So we need to convert any features that have information in string format, to floats. First, convert `anime_dataset['members']` to `float` using `.astype`.

In [25]:
anime_dataset['members'].astype('float', copy = False)

0        200630.0
1        793665.0
2        114262.0
3        673572.0
4        151266.0
5         93351.0
6        425855.0
7         80679.0
8         72534.0
9         81109.0
10       456749.0
11       102733.0
12       336376.0
13       572888.0
14       179342.0
15       466254.0
16       416397.0
17        75894.0
18       226193.0
19       715151.0
20       157670.0
21       129307.0
22       486824.0
23       552458.0
24       339556.0
25       240297.0
26       205959.0
27       101351.0
28       300030.0
29       562962.0
           ...   
12264       254.0
12265       205.0
12266       262.0
12267       174.0
12268       111.0
12269       164.0
12270       147.0
12271       240.0
12272       186.0
12273       146.0
12274       392.0
12275       108.0
12276        66.0
12277       176.0
12278       138.0
12279        79.0
12280       240.0
12281       221.0
12282       195.0
12283       112.0
12284       118.0
12285       485.0
12286       148.0
12287       201.0
12288     

In [12]:
anime_dataset['rating'].astype('float')

0        9.37
1        9.26
2        9.25
3        9.17
4        9.16
5        9.15
6        9.13
7        9.11
8        9.10
9        9.11
10       9.06
11       9.05
12       9.04
13       8.98
14       8.93
15       8.93
16       8.92
17       8.88
18       8.84
19       8.83
20       8.83
21       8.83
22       8.82
23       8.82
24       8.81
25       8.81
26       8.80
27       8.80
28       8.78
29       8.78
         ... 
12264    6.42
12265     NaN
12266    5.43
12267    4.11
12268    2.86
12269    4.08
12270    3.00
12271    5.20
12272    4.14
12273    4.00
12274     NaN
12275    3.14
12276    4.00
12277    4.66
12278    3.61
12279     NaN
12280     NaN
12281    4.53
12282     NaN
12283    4.95
12284    4.45
12285     NaN
12286    4.67
12287    4.33
12288    4.37
12289    4.15
12290    4.28
12291    4.88
12292    4.98
12293    5.46
Name: rating, Length: 12294, dtype: float64

Some of the ratings are `NaN`. Use the `fillna` function to replace the `NaN` entries in `anime['rating']`, with the median of `anime['rating']`. Do this `inplace`.

In [13]:
anime_dataset['rating'].fillna(anime_dataset['rating'].median(), inplace=True)
anime_dataset['rating']

0        9.37
1        9.26
2        9.25
3        9.17
4        9.16
5        9.15
6        9.13
7        9.11
8        9.10
9        9.11
10       9.06
11       9.05
12       9.04
13       8.98
14       8.93
15       8.93
16       8.92
17       8.88
18       8.84
19       8.83
20       8.83
21       8.83
22       8.82
23       8.82
24       8.81
25       8.81
26       8.80
27       8.80
28       8.78
29       8.78
         ... 
12264    6.42
12265    6.57
12266    5.43
12267    4.11
12268    2.86
12269    4.08
12270    3.00
12271    5.20
12272    4.14
12273    4.00
12274    6.57
12275    3.14
12276    4.00
12277    4.66
12278    3.61
12279    6.57
12280    6.57
12281    4.53
12282    6.57
12283    4.95
12284    4.45
12285    6.57
12286    4.67
12287    4.33
12288    4.37
12289    4.15
12290    4.28
12291    4.88
12292    4.98
12293    5.46
Name: rating, Length: 12294, dtype: float64

Now, what we are going to do is build a kNN classifier that can categorize an anime as either a Movie or an OVA (Original Video Animation). For this, we only really want the `Movie` and `OVA` entries from our database.

First, let's create a new data frame `anime_dataset_movies_ova` that contains only the `Movie` and `OVA` entries from our original dataframe. Do this by checking `anime_dataset['type'] == 'Movie'` OR `anime_dataset['type'] == 'OVA'` on `anime_dataset`. In addition, perform this operation on a `.copy()` of `anime_dataset`. This will produce a new dataframe, as opposed to a reference i.e. simply showing us what `anime_dataset` *would* look like if we performed the operation on it.

In [14]:
anime_dataset_movies_ova = anime_dataset[(anime_dataset['type'] == 'Movie') | (anime_dataset['type'] == 'OVA')].copy()
anime_dataset_movies_ova

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.10,72534
11,28851,Koe no Katachi,"Drama, School, Shounen",Movie,1,9.05,102733
15,199,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254
18,12355,Ookami Kodomo no Ame to Yuki,"Fantasy, Slice of Life",Movie,1,8.84,226193
21,44,Rurouni Kenshin: Meiji Kenkaku Romantan - Tsui...,"Action, Drama, Historical, Martial Arts, Roman...",OVA,4,8.83,129307
24,164,Mononoke Hime,"Action, Adventure, Fantasy",Movie,1,8.81,339556
25,7311,Suzumiya Haruhi no Shoushitsu,"Comedy, Mystery, Romance, School, Sci-Fi, Supe...",Movie,1,8.81,240297
33,28957,Mushishi Zoku Shou: Suzu no Shizuku,"Adventure, Fantasy, Historical, Mystery, Seine...",Movie,1,8.75,32266


Now, we want to get rid of some features that we think won't be useful in helping us guess what type the anime is. Two obvious features are the `anime_id` and the `name`. Remove these features from `anime_dataset_movies_ova` by calling `.drop` on it, making sure to provide a list of feature names, the axis on which these features exist (1), and doing so inplace. <br>
*Hint: you can see the arguments a function takes in by pressing Shift+Tab after opening the first parenthesis when writing the function call.*

In [15]:
anime_dataset_movies_ova.drop(['anime_id'], axis='columns', inplace=True)

Print the head of `anime_dataset_movies_ova`. You should notice that the `index` values from the original dataframe (`anime_dataset`) are preserved -- which we don't really want, so reset the index of the dataframe by calling `.reset_index`. Make sure to `drop` the index to replace the original indices, and do this `inplace`.


In [64]:
anime_dataset_movies_ova.reset_index(drop = True, inplace = True)

In [65]:
anime_dataset_movies_ova

Unnamed: 0,name,genre,type,episodes,rating,members
0,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
2,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.10,72534
3,Koe no Katachi,"Drama, School, Shounen",Movie,1,9.05,102733
4,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254
5,Ookami Kodomo no Ame to Yuki,"Fantasy, Slice of Life",Movie,1,8.84,226193
6,Rurouni Kenshin: Meiji Kenkaku Romantan - Tsui...,"Action, Drama, Historical, Martial Arts, Roman...",OVA,4,8.83,129307
7,Mononoke Hime,"Action, Adventure, Fantasy",Movie,1,8.81,339556
8,Suzumiya Haruhi no Shoushitsu,"Comedy, Mystery, Romance, School, Sci-Fi, Supe...",Movie,1,8.81,240297
9,Mushishi Zoku Shou: Suzu no Shizuku,"Adventure, Fantasy, Historical, Mystery, Seine...",Movie,1,8.75,32266


Display the `head()` of `anime_dataset_movies_ova` to confirm that the new indices have been applied.

In [18]:
anime_dataset_movies_ova.head()

Unnamed: 0,name,genre,type,episodes,rating,members
0,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
2,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
3,Koe no Katachi,"Drama, School, Shounen",Movie,1,9.05,102733
4,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254


### Features

Now we're ready to build our feature vector! We do this by simply concatenating the genres, ratings, members, and episodes into a single object. Remember that feature data always has to be *numbers*. Currently, `anime_dataset_movies_ova['genres']` is a string. So we want to convert the genres into a number format before concatenating it with the rest of the numerical features. 

Python has a `.get_dummies` function that will generate numerical values for the different values of a given non-numerical category. We're going to invoke this on `anime_dataset_movies_ova['genre']` by splitting each string in its series using `,` as a separator.

Call `anime_dataset_movies_ova['genres'].str.get_dummies(sep=",")`, and store the reference to the dataframe returned in a new variable `anime_dataset_movies_ova_genres`.

In [19]:
anime_dataset_ova_genres = anime_dataset_movies_ova['genre'].str.get_dummies(sep=",")

In [48]:
anime_dataset_movies_ova['rating']

0       9.37
1       9.11
2       9.10
3       9.05
4       8.93
5       8.84
6       8.83
7       8.81
8       8.81
9       8.75
10      8.74
11      8.73
12      8.69
13      8.68
14      8.64
15      8.64
16      8.61
17      8.61
18      8.60
19      8.59
20      8.59
21      8.58
22      8.58
23      8.57
24      8.55
25      8.53
26      8.53
27      8.53
28      8.50
29      8.50
        ... 
5629    3.11
5630    6.57
5631    5.43
5632    4.11
5633    2.86
5634    4.08
5635    3.00
5636    5.20
5637    4.14
5638    4.00
5639    6.57
5640    3.14
5641    4.00
5642    4.66
5643    3.61
5644    6.57
5645    6.57
5646    4.53
5647    6.57
5648    4.95
5649    4.45
5650    6.57
5651    4.67
5652    4.33
5653    4.37
5654    4.15
5655    4.28
5656    4.88
5657    4.98
5658    5.46
Name: rating, Length: 5659, dtype: float64

In [66]:
anime_dataset_movies_ova_members = anime_dataset_movies_ova['members'].astype('float')

0       9.37
1       9.11
2       9.10
3       9.05
4       8.93
5       8.84
6       8.83
7       8.81
8       8.81
9       8.75
10      8.74
11      8.73
12      8.69
13      8.68
14      8.64
15      8.64
16      8.61
17      8.61
18      8.60
19      8.59
20      8.59
21      8.58
22      8.58
23      8.57
24      8.55
25      8.53
26      8.53
27      8.53
28      8.50
29      8.50
        ... 
5629    3.11
5630    6.57
5631    5.43
5632    4.11
5633    2.86
5634    4.08
5635    3.00
5636    5.20
5637    4.14
5638    4.00
5639    6.57
5640    3.14
5641    4.00
5642    4.66
5643    3.61
5644    6.57
5645    6.57
5646    4.53
5647    6.57
5648    4.95
5649    4.45
5650    6.57
5651    4.67
5652    4.33
5653    4.37
5654    4.15
5655    4.28
5656    4.88
5657    4.98
5658    5.46
Name: rating, Length: 5659, dtype: float64

Finally, create `anime_dataset_movies_ova_features` by calling `pd.concat` on `anime_dataset_movies_ova_genres`,   `anime_dataset_movies_ova["rating"]`, `anime_dataset_movies_ova[["members"]]`, and `anime_dataset_movies_ova["episodes"]`.

In [77]:
anime_dataset_movies_ova_features = pd.concat([anime_dataset_ova_genres, 
                                               anime_dataset_movies_ova['rating'],
                                               anime_dataset_movies_ova_members,
                                               anime_dataset_movies_ova['episodes']], axis=1)
anime_dataset_movies_ova_features

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Sports,Super Power,Supernatural,Yaoi,rating,members,episodes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.37,200630.0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.11,80679.0,110
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.10,72534.0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.05,102733.0,1
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,8.93,466254.0,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,8.84,226193.0,1
6,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,8.83,129307.0,4
7,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,8.81,339556.0,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,8.81,240297.0,1
9,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,8.75,32266.0,1


458379


Check the `shape` of the `anime_dataset_movies_ova_features`! It should be `(5659, 81)`. This means that we have 5659 vectors, each of which has 81 features.

In [79]:
print(anime_dataset_movies_ova_features.size)

458379


Print `anime_dataset_movies_ova_features` to see what it looks like. You should see a table of numbers corresponding to all the different features.

### Scaling

Now, the way kNN works is by computing the distance between different vectors. Every vector has several points or co-ordinates, that represent the position of the point in a space. Calculating the distance between vectors requires that all points share the same co-ordinate system 'scale' in space. 

Episode numbers, members and rating are very different in values. Rating ranges from 0-10 in the dataset while the episode number can be even 800+ episodes long when it comes to long running popular animes such as One Piece, Naruto etc. So assign the features to the scaler i.e. use `sklearn.preprocessing` to import `MinMaxScaler` as it scales the values from 0-1.

Import `MinMaxScaler` from `sklearn.preprocessing`.

In [83]:
from sklearn.preprocessing import MinMaxScaler

Create a MinMaxScaler object, and call it `min_max_scaler`.

In [85]:
min_max_scaler = MinMaxScaler()

Create `anime_scaled_features` by calling `fit_transform` on `anime_dataset_movies_ova_features`.

In [88]:
anime_scaled_features = min_max_scaler.fit_transform(anime_dataset_movies_ova_features)

Use `np.round` to round the features in anime_scaled_features to 2 decimal places for consistency. You should see that the features are now an array of lists, with each list being comprised of a bunch of numbers.

In [96]:
df_features = np.round(anime_scaled_features, decimals = 2)

# Fit Feature Data to k Nearest Neighbor Model

Okay, this is the main part! We're going to create a `KNeighborsClassifier` from `sklearn.neighbors` to attempt to classify our data! 

Import `KNeighborsClassifier` from `sklearn.neighbors`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Create a `knn` object from `KNeighborsClassifier`, and set the number of neighbors (`n_neighbors`) to 3.

In [101]:
knn = KNeighborsClassifier(n_neighbors = 3)

Now, we're going to create training and test parititions of our data for our kNN classifier to work on!

Import `train_test_split` from `sklearn.model_selection`

In [103]:
from sklearn.model_selection import train_test_split

We need to pass our features and their corresponding categories to our classifier. Our features are stored in `anime_scaled_features`. What we are trying to predict is the `type` (`Movie` or `OVA`) -- so we will pass the `anime_dataset_movies_ova['type']` series as our output labels.

`X` is going to be our features (`anime_scaled_features`).
`y` is going to be our output categories for the features (`anime_dataset_movies_ova['type']`).

Call `X_train, X_test, y_train, y_test = train_test_split` (you can use *Shift+Tab* to check out the parameters you need to pass). `X` represents features and `y` represents labels. Assign the output of `train_test_split` to:

`X_train` (the feature training data),<br>
`X_test` (the feature testing data), <br>
`y_train` (the training output labels), and <br>
`y_test` (the testing output labels). <br>

Use a `test_size` of 0.2, which means that we'll use 20% of our data as testing data, and the remaining 80% as training data. Set the random state to 101, which simply represents a seed we'll use to split our data.

In [105]:
X_train, X_test, y_train, y_test = train_test_split(anime_scaled_features, anime_dataset_movies_ova['type'], test_size = 0.2, random_state = 101)

Now, fit our knn model onto our new training feature and label data by calling `knn.fit`, and passing it our training feature data, `X_train`.

In [107]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

Let's see how well our model works! Call `knn.predict` and pass in our test data, `X_test`. Store the predictions returned in a variable called `predictions`.

In [113]:
y_pred = knn.predict(X_test)

Display `predictions` -- you should see an array containing a large list of `Movie` and `OVA` entries -- these are the predictions that our classifier just made for all our corresponding feature test data. To get a sense of how successful our classifier was, we will compare our predictions with the actual ('real') output, and print a confusion matrix and a classification report that helps put these results into perspective.

From `sklearn.metrics`,  import `classification_report` and `confusion_matrix`

In [117]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Print the `confusion_matrix`, by passing in `y_test` (actual output) and `predictions` (our predictions)

In [120]:
accuracy_score(y_test, y_pred)
confusion_matrix(y_test, y_pred) 

array([[344, 122],
       [160, 506]])

OK! You should see a 2x2 matrix of numbers, but what do these numbers mean? A confusion matrix is actually very simple to read: it shows the output predicted and compares it to the actual output, by categorizing the number of:

*true positives* - actual output is yes, predicted output is yes <br>
*false positives* - actual output is no, predicted output is yes <br>
*true negatives* - actual output is no, predicted output is no <br>
*false negatives* - actual output is yes, predicted output is no <br>

You can learn more about the confusion matrix here: <br>
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

We can print a slightly 'prettier' version of the confusion matrix by using the `nltk` library instead. We'll use this library in more depth later on in the course (for natural language processing), but for now, let's quickly use it to print a slightly easier to read confusion matrix:

from `nltk`, import `ConfusionMatrix`, and then print `ConfusionMatrix(list(y_test), list(predictions))`.

In [124]:
from nltk import ConfusionMatrix
print(ConfusionMatrix(list(y_test), list(y_pred)))

      |   M     |
      |   o     |
      |   v   O |
      |   i   V |
      |   e   A |
------+---------+
Movie |<344>122 |
  OVA | 160<506>|
------+---------+
(row = reference; col = test)



The rows represent the actual output, and the columns represent the test output.

To quantify the results in our confusion matrix, we can print a classification report. This gives us the values of the precision, recall, f1-score, and support of our classifier. 

Precision = true positives / (true positives + false positives) <br>
Recall = true positives / (true positives + false negatives) <br>
F1 Score = Harmonic Mean of Precision & Recall <br>
Support = Number of occurrences of each label in actual output. <br>

The lectures will cover these topics in more detail and in a way that is more conceptual and less mathematical.

Print the `classification_report`, by passing in `y_test` (actual output) and `predictions` (our predictions)

In [125]:
classification_report(y_test, y_pred)

'             precision    recall  f1-score   support\n\n      Movie       0.68      0.74      0.71       466\n        OVA       0.81      0.76      0.78       666\n\navg / total       0.76      0.75      0.75      1132\n'

One problem with a simple `train_test_split` is that, since only a single split of data is made, it's possible that our data has been split into parts that don't contain enough variation in terms of feature values. This will cause our model to *overfit* i.e. it will be very sensitive to training data, and will be very sensitive to the slightest change in test data, resulting in high variance (we'll explore these concepts in more detail later).

To avoid this problem, we run a process called cross-validation. Cross-validation involves splitting our data into equally sized training and test partitions, and then running `fit` and `predict` on all *permutations* of our partitions.

To do this:

from `sklearn.cross_validation` import `cross_val_score`


In [128]:
from sklearn.cross_validation import cross_val_score

Compute the `scores` by running `cross_val_score` and passing in our classifier (`knn`), all our feature data (`X`) and label data (`y`), and the number of *folds* for the cross validation (let's go with 10).

In [132]:
np.mean(cross_val_score(knn, anime_scaled_features, anime_dataset_movies_ova['type'], cv=10))

0.69207234763245484

Print `scores`. You'll notice it's an array of scores; one for each fold that was run.

We can simply print the average (`np.mean`) of our scores to get the overall score for our classifier.

## Wine DataSet

You'll notice that we only got around a 70% score for our kNN classifier. How good a classifier is also depends on the kinds of data you use for it's features. We are going to quickly explore this concept on a different dataset, the wine dataset, and we'll see how the results differ. We won't have to clean our data, so this process should be much faster.

Import the `wine.csv` dataset into `wine_dataset` by using `pd.read_csv`.

In [133]:
wine_dataset = pd.read_csv('wine.csv')

Print it's `head()`.

In [134]:
wine_dataset.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Notice that the data looks so much more relevant. i.e. The features look like they are going to be more helpful in trying to guess the kind of wine. 

Note that this data set contains 3 kinds of wine (1, 2, 3) -- all contained in the index of the dataframe. These are effectively our output labels. This is convenient, because the remaining columns of the dataset form our features. So we don't need to slice our dataframe to create our feature and label variables separately. We can simply use `wine_dataset` as our features, and `wine_dataset.index.values` as our labels.

Of course, we still need to scale our features before using them. To mix things up, let's use a `StandardScaler` object instead of a `MinMaxScaler`. The steps remain the same: use the `StandardScaler` in the exact same way as you did for the `anime_dataset` to scale the `wine_dataset`.

from `sklearn.preprocessing`, import `StandardScaler`

In [135]:
from sklearn.preprocessing import StandardScaler

Create a `StandardScaler` object, and store it in `standard_scaler`.

In [136]:
standard_scaler = StandardScaler()

Apply a `fit_transform` on `wine_dataset` and store the result in `wine_scaled_features`.

In [138]:
wine_scaled_features = standard_scaler.fit_transform(wine_dataset)

array([[ 1.51861254, -0.5622498 ,  0.23205254, ...,  0.36217728,
         1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, ...,  0.40605066,
         1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, ...,  0.31830389,
         0.78858745,  1.39514818],
       ..., 
       [ 0.33275817,  1.74474449, -0.38935541, ..., -1.61212515,
        -1.48544548,  0.28057537],
       [ 0.20923168,  0.22769377,  0.01273209, ..., -1.56825176,
        -1.40069891,  0.29649784],
       [ 1.39508604,  1.58316512,  1.36520822, ..., -1.52437837,
        -1.42894777, -0.59516041]])

Let's store our wine_scaled_features in a DataFrame. Call `pd.DataFrame` and pass in `wine_scaled_features` for the data, using the `wine_dataset.columns.values` as the columns. Store the result in `wine_scaled_features_df`.

In [147]:
wine_scaled_features_df = pd.DataFrame(wine_scaled_features, columns=wine_dataset.columns.values)

Let's re-evaluate our knn model on our new wine dataset! Compute the `scores` by running `cross_val_score` and passing in our classifier (`knn`), all our feature data (`wine_scaled_features_df`), and our label data (`wine_dataset.index.values.`). Use 10 folds again, and store the result in `scores`.

In [150]:
np.mean(cross_val_score(knn, wine_scaled_features_df, wine_dataset.index.values, cv=10))

0.95449991400068801

Print `np.mean(scores)`. You should see a score of around 95 % !

So by passing in 'better' data, you can see that our classifier works more successfully. We'll explore many of the concepts covered in this lab in greater detail in the lectures. Completing this lab should put you in an excellent position to tackle the homework assignment. As always, ask your instructor or TA if you have any questions. Good luck!