# Week 1 Lab: k Nearest Neighbors

We're going to take a look at a simple way to use one of the easiest Machine Learning algorithms - k Nearest Neighbors (kNN).

k Nearest Neighbors works by taking data from features, and plotting this data in space. It then guesses the category of new data by calculating how close or far the test data is to the training data in this space.

We'll work on two datasets (if we have time!). The first one is an anime dataset from Kaggle (https://www.kaggle.com/CooperUnion/anime-recommendations-database). We'll use kNN to plot a confusion matrix, which will tell us how accurate/precise our classifier is, and then a cross-validation which will give us a better score of how well our classifier works. We'll then run the same analysis on a new dataset about wine (https://archive.ics.uci.edu/ml/datasets/wine), and we'll see how the results differ. This will give us some sense of the importance of features and data on the success of the classifier.

## Import Libraries

`numpy` and `pandas` are libraries in python that will help us with data analysisg. You can learn more about them here:

**numpy**<br>
http://www.numpy.org/

**pandas**<br>
https://pandas.pydata.org/

`import` the numpy and pandas libraries

In [43]:
import pandas as pd
import numpy as np

# Load the dataset

Our first dataset is an anime dataset, named `anime.csv`. You can learn more about it here:<br>
https://www.kaggle.com/CooperUnion/anime-recommendations-database


Read the anime CSV using pandas `pd.read_csv`. Let's call it `anime_dataset`.

In [44]:
df = pd.read_csv("anime.csv")

Display the `head()` of `anime_dataset`.

In [45]:
df.head(10)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
5,32935,Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...,"Comedy, Drama, School, Shounen, Sports",TV,10,9.15,93351
6,11061,Hunter x Hunter (2011),"Action, Adventure, Shounen, Super Power",TV,148,9.13,425855
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
9,15417,Gintama&#039;: Enchousen,"Action, Comedy, Historical, Parody, Samurai, S...",TV,13,9.11,81109


# Data preprocessing 

## Episodes


Many animes have an unknown number of episodes even if they have a similar rating. On top of that, many super popular animes such as Naruto Shippuden, Attack on Titan Season 2 were ongoing when the data was collected, thus their number of episodes was considered as "Unknown". For some animes, we'll fill in the episode numbers manually. For the other animes, we'll make some educated guesses.

Animes that are grouped are "OVA" stands for "Original Video Animation". These are generally one/two episode long animes; let's just fill the unknown numbers of episodes with 1. Animes that are grouped under "Movies" are considered as '1' episode as per the dataset overview goes.

Fill in anime whose type is movies and whose episodes are unknown, with 1 episode. Do this by using the `.loc` function in pandas (i.e.`anime_dataset.loc[...]`), and a condition inside the dataframe that checks for `anime_dataset['type'] == 'Movie` and `anime_dataset['episodes'] == 'Unknown'`. Set the number of episodes for instances that match the condition, to 1.

In [46]:
df.loc[(df['type']=='Movie')&(df['episodes']=='Unknown'), 'episodes'] = 1

Repeat the above for anime whose type is `OVA` and whose episodes are `Unknown`, with 1 episode.

In [47]:
df.loc[(df['type']=='OVA')&(df['episodes']=='Unknown'), 'episodes'] = 1
# df.groupby('episodes').count()

Great! Now, shown below are the animes whose episode numbers we know. `known_animes` is a dictionary object, where each item is a key-value pair. The key is the name of the anime, and the value is the number of episodes. (Make sure to run the cell below!)

In [48]:
known_animes = {"Naruto Shippuuden":500, "One Piece":784,"Detective Conan":854, "Dragon Ball Super":86,
                "Crayon Shin chan":942, "Yu Gi Oh Arc V":148,"Shingeki no Kyojin Season 2":25,
                "Boku no Hero Academia 2nd Season":25,"Little Witch Academia TV":25}

Fill `anime_dataset` with the episode numbers from the dictionary above. Do this by iterating over the dictionary items above in a `for` loop, and for each item in the dictionary, find the anime whose `name` matches the key of the dictionary item, and replace the number of `episodes` with the corresponding value. Again, use the `.loc` function to do so.

In [49]:
for name in known_animes.keys():
    df.loc[(df['name']==name, "episodes")]= known_animes[name]

Now, for any remaining anime whose episodes we don't know, let's re-assign them to the median number of episodes. This is a two step process: first, replace the unknown episodes with np.nan. You can do this in several different ways:
* The same way you replaced the `episodes` count for `Movies`/`OVA` above <br>
* By passing a `lambda` function to `anime_dataset['episodes'].map` <br>
* By using `anime_dataset.loc` over all `episodes` and passing a `lambda` function to `anime_dataset['episodes'].apply`.

In [50]:
df.loc[(df['episodes']=='Unknown'), 'episodes'] = np.nan
median = df['episodes'].median()
# df.loc[df['episodes'].isnull()]


`NaN` stands for 'Not a Number'. Some of the episodes have unknown numbers -- so, replace these `NaN` values by using the `.fillna` function on `anime_dataset['episodes']` with the median of the episodes. Do this `inplace`. 'In place' simply means that, we will be actually replacing/updating data in the original dataframe and hence changing it, rather than simply displaying what the result *would* look like had we performed the operation.

In [51]:
df['episodes'].fillna(median, inplace= True)
df.loc[df['episodes'].isnull()]

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members


### Rating and Members


We're about to build our features. Features are the pieces of information that our classifer needs in order to try and guess / categorize our anime. Features have to be numbers, because computers work with numbers! So we need to convert any features that have information in string format, to floats. First, convert `anime_dataset['members']` to `float` using `.astype`.

In [52]:
df['members'].astype(float)

0        200630.0
1        793665.0
2        114262.0
3        673572.0
4        151266.0
5         93351.0
6        425855.0
7         80679.0
8         72534.0
9         81109.0
10       456749.0
11       102733.0
12       336376.0
13       572888.0
14       179342.0
15       466254.0
16       416397.0
17        75894.0
18       226193.0
19       715151.0
20       157670.0
21       129307.0
22       486824.0
23       552458.0
24       339556.0
25       240297.0
26       205959.0
27       101351.0
28       300030.0
29       562962.0
           ...   
12264       254.0
12265       205.0
12266       262.0
12267       174.0
12268       111.0
12269       164.0
12270       147.0
12271       240.0
12272       186.0
12273       146.0
12274       392.0
12275       108.0
12276        66.0
12277       176.0
12278       138.0
12279        79.0
12280       240.0
12281       221.0
12282       195.0
12283       112.0
12284       118.0
12285       485.0
12286       148.0
12287       201.0
12288     

Next, convert `anime['ratings']` to float as well in the exact same way as you just did for `anime['members']`.

In [53]:
df['rating'].astype(float)

0        9.37
1        9.26
2        9.25
3        9.17
4        9.16
5        9.15
6        9.13
7        9.11
8        9.10
9        9.11
10       9.06
11       9.05
12       9.04
13       8.98
14       8.93
15       8.93
16       8.92
17       8.88
18       8.84
19       8.83
20       8.83
21       8.83
22       8.82
23       8.82
24       8.81
25       8.81
26       8.80
27       8.80
28       8.78
29       8.78
         ... 
12264    6.42
12265     NaN
12266    5.43
12267    4.11
12268    2.86
12269    4.08
12270    3.00
12271    5.20
12272    4.14
12273    4.00
12274     NaN
12275    3.14
12276    4.00
12277    4.66
12278    3.61
12279     NaN
12280     NaN
12281    4.53
12282     NaN
12283    4.95
12284    4.45
12285     NaN
12286    4.67
12287    4.33
12288    4.37
12289    4.15
12290    4.28
12291    4.88
12292    4.98
12293    5.46
Name: rating, Length: 12294, dtype: float64

Some of the ratings are `NaN`. Use the `fillna` function to replace the `NaN` entries in `anime['rating']`, with the median of `anime['rating']`. Do this `inplace`.

In [54]:
# df['rating'].unique()
median = df['rating'].median()
df['rating'].fillna(median, inplace= True)

Now, what we are going to do is build a kNN classifier that can categorize an anime as either a Movie or an OVA (Original Video Animation). For this, we only really want the `Movie` and `OVA` entries from our database.

First, let's create a new data frame `anime_dataset_movies_ova` that contains only the `Movie` and `OVA` entries from our original dataframe. Do this by checking `anime_dataset['type'] == 'Movie'` OR `anime_dataset['type'] == 'OVA'` on `anime_dataset`. In addition, perform this operation on a `.copy()` of `anime_dataset`. This will produce a new dataframe, as opposed to a reference i.e. simply showing us what `anime_dataset` *would* look like if we performed the operation on it.

In [55]:
df_ao_idx = (df['type']=='Movie')|(df['type']=="OVA")
df_ao = df[df_ao_idx].copy()
df_ao.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
7,820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,15335,Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
11,28851,Koe no Katachi,"Drama, School, Shounen",Movie,1,9.05,102733
15,199,Sen to Chihiro no Kamikakushi,"Adventure, Drama, Supernatural",Movie,1,8.93,466254


Now, we want to get rid of some features that we think won't be useful in helping us guess what type the anime is. Two obvious features are the `anime_id` and the `name`. Remove these features from `anime_dataset_movies_ova` by calling `.drop` on it, making sure to provide a list of feature names, the axis on which these features exist (1), and doing so inplace. <br>
*Hint: you can see the arguments a function takes in by pressing Shift+Tab after opening the first parenthesis when writing the function call.*

In [56]:
df_ao.drop(['anime_id', 'name'], axis= 1, inplace = True)
df_ao.head()

Unnamed: 0,genre,type,episodes,rating,members
0,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
7,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
8,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
11,"Drama, School, Shounen",Movie,1,9.05,102733
15,"Adventure, Drama, Supernatural",Movie,1,8.93,466254


Print the head of `anime_dataset_movies_ova`. You should notice that the `index` values from the original dataframe (`anime_dataset`) are preserved -- which we don't really want, so reset the index of the dataframe by calling `.reset_index`. Make sure to `drop` the index to replace the original indices, and do this `inplace`.


In [57]:
# df_ao.drop('index', inplace=True)
df_ao.reset_index(drop=True, inplace=True)

Display the `head()` of `anime_dataset_movies_ova` to confirm that the new indices have been applied.

In [58]:
df_ao.head()

Unnamed: 0,genre,type,episodes,rating,members
0,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,"Drama, Military, Sci-Fi, Space",OVA,110,9.11,80679
2,"Action, Comedy, Historical, Parody, Samurai, S...",Movie,1,9.1,72534
3,"Drama, School, Shounen",Movie,1,9.05,102733
4,"Adventure, Drama, Supernatural",Movie,1,8.93,466254


### Features

Now we're ready to build our feature vector! We do this by simply concatenating the genres, ratings, members, and episodes into a single object. Remember that feature data always has to be *numbers*. Currently, `anime_dataset_movies_ova['genres']` is a string. So we want to convert the genres into a number format before concatenating it with the rest of the numerical features. 

Python has a `.get_dummies` function that will generate numerical values for the different values of a given non-numerical category. We're going to invoke this on `anime_dataset_movies_ova['genre']` by splitting each string in its series using `,` as a separator.

Call `anime_dataset_movies_ova['genres'].str.get_dummies(sep=",")`, and store the reference to the dataframe returned in a new variable `anime_dataset_movies_ova_genres`.

In [59]:
df_ao_genres = df_ao['genre'].str.get_dummies(sep=",")
df_ao_genres.head()

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,School,Sci-Fi,Seinen,Shoujo,Shounen,Slice of Life,Sports,Super Power,Supernatural,Yaoi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Finally, create `anime_dataset_movies_ova_features` by calling `pd.concat` on `anime_dataset_movies_ova_genres`,   `anime_dataset_movies_ova["rating"]`, `anime_dataset_movies_ova[["members"]]`, and `anime_dataset_movies_ova["episodes"]`.

**REMEMBER:** by default, concatenation happens on axis = 0 -- which represents the _rows_. However, what we want is to concatenate features on the _columns_. Can you guess what the axis value should be ? Make sure you enter the correct value!

In [60]:
df_ao_rating = df_ao["rating"]
df_ao_members = df_ao["members"]
df_ao_espisodes = df_ao["episodes"]
# concat: https://pandas.pydata.org/pandas-docs/stable/merging.html
df_ao_features =  pd.concat([df_ao_genres,df_ao_rating, df_ao_members, df_ao_espisodes], axis=1)
df_ao_features.head()


Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Sports,Super Power,Supernatural,Yaoi,rating,members,episodes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.37,200630,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.11,80679,110
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.1,72534,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.05,102733,1
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,8.93,466254,1


Check the `shape` of the `anime_dataset_movies_ova_features`! It should be `(5659, 81)`. This means that we have 5659 vectors, each of which has 81 features. 

In [61]:
df_ao_features.shape

(5659, 81)

**NOTE!** 81 features is a lot of features -- in fact, choosing features carefully is an entire science in itself, which we will explore later in the course. For now, just keep in mind that more features does not always mean better results (in fact, [it can be a bad thing](https://i.stack.imgur.com/DUZKm.png) !).

Print `anime_dataset_movies_ova_features` to see what it looks like. You should see a table of numbers corresponding to all the different features.

In [62]:
df_ao_features

Unnamed: 0,Adventure,Cars,Comedy,Dementia,Demons,Drama,Ecchi,Fantasy,Game,Harem,...,Shoujo,Shounen,Slice of Life,Sports,Super Power,Supernatural,Yaoi,rating,members,episodes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.37,200630,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.11,80679,110
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.10,72534,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,9.05,102733,1
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,8.93,466254,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,8.84,226193,1
6,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,8.83,129307,4
7,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,8.81,339556,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,8.81,240297,1
9,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,8.75,32266,1


### Scaling

Now, the way kNN works is by computing the distance between different vectors. Every vector has several points or co-ordinates, that represent the position of the point in a space. Calculating the distance between vectors requires that all points share the same co-ordinate system 'scale' in space. 

Episode numbers, members and rating are very different in values. Rating ranges from 0-10 in the dataset while the episode number can be even 800+ episodes long when it comes to long running popular animes such as One Piece, Naruto etc. So assign the features to the scaler i.e. use `sklearn.preprocessing` to import `MinMaxScaler` as it scales the values from 0-1.

Import `MinMaxScaler` from `sklearn.preprocessing`.

In [63]:
from sklearn.preprocessing import MinMaxScaler

Create a MinMaxScaler object, and call it `min_max_scaler`.

In [64]:
min_max_scaler = MinMaxScaler()

Create `anime_scaled_features` by calling `fit_transform` on `anime_dataset_movies_ova_features`.

In [65]:
ascaled_features = min_max_scaler.fit_transform(df_ao_features)
print(ascaled_features)

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 9.22029703e-01
  4.30295829e-01 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 8.89851485e-01
  1.73027717e-01 1.00000000e+00]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 ... 8.88613861e-01
  1.55558511e-01 0.00000000e+00]
 ...
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 3.66336634e-01
  4.58982218e-04 2.75229358e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 3.78712871e-01
  3.64612042e-04 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 4.38118812e-01
  2.93834410e-04 0.00000000e+00]]


Use `np.round` to round the features in anime_scaled_features to 2 decimal places for consistency. You should see that the features are now an array of lists, with each list being comprised of a bunch of numbers.

In [66]:
# https://stackoverflow.com/questions/22261843/python-np-round-with-decimal-option-larger-than-2
np.round(ascaled_features, decimals=2)

array([[0.  , 0.  , 0.  , ..., 0.92, 0.43, 0.  ],
       [0.  , 0.  , 0.  , ..., 0.89, 0.17, 1.  ],
       [0.  , 0.  , 1.  , ..., 0.89, 0.16, 0.  ],
       ...,
       [0.  , 0.  , 0.  , ..., 0.37, 0.  , 0.03],
       [0.  , 0.  , 0.  , ..., 0.38, 0.  , 0.  ],
       [0.  , 0.  , 0.  , ..., 0.44, 0.  , 0.  ]])

# Fit Feature Data to k Nearest Neighbor Model

Okay, this is the main part! We're going to create a `KNeighborsClassifier` from `sklearn.neighbors` to attempt to classify our data! 

Import `KNeighborsClassifier` from `sklearn.neighbors`.

In [67]:
from sklearn.neighbors import KNeighborsClassifier

Create a `knn` object from `KNeighborsClassifier`, and set the number of neighbors (`n_neighbors`) to 3.

In [68]:
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors=3)

Now, we're going to create training and test parititions of our data for our kNN classifier to work on!

Import `train_test_split` from `sklearn.model_selection`

In [69]:
 from sklearn.model_selection import train_test_split 

We need to pass our features and their corresponding categories to our classifier. Our features are stored in `anime_scaled_features`. What we are trying to predict is the `type` (`Movie` or `OVA`) -- so we will pass the `anime_dataset_movies_ova['type']` series as our output labels.

`X` is going to be our features (`anime_scaled_features`).
`y` is going to be our output categories for the features (`anime_dataset_movies_ova['type']`).

Call `X_train, X_test, y_train, y_test = train_test_split` (you can use *Shift+Tab* to check out the parameters you need to pass). `X` represents features and `y` represents labels. Assign the output of `train_test_split` to:

`X_train` (the feature training data),<br>
`X_test` (the feature testing data), <br>
`y_train` (the training output labels), and <br>
`y_test` (the testing output labels). <br>

Use a `test_size` of 0.2, which means that we'll use 20% of our data as testing data, and the remaining 80% as training data. Set the random state to 101, which simply represents a seed we'll use to split our data.

In [70]:
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# X_train, X_test, y_train, y_test = train_test_split(
# ...     X, y, test_size=0.33, random_state=42)
x_train, x_test, y_train, y_test = train_test_split(ascaled_features, df_ao['type'],test_size = 0.2, random_state = 101)

Now, fit our knn model onto our new training feature and label data by calling `knn.fit`, and passing it our training feature data, `X_train`.

In [71]:
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

Let's see how well our model works! Call `knn.predict` and pass in our test data, `X_test`. Store the predictions returned in a variable called `predictions`.

In [72]:
predictions = knn.predict(x_test)

Display `predictions` -- you should see an array containing a large list of `Movie` and `OVA` entries -- these are the predictions that our classifier just made for all our corresponding feature test data.

In [73]:
predictions

array(['Movie', 'OVA', 'Movie', ..., 'OVA', 'OVA', 'Movie'], dtype=object)

To get a sense of how successful our classifier was, we will compare our predictions with the actual ('real') output, and print a confusion matrix and a classification report that helps put these results into perspective. From `sklearn.metrics`,  import `classification_report` and `confusion_matrix`

In [74]:
from sklearn.metrics import classification_report, confusion_matrix

Print the `confusion_matrix`, by passing in `y_test` (actual output) and `predictions` (our predictions)

In [75]:
print(confusion_matrix(y_test, predictions))

[[340 126]
 [163 503]]


OK! You should see a 2x2 matrix of numbers, but what do these numbers mean? A confusion matrix is actually very simple to read: it shows the output predicted and compares it to the actual output, by categorizing the number of:

*true positives* - actual output is yes, predicted output is yes <br>
*false positives* - actual output is no, predicted output is yes <br>
*true negatives* - actual output is no, predicted output is no <br>
*false negatives* - actual output is yes, predicted output is no <br>

You can learn more about the confusion matrix here: <br>
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

We can print a slightly 'prettier' version of the confusion matrix by using the `nltk` library instead. We'll use this library in more depth later on in the course (for natural language processing), but for now, let's quickly use it to print a slightly easier to read confusion matrix:

from `nltk`, import `ConfusionMatrix`, and then print `ConfusionMatrix(list(y_test), list(predictions))`.

In [76]:
from nltk import ConfusionMatrix
print(ConfusionMatrix(list(y_test), list(predictions)))

      |   M     |
      |   o     |
      |   v   O |
      |   i   V |
      |   e   A |
------+---------+
Movie |<340>126 |
  OVA | 163<503>|
------+---------+
(row = reference; col = test)



The rows represent the actual output, and the columns represent the test output.

To quantify the results in our confusion matrix, we can print a classification report. This gives us the values of the precision, recall, f1-score, and support of our classifier. 

Precision = true positives / (true positives + false positives) <br>
Recall = true positives / (true positives + false negatives) <br>
F1 Score = Harmonic Mean of Precision & Recall <br>
Support = Number of occurrences of each label in actual output. <br>

The lectures will cover these topics in more detail and in a way that is more conceptual and less mathematical.

Print the `classification_report`, by passing in `y_test` (actual output) and `predictions` (our predictions)

In [77]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

      Movie       0.68      0.73      0.70       466
        OVA       0.80      0.76      0.78       666

avg / total       0.75      0.74      0.75      1132



One problem with a simple `train_test_split` is that, since only a single split of data is made, it's possible that our data has been split into parts that don't contain enough variation in terms of feature values. This will cause our model to *overfit* i.e. it will be very sensitive to training data, and will be very sensitive to the slightest change in test data, resulting in high variance (we'll explore these concepts in more detail later).

To avoid this problem, we run a process called cross-validation. Cross-validation involves splitting our data into equally sized training and test partitions, and then running `fit` and `predict` on all *permutations* of our partitions.

To do this:

from `sklearn.cross_validation` import `cross_val_score`


In [78]:
from sklearn.cross_validation import cross_val_score

Compute the `scores` by running `cross_val_score` and passing in our classifier (`knn`), all our feature data (`X`) and label data (`y`), and the number of *folds* for the cross validation (let's go with 10).

In [80]:
# https://github.com/arahuja/GADS7/wiki/Scikits-Learn-and-K-Nearest-Neighbors
scores = cross_val_score(knn, ascaled_features, y = df_ao['type'], cv = 10)

Print `scores`. You'll notice it's an array of scores; one for each fold that was run.

In [82]:
print(scores)

[0.58201058 0.64840989 0.61660777 0.54240283 0.51590106 0.60070671
 0.80565371 0.8745583  0.84247788 0.85486726]


We can simply print the average (`np.mean`) of our scores to get the overall score for our classifier.

In [87]:
np.mean(scores)

0.6883595997439456

## Wine DataSet

You'll notice that we only got around a 70% score for our kNN classifier. As mentioned earlier, how good a classifier is depends (amongst other things) on how good the features are. We are going to quickly explore this concept on a different dataset, the wine dataset, and we'll see how the results differ. We won't have to clean our data, so this process should be much faster.

Import the `wine.csv` dataset into `wine_dataset` by using `pd.read_csv`.

In [101]:
wine_dataset = pd.read_csv('wine.csv')

Print it's `head()`.

In [103]:
wine_dataset.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Notice that the data looks so much more relevant. i.e. The features look like they are going to be more helpful in trying to guess the kind of wine. 

Note that this data set contains 3 kinds of wine (1, 2, 3) -- all contained in the index of the dataframe. These are effectively our output labels. This is convenient, because the remaining columns of the dataset form our features. So we don't need to slice our dataframe to create our feature and label variables separately. We can simply use `wine_dataset` as our features, and `wine_dataset.index.values` as our labels.

Of course, we still need to scale our features before using them. To mix things up, let's use a `StandardScaler` object instead of a `MinMaxScaler`. The steps remain the same: use the `StandardScaler` in the exact same way as you did for the `anime_dataset` to scale the `wine_dataset`.

from `sklearn.preprocessing`, import `StandardScaler`

In [98]:
from sklearn.preprocessing import StandardScaler

Create a `StandardScaler` object, and store it in `standard_scaler`.

In [100]:
standard_scaler = StandardScaler() 

Apply a `fit_transform` on `wine_dataset` and store the result in `wine_scaled_features`.

In [108]:
wine_scaled_features = standard_scaler.fit_transform(wine_dataset)
wine_scaled_features

array([[ 1.51861254, -0.5622498 ,  0.23205254, ...,  0.36217728,
         1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, ...,  0.40605066,
         1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, ...,  0.31830389,
         0.78858745,  1.39514818],
       ...,
       [ 0.33275817,  1.74474449, -0.38935541, ..., -1.61212515,
        -1.48544548,  0.28057537],
       [ 0.20923168,  0.22769377,  0.01273209, ..., -1.56825176,
        -1.40069891,  0.29649784],
       [ 1.39508604,  1.58316512,  1.36520822, ..., -1.52437837,
        -1.42894777, -0.59516041]])

Let's store our wine_scaled_features in a DataFrame. Call `pd.DataFrame` and pass in `wine_scaled_features` for the data, using the `wine_dataset.columns.values` as the columns. Store the result in `wine_scaled_features_df`.

In [115]:
label = wine_dataset.columns.values
pd.DataFrame(wine_scaled_features,columns=label)

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1.518613,-0.562250,0.232053,-1.169593,1.913905,0.808997,1.034819,-0.659563,1.224884,0.251717,0.362177,1.847920,1.013009
1,0.246290,-0.499413,-0.827996,-2.490847,0.018145,0.568648,0.733629,-0.820719,-0.544721,-0.293321,0.406051,1.113449,0.965242
2,0.196879,0.021231,1.109334,-0.268738,0.088358,0.808997,1.215533,-0.498407,2.135968,0.269020,0.318304,0.788587,1.395148
3,1.691550,-0.346811,0.487926,-0.809251,0.930918,2.491446,1.466525,-0.981875,1.032155,1.186068,-0.427544,1.184071,2.334574
4,0.295700,0.227694,1.840403,0.451946,1.281985,0.808997,0.663351,0.226796,0.401404,-0.319276,0.362177,0.449601,-0.037874
5,1.481555,-0.517367,0.305159,-1.289707,0.860705,1.562093,1.366128,-0.176095,0.664217,0.731870,0.406051,0.336606,2.239039
6,1.716255,-0.418624,0.305159,-1.469878,-0.262708,0.328298,0.492677,-0.498407,0.681738,0.083015,0.274431,1.367689,1.729520
7,1.308617,-0.167278,0.890014,-0.569023,1.492625,0.488531,0.482637,-0.417829,-0.597284,-0.003499,0.449924,1.367689,1.745442
8,2.259772,-0.625086,-0.718336,-1.650049,-0.192495,0.808997,0.954502,-0.578985,0.681738,0.061386,0.537671,0.336606,0.949319
9,1.061565,-0.885409,-0.352802,-1.049479,-0.122282,1.097417,1.125176,-1.143031,0.453967,0.935177,0.230557,1.325316,0.949319


Let's re-evaluate our knn model on our new wine dataset! Compute the `scores` by running `cross_val_score` and passing in our classifier (`knn`), all our feature data (`wine_scaled_features_df`), and our label data (`wine_dataset.index.values.`). Use 10 folds again, and store the result in `scores`.

In [116]:
# https://github.com/arahuja/GADS7/wiki/Scikits-Learn-and-K-Nearest-Neighbors
# scores = cross_val_score(knn, ascaled_features, y = df_ao['type'], cv = 10)

scores = cross_val_score(knn, wine_scaled_features, wine_dataset.index.values, cv =10)

Print `np.mean(scores)`. You should see a score of around 95 % !

In [119]:
np.mean(scores)

0.954499914000688

So by passing in 'better' data & features, you can see that our classifier works more successfully. We'll explore many of the concepts covered in this lab in greater detail in the lectures. Completing this lab should put you in an excellent position to tackle the homework assignment. As always, ask your instructor or TA if you have any questions. Good luck!