# Machine Learning: Lab 4 - Using Python to Design Classification Systems

Welcome to the final lab of the Machine Learning course at CoderSchool! Your final project is to build a *Spotify Music Recommendation System*. 

In order to do this, you will have to focus on having a good **DESIGN** for your system.

A good design means knowing a few things:
* Which Classifiers to use
* How to Connect them together
* How to produce a final "Score"

You spent the first half of today's lab trying to draw out a design and think of ways to produce your score from your classifiers.

Now, we will look at a few Python functions that can help you achieve your design. You don't *have* to use any of these objects; the final project can be completed without using these concepts. But knowing that they exist gives you some more options and you could find them helpful!

## DataSet

We'll use a consolidated dataset of `Dance`, `Jazz`, `Rock`, and `Rap` from Assignment 2.

In [1]:
import pandas as pd
import numpy as np

# # Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
songs_dataset = pd.read_csv('Consolidated_DF_6000.csv')

In [3]:
songs_dataset.head()

Unnamed: 0,key,energy,liveliness,tempo,speechiness,acousticness,instrumentalness,time_signature,duration,loudness,valence,danceability,mode,time_signature_confidence,tempo_confidence,key_confidence,mode_confidence,genres
0,6,0.618964,0.375099,114.907,0.03443,0.129149,0.000397,0,4,302.37333,-9.496,0.947362,0.905836,0.712,0.62,0.828,0.87,dance
1,7,0.844817,0.067792,109.935,0.048568,0.127837,0.910389,1,4,435.8,-8.461,0.449367,0.687389,0.359,0.498,0.76,1.0,dance
2,0,0.940507,0.0508,128.046,0.029577,0.034144,0.881883,0,4,235.62667,-7.588,0.932188,0.728027,0.541,0.557,1.0,1.0,dance
3,0,0.965342,0.350438,124.939,0.047233,0.082816,0.000383,1,4,198.65333,-6.038,0.778784,0.638805,0.0,0.045,0.687,1.0,dance
4,1,0.639406,0.064024,88.306,0.116464,0.100388,0.941271,1,4,65.13333,-6.737,0.71993,0.866491,0.051,0.436,0.312,1.0,dance


Let's create a 10% split to use for our training and testing. We'll work with `genres` first.

In [4]:
from sklearn.cross_validation import train_test_split

ModuleNotFoundError: No module named 'sklearn'

In [1]:
# X = songs_dataset.drop('genres', axis=1)
# y = songs_dataset['genres']

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

## Pipeline

The `Pipeline` object in Python is a way to simply connect many different processes or steps together. Let's combine the `SelectKBest` step with a `RandomForestClassifier` step to see how we can use a `Pipeline`.

First, from `sklearn.pipeline` import `Pipeline`. Then from `sklearn.feature_selection` import `SelectKBest`, and import `RandomForestClassifier` from `sklearn.ensemble`. 

Now, create a `RandomForestClassifier` called `rfc`, and a `SelectKBest` object called `selector`.

The first thing we need to do is tell the `Pipeline` all the different steps that will be involved.

We can arrange steps in a list in the follinwg way:
`steps = [(<name of step 1>, object 1), (<name of step 2>, object 2), etc.]`

So for our example, we can use something like
`steps = [('feature_selection', selector), ('random_forest', rfc)]`

Try it out!

Then, to make a `Pipeline`, simply pass the `steps` to our `Pipeline` object in the following way:<br>
`pipeline = Pipeline(steps)`

Now, we can call `.fit`, and `.predict` on our `pipeline` object just like any other model!

Call `.fit`, `.predict`, and then print the `classification_report`.

### Combining GridSearchCV and Pipeline

The cool thing about a `Pipeline` is that you can use a single `GridSearchCV` object to try different combinations for different values! 

from `sklearn.grid_search` import `GridSearchCV`

We will try 2 values of `k` for `SelectKBest`, 2 values for `n_estimators` and 2 values for the `min_samples_split` for our `RandomForestClassifier`.

We have to use the following syntax for defining the parameters:
* `<name of step in pipeline> + '__' + <name of parameter>`.

For example:
* the name of our `SelectKBest` step in our pipeline is `feature_selection`
* we want to change the `k` value
* so the syntax is: `feature_selection__k`

In [782]:
# parameters = dict(feature_selection__k=[5,10])

Modify `parameters` above to add the following values for the `RandomForestClassifier` step as well:<br>
 `n_estimators: [50, 100]
 min_samples_split: [2,10]`

Great! Now you know how to use `GridSearchCV` -- call it on your `pipeline` object, passing in your new `parameters`. Set `verbose=3` so you can see it in action! Call `.fit` and `.predict` on your `GridSearchCV` object and print the `classification_report`.

## Pickling

'Pickling' is a way to save your python objects to disk. You can simply save ANY python variable to your computer as a `.pickle` file and later read it. It saves time and will help you when testing!

In [2]:
# import pickle
# some_model = [1, 2, 3, 4, 5, 6]

print `some_model`.

The way to save your object is to call `pickle.dump`. You pass in a filename with a `.pickle` extension (in this case, `my_model.pickle`). `wb` means we are in write mode.

In [786]:
# pickle.dump(some_model, open('my_model.pickle', 'wb'))

Check your computer! You should see a file called `my_model.pickle`.

The way to load your object is to call `pickle.load`. You pass in the filename that you want to load. `rb` means we are in read mode.

In [None]:
# some_model_2 = pickle.load(open('my_model.pickle', 'rb'))

print `some_model_2`

Cool! We successfully saved and loaded a python variable to disk.

In our example, we used a list, but you can use almost any object -- a pandas DataFrame, a RandomForestClassifier model, a Doc2Vec model, a Bag Of Words matrix - almost **ANYTHING** !

## Multi-Label Output

A song can be *both* `happy` and `celebratory`. For this reason, `moods` is an example of a variable that can be described as `multi-label`. In your final project, you might choose to work with `moods` and hence with multi-label output. You can choose to do this in several ways. One way is to use one of Python's built-in classifiers which supports Multi-Label output directly! Egs: `RandomForestClassifier`, `KNeighborsClassifier`.

### DataSet
Remember how we saw that you can save anything use `.pickle`? Read in `songs_aggressive.pickle` and save it in `songs_aggressive`.

Print the `type` of songs_aggressive.

Notice it is a pandas DataFrame -- this means we saved a pandas dataframe as a pickle file. Nice! Print its `head`.

### Classification

We want to create a `train_test_split`. We want our features (`X`) to be our `audio_features` from `songs_aggressive`.<br>
**Note:** A quick way to get your features into a list format for the `train_test_split` from the pandas Series format is to use `.values.tolist()`. 

We want our labels, `y`,  to be the `moods` from `songs_aggressive`.

print `y`.

You should see that, for many entries, there are multiple labels. For example, for index `36359`, the label has 3 values: <br>`['angsty', 'aggressive', 'rowdy']`.

Now what we want to do is convert our labels into numbers. Remember, computers always work with numbers!!! We will do this using the `MultiLabelBinarizer`.

from `sklearn.preprocessing` import `MultiLabelBinarizer`


Create a new instance of `MultiLabelBinarizer()` and save it in `mlb`.

Now call `.fit_transform` on our labels above (`y`). Store it in `y_labels`.

Print y_labels to see what it looks like. It should look like a bunch of lists with 1s and 0s. Each list is a label that has been converted into numbers.

Print `mlb.classes_`, `y_labels[0]`, and `y.iloc[0]` and compare them. You will notice how the labels have been encoded.

Now, because we are going to use a `KNeighborsClassifier`, we need to scale our features. Use a `StandardScaler` to scale `X`.

Now we are ready to create our `train_test_split`! Use `test_size=0.1`, `random_state=42`, `X`, and `y_labels`.

Finally, create your `KNeighborsClassifier`, `.fit` it to the training data, call `.predict` and print the `classification_report`.

## Classifier Chains

Classifier Chains are a way to combine many multi-label classifiers together in a way such that each classifier in the chain gets the output of the previous classifier and uses it as a feature! This might be helpful if your labels are co-related. Moods are probably co-related, so let's see if it helps.

Let's try it out on our current dataset.

from `sklearn.multioutput` import `ClassifierChain`

`ClassifierChain` takes in as an argument the kind of classifier you want to use. Let's use `KNeighborsClassifier` so we can compare our results.

You can now `.fit` and `.predict` on your chain just like you would a normal classifier.

What did you observe? Did your score improve?

Remember that, the number of classifiers in the chain will be equal to the number of labels! You can check this by comparing `len(chain.estimators_)` and `len(mlb.classes_)`

It's also useful to know that the *order* of the classifiers inside the chain is important and can probably influence your results. Check the python [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html) and [example](http://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html#sphx-glr-auto-examples-multioutput-plot-classifier-chain-yeast-py) for the parameters you can pass to your chain.

For more info, you can also check out section 4.1.2 on Classifier Chains in this [article](https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/).

## Building Binary Classifiers and looking at Probabilities

Another technique you can try is to instead use individual binary classifiers and look at the probability of their predictions. This can help you do things like declare the *top* 3 moods for example. You can also use the probabilities in the calculation of your similarity score, when deciding which songs are more similar than others to your test song.

Let's take a quick look at how to inspect the probability for a prediction, using a `RandomForestClassifier` as an example.

## DataSet

We'll work with `genres` here, as in Assignment 1. Read in `Consolidated_Dance_Jazz.csv`, which contains all the audio features and the genres for `Dance` and `Jazz` songs. Store it in `songs_dance_jazz`.

In [4]:
# songs_dance_jazz = pd.read_csv('Consolidated_Dance_Jazz.csv')
# songs_dance_jazz.head()

Perform the following steps:
* Build a `LogisticRegression` classifier called `logReg`.
* Use all the audio features as features.
    * Remember to scale your features!
* Use `genres` as the labels.
* Make a `train_test_split` with `test_size=0.33, random_state=42`. 
* `.fit`, `.predict`, and print the `classification_report` for `logReg`.

You should see an `f1-score` of around `0.95`

### Probabilities

Let's look at the song at index 20 and see what the prediction probabilities were!

In [5]:
# logReg.predict(X_test[20].reshape(1,-1))

You can see the classifier predicted the song at index 20 as `dance`. How confident was it? We use the `predict_proba` method to find out!

You should see that it was around `80.14 %` sure that it was a `dance` song, and `19.86 %` sure it was `jazz`

To see the classes themselves, you can simply print `logReg.classes_`

## Voting Classifiers

Voting Classifiers are a convenient way to simply group many classifiers together, and take the result that most of them predict. Let's see this with a quick example using 3 classifiers: `KNearestNeighbors`, `SVC`, and `LogisticRegression`.

Do the following:
* Create a KNearestNeighbors Classifier
* Create a SVC (Support Vector Machine Classifier)
* Create a Logistic Regression Classifier

Now, using the same `X_train` / `X_test` / `y_train` / `y_test` as before, fit all your classifiers to the training data.

Now: Let us look at the prediction each classifier has for the song at index `20`.

In [813]:
# song_index = 20
# print(logReg.predict(X_test[song_index].reshape(1,-1))) # Logistic Regression prediction
# print(knn.predict(X_test[song_index].reshape(1,-1))) # kNN prediction
# print(svc.predict(X_test[song_index].reshape(1,-1))) # Support Vector Machine Prediction

['dance']
['jazz']
['dance']


Notice how `kNN` predicted the song as `jazz`, while the other 2 predicted the song as `dance`.

What we can do is create a `VotingClassifier` that has these 3 classifiers, and its output will automatically be the most popular prediction.

from `sklearn.ensemble` import `VotingClassifier`

Create a `VotingClassifier` that contains all your above classifiers! It requires you pass in a `list` of your classifiers, each element in the list with the following syntax:

`(<some name for your classifier>, <your classifier object>)`

Now, `.fit` your `voting_classifier` on `X_train` and `y_train`.

Let's look at the prediction for song at index 20 !

Notice it is `dance`! This is because it was the majority vote from the previous 3 classifiers.

## Cosine Similarity

If you decide to use the **Million Song DataSet** *(MSD)*, you won't have any mood information (unless you try to use the *MSD* and `MasterSongList.json` together ;)). In this case, one very popular similarity metric is called *cosine similarity*. It basically decides that 2 vectors are similar if they are close together to each other. You can learn more about cosine similarity in this [picture](https://lh5.googleusercontent.com/lYq5EWtpgku57oUGff4oBcQWNaxmvj9IIXGF7_ILr9uA1wgvlI0_j8dYc00).

Let's create 3 sentences that are similar. We'll make sentence_1 and sentence_2 more 'similar' to each other than sentence_3.

In [7]:
# sentence_1 = "Mason really loves food"
# sentence_2 = "Hannah loves food too"
# sentence_3 = "The pizza is food"
# all_sentences = [sentence_1, sentence_2, sentence_3]

Great, now let's create a bag of words model using `CountVectorizer` and `.fit_transform` it to the above sentences.

from `sklearn.metrics.pairwise` import `cosine_similarity`

Let's see how 'similar' sentence_1 and sentence_2 are. Remember that the feature vectors for these sentences live inside our `bag_of_words` !

You see a value of `0.5`, which means they are pretty similar!

What about sentence_1 and sentence_3 ?

We see a value of `0.25`, which means they are not as similar.

Remember, a value of `1` and the closer the value gets to `0` it means they are not the same. `-1` means they are the 'opposite' !

OK, that's it for now. All the best with the final project!

## Recommended Reading

**Multi-Label Classification: Yelp Example**<br>
http://mondego.ics.uci.edu/projects/yelp/

**Python page on Multi-Label and Multi-Class Classification**<br>
http://scikit-learn.org/stable/modules/multiclass.html

**GridSearchCV & Pipeline: Example**<br>
https://www.civisanalytics.com/blog/workflows-in-python-using-pipeline-and-gridsearchcv-for-more-compact-and-comprehensive-code/

**Classifier Chains**<br>
https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/ (see section 4.1.2)<br>
http://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html

---