## Machine Learning with Scikit-Learn

The underlying mathematics of different machine learning algorithms can differ substantially.  However, for our example, they all have something in common - they use information about what makes a review "positive" or "negative" to look for patterns in how to catorize records.  

In the previous notebook, we created a "bag of words" listing the most common terms associated with movie reviews.  Next, we generated a word vector for each record in our training set, showing us how often these terms occur within a movie review.  

We can now program or "train" a computer to recognize which word count vectors tend to be associated with a positive review and which ones tend to be associated with a negative one.

In [64]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import pandas as pd
import numpy as np
import re

### Training vs Testing sets

When running a machine learning algorithm, it's common to split your data into a training set and a testing set.  You use the training set to train your initial model, and you use the testing set to test the performance of the model on data that was not used to train the algorithm.

Generally, a machine learning model will perform very well on the data used to train it, so it's important to set aside some pre-categorized data to test the model.  

For this exercise, we have two different data sets, each with 500 records, trainReviews.tsv, and testReviews.tsv.  We'll use the training set to create a bag of words, decompose each record into a word vector, and use these data structures to populate and run a machine learning model that can be used to predict whether a review is positive or negative.  

### Bag of words and word vectors

We'll start by creating a word vector for each record in the training set (see the workbook on BagOfWords for more on this).  

In [65]:
train = pd.read_csv('data/trainReviews.tsv', header=0, delimiter="\t", quoting=3)
train_records = []

for i in range( 0, len(train["text"])):
    text = train["text"][i]
    text = re.sub("[^a-zA-Z0-9]"," ", text)
    train_records.append(text.lower())

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english',   \
                             max_features = 5000)

train_data_features = vectorizer.fit_transform(train_records)


There is one additional step we didn't need last time - the scikit-learn machine learning algorithms take numpy arrays as parameters, so we will need to trainsform our data features array (our word vectors) to a numpy array.

In [66]:
np.asarray(train_data_features)

array(<499x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 94239 stored elements in Compressed Sparse Row format>, dtype=object)

### Choose a Machine Learning model

Up to now, all our work has focused on preparing data - cleaning, parsing, formatting, filtering, and formatting into data structures usable by scikit-learn's machine learning libraries.

What that may seem like preamble, I'd like to once again emphasize how much of the "real work" is in the preparation of the data.  Our example is based on a neatly formatted data set, and even then we had to do plenty of processing and formatting.  In a real world project, the amount of work involved in gathering and preparing data would increase by an order of magnitude.

We are now ready to train and run a machine learning model.  Scikit-learn provides a very consistent API that allows you to swap out different machine learning models with minimal code changes.  

For our first pass, let's go with a Random Forest model.  First, we build a random forest classifier, then we fit our training set to it.  

In [67]:
clf = RandomForestClassifier()
clf.fit( train_data_features, train["category"] )

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Believe it or not, we now have a trained Random Forest model that we can use to make predictions.

There are, of course, a number of parameters you will want to consider, so it's not all *quite* that simple.  But really, the amount of work that goes into preparing data can easily exceed the amount of work that goes into building and running a model.  

### Evalutating the Model 

Now that we have a trained random forest, we can apply it to our testing set - the pre-categorized records that we put aside to evaluate the effectiveness of the algorithm.  

First, let's use this random forest to categorize our test reviews into positive and negative (1 and 0).  Once that's done, we'll compare these to the actual reveiws to evaluate the accuracy of our Random Forest categorizer's predictions.

In [68]:
test = pd.read_csv('data/testReviews.tsv', header=0, delimiter="\t", quoting=3)
test_records = [] 
    
for i in range(0,len(test["text"])):
    text = test["text"][i]
    text = re.sub("[^a-zA-Z0-9]"," ", text)
    test_records.append(text.lower())

### Making predictions

Using the model to make predictions follows a similar pattern to building the model.  We decompose each record in our testing set into a word vector and feed it to the model.  

The difference is that rather than using each record's word vector association with a category to train the classifier, the random forest model will now use a record's word vector to predict classification for a record.  

In [69]:
test_data_features = vectorizer.transform(test_records)
np.asarray(test_data_features)

predictions = clf.predict(test_data_features)

We now have a set of predictions for each record in the testing set.  We also have an actual classification for each record inthe testing set, so we can use these to assess the accuracy of the random forest model.

### Parsing and interpreting results

Let's use a pandas data frame to store the record id and the predicted category for each record in our training set.

In [70]:
prediction_list = pd.DataFrame( data={"id":test["id"], "category":predictions} )

I'm leaving these commented out because they generate long list.  You can uncomment to print these if you'd like to see the prdiction lists alongside the actual category for each record.

The prediction list with priovide you with the record id and the predicted category (0 for negative, 1 for positive) for each record.  The category list will provide the actual review 

In [71]:
#print(prediction_list)
#print(test['category'])

### Evaluating the accuracy

Now that we have predictions for each record in the testing set, we can compare the results with the actual categories to evaluate the accuracy of the random forest model.

We'll take a very simple approach, simply calculating the percentage of accurate predictions.  

In [72]:
print(1 - float(sum(abs(prediction_list['category'] - test['category']))) / float(len(test['category'])))

0.632


How did you do?  Note that (not surprisingly) there is an element of randomness in a "random forest".  I'm getting an accuracy of about 65%, but the precise number will vary each time you run the algorithm.

### Choosing a different machine learning algorithm

One nice thing about the scikit-learn API is that allows you to change swap out algorithms very easily.  Neural nets, naive bayes, random forest, logistic regression, support vector machines, and other algorithms all have a very similar interface.  The underlying mathematics can be dramatically different, but the programming interface is, fortunately, very similar.  

### Try a neural net!

To use a neural net, you only need to make a few changes to this workbook.

First, scroll back up to the first cell and load the appropriate library

from sklearn.neural_network import MLPClassifier

Then, when were you build your machine learning model, call a neural network classifier rather than a random forest model

clf = MLPClassifier()

Run the full notebook, and you'll be making classification predictions using a neural net model rather than a random forest model.

### Try some out

You can try out these algorithms and see how well they predict our test set.  Your numbers will vary, but for this particular case, random forest appears to be somewhat less effective than many of the other algorithms.

Of course, this doesn't mean that random forest is a less effective machine learning algorithm than naive bayes or neural nets!  

For starters, we let all the parameters for these algorithms at their defaults.  There are a number of parameters you can supply to these libraries that can dramatically influence the performance and effectiveness of your predictions.  The parameters depend on the paricular algorithm and the underliying mathematics.  For instance, with a random forest, you can control the depth of the trees, whereas for neural nets, you can control the number of hidden layers.  

Check out the various classification models at 
http://scikit-learn.org/stable/supervised_learning.html

The data itself may also be the biggest factor here.  Some data sets just match very well to a particular prediction algorithm.  Again, this may have less to do with the innate effectivness of an algorithm and more to do with factors in the data.  

### Insight vs Predictions

Machine models can be useful for gaining insight into your data in addition to making predictions.  Many of these models produce additional data that can help you understand what is driving the categorization of records.

The additional data produced varies between the different algorithms, as they often take different mathematical approaches to classification.  The scikit-learn library does provide a consistent interface to the extent that the algorithms produce similar data, but not all methods are available on all algorithms.

### Feature importances from a Random Forest

One type of insight you can gain from a machine learning model is feature importance.  Which words or terms are most likely to influence the classification?  Are some terms very pivotal, are others largely ignored?

Because a random forest model branches repeatedly on different terms, the model becomes "aware" of which terms are particularly influential in classifiying a record.  Scikit-learn allows us to read this information off of a trained a random forest model through the feature_importances attribute.   

In [73]:
feature_importances = dict(zip(vectorizer.get_feature_names(), clf.feature_importances_))
print(feature_importances)



That's kind of a mess.  Let's format it to make it more readable.

In [74]:
import operator
sorted_feature_importances = sorted(feature_importances.items(), key=operator.itemgetter(1), reverse=True)

In [75]:
sorted_feature_importances

[('bad', 0.015553064789704757),
 ('good', 0.0088329042495910795),
 ('excellent', 0.0083224234343156258),
 ('great', 0.0078809220736550233),
 ('job', 0.0076521765795480121),
 ('waste', 0.0070545789756716439),
 ('true', 0.0066687453569042755),
 ('video', 0.0064547926517219576),
 ('created', 0.0060192550442140069),
 ('ve', 0.0057939804332425241),
 ('intelligent', 0.0054351223295262124),
 ('supposed', 0.0054008143293932083),
 ('effective', 0.0052808289061364701),
 ('attitude', 0.0050297689077647301),
 ('mother', 0.0048077696232518164),
 ('laugh', 0.0046772571967839934),
 ('purpose', 0.0046673308191292409),
 ('frightening', 0.004479257448358788),
 ('hilarious', 0.0042847778402554893),
 ('issues', 0.0042149812865499448),
 ('shows', 0.0042131737375373807),
 ('entertaining', 0.0041052208480488756),
 ('just', 0.0040967500229007903),
 ('intriguing', 0.0039534251405054857),
 ('including', 0.0039235734960191681),
 ('form', 0.0038463905583789523),
 ('mature', 0.0038009409518776844),
 ('flawless', 0

Here, we can see what terms are most likely to assist the algorithm with categorization.  Some are intuitive, such as "bad", "great", and "excellent".  Keep in mind, our data set is relatively small, so it's possible that some terms are assigned greater predictive power than they really have.  For instance, is "ben" really predictive for a new film that has a character named "ben"?

So it's not all just about predictive power - the algorithm you choose may have more to do with getting insight into your data.  

### Next steps

Jupyter notebooks are great for interactively exploring code.  However, for very sequential operations on data, you'll probably want to write a more succinct script.

Classifier.py, included in this repository, provides a shorter version of everything we've covered here, including reading in a file, creating a bag of words and word vector, training a model, running the model on a training set, and parsing the output.  

I've also included the libraries and method calls (commented out) so you can experiment with swapping out different machine learning libraries.


### Keep reading and coding!

At this point you have the software carpentry basics for running a machine learning model on a small (fits on one laptop) data set.  

At this point, I'd recommend learning as much about the different algorithms as you can, investigating the scikit-learn docs, experimenting with parameters and attributes, and looking for new datasets to try out.

Checkout the READ_ME file in this repository for more links and suggestions.