## Introduction

### Random Forests
This tutorial is intended as a gentle introduction to **random forests**, which are an essential machine learning tool for any data scientist. The tutorial will introduce random forests in the context of NFL two point conversions, which are a critical part of decision making for coaches. With million dollar jobs on the line, coaches cannot afford to make mistakes, and generating a model can assist coaches in making these critical decisions. By using data science, coaches can make better decisions by casting model-based predictions, and thus get better results. In addition, to show the versatility of the random forest technique, we will run our model on tweet classification as well to classify a tweet as Republican or Democrat.

Random forests are considered to be a great way to predict results, since good old decision trees tend to overfit data. Random forests have also been shown to produce the best results out of the multitude of prediction algorithms out there (source: [link](http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/)), and we will show this today by comparing the random forest classifier with SVM's on NFL and tweet data.

### Outline
Random forests can be used in practically any situation, so we will be showing aspects of the algorithm on two completely different use cases - one for predicting certain outcomes of NFL games, and the other for predicting the author of a tweet. The following topics will be covered in the extent of this tutorial:

1. Setup and installation
2. Introduction to Trees and Random Forests
3. NFL Prediction - data processing
4. NFL Prediction - Random Forest prediction of 2 pt conversions
5. Why Only 72%?
5. Another Worked Example: Tweets - data processing
6. Tweets - Random Forest prediction
8. Conclusion

## Setup and Installation

The majority of the code written in the tutorial will use the module sci-kit learn library, which has many pre-written algorithms for classifiers such as random forests. To install sci-kit learn, simply use pip as follows.

`pip install -U scikit-learn`

Next import the libraries to be used:

In [586]:
import text_classification as tc

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

## Introduction to Trees and Random Forests

In machine learning, one of the most important problems is **classification**. Over the years, researchers have developed various models that can help with classification. A very basic model for classification is the decision tree, which takes in a bunch of features, and basically goes feature by feature down a tree until an answer is reached (typically success or failure).

![Decision Tree](Decision_Tree_1.png)

This is a fairly simple way to clasify a test set we have never seen before - the for each data entry, traverse the decision tree based on the values of each feature in the entry, and come to a prediction at the leaf. Decision trees are really useful for a quick and simple predictor of the data set. In addition, the model is extremely easy to follow, as most users can easily visualize the tree and understand on a high level how the decision tree operates. 

However, plain decision trees have their Achilles' heel - overfitting of data. This happens because the tree and the data set are very closely related to each other. Thus, for the training data, decision trees will have a very high accuracy, but for a test set, a decision tree will suffer, since it is so closely related to the training set.

Thus, when it is known that the test set is completely different from training (which it usually is), a decision tree will not give the best results.

To solve this problem, random forests were invented! A **random forest** is a machine learning method used to counteract the major flaw with decision trees mentioned above. Random forests make many decision trees at once, using only randomly generated subsets of the data at the time. In addition, to determine splits in the tree, a random subset of features is considered at any given time. Both of these in tandem allow for greater randomization, and thus combat the problem of overfitting. 

![Random Forest](random_forest_new2.png)


In this tutorial today, we will be constructing our own random forests using the sci-kit learn library, which has a pre-existing class for random forests.


## NFL Prediction: Data Processing

To show the usefulness of random forests, we will first be predicting the success or failure of a 2-point conversion in any football game. Some background: After a touchdown, a team has the option of kicking an extra point (1 point, probability 93%) [[source]](http://www.rollingstone.com/sports/features/the-nfls-new-rule-extra-points-are-no-longer-pointless-20150908), or going for a two point conversion (2 points, 46.3% success) [[source]](http://archive.advancedfootballanalytics.com/2010/12/almost-always-go-for-2-point.html). Because of this, coaches are usually reluctant to go for 2, and rather take the sure bet of 1 point (with millions of dollars on the line, this makes sense!) We will create a model that will hopefully better predict the result of a two point conversion, based on various factors, so that coaches can make a more informed decision.


Our first data set is a set of all plays called throughout 2005-2013 (Go 49ers!!) - every single play is accounted for in this data set. Load the data set as follows. We will be using data from the 2005-2012 NFL season as the training set, and use the 2013 NFL season (through week 12) as the testing test to predict results.

Finally, we will be using a metric called DVOA, which measures the efficiency of a teams offense for a given season - these metrics are in a seperate csv.

Lets get started!

In [409]:
#DVOA sets, converted to dictionaries for future use
dvoa_all = dict()
dvoa_2005 = pd.read_csv("2005_offdvoa.csv").set_index('off').to_dict()
dvoa_2006 = pd.read_csv("2006_offdvoa.csv").set_index('off').to_dict()
dvoa_2007 = pd.read_csv("2007_offdvoa.csv").set_index('off').to_dict()
dvoa_2008 = pd.read_csv("2008_offdvoa.csv").set_index('off').to_dict()
dvoa_2009 = pd.read_csv("2009_offdvoa.csv").set_index('off').to_dict()
dvoa_2010 = pd.read_csv("2010_offdvoa.csv").set_index('off').to_dict()
dvoa_2011 = pd.read_csv("2011_offdvoa.csv").set_index('off').to_dict()
dvoa_2012 = pd.read_csv("2012_offdvoa.csv").set_index('off').to_dict()
dvoa_2013 = pd.read_csv("2013_offdvoa.csv").set_index('off').to_dict()
for d in [dvoa_2005, dvoa_2006,dvoa_2007,dvoa_2008,dvoa_2009,dvoa_2010, dvoa_2011, dvoa_2012, dvoa_2013]:
    dvoa_all.update(d)

#test set
test_set = pd.read_csv("2013_nfl_pbp_data_through_wk_12.csv")


# training set
train2005_set = pd.read_csv("2005_nfl_pbp_data.csv")
train2006_set = pd.read_csv("2006_nfl_pbp_data.csv")
train2007_set = pd.read_csv("2007_nfl_pbp_data.csv")
train2008_set = pd.read_csv("2008_nfl_pbp_data.csv")
train2009_set = pd.read_csv("2009_nfl_pbp_data.csv")
train2010_set = pd.read_csv("2010_nfl_pbp_data.csv")
train2011_set = pd.read_csv("2011_nfl_pbp_data.csv")
train2012_set = pd.read_csv("2012_nfl_pbp_data.csv")

train_set = pd.concat([train2005_set, train2006_set, train2007_set, 
                       train2008_set, train2009_set, train2010_set,
                       train2011_set, train2012_set]).dropna(how = 'all')


Verify that the data sets have been loaded by printing this for each data set

In [410]:
print train_set.columns
print train_set.head()
print train_set.tail()

Index([u'gameid', u'qtr', u'min', u'sec', u'off', u'def', u'down', u'togo',
       u'ydline', u'description', u'offscore', u'defscore', u'season'],
      dtype='object')
            gameid  qtr   min sec  off  def  down  togo  ydline  \
0  20050908_OAK@NE  1.0   0.0   0   NE  OAK   NaN   NaN    22.0   
1  20050908_OAK@NE  1.0  59.0  55  OAK   NE   1.0  10.0    72.0   
2  20050908_OAK@NE  1.0  59.0  19  OAK   NE   2.0   3.0    65.0   
3  20050908_OAK@NE  1.0  58.0  34  OAK   NE   1.0  10.0    61.0   
4  20050908_OAK@NE  1.0  58.0  14  OAK   NE   1.0  10.0    32.0   

                                         description  offscore  defscore  \
0  A.Vinatieri kicks 67 yards from NE 30 to OAK 3...       0.0       0.0   
1  (14:55) L.Jordan left end to OAK 35 for 7 yard...       0.0       0.0   
2  (14:19) K.Collins pass to J.Porter to OAK 39 f...       0.0       0.0   
3  (13:34) K.Collins pass to R.Moss to NE 32 for ...       0.0       0.0   
4  (13:14) K.Collins pass to L.Jordan pushed ob

As you can see, there are various columns regarding different aspects of the game. In this case, though, we only need to concern ourselves with two point conversions. So, we need to filter our dataframe by the **description** column. In addition, we drop unnecessary columns like togo, since yards to go doesn't really make sense when going for a two point conversion. In fact, we for our model, we only need the description, year, offense, offense team's score, and whether or not the play was a run or a pass.

Finally, for any classification algorithm, an important variable is to have is a boolean variable indicating success or failure. For each data frame, we will include such a variable which is marked zero if the try was not successful, and one if the try was successful.

In [411]:
def getTwoPointConv(df):
    '''This function gets all entries in the dataframe that pertain to two-point conversions'''
    tempDf = pd.DataFrame()
    succ = [] #will store a indicator variable for either success or not success
    typ = [] #stores an indicator variable for either run or pass
    for tup in df.iterrows():
        row = tup[1]
        if("CONVERSION" in row['description']):
            tempDf = tempDf.append(row)
    newDf = pd.DataFrame()
    newDf = newDf.assign(description = tempDf['description'], off = tempDf['off'], 
                         offScore = tempDf['offscore'], defScore = tempDf['defscore'], season = tempDf['season'])
    #get the success or failure of each two PT conversion
    for tup in newDf.iterrows():
        row = tup[1]
        if("ATTEMPT SUCCEEDS" in row['description']):
            succ.append(1)
        else:
            succ.append(0)
            
    #figure out whether or not the attemp was a run or a pass
    for tup in newDf.iterrows():
        row = tup[1]
        if("pass" in row['description']):
            typ.append(1)
        else:
            typ.append(0)
    res = pd.Series(succ)
    resTyp = pd.Series(typ)
    newDf = newDf.reset_index(drop = True)
    newDf = newDf.assign(success = res)
    newDf = newDf.assign(runOrPass = resTyp).dropna().reset_index(drop = True)
    return newDf

two_point = getTwoPointConv(train_set)
two_point_test = getTwoPointConv(test_set)
print two_point.head()
print two_point.tail()
print len(two_point)

                                         description  off  offScore  season  \
0  TWO-POINT CONVERSION ATTEMPT. K.Collins pass t...  OAK      14.0  2005.0   
1  TWO-POINT CONVERSION ATTEMPT. S.Jackson rushes...  STL      12.0  2005.0   
2  TWO-POINT CONVERSION ATTEMPT. K.Warner pass to...  ARI      13.0  2005.0   
3  (Pass formation) TWO-POINT CONVERSION ATTEMPT....  MIN       0.0  2005.0   
4  TWO-POINT CONVERSION ATTEMPT. M.Schaub pass to...  ATL      10.0  2005.0   

   success  runOrPass  
0        0          1  
1        0          0  
2        0          1  
3        1          1  
4        1          1  
                                           description  off  offScore  season  \
418  TWO-POINT CONVERSION ATTEMPT. C.Henne pass to ...  JAC      14.0  2012.0   
419  (Run formation) TWO-POINT CONVERSION ATTEMPT. ...  DAL      10.0  2012.0   
420  (Pass formation) TWO-POINT CONVERSION ATTEMPT....  SEA      13.0  2012.0   
421  TWO-POINT CONVERSION ATTEMPT. M.Schaub pass to...  H

In addition, based on analysis done by other sources [[link]](http://harvardsportsanalysis.org/2016/09/predicting-two-point-conversion-success-did-the-raiders-have-a-special-edge/), DVOA (Defense-adjusted Value Over Average) is a good indicator of a team's offensive strength. We will be using offensive DVOA by year for each team, and load these variables into our dataframe as well. To read more about DVOA, check out this link [here](http://www.footballoutsiders.com/info/methods).

In [412]:
def add_DVOA(df):
    dvoa_list = []
    for tup in df.iterrows():
        off_team = tup[1]['off']
        year = str(int(tup[1]['season']))
        dvoa_list.append(dvoa_all[year][off_team])
    res = pd.Series(dvoa_list)
    newDf = df.assign(offdvoa = res).dropna().reset_index(drop = True)
    return newDf

two_point = add_DVOA(two_point)
two_point_test = add_DVOA(two_point_test)
print two_point.head()
print len(two_point)
print len(two_point_test)

                                         description  off  offScore  season  \
0  TWO-POINT CONVERSION ATTEMPT. K.Collins pass t...  OAK      14.0  2005.0   
1  TWO-POINT CONVERSION ATTEMPT. S.Jackson rushes...  STL      12.0  2005.0   
2  TWO-POINT CONVERSION ATTEMPT. K.Warner pass to...  ARI      13.0  2005.0   
3  (Pass formation) TWO-POINT CONVERSION ATTEMPT....  MIN       0.0  2005.0   
4  TWO-POINT CONVERSION ATTEMPT. M.Schaub pass to...  ATL      10.0  2005.0   

   success  runOrPass  offdvoa  
0        0          1   -0.006  
1        0          0   -0.079  
2        0          1   -0.095  
3        1          1   -0.139  
4        1          1    0.033  
422
57


Finally, we must remove any columns with strings in them, as these will confuse the predictor.

In [525]:
two_point_clean = two_point.drop(["success", "off", "description"], axis = 1)
two_point_test_clean = two_point_test.drop(["success", "off", "description"], axis = 1)

We are now ready to apply our random forest method to our cleaned up data set!

## NFL Prediction: Random Forest Prediction

### Random Forest: Initial Setup

At the heart of any machine learning prediction algorithm are three main functions - these are fit, predict and score. **Fit** is used on the training data, **predict** is used on the test data to predict the outcome of an indicator variable, and **score** is used to determine the accuracy of the results. The random forest method is no different - we will be calling these three methods on our data set.

In [604]:
clf = RandomForestClassifier(random_state = 1234567)
clf.fit(two_point_clean, two_point["success"])
clf.score(two_point_test_clean, two_point_test["success"])

0.684210526316


Awesome!! This score indicates that our model predicted 67% of the 2-point conversions in the partial data for the 2013 NFL season. However, we can do better, and to do this, we need to change the parameters we input into our classifer. This process is known as **tuning** the random forest. 

### Random Forest: Tuning the Parameters

When we initialize our RandomForestClassifier, there are many optional parameters that it can take it. Take a look [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)! We will specially look at a couple, and how they can further enhance our model.

- *n_estimators*: This value determines the number of trees in the forest. More trees lead to higher accuracy, but the tradeoff is that the model will take longer to run. Finding the perfect value for this parameter is critical.
- *max_features*: This will determine the maximum number of features to consider - options include 'sqrt' or 'log2'

This tutorial will focus on n_estimators, although there are others, which you can check out on that website.

I found that for my computer, it is best to set n_estimators to be a relatively large number, like 500.

In [608]:
clf2 = RandomForestClassifier(n_estimators = 100, random_state = 1234567)
clf2.fit(two_point_clean, two_point["success"])
clf2.score(two_point_test_clean, two_point_test["success"])

0.7192982456140351

Aha! By setting n_estimators to be 500, the accuracy of our predictions increased by 0.03. If we increase the amount of features to consider (instead of the default sqrt(n), lets try a number > sqrt(n)), and see what happens to the accuracy of our model.

In [605]:
clf2 = RandomForestClassifier(n_estimators = 100, random_state = 1234567, max_features = 4)
clf2.fit(two_point_clean, two_point["success"])
clf2.score(two_point_test_clean, two_point_test["success"])

0.66666666666666663

Moral of the Story: Sometimes, its best to leave parameters blank! I observed that the accuracy of the model peaked at the default value of sqrt(n), so in this case, I would leave that parameter untouched.

Finally, some more cool things you can do with almost any classification mechanism in sklearn is extract the feature probabilities and get the actual predictions with the following commands.

In [611]:
print clf2.predict(two_point_test_clean)
print clf2.predict_proba(two_point_test_clean)

[0 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0
 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0]
[[ 0.76        0.24      ]
 [ 0.16        0.84      ]
 [ 0.57        0.43      ]
 [ 0.61        0.39      ]
 [ 0.82        0.18      ]
 [ 0.61        0.39      ]
 [ 0.71        0.29      ]
 [ 0.47833333  0.52166667]
 [ 0.66        0.34      ]
 [ 0.54        0.46      ]
 [ 0.46        0.54      ]
 [ 0.4         0.6       ]
 [ 0.18        0.82      ]
 [ 0.46        0.54      ]
 [ 0.138       0.862     ]
 [ 0.31666667  0.68333333]
 [ 0.58        0.42      ]
 [ 0.81        0.19      ]
 [ 0.35        0.65      ]
 [ 0.52        0.48      ]
 [ 0.81        0.19      ]
 [ 0.46        0.54      ]
 [ 0.55833333  0.44166667]
 [ 0.55        0.45      ]
 [ 0.09        0.91      ]
 [ 0.32        0.68      ]
 [ 0.54833333  0.45166667]
 [ 0.54166667  0.45833333]
 [ 0.14083333  0.85916667]
 [ 0.045       0.955     ]
 [ 0.4         0.6       ]
 [ 0.32333333  0.67666667]
 [ 0.51        0.49

Thus, the probabilities indicate that there for the first attempt, there is a 24% chance of success. Therefore, in the predictions array, 0 is given as the prediction, since the chance of a failure is 76%.

### Comparision to Other Models

# Linear Support Vector Classification

In [564]:
clf5 = LinearSVC()
clf5.fit(two_point_clean, two_point["success"])
clf5.score(two_point_test_clean, two_point_test["success"])

0.49122807017543857

#### K-Neighbors Classification

In [565]:
clf6 = KNeighborsClassifier()
clf6.fit(two_point_clean, two_point["success"])
clf6.score(two_point_test_clean, two_point_test["success"])

0.49122807017543857

As you can see, our newly learned algorithm significantly outperforms other classification methods by about .20 in terms of score. With improvement, an NFL coach could use the Random Forest model that we just wrote, as it would help them correctly classify whether or not going for two points is a good idea.

## Why Only 72%

This is a good question - even though our new model outperformed the SVM we learned in class - 72% still means our classification model has a 1 in 4 chance of being incorrect. While that is pretty accurate, there are still ways we can improve the performance of our model. 

One major way we can improve our model is to use more training data - sadly, only about 400 2 point attempts have been made in the NFL from 2005 to 2012. More data could be used to train our model, and thus, make it more accurate. Another way that we could make more model stronger is by adding more relevant features to our data set, such as the DVOA of the defense, or data pertaining to weather conditions, which play a big role in the NFL.

## Another Worked Example: Tweets - Data Processing

Enough of sports; lets dive into another hotly contested topic - politics! We will be using the tweet dataset from the third homework to run a random forest classification to determine the author of a given tweet as Republican (class label 0) or Democrat (class label 1). Load the tweets as such.

In [589]:
tweets = pd.read_csv("tweets_train.csv", na_filter=False)
unlabeled_tweets = pd.read_csv("tweets_test.csv", na_filter=False)
print tweets.head()

      screen_name                                               text
0             GOP  RT @GOPconvention: #Oregon votes today. That m...
1    TheDemocrats  RT @DWStweets: The choice for 2016 is clear: W...
2  HillaryClinton  Trump's calling for trillion dollar tax cuts f...
3  HillaryClinton  .@TimKaine's guiding principle: the belief tha...
4        timkaine  Glad the Senate could pass a #THUD / MilCon / ...


To pre-process the data set, we will simply call the functions from our homework (essentially, we will make a TF-IDF vector that we can then predict on)

In [578]:
processed_tweets = tc.process_all(tweets)
rare_words = tc.get_rare_words(processed_tweets)
(tfidf, X) = tc.create_features(processed_tweets, rare_words)
y = tc.create_labels(processed_tweets)

## Tweets - Random Forest Prediction

In [607]:
#for some reason, this method was not importing correctly, so I have included it here
def classify_tweets(tfidf, classifier, unlabeled_tweets):
    cpy = tc.process_all(unlabeled_tweets)
    allWords = []
    for thing in cpy.iterrows():
        row = thing[1]
        words = row['text']
        allWords.append(" ".join(words))
    return classifier.predict(tfidf.transform(allWords))

clf7 = RandomForestClassifier(random_state = 1234567)
classifier = clf7.fit(X, y)
print clf7.score(X, y)
y_pred = classify_tweets(tfidf, classifier, unlabeled_tweets)
print y_pred

0.993236212279
[1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0
 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 1 0 0 1 1 1
 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 0 0 0 1 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1
 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 1 0
 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0
 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1
 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 0 0 1 1 0 1 1 1
 1 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0
 1 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 1
 0 0 1 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1
 1 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1
 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1
 0 0 1 0 0

Once again, our random forest model outperformed the SVC we used in homework 3 - we have 99.3% accuracy with a random forest, as opposed to the 95% obtained by our SVC with the same function calls from the homework!

## Conclusion
In this tutorial, we went over the idea behind a random forest, and then we went through two worked examples of real world, appplicable problems that can be solved with a random forest. I hope this tutorial was useful, and if you need more information be sure to check out the links below!

Random Forest documentation: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Original Two Point Conversion Model: http://harvardsportsanalysis.org/2016/09/predicting-two-point-conversion-success-did-the-raiders-have-a-special-edge/

A very informative video about the concept of random forests: https://www.youtube.com/watch?v=loNcrMjYh64

A very famous paper going into great detail about random forests (only read if very interested - its highly technical!): https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf