# Random Forests: Presidential Contributions

Let's look at a random forests models for the presidential dataset.

This dataset defines all presidential contribution amounts from publicly available information.

The purpose here is to try to predict the amount that a candidate will give.

Here are the feature columns we will use:
1. Last Name (converted from Contributor Name)
2. First Name (converted from Contributor Name)
3. State 
4. Latitude (converted from Zipcode)
5. Longitude (converted from zipcode)
6. Employer
7. Occupation

### Notes

This is going to be a very difficult dataset to get high accuracy, because we don't have any features that are highly correlated with the outcome. Part of our analysis is to see which features prove to be the most useful. 

One might suspect that information like State, might be very predictive -- because presumably New Yorkers might contribute to Hillary Clinton and Texans might contribute to Donald Trump. However, it turns out that State is pretty weakly correlated to the outcome.  

One nice thing about random forests is that since we "bag" featues in differnet trees, we can empirically see which variables have hte most predictive power.  This is helpful for analytical reasons.



In [None]:
%matplotlib inline
import time
import pandas as pd

## Step 1: Load the data

In [None]:
t1 = time.perf_counter()
dataset = pd.read_csv("/data/presidential_election_contribs/2016/2016-medium-clean.csv",)
t2 = time.perf_counter()

print("read {:,} records in {:,.2f} ms".format(len(dataset), (t2-t1)*1000))

In [None]:
dataset

In [None]:
prediction_column = ['CONTB_RECEIPT_AMT']
numeric_columns = ['LAT', 'LNG']
feature_columns = ['LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'LAT', 'LNG', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_columns = ['CAND_NM', 'LASTNAME', 'FIRSTNAME', 'CONTBR_ST', 'CONTBR_EMPLOYER', "CONTBR_OCCUPATION"]
categorical_index = ['CAND_NM_index', 'FIRSTNAME_index', 'LASTNAME_index', 'CONTBR_ST_index', 'CONTBR_EMPLOYER_index', 
                     "CONTBR_OCCUPATION_index"]

### Print out a contribution count broken down by candidate?
**=> Q : Which candidates got the most donations? (in terms of number of donors) **

In [None]:
## TODO : print out per candidate breakdown
## Hint : What column represents Candidate name
dataset.groupby('???').size()


In [None]:
## TODO : sort the output by number of contributions
dataset.groupby('???').size()

## Step 2: Build Indexers and feature vector

Let's index all the categorical columns, and build a labeld index.

In [None]:
for col in categorical_columns:
    dataset[col + '_index'] = pd.factorize(dataset[col])[0]

In [None]:
features = ??? # numeric columns plus our *_index columns
label = ??? #  What are we trying to predict?

## Step 4: Split data into training and test

**=> TODO: build training and test datasets 70%/30% **


In [None]:
## TODO : split 70% training and 30% testing
## Hint : 0.7 ,  0.3

from sklearn.model_selection import train_test_split
train_x, test_x,train_y, test_y = train_test_split(features, 
                                                    label, test_size=0.3)
print("training set = " , len(train_x))
print("testing set = " , len(test_x))

## Step 5: Train the Model


In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=20)

In [None]:
print("Starting model training....this will take some time")
t1 = time.perf_counter()
## TODO : train the model with our training set
## Hint : training
rf.fit(train_x,train_y)
t2 = time.perf_counter()
print("trained on {:,} records  in {:,.2f} ms".\
      format(len(train_x),  (t2-t1)*1000))

## Step 6: Evaulate the model

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, predicted)

**=> TODO: Think about the test error here?  Does it seem high?  What does that say about our model?**

**=> How do we define model success? **

## Step 8: Print the feature importanes

In [None]:
import pandas as pd
rf.feature_importances_

**=> TODO Compare the relative weight of the feature importances? **



### Visualize feature importances

TODO: DO a visualization of feature importances.

**=> TODO Compare the relative weight of the feature importances? **

Why do you think that the lat/long and other fields did not contribute?

**=> BONUS: Do a Pearson Correlation Matrix of the variables to the outcome, to see correlation **

