# Introduction

My goal in this notebook is to practice SQL and interpretable machine learning procedures. To do this I will use the 2016 US primary dataset. I will focus on the factors that led to Trump's victory or defeat in each county. 

# Load and view data



### Libraries

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# SQL
import sqlite3

### Database connection

In [None]:
# Data 
path = "../input/2016-us-election/" 
database = path + 'database.sqlite'
conn = sqlite3.connect(database)

### Understanding the database

The first step will be to view the tables we have in the database.

In [None]:
pd.read_sql("""

SELECT *
FROM sqlite_master
WHERE type='table';

""", conn)

In the `primary_results` table we have the election results for each candidate in each county (number of votes and percentages). Note that the latter are the percentages in relation to the total votes obtained by each party:

In [None]:
pd.read_sql("""

SELECT * 
FROM primary_results
LIMIT 5;

""", conn)

In the `county_facts` table we have 54 columns of social data associated with each county:

In [None]:
pd.read_sql("""

SELECT * 
FROM county_facts
LIMIT 5;

""", conn)

The `county_facts_dictionary` is an information table about the codes used to name the columns of the `county_facts` table. It should be saved for future use.

In [None]:
dictionary = pd.read_sql("""

SELECT *
FROM county_facts_dictionary;

""", conn)

### Candidates

We can see that a total of 11 candidates ran in the Republican Party.

In [None]:
pd.read_sql("""

SELECT DISTINCT candidate
FROM primary_results
WHERE party = 'Republican';

""", conn)

### Join winners with county facts

With the following query, the first thing I do is select the Republican candidates (`WHERE party = 'Republican'`) in the election results table. Then I sort the candidates by order of votes (`ORDER BY fraction_vote desc`) in each county (`PARTITION BY county`) and save the positions in a new column called `rank`. Then I only have to select the rows in which the "rank" is equal to 1 to keep the winners of each county. As I have only one row per county, we can now join this information to the one we have in the table `county_facts` and obtain the dataset we are going to work on. 

In [None]:
df_raw = pd.read_sql("""

SELECT p.fips,
       p.county,
       p.candidate,
       p.votes,
       p.fraction_votes,
       c.*
FROM (
    SELECT *
    FROM (
        SELECT fips,
               county,
               party,
               candidate, 
               votes, 
               fraction_votes,
               row_number() 
                   OVER (PARTITION BY county 
                         ORDER BY fraction_votes desc) as rank
        FROM primary_results
        WHERE party = 'Republican') 
    WHERE rank = 1) p
JOIN county_facts c
    ON p.fips = c.fips;

""", conn)

df_raw[0:5]

Only 5 out of 11 candidates won at least one county:

In [None]:
print(df_raw['candidate'].value_counts())

However, what I am interested in is understanding the vote for Trump, so I am not going to take into account the specificities of the vote for the other candidates. I am going to keep only one binary variable where 0 represents the vote for Trump and 1 the vote for any other candidate.

In [None]:
# Create df with social variables only
df = df_raw.iloc[:, 8:] 

# Create binary target
df['target'] = df_raw['candidate']!='Donald Trump'
df['target'] = df['target'].astype(int)

# Distribution of target
counts = df['target'].value_counts()
print(counts)

proportion = round(counts[0] / (counts[0] + counts[1]), 2)
print('proportion: ', proportion, '/', 1-proportion)


The sample is unbalanced in a proportion of 75/25, so measures must be taken.

### X and y

Separate features and target:

In [None]:
X = df.drop('target', axis=1)
y = df.target

# Modelling

### Libraries

In [None]:
# Classifier
from sklearn import tree

# Pre-processing
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Tune and cross validation
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

## Decision Tree

Although decision tree is an extremely versatile technique, handling unbalanced samples [is not among its capabilities](https://stats.stackexchange.com/questions/450634/why-decision-tree-handle-unbalanced-data-well). Therefore, I will use the [smote](https://arxiv.org/pdf/1106.1813.pdf) algorithm to balance the data. On the other hand, I will use `GridSearchCV` to adjust the parameters of the algorithm and I will train it through a system of cross validation in stratified subsamples created by `StratifiedKFold`.

In [None]:
# Define tools
clf = tree.DecisionTreeClassifier(random_state=0)
smote = SMOTE(random_state=1) 
cv=StratifiedKFold(n_splits=10)

# Pipeline
pipeline = Pipeline([ 
                    ('smote', smote),
                    ('model', clf)
                   ])

# Tune algorithm
param_grid = {
              'model__max_depth': [3, 4, 5],
              'model__min_samples_leaf': [0.04, 0.06, 0.08]
             }

grid = GridSearchCV(pipeline, 
                    param_grid=param_grid, 
                    cv=cv, 
                    scoring="accuracy", 
                    n_jobs= -1)

# Fit
grid.fit(X, y)
best_estimator = grid.best_estimator_

# Print details
print('best params: ', grid.best_params_)
print('best score: ', round(grid.best_score_, 3))

The accuracy achieved shows that the model cannot predict the data much better than it would by classifying everything in the majority class. Rarely is a single tree sufficient to make a good prediction. However, we can extract other information from it.


### Feature importance

To understand the importance of the features, we have to call the `feature_importances` method. However, the trained model is inside a pipeline, so first we have to access it through the `named_steps` method. Once we have the scores, we can join them with their respective identification codes and their descriptions, which we saved in `dictionary`. 

In [None]:
# Add feature importances to dictionary
score = pd.Series(best_estimator.named_steps['model'].feature_importances_)
cols = pd.Series(X.columns)
feature_importance = pd.concat([cols, 
                               dictionary.iloc[:,1], 
                               score], 
                               axis=1)

# Drop low importance features
threshold = 0.01
mask = score>threshold
important_features = feature_importance[mask]
important_features

In [None]:
# Add a new column for feature names
names = pd.Series(['Under 18', 'Native American', 'Pacific Islander', 'Latino', 'Multi-units density', 'Land Area'])
names.index = important_features.index
important_features = pd.concat([names, important_features], axis=1)

# Rename columns
important_features.columns = ['name', 'code', 'description', 'score']
important_features

We can see it clearer with a graph:

In [None]:
ax = sns.barplot(x='name', y='score', data=important_features)
plt.xticks(rotation=70)

The percentage of the Latino population living in the county is the feature with the greatest discriminating power in the model. It is followed by the American Native and the Pacific Islander vote. However, to understand the meaning of these relationships we have to look at the tree on the inside.

### Tree plot



In [None]:
# Replace codes with new names
cols.update(important_features['name'])

# Plot
fig = plt.figure(figsize=(20, 10))
tree.plot_tree(best_estimator.named_steps['model'], 
               filled=True, 
               feature_names=cols, 
               class_names=['Trump', 'Other'],
               rounded=True)

Indeed, the feature that divides the sample most clearly is the percentage of Latinos. A low presence of this group will tip the balance towards Trump. This result should not surprise anyone at this point given the enormous attention the candidate dedicated to this group. If there are few Latinos in the county, the number of housing units in multi-unit structures is low, and there are few people under 18, Trump's victory was assured in most cases. 

Somewhat less clear was the case of counties with a high percentage of Latinos. If there was a high percentage of native-born Pacific Islanders in these counties, the balance could still tip back toward Trump. This can be explained by the high presence of the military on these islands (a collective close to Trump) and by the moment of instability that Hawaii was going through ([article](https://www.civilbeat.org/2016/11/who-voted-for-donald-trump-in-hawaii/)).

Finally, Trump had a particularly difficult time in those counties where, in addition to a high presence of Latinos and a low presence of Pacific Islanders, there was a high percentage of Native Americans. Trump has a long history of confrontation with this group, for whom his public statements since the 1990s have not gone unnoticed ([article](https://www.washingtonpost.com/national/donald-trumps-long-history-of-clashes-with-native-americans/2016/07/25/80ea91ca-3d77-11e6-80bc-d06711fd2125_story.html)).

And so far this little rehearsal.

See you around!