# Model Predictions  
  
This notebook will try and build a model that can predict if Trump's campaign donations go up or down based upon what he tweets. Originally this was going to be a regression problem in which the model was going to predict the exact amount of his daily donations, however due to the nature of both NLP limitations and randomness of donation patterns the models used will focus on whether or not his next day donations will go up or down.  
  
In the modeling process I will start with a logistic regression model and progress to more complex models including Random Forest and XGBoost. 
    
The work flow will go as follows:  
  
> - [Import Datasets](#importing_data)
> - [Naive Accuracy](#naive)
> - [Declare Independent Features and Target Variable](#features)
> - [Logistic Model](#logistic)
> - [Random Forest Model](#random)
> - [XGBoost Model](#xgboost)
> - [Conclusions](#conclusion)

In [54]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV
from sklearn.naive_bayes import GaussianNB
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<a id='importing_data'></a>
## Import datasets

In [2]:
df= pd.read_csv('../datasets/frequency_dataset.csv')

In [3]:
df.dropna(inplace=True)

In [4]:
#Create a column that indicates whether or not the donations are above or below the previous days donations. 
df['up_down']=(df['contribution_receipt_amount'] < df['next_day_contributions']).astype(int)

In [5]:
#Reset index from Datetime to integer.
df.reset_index(inplace=True)

In [6]:
#Read in LSA dataset
df1= pd.read_csv('../datasets/lsa_dataset.csv')

In [7]:
#Merge the two datasets into one.
df1= df1.merge(df[['up_down','contribution_receipt_amount']],right_index=True,left_index=True)
df1.dropna(inplace=True)

<a id='naive'></a>
### Naive Prediction Accuracy

In [8]:
df['up_down'].value_counts(normalize=True)

1    0.519553
0    0.480447
Name: up_down, dtype: float64

> The naive accuracy if the model always predicted the the next day's donations would be higher is about 52%. The classes in this case are pretty balanced, so the predictions should not be skewed.

<a id='features'></a>
## Set up features and target variables

In [9]:
X = df1.drop(columns=['up_down','next_day_contributions'])
y = df1['up_down']

In [10]:
#Split data into training and validation sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

<a id='logistic'></a>
## Logistic Regression model  
  
The first model used will be logistic regression. Due to the number of features created during LSA, the model will gridsearch over multiple regularization parameters C. 

In [11]:
param_grid = {'C': [0.001, 0.01, 0.1, 1], 
             'penalty':['l2'],
             }
lr_gs = GridSearchCV(LogisticRegression(), param_grid)

In [12]:
lr_gs.fit(X_train,y_train)

GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1], 'penalty': ['l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [13]:
lr_gs.best_params_

{'C': 0.001, 'penalty': 'l2'}

In [14]:
lr_gs.best_score_

0.6240601503759399

In [15]:
lr_gs.score(X_train,y_train)

0.5338345864661654

In [16]:
cross_val_score(lr_gs,X_train,y_train).mean()

0.5794612794612796

In [17]:
lr_gs.score(X_test,y_test)

0.4666666666666667

> The logistic regression model performed worse on the test data than the naive accuracy. Even with Gridsearch the model scored below 48%. There might not be enough correlation to campaign contributions.

<a id='random'></a>
## Random Forest Model
  
The data will be fit to a Random Forest model. 

In [18]:
rf = RandomForestClassifier()
rf_params = {
    'n_estimators': [25,50,100, 150],
    'max_depth': [1, 2, 3, 4, 5],
}
gs = GridSearchCV(rf, param_grid=rf_params, cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.6240601503759399


{'max_depth': 4, 'n_estimators': 150}

In [19]:
gs.score(X_test,y_test)

0.6222222222222222

> Random Forest did perform better than Logistic Regression. The accuracy on the validation set went up to 62.2% from the naive accuracy of 51.9%. There the Forest model is picking up on some signal from the LSA components.

## Bayes Model

In [45]:
#Instaniate and fit Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [46]:
gnb.score(X_train,y_train)

0.5789473684210527

In [47]:
gnb.score(X_test,y_test)

0.4666666666666667

> Using Gaussian Naive Bayes did not produce better predictions than the naive accuracy. One potential reason for this is that given the dataset that I had, there is not enough prior information. 

<a id='xgboost'></a>
## XGBoost Model

In [66]:
#Instantiate XGBoost model.
model = XGBClassifier()
model.fit(X_train, y_train)

#Calculate predictions for test set.
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

#Calculate accuracy of model.
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {np.round(accuracy * 100.0,2)}%')

Accuracy: 64.44%


> XGBoost got the best accuracy score of all the models with 64.44%. While this is an improvement on all of the other models, I would not say this is a great score. 

<a id='conclusion'></a>
## Conclusions  
  
There is some signal coming from Trump's tweets to predict what his campaign donations are going to be. Although, none of the models were able to predict whether or not his donations would go up or down with an accuracy above 64.4%. Intuitively this might make sense. Trump's tweets tend to be reactionary to events happening in the world, thus the predictive ability of his tweets might be residual signal from other factors. As mentioned in the original cleaning of the data it might make sense to look at other information such as who specifically gave donations or at what frequency. My conclusion is that there isn't any significant reason to believe the words in Trump's tweets influence his campaign donations. 