<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Establish-programming-components" data-toc-modified-id="Establish-programming-components-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Establish programming components</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Establish-parameters" data-toc-modified-id="Establish-parameters-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Establish parameters</a></span></li></ul></li><li><span><a href="#Read-in-the-data" data-toc-modified-id="Read-in-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read in the data</a></span><ul class="toc-item"><li><span><a href="#Read-in-2016-poll-data" data-toc-modified-id="Read-in-2016-poll-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Read in 2016 poll data</a></span></li><li><span><a href="#Realign-and-add-columns-to-the-2016-poll-DataFrame" data-toc-modified-id="Realign-and-add-columns-to-the-2016-poll-DataFrame-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Realign and add columns to the 2016 poll DataFrame</a></span></li><li><span><a href="#Read-in-2016-press-data" data-toc-modified-id="Read-in-2016-press-data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Read in 2016 press data</a></span></li><li><span><a href="#Clean-and-add-columns-to-the-2016-press-DataFrame" data-toc-modified-id="Clean-and-add-columns-to-the-2016-press-DataFrame-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Clean and add columns to the 2016 press DataFrame</a></span></li><li><span><a href="#Merge-the-DataFrames" data-toc-modified-id="Merge-the-DataFrames-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Merge the DataFrames</a></span></li></ul></li></ul></div>

# References

# Reading data from GDELT GKG

This module combines polling data with press metrics from GDELT to create a modeling dataset.  It then examines the data in this modeling dataframe


## Class objects


       
## Search string



## Data


## References

- http://www.datasciencemadesimple.com/reshape-wide-long-pandas-python-melt-function/



# Establish programming components

## Import libraries

In [1]:
# Import libraries
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta



## Establish parameters

In [4]:
# 
data_path = '../data'
poll_file_2016 = 'pure_poll_data_2016.csv'
press_file_2016 = 'press_metrics_2016'

# The candidates to be searched
subject_mapping  = {'Trump':'Donald Trump', 'Kasich': 'John Kasich', 'Cruz':'Ted Cruz', 
                    'Rubio':'Marco Rubio', 'Carson':'Ben Carson', 'Bush':'Jeb Bush', 
                    'Christie':'Chris Christie', 'Fiorina':'Carly Fiorina', 'Santorum':'Rick Santorum', 
                    'Paul':'Rand Paul', 'Huckabee':'Mike Huckabee'}


## Establish functions

# Read in the data

## Read in 2016 poll data

In [42]:
# Get the pure poll data for 2016
df_polls_2016 = pd.read_csv(os.path.join(data_path, poll_file_2016))
df_polls_2016.head(5)


Unnamed: 0,end_time,Bush,Carson,Christie,Cruz,Fiorina,Huckabee,Kasich,Paul,Rubio,Santorum,Trump
0,2015-02-15,0.12,0.09,0.07,0.03,0.01,0.17,0.02,0.11,0.06,0.02,
1,2015-02-23,0.16,0.06,0.07,0.05,0.0,0.1,0.01,0.1,0.08,0.02,
2,2015-03-02,0.16,0.07,0.08,0.06,,0.08,0.01,0.06,0.05,0.02,
3,2015-03-15,0.16,0.09,0.07,0.04,0.0,0.1,0.02,0.12,0.07,0.01,
4,2015-03-19,0.16,0.095,0.07,0.09,0.02,0.065,0.02,0.095,0.07,0.035,0.01


## Realign and add columns to the 2016 poll DataFrame

In [43]:
# Create a long-form DataFrame
df_polls_2016 = pd.melt(frame = df_polls_2016, id_vars=['end_time'], var_name='subject', 
                        value_name='poll_result')
df_polls_2016['subject'] = df_polls_2016['subject'].map(subject_mapping)

# add a multi-class variable for performance in the next poll
df_polls_2016['next_poll'] = -round(df_polls_2016['poll_result'].diff(-1),4)
df_polls_2016['next_poll'] = df_polls_2016['next_poll'].map(np.sign)

# Add a subject and end_time column
df_polls_2016['subject_end_time']= [str(s).lower().replace(' ','_') + \
                                    '_' + str(e) for s,e in zip(df_polls_2016['subject'],df_polls_2016['end_time'])]

df_polls_2016.head(10)


Unnamed: 0,end_time,subject,poll_result,next_poll,subject_end_time
0,2015-02-15,Jeb Bush,0.12,1.0,jeb_bush_2015-02-15
1,2015-02-23,Jeb Bush,0.16,0.0,jeb_bush_2015-02-23
2,2015-03-02,Jeb Bush,0.16,0.0,jeb_bush_2015-03-02
3,2015-03-15,Jeb Bush,0.16,0.0,jeb_bush_2015-03-15
4,2015-03-19,Jeb Bush,0.16,1.0,jeb_bush_2015-03-19
5,2015-03-20,Jeb Bush,0.17,0.0,jeb_bush_2015-03-20
6,2015-03-21,Jeb Bush,0.17,1.0,jeb_bush_2015-03-21
7,2015-03-22,Jeb Bush,0.18,-1.0,jeb_bush_2015-03-22
8,2015-03-23,Jeb Bush,0.155,1.0,jeb_bush_2015-03-23
9,2015-03-24,Jeb Bush,0.17,-1.0,jeb_bush_2015-03-24


## Read in 2016 press data

In [45]:
# Get the pure poll data for 2016
df_press_2016 = pd.read_csv(os.path.join(data_path, press_file_2016))
df_press_2016.head(5)


Unnamed: 0.1,Unnamed: 0,subject,art_count,word_count,tone_avg,pos_score_avg,neg_score_avg,polarity_avg,act_ref_den_avg,self_ref_den_avg,themes
0,2015-02-15,Donald Trump,0,0.0,,,,,,,[]
1,2015-02-23,Donald Trump,1,877.0,1.735358,4.121475,2.386117,6.507592,24.94577,3.253796,"['TAX_FNCACT', 'TAX_FNCACT', 'MEDIA_SOCIAL', '..."
2,2015-03-02,Donald Trump,1,390.0,-3.424658,2.511416,5.936073,8.447489,20.776256,0.456621,"['TAX_FNCACT_CANDIDATES', 'TAX_FNCACT_CANDIDAT..."
3,2015-03-15,Donald Trump,2,1150.0,-0.776682,1.921842,2.698525,4.620367,22.07787,1.21281,"['MANMADE_DISASTER_IMPLIED', 'WB_696_PUBLIC_SE..."
4,2015-03-19,Donald Trump,0,0.0,,,,,,,[]


## Clean and add columns to the 2016 press DataFrame

In [46]:
# Rename the Unnamed column
df_press_2016.rename(columns={'Unnamed: 0':'end_time'}, inplace=True)
# Add a subject and end_time column
df_press_2016['subject_end_time']= [str(s).lower().replace(' ','_') + \
                                    '_' + str(e) for s,e in zip(df_press_2016['subject'],df_press_2016['end_time'])]


df_press_2016.head(10)


Unnamed: 0,end_time,subject,art_count,word_count,tone_avg,pos_score_avg,neg_score_avg,polarity_avg,act_ref_den_avg,self_ref_den_avg,themes,subject_end_time
0,2015-02-15,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-02-15
1,2015-02-23,Donald Trump,1,877.0,1.735358,4.121475,2.386117,6.507592,24.94577,3.253796,"['TAX_FNCACT', 'TAX_FNCACT', 'MEDIA_SOCIAL', '...",donald_trump_2015-02-23
2,2015-03-02,Donald Trump,1,390.0,-3.424658,2.511416,5.936073,8.447489,20.776256,0.456621,"['TAX_FNCACT_CANDIDATES', 'TAX_FNCACT_CANDIDAT...",donald_trump_2015-03-02
3,2015-03-15,Donald Trump,2,1150.0,-0.776682,1.921842,2.698525,4.620367,22.07787,1.21281,"['MANMADE_DISASTER_IMPLIED', 'WB_696_PUBLIC_SE...",donald_trump_2015-03-15
4,2015-03-19,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-03-19
5,2015-03-20,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-03-20
6,2015-03-21,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-03-21
7,2015-03-22,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-03-22
8,2015-03-23,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-03-23
9,2015-03-24,Donald Trump,0,0.0,,,,,,,[],donald_trump_2015-03-24


## Merge the DataFrames

In [47]:
# Create a combined DataFrame
df_2016 = pd.merge(left = df_polls_2016, right = df_press_2016, 
                   how='inner', left_on='subject_end_time', right_on='subject_end_time')
df_2016.head(20)


Unnamed: 0,end_time_x,subject_x,poll_result,next_poll,subject_end_time,end_time_y,subject_y,art_count,word_count,tone_avg,pos_score_avg,neg_score_avg,polarity_avg,act_ref_den_avg,self_ref_den_avg,themes
0,2015-02-15,Jeb Bush,0.12,1.0,jeb_bush_2015-02-15,2015-02-15,Jeb Bush,0,0.0,,,,,,,[]
1,2015-02-23,Jeb Bush,0.16,0.0,jeb_bush_2015-02-23,2015-02-23,Jeb Bush,2937,2337239.0,-0.650403,2.700619,3.351021,6.05164,21.937579,0.743177,"['TAX_FNCACT', 'TAX_RELIGION', 'TAX_FNCACT', '..."
2,2015-03-02,Jeb Bush,0.16,0.0,jeb_bush_2015-03-02,2015-03-02,Jeb Bush,4960,3513601.0,-0.87893,2.722577,3.601507,6.324084,22.107775,0.800885,"['TAX_FNCACT_GUIDE', 'TAX_FNCACT_GUIDE', 'TAX_..."
3,2015-03-15,Jeb Bush,0.16,0.0,jeb_bush_2015-03-15,2015-03-15,Jeb Bush,8540,6928782.0,-0.527594,2.284455,2.812049,5.096503,21.814635,0.658352,"['TAX_WORLDMAMMALS_HORSE', 'TAX_FNCACT_IMMIGRA..."
4,2015-03-19,Jeb Bush,0.16,1.0,jeb_bush_2015-03-19,2015-03-19,Jeb Bush,2668,2003146.0,-0.541959,2.572629,3.114589,5.687218,22.460508,0.660205,"['TAX_FNCACT_CHIEF', 'TAX_FNCACT_CHIEF', 'TAX_..."
5,2015-03-20,Jeb Bush,0.17,0.0,jeb_bush_2015-03-20,2015-03-20,Jeb Bush,490,351729.0,-0.350046,2.776159,3.126204,5.902363,21.454094,0.610172,"['TAX_FNCACT_GUIDE', 'TAX_FNCACT_PILOT', 'ECON..."
6,2015-03-21,Jeb Bush,0.17,1.0,jeb_bush_2015-03-21,2015-03-21,Jeb Bush,314,262379.0,1.230456,4.315145,3.084689,7.399833,21.074816,0.794701,"['TAX_FNCACT_CANDIDATES', 'TAX_FNCACT_CANDIDAT..."
7,2015-03-22,Jeb Bush,0.18,-1.0,jeb_bush_2015-03-22,2015-03-22,Jeb Bush,937,518392.0,0.121088,2.789871,2.668782,5.458653,22.102877,0.458582,"['TAX_FNCACT_CANDIDATES', 'TAX_FNCACT_CANDIDAT..."
8,2015-03-23,Jeb Bush,0.155,1.0,jeb_bush_2015-03-23,2015-03-23,Jeb Bush,2215,1627843.0,0.446504,3.298023,2.851519,6.149541,21.184024,0.527605,"['WB_2670_JOBS', 'WB_2769_JOBS_STRATEGIES', 'W..."
9,2015-03-24,Jeb Bush,0.17,-1.0,jeb_bush_2015-03-24,2015-03-24,Jeb Bush,872,645943.0,-0.394681,2.768083,3.162764,5.930847,21.777336,0.566875,"['TAX_FNCACT_CANDIDATES', 'TAX_POLITICAL_PARTY..."


In [49]:
df_2016.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3786 entries, 0 to 3785
Data columns (total 16 columns):
end_time_x          3786 non-null object
subject_x           3786 non-null object
poll_result         3692 non-null float64
next_poll           3667 non-null float64
subject_end_time    3786 non-null object
end_time_y          3786 non-null object
subject_y           3786 non-null object
art_count           3786 non-null int64
word_count          3786 non-null float64
tone_avg            3277 non-null float64
pos_score_avg       3277 non-null float64
neg_score_avg       3277 non-null float64
polarity_avg        3277 non-null float64
act_ref_den_avg     3277 non-null float64
self_ref_den_avg    3277 non-null float64
themes              3786 non-null object
dtypes: float64(9), int64(1), object(6)
memory usage: 502.8+ KB
