# White House Posts' Effect on Stock Market

### Author: Richard Doan
### Partner: Jon Khaykin


## Introduction

Given recent events, there seems to be correlations between text data sources and stock market motions. For example, there are a few [bots that use Twitter data to make stock transactions](http://www.npr.org/2017/04/14/523890750/twitter-bot-botus-will-buy-and-sell-stock-based-on-trumps-tweets). 
This inspired us to consider stock market motions given announcements made by the White House itself, president and staff included. 

To run through the examples interactively, please run the code in the **Appendix** at the bottom of this notebook after reading this.

## Data Schema and Collection Method

Because we are looking into how White House announcements affect the stock market, our dataset will be composed of raw White House blog posts and a few index fund prices. Major index fund prices are used as indicators as to how strong a country's economy is, which is what inspired the use of them in our analysis. The index funds we will be focusing are Dow Jones, S&P 500, and NASDAQ.

For each blog post, we get the closing prices of our three index funds on the day that blog post is posted. We then grab the price of the same funds, but for the next day. From there we take the price of the index funds on the next day, subtract that from the current day to get the price change, and then average over the three index fund price changes. This gave us numeric data to work with. We decided to average the fund prices instead of taking the max or the median. 

We noticed that absolute price changes may not give us as much information as relative price changes, so we repeated the above, but in addition to the subtraction, we divided the change in price with the price of the blog postdate and then averaged. Labels were then generated from the average proportion change of the three index fund prices. If the proportion change was positive, then we labeled the corresponding blog post as 1, otherwise, we labeled the blog post as -1. 

There were a few nuances that we had to overcome, which have added biases to our data collection methods. We weren't sure which blog posts would correspond to market motions, so we took every post that was on the White House blog page. This meant that there are cases where there are many posts to one day, so there are many posts to one closing price. To overcome this problem, we concatenated the blog posts that occur on the same day in this analysis. We also remove NaNs from our dataframe. These NaN value come from the fact that some posts are published on a day where the stock market isn't open for trading.

Below is the resulting dataframe from our collection method.


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/rykard95/wh_stocks/master/data/dataset.csv', dtype=DATATYPES).dropna()
dataset = join_blog_posts_on_date(df)
dataset['Title'] = dataset['Title'].apply(lambda title: title +' ')
dataset['Body'] = dataset['Body'].apply(lambda body: body + ' ')
dataset[:3]

Unnamed: 0,Date,Body,Title,Dow Jones Delta,Dow Jones Proportion,Nasdaq Delta,Nasdaq Proportion,S&P 500 Delta,S&P 500 Proportion,Mean Delta,Mean Proportion,Label
0,2017-03-13,On Monday President Donald J Trump welcomed to...,Readout of President Donald J Trumps Meeting w...,-44.111328,-0.002112,-18.959961,-0.003227,-8.02002,-0.003379,-23.697104,-0.002906,-1
1,2017-03-06,President Donald J Trump spoke separately toda...,Readout of the Presidents Calls with Prime Min...,-29.580078,-0.001412,-15.25,-0.002607,-6.920166,-0.002913,-17.250082,-0.002311,-1
2,2017-03-27,James S Brady Press Briefing Room PM EDTSENIOR...,Background Briefing on the Presidents Energy I...,150.519531,0.007324,34.77002,0.005953,16.97998,0.007251,67.42318,0.006843,1


## Word Analysis

Once we have our data scheme, I focused on the word level of our project. The hope was to gain insight as to what featurization method would be best to create a predictive model. Text analysis is trickier than numeric problems in machine learning. There are many ways to featurize unstructured text data, and the choice of featurization often introduces latent biases into machine learning models. 

Below I grouped the blog posts according to their label, concatenated the body of the blog posts together, and then tokenized (removing stop words) the larger text. I used the difference of proportions method to see if there were any major usage differences of certain words between the different labels. It seems as though the largest difference seen across the labels was when 'spicer' was mentioned, and the usage difference was only 0.002 more in the negative label.

I then repeated the same process, but with the titles of the blog posts instead. The results of the difference of proportions method seem more promising for the titles than those of the bodies. Here we see the largest difference of 0.014 more in the negative label. The word associated with this difference was 'meeting'. From these results, it seems as though the title might give us more information about stock market motions. A reason for this might be that because titles need to be concise, so each word must carry more distinctive information about the theme. In which case, more distinctive themes could correlate in some way to stock market motions.

In [6]:
body_df = generate_word_df(dataset, text='Body')
body_df[:10]

The total number of positive words is 141560
The total number of negative words is 140548


Unnamed: 0,Word,Positive,Negative,Positive Proportion,Negative Proportion,diff,diff proportion,abs diff proportion
3,spicer,1044,1252,0.007,0.009,-208,-0.002,0.002
0,president,2628,2501,0.019,0.018,127,0.001,0.001
73,applause,259,180,0.002,0.001,79,0.001,0.001
128,healthcare,178,295,0.001,0.002,-117,-0.001,0.001
110,eo,193,34,0.001,0.0,159,0.001,0.001
103,actions,201,78,0.001,0.001,123,0.001,0.001
101,regulations,202,88,0.001,0.001,114,0.001,0.001
99,plan,205,287,0.001,0.002,-82,-0.001,0.001
97,women,209,90,0.001,0.001,119,0.001,0.001
88,process,224,297,0.002,0.002,-73,-0.001,0.001


In [7]:
title_df = generate_word_df(dataset, 'Title')
title_df[:10]

The total number of positive words is 1509
The total number of negative words is 1340


Unnamed: 0,Word,Positive,Negative,Positive Proportion,Negative Proportion,diff,diff proportion,abs diff proportion
29,meeting,11,29,0.007,0.022,-18,-0.014,0.014
53,march,6,21,0.004,0.016,-15,-0.012,0.012
12,call,24,7,0.016,0.005,17,0.011,0.011
0,president,145,116,0.096,0.087,29,0.01,0.01
14,house,21,32,0.014,0.024,-11,-0.01,0.01
4,donald,41,47,0.027,0.035,-6,-0.008,0.008
5,j,41,47,0.027,0.035,-6,-0.008,0.008
173,proclaims,2,12,0.001,0.009,-10,-0.008,0.008
9,trumps,26,12,0.017,0.009,14,0.008,0.008
10,order,25,12,0.017,0.009,13,0.008,0.008


## Most Distinctive Words

Below is a different analysis of the most distinctive words. Instead of the difference of proportions, I used a chi-squared test to determine what words were statistically different in frequency across our two labels. I first do this analysis using the blog post body, and then I repeat the process using the blog post titles.

When analyzing words in the body, there doesn't seem to be any overlap between the words found to be significantly different using the difference of proportions and the chi-squared test. The same holds for words in the title. 

This discrepancy is worrisome because it means that a bag of words method may not give us as meaningful of a representation of our blog posts as we would like. We will want to perform operations such as stemming in order to tease out the topic of that post.


In [8]:
vectorizer = CountVectorizer(stop_words='english')
body_data = vectorizer.fit_transform(dataset['Body'])

indexes = np.argsort(chi2(body_data, dataset['Label'])[0])[:100]
top_body_words = np.array(vectorizer.get_feature_names())[indexes]
print("These are the words given by chi-squared...")
print()
print(top_body_words)
print()
print()
print("These are the words given by the difference of proportions...")
print()
print(np.array(body_df['Word'][:100].tolist()))
print()
print()
print("This is this the list of words from the difference of proportion list that are in the chi-squared list...")
print()
print(np.array([word in top_body_words for word in body_df['Word'][:100].tolist()]))

These are the words given by chi-squared...

['publish' 'portion' 'proper' 'rightmr' 'soldiers' 'informed' 'honestly'
 'area' 'dollars' 'free' 'addition' 'having' 'sell' 'readout' 'parties'
 'range' 'kids' 'learned' 'change' 'mike' 'wrote' 'subsequent'
 'generations' 'break' 'challenge' 'marketplace' 'lack' 'litigation' 'gave'
 'policy' 'rights' 'peaceful' 'hearings' 'largely' 'russians' 'function'
 'directing' 'uniform' 'sides' 'took' 'nominee' 'broken' 'list' 'pence'
 'right' 'direct' 'problems' 'aspect' 'developing' 'chemical'
 'professional' 'forgotten' 'rolling' 'private' 'hope' 'friend' 'allies'
 'ending' 'empower' 'door' 'creators' 'additionally' 'troops' 'alex'
 'keeps' 'reset' 'negotiations' 'hoping' 'launched' 'german' 'extremely'
 'taxpayer' 'presented' 'talks' 'district' 'arent' 'specifically'
 'extraordinary' 'car' 'named' 'podium' 'prepare' 'revenues' 'grant'
 'component' 'service' 'course' 'international' 'yeah' 'simple' 'bless'
 'particular' 'reinvest' 'meetingsq' 'impl

In [9]:
vectorizer = CountVectorizer(stop_words='english')
title_data = vectorizer.fit_transform(dataset['Title'])

indexes = np.argsort(chi2(title_data, dataset['Label'])[0])[:100]
top_title_words = np.array(vectorizer.get_feature_names())[indexes]
print("These are the words given by chi-squared...")
print()
print(top_title_words)
print()
print()
print("These are the words given by the difference of proportions...")
print()
print(np.array(title_df['Word'][:100].tolist()))
print()
print()
print("This is this the list of words from the difference of proportion list that are in the chi-squared list...")
print()
print(np.array([word in top_title_words for word in title_df['Word'][:100].tolist()]))

These are the words given by chi-squared...

['statement' 'discussion' 'guests' 'travel' 'discuss' 'committee' 'majesty'
 'insurance' 'dan' 'treasury' 'initiative' 'alsisi' 'promote' 'make'
 'federalism' 'sanders' 'angela' 'investment' 'bipartisan' 'excellence'
 'bin' 'second' 'support' 'prolife' 'photos' 'regulations' 'contracting'
 'empowerment' 'karen' 'charter' 'authority' 'attorney' 'florida'
 'upcoming' 'medal' 'acting' 'officials' 'comprehensive' 'reopening'
 'reorganizing' 'abuse' 'gerald' 'winter' 'gala' 'new' 'cabinet'
 'recipients' 'administrator' 'parentteacher' 'revocation' 'right' 'ford'
 'african' 'verma' 'minister' 'address' 'delivers' 'japan' 'establishing'
 'orders' 'route' 'week' 'en' 'message' 'care' 'al' 'womens' 'general'
 'department' 'chancellor' 'jordan' 'alabadi' 'legislative' 'memorandum'
 'weekly' 'nominate' 'intent' 'respect' 'gorsuch' 'visit' 'supreme'
 'budgetary' 'entitled' 'impact' 'congress' 'analysis' 'announces'
 'secretary' 'press' 'independence' 'c

## Clustering 

After noticing that inconclusiveness of the above methods, we approached our problem from a clustering standpoint to see why it was so difficult finding distinct words. We used the cosine distance metric to determine the similarity of two blog posts. This metric was used so that our plots would be agnostic of how long the text documents were. We also used MDS, which we learned about in class.

Below we see that the plot of the dataset created from the blog titles seems to have a few weak clusters based on label, while the dataset created by the blog post bodies is more mixed. The clusters in title date plot could indicate that the blog titles form a nice feature set for our labels, at least better for us than the data generated from the blog post bodies.


In [10]:

body_cos_dist = 1 - cosine_similarity(body_data)

mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

pos = mds.fit_transform(body_cos_dist)# shape (n_components, n_samples)
df = pd.DataFrame()
df['x'] = pos[:, 0]
df['y'] = pos[:, 1]
df['label'] = dataset['Label']

p = Scatter(df, x='x', y='y', title="MDS: White House Post Bodies", color='label',
           legend="top_right")

show(p)

In [11]:
title_cos_dist = 1 - cosine_similarity(title_data)

mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(title_cos_dist)# shape (n_components, n_samples)
df = pd.DataFrame()
df['x'] = pos[:, 0]
df['y'] = pos[:, 1]
df['label'] = dataset['Label']

p = Scatter(df, x='x', y='y', title="MDS: White House Post Titles", color='label',
           legend="top_right")

show(p)

## Discussion

From our exploration, it seems as though a simple bag of words model over our text data is not enough to create a sufficient vector space given our labels. Ideally, our featurization would place the labeled data in a space where a statistical model can be trained to discriminate the labels with high accuracy. As seen in the plots above, our data points are only semi-clustered when looking at the blog post titles. However, from Jon's topic models, it does seem as though there is a relationship between posts about 'Regulation' and stock market movements. From his findings, it should be possible to find a vector space where we can easily train a predictive model to estimate stock market movement.

Another problem that hinders us is the level of noise in the stock market itself. We would want to make sure that the movements that we look at are meaningful and not due to chance. That way when there is a significant movement after a blog post, we can more accurately say that there is a correlation between the type of content in the blog post and the movement the stock market makes. This assignment helped us stop and think more critically about our data collection process and analysis methods.


## Next Steps

- Label our data after denoising our stock data
- Stem our text data
- Look into combining the title and body of a posts

## Appendix: Imports, Globals, and Helper Functions

In [2]:
from re import sub
import pandas as pd
import numpy as np

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import MDS
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_selection import chi2

from bokeh.io import output_notebook, show
from bokeh.charts import Scatter

output_notebook()


stopwords = set(stopwords.words('english'))

DATATYPES = {'Dow Jones Value': np.float32,
            'Dow Jones Delta': np.float32,
            'Dow Jones Proportion': np.float32,
            'Nasdaq Value': np.float32,
            'Nasdaq Delta': np.float32,
            'Nasdaq Proportion': np.float32,
            'S&P 500 Value': np.float32,
            'S&P 500 Delta': np.float32,
            'S&P 500 Proportion': np.float32 
            }

def join_blog_posts_on_date(df):
    posts = {}
    titles = {}
    for index,row in df.iterrows():
        date = row['Date']
        if date not in posts:
            posts[date] = []
            titles[date] = []
        posts[date].append(row['Body'])
        titles[date].append(row['Title'])

    posts = {date:' '.join(posts[date]) for date in posts}
    titles = {date: ' '.join(titles[date]) for date in titles}

    posts = pd.DataFrame(list(posts.items()))
    posts.columns = ['Date', 'Body']

    titles = pd.DataFrame(list(titles.items()))
    titles.columns = ['Date', 'Title']

    dj_deltas = df[['Date', 'Dow Jones Delta']].drop_duplicates()
    nd_deltas = df[['Date', 'Nasdaq Delta']].drop_duplicates()
    sp_deltas = df[['Date', 'S&P 500 Delta']].drop_duplicates()

    dj_delta_prop = (df['Dow Jones Delta'] / df['Dow Jones Value']).drop_duplicates()
    nd_delta_prop = (df['Nasdaq Delta'] / df['Nasdaq Value']).drop_duplicates()
    sp_delta_prop = (df['S&P 500 Delta'] / df['S&P 500 Value']).drop_duplicates()

    dj_deltas['Dow Jones Proportion'] = dj_delta_prop
    nd_deltas['Nasdaq Proportion'] = nd_delta_prop
    sp_deltas['S&P 500 Proportion'] = sp_delta_prop
    
    dataset = pd.merge(posts, titles, how='inner', on=['Date'])
    dataset = pd.merge(dataset, dj_deltas, how='inner', on=['Date'])
    dataset = pd.merge(dataset, nd_deltas, how='inner', on=['Date'])
    dataset = pd.merge(dataset, sp_deltas, how='inner', on=['Date'])

    dataset['Body'] = dataset['Body'].str.replace('\d+', '').str.replace('[^a-zA-Z ]', '')
    dataset['Title'] = dataset['Title'].str.replace('\d+', '').str.replace('[^a-zA-Z ]', '')
    dataset['Mean Delta'] = (dataset['Dow Jones Delta'] + dataset['Nasdaq Delta'] + dataset['S&P 500 Delta']) / 3
    dataset['Mean Proportion'] = (dataset['Dow Jones Proportion'] + 
                                  dataset['Nasdaq Proportion'] + 
                                  dataset['S&P 500 Proportion']) / 3
    dataset['Label'] = (dataset['Mean Proportion'] >= 0).apply(int)
    dataset['Label'] = 2 * dataset['Label'] - 1
    return dataset

def generate_word_df(df, text='Body'):
    grouped = df.groupby('Label')
    joined_text = grouped[text].sum()
    negative_text = joined_text[-1]
    positive_text = joined_text[1]
    
    positive_text_words = count_words(positive_text)
    negative_text_words = count_words(negative_text)
    
    positive_words = pd.DataFrame(list(positive_text_words.items()), 
                              columns=['Word', 'Positive']).sort_values(by='Positive',
                                                                     ascending=False)
    negative_words = pd.DataFrame(list(negative_text_words.items()), 
                              columns=['Word', 'Negative']).sort_values(by='Negative',
                                                                     ascending=False)
    
    word_df = pd.merge(positive_words, negative_words, how='inner', on=['Word'])

    total_positive = word_df['Positive'].sum()
    total_negative = word_df['Negative'].sum()
    word_df['Positive Proportion'] = word_df['Positive'] / total_positive
    word_df['Negative Proportion'] = word_df['Negative'] / total_negative
    word_df['diff'] = word_df['Positive'] - word_df['Negative']
    word_df['diff proportion'] = word_df['Positive Proportion'] - word_df['Negative Proportion']
    print("The total number of positive words is %d" % word_df['Positive'].sum())
    print("The total number of negative words is %d" % word_df['Negative'].sum())
    
    word_df = word_df.round({'Positive Proportion':3, 'Negative Proportion':3, 'diff proportion':3})
    word_df['abs diff proportion'] = word_df['diff proportion'].abs()
    word_df = word_df.sort_values(by='abs diff proportion', ascending=False)
    return word_df

def count_words(text):
    text = sub('[^a-zA-Z ]', '', text)
    counts = {}
    for word in text.lower().split():
        if word in stopwords:
            continue
        if word not in counts:
            counts[word] = 0
        counts[word] += 1
    return counts