# Market Predictions

This notebook will attempt to predict weather returns of the S&P 500 will be positive or negative the month following a federal reserve meeting. We will do this by using the summary section of the beige book as our model input

In [1]:
# set up connection to s3

import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name

bucket='stevenkoenemann-sagemaker' # Replace with your s3 bucket name
prefix = 'sagemaker/xgboost-mnist' # Used as part of the path in the bucket where you store data
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket) # The URL to access the bucket

In [2]:
# retrieve data from s3

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role

role = get_execution_role()
bucket='stevenkoenemann-sagemaker'
data_key = 'market_predictions.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

data = pd.read_csv(data_location)
data.head()

Unnamed: 0,Date,Minutes,Bond Sentiment,S&P Sentiment
0,2013-11-30,Reports from the twelve Federal Reserve Distri...,1,1
1,2013-10-31,Reports from the twelve Federal Reserve Distri...,0,1
2,2013-08-31,Reports from the twelve Federal Reserve Distri...,1,0
3,2013-07-31,Overall economic activity increased at a modes...,1,1
4,2013-06-30,Reports from the twelve Federal Reserve Distri...,1,0


The first step in attempting this task is to clean and process the text data within the dataframe. To do this we will use NLTK and loop through the text in the dataframe to make the text lowercase, split the text into words and stem the words. 

In [3]:
#cleaning the text
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,168):
    minutes = re.sub('[^a-zA-Z]', ' ', data['Minutes'][i])
    minutes = minutes.lower()
    minutes = minutes.split()
    ps = PorterStemmer()
    minutes = [ps.stem(word) for word in minutes if not word in set(stopwords.words('english'))]
    minutes = ' '.join(minutes)
    corpus.append(minutes)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
X = corpus

y = data['S&P Sentiment']

The next step in processing text is to convert the text that we have cleaned and processed into numbers so that we have something to input inot the model. For this we will use the keras one_hot function. This function will integer encode our text using the hash_trick function and will not one hot encode the whole document. We will assign a vocab size to the function so that it knows how many number to assign. This number should be about the number of unique words in the document. 

In [6]:
from numpy import array
import tensorflow as tf
import keras
from keras.preprocessing.text import one_hot

# integer encode the documents
vocab_size = 500
encoded_docs = [one_hot(d, vocab_size) for d in X]


Using TensorFlow backend.


After we encode our document into numbers we must pad the sequences because they are most likely not all the same length. The default setting of this function will add zeros to the end of each sequence to make it as long as the max_length which we will assign. 

In [7]:
from keras.preprocessing.sequence import pad_sequences

# pad documents to a max length of 4 words
max_length = 500
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

Now that the text has been processed completely we will build our LSTM model using keras. LSTM is a type of recurrent neural network which means that the out put state from oen node is saved and used as an input for the next node along with the next input. This helps the network recognize patterns better in in ttext data. 

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM, Dropout
from keras import optimizers

# Build model


print('Build model...')
model = Sequential()
# Embedding layer converts words to integers
model.add(Embedding(vocab_size, 128, input_length=max_length))
# LSTM layer with dropout to prevent overfitting
model.add(LSTM(500, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(500, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(500, return_sequences=True))
model.add(LSTM(200))
# Final layer of model with sigmoid function 
model.add(Dense(1, activation='sigmoid'))

# Create otimizer
o=optimizers.Nadam(lr=0.000001)

# Compile Model 
model.compile(loss='binary_crossentropy',
              optimizer=o,
              metrics=['accuracy'])

W0902 18:28:13.300153 139939023046464 deprecation_wrapper.py:119] From /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0902 18:28:13.317251 139939023046464 deprecation_wrapper.py:119] From /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0902 18:28:13.319942 139939023046464 deprecation_wrapper.py:119] From /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



Build model...


W0902 18:28:14.985152 139939023046464 deprecation_wrapper.py:119] From /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0902 18:28:14.992717 139939023046464 deprecation.py:506] From /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0902 18:28:16.037556 139939023046464 deprecation_wrapper.py:119] From /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0902 18:28:16.043812 13993

In [None]:
# Train model with training dataset
print('Train...')
model.fit(padded_docs, y,
          batch_size=5,
          epochs=10)
model.save_weights('fed_predictions.h5')

Train...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10

The results for this are not quite as good as I had hoped but there may be reason for this. The first reason that the model didnt perform that well may be that I simply did not have enough data. There are only 128 data points which is not that many for a mchine learning project like this. The second reason that this may not have worked is because I conflated sentiment of text with market return. The fed could talk positively about the economy and the market not respond accordingly due to other factors. Overall, I think that adding more data points and using more of the federal reserve minutes would improve the performance of the model. 