# Event Driven Stock Prediction

Deep Learning implementation of Stock Prediction inspired by [Deep Learning for Event-Driven Stock Prediction] (Ding et al.,2015)

This is a simplified implementation where I did not include Neural Tensor Network and Convolutional Neural Network.

#### Data Preparation
#####  News Data
###### News dataset from Bloomberg & Reuters (Oct.20.2006 ~ Nov.26.2013)
  - Extract news titles only (generators/data_generator.py)
  - Extract Relation Triples using OpenIE 5.0 (generators/svo_generator.py)
  - Match Relation Triples with corresponding word embeddings (generators/svo_embedding_generator.py)
  - For detailed description of preprocessing steps, refer to the corresponding .py files

##### S&P 500 Data (2006 ~ 2013)
- labeled the data based on volatility level.

- Here, I decided to train a multi-classification model based on the next day's volatility. (Original paper is a binary-classification)

In [4]:
import numpy as np
import pickle
import os
import scipy.stats as stats
import pandas as pd
from collections import defaultdict
from keras import backend as K
from keras.engine.topology import Layer
from keras.layers import Input

In [5]:
#Load dictionaries
with open(os.getcwd()+'/data/news_dict.pickle', 'rb') as handle:
    news_dict = pickle.load(handle)
    
with open(os.getcwd()+'/data/svo_dict.pickle', 'rb') as handle:
    svo_dict = pickle.load(handle)
    
with open(os.getcwd()+'/data/svo_dict_embed.pickle', 'rb') as handle:
    svo_dict_embed = pickle.load(handle)

In [None]:
df = pd.read_csv("target.csv")
df['Volatility'] = ((df['Close']-df['Open'])/df['Open']) * 100
df.replace('-', '', regex=True, inplace=True)

In [208]:
df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Volatility
0,20061002,1335.819946,1338.540039,1330.280029,1331.319946,1331.319946,2154480000,-0.336872
1,20061003,1331.319946,1338.310059,1327.099976,1334.109985,1334.109985,2682690000,0.209569
2,20061004,1333.810059,1350.199951,1331.479980,1350.199951,1350.199951,3019880000,1.228803
3,20061005,1349.839966,1353.790039,1347.750000,1353.219971,1353.219971,2817240000,0.250400
4,20061006,1353.219971,1353.219971,1344.209961,1349.589966,1349.589966,2523000000,-0.268249
5,20061009,1349.579956,1352.689941,1346.550049,1350.660034,1350.660034,1935170000,0.080031
6,20061010,1350.619995,1354.229980,1348.599976,1353.420044,1353.420044,2376140000,0.207316
7,20061011,1353.280029,1353.969971,1343.569946,1349.949951,1349.949951,2521000000,-0.246075
8,20061012,1349.939941,1363.760010,1349.939941,1362.829956,1362.829956,2514350000,0.954858
9,20061013,1362.819946,1366.630005,1360.500000,1365.619995,1365.619995,2482920000,0.205460


In [156]:
vol_neut = []
vol_pos = []
vol_neg = []
pos_mask = df['Volatility'] > 0.620074
neg_mask = df['Volatility'] < -0.471559
vol_pos = np.array(df[pos_mask]['Date'])
vol_neg = np.array(df[neg_mask]['Date'])

In [157]:
df.drop(df[pos_mask].index, inplace= True)

In [158]:
df.drop(df[neg_mask].index, inplace= True)

  """Entry point for launching an IPython kernel.


In [159]:
vol_nothing = np.array(df['Date'])
print(vol_nothing)

['20061002' '20061003' '20061005' '20061006' '20061009' '20061010'
 '20061011' '20061013' '20061016' '20061017' '20061018' '20061019'
 '20061020' '20061023' '20061024' '20061025' '20061026' '20061030'
 '20061031' '20061102' '20061103' '20061107' '20061108' '20061110'
 '20061113' '20061115' '20061116' '20061117' '20061120' '20061121'
 '20061122' '20061124' '20061128' '20061130' '20061201' '20061205'
 '20061206' '20061207' '20061208' '20061211' '20061212' '20061213'
 '20061215' '20061218' '20061219' '20061220' '20061221' '20061226'
 '20061228' '20061229' '20070103' '20070104' '20070108' '20070109'
 '20070110' '20070112' '20070116' '20070117' '20070118' '20070119'
 '20070123' '20070126' '20070129' '20070130' '20070201' '20070202'
 '20070205' '20070206' '20070207' '20070208' '20070212' '20070215'
 '20070216' '20070220' '20070221' '20070222' '20070223' '20070226'
 '20070228' '20070301' '20070307' '20070309' '20070312' '20070315'
 '20070316' '20070322' '20070323' '20070326' '20070329' '20070

In [202]:
df_2 = pd.read_csv("target.csv")
df_2['Volatility'] = ((df_2['Close']-df_2['Open'])/df_2['Open']) * 100
df_2.replace('-', '', regex=True, inplace=True)

20131126
2595


['20061020',
 '20061021',
 '20061022',
 '20061023',
 '20061024',
 '20061025',
 '20061026',
 '20061027',
 '20061028',
 '20061029',
 '20061030',
 '20061031',
 '20061101',
 '20061102',
 '20061103',
 '20061104',
 '20061105',
 '20061106',
 '20061107',
 '20061108',
 '20061109',
 '20061110',
 '20061111',
 '20061112',
 '20061113',
 '20061114',
 '20061115',
 '20061116',
 '20061117',
 '20061118',
 '20061119',
 '20061120',
 '20061121',
 '20061122',
 '20061123',
 '20061124',
 '20061125',
 '20061126',
 '20061127',
 '20061128',
 '20061129',
 '20061130',
 '20061201',
 '20061202',
 '20061203',
 '20061204',
 '20061205',
 '20061206',
 '20061207',
 '20061208',
 '20061209',
 '20061210',
 '20061211',
 '20061212',
 '20061213',
 '20061214',
 '20061215',
 '20061216',
 '20061217',
 '20061218',
 '20061219',
 '20061220',
 '20061221',
 '20061222',
 '20061223',
 '20061224',
 '20061225',
 '20061226',
 '20061227',
 '20061228',
 '20061229',
 '20061230',
 '20061231',
 '20070101',
 '20070102',
 '20070103',
 '20070104',

In [211]:
news_date_list = list(sorted(svo_dict_embed.keys()))
X_temp_list = []
y_temp_list = []
vol = []
pos_count = 0
neg_count = 0
neut_count = 0
for k, v in sorted(svo_dict_embed.items()): #in news article dict
    if int(k)+3 > int(news_date_list[-1]):
        print(k)
        break
    indx = (news_date_list.index(k))
    if (df_2['Date'] == news_date_list[indx+1]).any(): #if news article d+1 in S&P500 date
        pred_date = news_date_list[indx+1]
    elif (df_2['Date'] == news_date_list[indx+2]).any():
        pred_date = news_date_list[indx+2]
    else:
        pred_date = news_date_list[indx+3]
    if pred_date in vol_nothing:
        vol = [0,1,0]
    if pred_date in vol_pos:
        vol = [1,0,0]
    if pred_date in vol_neg:
        vol = [0,0,1]
    for val in v:
        if len(val[0]) != 100 :
            val[0] = val[0][0]
        if len(val[1]) != 100 :
            val[1] = val[1][0]
        if len(val[2]) != 100 :
            val[2] = val[2][0]
        X_temp_list.append(np.mean(val,axis=0))
        y_temp_list.append(vol)
        if vol[0] == 1:
            pos_count += 1
        if vol[1] == 1:
            neut_count +=1
        if vol[2] == 1:
            neg_count +=1
        
print(pos_count)
print(neg_count)
print(neut_count)

20131124
37617
36986
108825


In [206]:
news_date_list[-1]

'20131126'

In [212]:
y_full = np.array(y_temp_list,dtype='float')
X_full = np.stack(X_temp_list,axis=0)

In [163]:
#Data preparation complete

#### Modeling
- Simple settings with default parameters used.
- I focused on just learning the NN architecture, therefore did not optimize the model to the deploy level.

In [213]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full)

In [214]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(31, activation='relu', input_dim=100))
model.add(Dropout(0.5))
model.add(Dense(31, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, batch_size=128)
score = model.evaluate(X_test, y_test, batch_size=128)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
