# Relating News Data to Market Movement

As the efficient-market hypothesis points, asset prices fully reflect all available information. There are three common forms in which the efficient-market hypothesis is commonly stated: weak-form efficiency, semi-strong-form efficiency and strong-form efficiency. The weak-form states that developing trading strategies from public information will not produce excess profits. 

However, the success of more and more funds prove that implementing trading strategies based on public & private information is able to make money. Public information contains two parts. The first one is commonly used among variety of funds, which is the historical data. The other part, which is a mean to affect people's psychological part and often be neglected, is the public news. However, the second part is a very important part and quantitative analysts seldom pay attention to.

In this project, I used machine learning and deep learning methods associated with some nlp methods to build models that links market movement to news data.

The data can be found from on github. Check this link: https://github.com/dineshdaultani/StockPredictions/tree/master/Data

There are two types of data being gathered over ten years from **2007** to **2016**. 

The first part is the stock indices. In this data set, DJIA indices of Open, High, Low, Close, Volume and Adjusted Close Price were included. Here, we will use the **Adjusted Close Price** only. The data was collected from Yahoo Finance.

The second data contains **News Data** from NY Times Archive API. For the news data, the text was put together to be 1 line for each day.

## 1. Data Preprocessing
In this section, I did some preprocessing for the data. First, I made all the words together. Meantime, I created binary labels for the data. And also train and test data are split.

In [1]:
# Reading the saved data pickle file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_pickle('/Users/xuhuizhou/Desktop/courses_2019spring/stat541/project/pickled_ten_year_filtered_data.pkl')

In [2]:
df.head()

Unnamed: 0,close,adj close,articles
2007-01-01,12469.971875,12469.971875,. What Sticks from '06. Somalia Orders Islamis...
2007-01-02,12472.245703,12472.245703,. Heart Health: Vitamin Does Not Prevent Death...
2007-01-03,12474.519531,12474.519531,. Google Answer to Filling Jobs Is an Algorith...
2007-01-04,12480.69043,12480.69043,. Helping Make the Shift From Combat to Commer...
2007-01-05,12398.009766,12398.009766,. Rise in Ethanol Raises Concerns About Corn a...


In [3]:
df['price'] = df['adj close']
df['articles'] = df['articles'].map(lambda x: x.lstrip('.-'))
df['date'] = df.index
df.reset_index(inplace = True)
df = df[['date', 'price', 'articles']]
df.head()

Unnamed: 0,date,price,articles
0,2007-01-01,12469.971875,What Sticks from '06. Somalia Orders Islamist...
1,2007-01-02,12472.245703,Heart Health: Vitamin Does Not Prevent Death ...
2,2007-01-03,12474.519531,Google Answer to Filling Jobs Is an Algorithm...
3,2007-01-04,12480.69043,Helping Make the Shift From Combat to Commerc...
4,2007-01-05,12398.009766,Rise in Ethanol Raises Concerns About Corn as...


In [4]:
# create binary label. When the next day's close price
# is higher than today's, we label it 1, otherwise 0
df['next_price'] = df['price'].shift(-1)
df['label'] = (df['next_price'] > df['price']) * 1
df.head()

Unnamed: 0,date,price,articles,next_price,label
0,2007-01-01,12469.971875,What Sticks from '06. Somalia Orders Islamist...,12472.245703,1
1,2007-01-02,12472.245703,Heart Health: Vitamin Does Not Prevent Death ...,12474.519531,1
2,2007-01-03,12474.519531,Google Answer to Filling Jobs Is an Algorithm...,12480.69043,1
3,2007-01-04,12480.69043,Helping Make the Shift From Combat to Commerc...,12398.009766,0
4,2007-01-05,12398.009766,Rise in Ethanol Raises Concerns About Corn as...,12406.503255,1


In [5]:
# remove the last day's data because there
# is no tomorrow for that day in the data
df = df[:-1]

In [6]:
# train and test splitting. 2007-2014 train, 2015-2016 test
train = df[df['date'] < '2015-01-01']
test = df[df['date'] > '2014-12-31']
print(len(train))
print(len(test))

2922
730


## 2. Models
In this section, multiple models were tested and tuned.
### 2.1. Logistic Regression with One Connected Word

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import random

random.seed(0)
vectorizer_1 = CountVectorizer()
train_1 = vectorizer_1.fit_transform(train['articles'])
print(train_1.shape)
model_1 = LogisticRegression().fit(train_1, train["label"])
test_1 = vectorizer_1.transform(test['articles'])
pred_1 = model_1.predict(test_1)
acc_1=accuracy_score(test['label'], pred_1)
print('Logic Regression 1 accuracy: ',acc_1 )

(2922, 54425)
Logic Regression 1 accuracy:  0.52602739726


In [22]:
# check the words with most and least coefficient
basicwords = vectorizer_1.get_feature_names()
basiccoeffs = model_1.coef_.tolist()[0]
coeffdf = pd.DataFrame({'Word' : basicwords, 
                        'Coefficient' : basiccoeffs})
coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(5)

Unnamed: 0,Coefficient,Word
49875,0.600179,try
12111,0.448701,cuba
18046,0.442023,ferguson
25823,0.441164,john
5093,0.440137,bay


In [23]:
coeffdf.tail(5)

Unnamed: 0,Coefficient,Word
16832,-0.393243,ethics
44949,-0.409957,software
25363,-0.419106,italy
50856,-0.422388,unlikely
12534,-0.430101,danger


### 2.2. Logistic Regression with Two Connected Words

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
# find a relative reasonable accuracy
vectorizer_2 = TfidfVectorizer( min_df=0.2, max_df=0.95, max_features = 200000, ngram_range = (2, 2))
train_2 = vectorizer_2.fit_transform(train['articles'])
print(train_2.shape)
model_2 = LogisticRegression().fit(train_2, train["label"])
test_2 = vectorizer_2.transform(test['articles'])
pred_2 = model_2.predict(test_2)
acc_2=accuracy_score(test['label'], pred_2)
print('Logic Regression 2 accuracy: ', acc_2)

(2922, 48)
Logic Regression 2 accuracy:  0.501369863014


In [39]:
# check the words with most and least coefficient
basicwords = vectorizer_2.get_feature_names()
basiccoeffs = model_2.coef_.tolist()[0]
coeffdf = pd.DataFrame({'Word' : basicwords, 
                        'Coefficient' : basiccoeffs})
coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(5)

Unnamed: 0,Coefficient,Word
37,1.227785,the day
21,0.681015,million in
25,0.574881,north korea
24,0.551523,news quiz
30,0.37457,plans to


In [40]:
coeffdf.tail(5)

Unnamed: 0,Coefficient,Word
18,-0.23468,in the
29,-0.259136,plan to
26,-0.269091,of the
39,-0.820618,the early
5,-1.033149,daily report


### 2.3. Logistic Regression with Three Connected Words

In [42]:
vectorizer_3 = TfidfVectorizer( min_df=0.004, max_df=0.016, max_features = 200000, ngram_range = (3, 3))
train_3 = vectorizer_3.fit_transform(train['articles'])
print(train_3.shape)
model_3 = LogisticRegression().fit(train_3, train["label"])
test_3 = vectorizer_3.transform(test['articles'])
pred_3 = model_3.predict(test_3)
acc_3=accuracy_score(test['label'], pred_3)
print('Logic Regression 3 accuracy: ', acc_3)

(2922, 3217)
Logic Regression 3 accuracy:  0.532876712329


In [43]:
advwords = vectorizer_3.get_feature_names()
advcoeffs = model_3.coef_.tolist()[0]
advcoeffdf = pd.DataFrame({'Words' : advwords, 
                        'Coefficient' : advcoeffs})
advcoeffdf = advcoeffdf.sort_values(['Coefficient', 'Words'], ascending=[0, 1])
advcoeffdf.head(5)

Unnamed: 0,Coefficient,Words
273,1.187452,bay area report
456,1.027108,china moves to
374,1.023931,business as usual
1084,1.022957,in northern ireland
2750,1.021028,to cut costs


In [44]:
advcoeffdf.tail(5)

Unnamed: 0,Coefficient,Words
772,-1.067543,for weather channel
2549,-1.100151,the life report
3154,-1.109636,world war ii
126,-1.122023,and the future
260,-1.240254,back on the


### 2.4. Naive Bayes with One Word

In [45]:
from sklearn.naive_bayes import MultinomialNB

In [57]:
vectorizer_4 = TfidfVectorizer( min_df=0, max_df=0.8, max_features = 200000, ngram_range = (1, 1))
train_4 = vectorizer_4.fit_transform(train['articles'])
print(train_4.shape)
model_4 = MultinomialNB(alpha=0.01).fit(train_4, train["label"])
test_4 = vectorizer_4.transform(test['articles'])
pred_4 = model_4.predict(test_4)
acc_4=accuracy_score(test['label'], pred_4)
print('Naive Bayes 1 accuracy: ', acc_4)

(2922, 54399)
Naive Bayes 1 accuracy:  0.538356164384


In [53]:
advwords = vectorizer_4.get_feature_names()
advcoeffs = model_4.coef_.tolist()[0]
advcoeffdf = pd.DataFrame({'Words' : advwords, 
                        'Coefficient' : advcoeffs})
advcoeffdf = advcoeffdf.sort_values(['Coefficient', 'Words'], ascending=[0, 1])
advcoeffdf.head(5)

Unnamed: 0,Coefficient,Words
12730,-6.79526,deal
29590,-6.935432,may
40219,-6.973518,report
5124,-6.973774,be
33051,-6.974065,not


In [54]:
advcoeffdf.tail(5)

Unnamed: 0,Coefficient,Words
54385,-15.007661,تهران
54392,-15.007661,نه
54393,-15.007661,چندان
54396,-15.007661,中国的不法
54398,-15.007661,总理家人隐秘的财富


### 2.5. Naive Bayes with Two Connected Words

In [60]:
vectorizer_5 = TfidfVectorizer( min_df=0.07, max_df=0.25, max_features = 200000, ngram_range = (2, 2))
train_5 = vectorizer_5.fit_transform(train['articles'])
print(train_5.shape)
model_5 = MultinomialNB(alpha=0.001)
model_5 = model_5.fit(train_5, train["label"])
test_5 = vectorizer_5.transform(test['articles'])
pred_5 = model_5.predict(test_5)
acc_5 = accuracy_score(test['label'], pred_5)
print('Naive Bayes 2 accuracy: ', acc_5)

(2922, 323)
Naive Bayes 2 accuracy:  0.523287671233


In [61]:
advwords = vectorizer_5.get_feature_names()
advcoeffs = model_5.coef_.tolist()[0]
advcoeffdf = pd.DataFrame({'Words' : advwords, 
                        'Coefficient' : advcoeffs})
advcoeffdf = advcoeffdf.sort_values(['Coefficient', 'Words'], ascending=[0, 1])
advcoeffdf.head(5)

Unnamed: 0,Coefficient,Words
138,-5.059054,north korea
76,-5.087397,hedge fund
234,-5.204116,the new
21,-5.205317,bank of
0,-5.230265,about the


In [62]:
advcoeffdf.tail(5)

Unnamed: 0,Coefficient,Words
166,-6.227342,people of
318,-6.232397,with new
8,-6.2372,and its
285,-6.250869,to start
141,-6.261136,note in


### 2.6. Random Forest with One Word

In [64]:
from sklearn.ensemble import RandomForestClassifier

In [73]:
vectorizer_6 = TfidfVectorizer( min_df=0.04, max_df=0.6, max_features = 200000, ngram_range = (1, 1))
train_6 = vectorizer_6.fit_transform(train['articles'])
print(train_6.shape)
model_6 = RandomForestClassifier(random_state = 12345).fit(train_6, train["label"])
test_6 = vectorizer_6.transform(test['articles'])
pred_6 = model_6.predict(test_6)
acc_6 = accuracy_score(test['label'], pred_6)
print('Random Forest 1 accuracy: ', acc_6)

(2922, 2906)
Random Forest 1 accuracy:  0.530136986301


### 2.7. Random Forest with Two Connected Words

In [76]:
vectorizer_7 = TfidfVectorizer( min_df=0.05, max_df=0.28, max_features = 200000, ngram_range = (2, 2))
train_7 = vectorizer_7.fit_transform(train['articles'])
print(train_7.shape)
model_7 = RandomForestClassifier(random_state = 12345).fit(train_7, train["label"])
test_7 = vectorizer_7.transform(test['articles'])
pred_7 = model_7.predict(test_7)
acc_7 = accuracy_score(test['label'], pred_7)
print('Random Forest 2 accuracy: ', acc_7)

(2922, 590)
Random Forest 2 accuracy:  0.53698630137


### 2.8. Gradient Boost Machine with One Word

In [77]:
from sklearn.ensemble import GradientBoostingClassifier

In [83]:
vectorizer_8 = TfidfVectorizer( min_df=0.05, max_df=0.75, 
                                             max_features = 200000, ngram_range = (1, 1))
train_8 = vectorizer_8.fit_transform(train['articles'])
print(train_8.shape)
model_8 = GradientBoostingClassifier(random_state = 12345).fit(train_8, train["label"])
test_8 = vectorizer_8.transform(test['articles'])
pred_8 = model_8.predict(test_8.toarray())
acc_8 = accuracy_score(test['label'], pred_8)
print('Gradient Boost Machine 1 accuracy: ', acc_8)

(2922, 2397)
Gradient Boost Machine 1 accuracy:  0.541095890411


### 2.9. Gradient Boost Machine with Two Connected Words

In [85]:
vectorizer_9 = TfidfVectorizer( min_df=0.07, max_df=0.21, 
                                             max_features = 200000, ngram_range = (2, 2))
train_9 = vectorizer_9.fit_transform(train['articles'])
print(train_9.shape)
model_9 = GradientBoostingClassifier(random_state=12345).fit(train_9, train["label"])
test_9 = vectorizer_9.transform(test['articles'])
pred_9 = model_9.predict(test_9.toarray())
acc_9 = accuracy_score(test['label'], pred_9)
print('Gradient Boost Machine 2 accuracy: ', acc_9)

(2922, 315)
Gradient Boost Machine 2 accuracy:  0.554794520548


### 2.10. Stochastic Gradient Descent Classifier with One Word

In [87]:
from sklearn.linear_model import SGDClassifier

In [92]:
vectorizer_10 = TfidfVectorizer( min_df=0.01, max_df=0.75,
                                             max_features = 200000, ngram_range = (1, 1))
train_10 = vectorizer_10.fit_transform(train['articles'])
print(train_10.shape)
model_10 = SGDClassifier(loss='modified_huber', n_iter=5, random_state=12345, shuffle=True)
model_10 = model_10.fit(train_10, train["label"])
test_10 = vectorizer_10.transform(test['articles'])
pred_10 = model_10.predict(test_10.toarray())
acc_10 = accuracy_score(test['label'], pred_10)
print('Stochastic Gradient Descent Classifier 1 accuracy: ', acc_10)

(2922, 7647)




Stochastic Gradient Descent Classifier 1 accuracy:  0.535616438356


### 2.11. Stochastic Gradient Descent Classifier with Two Connected Words

In [96]:
vectorizer_11 = TfidfVectorizer( min_df=0.09, max_df=0.15, 
                                             max_features = 200000, ngram_range = (2, 2))
train_11 = vectorizer_11.fit_transform(train['articles'])
print(train_11.shape)
model_11 = SGDClassifier(loss='modified_huber', n_iter=5, random_state=12345, shuffle=True)
model_11 = model_11.fit(train_11, train["label"])
test_11 = vectorizer_11.transform(test['articles'])
pred_11 = model_11.predict(test_11.toarray())
acc_11 = accuracy_score(test['label'], pred_11)
print('Stochastic Gradient Descent Classifier 1 accuracy: ', acc_11)

(2922, 127)




Stochastic Gradient Descent Classifier 1 accuracy:  0.532876712329


### 2.12. Multi-Layer Perceptron

In [100]:
from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.layers.embeddings import Embedding

In [109]:
batch_size = 32
nb_classes = 2
vectorizer_12 = TfidfVectorizer( min_df=0.13, max_df=0.54, max_features = 200000, ngram_range = (2, 2))
train_12 = vectorizer_12.fit_transform(train['articles'])
test_12 = vectorizer_12.transform(test['articles'])
print(train_12.shape)

X_train = train_12.toarray()
X_test = test_12.toarray()

print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
y_train = np.array(train["label"])
y_test = np.array(test["label"])

Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)


# pre-processing: divide by max and substract mean
scale = np.max(X_train)
X_train /= scale
X_test /= scale

mean = np.mean(X_train)
X_train -= mean
X_test -= mean

input_dim = X_train.shape[1]

# Here's a Deep Dumb MLP (DDMLP)
model = Sequential()
model.add(Dense(256, input_dim=input_dim))
model.add(Activation('relu'))
model.add(Dropout(0.4))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.4))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# we'll use categorical xent for the loss, and RMSprop as the optimizer
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

print("Training...")
model.fit(X_train, Y_train, epochs=2, batch_size=16, validation_split=0.15)

print("Generating test predictions...")
pred_12 = model.predict_classes(X_test, verbose=0)
acc_12 = accuracy_score(test["label"], pred_12)

print('prediction accuracy: ', acc_12)

(2922, 107)
X_train shape: (2922, 107)
X_test shape: (730, 107)
Training...
Train on 2483 samples, validate on 439 samples
Epoch 1/2
Epoch 2/2
Generating test predictions...
prediction accuracy:  0.486301369863


### 2.13. LSTM

In [117]:
import keras
from keras.layers.recurrent import LSTM
from keras.preprocessing.text import Tokenizer

In [118]:
max_features = 10000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.1
maxlen = 200
batch_size = 32
nb_classes = 2

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=max_features)
train_13 = train['articles']
tokenizer.fit_on_texts(train_13)
train_13 = tokenizer.texts_to_sequences(train_13)
test_13 = tokenizer.texts_to_sequences(test['articles'])

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(train_13, maxlen=maxlen)
X_test = sequence.pad_sequences(test_13, maxlen=maxlen)

Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)


print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(keras.layers.SpatialDropout1D(rate = 0.2))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, Y_train, batch_size=batch_size, epochs=3,
          validation_data=(X_test, Y_test))
score, acc = model.evaluate(X_test, Y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)


print("Generating test predictions...")
pred_15 = model.predict_classes(X_test, verbose=0)
acc_15 = accuracy_score(test['label'], pred_15)

print('prediction accuracy: ', acc_15)

Pad sequences (samples x time)
X_train shape: (2922, 200)
X_test shape: (730, 200)
Build model...
Train...
Train on 2922 samples, validate on 730 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Test score: 1.03852928354
Test accuracy: 0.48767123255
Generating test predictions...
prediction accuracy:  0.487671232877


## 3. Conclusion

In section 2, I tried variety of models in machine learning and deep learning. All models give accuracy of about 0.5, which is not good. 
I have spoken with some friends in CS department who are doing research with NLP. They told me that they've tried to extract signals from financial data with nlp methods. However, the signal is so weak that, once using large dataset, it is averaged and disappeared. 
I think that is reasonable if they are using a general collection of words such as from NLTK. If we want to find signals from news data to predict stock price, we need to build a very related collection of words.
Will try some more NLP methods in the future.