# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

### Import Data + Clean Data

### News csv file

In [45]:
# Read in the news data
import pandas as pd

news = pd.read_csv('news_reuters.csv', header=None, encoding = "ISO-8859-1")
news.columns = ['ticker', 'company_name', 'publish_date', 'headline', 'first_sentence', 'news_category']
news.head(3)

Unnamed: 0,ticker,company_name,publish_date,headline,first_sentence,news_category
0,AA,Alcoa Corporation,20110707,Alcoa profit seen higher on aluminum price surge,* Analysts expect profit of 34 cts/shr vs yea...,topStory
1,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
2,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory


In [46]:
news['headline'] = news['headline'].str.lower()
#news['first_sentence'] = news['first_sentence']

news.head(3)

Unnamed: 0,ticker,company_name,publish_date,headline,first_sentence,news_category
0,AA,Alcoa Corporation,20110707,alcoa profit seen higher on aluminum price surge,* Analysts expect profit of 34 cts/shr vs yea...,topStory
1,AA,Alcoa Corporation,20110708,global markets weekahead: lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
2,AA,Alcoa Corporation,20110708,jobs halt wall street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory


In [47]:
from datetime import datetime, timedelta
from pandas import to_datetime

news['date'] = pd.to_datetime(news['publish_date'], format='%Y%m%d') + timedelta(days=1)
#modified_date = date + timedelta(days=1)
# = [datetime.strptime(x, '%Y%m%d%H') for x in  ]
#news['publish_date_formatted'] = datetime.strptime(news['publish_date'],'%Y%m%d')
#date = datetime.strptime(news['publish_date'], "%Y%m%d")

In [48]:
news.head(3)

Unnamed: 0,ticker,company_name,publish_date,headline,first_sentence,news_category,date
0,AA,Alcoa Corporation,20110707,alcoa profit seen higher on aluminum price surge,* Analysts expect profit of 34 cts/shr vs yea...,topStory,2011-07-08
1,AA,Alcoa Corporation,20110708,global markets weekahead: lacking conviction,LONDON Investors are unlikely to gain strong c...,normal,2011-07-09
2,AA,Alcoa Corporation,20110708,jobs halt wall street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory,2011-07-09


### Stock Return JSON file

#### following is how i re-format the stock data. Im going to only look at "short" stock return for now, so I filted on short. I also transferred it from JSON to a csv.

In [1]:
import json
#stock_return = json.loads(stockReturns.json)
stock_return = json.load(open('stockReturns.json'))

In [3]:
stock_key = stock_return.keys()
stock_key

dict_keys(['short', 'mid', 'long'])

In [4]:
short_key = ['short']
short_stock_return = dict((k, stock_return[k]) for k in short_key if k in stock_return)

In [5]:
short_stock_return.keys()

dict_keys(['short'])

In [6]:
from pandas.io.json import json_normalize
stock_return_norm = json_normalize(short_stock_return)

In [7]:
stock_return_norm.keys()

Index(['short.AAPL.20040106', 'short.AAPL.20040107', 'short.AAPL.20040108',
       'short.AAPL.20040109', 'short.AAPL.20040113', 'short.AAPL.20040114',
       'short.AAPL.20040115', 'short.AAPL.20040116', 'short.AAPL.20040121',
       'short.AAPL.20040122',
       ...
       'short.^GSPC.20180405', 'short.^GSPC.20180406', 'short.^GSPC.20180410',
       'short.^GSPC.20180411', 'short.^GSPC.20180412', 'short.^GSPC.20180413',
       'short.^GSPC.20180417', 'short.^GSPC.20180418', 'short.^GSPC.20180419',
       'short.^GSPC.20180420'],
      dtype='object', length=908537)

In [10]:
stock_return_norm.items

<bound method DataFrame.iteritems of    short.AAPL.20040106  short.AAPL.20040107  short.AAPL.20040108  \
0              -0.0013               0.0162               0.0311   

   short.AAPL.20040109  short.AAPL.20040113  short.AAPL.20040114  \
0              -0.0089               0.0226              -0.0083   

   short.AAPL.20040115  short.AAPL.20040116  short.AAPL.20040121  \
0               -0.063              -0.0068              -0.0169   

   short.AAPL.20040122          ...           short.^GSPC.20180405  \
0              -0.0153          ...                            0.0   

   short.^GSPC.20180406  short.^GSPC.20180410  short.^GSPC.20180411  \
0                   0.0                   0.0                   0.0   

   short.^GSPC.20180412  short.^GSPC.20180413  short.^GSPC.20180417  \
0                   0.0                   0.0                   0.0   

   short.^GSPC.20180418  short.^GSPC.20180419  short.^GSPC.20180420  
0                   0.0                   0.0          

In [11]:
import csv

with open('my_file.csv', 'w') as f:
    [f.write('{0},{1}\n'.format(key, value)) for key, value in stock_return_norm.items()]

In [15]:
import pandas as pd
my_file = pd.read_csv('my_file.csv', header=None, encoding = "ISO-8859-1")
my_file2 = my_file[::2]

In [16]:
my_file2.head(3)

Unnamed: 0,0,1
0,short.AAPL.20040106,0 -0.0013
2,short.AAPL.20040107,0 0.0162
4,short.AAPL.20040108,0 0.0311


In [17]:
my_file2.to_csv("short_stocks.csv")

#### this is the end of stock data reformatting.

In [50]:
stock = pd.read_csv('short_stocks.csv', header=None, encoding = "ISO-8859-1")
stock.columns = ['stock_type', 'ticker', 'stock_date', 'stock_return']
stock.head(5)

Unnamed: 0,stock_type,ticker,stock_date,stock_return
0,short,AAPL,20040106,-0.0013
1,short,AAPL,20040107,0.0162
2,short,AAPL,20040108,0.0311
3,short,AAPL,20040109,-0.0089
4,short,AAPL,20040113,0.0226


In [51]:
from datetime import datetime, timedelta
from pandas import to_datetime

stock['date'] = pd.to_datetime(stock['stock_date'], format='%Y%m%d')

In [52]:
stock.head(5)

Unnamed: 0,stock_type,ticker,stock_date,stock_return,date
0,short,AAPL,20040106,-0.0013,2004-01-06
1,short,AAPL,20040107,0.0162,2004-01-07
2,short,AAPL,20040108,0.0311,2004-01-08
3,short,AAPL,20040109,-0.0089,2004-01-09
4,short,AAPL,20040113,0.0226,2004-01-13


### Join Stock data and News data by date and ticker

In [55]:
news.head(2)

Unnamed: 0,ticker,company_name,publish_date,headline,first_sentence,news_category,date
0,AA,Alcoa Corporation,20110707,alcoa profit seen higher on aluminum price surge,* Analysts expect profit of 34 cts/shr vs yea...,topStory,2011-07-08
1,AA,Alcoa Corporation,20110708,global markets weekahead: lacking conviction,LONDON Investors are unlikely to gain strong c...,normal,2011-07-09


In [56]:
stock.head(2)

Unnamed: 0,stock_type,ticker,stock_date,stock_return,date
0,short,AAPL,20040106,-0.0013,2004-01-06
1,short,AAPL,20040107,0.0162,2004-01-07


In [57]:
data = pd.merge(news, stock, on= ['ticker', 'date'], how='inner')

In [58]:
data.head(3)

Unnamed: 0,ticker,company_name,publish_date,headline,first_sentence,news_category,date,stock_type,stock_date,stock_return
0,CVCY,Central Valley Community Bancorp,20170125,brief-central valley community bancorp reports...,* Reports earnings results for the year and qu...,topStory,2017-01-26,short,20170126,0.0248
1,CVEO,Civeo Corporation,20170201,brief-civeo corp announces public offering of ...,* Civeo Corporation announces public offering ...,topStory,2017-02-02,short,20170202,-0.1028
2,CVEO,Civeo Corporation,20170202,brief-civeo announces public offering of 20 ml...,* Civeo Corporation announces pricing of publi...,topStory,2017-02-03,short,20170203,0.0019


In [60]:
data = data[['ticker','company_name', 'date', 'headline', 'first_sentence', 'stock_return']]

In [95]:
data.head(3)

Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return
0,CVCY,Central Valley Community Bancorp,2017-01-26,brief-central valley community bancorp reports...,* Reports earnings results for the year and qu...,0.0248
1,CVEO,Civeo Corporation,2017-02-02,brief-civeo corp announces public offering of ...,* Civeo Corporation announces public offering ...,-0.1028
2,CVEO,Civeo Corporation,2017-02-03,brief-civeo announces public offering of 20 ml...,* Civeo Corporation announces pricing of publi...,0.0019


In [97]:
import numpy as np

data['stock_label'] = np.where(data['stock_return'] < 0, 0, 1)

data.head(3)

Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label
0,CVCY,Central Valley Community Bancorp,2017-01-26,brief-central valley community bancorp reports...,* Reports earnings results for the year and qu...,0.0248,1
1,CVEO,Civeo Corporation,2017-02-02,brief-civeo corp announces public offering of ...,* Civeo Corporation announces public offering ...,-0.1028,0
2,CVEO,Civeo Corporation,2017-02-03,brief-civeo announces public offering of 20 ml...,* Civeo Corporation announces pricing of publi...,0.0019,1


In [61]:
data.shape

(7828, 6)

### Now we have the dataset (with news and stock return as a binary label)

In [101]:
data.head(10)

Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label
0,CVCY,Central Valley Community Bancorp,2017-01-26,brief-central valley community bancorp reports...,* Reports earnings results for the year and qu...,0.0248,1
1,CVEO,Civeo Corporation,2017-02-02,brief-civeo corp announces public offering of ...,* Civeo Corporation announces public offering ...,-0.1028,0
2,CVEO,Civeo Corporation,2017-02-03,brief-civeo announces public offering of 20 ml...,* Civeo Corporation announces pricing of publi...,0.0019,1
3,CVEO,Civeo Corporation,2017-02-15,brief-renaissance technologies reports 6.2 pct...,* Renaissance Technologies LLC reports 6.20 pe...,0.0136,1
4,CVEO,Civeo Corporation,2017-02-22,brief-civeo corporation announces amendment to...,* Civeo Corp- Under amended credit facility C...,-0.002,0
5,CVLY,Codorus Valley Bancorp Inc,2017-01-20,brief-codorus valley bancorp qtrly earnings pe...,* Codorus Valley Bancorp Inc reports earnings ...,-0.0234,0
6,CWH,Camping World Holdings Inc,2016-11-11,brief-camping world announces refinancing of s...,* Camping World Holdings - refinanced senior s...,0.0793,1
7,CWH,Camping World Holdings Inc,2016-11-23,brief-camping world announces acquisition of t...,* Camping World announces acquisition of Thomp...,0.0336,1
8,CZR,Caesars Entertainment Corporation,2012-08-16,text-s&p revises caesars entertainment corp ra...,-- U.S. casino operator Caesars Entertainment ...,-0.0095,0
9,CZR,Caesars Entertainment Corporation,2012-08-28,text-s&p revises caesars linq llc and caesars ...,Overview -- We recently revised our ratin...,-0.0627,0


In [100]:
# Find the number of times each word was used and the size of the vocabulary
word_counts = {}

for news in data['headline']:
    for word in data['headline']:
        if word not in word_counts:
            word_counts[word] = 1
        else:
            word_counts[word] += 1
            
print("Size of Vocabulary:", len(word_counts))

Size of Vocabulary: 6738


In [268]:
data.to_csv("SJ_HW4_formatted.csv")

## **

In [1]:
import pandas as pd

data = pd.read_csv('SJ_HW4_formatted.csv', encoding = "ISO-8859-1")

In [3]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label
0,0,CVCY,Central Valley Community Bancorp,1/26/17,brief-central valley community bancorp reports...,* Reports earnings results for the year and qu...,0.0248,1
1,1,CVEO,Civeo Corporation,2/2/17,brief-civeo corp announces public offering of ...,* Civeo Corporation announces public offering ...,-0.1028,0
2,2,CVEO,Civeo Corporation,2/3/17,brief-civeo announces public offering of 20 ml...,* Civeo Corporation announces pricing of publi...,0.0019,1


In [4]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, 
                               test_size = .13, 
                               random_state=25)

print(train.shape, test.shape)

(6810, 8) (1018, 8)


### Train set and Test set are ready!!

### w2v

In [97]:
#split the dataset into train (85%) and test (15%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['headline'], 
                                                    data['stock_label'], 
                                                    test_size = .13, 
                                                    random_state=25)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(6810,) (1018,) (6810,) (1018,)


In [98]:
# Building word2vec model
from nltk.tokenize import PunktSentenceTokenizer
import string

sentences = []
for item in data['headline']:
    sentences.extend([[w.translate(str.maketrans('','',string.punctuation)).strip().lower() for w in sent.split()]\
                      for sent in PunktSentenceTokenizer().tokenize(item)])

In [100]:
import gensim

w2v_model = gensim.models.Word2Vec (sentences, size=150, window=10, min_count=2, workers=10)
w2v_model.train(sentences,total_examples=len(sentences),epochs=10)
w2v = dict(zip(w2v_model.wv.index2word, w2v_model.wv.syn0))

  """


In [101]:
# Checking word2vec model
w2v_model.similar_by_word('public')

  


[('direct', 0.995242714881897),
 ('charter', 0.9940692186355591),
 ('updates', 0.9937134385108948),
 ('briefdiana', 0.9928981065750122),
 ('puerto', 0.9928292632102966),
 ('disposal', 0.9926966428756714),
 ('approved', 0.9925059080123901),
 ('rico', 0.992316722869873),
 ('continuation', 0.9921209216117859),
 ('support', 0.9917473196983337)]

In [102]:
type(y_train)

pandas.core.series.Series

In [103]:
y_train.shape

(6810,)

In [107]:
#Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras
import numpy as np
import keras

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

max_words = 250
batch_size = 64
epochs = 20

num_classes = 2
print(num_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
x_train = tokenizer.texts_to_matrix(X_train)
x_test = tokenizer.texts_to_matrix(X_test)
print('x_train shape:', X_train.shape)
print('x_test shape:', X_test.shape)

# Borrow our binarized labels from the previous model
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

2 classes
Vectorizing sequence data...
x_train shape: (6810,)
x_test shape: (1018,)
y_train shape: (6810,)
y_test shape: (1018,)


In [110]:
x_train[0]

array([0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [112]:
train_label = train['stock_label']
train_label = train_label.reshape((6810, 1))
type(train_label)

  


numpy.ndarray

In [113]:
train_label[0]

array([0])

#### Now we have the train set and the test set

In [108]:
from nltk.tokenize import PunktSentenceTokenizer
import string

sentences = []
for item in data['headline']:
    sentences.extend([[w.translate(str.maketrans('','',string.punctuation)).strip().lower() for w in sent.split()]\
                      for sent in PunktSentenceTokenizer().tokenize(item)])

In [132]:
sentences[15:18]

[['textsp', 'rates', 'caesars', 'entertainments', 'proposed', 'notes', 'b'],
 ['briefboyd', 'and', 'caesars', 'jump', 'in', 'afternoon', 'trading'],
 ['textfitch', 'imminent', 'online', 'gaming', 'in', 'nj']]

In [110]:
import gensim

model = gensim.models.Word2Vec (sentences, size=150, window=10, min_count=2, workers=10)
model.train(sentences,total_examples=len(sentences),epochs=10)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

  """


In [114]:
# Checking word2vec model
model.similar_by_word('public')

  


[('briefdiana', 0.9951826930046082),
 ('notes', 0.9946377277374268),
 ('fund', 0.9930644035339355),
 ('offering', 0.9923483729362488),
 ('growing', 0.9918162226676941),
 ('support', 0.9897124171257019),
 ('direct', 0.9886226058006287),
 ('holdings', 0.9884518384933472),
 ('advisers', 0.9878647327423096),
 ('completes', 0.987798810005188)]

In [137]:
#Build a neural net model using word2vec embeddings (both pretrained and within an Embedding layer from Keras
import numpy as np
import keras

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

max_words = 214
batch_size = 64
epochs = 20

num_classes = 2
print(num_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
x_train = tokenizer.texts_to_matrix(X_train)
x_test = tokenizer.texts_to_matrix(X_test)
print('x_train shape:', X_train.shape)
print('x_test shape:', X_test.shape)

# Borrow our binarized labels from the previous model
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

2 classes
Vectorizing sequence data...
x_train shape: (6810,)
x_test shape: (1018,)
y_train shape: (6810,)
y_test shape: (1018,)


In [138]:
x_train.shape

(6810, 400)

In [145]:
## Get length of longest sequence
max_seq_len = max([len(idx_seq) for idx_seq in data['headline']])
print("Max length in headline is: ", max_seq_len)

Max length in headline is:  214


In [139]:
x_train[13:14, :]

array([[0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

In [180]:
y_train[13:14]

6361    1
Name: stock_label, dtype: int64

#### tokenize

In [148]:
from nltk.tokenize import word_tokenize
X_train_tokenized = [word_tokenize(x) for x in X_train]
X_test_tokenized = [word_tokenize(x) for x in X_test]

In [149]:
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer

stop = stopwords.words('english')

def hasNumbers(inputString):
    return bool(re.search(r'\d', inputString))
def isSymbol(inputString):
    return bool(re.match(r'[^\w]', inputString))

wordnet_lemmatizer = WordNetLemmatizer()

def check(word):
    if word in stop:
        return False
    elif hasNumbers(word) or isSymbol(word):
        return False
    else:
        return True
    
def preprocessing(sen):
    res = []
    for word in sen:  
        if check(word): 
            word = word.lower().replace(".", '').replace('"', '').replace("'", '')
            res.append(wordnet_lemmatizer.lemmatize(word))
    return res

In [150]:
X_train_processed = [preprocessing(x) for x in X_train_tokenized]
X_test_processed = [preprocessing(x) for x in X_test_tokenized]

In [153]:
X_train_processed[0]
X_test_processed[0]

['ge', 'mammography', 'device', 'get', 'u', 'fda', 'approval']

In [156]:
from gensim.models.word2vec import Word2Vec
model_w2v = Word2Vec(X_train_processed, size=400, window=5, min_count=5, workers=4)

In [157]:
model_w2v.wv['public']

array([ 0.03800208, -0.13542098, -0.05232017,  0.05867366,  0.03618153,
        0.05351061, -0.01684152, -0.04570716,  0.06966811,  0.03118808,
        0.01305857, -0.04003138,  0.01996516,  0.00277247,  0.01923244,
        0.1010325 , -0.00668759,  0.05723053,  0.05684376,  0.07368633,
       -0.00032078, -0.00496446, -0.05285033, -0.0207696 , -0.06006723,
        0.02183884,  0.06949441, -0.04099743,  0.03313633, -0.04996708,
        0.017782  ,  0.08012734, -0.02781224, -0.10544675,  0.00986416,
       -0.01352925,  0.1012539 ,  0.0549473 , -0.06386796,  0.00466527,
       -0.11189996,  0.01975877,  0.05812088,  0.11149258, -0.01931848,
       -0.04613898, -0.02669925, -0.0267346 , -0.0046218 , -0.01631208,
        0.03660669,  0.12000056, -0.03243194,  0.04035706, -0.05602781,
        0.00073939,  0.00186324,  0.03259039,  0.03362983,  0.07117146,
       -0.12409593, -0.05546589, -0.02826298, -0.02625254,  0.00727845,
       -0.01028248,  0.05686618, -0.08262352,  0.02459825,  0.08

In [162]:
vocab = model_w2v.wv.vocab

def get_vector(word_list):
    res = np.zeros([400])
    count=0
    for word in word_list:
        res += model[word]
        count += 1
    return res/count

In [163]:
X_train_w2v = [get_vector(x) for x in X_train_processed]
X_test_w2v = [get_vector(x) for x in X_test_processed]

  import sys


KeyError: "word 'aisle' not in vocabulary"

In [191]:
import collections

def build_vocab(filename):
    data = X_train

    counter = collections.Counter(data)
    count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

    words, _ = list(zip(*count_pairs))
    word_to_id = dict(zip(words, range(len(words))))

    return word_to_id

In [192]:
word_to_id = build_vocab(x_train)
vocabulary = len(word_to_id)

In [193]:
vocabulary

5974

In [204]:
def file_to_word_ids(filename, word_to_id):
    data = X_train
    return [word_to_id[word] for word in data if word in word_to_id]

In [206]:
train_data = file_to_word_ids(X_train, word_to_id)

In [210]:
len(train_data)

6810

## Model 1: RNN

### create lexicons + index 

In [5]:
print(train.shape, test.shape)

(6810, 8) (1018, 8)


In [6]:
from nltk.tokenize import word_tokenize
train['text_tokenized'] = [word_tokenize(x) for x in train['headline']]
test['text_tokenized'] = [word_tokenize(x) for x in test['headline']]

data['headline_tokenized'] = [word_tokenize(x) for x in data['headline']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [7]:
train.head(3)

Unnamed: 0.1,Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label,text_tokenized
518,518,DIS,Walt Disney Company (The),10/6/11,wal-mart goes back to basics in holiday toy ai...,NORTH BERGEN New Jersey Wal-Mart is taking cu...,-0.0018,0,"[wal-mart, goes, back, to, basics, in, holiday..."
5204,5204,JD,JDcom Inc,5/13/15,alibaba rolls out 3-hour delivery service for ...,BEIJING May 12 Chinese online shopping giant ...,0.0,1,"[alibaba, rolls, out, 3-hour, delivery, servic..."
6516,6516,MAR,Marriott International,10/7/16,corrected-update 1-marriott expands in south a...,JOHANNESBURG Oct 6 Marriott International sa...,-0.0022,0,"[corrected-update, 1-marriott, expands, in, so..."


In [8]:
#create lexicon

import pickle

def make_lexicon(token_seqs, min_freq=1):
    # First, count how often each word appears in the text.
    token_counts = {}
    for news in token_seqs:
        for word in news:
            if word in token_counts:
                token_counts[word] += 1
            else:
                token_counts[word] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

In [9]:
print("WORDS:")
words_lexicon = make_lexicon(data['headline_tokenized'])

WORDS:
LEXICON SAMPLE (8912 total items):
{'brief-central': 2, 'valley': 3, 'community': 4, 'bancorp': 5, 'reports': 6, 'q4': 7, 'eps': 8, '$': 9, '0.21': 10, 'brief-civeo': 11, 'corp': 12, 'announces': 13, 'public': 14, 'offering': 15, 'of': 16, 'common': 17, 'shares': 18, '20': 19, 'mln': 20, 'brief-renaissance': 21}


In [10]:
def tokens_to_idxs(token_seqs, lexicon):
    #idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
    #                                                                 for token_seq in token_seqs]
    #idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] ] for token in token_seqs]  
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]                                                                 
    return idx_seqs

In [11]:
#train['Word_Idxs'] = tokens_to_idxs(train['Tokenized_Words'], words_lexicon)
train['text_tokenized_idx'] = tokens_to_idxs(train['text_tokenized'], words_lexicon)
train[['text_tokenized', 'text_tokenized_idx', 'stock_label']][:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,text_tokenized,text_tokenized_idx,stock_label
518,"[wal-mart, goes, back, to, basics, in, holiday...","[1154, 1155, 1129, 31, 1156, 27, 1015, 1157, 1...",0
5204,"[alibaba, rolls, out, 3-hour, delivery, servic...","[4637, 1026, 126, 6810, 4573, 1427, 92, 1021, ...",1
6516,"[corrected-update, 1-marriott, expands, in, so...","[530, 7862, 1001, 27, 197, 3156, 448, 1426, 11...",0
6581,"[two, mediobanca, board, members, quit, in, li...","[1519, 7898, 162, 586, 3617, 27, 2092, 448, 11...",1
5080,"[update, 1-big, hedge, funds, shopped, at, j.c...","[110, 4330, 327, 328, 6693, 145, 3024, 3025, 2...",1
3168,"[update, 2-apple, plans, fix, next, week, for,...","[110, 4338, 130, 4935, 2145, 540, 92, 2665, 49...",1
1928,"[ge, points, way, for, 'too, big, to, fail, ',...","[2959, 3548, 3363, 92, 3549, 836, 31, 983, 81,...",0
2596,"[refile-update, 2-gamestop, 's, forecast, miss...","[1549, 4258, 72, 2818, 2682, 68, 1184, 4257, 1...",0
5656,"[update, 1-banking, venture, nbnk, in, talks, ...","[110, 7205, 263, 7188, 27, 218, 244, 45]",1
4747,"[fitch, downgrades, j.c., penney, on, strategy...","[185, 208, 3024, 3025, 68, 3200, 905]",1


In [12]:
### then padding
from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


In [13]:
max_seq_len = max([len(idx_seq) for idx_seq in train['text_tokenized_idx']]) # Get length of longest sequence
train_padded_words = pad_idx_seqs(train['text_tokenized_idx'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1

print("WORDS:\n", train_padded_words)
print("SHAPE:", train_padded_words.shape, "\n")

WORDS:
 [[   0    0    0 ... 1015 1157 1158]
 [   0    0    0 ...   92 1021 2155]
 [   0    0    0 ... 1426  113 6995]
 ...
 [   0    0    0 ...  264  162  703]
 [   0    0    0 ...   16  446   76]
 [   0    0    0 ... 1241   31 1532]]
SHAPE: (6810, 33) 



In [134]:
from keras.models import Model, Input
from keras.models import Sequential
from keras.layers.recurrent import SimpleRNN
from keras.layers import Input, Concatenate, TimeDistributed, Dense
from keras.layers import Activation,Conv1D,Dense,Embedding,Input,Dropout,LSTM,Bidirectional,MaxPooling1D,Flatten
from keras.preprocessing.sequence import pad_sequences
from keras.initializers import RandomUniform

hidden_size = 500
max_length = 33
num_steps = 30
vocabulary = 5974
n_class = 1
n_word_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding

model_rnn = Sequential()
model_rnn.add(Embedding(input_dim=n_word_input_nodes[0],
                     input_length=max_length,
                     output_dim=20, 
                     mask_zero=True))
#model_rnn.add(Flatten())
model_rnn.add(SimpleRNN(100, return_sequences=False))
model_rnn.add(Dropout(0.25))
model_rnn.add(Dense(n_class, activation="sigmoid"))

#model_rnn.add(Embedding(max_words, 100, input_length= x_train.shape[1]))
#model_rnn.add(Flatten())
#model_rnn.add(LSTM(hidden_size, return_sequences=True))
#model_rnn.add(Dense(256, input_shape=(max_words,)))
#model_rnn.add(Activation('relu'))
#model_rnn.add(Dropout(0.5))


#model_rnn.add(TimeDistributed(Dense(n_class, activation = 'softmax')))

#model_rnn.add(LSTM(hidden_size, return_sequences=True))
#model_rnn.add(LSTM(hidden_size, return_sequences=True))
#if use_dropout:
#    model_rnn.add(Dropout(0.5))
#model_rnn.add(TimeDistributed(Dense(vocabulary)))
#model_rnn.add(Flatten())
#model_rnn.add(Activation('softmax'))

model_rnn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_43 (Embedding)     (None, 33, 20)            178260    
_________________________________________________________________
simple_rnn_11 (SimpleRNN)    (None, 100)               12100     
_________________________________________________________________
dropout_22 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_27 (Dense)             (None, 1)                 101       
Total params: 190,461
Trainable params: 190,461
Non-trainable params: 0
_________________________________________________________________


In [141]:
from keras.optimizers import Adam

model_rnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [75]:
train_padded_words[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0, 1154, 1155, 1129,   31, 1156,   27, 1015, 1157, 1158],
      dtype=int32)

In [88]:
len(train['stock_label'])

6810

In [89]:
train_label = train['stock_label']
train_label = train_label.reshape((6810, 1))
type(train_label)

  


numpy.ndarray

In [92]:
train_label[:2]

array([[0],
       [1]])

In [142]:
rnn_fit = model_rnn.fit(x = train_padded_words, #y = train['stock_label'],
                        y = train_label,
                        batch_size=32,
                        epochs=5,
                        verbose=1,
                        validation_split=0.1)

Train on 6129 samples, validate on 681 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [158]:
##process the test set
test['text_tokenized_idx'] = tokens_to_idxs(test['text_tokenized'], words_lexicon)

#max_seq_len = max([len(idx_seq) for idx_seq in train['text_tokenized_idx']]) # Get length of longest sequence
test_padded_words = pad_idx_seqs(test['text_tokenized_idx'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [157]:
max_seq_len = max([len(idx_seq) for idx_seq in train['text_tokenized_idx']])
max_seq_len

32

In [159]:
len(test_padded_words[0])

33

In [160]:
test_label = test['stock_label']
test_label = test_label.reshape((1018, 1))
type(test_label)

  


numpy.ndarray

In [162]:
score = model_rnn.evaluate(test_padded_words, test_label,
                       batch_size=batch_size, verbose=1)

print('Test score:', score[0])
print('Test accuracy:', score[1])

Test score: 0.8676007067993254
Test accuracy: 0.6905697442459451


In [167]:
rnn_pred = model_rnn.predict(test_padded_words)

In [171]:
test['rnn_pred'] = rnn_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [177]:
def label_recode(series):
    if series < 0.5 :
        return 0
    elif series >= 0.5 :
        return 1
    
test['rnn_pred'] = test['rnn_pred'].apply(label_recode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [337]:
from sklearn.metrics import average_precision_score, recall_score

rnn_precision = average_precision_score(test['stock_label'], test['rnn_pred'])
rnn_recall = recall_score(test['stock_label'], test['rnn_pred'], average="macro")
print('Average precision-recall score: {0:0.2f}'.format(
      rnn_precision))
print('Recall Score: {0:0.2f}'.format(
      rnn_recall))

Average precision-recall score: 0.61
Recall Score: 0.69


In [178]:
test.head(7)

Unnamed: 0.1,Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label,text_tokenized,text_tokenized_idx,rnn_pred
1790,1790,GEK,General Electric Capital Corporation,9/4/14,ge's 3d mammography device gets u.s. fda appro...,Sept 2 General Electric Co's healthcare unit l...,-0.0052,0,"[ge, 's, 3d, mammography, device, gets, u.s., ...","[2959, 72, 1087, 3408, 3409, 382, 219, 1927, 203]",0
5550,5550,LW,Lamb Weston Holdings Inc,1/11/17,brief-lamb weston reports fiscal q2 2017 eps $...,* Q2 earnings per share view $0.55 -- Thomson ...,-0.0014,0,"[brief-lamb, weston, reports, fiscal, q2, 2017...","[7077, 7078, 6, 1524, 1805, 1871, 8, 9, 7079]",1
4639,4639,IRMD,iRadimed Corporation,2/7/17,brief-iradimed corp q4 non-gaap eps $0.11,* Iradimed Corporation announces fourth quarte...,-0.0292,0,"[brief-iradimed, corp, q4, non-gaap, eps, $, 0...","[6305, 12, 7, 4265, 8, 9, 6306]",1
1094,1094,EQT,EQT Corporation,5/3/16,brief-eqt intends to commence public offering ...,* Intends to commence a registered public offe...,-0.0084,0,"[brief-eqt, intends, to, commence, public, off...","[2053, 2286, 31, 2287, 14, 15, 16, 472, 20, 18]",0
4084,4084,IBM,International Business Machines Corporation,12/4/13,google takes on amazon by cutting cloud servic...,SAN FRANCISCO Dec 3 Google Inc will lower pri...,-0.0007,0,"[google, takes, on, amazon, by, cutting, cloud...","[2756, 959, 68, 2975, 360, 3660, 1626, 1427, 1...",0
1318,1318,FOX,Twenty-First Century Fox Inc,10/22/15,australia waves through news corp buy-in to st...,SYDNEY Oct 22 Australia's antitrust regulator...,0.0018,1,"[australia, waves, through, news, corp, buy-in...","[2122, 2707, 2275, 882, 12, 2708, 31, 2709, 27...",1
3922,3922,IBM,International Business Machines Corporation,5/31/12,text-s&p raises ibm ratings,Overview\t -- U.S. technology and solutio...,-0.0062,0,"[text-s, &, p, raises, ibm, ratings]","[49, 50, 51, 693, 5288, 213]",0


## Model 2: CNN

In [187]:
test_padded_words[:2]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0, 2959,   72, 1087, 3408, 3409,  382,  219, 1927,  203],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0, 7077, 7078,    6, 1524, 1805, 1871,    8,    9, 7079]],
      dtype=int32)

In [188]:
train_label[:2]

array([[0],
       [1]])

In [189]:
train.head(2)

Unnamed: 0.1,Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label,text_tokenized,text_tokenized_idx
518,518,DIS,Walt Disney Company (The),10/6/11,wal-mart goes back to basics in holiday toy ai...,NORTH BERGEN New Jersey Wal-Mart is taking cu...,-0.0018,0,"[wal-mart, goes, back, to, basics, in, holiday...","[1154, 1155, 1129, 31, 1156, 27, 1015, 1157, 1..."
5204,5204,JD,JDcom Inc,5/13/15,alibaba rolls out 3-hour delivery service for ...,BEIJING May 12 Chinese online shopping giant ...,0.0,1,"[alibaba, rolls, out, 3-hour, delivery, servic...","[4637, 1026, 126, 6810, 4573, 1427, 92, 1021, ..."


In [192]:
from functools import reduce
cnn_vocab = sorted(reduce(lambda x, y: x | y, (set(words) for words in train['text_tokenized'])))
len(cnn_vocab)

8415

In [197]:
import operator
from collections import Counter

def word_freq(Xs, num):
    all_words = [words.lower() for sentences in train['text_tokenized'] for words in sentences]
    sorted_vocab = sorted(dict(Counter(all_words)).items(), key=operator.itemgetter(1))
    final_vocab = [k for k,v in sorted_vocab if v>num]
    word_idx = dict((c, i + 1) for i, c in enumerate(final_vocab))
    return final_vocab, word_idx

final_vocab, word_idx = word_freq(train['text_tokenized'],2)
vocab_len = len(final_vocab) # Finally we have 3254 words!

print(vocab_len)

3254


In [200]:
final_vocab[5:8]

['blood', 'returning', 'authors']

In [208]:
test_padded_words.shape

(1018, 33)

In [235]:
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(train['headline'])

sequences = tokenizer.texts_to_sequences(train['headline'])
x_train_cnn = pad_sequences(sequences, maxlen=50)

In [236]:
x_train_cnn.shape

(6810, 50)

In [292]:
from keras.layers.core import Activation, Flatten, Dropout, Dense
from keras.layers import Convolution1D, Convolution2D
from keras.layers.pooling import MaxPooling1D
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam, rmsprop

hidden_size = 500
max_length = 33
num_steps = 30
vocabulary_size = 20000
n_class = 1
n_word_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding
kernel_size = 3
nb_filter = 250
filter_length = 3

model_cnn = Sequential()
model_cnn.add(Embedding(vocabulary_size, 100, input_length=33))
model_cnn.add(Dropout(0.2))
model_cnn.add(Conv1D(64, 5, activation='relu'))
model_cnn.add(MaxPooling1D(pool_size=4))
#model_cnn.add(LSTM(100))
model_cnn.add(Flatten())
model_cnn.add(Dense(1, activation='sigmoid'))

#model_cnn = Sequential()
#model_cnn.add(Embedding(input_dim=n_word_input_nodes[0],
#                     input_length=max_length,
#                     output_dim=20, 
#                     mask_zero=True))
#model_cnn.add(Dropout(0.25))
#model_cnn.add(Convolution1D(nb_filter=nb_filter,
#                        filter_length=filter_length,
#                        border_mode='valid',
#                        activation='relu',
#                        subsample_length=1))
#model_cnn.add(Convolution1D(padding="same", kernel_size=3, filters=32))
#model_cnn.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[1],
#                      border_mode='valid', activation='relu'))
#model_cnn.add(GlobalMaxPooling1D())
#model_cnn.add(Dense(n_class, activation="sigmoid"))

model_cnn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_82 (Embedding)     (None, 33, 100)           2000000   
_________________________________________________________________
dropout_64 (Dropout)         (None, 33, 100)           0         
_________________________________________________________________
conv1d_39 (Conv1D)           (None, 29, 64)            32064     
_________________________________________________________________
max_pooling1d_22 (MaxPooling (None, 7, 64)             0         
_________________________________________________________________
flatten_17 (Flatten)         (None, 448)               0         
_________________________________________________________________
dense_61 (Dense)             (None, 1)                 449       
Total params: 2,032,513
Trainable params: 2,032,513
Non-trainable params: 0
_________________________________________________________________


In [293]:
from keras.optimizers import Adam

model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [294]:
cnn_fit = model_cnn.fit(x = train_padded_words, #y = train['stock_label'],
                        y = train_label,
                        batch_size=32,
                        epochs=5,
                        verbose=1,
                        validation_split=0.1)

Train on 6129 samples, validate on 681 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [306]:
cnn_score = model_cnn.evaluate(test_padded_words, test_label,
                       batch_size=batch_size, verbose=1)

print('Test score:', cnn_score[0])
print('Test accuracy:', cnn_score[1])

Test score: 0.7459681071560603
Test accuracy: 0.7445972492746379


In [296]:
cnn_pred = model_cnn.predict(test_padded_words)

In [297]:
test['cnn_pred'] = cnn_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [298]:
def label_recode(series):
    if series < 0.5 :
        return 0
    elif series >= 0.5 :
        return 1
    
test['cnn_pred'] = test['cnn_pred'].apply(label_recode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [339]:
from sklearn.metrics import average_precision_score, recall_score

cnn_precision = average_precision_score(test['stock_label'], test['cnn_pred'])
cnn_recall = recall_score(test['stock_label'], test['cnn_pred'], average="macro")
print('Average precision-recall score: {0:0.2f}'.format(
      cnn_precision))
print('Recall Score: {0:0.2f}'.format(
      cnn_recall))

Average precision-recall score: 0.66
Recall Score: 0.75


In [299]:
test.head(7)

Unnamed: 0.1,Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label,text_tokenized,text_tokenized_idx,rnn_pred,cnn_pred
1790,1790,GEK,General Electric Capital Corporation,9/4/14,ge's 3d mammography device gets u.s. fda appro...,Sept 2 General Electric Co's healthcare unit l...,-0.0052,0,"[ge, 's, 3d, mammography, device, gets, u.s., ...","[2959, 72, 1087, 3408, 3409, 382, 219, 1927, 203]",0,0
5550,5550,LW,Lamb Weston Holdings Inc,1/11/17,brief-lamb weston reports fiscal q2 2017 eps $...,* Q2 earnings per share view $0.55 -- Thomson ...,-0.0014,0,"[brief-lamb, weston, reports, fiscal, q2, 2017...","[7077, 7078, 6, 1524, 1805, 1871, 8, 9, 7079]",1,1
4639,4639,IRMD,iRadimed Corporation,2/7/17,brief-iradimed corp q4 non-gaap eps $0.11,* Iradimed Corporation announces fourth quarte...,-0.0292,0,"[brief-iradimed, corp, q4, non-gaap, eps, $, 0...","[6305, 12, 7, 4265, 8, 9, 6306]",1,1
1094,1094,EQT,EQT Corporation,5/3/16,brief-eqt intends to commence public offering ...,* Intends to commence a registered public offe...,-0.0084,0,"[brief-eqt, intends, to, commence, public, off...","[2053, 2286, 31, 2287, 14, 15, 16, 472, 20, 18]",0,0
4084,4084,IBM,International Business Machines Corporation,12/4/13,google takes on amazon by cutting cloud servic...,SAN FRANCISCO Dec 3 Google Inc will lower pri...,-0.0007,0,"[google, takes, on, amazon, by, cutting, cloud...","[2756, 959, 68, 2975, 360, 3660, 1626, 1427, 1...",0,0
1318,1318,FOX,Twenty-First Century Fox Inc,10/22/15,australia waves through news corp buy-in to st...,SYDNEY Oct 22 Australia's antitrust regulator...,0.0018,1,"[australia, waves, through, news, corp, buy-in...","[2122, 2707, 2275, 882, 12, 2708, 31, 2709, 27...",1,0
3922,3922,IBM,International Business Machines Corporation,5/31/12,text-s&p raises ibm ratings,Overview\t -- U.S. technology and solutio...,-0.0062,0,"[text-s, &, p, raises, ibm, ratings]","[49, 50, 51, 693, 5288, 213]",0,1


## Model 3: RNN+CNN

In [301]:
from keras.layers.core import Activation, Flatten, Dropout, Dense
from keras.layers import Convolution1D, Convolution2D
from keras.layers.pooling import MaxPooling1D
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam, rmsprop

hidden_size = 500
max_length = 33
num_steps = 30
vocabulary_size = 20000
n_class = 1
n_word_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding
kernel_size = 3
nb_filter = 250
filter_length = 3

model_3 = Sequential()
model_3.add(Embedding(vocabulary_size, 100, input_length=33))
model_3.add(Dropout(0.2))
model_3.add(Conv1D(64, 5, activation='relu'))
model_3.add(MaxPooling1D(pool_size=4))
model_3.add(SimpleRNN(100, return_sequences=False))
model_3.add(Dense(1, activation='sigmoid'))

model_3.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_84 (Embedding)     (None, 33, 100)           2000000   
_________________________________________________________________
dropout_66 (Dropout)         (None, 33, 100)           0         
_________________________________________________________________
conv1d_41 (Conv1D)           (None, 29, 64)            32064     
_________________________________________________________________
max_pooling1d_24 (MaxPooling (None, 7, 64)             0         
_________________________________________________________________
simple_rnn_19 (SimpleRNN)    (None, 100)               16500     
_________________________________________________________________
dense_63 (Dense)             (None, 1)                 101       
Total params: 2,048,665
Trainable params: 2,048,665
Non-trainable params: 0
_________________________________________________________________


In [302]:
from keras.optimizers import Adam

model_3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [303]:
model_3_fit = model_3.fit(x = train_padded_words, #y = train['stock_label'],
                        y = train_label,
                        batch_size=32,
                        epochs=5,
                        verbose=1,
                        validation_split=0.1)

Train on 6129 samples, validate on 681 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [305]:
model_3_score = model_3.evaluate(test_padded_words, test_label,
                       batch_size=batch_size, verbose=1)

print('Test score:', model_3_score[0])
print('Test accuracy:', model_3_score[1])

Test score: 0.745201237309424
Test accuracy: 0.7328094309580115


In [307]:
model_3_pred = model_3.predict(test_padded_words)

In [308]:
test['model_3_pred'] = model_3_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [310]:
def label_recode(series):
    if series < 0.5 :
        return 0
    elif series >= 0.5 :
        return 1
    
test['model_3_pred'] = test['model_3_pred'].apply(label_recode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [340]:
from sklearn.metrics import average_precision_score, recall_score

model_3_precision = average_precision_score(test['stock_label'], test['model_3_pred'])
model_3_recall = recall_score(test['stock_label'], test['model_3_pred'], average="macro")
print('Average precision-recall score: {0:0.2f}'.format(
      model_3_precision))
print('Recall Score: {0:0.2f}'.format(
      model_3_recall))

Average precision-recall score: 0.65
Recall Score: 0.73


In [314]:
test.head(25)

Unnamed: 0.1,Unnamed: 0,ticker,company_name,date,headline,first_sentence,stock_return,stock_label,text_tokenized,text_tokenized_idx,rnn_pred,cnn_pred,model_3_pred
1790,1790,GEK,General Electric Capital Corporation,9/4/14,ge's 3d mammography device gets u.s. fda appro...,Sept 2 General Electric Co's healthcare unit l...,-0.0052,0,"[ge, 's, 3d, mammography, device, gets, u.s., ...","[2959, 72, 1087, 3408, 3409, 382, 219, 1927, 203]",0,0,0
5550,5550,LW,Lamb Weston Holdings Inc,1/11/17,brief-lamb weston reports fiscal q2 2017 eps $...,* Q2 earnings per share view $0.55 -- Thomson ...,-0.0014,0,"[brief-lamb, weston, reports, fiscal, q2, 2017...","[7077, 7078, 6, 1524, 1805, 1871, 8, 9, 7079]",1,1,1
4639,4639,IRMD,iRadimed Corporation,2/7/17,brief-iradimed corp q4 non-gaap eps $0.11,* Iradimed Corporation announces fourth quarte...,-0.0292,0,"[brief-iradimed, corp, q4, non-gaap, eps, $, 0...","[6305, 12, 7, 4265, 8, 9, 6306]",1,1,1
1094,1094,EQT,EQT Corporation,5/3/16,brief-eqt intends to commence public offering ...,* Intends to commence a registered public offe...,-0.0084,0,"[brief-eqt, intends, to, commence, public, off...","[2053, 2286, 31, 2287, 14, 15, 16, 472, 20, 18]",0,0,0
4084,4084,IBM,International Business Machines Corporation,12/4/13,google takes on amazon by cutting cloud servic...,SAN FRANCISCO Dec 3 Google Inc will lower pri...,-0.0007,0,"[google, takes, on, amazon, by, cutting, cloud...","[2756, 959, 68, 2975, 360, 3660, 1626, 1427, 1...",0,0,0
1318,1318,FOX,Twenty-First Century Fox Inc,10/22/15,australia waves through news corp buy-in to st...,SYDNEY Oct 22 Australia's antitrust regulator...,0.0018,1,"[australia, waves, through, news, corp, buy-in...","[2122, 2707, 2275, 882, 12, 2708, 31, 2709, 27...",1,0,0
3922,3922,IBM,International Business Machines Corporation,5/31/12,text-s&p raises ibm ratings,Overview\t -- U.S. technology and solutio...,-0.0062,0,"[text-s, &, p, raises, ibm, ratings]","[49, 50, 51, 693, 5288, 213]",0,1,1
3238,3238,GOOGL,Alphabet Inc,4/17/15,yahoo and facebook shares outperform google in...,LONDON The Frankfurt-listed shares of Internet...,-0.0087,0,"[yahoo, and, facebook, shares, outperform, goo...","[1117, 59, 1056, 18, 4999, 2756, 27, 5000]",0,0,0
4166,4166,IBM,International Business Machines Corporation,5/15/14,ibm expects hardware business to stabilize in ...,NEW YORK May 14 International Business Machin...,-0.0027,0,"[ibm, expects, hardware, business, to, stabili...","[5288, 1484, 5747, 990, 31, 5898, 27, 574, 264...",0,0,0
4997,4997,JCP,JC Penney Company Inc Holding Company,8/14/13,bad jc penney bet calls ackman's retail acumen...,BOSTON Aug 13 Billionaire investor William Ac...,0.0385,1,"[bad, jc, penney, bet, calls, ackman, 's, reta...","[4410, 6438, 3025, 3422, 278, 6442, 72, 3154, ...",0,0,0


### Model Summary

In [341]:
from IPython.display import HTML, display
import tabulate
pd.set_option('display.max_colwidth', -1)

example_1 = [["Model", "Accuracy", "Precision", "Recall", "News Content", "Actual Stock Return", "Predicted Stock Return"],
            ["RNN", "0.6906", "0.61", "0.69", test['headline'][24:25], "0", "0"],
            ["CNN", "0.7445", "0.66", "0.75", test['headline'][24:25], "0", "1"],
            ["RNN + CNN", "0.7328", "0.65", "0.73", test['headline'][24:25], "0", "0"]
    
]

display(HTML(tabulate.tabulate(example_1, tablefmt='html')))

0,1,2,3,4,5,6
Model,Accuracy,Precision,Recall,News Content,Actual Stock Return,Predicted Stock Return
RNN,0.6906,0.61,0.69,"195 caesars tries to woo strengthened creditors with new plan Name: headline, dtype: object",0,0
CNN,0.7445,0.66,0.75,"195 caesars tries to woo strengthened creditors with new plan Name: headline, dtype: object",0,1
RNN + CNN,0.7328,0.65,0.73,"195 caesars tries to woo strengthened creditors with new plan Name: headline, dtype: object",0,0


In [342]:
pd.set_option('display.max_colwidth', -1)

example_2 = [["Model", "Accuracy", "Precision", "Recall", "News Content", "Actual Stock Return", "Predicted Stock Return"],
            ["RNN", "0.6906", "0.61", "0.69", test['headline'][6:7], "0", "0"],
            ["CNN", "0.7445", "0.66", "0.75", test['headline'][6:7], "0", "1"],
            ["RNN + CNN", "0.7328", "0.65", "0.73", test['headline'][6:7], "0", "1"]
    
]

display(HTML(tabulate.tabulate(example_2, tablefmt='html')))

0,1,2,3,4,5,6
Model,Accuracy,Precision,Recall,News Content,Actual Stock Return,Predicted Stock Return
RNN,0.6906,0.61,0.69,"3922 text-s&p raises ibm ratings Name: headline, dtype: object",0,0
CNN,0.7445,0.66,0.75,"3922 text-s&p raises ibm ratings Name: headline, dtype: object",0,1
RNN + CNN,0.7328,0.65,0.73,"3922 text-s&p raises ibm ratings Name: headline, dtype: object",0,1


In [343]:
pd.set_option('display.max_colwidth', -1)

example_3 = [["Model", "Accuracy", "Precision", "Recall", "News Content", "Actual Stock Return", "Predicted Stock Return"],
            ["RNN", "0.6906", "0.61", "0.69", test['headline'][5:6], "1", "1"],
            ["CNN", "0.7445", "0.66", "0.75", test['headline'][5:6], "1", "0"],
            ["RNN + CNN", "0.7328", "0.65", "0.73", test['headline'][5:6], "1", "0"]
    
]

display(HTML(tabulate.tabulate(example_3, tablefmt='html')))

0,1,2,3,4,5,6
Model,Accuracy,Precision,Recall,News Content,Actual Stock Return,Predicted Stock Return
RNN,0.6906,0.61,0.69,"1318 australia waves through news corp buy-in to struggling free-to-air ten Name: headline, dtype: object",1,1
CNN,0.7445,0.66,0.75,"1318 australia waves through news corp buy-in to struggling free-to-air ten Name: headline, dtype: object",1,0
RNN + CNN,0.7328,0.65,0.73,"1318 australia waves through news corp buy-in to struggling free-to-air ten Name: headline, dtype: object",1,0
