<a href="https://www.nvidia.com/dli"> <img src="images/DLI Header.png" alt="Header" style="width: 400px;"/> </a>

# Natural Language Processing for Signal Generation on News Data

<img src = "images/RNFC-logo.png" width="800" height="800">

### RN Financial Corporation
#### Andrew Tan
#### Rafael Nicolas Fermin Cota

Natural Language Processing (NLP) is a field of artificial intelligence that models the interaction between human (natural) language and computers. The various tasks NLP models can be split into several categories:
1. Syntax: Parsing, part-of-speech tagging, morphological segmentation, stemmming...
2. Sementics: Machine translation, sentiment analysis, natural language understanding...
3. Discourse: Automatic summarization (TL:DR)...
4. Speech: Speech recognition, text-to-speech...

In the early days of NLP (pre-1980s) most of these tasks were accomplished with many hand-crafted rules. With the introduction of machine learning and the steady increase of computational power, more NLP models were being built with statistical learning on natural language corpus.

Fast forward to today, where Deep learning models are achieving state-of-the-art performance in a variety of vision and NLP tasks. The combination of larger datasets, advancement in GPU technology, increased research into deep learning architectures/applications and the vast number of problems a modern company deals with means that today's Data Scientist must be capable of deploying Deep Learning solutions. 

In this lab we will focus on the following:

1. Building Deep Neural Networks to process and interpret news data.
2. Understand the various building blocks of making a NLP system.
3. Backtest and apply the models to news data for signal generation.

# Word Embeddings

A Word Embedding is a mathematical mapping from a vast dimensional space where each word occupies a dimension to a reduced-dimension, continuous vector space. Typically a large corpus of text is used to train and develop these embeddings. 
There are various methods to generating word embeddings. These include neural networks, probabilistic models, dimensionality reduction etc.

#### GloVe

GloVe (Global Vectors for Word Representation) is an unsupervised learning  algorithm. It trains on the word-to-word co-occurance statistics from a corpus and attempts to learn word vectors such that their dot product equals the logarithms of the words' probability of co-occurrence. GloVe has several neat features:
* Nearest Neighbors - the cosine similiarity between two word vectors is can be an effective measure of the linguistic or semantic similarity of the corresponding words.
* Linear substructures - in contrast to the cosine similarity, an great deal of information is captured in the vector differences between word vectors. GloVe tries to captures the information pertaining to the relationships between words and this can be showcased through the vector differences. 


<img src="images/Word-Vectors.png">

#### FastText


FastText is a library for text classification and representation learning developed by Facebook AI Research. Its focus is on speed and scalability while maintaining comparable levels of performance compared to other methods. 

FastText provides two methods for computing word representations from a corpus. Both define a supervised learning task in which by learning this task well will generate useful word vectors. 

* Skipgram
   
The Skipgram model attempts to utilize given word to predict the word(s) surrounding it. Skipgram thus learns the likelyhood of a word being present based on the occurance of the word(s) that appear near it in the corpus. You can think of the task as predicting the context given a word.
    
* Continuous Bag of Words (Cbow)

The CBow method is the inverse, instead it takes a bag of words surrounding the target word and attempts to make a prediction. You can think of this as predicting the word given the context. 

## Applications of Natural Language Processing to News Data

### Sentiment Analysis for News

It is undeniable that following the news release of a story with strong impact on an industry or company the market prices intraday will react accordingly. For the average person with an investment account, the news is a substantial signal in the decision making process to buy/sell a certain stock. However to make a systematic approach to trading on news signals is simply impossible for a human to do manually. 

- There are 92000+ news article released per day
- An average human can read at a speed of 200-250 words per minute
- Reading at this rate, for 8 hours continuously, a human may process up to 40-50 articles per day. This calculation disregards the time it takes to find the articles and make any analysis.

In Finance, the efficiency and speed at which you process information can be vital for making well-informed, smart decisions. We can leverage deep learning in order to train models that provide sentiment scores for headlines, articles, tweets, and posts. These sentiments can produce valuable signals to support a buy/sell/hold decision as well as valuation models. 

### Multi-channel LSTM network for Sentiment Classification

In the following code, we will use the Keras + Tensorflow libraries to construct and train a multi-channel LSTM network for classifying sentiments. A multi-channel network simply means we can use more than one type of embedding. This means that the network has access to more features from seperate word embeddings. The idea is that a single type of embedding may not contain enough information, By utilizing another embedding, that is either trained with a different corpus or a different algorithm entirely we can have stronger more robust features. In the model below we will utilize pretrained embeddings from both GloVe and FastText Skipgram models.

In [1]:
## Import the neccessary libraries
import numpy as np
import pandas as pd
from datetime import date
from tqdm import tqdm

import sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold


import tensorflow as tf
from tensorflow.python import pywrap_tensorflow
import keras

from IPython.display import Image

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

import nltk
nltk.download('stopwords')
from nltk import RegexpTokenizer
from nltk.corpus import stopwords

from numpy.random import seed
from tensorflow import set_random_seed
seed(42)
set_random_seed(42)
MAX_SEQUENCE_LENGTH = 32
EMBEDDING_DIM = 300

print("Keras version:",keras.__version__)
print("Tensorflow version:",tf.__version__)
print("Sklearn version:",sklearn.__version__)


Using TensorFlow backend.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Keras version: 2.2.2
Tensorflow version: 1.10.0
Sklearn version: 0.19.2


We will be utilizing open source news data from Bloomberg and Reuters between 2006 and 2012.


In [2]:
## Read and sample the data.
df = pd.read_csv("/dli/data/news_data/news_data_labelled.csv")
df.head()

Unnamed: 0,headline,timestamp,url,tldr,Class
0,Exxon Mobil offers plan to end Alaska dispute,2006-10-20 06:15:00,http://www.reuters.com/article/2006/10/20/busi...,In a proposal sent earlier this week to the Al...,2
1,"Hey buddy, can you spare $600 for a Google share?",2006-10-20 04:25:00,http://www.reuters.com/article/2006/10/20/busi...,SAN FRANCISCO/NEW YORK (Reuters) - Wall Stree...,1
2,Ford posts biggest loss in 14 years,2006-10-23 06:42:00,http://www.reuters.com/article/2006/10/23/us-a...,Ford also said it was considering raising new ...,1
3,Shell looks to buy out Canada unit for C$7.7 b...,2006-10-23 04:34:00,http://www.reuters.com/article/2006/10/23/us-e...,"In July, Shell Canada rattled the industry and...",1
4,"U.S. venture investors betting on energy, Web 2.0",2006-10-23 08:36:00,http://www.reuters.com/article/2006/10/23/us-f...,SAN FRANCISCO (Reuters) - U.S. venture capita...,1


In [3]:
print("Starting timestamp: {} \nEnding timestamp: {}".format(df.timestamp.min(),df.timestamp.max()))

Starting timestamp: 2006-10-20 04:25:00 
Ending timestamp: 2012-12-31 23:06:28


In [7]:
## Split the dataset into testing and training datasets
X = df.tldr
y = df.Class
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y, test_size=0.10, random_state=42)

In [8]:
## Tokenize training set and apply to train + test sets. Pad the sequences with zeros.
tokenizer = Tokenizer(num_words=None,
                       filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                       lower=True,
                       split=" ",
                       char_level=False)

tokenizer.fit_on_texts(X_train)

X_train = pad_sequences(tokenizer.texts_to_sequences(X_train),maxlen=MAX_SEQUENCE_LENGTH,value=0.)
X_test = pad_sequences(tokenizer.texts_to_sequences(X_test),maxlen=MAX_SEQUENCE_LENGTH,value=0.)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

word_index = tokenizer.word_index

In [None]:
## Define the embedding matrices for Glove and FastText
def embedding_matrix(path_to_embedding : str, embedding_dim: int, word_index : dict) -> np.array:
    """
    This function creates an embedding matrix.
    
    Inputs:
    path_to_embedding - path to text file of word embeddings
    embedding_dim - dimension of word embeddings
    word_index - dictionary mapping words to indices
    
    Outputs:
    embedding_matrix - numpy matrix containing the embeddings
    
    """
    embeddings_index = {}
    f = open(path_to_embedding, encoding='utf-8')
    for line in f:
        try:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
        except:
            pass
        
    f.close()

    embedding_matrix = np.zeros((len(word_index) + 1,embedding_dim))
    found = 0
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            found +=1
            embedding_matrix[i] = embedding_vector

    return embedding_matrix
                                
                                
glove_embedding_matrix = embedding_matrix("/dli/data/news_data/glove/glove.840B.300d.txt",EMBEDDING_DIM,tokenizer.word_index)
fasttext_embedding_matrix = embedding_matrix("/dli/data/news_data/fasttext/wiki-news-300d-1M.vec",EMBEDDING_DIM,tokenizer.word_index)

In [None]:
## Import libraries + tools for building the multi-channel LSTM
from keras.models import Model
from keras import layers
from keras.optimizers import Adam
from keras.layers import Input, Dense, Dropout, Activation, Embedding, BatchNormalization
from keras.layers import LSTM, concatenate, Bidirectional, PReLU, ELU, LeakyReLU, GRU, SimpleRNN
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from keras.utils.vis_utils import plot_model
import datetime
import pydot, graphviz

In [None]:
## Define the neural network architecture

## Subchannel network for encoding sequential information
def subnetwork_channel(input_layer : layers, RNN_architecture : str, units : int, dropout_rate : float) -> layers:
    """
    This function creates a sub network for encoding sequences.
    
    Inputs:
    input_layer - The input keras layer into the subnetwork
    RNN_architecture - Name of the RNN type to use
    units - Number of units in the RNN
    dropout_rate - dropout rate
    
    Outputs:
    batch - Batch Normalized output layer
    
    """
    assert RNN_architecture in ["LSTM", "GRU", "RNN"]
    
    dropout1 = Dropout(rate = dropout_rate)(input_layer)
    
    if RNN_architecture == "LSTM":
        rnn_layer = Bidirectional(LSTM(units = units, return_sequences = False))(dropout1)
    elif RNN_architecture == "GRU":
        rnn_layer = Bidirectional(GRU(units = units, return_sequences = False))(dropout1)
    elif RNN_architecture == "RNN":
        rnn_layer = Bidirectional(SimpleRNN(units = units, return_sequences = False))(dropout1)
    
    dropout2 = Dropout(rate = dropout_rate)(rnn_layer)
    batch = BatchNormalization()(dropout2)
    return batch

## Output layer network
def output_channel(input_layer : layers ,activation : str, units : int, dropout_rate : float) -> layers:
    """
    This function creates a sub network for outputing classification probabilities.
    
    Inputs:
    input_layer - The input keras layer into the subnetwork
    activation  - Name of the activation type to use
    units - Number of units in the Dense network
    dropout_rate - dropout rate
    
    Outputs:
    output - Softmax output layer
    
    """
    assert activation in ["ReLU","PReLU", "ELU", "LeakyReLU"]
    
    dense = Dense(units)(input_layer)
    
    if activation == "PReLU":
        act = PReLU()(dense)
    elif activation == "ELU":
        act = ELU()(dense)
    elif activation == "LeakyReLU":
        act = LeakyReLU()(dense)
    elif activation == "ReLU":
        act = Dense(units, activation='relu')(input_layer)
    
        
    dropout = Dropout(rate = dropout_rate)(act)
    batch = BatchNormalization()(dropout)
    output = Dense(3,activation='softmax', name = "Output")(batch)
    
    return output
    
## Define full model.
def define_model(RNN_architecture : str = "LSTM", rnn_units : int = 256, dense_units : int = 128,dense_activation : str = "PReLU" ,dropout_rate : float = 0.4) -> Model:
    """
    This function defines and compiles a Multichannel RNN for Sentiment Classification.
    
    Inputs:
    RNN_architecture - Name of the RNN type to use
    rnn_units - Number of units in the RNN
    dense_units - Number of units in the Dense network
    dense_activation  - Name of the activation type to use
    dropout_rate - dropout rate
    
    Outputs:
    model - A Keras model
    
    """
    # Input Layer
    shape = (MAX_SEQUENCE_LENGTH,)
    input1 = Input(shape = shape, name = "Main_input")
    
    # Channel 1 - GLoVe
    embedding1 = Embedding(len(word_index) + 1,
              EMBEDDING_DIM,
              weights=[glove_embedding_matrix],
              input_length=MAX_SEQUENCE_LENGTH,
              trainable=False,
              input_shape=X_train.shape[1:], name = "GLoVe_Embedding")(input1)

    net1 = subnetwork_channel(embedding1, RNN_architecture = RNN_architecture, units = rnn_units, dropout_rate = dropout_rate)
    
    # Channel 2 - Fast Text
    embedding2 = Embedding(len(word_index) + 1,
              EMBEDDING_DIM,
              weights=[fasttext_embedding_matrix],
              input_length=MAX_SEQUENCE_LENGTH,
              trainable=False,
              input_shape=shape, name = "FastText_Embedding")(input1)

    net2 = subnetwork_channel(embedding2, RNN_architecture = RNN_architecture, units = rnn_units, dropout_rate = dropout_rate)
    
    # Merge
    merged = concatenate([net1,net2], name ="Merge")
    # Output channel
    output = output_channel(merged, activation = dense_activation, units = dense_units, dropout_rate = dropout_rate)
    
    # Compile 
    model = Model(inputs = input1, outputs = output)
    model.compile(loss = 'categorical_crossentropy', optimizer = Adam(0.002), metrics = ['categorical_accuracy'])
    
    return model

In [None]:
## Compile and display the model.
model = define_model(RNN_architecture = "LSTM", rnn_units= 256, dense_units = 128, dense_activation = "ReLU", dropout_rate = 0.4)
model.summary()
pic_name = 'images/multichannel-bidirectionalLSTM.png'
plot_model(model,show_shapes=True,to_file=pic_name)

<img src="images/multichannel-bidirectionalLSTM.png">

In [None]:
## Train
tensorboard = TensorBoard(log_dir='/dli/tasks/tensorboard/logs/')
model.fit(X_train,y_train,epochs = 10, batch_size = 1024,callbacks = [tensorboard])

### Click [here](/tensorboard/) to start TensorBoard.

In [None]:
## Find the testing accuracy
val_loss, val_catergorical_accuracy = model.evaluate(X_test,y_test)
print("Validation Accuracy: {:.1f}".format(val_catergorical_accuracy * 100))

Our model was able to achieve ~79% accuracy. According to research on sentiment analysis and classification, human raters may only agree with each other about 80% of the time. Due to the nature of sentiment analysis, the outcome a reader arrives at can be very subjective depending on how the reader interprets the words, tone or phrasing of the text. Thus, a model that predicts with 100% accuracy may still disagree with a human 20% of the time. 

### Exercise: Re-tune Neural Network Parameters
Try experimenting with different parameters in the neural network.
In the function 'define_model'
    - 'RNN_architecture' can be one of: "RNN", "GRU", "LSTM".
    - 'rnn_units' are the number of units in the RNN
    - 'dense_units' are the number of units in the dense network
    - 'dense_activation' can be one of: "PReLU", "LeakyReLU", "ELU", "ReLU"
    - 'dropout_rate' rate of dropout throughout the network

## Intraday Sentiment Strategy

To illustrate the power of Sentiment Analysis we'll construct and backtest a simple strategy.
- Trade intraday over the year of 2013
- Companies: 
    - Apple, Microsoft, Boeing, JPMorgan, Google, GM, Citigroup, Ford, Toyota, HSBC, ICAP
- Assume perfect market entry and exit, no transaction fees
- Sentiment score is the confidence of a text being positive or negative.
- Basic strategy: 
    - BUY when 'sentiment_score' >= 'sentiment_cutoff' and SELL 'time_to_close_position' minutes later. 
    - SHORT SELL when 'sentiment_score' <= -'sentiment_cutoff' and BUY 'time_to_close_position' minutes later. 
    - If news is released when market is closed then BUY as soon as it is open.

In [None]:
from utils import interactive_backtest
interactive_backtest()

## News Processing for Risk Management (Optional)

Another use case of News in trading is the ability to monitor portfolio holdings and mitigate risk. Being able to identify the possibility of a drop in a stock's price or observing that the market is reacting to the release of particular news can be a useful component in managing risk.

### Case Study: Apple cuts iPhone X production due to weak demand:

News reported on the Nikkei on Monday, January 29th revealed that Apple would cut its production target for the iPhone X from 40 to 20 million units. Apple's stock did not react well, in the wake of the reports stock fell even further even after it was already on the downtrend due to earnings reports. In this case we can see that Apple's stock price is correlated to the sentiment on the news articles related to the iPhone.



<img src="images/apple_price.png">

<img src="images/apple_sentiment.png">

### Case Study:  Aimia Inc Recieves Notice of Contract Non-renewal from Air Canada

Aimia Inc is a data-driven marketing and loyalty analytics company. On May 11th, 2017 the company announced that Air Canada, its largest client had given its notice of non-renewal. The market responded accordingly with a sharp drop in price. The relative volume of articles on Aimia on the few days leading up to the announcement skyrocketed. A drastic change in the volume can be a signal for redirecting attention to certain companies. 

<img src="images/aimia_price.png">

<img src="images/aimia_vol.png">

Finally, don't forget to save your work from this lab before time runs out and the instance shuts down!!

In [None]:
!tar -cvf output.zip  "Applying Natural Language Processing on News Data using Deep Learning.ipynb" utils.py images

[Download output.zip](output.zip)

### References

- Glove: https://nlp.stanford.edu/projects/glove/
- Fasttext: https://fasttext.cc/
- News articles per day: https://www.slideshare.net/chartbeat/mockup-infographicv4-27900399
- News data source: https://github.com/philipperemy/financial-news-dataset
- Word embeddings: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/, 
- Natural Language Processing: https://en.wikipedia.org/wiki/Natural-language_processing
- Sentiment Analysis: https://en.wikipedia.org/wiki/Sentiment_analysis


<a href="https://www.nvidia.com/dli"> <img src="images/DLI Header.png" alt="Header" style="width: 400px;"/> </a>