# Daily News for Stock Market Prediction - Part I<a name="breakdown"></a>

In this project, we will create a classifier that can predict stock market movement based on daily news headlines. In the first part, we will do some basic data preprocessing and analysis. 

Here's the phase breakdown of the part I:

[Phase1: Loading in the Data](#phase1)  
[Phase2: EDA](#phase2)  
[Phase3: Text Encoding techniques](#phase3)  

In [1]:
import numpy as np
import pandas as pd
import re

import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Loading in the Data<a name="phase1"></a> 
[Return to phase breakdown](#breakdown)

In [2]:
news_data = pd.read_csv('data/RedditNews.csv')
combined_data = pd.read_csv('data/Combined_News_DJIA.csv')
DJIA_data = pd.read_csv('data/DJIA_table.csv')

In [3]:
combined_data.head(1)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""


In [4]:
DJIA_data.head(1)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141


In [5]:
print(combined_data.shape)
print(DJIA_data.shape)

(1989, 27)
(1989, 7)


Both `combined_data` and `DJIA_data` have 1989 columns, from 2008-08-08 to 2016-07-01.

In [6]:
combined_data.isnull().sum()

Date     0
Label    0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
dtype: int64

We can see that there are some missing values at last several Top News, so we only grab Top20 News and merge the dataframe with `DJIA_data`.

In [7]:
data = pd.merge(DJIA_data, combined_data.iloc[:, 0:22], on = 'Date')
print("Data column names:", data.columns)
data.head(2)

Data column names: Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close', 'Label',
       'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8', 'Top9',
       'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16', 'Top17',
       'Top18', 'Top19', 'Top20'],
      dtype='object')


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Label,Top1,Top2,...,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141,1,A 117-year-old woman in Mexico City finally re...,IMF chief backs Athens as permanent Olympic host,...,France Cracks Down on Factory Farms - A viral ...,Abbas PLO Faction Calls Killer of 13-Year-Old ...,Taiwanese warship accidentally fires missile t...,"Iran celebrates American Human Rights Week, mo...",U.N. panel moves to curb bias against L.G.B.T....,"The United States has placed Myanmar, Uzbekist...",S&amp;P revises European Union credit rating t...,India gets $1 billion loan from World Bank for...,U.S. sailors detained by Iran spoke too much u...,Mass fish kill in Vietnam solved as Taiwan ste...
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234,1,Jamaica proposes marijuana dispensers for tour...,Stephen Hawking says pollution and 'stupidity'...,...,Turkish Cop Who Took Down Istanbul Gunman Hail...,Cannabis compounds could treat Alzheimer's by ...,Japan's top court has approved blanket surveil...,CIA Gave Romania Millions to Host Secret Prisons,Groups urge U.N. to suspend Saudi Arabia from ...,Googles free wifi at Indian railway stations i...,Mounting evidence suggests 'hobbits' were wipe...,The men who carried out Tuesday's terror attac...,Calls to suspend Saudi Arabia from UN Human Ri...,More Than 100 Nobel Laureates Call Out Greenpe...


## Basic EDA<a name="phase2"></a>
[Return to phase breakdown](#breakdown)

First thing we need to check out is the amount of two labels, how many days did the DJIA rise or fall?

In [8]:
data['Label'].value_counts()

1    1065
0     924
Name: Label, dtype: int64

In [9]:
label_barplot = go.Bar(x = ['Up', 'Down'], 
                       y = 100*data['Label'].value_counts().values/data['Label'].value_counts().sum())
py.iplot(go.Figure(data = [label_barplot], 
                   layout = go.Layout(autosize=False, width=500, height=500, 
                                      title="Market Movement from 2008-08-08 to 2016-07-01")), 
         filename='images/label_count')

During the time we picked (from 2008-08-08 to 2016-07-01), DJIA rose in **1065** days and decreased in left **924** days. 

What about the DJIA tendency in the period?

In [10]:
adjclose_trace = go.Scatter(x = data['Date'], y = data['Adj Close'])
layout = dict(xaxis = dict(title = "Date", rangeslider = dict(visible = True)), 
              yaxis = dict(title = "Currency in USD"),
              title = "DJIA Trend")
              

py.iplot(go.Figure(data = [adjclose_trace], layout = layout), filename='images/ctrend')

We can also plot the daily highest point and lowest point during the period:

In [11]:
high_trace = go.Scatter(x = data['Date'], y = data['High'], name = "High")
low_trace = go.Scatter(x = data['Date'], y = data['Low'], name = "Low")

py.iplot(go.Figure(data = [high_trace, low_trace], layout = layout), filename='images/hltrend')

Well, looks like the total trace is basically positive.

It decreased to the lowest point at about ``2009.03``, and after that it continued to rise until ``first half of 2015``.

Let's find out the day with the lowest or highest Adj Close:

In [12]:
lowest_adjclose = data.loc[data['Adj Close'].idxmin()]
print("Lowest Adj Close: ", lowest_adjclose['Adj Close'])
print("Date: ", lowest_adjclose['Date'])

highest_adjclose = data.loc[data['Adj Close'].idxmax()]
print("Highest Adj Close: ", highest_adjclose['Adj Close'])
print("Date: ", highest_adjclose['Date'])

Lowest Adj Close:  6547.049805
Date:  2009-03-09
Highest Adj Close:  18312.390625
Date:  2015-05-19


In the data:
1. DJIA decreased to the lowest point on `2009-03-09` at about a price of **\$6547** per share.  
2. DJIA rose to the highest point on `2015-05-19` at about a price of **\$18312** per share.
3. DJIA almost tripled from the lowest to the highest price.

Actually we can combine all the plots above and make an interactive OHLC charts with Plotly. An [OHLC](https://en.wikipedia.org/wiki/Open-high-low-close_chart) (open-high-low-close chart) is a type of chart typically used to illustrate movements in the price of a financial instrument over time. Each vertical line on the chart shows the price range (the highest and lowest prices) over one unit of time, which in our case, is a single day:

In [13]:
trace = go.Ohlc(x=data['Date'], open=data['Open'], 
                high=data['High'], low=data['Low'],
                close=data['Adj Close'])
layout = dict(xaxis=dict(title="Date", rangeselector=dict(buttons=[
                            dict(count=1, label='1 month', 
                                 step='month', stepmode='backward'),
                            dict(count=3, label='1 quarter', 
                                 step='month', stepmode='backward'),
                            dict(count=12, label='1 year',
                                 step='month', stepmode='backward'),
                            dict(step='all')]),
                         rangeslider=dict(visible = True)), 
              yaxis=dict(title="Currency in USD"),
              title="DJIA Trend",
              shapes = [dict(x0='2009-03-09', x1='2009-03-09', y0=0, y1=1,
                             yref='paper', opacity = 0.25)],
              annotations = [dict(x='2015-05-19', y=0.95, yref='paper',
                                  text='Highest point',
                                  hovertext = "2015-05-19, 18312.390625"),
                             dict(x='2009-03-09', y=0.05, yref='paper',
                                  ax=50, ay=-20, text='Lowest point', 
                                  hovertext = "2009-03-09, 6547.049805")])

py.iplot(go.Figure(data = [trace], layout = layout), filename='images/ohlctrend')

## Text Encoding techniques<a name="phase3"></a>
[Return to phase breakdown](#breakdown)

Our goal is to use daily headlines of news to make a binary prediction, but many of techniques assume the inputs are **quantitative** variables in which the relative magnitude of the feature encode information about the response variable. In other words, we need a way to transform our text to quantitative variables, and use them to do the following steps.

This process is usually called [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering), which is a process of transforming the representation of model inputs to enable better model approximation. In this phase, we will introduce two widely representations of text:
* **Bag-of-Words Encoding**: encodes text by the frequency of each word.
* **N-Gram Encoding**: encodes text by the frequency of sequences of words of length N.

After that, we will use a numerical statistic called TF-IDF based on these two techniques to reflect how important a word is to a document in a corpus.

### The Bag-of-Words Encoding

Bag-of-words encoding is commonly used in NLP, in the method a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 
The following is a simple illustration of the bag-of-words encoding:

<img src="images/bag_of_words.png" width="600px">

**Notice**
1. **Stop words are removed**. Stop words are words like "and", "is", "him", which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction. Here is a good list of [stop-words in many languages](https://code.google.com/archive/p/stop-words).
2. **Word order information is lost**. Nonetheless the vector still suggests that the sentence is about fun, machines, and learning. Thought there are many possible meanings learning machines have fun learning or learning about machines is fun learning ...
3. **Capitalization and punctuation are typically removed.**
4. **Sparse Encoding**: It is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a 0 for every word that is not in each record would be incredibly inefficient.

Here is a basic example using sklearn:

In [14]:
corpus = [
    'This is the first sentence.',
    'And this sentence is the second sentence.',
    'Is this the third one?',
    'I just know I am the fourth.'
]
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)
print("Corpus: ")
for d in corpus:
    print(d)
print("\nWords:", vectorizer.get_feature_names())
print("\nSentence Encoding:")
print(X.toarray())

Corpus: 
This is the first sentence.
And this sentence is the second sentence.
Is this the third one?
I just know I am the fourth.

Words: ['fourth', 'just', 'know', 'second', 'sentence']

Sentence Encoding:
[[0 0 0 0 1]
 [0 0 0 1 2]
 [0 0 0 0 0]
 [1 1 1 0 0]]


The matrix represents the count of words of each sentence in alphabetical order, like word \`sentence\` shows twice in the second sentence, so the count is 2, and none of the selected words can be found in the third sentence, so the third line is just a row of 0. 

We can see that only 5 words are selected as features, if we remove stop_words, each single word will be choosen as a feature:

In [15]:
vectorizer_nostopwords = CountVectorizer()
vectorizer_nostopwords.fit_transform(corpus)
print("Words:", vectorizer_nostopwords.get_feature_names())

Words: ['am', 'and', 'first', 'fourth', 'is', 'just', 'know', 'one', 'second', 'sentence', 'the', 'third', 'this']


### The N-Gram Encoding

Bag-of-word model is an orderless document representation—only the counts of words mattered. For instance, in the above example "This is the first sentence. And this sentence is the second sentence", the bag-of-words representation will not reveal that `"the"` is always used before ordinal number words `"first"`, `"second"` in this text. As an alternative, the n-gram model can store this spatial information. Conceptually, we can view bag-of-word model as a special case of the n-gram model, with n=1. Consider the following two sentences:

> _The novel is pretty good but I do not like it._

If we re-arrange the words:

> _The novel is not pretty good but I do like it._

Moreover, local word order can be important when making decisions about text.  The n-gram encoding captures local word order by defining counts. In the following example a bi-gram ($n=2$) encoding is constructed:

<img src="images/n_grams.png" width="600px">

The above n-gram would be encoded in the sparse vector:

<img src="images/n_grams_table.png" width="300px">

Notice that the n-gram captures key pieces of sentiment information: `"pretty good"` and `"not like"`. 

**Notice**
1. The n-gram representation is hyper sparse and maintaining the dictionary of possible n-grams can be very costly. The hashing trick is a popular solution to approximate the sparse n-gram encoding. 
2. As $N$ increase the chance of seeing the same n-grams at prediction time decreases rapidly.

Here is the same example using sklearn:

In [16]:
bigram = CountVectorizer(ngram_range=(2,2)) 
X = bigram.fit_transform(corpus)
print("Corpus: ")
for d in corpus:
    print(d)
print("\nWords:", bigram.get_feature_names())
print("\nSentence Encoding:")
print(X.toarray())

Corpus: 
This is the first sentence.
And this sentence is the second sentence.
Is this the third one?
I just know I am the fourth.

Words: ['am the', 'and this', 'first sentence', 'is the', 'is this', 'just know', 'know am', 'second sentence', 'sentence is', 'the first', 'the fourth', 'the second', 'the third', 'third one', 'this is', 'this sentence', 'this the']

Sentence Encoding:
[[0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0]
 [0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1]
 [1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0]]


Similarly, features will be less if we add "stop_words" paramater:

In [17]:
bigram_nostopwords = CountVectorizer(ngram_range=(2,2), stop_words="english") 
bigram_nostopwords.fit_transform(corpus)
print("Words:", bigram_nostopwords.get_feature_names())

Words: ['just know', 'know fourth', 'second sentence', 'sentence second']


So far we have introducted CountVectorizer, but most of time people nearly always use another method called TF-IDF Vectorizer, which is a better way to encode text and equivalent to CountVectorizer followed by TF-IDF Transformer. We also will use this brilliant feature vectorization method in our project, so let's see how it works.

### TF-IDF

For CountVectorizer, we all know that it just counts the word frequencies, simple as that, but sometimes in a corpus we need to care about more. In a corpus, several common words makes up lots of space which carry very little information about content of document. If we feed these counts directly to a classifier then those frequently occurring words will shadow the real interesting terms of the document. For example, how often do you really use `crepuscular`, and `petrichor`? Indeed, they are important words, and should be used when appropriate, but how often do we really use them?

With the [TF-IDF](https://spark.apache.org/docs/latest/mllib-feature-extraction.html) (Term Frequency–Inverse Document Frequency) Vectorizer the value increases proportionally to count, but is offset by the frequency of the word in the corpus. TF–IDF is one of the most popular term-weighting schemes today, [83% of text-based recommender systems in digital libraries use tf–idf](https://en.wikipedia.org/wiki/Tf–idf#cite_note-2). The TF-IDF measure is simply the product of TF and IDF:

$$ TFIDF(t,d,D) = TF(t,D) \cdot IDF(t,D)$$

In the equation:

* $t$ (term), $d$(document) and $D$(corpus).
* $TF(t,d)$: (Term Frequency) is the number of times that term $t$ appears in document $d$.
* $DF(t,D)$: (Document Frequency) is the number of documents that contains term $t$.

If we only use Term Frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., `'a'`, `'the'`, and `'of'`. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse Document Frequency is a numerical measure of how much information a term provides:

$$IDF(t,D) = \log\frac{|D|+1}{DF(t,D)+1}$$

There are several variants on the definition of term frequency and document frequency, details can be found in [sk-learn TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

Here is the same example using TF-IDF Vectorizer:

In [18]:
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(corpus)
print("Corpus: ")
for d in corpus:
    print(d)
print("\nWords:", tfidf_vectorizer.get_feature_names())
print("\nSentence Encoding:")
print(X.toarray())

Corpus: 
This is the first sentence.
And this sentence is the second sentence.
Is this the third one?
I just know I am the fourth.

Words: ['am', 'and', 'first', 'fourth', 'is', 'just', 'know', 'one', 'second', 'sentence', 'the', 'third', 'this']

Sentence Encoding:
[[0.         0.         0.60759891 0.         0.38782252 0.
  0.         0.         0.         0.47903796 0.31707032 0.
  0.38782252]
 [0.         0.42358016 0.         0.         0.27036573 0.
  0.         0.         0.42358016 0.66791093 0.2210417  0.
  0.27036573]
 [0.         0.         0.         0.         0.36327702 0.
  0.         0.56914364 0.         0.         0.29700276 0.56914364
  0.36327702]
 [0.48380259 0.         0.         0.48380259 0.         0.48380259
  0.48380259 0.         0.         0.         0.25246826 0.
  0.        ]]


In the first sentence, `is`, `the`, `first` and `sentence` all appear once, if we use CountVectorizer all the number should be 1, but in TF-IDF Vectorizer they are totally different. Like only first sentence includes `first`, so its $DF(t,D)$ is highest, while all four sentences contain `the`, and its $DF(t,D)$ is lowest. Without the Inverse Document Frequency part, less meaning giving words such as "the" (assuming not using stop_words) would bear a higher weight than these less frequent words.