# Prep

First things first, let's import the libraries we'll be using.  

In [20]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

Pandas will make our data easy to look at and work with.
CountVectorizer, part of scikit-learn, will take care of our NLP tasks.
LogisticRegression, also part of scikit-learn, will train and test our predictive models.

# Data Import

Now, let's [read](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in the data with Pandas.  

In [21]:
data = pd.read_csv("./Data/Combined_News_DJIA.csv")

Now look at the head of the data.

In [22]:
data.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


We've got a lot of vaiables here, but the layout is pretty straight-forward.  
As a reminder, the Label variable will be a **1** if the DJIA **stayed the same or rose** on that date or 
 **0** if the DJIA **fell** on that date.

And finally, before we get started on the rest of the notebook, we need to split our data into a training set and a testing set. We'll use all of the dates up to the end of 2014 as our training data and everything after as testing data.

In [23]:
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']

# Text Preprocessing

Now that our data is loaded in, we need to clean it up just a little bit to prepare it for the rest of our analysis.  
To illustrate this process, look at how the example headline below changes from cell to cell.  
Don't worry about the code too much here, since this example is only meant to be visual.

In [24]:
example = train.iloc[3,10]
print(example)

b"The commander of a Navy air reconnaissance squadron that provides the President and the defense secretary the airborne ability to command the nation's nuclear weapons has been relieved of duty"


In [25]:
example2 = example.lower()
print(example2)

b"the commander of a navy air reconnaissance squadron that provides the president and the defense secretary the airborne ability to command the nation's nuclear weapons has been relieved of duty"


In [26]:
example3 = CountVectorizer().build_tokenizer()(example2)
print(example3)

['the', 'commander', 'of', 'navy', 'air', 'reconnaissance', 'squadron', 'that', 'provides', 'the', 'president', 'and', 'the', 'defense', 'secretary', 'the', 'airborne', 'ability', 'to', 'command', 'the', 'nation', 'nuclear', 'weapons', 'has', 'been', 'relieved', 'of', 'duty']


In [27]:
pd.DataFrame([[x,example3.count(x)] for x in set(example3)], columns = ['Word', 'Count'])

Unnamed: 0,Word,Count
0,nation,1
1,provides,1
2,president,1
3,has,1
4,nuclear,1
5,been,1
6,secretary,1
7,the,5
8,to,1
9,command,1


Were you able to see everything that changed?  
The process involved:  
- Converting the headline to lowercase letters  
- Splitting the sentence into a list of words  
- Removing punctuation and meaningless words  
- Transforming that list into a table of counts

What started as a relatively "messy" sentence has now become an neatly organized table!  
And while this may not be exactly what goes on behind the scenes with scikit-learn, this example should give you a pretty good idea about how it works.

So now that you've seen what the text processing looks like, let's get started on the fun part, modeling!

# Basic Model Training and Testing

As mentioned previously, scikit-learn is going to take care of all of our preprocessing needs.  
The tool we'll be using is CountVectorizer, which takes a single list of strings as input, and produces word counts for each one.

You might be wondering if our dataframe meets this "single list of strings" criteria, and the answer to that is... it doesn't!  
In order to meet this criteria, we'll use the following [for loop](https://wiki.python.org/moin/ForLoop) to iterate through each row of our dataset, [combine](https://docs.python.org/3.5/library/stdtypes.html#str.join) all of our headlines into a single string, then [add](https://docs.python.org/3.5/tutorial/datastructures.html) that string to the list we need for CountVectorizer.

In [28]:
trainheadlines = []
for row in range(0,len(train.index)):
    trainheadlines.append(' '.join(str(x) for x in train.iloc[row,2:27]))

With our headlines formatted, we can set up our CountVectorizer.  
To start, let's just use the default settings and see how it goes!  
Below, we'll name our default vectorizer, then [use](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform) it on our list of combined headlines.  
After that, we'll take a look at the size of the result to see how many words we have.

In [29]:
basicvectorizer = CountVectorizer()
basictrain = basicvectorizer.fit_transform(trainheadlines)
print(basictrain.shape)

(1611, 31675)


Wow! Our resulting table contains counts for 31,675 different words!

Now, let's train a logistic regression model using this data.  
In the cell below, we're simply naming our model, then [fitting](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) the model based on our X and Y values.

In [30]:
basicmodel = LogisticRegression()
basicmodel = basicmodel.fit(basictrain, train["Label"])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Our model is ready to go, so let's set up our test data.  
Here, we're just going to repeat the steps we used to prep our training data, then [predict](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) whether the DJIA increased or decreased for each day in the test dataset.

In [31]:
testheadlines = []
for row in range(0,len(test.index)):
    testheadlines.append(' '.join(str(x) for x in test.iloc[row,2:27]))
basictest = basicvectorizer.transform(testheadlines)
predictions = basicmodel.predict(basictest)

The predictions are set, so let's use a [crosstab](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html) to take a look at the results!

In [32]:
pd.crosstab(test["Label"], predictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,61,125
1,92,100


Prediction accuracy is just over 42%. It seems like this model isn't too reliable.  
Now, let's also take a look at the coefficients of our model. (Excellent request from [Lucie](https://www.kaggle.com/luciegattepaille)!)

The cell below will get a list of the names from our CountVectorizer and a list of the coefficients from our model, then combine the two lists into a Pandas dataframe.  
Once that's made, we can sort it and check out the top 10 positive and negative coefficients.

In [34]:
basicwords = basicvectorizer.get_feature_names()
basiccoeffs = basicmodel.coef_.tolist()[0]
coeffdf = pd.DataFrame({'Word' : basicwords, 
                        'Coefficient' : basiccoeffs})
coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)



Unnamed: 0,Word,Coefficient
19419,nigeria,0.50398
25261,self,0.462117
15998,korea,0.434983
29286,tv,0.426209
20135,olympics,0.426088
26323,so,0.419507
15843,kills,0.40832
10874,fears,0.39806
29256,turn,0.394045
28274,territory,0.384038


In [35]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
8478,did,-0.429427
27299,students,-0.430881
6683,congo,-0.433462
12818,hacking,-0.449924
7139,country,-0.451514
16949,low,-0.465453
3651,begin,-0.470763
25433,sex,-0.4952
24754,sanctions,-0.550044
24542,run,-0.599664


Our most positive words don't seem particularly interesting, however there are some negative sounding words within our bottom 10, such as "sanctions," "low," and "hacking."  
Maybe the saying "no news is good news" is true here?

# Advanced Modeling

The technique we just used is known as a **bag-of-words** model. We essentially placed all of our headlines into a "bag" and counted the words as we pulled them out.  
However, most people would agree that a single word doesn't always have enough meaning by itself.  
Obviously, we need to consider the rest of the words in the sentence as well!  

This is where the **n-gram** model comes in.  
In this model, n represents the length of a sequence of words to be counted.  
This means our bag-of-words model was the same as an n-gram model where n = 1.  
So now, let's see what happens when we run an n-gram model where n = 2.

Below, we'll create a new CountVectorizer with the n-gram parameter set to 2 instead of the default value of 1.

In [36]:
advancedvectorizer = CountVectorizer(ngram_range=(2,2))
advancedtrain = advancedvectorizer.fit_transform(trainheadlines)

Now that we've run our vectorizer, let's see what our data looks like this time around.


In [37]:
print(advancedtrain.shape)

(1611, 366721)


This time we have 366,721 unique variables representing two-word combinations!  
And here I thought last time was big...

So, just like last time, let's name and fit our regression model.

In [38]:
advancedmodel = LogisticRegression()
advancedmodel = advancedmodel.fit(advancedtrain, train["Label"])

And again like last time, let's transform our test data and make some predictions!

In [39]:
testheadlines = []
for row in range(0,len(test.index)):
    testheadlines.append(' '.join(str(x) for x in test.iloc[row,2:27]))
advancedtest = advancedvectorizer.transform(testheadlines)
advpredictions = advancedmodel.predict(advancedtest)

Crosstab says...!


In [40]:
pd.crosstab(test["Label"], advpredictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,62,124
1,43,149


This time we're up to nearly 57% prediction accuracy.  
We might only consider this a slight improvement, but keep in mind that we've barely scratched the surface of NLP here, and we haven't even touched more advanced machine learning techniques.  
Let's check out our coefficients again as well!