# Reddit News Data Exploration and Cleaning

Below I will be exploring the Reddit News data source from [kaggle](https://www.kaggle.com/aaron7sun/stocknews). The eventual goal is to predict stop movements from News headlines posted to the /r/news subreddit. The input data that we will be working with is a set of headings. In order to prep this for training we will need to "clean" the data. This process is shown below.

---

## Basic Data Measures

In [45]:
import pandas as pd
import numpy as np
import nltk               #natural language tool kit
import re                 #regular expression library

Below the data is imported from a csv file into a pandas dataframe and the first few rows are printed. This dataset has two columns including a date component and a news component. As is apparent below, the Reddit news headings are contained in the News column. 

In [63]:
pre_clean_newsdf = pd.read_csv('../../Data/RedditNews.csv')
pre_clean_newsdf.head(10)

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...
5,2016-07-01,Brazil: Huge spike in number of police killing...
6,2016-07-01,Austria's highest court annuls presidential el...
7,2016-07-01,"Facebook wins privacy case, can track any Belg..."
8,2016-07-01,Switzerland denies Muslim girls citizenship af...
9,2016-07-01,China kills millions of innocent meditators fo...


Here the dimensions of this dataset is checked. We can see that there are 73608 rows in this dataframe (i.e. that many news headings). Subsequently, by running `count()` on both the columns we can check for `Nan` values since these values are included in the total. Since both these values equal the vertical dimension of the dataframe we know that there are no missing values.

In [47]:
pre_clean_newsdf.shape

(73608, 2)

In [48]:
pre_clean_newsdf['Date'].count()

73608

In [49]:
pre_clean_newsdf['News'].count()

73608

In [50]:
#No null values
pre_clean_newsdf.isnull().values.any()

False

The Date column is removed since we don't need to clean this data since it's not being used in our prediction and serves just to sequence our data.

In [64]:
pre_clean_newsdf2 = pre_clean_newsdf
pre_clean_newsdf2.drop('Date', axis=1, inplace=True)

In [66]:
pre_clean_newsdf2.head(10)

Unnamed: 0,News
0,A 117-year-old woman in Mexico City finally re...
1,IMF chief backs Athens as permanent Olympic host
2,"The president of France says if Brexit won, so..."
3,British Man Who Must Give Police 24 Hours' Not...
4,100+ Nobel laureates urge Greenpeace to stop o...
5,Brazil: Huge spike in number of police killing...
6,Austria's highest court annuls presidential el...
7,"Facebook wins privacy case, can track any Belg..."
8,Switzerland denies Muslim girls citizenship af...
9,China kills millions of innocent meditators fo...


In [73]:
pre_clean_newsdf2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73608 entries, 0 to 73607
Data columns (total 1 columns):
News    73608 non-null object
dtypes: object(1)
memory usage: 575.1+ KB


---

## Cleaning the Data

To begin, all letters should be lower-case since we are primarily interested in semantics and capitalization doesn't help us here from a machines perspective.

In [79]:
# converts all uppercase chars in string to lower case
def toLowerCase(sent):
    return sent.lower()
    
pre_clean__newsdf3 = newsdf['News'].apply(toLowerCase)

In [80]:
pre_clean__newsdf3.head(10)

0    a 117-year-old woman in mexico city finally re...
1     imf chief backs athens as permanent olympic host
2    the president of france says if brexit won, so...
3    british man who must give police 24 hours' not...
4    100+ nobel laureates urge greenpeace to stop o...
5    brazil: huge spike in number of police killing...
6    austria's highest court annuls presidential el...
7    facebook wins privacy case, can track any belg...
8    switzerland denies muslim girls citizenship af...
9    china kills millions of innocent meditators fo...
Name: News, dtype: object

Next, syntax plays a minor role when it comes to news headings as these strings are more similar to short phrases rather than structured sentences. Therefore, puncuation will be removed. The regular expression below will take a puncuation symbol (i.e. any non-alphanumeric symbol) and remove it plus any surrounding white space and replace this selection with a single whitespace character. This is shown below.

In [82]:
# this currently seems to be doing everything I need it to
pre_clean_newsdf4 = pre_clean__newsdf3.apply(lambda sent: re.sub(r'\s\W*|\W\s*',' ',sent))

In [14]:
# This func isn't necessary since above does the same
def remove_punc(sent):
    sent = re.sub(r'[^a-z0-9\s-]+','',sent) #leaves all alphanumeric chars and '-' chars
    sent = re.sub(r'\s*[\W]\s*','',sent)
    sent = re.sub(r'-',' ',sent)            #replaces the '-' with a space
    sent = re.sub(r'\s{1,10}', ' ', sent)   #replaces anywhere with multiple spaces in a row with a single space
    return sent

#might want to look into replacing all punc with no space around it with a space and those with spaces as empty

ns_lc_newsdf = lc_newsdf.apply(remove_punc)

**Woohoo** no punctuation.

In [83]:
pre_clean_newsdf4.head(10) #this Frame contains non-punctuated lower case sentences

0    a 117 year old woman in mexico city finally re...
1     imf chief backs athens as permanent olympic host
2    the president of france says if brexit won so ...
3    british man who must give police 24 hours noti...
4    100 nobel laureates urge greenpeace to stop op...
5    brazil huge spike in number of police killings...
6    austria s highest court annuls presidential el...
7    facebook wins privacy case can track any belgi...
8    switzerland denies muslim girls citizenship af...
9    china kills millions of innocent meditators fo...
Name: News, dtype: object

### Removing Stop Words

When looking at this data from a tokenized perspective, many of the individual words don't offer us much in terms of helping a machine to understand text (derive semantic meaning) appropriately. Additionally the more data we are working with, the harder it is to train a model; especially in the deep learning space where computation requirements skyrocket. Our main goal from each headline is do identify the importent identities (names, places, concrete/nouny stuff) and part of speech identification. The interconnecting fabric to these external objectives is what's called [stop words](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html). Essentially, a stop word refers to a variety of clarifying adjectives and various conjunction words. In terms of deciphering the knowledge to be gained from a sentence these words offer little value.

Below, a list of stop words is grabbed from the NLTK library.

In [25]:
# Now to remove stop words
#nltk.download() 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
sw = set(stopwords.words("english"))
print(sw)

{'isn', 'why', 'against', 'to', 'no', 'its', 'aren', 'wasn', 'doesn', 'hadn', 'them', 'as', 'if', 'now', 'will', 'is', 'each', 'hasn', 'shouldn', 'from', 'your', 'won', 'down', 'only', 'than', 'wouldn', "weren't", "shan't", "wouldn't", "mustn't", 'not', "that'll", 'under', 've', 'once', 'again', 'this', 'it', 'were', 'which', 'or', 'who', 'nor', 'doing', 'between', 'being', 'about', 'after', 'before', 'me', "you'd", 'mightn', 'yours', 'should', 'y', 'his', 'off', 'did', 're', 'he', 'couldn', 'how', 'himself', 'herself', 'ma', 'then', 'and', 'while', 'been', "wasn't", 'but', 'all', 'him', 'are', 'had', 'mustn', 'very', "you're", "didn't", 'have', 'her', 'i', 'of', 'when', 'they', 'above', 'some', 'needn', 'ourselves', 'out', 'on', 'over', 'during', 'has', 'an', 'whom', 't', 'too', "hasn't", 'these', 'having', 'yourselves', 's', "shouldn't", 'own', 'most', 'here', 'she', 'such', 'o', 'shan', 'itself', 'the', "haven't", 'hers', 'both', 'there', "isn't", "hadn't", 'a', 'just', 'be', 'our',

Before the stop words can be removed, the data needs to be tokenized.

In [26]:
stops_newsdf = ns_lc_newsdf.apply(word_tokenize)

In [85]:
stops_newsdf.head(10)

0    [a, 117, year, old, woman, in, mexico, city, f...
1    [imf, chief, backs, athens, as, permanent, oly...
2    [the, president, of, france, says, if, brexit,...
3    [british, man, who, must, give, police, 24, ho...
4    [100, nobel, laureates, urge, greenpeace, to, ...
5    [brazil, huge, spike, in, number, of, police, ...
6    [austria, s, highest, court, annuls, president...
7    [facebook, wins, privacy, case, can, track, an...
8    [switzerland, denies, muslim, girls, citizensh...
9    [china, kills, millions, of, innocent, meditat...
Name: News, dtype: object

Now the stops words can be discarded.

In [86]:
def remove_stops(sentence):
    no_stops_sentence = [w for w in sentence if not w in sw]   #sv = list of stop words 
    return no_stops_sentence

no_stops_newsdf = stops_newsdf.apply(remove_stops)

In [90]:
no_stops_newsdf.head(10)

0    [117, year, old, woman, mexico, city, finally,...
1    [imf, chief, backs, athens, permanent, olympic...
2     [president, france, says, brexit, donald, trump]
3    [british, man, must, give, police, 24, hours, ...
4    [100, nobel, laureates, urge, greenpeace, stop...
5    [brazil, huge, spike, number, police, killings...
6    [austria, highest, court, annuls, presidential...
7    [facebook, wins, privacy, case, track, belgian...
8    [switzerland, denies, muslim, girls, citizensh...
9    [china, kills, millions, innocent, meditators,...
Name: News, dtype: object

## Stemming/Lemmatizing The Data

[Info](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

### Some things Joe mentioned 

- Backoff tagger
- HMM Tagger
- named entitiy recognition
- part of speech tagger
- noun phrase chunking
- look for map from company to stock market
- multi objective neural networks

Below is the other data tables.

In [88]:
df = pd.read_csv('../../Data/DJIA_table.csv')
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [89]:
df2 = pd.read_csv('../../Data/Combined_News_DJIA.csv')
df2.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...
