# Bidirectional LSTM for News Source Classification

### Author 
Stephen Lee

### Goal
Classify news source based on the article text. Training data: 
- Fox News
- Vox News
- PBS News

### Date 
3.4.19

## Read Data, Remove Missing Values

In [2]:
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences 


import os 
import math 
import numpy as np 
import pandas as pd 

from sklearn.model_selection import train_test_split

In [3]:
os.getcwd()

'/Users/stevelee/Dropbox/General/Projects/Thesis/code/clean-data'

In [4]:
FOLDER_READ = '/Users/stevelee/Dropbox/General/Projects/Thesis/data'
FILE = 'articles.csv'

In [5]:
os.chdir(FOLDER_READ)

In [6]:
os.listdir()

['.DS_Store', 'archive', 'clean_article_df.csv', 'articles.csv', 'raw']

In [7]:
df_all = pd.read_csv(FILE, sep='|').drop('Unnamed: 0', axis=1)
df_all.head()

Unnamed: 0,article id,source,article
0,fox_politics_166,Fox,Video\n<br>\nFormer New Jersey Gov. Chris Chri...
1,fox_politics_390,Fox,"FILE--In this July 28, 2016 file photo, Sen. B..."
2,fox_politics_423,Fox,"Video\nHoward Kurtz: How Michael Cohen, Democr..."
3,fox_politics_102,Fox,Video\nStudent Union: Make UC Berkeley a sanct...
4,fox_politics_492,Fox,Video\nPresident Trump’s health care executive...


In [8]:
df_all.groupby('source').count()

Unnamed: 0_level_0,article id,article
source,Unnamed: 1_level_1,Unnamed: 2_level_1
Fox,1024,1023
PBS,1752,1752
Vox,2000,1938


In [9]:
df_all = df_all.dropna()
df_all.groupby('source').count()

Unnamed: 0_level_0,article id,article
source,Unnamed: 1_level_1,Unnamed: 2_level_1
Fox,1023,1023
PBS,1752,1752
Vox,1938,1938


### Check for and remove duplicates

In [10]:
df_all.groupby("source").describe()

Unnamed: 0_level_0,article,article,article,article,article id,article id,article id,article id
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
source,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Fox,1023,661,Video\nTrump: The media refuses to acknowledge...,4,1023,1023,fox_politics_454,1
PBS,1752,1739,"It is messy, tentacled, and increasingly confu...",5,1752,1752,pbs_politics_1621,1
Vox,1938,1027,"Part of The 2018 midterm elections, explained",152,1938,1938,vox_politics_1220,1


In [11]:
df_all = df_all.drop_duplicates('article', keep='first')
df_all.groupby("source").describe()

Unnamed: 0_level_0,article,article,article,article,article id,article id,article id,article id
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
source,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Fox,661,661,Gov. Kristi Noem smiles after signing her firs...,1,661,661,fox_politics_85,1
PBS,1739,1739,WASHINGTON — House Speaker Paul Ryan announced...,1,1739,1739,pbs_politics_1621,1
Vox,1027,1027,"ROSWELL, Georgia — Political strategist Brian ...",1,1027,1027,vox_politics_690,1


### Replace "video" with ""
This fixes a potential issue with the fox news articles, many of which start with "video\n..."

In [12]:
df_all['article'] = df_all['article'].str.replace('Video\n','')
df_all.head()

Unnamed: 0,article id,source,article
0,fox_politics_166,Fox,<br>\nFormer New Jersey Gov. Chris Christie sa...
1,fox_politics_390,Fox,"FILE--In this July 28, 2016 file photo, Sen. B..."
2,fox_politics_423,Fox,"Howard Kurtz: How Michael Cohen, Democrats sto..."
3,fox_politics_102,Fox,Student Union: Make UC Berkeley a sanctuary ca...
4,fox_politics_492,Fox,President Trump’s health care executive order:...


### Look at text from Fox and Vox and PBS for idiosyncracies 

In [13]:
df_all[df_all['source'] == "Fox"].head(15)

Unnamed: 0,article id,source,article
0,fox_politics_166,Fox,<br>\nFormer New Jersey Gov. Chris Christie sa...
1,fox_politics_390,Fox,"FILE--In this July 28, 2016 file photo, Sen. B..."
2,fox_politics_423,Fox,"Howard Kurtz: How Michael Cohen, Democrats sto..."
3,fox_politics_102,Fox,Student Union: Make UC Berkeley a sanctuary ca...
4,fox_politics_492,Fox,President Trump’s health care executive order:...
5,fox_politics_554,Fox,Former Bernie Sanders campaign staffer on repo...
6,fox_politics_490,Fox,Trump takes on ObamaCare subsidies\nBrad Blake...
7,fox_politics_590,Fox,Washington State proposes a new carbon tax\nTh...
8,fox_politics_1,Fox,Do voters think Ocasio-Cortez's Green New Deal...
9,fox_politics_971,Fox,President Trump gave Dr. Ronny Jackson a clean...


In [14]:
df_all[df_all['source'] == "Vox"].head(15)

Unnamed: 0,article id,source,article
1024,vox_politics_396,Vox,Senate Republicans on Thursday revealed the Be...
1025,vox_politics_372,Vox,"“New York will be destroyed,” the state’s Gov...."
1026,vox_politics_602,Vox,The Trump administration wants to send a messa...
1027,vox_politics_1198,Vox,"Donald Trump’s long, improbable journey to pol..."
1028,vox_politics_682,Vox,The Trump administration threw the fate of the...
1029,vox_politics_1634,Vox,"On Wednesday, the White House released a state..."
1030,vox_politics_976,Vox,"Part of The 2018 midterm elections, explained"
1031,vox_politics_590,Vox,Part of Understanding the Trump era
1032,vox_politics_71,Vox,Republicans and Democrats in Congress have fin...
1033,vox_politics_714,Vox,"Two months ago, things looked dire for Obamaca..."


In [15]:
df_all[df_all['source'] == "PBS"].head(15)

Unnamed: 0,article id,source,article
3024,pbs_politics_396,PBS,President Donald Trump’s longtime personal law...
3025,pbs_politics_372,PBS,WASHINGTON — Facing a midnight deadline to avo...
3026,pbs_politics_602,PBS,WASHINGTON — President Donald Trump is exagger...
3027,pbs_politics_1198,PBS,\nPresident Donald Trump says newly confirmed ...
3028,pbs_politics_682,PBS,President Donald Trump is adding a new lawyer ...
3029,pbs_politics_1634,PBS,DALLAS — U.S. Rep. Joe Barton told a woman tha...
3030,pbs_politics_976,PBS,WASHINGTON — President Donald Trump said Tuesd...
3031,pbs_politics_590,PBS,"ALEXANDRIA, Va. — In a blistering back-and for..."
3032,pbs_politics_71,PBS,WASHINGTON (AP) — Michael Cohen’s closed-door ...
3033,pbs_politics_714,PBS,Supreme Court nominee Brett Kavanaugh says he ...


### Remove location text in PBS
This will remove, for example, "DETROIT --- Start of article..." 

In [16]:
df_all['article'] = df_all['article'].str.replace('WASHINGTON', '')
df_all[df_all['source'] == "PBS"].head(20)

Unnamed: 0,article id,source,article
3024,pbs_politics_396,PBS,President Donald Trump’s longtime personal law...
3025,pbs_politics_372,PBS,— Facing a midnight deadline to avoid a parti...
3026,pbs_politics_602,PBS,— President Donald Trump is exaggerating the ...
3027,pbs_politics_1198,PBS,\nPresident Donald Trump says newly confirmed ...
3028,pbs_politics_682,PBS,President Donald Trump is adding a new lawyer ...
3029,pbs_politics_1634,PBS,DALLAS — U.S. Rep. Joe Barton told a woman tha...
3030,pbs_politics_976,PBS,— President Donald Trump said Tuesday that th...
3031,pbs_politics_590,PBS,"ALEXANDRIA, Va. — In a blistering back-and for..."
3032,pbs_politics_71,PBS,(AP) — Michael Cohen’s closed-door testimony ...
3033,pbs_politics_714,PBS,Supreme Court nominee Brett Kavanaugh says he ...


In [17]:
df_all['article'] = df_all['article'].str.replace(u"\u2014", "")
df_all[df_all['source'] == "PBS"].head(25)

Unnamed: 0,article id,source,article
3024,pbs_politics_396,PBS,President Donald Trump’s longtime personal law...
3025,pbs_politics_372,PBS,Facing a midnight deadline to avoid a partia...
3026,pbs_politics_602,PBS,President Donald Trump is exaggerating the n...
3027,pbs_politics_1198,PBS,\nPresident Donald Trump says newly confirmed ...
3028,pbs_politics_682,PBS,President Donald Trump is adding a new lawyer ...
3029,pbs_politics_1634,PBS,DALLAS U.S. Rep. Joe Barton told a woman that...
3030,pbs_politics_976,PBS,President Donald Trump said Tuesday that the...
3031,pbs_politics_590,PBS,"ALEXANDRIA, Va. In a blistering back-and fort..."
3032,pbs_politics_71,PBS,(AP) Michael Cohen’s closed-door testimony b...
3033,pbs_politics_714,PBS,Supreme Court nominee Brett Kavanaugh says he ...


#### Remove "Associated Press Contributed" from Fox

In [18]:
df_all['article'] = df_all['article'].str.replace("Associated", "")
df_all['article'] = df_all['article'].str.replace("Press", "")
df_all['clean_articles'] = df_all['article'].str.replace("AP", "")

## Change source into a target number

In [19]:
TARGETS = 4
tokenizer = Tokenizer(num_words=TARGETS)
tokenizer.fit_on_texts(df_all['source'])

targets = tokenizer.texts_to_sequences(df_all['source'])
df_all['targets'] = [(i[0] - 1) for i in targets]
df_all[df_all['source'] == "Fox"].describe()

Unnamed: 0,targets
count,661.0
mean,2.0
std,0.0
min,2.0
25%,2.0
50%,2.0
75%,2.0
max,2.0


In [20]:
df_all[df_all['source'] == "Vox"].describe()

Unnamed: 0,targets
count,1027.0
mean,1.0
std,0.0
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


In [21]:
df_all[df_all['source'] == "PBS"].describe()

Unnamed: 0,targets
count,1739.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0


### Save df

In [22]:
df_all.to_csv('cleaner_article_df.csv', sep='|')