### Download nltk stopwords and spacy model

We will need the stopwords from NLTK and spacy’s en model for text pre-processing. Then we will be using the spacy model for lemmatization. Lemmatization is converting a word to its root word. 

For example: the lemma of the word ‘machines’ is ‘machine’. 
Likewise, ‘walking’ –> ‘walk’, ‘mice’ –> ‘mouse’ and so on.

In [1]:
import nltk; nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yidingweng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Import Packages

In [2]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Prepare Stopwords

We have already downloaded the stopwords. Let’s import them and make it available in stop_words.

In [21]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

### Import news article data

The dataset consists of 3.csv files. we will concatenate them.

In [22]:
data1 = pd.read_csv('data//articles1.csv',index_col=0)
data2 = pd.read_csv('data//articles2.csv',index_col=0)
data3 = pd.read_csv('data//articles3.csv',index_col=0)

In [23]:
df_total = pd.concat([data1, data2, data3])
df_total.shape

(142570, 9)

This version of the dataset contains about 142k news articles from 15 different medias. This is available as https://www.kaggle.com/jannesklaas/analyzing-the-news/data. Due to the limited computing power, I will take a sample of 30% of the total data for our study in this project.

In [24]:
df = df_total.sample(frac=0.3, replace=True, random_state=1)

In [25]:
df.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
131500,198273,Ivanka Trump just got booed in Germany for cal...,Vox,Zeeshan Aleem,2017/4/25,2017.0,4.0,http://www.vox.com/world/2017/4/25/15420358/iv...,Ivanka Trump is having trouble convincing the...
5192,23073,"At Debate, Hillary Clinton Leaves Questions Ab...",New York Times,Peter Eavis,2016-04-15,2016.0,4.0,,A jarring regulatory action this week against ...
53350,73533,Why Mars Is the Best Planet,Atlantic,Rebecca Boyle,2017-01-13,2017.0,1.0,,Our tale of two planets begins four billion ye...
112720,167391,Sketch To Impress: How An Oscar-Winning Design...,NPR,Mandalit del Barco,2016-02-24,2016.0,2.0,http://www.npr.org/2016/02/24/467800435/sketch...,British costumer Sandy Powell already has thre...
76783,117088,What to Make of the Saudi Shake-up,National Review,Elliott Abrams,2017-06-21,2017.0,6.0,http://www.nationalreview.com/article/448834/s...,"On Wednesday, King Salman of Saudi Arabia push..."


### Check empty fields

In [26]:
df.isnull().sum()

id                 0
title              0
publication        0
author          4872
date             798
year             798
month            798
url            16934
content            0
dtype: int64

'title' and 'content' are the important features that have the text information we are interested to study. Since this title dataset does not have empty fields in 'title' or 'content' information, we will not remove any rows of data.

#df.dropna(subset=['title'], inplace = True)
#df.isnull().sum()

In [27]:
df.to_csv('data/allNews_30%sample.csv', sep='\t')

### Remove emails and newline characters

As you can see there are many emails, newline and extra spaces that is quite distracting. So get rid of them using regular expressions.

In [28]:
# Convert to list
df.content = df.content.values.tolist()

In [29]:
# Remove Emails
df.content = [re.sub('\S*@\S*\s?', '', sent) for sent in df.content]

# Remove new line characters
df.content = [re.sub('\s+', ' ', sent) for sent in df.content]

# Remove distracting single quotes
df.content = [re.sub("\'", "", sent) for sent in df.content]

pprint(df[:1])

            id                                              title publication  \
131500  198273  Ivanka Trump just got booed in Germany for cal...         Vox   

               author       date    year  month  \
131500  Zeeshan Aleem  2017/4/25  2017.0    4.0   

                                                      url  \
131500  http://www.vox.com/world/2017/4/25/15420358/iv...   

                                                  content  
131500   Ivanka Trump is having trouble convincing the...  


In [30]:
df.to_pickle("data/allNews_30%sample_lemmatized_nontokenized.pkl")

After removing the emails and extra spaces, we need to break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

Gensim’s simple_preprocess is great for this.

### Tokenize words and Clean-up text

In [12]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

df.content = list(sent_to_words(df.content))

print(df.content[:1])

131500    [ivanka, trump, is, having, trouble, convincin...
Name: content, dtype: object


### Creating Bigram and Trigram Models

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.
Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams.

In [13]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(df.content, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[df.content.tolist()], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[df.content[0]]])

['washington', 'congressional', 'republicans', 'have', 'new', 'fear', 'when', 'it', 'comes', 'to', 'their', 'health_care', 'lawsuit', 'against', 'the', 'obama', 'administration', 'they', 'might', 'win', 'the', 'incoming', 'trump', 'administration', 'could', 'choose', 'to', 'no', 'longer', 'defend', 'the', 'executive_branch', 'against', 'the', 'suit', 'which', 'challenges', 'the', 'administration', 'authority', 'to', 'spend', 'billions', 'of', 'dollars', 'on', 'health_insurance', 'subsidies', 'for', 'and', 'americans', 'handing', 'house', 'republicans', 'big', 'victory', 'on', 'issues', 'but', 'sudden', 'loss', 'of', 'the', 'disputed', 'subsidies', 'could', 'conceivably', 'cause', 'the', 'health_care', 'program', 'to', 'implode', 'leaving', 'millions', 'of', 'people', 'without', 'access', 'to', 'health_insurance', 'before', 'republicans', 'have', 'prepared', 'replacement', 'that', 'could', 'lead', 'to', 'chaos', 'in', 'the', 'insurance', 'market', 'and', 'spur', 'political', 'backlash',

Some examples in our example are: ‘health_insurance’, ‘health_care’, ‘white_house’ etc.

### Remove Stopwords, Make Bigrams and Lemmatize

The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [14]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [15]:
# Remove Stop Words
df.content = remove_stopwords(df.content)

In [16]:
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
df.content = lemmatization(df.content, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [17]:
df.to_pickle("data/allNews_30%sample_lemmatized_nonbigrams.pkl")

In [18]:
# Form Bigrams
df.content = make_bigrams(df.content)

print(df.content[:1])

131500    [ivanka, trump, trouble, convince, world, fath...
Name: content, dtype: object


In [20]:
df.to_pickle("data/allNews_30%sample_lemmatized.pkl")