# Week 10 Document Classification:
## CNN or Fox News
### Aaron Grzasko
### 11/9/2018

## Assignment Overview

The goal of this assignment is to build a binary classifier model using a collection of text-based documents. 

Classification is an example of a supervised learning procedure: data are trained to using using labeled target values.  The quality of the model is then assessed against holdout, or test data.

In the scripts below, I build a classifer model to determine whether a given news article was generated from [cnn.com](cnn.com) or [foxnews.com](foxnews.com). My hypothesis is that there are material differences in the content of each news site (e.g. word choices, article length, political leanings, etc.) that can be used to accurate determine the source of a particular article. 

## Data Retrieval

Let's first import python modules relevant to the assignment.

In [860]:
# import relevant modules
import newspaper
from newspaper import Article
import re
import numpy as np
import pandas as pd
import string

import nltk
from nltk.stem import PorterStemmer

import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn import preprocessing,naive_bayes, metrics
from sklearn import ensemble

import warnings
warnings.filterwarnings('ignore')

**CNN Data Initial Pull**

I used the *newspaper* python library to scrape news articles from both CNN and Fox News.

For the CNN data pull, I limited the categories to opinion and politics-related articles. 

Below are scripts to generate a list of relevant article urls: 

In [15]:
# initial list of cnn article urls
cnn_paper = newspaper.build('http://cnn.com',memoize_articles = False)

# pull urls related to opinions and politics on cnn
cnn_po_articles = []
for article in cnn_paper.articles:
    match = re.search( r'/opinion/|/politics/',article.url)
    if match and article.url not in cnn_po_articles:
        cnn_po_articles.append(article.url)


In [277]:
# check number of cnn urls found
len(cnn_po_articles)

145

The *newspaper* library limits the data pull to mostly recent articles.  Because of this limitation, our sample list of urls is small. 

Let's save the initial url list to a csv file in our working directory.

In [278]:
# write cnn urls to csv file
df = pd.DataFrame(cnn_po_articles, columns=["cnn_urls"])
df.to_csv('cnn_urls.csv', index=False)

Now, let's download all articles: 

In [None]:
# pull article text for each url; save to list
cnn_articles = []
for art in cnn_po_articles:
    #print(art)
    cnn_article = Article(url = art)
    cnn_article.download()
    cnn_article.parse()
    cnn_articles.append(cnn_article.text)
    time.sleep(15)
    

Finally, I will save the articles in csv file in my working directory.

In [None]:
# write cnn article texts to csv file
df = pd.DataFrame(cnn_articles, columns=["cnn_articles"])
df.to_csv('cnn_articles.csv', index=False, encoding='utf-16')

**Subsequent CNN Data Pulls**

As discussed earlier, the initial sample of CNN articles was fairly small.  Fortunately, I was able to pull new additional articles as they were published on the cnn website.

The scripts below are slightly modified versions of thes scripts used to do the initial data pull.

We first read in the original urls and retrieved articles.

In [514]:
# read back in initial cnn urls
cnn_po_articles = pd.read_csv('cnn_urls.csv')

# read back in initial cnn articles
cnn_articles = pd.read_csv('cnn_articles.csv', encoding = 'utf-16')

Now, let's find new urls, append to the existing url list, and save to a csv file.

In [508]:
# update with new article urls as they become available
cnn_paper_new = newspaper.build('http://cnn.com',memoize_articles = False)

# generate new article urls
cnn_po_articles_new = []
for article in cnn_paper_new.articles:
    match = re.search( r'/opinion/|/politics/',article.url)
    if match and article.url not in list(cnn_po_articles.iloc[:,0]):
        cnn_po_articles_new.append(article.url)
        
# add new articles to existing article urls
cnn_po_articles_new = pd.DataFrame(cnn_po_articles_new,columns=["cnn_urls"])
cnn_po_articles = pd.concat([cnn_po_articles, cnn_po_articles_new])

# write cnn urls with new data to csv file
df = pd.DataFrame(cnn_po_articles, columns=["cnn_urls"])
df.to_csv('cnn_urls.csv', index=False)

We are now ready to download the new articles.  Once the new articles are downloaded, I can append them to the master file.

In [517]:
# pull article text for each new cnn url; save to list
cnn_articles_new = []
for art in cnn_po_articles_new.iloc[:,0]:
    #print(art)
    cnn_article = Article(url = art)
    cnn_article.download()
    cnn_article.parse()
    cnn_articles_new.append(cnn_article.text)
    time.sleep(15)
    
# append new articles to existing list
cnn_articles_new = pd.DataFrame(cnn_articles_new,columns=["cnn_articles"])
cnn_articles = pd.concat([cnn_articles, cnn_articles_new])

# write all cnn article texts to new csv file
df = pd.DataFrame(cnn_articles, columns=["cnn_articles"])
df.to_csv('cnn_articles.csv', index=False, encoding='utf-16')


In [531]:
# read back in cnn articles
cnn_articles = pd.read_csv('cnn_articles.csv', encoding = 'utf-16')

In [532]:
len(cnn_articles)

175

We now have 175 total cnn articles.

**Fox News Initial Data Retrieval**

The scripts used to pull Fox News articles are similar to those used for downloading cnn articles.

I limited my Fox News article search to those labeled as "opinion", "politics", or "insider".

Below are scripts to identify relevant fox urls.  The urls are then saved to a csv file.

In [87]:
# initial list of fox article urls
fox_paper = newspaper.build('http://foxnews.com', memoize_articles = False)

# newspaper library pulls some duplicate fox articles, some with http, others with https
# regex below identifies url after http(s) prefix.
re_pat = re.compile(r'(?<=https://).+|(?<=http://).+')

# pull urls related to opinions, politics, insidere on fox 
# make sure to avoid duplicate entries
fox_po_articles = []
for article in fox_paper.articles:
    match = re.search( r'/opinion/|/politics/|/insider',article.url)
    if match and re_pat.search(article.url).group(0) not in " ".join(fox_po_articles):
        fox_po_articles.append(article.url)
        
# write fox urls to csv file
df = pd.DataFrame(fox_po_articles, columns=["fox_urls"])
df.to_csv('fox_urls.csv', index=False)

With the fox urls identified, I can download the articles.

In [148]:
fox_articles = []
for art in fox_po_articles:
    #print(art)
    fox_article = Article(url = art)
    fox_article.download()
    fox_article.parse()
    fox_articles.append(fox_article.text)
    time.sleep(15)

# write fox articles to csv file
df = pd.DataFrame(fox_articles, columns=["fox_articles"])
df.to_csv('fox_articles.csv', index=False, encoding='utf-16')

**Subsequent Fox Data Pulls**

Once again, the initial sample of news articles is quite small.  I used the scripts below to augment the initial sample with newer Fox articles.

I will first read in the original urls and fox articles.

In [523]:
# read back in original fox urls
fox_po_articles = pd.read_csv('fox_urls.csv')

# read back in original fox articles
fox_articles = pd.read_csv('fox_articles.csv', encoding = 'utf-16')

Now, let's look for new urls:

In [524]:
# update with new article urls as they become available
fox_paper_new = newspaper.build('http://foxnews.com',memoize_articles = False)

re_pat = re.compile(r'(?<=https://).+|(?<=http://).+')

# pull urls related to opinions and politics on fox
fox_po_articles_new = []
for article in fox_paper_new.articles:
    match = re.search( r'/opinion/|/politics/|/insider',article.url)
    if match and re_pat.search(article.url).group(0) not in " ".join(fox_po_articles.iloc[:,0]) \
    and re_pat.search(article.url).group(0) not in " ".join(fox_po_articles_new):
        fox_po_articles_new.append(article.url)

# add new articles to existing article urls
fox_po_articles_new = pd.DataFrame(fox_po_articles_new,columns=["fox_urls"])
fox_po_articles = pd.concat([fox_po_articles, fox_po_articles_new])

# write cnn urls with new data to csv file
df = pd.DataFrame(fox_po_articles, columns=["fox_urls"])
df.to_csv('fox_urls.csv', index=False)

Let's download the new articles, and append to the initial file.

In [527]:
# pull article text for each new cnn url; save to list
fox_articles_new = []
for art in fox_po_articles_new.iloc[:,0]:
    #print(art)
    fox_article = Article(url = art)
    fox_article.download()
    fox_article.parse()
    fox_articles_new.append(fox_article.text)
    time.sleep(15)

fox_articles_new = pd.DataFrame(fox_articles_new,columns=["fox_articles"])
fox_articles = pd.concat([fox_articles, fox_articles_new])

# write fox article texts to csv file
df = pd.DataFrame(fox_articles, columns=["fox_articles"])
df.to_csv('fox_articles.csv', index=False, encoding='utf-16')

In [529]:
# read back in fox articles
fox_articles = pd.read_csv('fox_articles.csv', encoding = 'utf-16')

In [530]:
len(fox_articles)

98

We now have almost 100 Fox News articles.

## Pre-Processing

**Data Scrubbing**

Before I can build any models, I need to prep the data.

Let's begin by removing any references to CNN or Fox News from the articles' text.  We want to determine the origin of each article from contextual clues, not explicit references to the article's source.

In [534]:
# remove source references in cnn articles
for i in range(len(cnn_articles)):
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("(CNN)", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("CNN", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("cnn", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("FoxNews", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("Foxnews", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("FOXNEWS", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("foxnews", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("FOX", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("Fox", "Outlet")
    cnn_articles.iloc[i,0] = str(cnn_articles.iloc[i,0]).replace("fox", "Outlet")
    
# remove source references in fox articles
for i in range(len(fox_articles)):
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("FoxNews", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("Foxnews", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("FOXNEWS", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("foxnews", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("FOX", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("Fox", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("fox", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("(CNN)", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("CNN", "Outlet")
    fox_articles.iloc[i,0] = str(fox_articles.iloc[i,0]).replace("cnn", "Outlet")
  

Let's create appropriate news source labels for each collection of articles.

In [536]:
# cnn labels
cnn_articles['label'] = pd.Series(['cnn']*len(cnn_articles))

# fox lables
fox_articles['label'] = pd.Series(['fox']*len(fox_articles))

# rename article columns to "text"
cnn_articles = cnn_articles.rename(index=str, columns={"cnn_articles": "text"})
fox_articles = fox_articles.rename(index=str, columns={"fox_articles": "text"})

With the appropriate labels assigned, I can combine the two article dataframes together.

In [748]:
# combined dataset with labels
combined = pd.concat([cnn_articles,fox_articles])

Below, I apply a variety of text tranformations including:
- converting text to lowercase
- removing punctuation
- removing newlines and carriage returns
- word stemming using the Porter Stemmer

In [749]:
# convert text to lower case
combined['text'] = combined['text'].apply(lambda x: x.lower())

# remove punctuation
combined['text'] = combined['text'].apply(lambda s: ''.join(ch for ch in s if ch not in set(string.punctuation)))

# remove additional items such as new newlines
combined['text'] = combined['text'].apply(lambda s: s.replace('\n',' '))
combined['text'] = combined['text'].apply(lambda s: s.replace('\r',''))
combined['text'] = combined['text'].apply(lambda s: s.replace('  ',' '))

# tokenize words
combined['text'] = combined['text'].apply(nltk.word_tokenize)  

# stem words
stemmer = PorterStemmer()
combined['text'] = combined['text'].apply(lambda x: [stemmer.stem(y) for y in x])

# convert text back to string from list of words
combined['text'] = combined['text'].apply(lambda x: " ".join(x))


**Train/Test Split**

Using a built-in sklearn function, I will split the combined dataset into training and test components.  I will use 70% of the data for training, and the other 30% for testing. 

In [751]:
# split the dataset into training and test datasets 
train_x, test_x, train_y, test_y = train_test_split(combined['text'], combined['label'], random_state=4, test_size = 0.3)

Python model require that the target variables be coded with numerical values rather than text.  Below, I convert "cnn" and "fox" values to 0s and 1s, respectivelys.  

In [752]:
# convert target values to numbers
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
test_y = encoder.fit_transform(test_y)

**Feature Extraction**

I will use tf-idf scores as a possible set of features for the models.  These scores are used to assess the relative importance of words in the corpus.

In [784]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=500)
tfidf_vect.fit(combined['text'])
train_x_tfidf =  tfidf_vect.transform(train_x)
test_x_tfidf =  tfidf_vect.transform(test_x)


I will also produce a set of tf-idf scores for word bigrams 

In [789]:
# bigram tfidf
tfidf_bi = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=500)
tfidf_bi.fit(combined['text'])
train_x_tfidf_bi =  tfidf_bi.transform(train_x)
test_x_tfidf_bi =  tfidf_bi.transform(test_x)

## Models

With the data processed, we can fit a variety of classifer models.  For this assignement, we will four models in total:
* Naive Bayes Model using TFIDF features
* Naive Bayes Model using TFIDF word bigram features
* Random Forest Model using TFIDF features
* Tandom Forest Model using TFIDF word bigram features

**Naive Bayes Using TFIDF Features**

In [785]:
# fit naive bayes using tf-idf
NB_tfidf = MultinomialNB().fit(train_x_tfidf, train_y)

In [811]:
# calculate accuracy on test set
predictions = NB_tfidf.predict(test_x_tfidf)
    
print("Naive Bayes with tfidf, accuracy: ",metrics.accuracy_score(predictions, test_y))   

Naive Bayes with tfidf, accuracy:  0.6707317073170732


In [812]:
print("Confusion Matrix for NB with TFIDF:")
print(metrics.confusion_matrix(test_y, predictions))

Confusion Matrix for NB with TFIDF:
[[49  0]
 [27  6]]


In [813]:
print("Classification Metrics for NB with TFIDF:")
print(metrics.classification_report(test_y, predictions, target_names = ['cnn','fox']))

Classification Metrics for NB with TFIDF:
             precision    recall  f1-score   support

        cnn       0.64      1.00      0.78        49
        fox       1.00      0.18      0.31        33

avg / total       0.79      0.67      0.59        82



The initial Naive Bayes model does not great accuracy.  The model also correctly labels only 18% of the total Fox News articles.  On the the other hand, the model has perfect precision for Fox News articles.  In other words, the articles that the model labels as Fox News articles are all correct.

**Naive Bayes Using TFIDF Bigram Features**

In [814]:
# fit naive bayes using tf-idf bigrams
NB_tfidf_bi = MultinomialNB().fit(train_x_tfidf_bi, train_y)

In [815]:
# calculate accuracy on test set
predictions = NB_tfidf_bi.predict(test_x_tfidf_bi)
    
print("Naive Bayes with tfidf bigrams, accuracy: ",metrics.accuracy_score(predictions, test_y))

Naive Bayes with tfidf bigrams, accuracy:  0.7804878048780488


In [816]:
print("Confusion Matrix for NB with TFIDF bigrams:")
print(metrics.confusion_matrix(test_y, predictions))

Confusion Matrix for NB with TFIDF bigrams:
[[47  2]
 [16 17]]


In [818]:
print("Classification Metrics for NB with TFIDF bigrams:")
print(metrics.classification_report(test_y, predictions, target_names = ['cnn','fox']))

Classification Metrics for NB with TFIDF bigrams:
             precision    recall  f1-score   support

        cnn       0.75      0.96      0.84        49
        fox       0.89      0.52      0.65        33

avg / total       0.81      0.78      0.76        82



The Naive Bayes model shows significant improvement to accuracy using the TFIDF bigram features.  Accuracy is now roughly 80%.  The recall for Fox News articles is still not great (just over 50%), but recall for cnn is great, and the precision scores for both news sources is acceptable.

**Random Forest with TFIDF**

In [839]:
# fit random forest tfidf 
RF_tfidf = ensemble.RandomForestClassifier().fit(train_x_tfidf, train_y)

In [866]:
# calculate accuracy on test set
predictions = RF_tfidf.predict(test_x_tfidf)
    
print("Random Forest with tfidf, accuracy: ",metrics.accuracy_score(predictions, test_y))

Random Forest with tfidf, accuracy:  0.7682926829268293


In [867]:
print("Confusion Matrix for RF with TFIDF:")
print(metrics.confusion_matrix(test_y, predictions))

Confusion Matrix for RF with TFIDF:
[[48  1]
 [18 15]]


In [868]:
print("Classification Metrics for RF with TFIDF:")
print(metrics.classification_report(test_y, predictions, target_names = ['cnn','fox']))

Classification Metrics for RF with TFIDF:
             precision    recall  f1-score   support

        cnn       0.73      0.98      0.83        49
        fox       0.94      0.45      0.61        33

avg / total       0.81      0.77      0.75        82



The random forest model trained on TFIDF scores produces a model that is similar in quality to the Naive Bayes Model using TFIDF bigram data.  The model recall for fox news articles is still not very good (under 50%).

**Random Forest with TFIDF Bigrams**

In [856]:
# fit random forest tfidf on tfidf bigrams
RF_tfidf_bi = ensemble.RandomForestClassifier().fit(train_x_tfidf_bi, train_y)

In [857]:
# calculate accuracy on test set
predictions = RF_tfidf_bi.predict(test_x_tfidf_bi)
    
print("Random Forest with tfidf bigrams, accuracy: ",metrics.accuracy_score(predictions, test_y))

Random Forest with tfidf bigrams, accuracy:  0.8536585365853658


In [858]:
print("Confusion Matrix for RF with TFIDF bigrams:")
print(metrics.confusion_matrix(test_y, predictions))

Confusion Matrix for RF with TFIDF bigrams:
[[49  0]
 [12 21]]


In [859]:
print("Classification Metrics for RF with TFIDF bigrams:")
print(metrics.classification_report(test_y, predictions, target_names = ['cnn','fox']))

Classification Metrics for RF with TFIDF bigrams:
             precision    recall  f1-score   support

        cnn       0.80      1.00      0.89        49
        fox       1.00      0.64      0.78        33

avg / total       0.88      0.85      0.85        82



The Random Forest model, trained on TFIDF bigrams, produced the most accurate result on the holdout data.  Accuracy is over 85%.  Fox News recall is acceptable, with a value of 64%.

## Video Commentary

http://youtu.be/M5v20YSogoM?hd=1

## References

- newspaper module:  https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#parsing-an-article
- classification models: https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
- More classification models: https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/