# 1. Load libraries and prepare dataset

In [1]:
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
os.getcwd()

'/Users/TobiasMerkt/Desktop/IE/NLP/Assignments'

After loading the libraries, the dataset is loaded and its first five rows displayed.

In [3]:
data = pd.read_csv("fake_or_real_news_training.csv")
data.head()

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


Since the X1 and X2 columns are not explained and mostly contain NULL values, we remove them. May also try not to remove them afterwards to boost performance.

In [4]:
data.drop(['X1','X2'], axis=1)

Unnamed: 0,ID,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


Further, the title and text are merged. One could also create separate document-term matrices for the title and text if they likely contain  different indicators (e.g. a certain word in the title indicates fake news). However, after conducting a web search, I've come to the conclusion that fake news mostly use similar words as real news and the differences are nuances. If time allows (Edit: As expected, it did not allow it), both options can be tested against each other.

In [5]:
data['fulltext'] = data.title.str.cat(data.text, sep = ' ')
data.head(2)

Unnamed: 0,ID,title,text,label,X1,X2,fulltext
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,,You Can Smell Hillary’s Fear Daniel Greenfield...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,,Watch The Exact Moment Paul Ryan Committed Pol...


In [6]:
final_test = pd.read_csv("fake_or_real_news_test.csv")
final_test.head()

Unnamed: 0,ID,title,text
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
2,864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...
3,4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...
4,662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...


In [7]:
final_test['fulltext'] = final_test.title.str.cat(final_test.text, sep = ' ')
final_test.head(2)

Unnamed: 0,ID,title,text,fulltext
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...,September New Homes Sales Rise——-Back To 1992 ...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...,Why The Obamacare Doomsday Cult Can't Admit It...


The dataset is split into a training and a test set

In [8]:
train, test = train_test_split(data, test_size=0.3)

In [9]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2799 entries, 2008 to 2446
Data columns (total 7 columns):
ID          2799 non-null int64
title       2799 non-null object
text        2799 non-null object
label       2799 non-null object
X1          22 non-null object
X2          2 non-null object
fulltext    2799 non-null object
dtypes: int64(1), object(6)
memory usage: 174.9+ KB


# 2. Data Preprocessing

In [10]:
train.fulltext.isnull().sum()

0

To train a model, the text file must be converted into a numerical feature vector using the CountVectorizer (creates a Document-Term-Matrix from the input text file).

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train.fulltext)
train_counts.shape

(2799, 48023)

This Document-Term-Matrix in fact is just a bag-of-words model in which only the term frequency is relevant. We want to give a higher rank to terms which appear very often in very few documents as they are likely to be more indicative than terms which appear in many documents -> Create a TF-IDF matrix.

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
train_tfidf.shape

(2799, 48023)

# 3. Model creation

## 3.1. Naive Bayes Classifier

In [13]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_tfidf, train.label)

One can also use a pipeline instead of the work above. Since I used this analysis for learning purposes, I still found it useful to code the entire process. From now on, the pipeline is used to optimise the performance and use different models quickly.

In [45]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

text_clf_NB = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB(fit_prior=False)),#fit_prior slightly improves predictions
])
text_clf_NB = text_clf_NB.fit(train.fulltext, train.label)

As can be seen below, a 76% accuracy is achieved by using a Naive Bayes classifier on the TF-IDF matrix.

In [46]:
predicted = text_clf_NB.predict(test.fulltext)
np.mean(predicted == test.label)

0.76333333333333331

## 3.2. Support Vector Machines

An SVM produces the following results (Alpha, Result) with penalty = 12
    1e-2: 0.776
    1e-3: 0.905
    1e-4: 0.9267
    1e-5: 0.9258
Consequently, alpha = 1e-4 will be chosen.

In [36]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-4, n_iter=5, random_state=42))])

text_clf_svm = text_clf_svm.fit(train.fulltext, train.label)
predicted_svm = text_clf_svm.predict(test.fulltext)
np.mean(predicted_svm == test.label)

0.92666666666666664

## 3.3. Grid Search

By performing a grid search, some parameter values will be tested to find their optimal combination.

### 3.3.1. Naive Bayes

In [22]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}

In [23]:
gs_clf = GridSearchCV(text_clf_NB, parameters, n_jobs=-1) #n_jobs -1: Use multiples cores
gs_clf = gs_clf.fit(train.fulltext, train.label)



The grid search improved the search result from Naive Bayes from 0.76 to 0.89 - a significant improvement.

In [26]:
gs_clf.best_score_

0.89424794569489108

The following parameter optimise the Naive Bayes' result:

In [27]:
gs_clf.best_params_

{'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

### 3.3.2. SVM

In several iterations, I've tested alpha values from 1e-03 to 1e-08 and found that alpha is optimised at 1e-05. The score is lower as in the previous example because it tests the result on a different dataset (train vs. test dataset).

In [33]:
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-5, 1e-8)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(train.fulltext, train.label)



In [34]:
gs_clf_svm.best_score_

0.91032511611289746

In [35]:
gs_clf_svm.best_params_

{'clf-svm__alpha': 1e-05, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

On the test dataset, the performance is 0.92833 and thus slightly better than the previous result of 0.9267.

In [37]:
gs_svm_predict = gs_clf_svm.predict(test.fulltext)
np.mean(gs_svm_predict == test.label)

0.92833333333333334

# 4. Optimisation

With these basic models running, they can be optimised by 1) removing the stop words and 2) stemming. One can also use POS Tagging and parsing to improve the classification models.

## 4.1. Removing stop words

Stop words are words which appear very often and do not provide useful information on the context (e.g., "and", "the"). By removing them, the accuracy increases to 0.8067.

In [39]:
text_clf_NB = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), 
                     ('clf', MultinomialNB())])
text_clf_NB = text_clf_NB.fit(train.fulltext, train.label)

In [40]:
predicted = text_clf_NB.predict(test.fulltext)
np.mean(predicted == test.label)

0.80666666666666664

The performance (0.93) is slightly better than without removing stopwords (0.9283).

In [52]:
text_clf_svm_sw = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1, 2))), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-5, n_iter=5, random_state=42))])

text_clf_svm_sw = text_clf_svm_sw.fit(train.fulltext, train.label)
predicted_svm_sw = text_clf_svm_sw.predict(test.fulltext)
np.mean(predicted_svm_sw == test.label)

0.93000000000000005

## 4.2. Stemming

A stemming algorithm shortens words to their root form (e.g., stemmer/ stemmed -> stem). This allows us to better estimate the frequency of roots which occur in varying forms.

In [49]:
import nltk
#nltk.download()

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

### 4.2.1. Naive Bayes

The naive bayes model achieves a result of 0.8117 which is slightly better than without stemming. 

In [53]:
mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()), 
                             ('mnb', MultinomialNB(fit_prior=False))])

mnb_stemmed = mnb_stemmed.fit(train.fulltext, train.label)

predicted_mnb_stemmed = mnb_stemmed.predict(test.fulltext)
np.mean(predicted_mnb_stemmed == test.label)

0.81166666666666665

### 4.2.2. SVM

Compared to the previous result (0.93), the stemmed SVM produces slightly worse results (0.9275).

In [57]:
text_clf_svm_stem = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-5, n_iter=5, random_state=42))])

text_clf_svm_stem = text_clf_svm_stem.fit(train.fulltext, train.label)
predicted_svm_stem = text_clf_svm_stem.predict(test.fulltext)
np.mean(predicted_svm_stem == test.label)

0.92749999999999999