# Fake News Detection 
Building a classifier differentiating fake and real news based on pre-defined labels

### Major Steps
1. Libraries
2. Loading and Cleaning datasets
3. Preparing dataset (BOW, TF-IDF)
4. SVM
5. Naive Bayes
6. Logistic Regression
7. PassiveAgressive Classifier
8. Summary

## 1. Import libraries

In [12]:
import pandas as pd
import numpy as np
import nltk
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import CountVectorizer,  TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import accuracy_score

from sklearn.pipeline import Pipeline

In [13]:
import warnings
warnings.filterwarnings('ignore')

## 2. Load & Examine Datasets

In [14]:
df_train = pd.read_csv("/Users/taniaelachkar/Desktop/Natural Language Processing/Assignment 1/fake_or_real_news_training.csv")

df_test = pd.read_csv("/Users/taniaelachkar/Desktop/Natural Language Processing/Assignment 1/fake_or_real_news_test.csv")

In [15]:
df_train.head(20)

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE,,
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE,,
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL,,
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL,,
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL,,


In [16]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 0 to 3998
Data columns (total 6 columns):
ID       3999 non-null int64
title    3999 non-null object
text     3999 non-null object
label    3999 non-null object
X1       33 non-null object
X2       2 non-null object
dtypes: int64(1), object(5)
memory usage: 187.5+ KB


### 2.1 Examine the X1 and X2 variables

In [17]:
df_train.X1.unique()

array([nan, 'REAL', 'FAKE',
       'PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE.\xa0Congress may have spent August away from Washington but Planned Parenthood’s campaign to convince lawmakers to protect the group’s funding followed them back to their home states. Power Post has more.\n\n“Lawmakers will raise the stakes when Congress returns next week by threatening to defund the group through the federal appropriations process. Planned Parenthood’s counter-offensive is widespread and varied and is unfolding inside and outside the Beltway. The group has been\xa0organizing rallies, flooding lawmakers’ town hall meetings, commissioning polls, shelling\xa0out six figures for television\xa0ads and\xa0hiring forensics experts to try to discredit undercover video footage that sparked the controversy. The success of these lobbying efforts will be tested when Congress returns and must move a short-term spending bill to keep the government open. Some conservatives in both chambers are pushing 

In [18]:
df_train.X2.unique()

array([nan, 'REAL', 'FAKE'], dtype=object)

In [19]:
#showing all the instances with X1
df_train[df_train.X1.notnull()]

Unnamed: 0,ID,title,text,label,X1,X2
192,599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,Election Day: No Legal Pot In Ohio; Democrats ...,REAL,
308,10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,Who rode it best? Jesse Jackson mounts up to f...,FAKE,
382,356,Black Hawk crashes off Florida,human remains found,(CNN) Thick fog forced authorities to suspend ...,REAL,
660,2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,(CNN) Aerial bombardments blew apart a Doctors...,REAL,
889,3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,A member of Al Qaeda's branch in Yemen said Fr...,REAL,
911,7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,00 UTC © USGS Map of the earthquake's epicent...,FAKE,
1010,9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,Email Print After writing a lengthy suicide no...,FAKE,
1043,9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,We Are Change \n\nIn today’s political climate...,FAKE,
1218,1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,A new national poll shows Vice President Biden...,REAL,
1438,4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,Russian warplanes began airstrikes in Syria on...,REAL,


### 2.2 Deleting the X1 and X2
Since X1 and X2 have 33 occurences in total and there is no clear structure to fit them within existing data we decided to delete them

In [20]:
# delete X1 and X2 variables
df_train = df_train[df_train.X1.isnull()] 

In [21]:
# ALTERNATIVE COLUMN DELETE 
df_train = df_train.drop("X1",axis=1)
df_train = df_train.drop("X2",axis=1)

In [22]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3966 entries, 0 to 3998
Data columns (total 4 columns):
ID       3966 non-null int64
title    3966 non-null object
text     3966 non-null object
label    3966 non-null object
dtypes: int64(1), object(3)
memory usage: 154.9+ KB


## 3. Prepare Dataset

### 3.1 Cross Validation

In [23]:
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(df_train.text, df_train.label)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(2974,) (2974,)
(992,) (992,)


### 3.2 Feature Extraction
Bag of Words

In [24]:
# Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer

tf_weighting = CountVectorizer()
X_train_counts = tf_weighting.fit_transform(df_train.text)
X_train_counts.shape

(3966, 55312)

### 3.2 TF-IDF
Assigning weights to words in documents by dividing the count of words by the lenght of the document and reducing the weight of most common words

In [25]:
#TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3966, 55312)

## 4. SVM 

#### First attempt - SGDClassifier

In [26]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='squared_loss', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])

text_clf_svm = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
np.mean(predicted_svm == y_test)

0.9153225806451613

#### SVM including removing the stop words  // Lower the result // 

In [27]:
# Training Support Vector Machines - SVM and calculating its performance
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])

text_clf_svm = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
np.mean(predicted_svm == y_test)

0.9112903225806451

### 4.1  Applying LinearSCV - score improvement 

In [28]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC

text_clf_svm2 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', LinearSVC())])

text_clf_svm2 = text_clf_svm2.fit(X_train, y_train)
predicted_svm2 = text_clf_svm2.predict(X_test)
np.mean(predicted_svm2 == y_test)

0.9243951612903226

### 4.2 Grid Search for SVM

In [None]:

from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train, y_train)


print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)

# 5. Naive Bayes

In [19]:
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, df_train.label)

In [20]:
# Pipeline - standard
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

In [21]:
# Performance - standard
predicted = text_clf.predict(X_test)
np.mean(predicted == y_test)

0.7368951612903226

In [22]:
# Pipeline with adjusting MultinomialNB

text_clf2 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB(fit_prior=False))])

text_clf2 = text_clf2.fit(X_train, y_train)

In [23]:
# Performance with adjusting MultinomialNB
predicted = text_clf2.predict(X_test)
np.mean(predicted == y_test)

0.7520161290322581

### 4.1 Grid Search for Naive Bayes
I tried to optimize the parameters for Grid Search (estimator and scoring) : 
http://scikit-learn.org/stable/modules/grid_search.html#grid-search

In [24]:
# Original Search
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.8971082716879624
{'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}


In [25]:
# Search 2 - Changing parameters based on the previous results. Slight Improvement
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 5)], 'tfidf__use_idf': (True, False), 'clf__alpha': (0.0001, 0.001, 0.01)}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.8971082716879624
{'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}


In [26]:
# Search 3
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 4), (1, 5), (1, 6)], 'tfidf__use_idf': (True, False), 'clf__alpha': (0.001, 0.005, 0.01)}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)

print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.8967720242098184
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 6)}


### 4.2 Stemming for Naive Bayes

In [27]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), 
                             ('tfidf', TfidfTransformer()), 
                             ('mnb', MultinomialNB(fit_prior=False))])

text_mnb_stemmed = text_mnb_stemmed.fit(X_train, y_train)

predicted_mnb_stemmed = text_mnb_stemmed.predict(X_test)

np.mean(predicted_mnb_stemmed == y_test)

0.8064516129032258

# 6. Logistic Regression

#### 6.1 Repeating the dataset preparation for Logistic Regression
We found out that the transformation process for logistic regression is different than for the other two models. Applying same feature extraction is causing a crash for one of the parts and for now the only way around it we found is repeating and adjusting the process.

In [28]:
# Cross Validation
X_train, X_test, y_train, y_test = train_test_split(df_train.text, df_train.label)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

# Feature Extraction - Logistic Regression
from sklearn.feature_extraction.text import CountVectorizer
tf_weighting = CountVectorizer()
X_train_counts = tf_weighting.fit_transform(X_train)
X_test_counts = tf_weighting.transform(X_test)

#TF-IDF - Logistic Regression
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

X_test_tfidf = tfidf_transformer.transform(X_test_counts)

(2974,) (2974,)
(992,) (992,)


### 6.2 Logistic Regression Model

In [29]:
lr_count_clf = LogisticRegression()

In [30]:
lr_count_clf.fit(X_train_counts, y_train)
prediction = lr_count_clf.predict(X_test_counts)
result0 = metrics.accuracy_score(y_test, prediction)
print("logistic regression accuracy: %0.3f" % result0)

result1 = cross_val_score(lr_count_clf, X_train_counts, y_train, scoring="accuracy", cv=5).mean()
print("logistic regression cross-validation accuracy: %0.3f" % result1)

logistic regression accuracy: 0.923
logistic regression cross-validation accuracy: 0.899


In [31]:
X_train, X_test, y_train, y_test = train_test_split(df_train.text, df_train.label)

tf_weighting = CountVectorizer()
X_train_counts = tf_weighting.fit_transform(df_train.text)
X_train_counts.shape

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3966, 55312)

In [32]:
from sklearn.pipeline import Pipeline

text_clf_logreg = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf2', LogisticRegression())])

text_clf_logreg = text_clf_logreg.fit(X_train, y_train)


# Performance - standard
predicted = text_clf_logreg.predict(X_test)
np.mean(predicted == y_test)

0.8931451612903226

## 7. PassiveAgressive Classifier
https://github.com/docketrun/Detecting-Fake-News-with-Scikit-Learn/blob/master/Attempting%20to%20detect%20fake%20news.ipynb

In [33]:
# Preparation
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

In [34]:
# Importing and adjusting the PAC
from sklearn.linear_model import PassiveAggressiveClassifier
linear_clf = PassiveAggressiveClassifier(n_iter=100, fit_intercept=False, warm_start=True)

In [35]:
linear_clf.fit(tfidf_train, y_train) #
pred = linear_clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])

accuracy:   0.917


## 8. SUMMARY

#### We have tried several models - found at different sources. For our specific approach the SVM with LinearSVC gave the best score around 0.924

#### Resources:
https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
https://github.com/docketrun/Detecting-Fake-News-with-Scikit-Learn/blob/master/Attempting%20to%20detect%20fake%20news.ipynb
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py
