## The following is a Fake News Detector. It was built using a Random Forest Classifier and Natural Language Processing. 


###  The code in this notebook, when given an example article title, produces a classification of either a 0 or a 1, not fake news or fake news, respectively. This is meant to be the first of many iterations; all suggestions are welcome. 

##### All work was completed by Katy Spalding, November 2018. 





### Step 1: Imports, light Exploratory Data Analysis, cleaning data

#### 1.1: Import. Read in not_fake data. Name not_fake dataset "not_fake_og".

In [32]:
import pandas as pd
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
import string
import seaborn as sns
nltk.download('punkt')
nltk.download('wordnet')
import matplotlib.pyplot as plt
import sklearn.metrics



not_fake_og = pd.read_csv('/Users/katyspalding/Desktop/not_fake.csv')
pd.set_option('display.max_colwidth', 1000)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/katyspalding/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/katyspalding/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### 1.2: Drop all rows without titles. 

In [33]:
not_fake_og.dropna(subset = ['title'], inplace = True)
not_fake_og.shape

(49998, 10)

In [34]:
not_fake_og.isnull().sum()

Unnamed: 0        0
id                0
title             0
publication       0
author         8597
date           2626
year           2626
month          2626
url            7011
content           0
dtype: int64

#### 1.3: Trim not_fake data to just 1 column: 'title'.

In [35]:
not_fake_og = not_fake_og[['title']]

In [36]:
not_fake_og.shape

(49998, 1)

#### 1.4: Make not_fake data 12,000 rows long.

In [37]:
not_fake = not_fake_og[1:12200]
not_fake.shape

(12199, 1)

#### 1.5: Add a "fake news label" column to not_fake data, and call all of these 0, or not fake.

In [38]:
not_fake['fake news label'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


#### 1.6: Carry out the same steps on fake data. 

Read in fake data, drop all empty title rows. 

In [39]:
# read in the data
fake_og = pd.read_csv('/Users/katyspalding/Desktop/fake.csv')
pd.set_option('display.max_colwidth', 1000)

# drop all rows with empty titles
fake_og.dropna(subset = ['title'], inplace = True)

# ALSO, remove all fake news articles with the fake news label "state"
fake = (fake_og.loc[fake_og['type'] != 'state'])

# check shape
fake.shape

(12198, 20)

In [40]:
# trim fake data to just 1 column: 'title'
fake = fake[['title']]

In [41]:
# add another column to the fake data, and name all of these articles 1, denoting fake news
fake['fake news label'] = 1

#### 1.7: Concatenate both fake news data and not fake news data into one dataframe.

In [42]:
df = pd.concat([fake, not_fake])

In [43]:
df

Unnamed: 0,title,fake news label
0,Muslims BUSTED: They Stole Millions In Gov’t Benefits,1
1,Re: Why Did Attorney General Loretta Lynch Plead The Fifth?,1
2,BREAKING: Weiner Cooperating With FBI On Hillary Email Investigation,1
3,"PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnapped And Killed By ISIS: ""I have voted for Donald J. Trump!"" » 100percentfedUp.com",1
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Healthcare Begins With A Bombshell! » 100percentfedUp.com,1
5,Hillary Goes Absolutely Berserk On Protester At Rally! (Video),1
6,"BREAKING! NYPD Ready To Make Arrests In Weiner Case…Hillary Visited Pedophile Island At Least 6 Times…Money Laundering, Underage Sex, Pay-for-Play,Proof of Inappropriate Handling Classified Information » 100percentfedUp.com",1
7,WOW! WHISTLEBLOWER TELLS CHILLING STORY Of Massive Voter Fraud: Trump Campaign Readies Lawsuit Against FL Sec Of Elections In Critical District [VIDEO] » 100percentfedUp.com,1
8,BREAKING: CLINTON CLEARED...Was This A Coordinated Last Minute Trick To Energize Hillary's Base? » 100percentfedUp.com,1
9,"EVIL HILLARY SUPPORTERS Yell ""F*ck Trump""…Burn Truck Of Daddy Fishing With 2 Yr Son Over Of Trump Bumper-Stickers [VIDEO] » 100percentfedUp.com",1


In [44]:
df['fake news label'].value_counts()

0    12199
1    12198
Name: fake news label, dtype: int64

### Step 2:  Defining X and y, NLP pre-processing.

#### 2.1: Define X and y.

In [45]:
X = df[['title']]
y = df[['fake news label']]

#### 2.2: NLP Preprocessing. Add a new 'lemma' column to the dataframe. To every title in the dataframe: lower case it, tokenize it, remove all punctuation, and convert every invididual word to its lemma. Lastly, add this new lemmatized title to the column 'lemma'.

In [46]:
wordnet_lemmatizer = WordNetLemmatizer() 

def clean_lemma_X(og_article_title):
    og_article_title = og_article_title.lower() # 1. make it lower case
    
    article_title_split = nltk.word_tokenize(og_article_title) # 2. tokenize it
    
    article_title_split = [word for word in article_title_split if word not in string.punctuation] # 3. remove all punctuation
    
    article_title_lemmatized=[] # 4. convert to lemma
    word = [wordnet_lemmatizer.lemmatize(word, pos='n') for word in article_title_split]
    word = [wordnet_lemmatizer.lemmatize(word, pos='a') for word in article_title_split]
    word = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in article_title_split]

    article_title_lemmatized.append(word)
    return ' '.join(word)    

# 5. Add new lemmatized title to the column 'lemma'
X['lemma'] = X['title'].map(clean_lemma_X)

#### 2.3: TFIDF the 'lemma' column.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(X['lemma'])

# X is now a sparse matrix

### Step 3: train/test/validation split.

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

#### 3.2: Make y into a dummy column.

In [49]:
y_train_dummies = pd.get_dummies(y_train) # DO I EVER EVEN USE Y_TRAIN_DUMMIES IN MY CLASSIFIER?

### Step 4: Grid search over Random Forest parameters.

In [50]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

cls = RandomForestClassifier()

params = {
    'n_estimators': range(5, 30, 2),
    'max_depth': range(2, 12, 2),
}

gs = GridSearchCV(cls, params, scoring='roc_auc', n_jobs=4)

gs.fit(X_train, y_train_dummies)

  self.best_estimator_.fit(X, y, **fit_params)


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'n_estimators': range(5, 30, 2), 'max_depth': range(2, 12, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

#### 4.2: Print the parameters that yeiled the best outcome.

In [20]:
print (gs.best_score_)
print (gs.best_estimator_)

0.7692452331019601
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=29, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


### Step 5: Create and fit a Random Forest Classifier with the aforementioned parameters on both training and validation data. 

#### 5.1: Create an instance of a RFC.

In [21]:
cls = RandomForestClassifier(max_depth=10, n_estimators=29, class_weight='balanced')

  


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=10, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=29, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

#### 5.2: Fit the model to trianing data.

In [None]:
cls.fit(X_train, y_train_dummies)

#### 5.3: Fit the model to validation data.

In [22]:
cls.fit(X_val, y_val)

  """Entry point for launching an IPython kernel.


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=10, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=29, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

### Step 6: Check model performance on training, validation and testing data. 

In [23]:
from sklearn.metrics import f1_score

f1_score(cls.predict(X_train), y_train_dummies, average='weighted')

0.6856794529824058

In [24]:
f1_score(cls.predict(X_val), y_val, average='weighted')

0.7503440149870262

In [25]:
f1_score(cls.predict(X_test), y_test, average='weighted')

0.6814339486838772

### Step 7: Perform inference. 

In [26]:
test_string = ['Trump is a Republican']

import pickle
#print (tfidf_vectorizer)
with open('tfidf_save.pkl', 'wb') as handle:
    pickle.dump(tfidf_vectorizer, handle)

tfidf_loaded = pickle.load(open('tfidf_save.pkl', 'rb'))
#print (tfidf_loaded)


sample = tfidf_loaded.transform(test_string)
demo = cls.predict(sample)
demo

print(test_string)
print(demo)
if demo[0] == 0:
    print('This article is not fake news.')
if demo[0] == 1:
    print('This article is fake news.')

['Trump is a Republican']
[0]
This article is not fake news.


In [27]:
test_string_2 = ['Hillary Clinton is a Democrat']

import pickle
with open('tfidf_save.pkl', 'wb') as handle:
    pickle.dump(tfidf_vectorizer, handle)

tfidf_loaded = pickle.load(open('tfidf_save.pkl', 'rb'))


sample = tfidf_loaded.transform(test_string_2)
demo = cls.predict(sample)
demo


print(test_string_2)
print(demo)
if demo[0] == 0:
    print('This article is not fake news.')
if demo[0] == 1:
    print('This article is fake news.')

['Hillary Clinton is a Democrat']
[1]
This article is fake news.


In [28]:
test_string_3 = ["YOU WONT BELIEVE IT: All Science is False"]

import pickle
with open('tfidf_save.pkl', 'wb') as handle:
    pickle.dump(tfidf_vectorizer, handle)

tfidf_loaded = pickle.load(open('tfidf_save.pkl', 'rb'))


sample = tfidf_loaded.transform(test_string_3)
demo = cls.predict(sample)

print(test_string_3)
print(demo)
if demo[0] == 0:
    print('This article is not fake news.')
if demo[0] == 1:
    print('This article is fake news.')

['YOU WONT BELIEVE IT: All Science is False']
[1]
This article is fake news.


In [29]:
test_string_4 = ["Scientists Say Voting Could Kill You!"]

import pickle

with open('tfidf_save.pkl', 'wb') as handle:
    pickle.dump(tfidf_vectorizer, handle)

tfidf_loaded = pickle.load(open('tfidf_save.pkl', 'rb'))



sample = tfidf_loaded.transform(test_string_4)
demo = cls.predict(sample)



print(test_string_4)
print(demo)
if demo[0] == 0:
    print('This article is not fake news.')
if demo[0] == 1:
    print('This article is fake news.')

['Scientists Say Voting Could Kill You!']
[1]
This article is fake news.
