## Predicting whether a new article is Fake or Real
In this notebook we will be using two bag of words techniques; count vectorization and tfidf vectorization. 

Depending on whether one runs this notebook using just the titles, or using the full articles the accuracy can vary between 92-99%. However, the full article version does have some data leakage (see the notebook https://www.kaggle.com/mosewintner/5-data-leaks-100-acc-1-word-99-6-acc for why everyone can easily exceed 99%). 

This notbook was originally put together for an NLP study group session. To see some of the context around this notebook you can view the recording on youtube of the study group session: https://youtu.be/HlmmXrA4FUU

In [None]:
import numpy as np
import pandas as pd

real = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
fake = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')

In [None]:
### If you want to run the notebook faster at the cost of accuracy you can uncomment out the two lines below to use only a sample of 40k

# real = real.sample(20000)
# fake = fake.sample(20000)
real.shape, fake.shape

In [None]:
real.head()

In [None]:
num = 100 # Selects an article to preview fromt the real dataset

print('Title: ', real.title[num],'\n')
print('Article:\n', real.text[num])

In [None]:
### Based on the differences in this column we cannot use this feature without data leakage.
print('subjects of fake news articles:',fake['subject'].unique())
print('subjects of real news articles:',real['subject'].unique())

In [None]:
### Since the real news articles and fake news articles are in two different data sets we can add a label column easily
real['is_real'] = 1
fake['is_real'] = 0

In [None]:
data = real.append(fake)
data.index = range(data.shape[0])
data.sample(10)

In [None]:
data.isnull().sum()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

Here we create the count vectorizer and tfidf-vectorizer, we use the optional arguments to strip accents, remove n-grams, filter out stop words, and set the vector length to 1k elements. These two vectorizers will perform tokenization on their own so that step has been skipped. One can update the `text_column` variable to select whether to vectorize the articles by title or by the full article text. Using only the title results in models that only achieve 92% accuracy, but the vectorizer will complete in less than 3 seconds (for the entire corpus). When using the full articles the accuracy will exceed 99% on test data but the vectorization time jumps to almost a minute (for the entire corpus).

In [None]:
countvec = CountVectorizer(strip_accents='ascii', stop_words=stopwords, ngram_range=(1,2), max_features=1000)
tfidf = TfidfVectorizer(strip_accents='ascii', stop_words=stopwords, ngram_range=(1,2), max_features=1000)

text_column = 'title' # use 'text' to train on the full articles or 'title' to only use the titles.
                     # 'text' will take alot longer for the vectorizers to run

The sample from the count vectorizer below may appear to contain only zeroes, this is because the output is a sparse matrix where the vast majority of columns in any one row will be zero. Below this cell we can also veiw the vocabulary key used to build the vectors.

In [None]:
%%time
count_dat = countvec.fit_transform(data[text_column])
count_dat = pd.DataFrame(count_dat.toarray())
count_dat.sample(10)

In [None]:
### Unhide the output to see the vocabulary dictionary generated by the vectorizer

countvec.vocabulary_

In [None]:
%%time
tfidf_dat = tfidf.fit_transform(data[text_column])
tfidf_dat = pd.DataFrame(tfidf_dat.toarray())
tfidf_dat.sample(10)

Here we split the data into training and test sets, the two sets of training data reprsent the two different vectorization methods, performed side by side for compairison. Since this is a balanced dataset we will use accuracy as the metric.

In [None]:
from sklearn.model_selection import train_test_split
y = data.is_real

train_x1, test_x1, train_y1, test_y1 = train_test_split(count_dat, y, test_size=.3, random_state=42)
train_x2, test_x2, train_y2, test_y2 = train_test_split(tfidf_dat, y, test_size=.3, random_state=42)

Below we use 3 different models, a support vector machine, a random forest, and a naive bayes model. The %%time magic is used to view the time each model takes to complete training/inference. It is worth pointing out how fast the SVM is able to train and perform inference compaired the other two models

Depending on the model used the difference between count vectorizer and tfidf is either trivial or signifigant. The most interesting of these changes is the naive bayes model, which not only becomes much more accurate, but also cuts a signifigant amount of time off training/inference. One explanation of this is that tfidf essentially has weighted values instead of a strightforward count, which help to start the model off with coeficients (or equivilant model parameters) closer to the optimal value. SVMs seem to be able to reach convergence optimally without the extra help, but it would seem that NB/RF models benifit from initiallizing closer to the optimum values.

In [None]:
%%time

from sklearn.svm import LinearSVC
from sklearn import metrics

svm = LinearSVC()
svm.fit(train_x1, train_y1)
preds = svm.predict(test_x1)
print('Accuracy with count vectorizer:', metrics.accuracy_score(preds, test_y1), '\n\n')

In [None]:
%%time

svm = LinearSVC()
svm.fit(train_x2, train_y2)
preds = svm.predict(test_x2)
print('Accuracy with tfidf vectorizer:', metrics.accuracy_score(preds, test_y2), '\n\n')

In [None]:
%%time
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(train_x1, train_y1)
print('Accuracy with count vectorizer:', metrics.accuracy_score(rfc.predict(test_x1), test_y1), '\n\n')

In [None]:
%%time
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(train_x2, train_y2)
print('accuracy with tfidf vectorizer:', metrics.accuracy_score(preds, test_y2), '\n\n')

In [None]:
%%time
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x1, train_y1)
print('Accuracy with count vectorizer:', metrics.accuracy_score(gnb.predict(test_x1), test_y1), '\n\n')

In [None]:
%%time
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x2, train_y2)
print('Accuracy with tfidf vectorizer:', metrics.accuracy_score(preds, test_y2), '\n\n')