In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_msgpack("./combined.msg")
df['fakeness'] = 0
df.loc[df['type'] < 2, 'fakeness'] = 1
smaller = df.groupby('fakeness').head(15000)
df.head()

Unnamed: 0,type,content,title,fakeness
0,0,\n\n\n\n\n\n\n\nRev Dr. Childress is available...,BlackGenocide.org,1
1,0,\n\nSpeaking Engagement Request\n\n\n\nContact...,Request Speaking Engagement,1
2,0,"""…I have set before you life and death, blessi...",BlackGenocide.org,1
3,0,Why We Oppose Planned Parent Hood ( The follow...,Why We Oppose Planned Parenthood,1
4,1,Headline: Bitcoin & Blockchain Searches Exceed...,Surprise: Socialist Hotbed Of Venezuela Has Lo...,1


In [3]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

In [4]:
fitted = text_clf.fit(smaller.content.values, smaller.fakeness.values)

In [5]:
predicted = text_clf.predict(smaller.content.values)

In [6]:
print("Accuracy: " + str(np.mean(predicted == smaller.fakeness.values) * 100) + "%")

Accuracy: 90.81%


In [7]:
print(metrics.classification_report(smaller.fakeness, predicted, target_names=['real', 'fake']))

             precision    recall  f1-score   support

       real       0.90      0.92      0.91     15000
       fake       0.91      0.90      0.91     15000

avg / total       0.91      0.91      0.91     30000



In [8]:
metrics.confusion_matrix(smaller.fakeness, predicted)

array([[13728,  1272],
       [ 1485, 13515]])

### From Here
----

90% accuracy is okay if the model is a component of a larger system to differentiate between likely fake and likely true news.

Some other components of this system could be examining the source of the article, examining publish dates. However, little effort has gone into optimizing the current model.

__It is valuable to perform a grid search on the classifier parameters to find optimal values.__

It is likely that this model is fitted to "current" events. We should hope that this can be more resilient and "future-proof".

__This means that the model needs to be set up to consistently learn from data sets.__

There are larger datasets out in the wild. This current model uses a subset of [Open Sources'](http://www.opensources.co/) dataset, using only 30,000 article bodies. This dataset alone has millions of articles and occupies ~30GB of memory. Scraping the web for articles online can produce multiple DB each day. This dataset is too large for many machines to hold in memory at once.

__This means that the model needs to be scaled up to handle a larger set of data that may not fit in memory simultaneously.__

Open Sources also supports the following tags instead of True/False:

- Fake News (tag fake) Sources that entirely fabricate information, disseminate deceptive content, or grossly distort actual news reports

- Satire (tag satire) Sources that use humor, irony, exaggeration, ridicule, and false information to comment on current events.

- Extreme Bias (tag bias) Sources that come from a particular point of view and may rely on propaganda, decontextualized information, and opinions distorted as facts.

- Conspiracy Theory (tag conspiracy): Sources that are well-known promoters of kooky conspiracy theories.

- Rumor Mill (tag rumor) Sources that traffic in rumors, gossip, innuendo, and unverified claims.

- State News (tag state) Sources in repressive states operating under government sanction.

- Junk Science (tag junksci) Sources that promote pseudoscience, metaphysics, naturalistic fallacies, and other scientifically dubious claims.

- Hate News (tag hate) Sources that actively promote racism, misogyny, homophobia, and other forms of discrimination.

- Clickbait (tag clickbait) Sources that provide generally credible content, but use exaggerated, misleading, or questionable headlines, social media descriptions, and/or images.

- Proceed With Caution (tag unreliable) Sources that may be reliable but whose contents require further verification.

- Political (tag political) Sources that provide generally verifiable information in support of certain points of view or political orientations.

- Credible (tag reliable) Sources that circulate news and information in a manner consistent with traditional and ethical practices in journalism (Remember: even credible sources sometimes rely on clickbait-style headlines or occasionally make mistakes. No news organization is perfect, which is why a healthy news diet consists of multiple sources of information).

How reliably can a model be able to differentiate between dozen tags?