## Out-of-core Learning
----

Another major problem with the initial model was the restriction to datasets that could fit inside memory. This means: 
  - Technically, we are severely limited in the number of possible data sources
  - Practically, the model must be retrained with each time that new data is being processed
    - Imagine running a model trained during the 2012 presidential election on the 2016



In [1]:
import time
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer, HashingVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_msgpack("./combined.msg")
df['fakeness'] = 0
df.loc[df['type'] < 2, 'fakeness'] = 1
df.groupby('fakeness')
df = df.reindex(np.random.permutation(df.index))

In [3]:
df.fakeness.value_counts()

0    290438
1    171636
Name: fakeness, dtype: int64

In [4]:
vect = HashingVectorizer(n_features=2**18,
                         alternate_sign=False,
                         stop_words=ENGLISH_STOP_WORDS,
                         ngram_range=(1,2))
trans = TfidfTransformer()
clf = MultinomialNB(alpha=0.250)

In [5]:
def iter_batches(max_size=None, batch_size=10000):
    if max_size is None:
        max_size = df.shape[0]
    counter = batch_size
    while counter < max_size:
        yield df.iloc[counter:counter+batch_size].content.values, \
               df.iloc[counter:counter+batch_size].fakeness.values
        counter += batch_size

In [6]:
test = df.iloc[:15000].content.values
y_test = df.iloc[:15000].fakeness.values
test_vect = vect.transform(test)
test_tfidf = trans.fit_transform(test_vect)

batches = iter_batches(batch_size=15000)
for i, (x_train, y_train) in enumerate(batches):
    tick = time.time()
    x_vect = vect.transform(x_train)
    x_tfidf = trans.transform(x_vect)
    clf.partial_fit(x_tfidf, y_train, classes=[0, 1])

    print("Train time", time.time() - tick)
    print("Pos", (i + 1) * 15000)
    print("Score", clf.score(test_tfidf, y_test))
    predicted = clf.predict(test_tfidf)
    print("Accuracy: " + str(np.mean(predicted == y_test) * 100) + "%")

Train time 8.76163649559021
Pos 15000
Score 0.8398
Accuracy: 83.98%
Train time 9.653783798217773
Pos 30000
Score 0.8082
Accuracy: 80.82000000000001%
Train time 9.237140655517578
Pos 45000
Score 0.7992
Accuracy: 79.92%
Train time 9.591607332229614
Pos 60000
Score 0.7921333333333334
Accuracy: 79.21333333333334%
Train time 8.916282176971436
Pos 75000
Score 0.7913333333333333
Accuracy: 79.13333333333334%
Train time 8.801216125488281
Pos 90000
Score 0.7884666666666666
Accuracy: 78.84666666666666%
Train time 8.779746532440186
Pos 105000
Score 0.7882666666666667
Accuracy: 78.82666666666667%
Train time 8.794522523880005
Pos 120000
Score 0.7876666666666666
Accuracy: 78.76666666666667%
Train time 9.300930261611938
Pos 135000
Score 0.7863333333333333
Accuracy: 78.63333333333333%
Train time 9.719456434249878
Pos 150000
Score 0.7852666666666667
Accuracy: 78.52666666666667%
Train time 9.331478357315063
Pos 165000
Score 0.7857333333333333
Accuracy: 78.57333333333332%
Train time 9.788714170455933
Pos 

In [7]:
predicted = clf.predict(test_tfidf)

In [8]:
print("Accuracy: " + str(np.mean(predicted == y_test) * 100) + "%")

Accuracy: 78.28%


In [14]:
print(metrics.classification_report(y_test, predicted, target_names=['factual', 'fake news']))

             precision    recall  f1-score   support

    factual       0.99      0.67      0.80      9408
  fake news       0.64      0.99      0.78      5592

avg / total       0.86      0.79      0.79     15000



## Discussion of Results
----

78% accuracy doesn't seem very reassuring.

When looking at the classification report, we can think of the properties in the following ways:
  - __Support__: The actual number of items with that tag
  - __Precision__: Given a positive prediction from the classifier, how likely is it to be correct?
  - __Recall__: Given a positive example, will the classifier detect it?
  
Upon closer inspection, these results actually look very promising. The recall value of factual articles is one of the _less_ important statistics. It is typically very easy to identify what is real, and there are many other means of doing this. We can look at the publisher, journalist, and the "web of trust" built around a piece of news and quickly identify whether the article is certainly accurate or suspect.

On the other side of the coin, the reflected high recall for "fake news" is also promising. If the article appears suspect due to other measures (as listed above), there is a very high chance of the model confirming our suspicions.

Why is the accuracy of this model so low?

It is possible that the relative imbalance of this subset of the larger corpus influences the ability for the model to learn appropriately. With more than $3/5$ of the dataset being labeled likely factual, there is room for more "fake news"

## Future Work
----

### Dataset Improvements

When initially working through the ~9 million articles in the dataset, I noticed significant problems with data corruption both in the CSV and after it was imported with pandas.

There is a large space for improvement with input data sets, and this is luckily something that can be automated. I would be interested in catching a wide net of news sources, scraping them from their respective websites, and cleaning up any corruption that occurs during scraping.

This would also solve another problem that was noticed: many of the documents from "fake-news" sites were not news articles in any way. Some were simply home pages, others were comment threads. I'm unsure how much these non-news documents influenced the output of the model.

### Modeling Improvements

This model is missing some of the features commonly found in neural networks:
- The ability to "drop" connections between neurons
  - A possibly productive analogue would be the ability to mark specific nouns to be "forgotten" during training. We wouldn't want to train on words that are over-represented in current events.