# Fake News Assignment
**Authors**: Vilhelm Stiernstedt, Sharon Marín Salazar & Andrea Tondella
<br>
**Date**: 26/05/2018

#### Bag of Words Approaches
In this section we will import our cleaned data from the analysis section and explore various models by setting up pipelines with different:
- vectorizers
- stemmers
- lemmatizers
- tf-idfs

#### Models Optimization
We will try to optimize the same models as establish in our baselines test in the previous section by trying hyperparameter optimining via a random grid search. As always we seek to split our data into training and validation sets, along with using cross validation to avoid overfitting. Models to try:
- Navie Bayes
- SGD
- SVM
    - linear
    - rgb
    - poly (removed due to heavy computations and inferior results)
    - sigmod (removed due to heavy computations and inferior results)

#### Evalutaion
We will evaluate our models by trying to: 
- produce models that acheive high average accuracy
- produce models that score high in either Fake or Real 

By doing so we hope to later use stacking/ensamble that can bring our models together and achieve a higher prediction score. 

## Import Libraries

In [4]:
# general libs
import warnings
import collections
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import re

# nltk libs
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
# download required nltk packages (NB. commented out)
# nltk.download()

# sklearn libs
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# plot settings
%matplotlib inline

# pandas view settings -> see all contents of column
pd.set_option('display.max_colwidth', -1)

# Warning settings -> suppress depreciation warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
# Warning settings -> suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Function Definitions

### Pipeline 

In [5]:
# function to build pipeline with multiple models
# https://github.com/bmurauer/pipelinehelper/blob/master/pipelinehelper.py

from sklearn.base import TransformerMixin, BaseEstimator, ClassifierMixin
from collections import defaultdict
import itertools

class PipelineHelper(BaseEstimator, TransformerMixin, ClassifierMixin):

    def __init__(self, available_models=None, selected_model=None, include_bypass=False):
        self.include_bypass = include_bypass
        self.selected_model = selected_model
        # this is required for the clone operator used in gridsearch
        if type(available_models) == dict:
            self.available_models = available_models
        # this is the case for constructing the helper initially
        else:
            # a string identifier is required for assigning parameters
            self.available_models = {}
            for (key, model) in available_models:
                self.available_models[key] = model

    def generate(self, param_dict={}):
        per_model_parameters = defaultdict(lambda: defaultdict(list))

        # collect parameters for each specified model
        for k, values in param_dict.items():
            model_name = k.split('__')[0]
            param_name = k[len(model_name)+2:]  # might be nested
            if model_name not in self.available_models:
                raise Exception('no such model: {0}'.format(model_name))
            per_model_parameters[model_name][param_name] = values

        ret = []

        # create instance for cartesion product of all available parameters for each model
        for model_name, param_dict in per_model_parameters.items():
            parameter_sets = (dict(zip(param_dict, x)) for x in itertools.product(*param_dict.values()))
            for parameters in parameter_sets:
                ret.append((model_name, parameters))

        # for every model that has no specified parameters, add the default model
        for model_name in self.available_models.keys():
            if model_name not in per_model_parameters:
                ret.append((model_name, dict()))

        # check if the stage is to be bypassed as one configuration
        if self.include_bypass:
            ret.append((None, dict(), True))
        return ret

    def get_params(self, deep=False):
        return {'available_models': self.available_models,
                'selected_model': self.selected_model,
                'include_bypass': self.include_bypass}

    def set_params(self, selected_model, available_models=None, include_bypass=False):
        include_bypass = len(selected_model) == 3 and selected_model[2]

        if available_models:
            self.available_models = available_models

        if selected_model[0] is None and include_bypass:
            self.selected_model = None
            self.include_bypass = True
        else:
            if selected_model[0] not in self.available_models:
                raise Exception('so such model available: {0}'.format(selected_model[0]))
            self.selected_model = self.available_models[selected_model[0]]
            self.selected_model.set_params(**selected_model[1])

    def fit(self, X, y=None):
        if self.selected_model is None and not self.include_bypass:
            raise Exception('no model was set')
        elif self.selected_model is None:
            # print('bypassing model for fitting, returning self')
            return self
        else:
            # print('using model for fitting: ', self.selected_model.__class__.__name__)
            return self.selected_model.fit(X, y)

    def transform(self, X, y=None):
        if self.selected_model is None and not self.include_bypass:
            raise Exception('no model was set')
        elif self.selected_model is None:
            # print('bypassing model for transforming:')
            # print(X[:10])
            return X
        else:
            # print('using model for transforming: ', self.selected_model.__class__.__name__)
            return self.selected_model.transform(X)

    def predict(self, x):
        if self.include_bypass:
            raise Exception('bypassing classifier is not allowed')
        if self.selected_model is None:
            raise Exception('no model was set')
        return self.selected_model.predict(x)


## Import Data

In [6]:
# set path to data
data_path = 'data/'

# load test and train
df_train = pd.read_csv(data_path+'training_clean.csv')
df_test = pd.read_csv(data_path+'fake_or_real_news_test.csv')

# set index
df_train.set_index('ID', inplace=True)
df_test.set_index('ID', inplace=True)

# define combined df
all_data = df_train.append(df_test)

## Data Processing 

### Text Processing
#### Stemmers & Lemmatizer

In [7]:
# define count vectorizer for modelling (different parameter inputs will be given in modelling)
count_vectorizer = CountVectorizer()

# define Snowball stemmer (different parameter inputs will be given in modelling
snowball_stemmer = SnowballStemmer("english")

# define new vectorizer function with snowball stemmer
class SnowballCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(SnowballCountVectorizer, self).build_analyzer()
        return lambda doc: ([snowball_stemmer.stem(w) for w in analyzer(doc)])
    
# define new vectorizer function with Porter stemmer NLTK exten 
# (different parameter inputs will be given in modelling)
porter_stemmer = PorterStemmer(mode='NLTK_EXTENSIONS')

# define new vectorizer function with porter stemmer
class PorterCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(PorterCountVectorizer, self).build_analyzer()
        return lambda doc: ([porter_stemmer.stem(w) for w in analyzer(doc)])
    
# define lemmatizer
lemmatizer = WordNetLemmatizer()

# define new vectorizer function with stemmer
class LemmatizerCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(LemmatizerCountVectorizer, self).build_analyzer()
        return lambda doc: ([lemmatizer.lemmatize(w) for w in analyzer(doc)])

### Model Pipeline 1
In our first pipeline we will try our simple count vectorizer but also different type of stemmers with various inputs such as n-grams, remove stopword and convert to lowercase. We will also see if using tf-idf(Term Frequency times inverse document frequency) can increase our scroe along with some parameter tuning for our models. We hope that of these stemmers will be better than our baseline for all tested models. 

We will assess the following combinations:
    - count vectorizer
    - count vectorizer w. snowball stemmer
    - count vectorizer w. porter stemmer
    - count vectorizer w. lemmatizer
    
TF-IDF = If we want to reduce the weightage of more common words, we deploy our vectorizer into the TF-IDF transformer, which will assign more weight to less common words.

#### Variable Selection

In [8]:
# create different feature subsets
x = df_train.text

#### Label

In [9]:
# save label
y = df_train.label

#### Split training data

In [10]:
# split training data and labels into train and validation 80/20 -> not used 
# -> causes too much data leakage -> model overfits
#x_train, x_validation, y_train, y_validation = train_test_split(x, y,
#                                                                test_size=0.2, random_state=42)

In [11]:
# define pipeline (vectorizer, models)
pipeline = Pipeline([('vect', PipelineHelper([
                            ('counter', CountVectorizer()),
                            #('snowball_stemmer', SnowballCountVectorizer()),
                            #('porter_stemmer', PorterCountVectorizer()),
                            #('lemmatizer', LemmatizerCountVectorizer()),
                        ])),
                     ('clf', PipelineHelper([
                            #('sgd', SGDClassifier()),
                            #('svm-lin', LinearSVC()),
                            #('svm-ker', SVC()),
                            ('multi_nb', MultinomialNB()),
                        ])),
                       ])

#### Parameters
We will extend the model parameters and hope to imporve our score.

In [12]:
# define pipline parameters
parameters = {'vect__selected_model': pipeline.named_steps['vect'].generate({
                  'counter__ngram_range': [(1, 2), (1, 3)],
                  'counter__stop_words': ('english', None),
                  'counter__lowercase': (True, False),
                  'counter__max_df': (0.25, 0.5, 0.75, 1),
                  #'snowball_stemmer__ngram_range': [(1, 2), (1, 3)],
                  #'snowball_stemmer__stop_words': ('english', None),
                  #'snowball_stemmer__lowercase': (True, False),
                  #'snowball_stemmer__max_df': (0.5, 0.75, 1),
                  #'porter_stemmer__ngram_range': [(1, 2), (1, 3)],
                  #'porter_stemmer__stop_words': ('english', None),
                  #'porter_stemmer__max_df': (0.5, 0.75, 0.1),
                  #'porter_stemmer__lowercase': (True, False),
                  #'lemmatizer__ngram_range': [(1, 2), (1, 3)],
                  #'lemmatizer__stop_words': ('english', None),
                  #'lemmatizer__lowercase': (True, False),
                  #'lemmatizer__max_df': (0.5, 0.75, 1),
                }),
              'clf__selected_model': pipeline.named_steps['clf'].generate({
                    #'sgd__alpha': (1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3),
                    #'sgd__loss': ('hinge', 'squared_hinge', 'log'),
                    #'sgd__l1_ratio': (0, 0.1, 0.25, 0.5, 0.75, 1.0),
                    #'svm-lin__penalty': ('l1', 'l2'),
                    #'svm-lin__loss': ('hinge', 'squared_hinge'),
                    #'svm-lin__C': (0.1, 1, 10, 50, 100, 500, 1000),
                    #'svm-ker__kernel': ('rbf', 'poly', 'sigmoid'),
                    #'svm-ker__C': (0.1, 1, 10, 50, 100, 500, 1000),
                    'multi_nb__alpha': (0.1, 0.25, 0.5)
                })
              }

#### GridSearch

In [13]:
# define random search grid with cv
rscv_clf = RandomizedSearchCV(estimator=pipeline, verbose=3,
                              param_distributions=parameters,
                              n_jobs=1, n_iter=5, cv=3, 
                              random_state=42)

# fit model
rscv_clf_mod = rscv_clf.fit(x, y)

# get best score from CV
rscv_clf_mod.best_score_

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.1}) 
[CV]  vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.1}), score=0.9115442278860569, total=  33.4s
[CV] vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.1}) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   41.2s remaining:    0.0s


[CV]  vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.1}), score=0.913728432108027, total=  32.4s
[CV] vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.1}) 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.4min remaining:    0.0s


[CV]  vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.1}), score=0.8888888888888888, total=  32.4s
[CV] vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.25}) 
[CV]  vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.25}), score=0.9070464767616192, total=  34.3s
[CV] vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.25}) 
[CV]  vect__selected_model=('counter', {'ngram_range': (1, 3), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), clf__selected_model=('multi_nb', {'alpha': 0.25}), score=0.9062265566391597, total=  32.7s
[CV] vect__selected_mod

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  8.0min finished


0.9047261815453863

In [14]:
# get parameters for best score from CV
rscv_clf_mod.best_params_

{'vect__selected_model': ('counter',
  {'ngram_range': (1, 3),
   'stop_words': None,
   'lowercase': False,
   'max_df': 0.75}),
 'clf__selected_model': ('multi_nb', {'alpha': 0.1})}

In [15]:
# make predictions -> no validation at this stage -> data leakage -> overfitting of model
#rscv_clf_pred = rscv_clf_mod.best_estimator_.predict(x_validation)

# model evaluation
#print(metrics.classification_report(y_validation, rscv_clf_pred, digits=3))

#### Score Log

1. {'clf__selected_model': ('multi_nb', {'alpha': 0.25}),
 'vect__selected_model': ('counter',
  {'lowercase': True, 'ngram_range': (1, 3), 'stop_words': None})}

precision  /  recall / f1-score  / support

FAKE    /   0.96  /    0.89   /   0.92    /   383
<br>
REAL    /   0.90  /    0.96   /   0.93   /    417

avg / total    /   0.93   /   0.93   /   0.93    /   800

----------------------------------------------------------------------

2. {'clf__selected_model': ('sgd', {'alpha': 0.1, 'l1_ratio': 0.75, 'loss': 'hinge', 'n_iter': 800}) 'vect__selected_model': ('snowball_stemmer', {'lowercase': True, 'ngram_range': (1, 2), 'stop_words': None})}

precision  /  recall  f1-score   support

FAKE    /   0.87    /  0.95   /   0.91   /    383
<br>
REAL    /   0.95   /   0.87   /   0.91   /    417

avg / total    /   0.91    /  0.91   /   0.91   /    800


----------------------------------------------------------------------
3. {'clf__selected_model': ('svm-ker', {'C': 10000.0, 'kernel': 'rbf'}),
 'vect__selected_model': ('porter_stemmer',
  {'lowercase': True, 'ngram_range': (1, 4), 'stop_words': None})}


precision    recall  f1-score   support

FAKE    /   0.86    /  0.94   /  0.90     /  383
<br>
REAL    /   0.94    /  0.86   /   0.90    /   417

avg / total   /  0.90   /   0.90   /   0.90    /   800

----------------------------------------------------------------------
4. svm-kernel (cv=3, n-iter=20): 
- vect__selected_model=('counter', {'ngram_range': (1, 3),
 'stop_words': None, 'lowercase': False}), clf__selected_model=('svm-ker',
  {'kernel': 'rbf', 'C': 1000.0}), score=0.8818011257035647, total= 1.3min

- vect__selected_model=('porter_stemmer', {'ngram_range': (1, 4),
'stop_words': None, 'lowercase': True}), clf__selected_model=('svm-ker',
{'kernel': 'rbf', 'C': 10000.0}), score=0.8958724202626641, total=12.4min


### Model Pipeline 2
Our second pipeline will introduce an tf-idf to see if some of our models improve.

#### Variable Selection

In [19]:
# subset for orginal text
x = df_train.text

#### Label

In [20]:
# save label
y = df_train.label

#### Split training data

In [21]:
# split training data and labels into train and validation 80/20 -> not used due to too much data leakage
#x_train, x_validation, y_train, y_validation = train_test_split(x, y,
#                                                                test_size=0.2, random_state=42)

In [39]:
# define pipeline (vectorizer, models)
pipeline_2 = Pipeline([('vect', PipelineHelper([
                            ('counter', CountVectorizer()),
                            #('snowball_stemmer', SnowballCountVectorizer()),
                            #('porter_stemmer', PorterCountVectorizer()),
                            ('lemmatizer', LemmatizerCountVectorizer()),
                        ])),
                     ('tfidf', TfidfTransformer()),
                     ('clf', PipelineHelper([
                            #('sgd', SGDClassifier()),
                            #('svm-lin', LinearSVC()),
                            #('svm-ker', SVC()),
                            ('multi_nb', MultinomialNB()),
                        ])),
                       ])

#### Parameters
We will extend the model parameters and hope to imporve our score.

In [40]:
# define pipline parameters
parameters_2 = {'vect__selected_model': pipeline.named_steps['vect'].generate({
                  'counter__ngram_range': [(1, 1), (1, 2), (1, 3)],
                  'counter__stop_words': ('english', None),
                  'counter__lowercase': (True, False),
                  'counter__max_df': (0.5, 0.75, 1),
                  #'snowball_stemmer__ngram_range': [(1, 1), (1, 2), (1, 3)],
                  #'snowball_stemmer__stop_words': ('english', None),
                  #'snowball_stemmer__lowercase': (True, False),
                  #'snowball_stemmer__max_df': (0.5, 0.75, 1),
                  #'porter_stemmer__ngram_range': [(1, 1), (1, 2), (1, 3)],
                  #'porter_stemmer__stop_words': ('english', None),
                  #'porter_stemmer__lowercase': (True, False),
                  #'porter_stemmer__max_df': (0.5, 0.75, 0. 1),
                  'lemmatizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
                  'lemmatizer__stop_words': ('english', None),
                  'lemmatizer__lowercase': (True, False),
                  'lemmatizer__max_df': (0.5, 0.75, 1),
                }),
              'tfidf__norm': ('l1', 'l2'),
              'tfidf__use_idf': (True, False),
              'tfidf__smooth_idf': (True, False),
              'clf__selected_model': pipeline.named_steps['clf'].generate({
                  #'sgd__alpha': (0.5, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6),
                  #'sdg__loss': ('hinge', 'squared_hinge'),
                  #'sdg__l1_ratio': (0, 0.1, 0.25, 0.5, 0.75, 1.0),
                  'multi_nb__alpha': (0.01, 0.1, 0.25, 0.5)
                })
              }

#### GridSearch

In [41]:
# define random search grid with cv
rscv_clf = RandomizedSearchCV(estimator=pipeline_2, verbose=3,
                              param_distributions=parameters_2,
                              n_jobs=1, n_iter=5, cv=3, 
                              random_state=42)

# fit model based
rscv_clf_mod = rscv_clf.fit(x, y)

# get best score from CV
rscv_clf_mod.best_score_

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), tfidf__use_idf=False, tfidf__smooth_idf=False, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.1}) 
[CV]  vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), tfidf__use_idf=False, tfidf__smooth_idf=False, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.1}), score=0.8185907046476761, total=  30.1s
[CV] vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), tfidf__use_idf=False, tfidf__smooth_idf=False, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.1}) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   40.6s remaining:    0.0s


[CV]  vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), tfidf__use_idf=False, tfidf__smooth_idf=False, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.1}), score=0.8304576144036009, total=  26.2s
[CV] vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), tfidf__use_idf=False, tfidf__smooth_idf=False, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.1}) 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.3min remaining:    0.0s


[CV]  vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': None, 'lowercase': False, 'max_df': 0.75}), tfidf__use_idf=False, tfidf__smooth_idf=False, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.1}), score=0.8108108108108109, total=  25.7s
[CV] vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': 'english', 'lowercase': True, 'max_df': 0.5}), tfidf__use_idf=False, tfidf__smooth_idf=True, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.25}) 
[CV]  vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': 'english', 'lowercase': True, 'max_df': 0.5}), tfidf__use_idf=False, tfidf__smooth_idf=True, tfidf__norm=l1, clf__selected_model=('multi_nb', {'alpha': 0.25}), score=0.8298350824587706, total=  17.0s
[CV] vect__selected_model=('lemmatizer', {'ngram_range': (1, 1), 'stop_words': 'english', 'lowercase': True, 'max_df': 0.5}), tfidf__use_idf=False, tfidf__smooth_idf=True, tfidf__norm=l1, clf__selected_mo

[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  8.0min finished


0.8617154288572143

In [42]:
# get parameters for best score from CV
rscv_clf_mod.best_params_

{'vect__selected_model': ('counter',
  {'ngram_range': (1, 1),
   'stop_words': None,
   'lowercase': False,
   'max_df': 0.75}),
 'tfidf__use_idf': True,
 'tfidf__smooth_idf': False,
 'tfidf__norm': 'l2',
 'clf__selected_model': ('multi_nb', {'alpha': 0.25})}

In [None]:
# make predictions
#rscv_clf_pred = rscv_clf_mod.best_estimator_.predict(x_validation)

# model evaluation
#print(metrics.classification_report(y_validation, rscv_clf_pred))

#### Score Log

- SGD / multi_nb randomsearchgrid with tf-idf (cv=3, n-iter=20):
- vect__selected_model=('porter_stemmer', {'ngram_range': (1, 4), 'stop_words': None, 'lowercase': True}), tf-idf__use_idf=False, clf__selected_model=('sgd', {'alpha': 0.0001, 'loss': 'hinge', 'l1_ratio': 0.75}), score=0.9090056285178236, total=11.9min


## Conclusion
After trying a handful of different bag-of-word approaches and model paramters we can conclude that the two following models performed best, both far better at either fake or real news clasification. Hopefully this can set us up for a imporved ensamble model in the last section.

1. Fake News classifier:
    -  multi_nb (Fake 96% / Real 90% acc)
<br>
2. Real New classifier:
    - SGD (Real 95% / Fake 87% acc)