# Fake News Assignment
**Authors**: Vilhelm Stiernstedt & Sharon Marín Salazar
<br>
**Date**: 20/05/2018

### Description
Classification problem of News Report (document) for classes (FAKE, REAL). Try text-related classifiers such as Naive Bayes, MaxEnt, SVM. Use NLTK+SKLearn, NLP Pre-processing, Classifiers and CV-evaluation.

#### Dataset
**fake_or_real_news_training:**
- ID: ID of the tweet
- Title: Title of the news report
- Text: Textual content of the news report
- Label: Target Variable [FAKE, REAL]
- X1, X2 additional fields

**fake_or_real_news_test:**
- ID, title and text
- Predict Label

#### Advices
- Take a look to the data
- Try the pre-processing methodologies we have seen in class
- TF-IDF seems to be better (but try it!)
- N-grams pay the effort
- Less than 90-92%? -> Try again

#### Plan
1. Variable analysis
    - Features
    - Other insight
2. Data Processing
    - Drop features
    - Label
3. Modelling
    - Navie Bayes
    - MaxEnt
    - SVM
4. Evaluation

## Import Libraries

In [47]:
import collections
import matplotlib.pyplot as plt
import nltk
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.classify import MaxentClassifier
import numpy as np
import pandas as pd
import seaborn as sns
import re
import PipelineHelper # https://github.com/bmurauer/pipelinehelper/blob/master/pipelinehelper.py
from scipy.sparse import hstack
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from wordcloud import WordCloud, STOPWORDS
import warnings

# download required nltk packages (NB. commented out)
# nltk.download()

# plot settings
%matplotlib inline

# pandas view settings -> see all contents of column
pd.set_option('display.max_colwidth', -1)

# Warning settings -> suppress depreciation warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

## Function Definitions
### Analysis 

In [2]:
# define stop words - english
stop_words = set(stopwords.words('english'))

# define lemmatizer for simple analysis 
wordnet_lemmatizer = WordNetLemmatizer()

# create normalization function for analysis of title and text
def normalizer(text):
    clean_text = re.sub('[^\x00-\x7F]+', "", text) # remove non-ascii characters
    clean_text = re.sub('(\r)+', "",  clean_text) # remove newline characters
    clean_text = re.sub(r'@([A-Za-z0-9_]+)', "",  clean_text) # remove twitter handles
    clean_text = re.sub(r"(https|http)\S+", "",  clean_text) # remove hyperlinks
    clean_text = re.sub("[^a-zA-Z]", " ", clean_text) # remove all but letters remains
    tokens = nltk.word_tokenize(clean_text)[2:] # tokenize words
    lower_case = [l.lower() for l in tokens] # convert to lowercase
    filtered_result = list(filter(lambda l: l not in stop_words, lower_case)) # filter stopwords
    lemmas = [wordnet_lemmatizer.lemmatize(t) for t in filtered_result] # stem words with lemmatizer
    return lemmas

# define function to construct our ngrams for analysis of title and text
def ngrams(input_list):
    bigrams = [' '.join(t) for t in list(zip(input_list, input_list[1:]))]
    trigrams = [' '.join(t) for t in list(zip(input_list, input_list[1:], input_list[2:]))]
    quadgrams = [' '.join(t) for t in list(zip(input_list, input_list[1:], input_list[3:]))]
    return bigrams+trigrams+quadgrams

# define function to count words for analysis of ngrams (bi, tri, quad) for title and text
def count_words(input):
    cnt = collections.Counter()
    for row in input:
        for word in row:
            cnt[word] += 1
    return cnt

# exclaimation counter for analysis of text (potentially introduce as new feautre)
def exclaimation_counter(article):
    nr_abs = article.count('!')
    text_len = len(article)
    nr_rel = nr_abs/text_len
    return nr_rel

### Text Processing
#### Stemmers

In [3]:
# define count vectorizer for modelling (different parameter inputs will be given in modelling)
count_vectorizer = CountVectorizer()

# define Snowball stemmer (different parameter inputs will be given in modelling
snowball_stemmer = SnowballStemmer("english")

# define new vectorizer function with snowball stemmer
class SnowballCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(SnowballCountVectorizer, self).build_analyzer()
        return lambda doc: ([snowball_stemmer.stem(w) for w in analyzer(doc)])
    
# define new vectorizer function with Porter stemmer NLTK exten 
# (different parameter inputs will be given in modelling)
porter_stemmer = PorterStemmer(mode='NLTK_EXTENSIONS')

# define new vectorizer function with porter stemmer
class PorterCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(PorterCountVectorizer, self).build_analyzer()
        return lambda doc: ([porter_stemmer.stem(w) for w in analyzer(doc)])

### Pipeline 

In [4]:
# function to build pipeline with multiple models
# https://github.com/bmurauer/pipelinehelper/blob/master/pipelinehelper.py

from sklearn.base import TransformerMixin, BaseEstimator, ClassifierMixin
from collections import defaultdict
import itertools

class PipelineHelper(BaseEstimator, TransformerMixin, ClassifierMixin):

    def __init__(self, available_models=None, selected_model=None, include_bypass=False):
        self.include_bypass = include_bypass
        self.selected_model = selected_model
        # this is required for the clone operator used in gridsearch
        if type(available_models) == dict:
            self.available_models = available_models
        # this is the case for constructing the helper initially
        else:
            # a string identifier is required for assigning parameters
            self.available_models = {}
            for (key, model) in available_models:
                self.available_models[key] = model

    def generate(self, param_dict={}):
        per_model_parameters = defaultdict(lambda: defaultdict(list))

        # collect parameters for each specified model
        for k, values in param_dict.items():
            model_name = k.split('__')[0]
            param_name = k[len(model_name)+2:]  # might be nested
            if model_name not in self.available_models:
                raise Exception('no such model: {0}'.format(model_name))
            per_model_parameters[model_name][param_name] = values

        ret = []

        # create instance for cartesion product of all available parameters for each model
        for model_name, param_dict in per_model_parameters.items():
            parameter_sets = (dict(zip(param_dict, x)) for x in itertools.product(*param_dict.values()))
            for parameters in parameter_sets:
                ret.append((model_name, parameters))

        # for every model that has no specified parameters, add the default model
        for model_name in self.available_models.keys():
            if model_name not in per_model_parameters:
                ret.append((model_name, dict()))

        # check if the stage is to be bypassed as one configuration
        if self.include_bypass:
            ret.append((None, dict(), True))
        return ret

    def get_params(self, deep=False):
        return {'available_models': self.available_models,
                'selected_model': self.selected_model,
                'include_bypass': self.include_bypass}

    def set_params(self, selected_model, available_models=None, include_bypass=False):
        include_bypass = len(selected_model) == 3 and selected_model[2]

        if available_models:
            self.available_models = available_models

        if selected_model[0] is None and include_bypass:
            self.selected_model = None
            self.include_bypass = True
        else:
            if selected_model[0] not in self.available_models:
                raise Exception('so such model available: {0}'.format(selected_model[0]))
            self.selected_model = self.available_models[selected_model[0]]
            self.selected_model.set_params(**selected_model[1])

    def fit(self, X, y=None):
        if self.selected_model is None and not self.include_bypass:
            raise Exception('no model was set')
        elif self.selected_model is None:
            # print('bypassing model for fitting, returning self')
            return self
        else:
            # print('using model for fitting: ', self.selected_model.__class__.__name__)
            return self.selected_model.fit(X, y)

    def transform(self, X, y=None):
        if self.selected_model is None and not self.include_bypass:
            raise Exception('no model was set')
        elif self.selected_model is None:
            # print('bypassing model for transforming:')
            # print(X[:10])
            return X
        else:
            # print('using model for transforming: ', self.selected_model.__class__.__name__)
            return self.selected_model.transform(X)

    def predict(self, x):
        if self.include_bypass:
            raise Exception('bypassing classifier is not allowed')
        if self.selected_model is None:
            raise Exception('no model was set')
        return self.selected_model.predict(x)


## Import Data

In [5]:
# set path to data
data_path = 'data/'

# load test and train
df_train = pd.read_csv(data_path+'fake_or_real_news_training.csv')
df_test = pd.read_csv(data_path+'fake_or_real_news_test.csv')

# set index
df_train.set_index('ID', inplace=True)
df_test.set_index('ID', inplace=True)

# define combined df
all_data = df_train.append(df_test)

## Inspect Data

### Structure and features

In [6]:
# check dimension of training data
df_train.shape

(3999, 5)

In [7]:
# check dimension of test data -> more than one column difference!
df_test.shape

(2321, 2)

In [8]:
# check column names and dtypes for training data
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3999 entries, 8476 to 9673
Data columns (total 5 columns):
title    3999 non-null object
text     3999 non-null object
label    3999 non-null object
X1       33 non-null object
X2       2 non-null object
dtypes: object(5)
memory usage: 187.5+ KB


In [9]:
# check column names and dtypes for test data
# -> X1 and X2 not in testset ... -> need manipulation to be used for modelling
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2321 entries, 10498 to 4330
Data columns (total 2 columns):
title    2321 non-null object
text     2321 non-null object
dtypes: object(2)
memory usage: 54.4+ KB


In [10]:
# check df_train -> text lengthy ...
# df_train.head(1)

In [11]:
# check df_test -> similar structure as train -> good.
df_test.head(1)

Unnamed: 0_level_0,title,text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
10498,September New Homes Sales Rise——-Back To 1992 Level!,"September New Homes Sales Rise Back To 1992 Level! By David Stockman. Posted On Wednesday, October 26th, 2016 \n\nDavid Stockman's Contra Corner is the only place where mainstream delusions and cant about the Warfare State, the Bailout State, Bubble Finance and Beltway Banditry are ripped, refuted and rebuked. Subscribe now to receive David Stockman’s latest posts by email each day as well as his model portfolio, Lee Adler’s Daily Data Dive and David’s personally curated insights and analysis from leading contrarian thinkers."


### Missing values

In [12]:
# missing values for training data -> X1 and X2 almost 100% NaNs -> probably drop values
df_train.isnull().sum()

title    0   
text     0   
label    0   
X1       3966
X2       3997
dtype: int64

In [13]:
# missing values for training data -> all unnamed almost 100% NaNs, tweet_coord 90%
df_test.isnull().sum()

title    0
text     0
dtype: int64

## Variable Analysis

### X1

In [14]:
# general overview -> mostly unique titles, 4 identical could suggest duplictes exist. 
df_train.X1.describe()

count     33  
unique    4   
top       REAL
freq      17  
Name: X1, dtype: object

In [15]:
# view rows where X1 != NaN
# value counts -> seems that variable include data that has miss aligned
# lets look at independet rows and see the values belong to other column
# either title is long been splitted into both fields (title and text) and thus shifted everything else rightwards
# df_train[df_train.X1.notnull()]

In [16]:
# title merge test -> works!
# df_train.loc[599]['title'] + '' + df_train.loc[599]['text']

In [17]:
# build function to shift all row fields left where X1 != NaN
for id in df_train[df_train.X1.notnull()].index:
    # title will be a concatenation of title and text
    df_train.loc[id]['title'] = df_train.loc[id]['title'] + '' + df_train.loc[id]['text']
    # text will be current label
    df_train.loc[id]['text'] = df_train.loc[id]['label']
    # label will be current X1
    df_train.loc[id]['label'] = df_train.loc[id]['X1']

In [18]:
# preview X1 again -> looks good
# df_train[df_train.X1.notnull()]

####  notes
At this stage X1 doesn't contain any useful data, we can remove it completely.

### X2

In [19]:
# general overview -> mostly unique titles, 4 identical could suggest duplictes exist. 
df_train.X2.describe()

count     2   
unique    2   
top       FAKE
freq      1   
Name: X2, dtype: object

In [20]:
# view rows where X2 != NaN
# seems that variable include data that has miss aligned just like X1
# from our last shift, we only need to merge title and text, move label to text, and X2 to label
# df_train[df_train.X2.notnull()]

In [21]:
# build function to shift all row fields left where X2 != NaN
for id in df_train[df_train.X2.notnull()].index:
    # title will be a concatenation of title and text
    df_train.loc[id]['title'] = df_train.loc[id]['title'] + '' + df_train.loc[id]['text']
    # text will be current label
    df_train.loc[id]['text'] = df_train.loc[id]['label']
    # label will be current X2
    df_train.loc[id]['label'] = df_train.loc[id]['X2']

In [22]:
# preview X2 again -> looks good
# df_train[df_train.X2.notnull()]

####  notes
At this stage X2 doesn't contain any useful data, we can remove it completely.

### label

In [23]:
# most important is that labels are corrct -> yes, only fake and real!
# also almost equal split of fake and real articles
df_train.label.value_counts()

REAL    2008
FAKE    1991
Name: label, dtype: int64

### title

In [24]:
# general overview -> mostly unique titles, 4 identical could suggest duplictes exist. 
df_train.title.describe()

count     3999                         
unique    3968                         
top       OnPolitics | 's politics blog
freq      4                            
Name: title, dtype: object

#### wordcloud - Fake news
Most common word used in fake news titles.

In [None]:
# subset for negative sentiment
df_cloud = df_train[df_train['label']=='FAKE']

# subset all words in text based on space
words = ' '.join(df_cloud['title'])

# split words
split_words = " ".join([word for word in words.split()])

# create wordcloud based on word frequancy
# wordcloud = WordCloud(stopwords=STOPWORDS, # remove stopwords
                      background_color='white',
                      width=3000,
                      height=2500
                     ).generate(split_words)

# plot wordcloud
plt.figure(1, figsize=(10, 10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

#### wordcloud - Real news
Most common word used in real news titles.

In [None]:
# subset for negative sentiment
df_cloud = df_train[df_train['label']=='REAL']

# subset all words in text based on space
words = ' '.join(df_cloud['title'])

# split words
split_words = " ".join([word for word in words.split()])

# create wordcloud based on word frequancy
# wordcloud = WordCloud(stopwords=STOPWORDS, # remove stopwords
                      background_color='white',
                      width=3000,
                      height=2500
                     ).generate(split_words)

# plot wordcloud
plt.figure(1, figsize=(10, 10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

#### normalize text
Lets try to structure the titles and remove all unuseful elements to conduct a better analysis. Such as:
- Remove all but letters
- Tokenize words
- Convert to lowercase
- Filter stopwords
- Stem words with lemmatizer

In [25]:
# create new feature -> apply function on text
df_train['title_normalized'] = df_train.title.apply(normalizer)

# view reuslts -> some desired words seem to fall off, perhaps lammatizer not best tool for stemming.
# We will try other methods for our modelling such as snowball and porter stemmers.
df_train[['title','title_normalized', 'label']].head()

Unnamed: 0_level_0,title,title_normalized,label
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8476,You Can Smell Hillary’s Fear,"[smell, hillary, fear]",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO),"[exact, moment, paul, ryan, committed, political, suicide, trump, rally, video]",FAKE
3608,Kerry to go to Paris in gesture of sympathy,"[go, paris, gesture, sympathy]",REAL
10142,Bernie supporters on Twitter erupt in anger against the DNC: 'We tried to warn you!',"[twitter, erupt, anger, dnc, tried, warn]",FAKE
875,The Battle of New York: Why This Primary Matters,"[new, york, primary, matter]",REAL


#### ngrams
As words mostly matter in context we'll look at bi, tri, quad-grams instead of just individual tokens.

In [26]:
# create new feature -> apply function on normalized_text
df_train['title_grams'] = df_train.title_normalized.apply(ngrams)

# view reuslts
df_train[['title_grams']].head()

Unnamed: 0_level_0,title_grams
ID,Unnamed: 1_level_1
8476,"[smell hillary, hillary fear, smell hillary fear]"
10294,"[exact moment, moment paul, paul ryan, ryan committed, committed political, political suicide, suicide trump, trump rally, rally video, exact moment paul, moment paul ryan, paul ryan committed, ryan committed political, committed political suicide, political suicide trump, suicide trump rally, trump rally video, exact moment ryan, moment paul committed, paul ryan political, ryan committed suicide, committed political trump, political suicide rally, suicide trump video]"
3608,"[go paris, paris gesture, gesture sympathy, go paris gesture, paris gesture sympathy, go paris sympathy]"
10142,"[twitter erupt, erupt anger, anger dnc, dnc tried, tried warn, twitter erupt anger, erupt anger dnc, anger dnc tried, dnc tried warn, twitter erupt dnc, erupt anger tried, anger dnc warn]"
875,"[new york, york primary, primary matter, new york primary, york primary matter, new york matter]"


#### frequency count - Fake news

In [27]:
# show 20 most common words form ngrams for fake news
df_train[(df_train.label == 'FAKE')][['title_grams']].apply(count_words)['title_grams'].most_common(20)

[('hillary clinton', 51),
 ('donald trump', 32),
 ('onion america', 23),
 ('america finest', 23),
 ('finest news', 23),
 ('news source', 23),
 ('onion america finest', 23),
 ('america finest news', 23),
 ('finest news source', 23),
 ('onion america news', 23),
 ('america finest source', 23),
 ('clinton email', 18),
 ('world war', 13),
 ('email investigation', 12),
 ('clinton foundation', 12),
 ('standing rock', 11),
 ('u election', 11),
 ('hillary campaign', 9),
 ('fbi director', 8),
 ('trump supporter', 8)]

#### frequency count - Real news

In [28]:
# show 20 most common words form ngrams for real news
df_train[(df_train.label == 'REAL')][['title_grams']].apply(count_words)['title_grams'].most_common(20)

[('donald trump', 55),
 ('hillary clinton', 36),
 ('white house', 22),
 ('supreme court', 17),
 ('bernie sander', 14),
 ('gop debate', 14),
 ('iran deal', 12),
 ('nuclear deal', 12),
 ('new hampshire', 12),
 ('fox news', 12),
 ('islamic state', 12),
 ('trump clinton', 11),
 ('trade deal', 11),
 ('foreign policy', 11),
 ('new york', 10),
 ('ted cruz', 10),
 ('presidential debate', 10),
 ('jeb bush', 9),
 ('climate change', 9),
 ('iran nuclear', 8)]

#### Summary
Seems that both Hillary Clinton and Donald Trump are top represented in both real and fake news. However, Trump more so in real news and Clinton in fake news. 

*Noteworthy words for fake news:*
- onion (probably ref to satire news)
- finest 
- email (clinton scandal)

*Noteworthy words for real news:*
- deal (iran nuclear deal)
- gop
- islamic state

### text

In [29]:
# general overview -> 41 news articles are identicle, only 4 duplicate titels. 
# same news article reused and published with new titles
# perhaps could we
df_train.text.describe()

count     3999                                                                                                                 
unique    3839                                                                                                                 
top       Killing Obama administration rules, dismantling Obamacare and pushing through tax reform are on the early to-do list.
freq      41                                                                                                                   
Name: text, dtype: object

#### normalize text
Lets try to structure the text and remove all unuseful elements to conduct a better analysis. Such as:
- Remove all but letters
- Tokenize words
- Convert to lowercase
- Filter stopwords
- Stem words with lemmatizer

In [30]:
# create new feature -> apply function on text
df_train['text_normalized'] = df_train.text.apply(normalizer)

# view reuslts
# df_train[['text','text_normalized', 'label']].head(1)

#### ngrams
As words mostly matter in context we'll look at bi, tri, quad-grams instead of just individual tokens.

In [31]:
# create new feature -> apply function on normalized_text
df_train['text_grams'] = df_train.text_normalized.apply(ngrams)

# view reuslts
# df_train[['text_grams']].head(1)

#### frequency count - Fake news

In [32]:
# show 20 most common words form ngrams for fake news
df_train[(df_train.label == 'FAKE')][['text_grams']].apply(count_words)['text_grams'].most_common(20)

[('hillary clinton', 1475),
 ('donald trump', 1142),
 ('united state', 836),
 ('new york', 437),
 ('white house', 426),
 ('clinton campaign', 331),
 ('year old', 277),
 ('bill clinton', 256),
 ('presidential election', 246),
 ('clinton foundation', 237),
 ('american people', 236),
 ('secretary state', 235),
 ('wall street', 231),
 ('year ago', 195),
 ('foreign policy', 190),
 ('democratic party', 180),
 ('law enforcement', 179),
 ('look like', 176),
 ('human right', 175),
 ('election day', 169)]

#### frequency count - Real news

In [33]:
# show 20 most common words form ngrams for real news
df_train[(df_train.label == 'REAL')][['text_grams']].apply(count_words)['text_grams'].most_common(20)

[('donald trump', 1177),
 ('hillary clinton', 1112),
 ('united state', 1050),
 ('white house', 1023),
 ('new york', 976),
 ('fox news', 647),
 ('new hampshire', 630),
 ('islamic state', 521),
 ('president obama', 518),
 ('trump said', 496),
 ('secretary state', 466),
 ('supreme court', 460),
 ('ted cruz', 424),
 ('last week', 423),
 ('bernie sander', 415),
 ('foreign policy', 404),
 ('presidential candidate', 392),
 ('republican party', 390),
 ('barack obama', 388),
 ('south carolina', 350)]

#### Summary
Following the same break down as for titles but for our article text, we find the same results as for titles. 

*Noteworthy words for fake news:*
- wall street
- bill clinton
- year old

*Noteworthy words for real news:*
- fox news
- president obama
- new hampshire

### Further Analysis

### Exclaimation couting
We want to inspect whether fake news use more exclaimation marks compared to real news. Thus we will create a new feature that contains the number of exclaimations marks used in each article in relation to the length of the article. We will assess this function for the main text body.

In [34]:
df_train['ex_rel_count'] = df_train.text.apply(exclaimation_counter)
df_train['ex_rel_count'] = pd.qcut(df_train['ex_rel_count'], 20, duplicates='drop')

In [35]:
df_train.groupby(['label', 'ex_rel_count']).ex_rel_count.count().unstack()

ex_rel_count,"(-0.001, 8.81e-05]","(8.81e-05, 0.000227]","(0.000227, 0.000464]","(0.000464, 0.000959]","(0.000959, 0.0361]"
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FAKE,1430,85,132,159,185
REAL,1769,115,68,41,15


## Baseline Model
Before we get into complex modelling lets try to establish a baseline for the our different models. We will only assume a simple count vectorizer. 
    - SVM
    - NavieBayes - Multinominal
    - MaxEntropy (not working atm)

### Model Preprocessing

#### Missing Values

In [36]:
df_train.isnull().sum()

title               0   
text                0   
label               0   
X1                  3966
X2                  3997
title_normalized    0   
title_grams         0   
text_normalized     0   
text_grams          0   
ex_rel_count        0   
dtype: int64

#### Variable Selection

In [40]:
df_model = df_train.text

#### Label

In [41]:
# save label
label = df_train.label

#### CountVectorizer

In [42]:
# define vectorizer without text modifications
count_vectorizer = CountVectorizer(lowercase=False)

# apply vector on text in df_train
vectorized_data = count_vectorizer.fit_transform(df_model)

#### Split training data

In [43]:
# split training data and labels into train and validation 80/20
x_train, x_validation, y_train, y_validation = train_test_split(vectorized_data, label,
                                                                test_size=0.2, random_state=42)

### Models

#### SVM

In [48]:
# define svm classifer
clf_svm = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3,
                        n_iter=5, random_state=42)

# fit model 
clf_svm_output = clf_svm.fit(x_train, y_train)

# make predictions
clf_svm_pred = clf_svm_output.predict(x_validation)

# model evaluation
print(metrics.classification_report(y_validation, clf_svm_pred))

             precision    recall  f1-score   support

       FAKE       0.94      0.84      0.89       383
       REAL       0.87      0.95      0.91       417

avg / total       0.90      0.90      0.90       800



####  Naive Bayes - Multinomial

In [50]:
# define nb classifier
clf_nb = MultinomialNB()

# fit model 
clf_nb_output = clf_nb.fit(x_train, y_train)

# make predictions
clf_nb_pred = clf_nb_output.predict(x_validation)

# mean accuracy
print(metrics.classification_report(y_validation, clf_nb_pred))

             precision    recall  f1-score   support

       FAKE       0.91      0.86      0.89       383
       REAL       0.88      0.92      0.90       417

avg / total       0.89      0.89      0.89       800



#### MaxEntropy 
Not working atm.

In [59]:
# define nb classifier
#clf_maxent = MaxentClassifier(encoding=x_train, weights=y_train)

In [None]:
# define SVM classifer
maxent_clf = GaussianNB()

# define multinomial naivebayes classifer
multi_nb_clf = MultinomialNB()

# define guassian naivebayes classifer
gus_nb_clf = GaussianNB()

# define max entropy classifer -> doesn't seem to be able to be predefined. we will employ later
# maxent_clf = MaxentClassifier()