# Text Classification of embedded documents using Doc2vec

## Goal: 
The goal for this project is to apply classifiers like XGBoost using embedded documents of fake news dataset: 
https://github.com/fakerfact/FakeNewsTutorials/blob/master/data/fake_or_real_news.csv

This notebook attempts to study Doc2vec methods for learning how various classifiers' results improve using paragraph/document vectors. The results are quite surprising and better than previous used methods like word2vec. 

In [1]:
import numpy as np
import pandas as pd 
import gensim

all_data = pd.read_csv("C:\\Users\\Prajakta\\fake_or_real_news.csv")
all_data.head(5)



Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


Convert the labels from string to boolean

In [2]:
all_data['label'] = np.where(all_data['label'] == 'FAKE', 0 , 1)

Clean the data viz. removing punctuation, stopwords etc.

In [3]:
from nltk.corpus import stopwords
from gensim.models.doc2vec import LabeledSentence
from gensim import utils
import re
from __future__ import print_function
import os
import re
import string
def textClean(text):
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = text.lower().split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]    
    text = " ".join(text)
    return(text)
    
def cleanup(text):
    text = textClean(text)
    text= text.translate(str.maketrans("","", string.punctuation))
    return text

Gensim Doc2Vec needs model training data in an LabeledSentence iterator object. More information can be found out on: 
https://radimrehurek.com/gensim/models/doc2vec.html

In [4]:
def constructLabeledSentences(data):
    sentences=[]
    for index, row in data.iteritems():
        sentences.append(LabeledSentence(utils.to_unicode(row).split(), ['Text' + '_%s' % str(index)]))
    return sentences


In [5]:
#Applying cleanup to the whole data
allText = all_data['text'].apply(cleanup)

In [6]:
#Constructing labeled sentences for data
#Doc2vec represents each word in terms of vectors and assigns tags/labels to them.
sentences = constructLabeledSentences(allText)

In [7]:
allText.head()

0    daniel greenfield shillman journalism fellow f...
1    google pinterest digg linkedin reddit stumbleu...
2    us secretary state john f kerry said monday st...
3    kaydee king kaydeeking november 9 2016 lesson ...
4    its primary day new york frontrunners hillary ...
Name: text, dtype: object

In [8]:
from gensim.models import Doc2Vec
text_model = Doc2Vec(min_count=1, window=5, size=100, sample=1e-4, negative=5, workers=4, iter=5,seed=1)

In [9]:
text_model.build_vocab(sentences)

In [10]:
text_model.train(sentences, total_examples=text_model.corpus_count, epochs=text_model.iter)

11114237

In [11]:
train_arrays = np.zeros((all_data.shape[0], 100))

In [12]:
train_labels = np.zeros(all_data.shape[0])

The docvecs property of the Doc2Vec model holds all trained vectors for the 'document tags' seen during training.
Lets use these trained vectors using the labels in the csv data and store it in arrays to be used for train-test splitting of the data.

In [13]:
for i in range(all_data.shape[0]):
    train_arrays[i] = text_model.docvecs['Text_'+str(i)]
    train_labels[i] = all_data["label"][i]

Lets split the data into test and train for our classifier

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_arrays, train_labels, test_size=0.2, random_state=42)

In [16]:
from xgboost import XGBClassifier

XGmodel = XGBClassifier(max_depth=7, learning_rate=0.2, 
                        n_estimators=1000, silent=True, 
                        objective='binary:logistic', nthread=-1, 
                        gamma=0, min_child_weight=1, max_delta_step=0, 
                        subsample=1, colsample_bytree=1, 
                        colsample_bylevel=1, reg_alpha=0, 
                        reg_lambda=1, scale_pos_weight=1, 
                        base_score=0.5, seed=0, missing=None)

In [17]:
XGmodel.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.2, max_delta_step=0,
       max_depth=7, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=-1, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True,
       subsample=1)

In [18]:
from sklearn.metrics import classification_report
y_pred = XGmodel.predict(X_test)
XGmodel.score(X_test, y_test)

0.88792423046566693

In [20]:
from sklearn.metrics import classification_report

target_names = ['FAKE', 'REAL']

print(classification_report(y_test, y_pred, target_names=target_names))

             precision    recall  f1-score   support

       FAKE       0.81      0.90      0.85       628
       REAL       0.89      0.79      0.84       639

avg / total       0.85      0.84      0.84      1267



Applying Logistic Regression to check the results with the above classifier

In [19]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model = lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)
lr_model.score(X_test, y_test)

0.84372533543804262

Accuracy score of XGboost classifier is almost 89% which works surprisingly better than Logistic Regression. 