# Final Capstone: Detecting Fake news


### What is the problem you are attempting to solve?

This project is about detecting fake news using NLP and classification models. The data is collected from Kaggle - https://www.kaggle.com/ahmedsmara/full-fake-news. the data consists of  

title: the title of a news article
author: author of the news article
text: the text of the article
label: a label that marks the article as potentially real or fake
    1: FAKE
    0: REAL
    
### How is your solution valuable?
Today, we get news from all over the place. Most people will tell you that they get their news from facebook or some blog online. News is very powerful and its important that the integrity of news is verified and protected so its important to see if news that we get from certain places are fake or real.
    
### What is your data source and how will you access it?
I will be getting the data from Kaggle and use that for now. Time permitting, I may use the onion to get some fake news and other articles as real news and use my model on that. But for now I will be getting the data from Kaggle and split it into train and test dataset
    
### What techniques from the course do you anticipate using?
I will be using all techniques learned so far of data cleaning and data visulaization. In addition, I will be using NLP and various NLP techniques to clean and tokenize the text and various classification models on the text to see how accurate the model will be to detecting fake news.

### What do you anticipate to be the biggest challenge you’ll face?
- I think data cleaning will be challenging 
- Lots and lots of tokens so memory problems as well as runtime and processing powers
- overfitting models and hyperparameter tuning

###### showing all rows and columns 
- pd.set_option('display.max_columns', 500)
- pd.set_option('display.max_rows', 500)

#### check code before submission
http://pep8online.com/


In [1]:
### importing all libraries

import pandas as pd
import numpy as np
import warnings

import scipy.stats as stats

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.tokenize import RegexpTokenizer

# Natural Language Processing
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB # Naive Bayes
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import ensemble #boosting
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import accuracy_score, classification_report

%matplotlib inline 

# These two lines let you show all the columns and rows
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

warnings.filterwarnings('ignore')



In [2]:

df = pd.read_csv('full_dataset.csv') 

In [3]:
df.head(10)

Unnamed: 0,title,author,text,label
0,a news release,federation-american-immigration-reform,Unemployment has been on the rise throughout W...,1
1,"Black Turnout Soft in Early Voting, Boding Ill...",Henry Wolff,"Black Turnout Soft in Early Voting, Boding Ill...",1
2,a television interview,chris-abele,Says Milwaukee County buses are no less safe n...,1
3,10 Things I Learned From Being My Own General ...,Luke Stranahan,One of the foundations of living a good life i...,1
4,a statement responding to Gov. Rick Scott's St...,rod-smith,Says Rick Scotts proposed budget would lay off...,0
5,Rafael Nadal Wins a Marathon to Set Up a Final...,Ben Rothenberg,"MELBOURNE, Australia — Rafael Nadal complet...",0
6,Mexican Politician’s Wife Arrested in Texas fo...,Ildefonso Ortiz,A top politician who at one time served as the...,0
7,Merkel Floats Fake News at Trump Presser: TTIP...,Chris Tomlinson,In the first joint press conference between U....,0
8,a campaign mailer,republican-party-florida,Barack Obama has consistently voted against to...,1
9,"Number Of Accusers Grows To 12, As Former Miss...",Sarah Jones,"By Sarah Jones on Thu, Oct 27th, 2016 at 1:41 ...",1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30938 entries, 0 to 30937
Data columns (total 4 columns):
title     30380 non-null object
author    28981 non-null object
text      30899 non-null object
label     30938 non-null int64
dtypes: int64(1), object(3)
memory usage: 966.9+ KB


In [5]:
df.shape

(30938, 4)

In [6]:
df.label.value_counts()

0    16085
1    14853
Name: label, dtype: int64

In [7]:
### Check for missing values
df.isnull().sum()

title      558
author    1957
text        39
label        0
dtype: int64

In [8]:
## Filling in missing values
df=df.fillna(' ')
df.isnull().sum()


title     0
author    0
text      0
label     0
dtype: int64

In [9]:
print(df.iloc[3].text)

One of the foundations of living a good life in today’s times is having a good place to call home. Whether you want a solid, comfortable place with which to pursue your hobbies, recover from the day’s tribulations, and just to be, or whether you want a bachelor pad for your romantic pursuits, or both; a good home is essential to the modern man.
I purchased a home and I decided to be my own general contractor for the renovations. My home was an as-originally-furnished home of the 1970s, and I brought its multi-color painted, green shag carpeted datedness up to a sharply trimmed, hardwood-floored modernity while being of a somewhat timeless style. A general contractor is a person hired by the architect or engineer to run the job site, source the labor, follow the schedule, get the materials, and execute the vision of the plan. Here are ten things I learned as my own general contractor.
1. There are good contractors, and there are bad contractors 
You will run into both good and bad contr

In [10]:
### label becomes a token and a valid word later on so lets change the label to another name
df.rename(columns={'label':'news_label'}, inplace=True)

### Get baseline accuracy score¶

In [11]:
df['news_label'].value_counts(normalize=True)

0    0.519911
1    0.480089
Name: news_label, dtype: float64

 - 1 is fake and 0 is real - This means there are 48% fake news -  This also means that 51% is real news. therefore my model should do at least this or better.

# Lets try title and see if title is important
### BoW with title - is title important?

In [12]:
### lowercase the text
df['title'] = df['title'].str.lower()

In [13]:
### Lets see if title is important at all

### try 0.005
### max = .85, 95
### tried with min_df=2 and finally .01 worked

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_df=0.75,  min_df=.01, stop_words = 'english', analyzer = 'word')
X = vectorizer.fit_transform(df["title"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_title = pd.concat([bow_df, df[["news_label"]]], axis=1)

In [14]:
sentences_title.shape

(30938, 43)

In [15]:
sentences_title.head()

Unnamed: 0,ad,america,breitbart,campaign,clinton,cnn,comment,comments,conference,debate,donald,election,email,fbi,fox,hillary,house,interview,media,new,news,obama,post,president,presidential,press,radio,release,report,republican,russia,says,speech,state,times,trump,tv,video,war,white,world,york,news_label
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0


In [16]:
#bow_df.isnull().sum().sum()

In [17]:
total_wc = bow_df.sum(axis = 0)
total_top_5 = total_wc.sort_values(ascending=False).head(5)

In [18]:
total_top_5

new          7246
times        6413
york         6410
trump        3733
breitbart    2407
dtype: int64

In [19]:
#sentences = sentences.reset_index()

### Lets see how the title does predicting fake news

In [20]:
Y = sentences_title['news_label']
X = np.array(sentences_title.drop(['news_label'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()

lr.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
#print('CV score:', cross_val_score(lr, X_train, y_train, cv=3).mean())
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))


----------------------Logistic Regression Scores----------------------
Training set score: 0.7846676004740868

Test set score: 0.7794925662572721


In [21]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [22]:
print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))


----------------------Random Forest Scores----------------------
Training set score: 0.8000215494020041

Test set score: 0.7854718810601163


### Looks like title is not doing too bad in predicting fake news

### Lets try author
Lets look to see if author will contibute much to fake news and if we should include author.

In [23]:
df['author'] = df['author'].str.lower()

In [24]:
vectorizer = CountVectorizer(max_df=0.75, min_df=.005, stop_words = 'english', analyzer = 'word')
X = vectorizer.fit_transform(df["author"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_author = pd.concat([bow_df, df[["news_label"]]], axis=1)

In [25]:
sentences_author.head()

Unnamed: 0,admin,alan,alexander,andrew,baker,barack,ben,blogger,bob,breitbart,charlie,chris,clinton,com,committee,dan,daniel,david,donald,eric,hillary,hudson,ian,james,jason,jeff,jerome,jim,joe,john,key,mark,michael,mike,mitt,neil,news,noreply,obama,pam,patrick,paul,republican,rick,robert,romney,scott,smith,thomas,tim,tom,trump,walker,news_label
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [26]:
sentences_author.shape

(30938, 54)

In [27]:
total_wc = bow_df.sum(axis = 0)
total_top_5 = total_wc.sort_values(ascending=False).head(5)

In [28]:
total_top_5

john       819
michael    587
obama      494
barack     481
com        476
dtype: int64

In [29]:
Y = sentences_author['news_label']
X = np.array(sentences_author.drop(['news_label'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
#rfc = RandomForestClassifier()
#gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
#rfc.fit(X_train, y_train)
#gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.6157741622669971

Test set score: 0.6233031674208145


In [30]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

----------------------Random Forest Scores----------------------
Training set score: 0.6166900118521711

Test set score: 0.6236263736263736


In [31]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Gradient Boosting Scores----------------------
Training set score: 0.6100635707359121

Test set score: 0.6182934712346477


### Author although not doing as good as title, is still doing pretty well to detect fake news

#### Lets try text

In [32]:
df['text'] = df['text'].str.lower()

In [33]:
## token_pattern = '[a-zA-Z0-9]+'
vectorizer = CountVectorizer(min_df=.02, max_df=0.75, stop_words = 'english', analyzer = 'word')
X = vectorizer.fit_transform(df["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_text = pd.concat([bow_df, df[["news_label"]]], axis=1)

In [34]:
sentences_text.head()

Unnamed: 0,000,10,100,11,12,13,14,15,16,17,18,19,20,200,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,21,22,23,24,25,26,27,28,29,30,300,31,35,40,400,45,50,500,60,70,80,90,ability,able,absolutely,abuse,accept,accepted,access,according,account,accounts,accusations,accused,acknowledged,act,acting,action,actions,active,activist,activists,activities,activity,acts,actual,actually,add,added,adding,addition,additional,address,administration,admitted,advance,advantage,advice,adviser,affairs,afghanistan,african,afternoon,age,agencies,agency,agenda,agents,aggressive,ago,agree,agreed,agreement,ahead,aid,aides,aimed,air,al,allegations,alleged,allies,allow,allowed,allowing,allows,alternative,ambassador,amendment,america,american,americans,amid,analysis,analyst,andrew,angeles,anger,angry,announced,announcement,annual,answer,anti,apart,apparently,appeal,appear,appearance,appeared,appears,approach,approval,approved,april,arabia,area,areas,aren,argued,argument,armed,arms,army,arrested,arrived,art,article,articles,aside,ask,asked,asking,assault,assistant,associated,association,attack,attacked,attacks,attempt,attempts,attended,attention,attorney,audience,august,author,authorities,authority,available,average,avoid,aware,away,backed,background,bad,balance,ban,bank,banks,bar,barack,base,based,basic,basically,basis,battle,beat,began,begin,beginning,begun,behalf,behavior,believe,believed,believes,benefit,benefits,bernie,best,better,big,bigger,biggest,billion,bit,black,blame,block,blood,blue,board,body,book,books,border,borders,born,bought,boy,break,breaking,breitbart,bring,bringing,britain,british,broad,broadcast,broke,broken,brother,brought,brown,budget,build,building,built,bureau,bush,business,...,story,straight,strategic,strategy,street,streets,strike,strong,struck,struggle,student,students,studies,study,stuff,style,subject,success,successful,suddenly,suggest,suggested,suggesting,suggests,summer,sunday,supply,support,supported,supporters,supporting,supposed,supreme,sure,surprise,suspect,syria,syrian,systems,table,taken,takes,taking,talk,talked,talking,talks,target,targeted,task,tax,taxes,team,technology,television,tell,telling,tells,tens,term,terms,territory,terror,terrorism,terrorist,terrorists,test,texas,thank,thanks,theory,thing,things,think,thinking,thinks,thomas,thought,thousands,threat,threatened,threats,thursday,ties,time,times,today,told,tom,took,total,totally,tough,town,track,trade,traditional,training,transition,travel,treated,treatment,trial,tried,trip,troops,trouble,true,truly,trump,trust,truth,try,trying,tuesday,turkey,turn,turned,turning,turns,tv,twice,twitter,type,typically,ultimately,unable,unclear,understand,understanding,unfortunately,union,united,university,unless,unlike,unlikely,unusual,urged,usa,use,used,uses,using,usual,usually,value,values,various,vast,ve,version,vice,victims,victory,video,videos,view,views,violence,violent,virginia,vision,visit,vladimir,voice,vote,voted,voter,voters,votes,voting,wait,waiting,wake,walk,walking,wall,want,wanted,wants,war,warned,warning,wars,washington,wasn,watch,watched,watching,water,way,ways,wealth,weapons,wearing,website,wednesday,week,weekend,weeks,welcome,went,west,western,white,wide,widely,wife,wikileaks,willing,win,wing,winning,wins,wisconsin,wish,woman,women,won,wonder,word,words,work,worked,workers,working,works,world,worried,worry,worse,worst,worth,wouldn,write,writer,writes,writing,written,wrong,wrote,year,years,yes,york,young,youtube,zero,news_label
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,3,1,0,6,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
3,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,4,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,12,3,0,0,0,0,0,0,0,0,0,0,10,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,2,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,14,0,0,0,2,0,0,0,0,0,0,1,1,0,0,0,0,2,0,0,1,0,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [35]:
total_wc = bow_df.sum(axis = 0)
total_top_5 = total_wc.sort_values(ascending=False).head(5)

In [36]:
total_top_5

said      80430
mr        66301
trump     56420
people    37135
new       30522
dtype: int64

In [37]:
Y = sentences_text['news_label']
X = np.array(sentences_text.drop(['news_label'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)



In [38]:
# Models
lr = LogisticRegression()

lr.fit(X_train, y_train)

print("----------------------Logistic Regression Scores on text ----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

----------------------Logistic Regression Scores on text ----------------------
Training set score: 0.8581510613080487

Test set score: 0.7949256625727213


In [39]:
#vectorizer = CountVectorizer(min_df = float in range [0.0, 1.0], max_df=0.75, token_pattern = '[a-zA-Z0-9]+')
#min_df = float in range [0.0, 1.0] or int, default=1

In [40]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

----------------------Random Forest Scores----------------------
Training set score: 0.998168300829652

Test set score: 0.8169844861021331


### Finally, text is also doing pretty well in detecting fake news

### What if we added title, author and text?

Now that we know that author, title and text individually makes a difference and is able to identify fake news, now we will combine all of them and see if all of them together is a good indicator of fake news.

In [41]:
df_total = df.copy()

In [42]:
df_total['total']=df_total['title']+' '+df_total['author']+' '+df_total['text']

In [43]:
df_total['total'] = df_total['total'].str.lower()

In [44]:
vectorizer = CountVectorizer(max_df=0.75,  min_df=.01, stop_words = 'english', analyzer = 'word')
X = vectorizer.fit_transform(df_total['total'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_total = pd.concat([bow_df, df[["news_label"]]], axis=1)

In [45]:
sentences_total.shape

(30938, 3513)

In [46]:
total_wc = bow_df.sum(axis = 0)
total_top_5 = total_wc.sort_values(ascending=False).head(5)

In [47]:
total_top_5

said      80512
mr        66340
trump     60431
new       37849
people    37391
dtype: int64

In [48]:
## n gram
vectorizer = CountVectorizer(max_df=0.75,  min_df=.01, stop_words = 'english', analyzer = 'word', ngram_range=(2,2))
X = vectorizer.fit_transform(df_total['total'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_total_ngram = pd.concat([bow_df, df[["news_label"]]], axis=1)

In [49]:
sentences_total_ngram.shape

(30938, 221)

In [50]:
total_wc = bow_df.sum(axis = 0)
total_top_5 = total_wc.sort_values(ascending=False).head(5)

In [51]:
total_top_5

mr trump           17348
new york           15098
united states      12708
donald trump       10735
hillary clinton     9519
dtype: int64

In [52]:
X = df_total['total']
y = df['news_label']

In [53]:
# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

### Trying to get the best parameters

In [56]:
pipe = Pipeline([('count_vect', CountVectorizer()),    
                 ('lr', LogisticRegression(solver='liblinear'))])

# Tune GridSearchCV
pipe_params = {'count_vect__stop_words': [None, 'english'],
               'count_vect__ngram_range': [(1,1), (2,2), (1,3)],
               'lr__C': [0.01, 1]}

cvec_gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
cvec_gs.fit(X_train, y_train);

print("----------------------Logistic Regression Scores----------------------")
print("Best score:", cvec_gs.best_score_)
print("Train score", cvec_gs.score(X_train, y_train))
print("Test score", cvec_gs.score(X_test, y_test))

cvec_gs.best_params_

----------------------Logistic Regression Scores----------------------
Best score: 0.8357395400626464
Train score 0.9441331753043853
Test score 0.8472042663219134


{'count_vect__ngram_range': (1, 3),
 'count_vect__stop_words': None,
 'lr__C': 0.01}

### Lets use the best parameters: ngram_range=(1,3), stop_words = None

In [57]:
vectorizer = CountVectorizer(min_df=.01, max_df=0.75, stop_words = None, analyzer = 'word', ngram_range=(1,3))
X = vectorizer.fit_transform(df_total['total'].apply(lambda x: np.str_(x)))
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_total = pd.concat([bow_df, df_total[["news_label"]]], axis=1)

In [58]:
sentences_total.head(5)

Unnamed: 0,000,000 in,000 people,000 to,10,10 000,10 percent,10 years,100,100 000,11,12,13,14,15,15 years,150,16,17,18,19,1960s,1970s,1980s,1990s,1999,20,20 years,200,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2015 the,2016,2016 by,2016 election,2016 presidential,2016 the,2017,2018,20th,21,22,23,24,25,250,26,26 2016,27,27 2016,28,28 2016,29,30,30 years,300,31,32,33,34,35,36,37,38,39,40,400,41,42,43,44,45,46,47,48,49,50,500,51,52,55,60,600,65,70,700,75,80,800,90,95,99,abandoned,abc,abedin,ability,ability to,able,able to,abortion,about,about an,about her,about his,about how,about it,about mr,about that,about the,about their,about this,about to,about what,about whether,above,above the,abroad,absence,absolute,absolutely,abuse,academic,academy,accept,accepted,accepting,access,access to,accident,accompanied,according,according to,according to the,account,accountable,accounts,accurate,accusations,accused,accused of,accusing,achieve,acknowledge,acknowledged,acknowledged that,across,across the,across the country,act,act of,acted,acting,action,actions,active,actively,activist,activists,activities,activity,actor,actors,acts,actual,actually,ad,adam,add,added,added that,added the,adding,adding that,addition,addition to,additional,address,address the,addressed,addressing,adds,admin,administration,administration and,administration has,admit,admitted,adopted,ads,adult,adults,advance,advanced,advantage,advantage of,advertisement,advertising,advice,advised,adviser,adviser to,advisers,advocacy,advocate,advocates,affairs,affect,affected,afford,affordable,affordable care,affordable care act,afghanistan,afraid,africa,african,after,after all,after an,after being,after he,after his,after it,after mr,after she,after that,after the,after the election,after they,aftermath,afternoon,again,again and,again the,against,against him,...,who has been,who have,who have been,who is,who said,who served,who want,who was,who were,who will,who would,whole,whom,whose,why,why he,why the,wide,widely,widespread,wife,wikileaks,wild,will,will also,will be,will be the,will continue,will continue to,will do,will get,will go,will have,will have to,will help,will make,will never,will not,will not be,will take,william,williams,willing,willing to,win,win the,wind,window,wing,winner,winning,wins,winter,wisconsin,wish,with,with all,with an,with each,with her,with him,with his,with it,with its,with many,with me,with more,with mr,with mr trump,with my,with new,with no,with one,with other,with our,with people,with president,with russia,with some,with that,with the,with their,with them,with this,with those,with trump,with two,with us,with what,with you,with your,within,within the,without,without the,witness,witnessed,witnesses,woman,woman who,women,women and,women in,women who,won,won be,won the,wonder,wonderful,wondering,word,words,wore,work,work and,work for,work in,work of,work on,work to,work with,worked,worked for,worker,workers,working,working for,working on,working to,working with,works,world,world and,world in,world is,world of,world the,world war,world war ii,worldwide,worried,worried about,worry,worse,worst,worth,would,would also,would be,would be the,would do,would have,would have been,would have to,would like,would like to,would make,would never,would not,would not be,would take,would you,wouldn,wounded,write,writer,writers,writes,writing,written,written by,wrong,wrote,wrote in,wrote on,wrote the,www,yeah,year,year and,year in,year old,year the,year to,years,years after,years ago,years and,years in,years later,years of,years old,years the,years to,yemen,yes,yesterday,yet,yet another,yet the,yet to,york,york and,york city,york times,you,you and,you are,you can,you can follow,you could,you do,you don,you get,you have,you have to,you just,you know,you like,you ll,you look,you may,you might,you need,you need to,you re,you re going,you see,you should,you that,you the,you think,you to,you ve,you want,you want to,you were,you will,you would,young,young people,younger,your,your own,yourself,youth,youtube,zero,zone,news_label
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,27,0,1,0,0,0,2,0,0,0,0,0,0,0,3,1,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,14,1,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,41,0,0,5,0,0,1,0,0,4,1,0,0,0,1,0,1,0,1,1,0,0,0,1,0,0,2,1,0,5,0,1,4,0,0,0,0,10,1,1,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [59]:
sentences_total.shape

(30938, 7659)

In [60]:
total_wc = bow_df.sum(axis = 0)
total_top_5 = total_wc.sort_values(ascending=False).head(5)

In [61]:
total_top_5

and     389994
that    220236
is      160459
for     149407
on      136408
dtype: int64

In [62]:
Y = sentences_total['news_label']
X = np.array(sentences_total.drop(['news_label'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)


In [63]:
##countvectorize 
lr = LogisticRegression()
lr.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.937883848723198

Test set score: 0.8256302521008403


In [64]:
print("----------------------Confusion Matrix----------------------")
predictions = lr.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5305,1071
actual pos,1087,4913


In [65]:
print("Classification Report: \n {}\n".format(classification_report(y_test, lr.predict(X_test))))

Classification Report: 
               precision    recall  f1-score   support

           0       0.83      0.83      0.83      6376
           1       0.82      0.82      0.82      6000

    accuracy                           0.83     12376
   macro avg       0.83      0.83      0.83     12376
weighted avg       0.83      0.83      0.83     12376




In [66]:
# Calculate classification metrics
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

Accuracy: 82.56%
Misclassification rate: 17.44%
Recall / Sensitivity: 81.88%
Specificity: 83.2%
Precision: 82.1%


In [67]:
lr_params = {"penalty": ["l1", "l2"]}
clf_lr = GridSearchCV(lr, lr_params, cv=5)
clf_lr.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'penalty': ['l1', 'l2']}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

In [68]:
print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', clf_lr.score(X_train, y_train))
print('\nTest set score:', clf_lr.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.937883848723198

Test set score: 0.8256302521008403


In [69]:
print("----------------------Confusion Matrix----------------------")
predictions = clf_lr.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5305,1071
actual pos,1087,4913


In [70]:
# Calculate classification metrics
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

Accuracy: 82.56%
Misclassification rate: 17.44%
Recall / Sensitivity: 81.88%
Specificity: 83.2%
Precision: 82.1%


In [71]:
result = cross_val_score(lr, X_test, y_test, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(result)))

Average Accuracy: 	 0.8159


In [72]:
#print("Classification Report: \n {}\n".format(classification_report(y_test, lr.predict(X_test))))

In [73]:
rfc = RandomForestClassifier(max_features=50)
rfc.fit(X_train, y_train)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8489010989010989


In [74]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5435,941
actual pos,929,5071


In [75]:
# Calculate classification metrics
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

Accuracy: 84.89%
Misclassification rate: 15.11%
Recall / Sensitivity: 84.52%
Specificity: 85.24%
Precision: 84.35%


In [76]:
rfc_params = {"n_estimators": [3, 5, 10, 15],
              "max_depth": [2, 3, 4, 5],
              "min_samples_split": [3, 5, 7, 9]}

clf_rfc = GridSearchCV(rfc, rfc_params, cv=5)
clf_rfc.fit(X_train, y_train)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', clf_rfc.score(X_train, y_train))
print('\nTest set score:', clf_rfc.score(X_test, y_test))

----------------------Random Forest Scores----------------------
Training set score: 0.7316560715440147

Test set score: 0.7234162895927602


In [77]:
###
#print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, rfc.predict(X_test))))

In [78]:
print("----------------------Confusion Matrix----------------------")
predictions = clf_rfc.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,3433,2943
actual pos,480,5520


In [79]:
#print("Classification Report: \n {}\n".format(classification_report(y_test, rfc.predict(X_test))))

print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 72.34%
Misclassification rate: 27.66%
Recall / Sensitivity: 92.0%
Specificity: 53.84%
Precision: 65.23%


### TFIDF (ngram_range=(1,3))

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_df=0.75, min_df=.01, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(1,3))


# applying the vectorizer
X = vectorizer.fit_transform(df_total['total'])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
tfidf_total = pd.concat([tfidf_df, df[["news_label"]]], axis=1)


In [81]:
tfidf_total.shape

(30938, 7659)

In [82]:
tfidf_total.head()

Unnamed: 0,000,000 in,000 people,000 to,10,10 000,10 percent,10 years,100,100 000,11,12,13,14,15,15 years,150,16,17,18,19,1960s,1970s,1980s,1990s,1999,20,20 years,200,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2015 the,2016,2016 by,2016 election,2016 presidential,2016 the,2017,2018,20th,21,22,23,24,25,250,26,26 2016,27,27 2016,28,28 2016,29,30,30 years,300,31,32,33,34,35,36,37,38,39,40,400,41,42,43,44,45,46,47,48,49,50,500,51,52,55,60,600,65,70,700,75,80,800,90,95,99,abandoned,abc,abedin,ability,ability to,able,able to,abortion,about,about an,about her,about his,about how,about it,about mr,about that,about the,about their,about this,about to,about what,about whether,above,above the,abroad,absence,absolute,absolutely,abuse,academic,academy,accept,accepted,accepting,access,access to,accident,accompanied,according,according to,according to the,account,accountable,accounts,accurate,accusations,accused,accused of,accusing,achieve,acknowledge,acknowledged,acknowledged that,across,across the,across the country,act,act of,acted,acting,action,actions,active,actively,activist,activists,activities,activity,actor,actors,acts,actual,actually,ad,adam,add,added,added that,added the,adding,adding that,addition,addition to,additional,address,address the,addressed,addressing,adds,admin,administration,administration and,administration has,admit,admitted,adopted,ads,adult,adults,advance,advanced,advantage,advantage of,advertisement,advertising,advice,advised,adviser,adviser to,advisers,advocacy,advocate,advocates,affairs,affect,affected,afford,affordable,affordable care,affordable care act,afghanistan,afraid,africa,african,after,after all,after an,after being,after he,after his,after it,after mr,after she,after that,after the,after the election,after they,aftermath,afternoon,again,again and,again the,against,against him,...,who has been,who have,who have been,who is,who said,who served,who want,who was,who were,who will,who would,whole,whom,whose,why,why he,why the,wide,widely,widespread,wife,wikileaks,wild,will,will also,will be,will be the,will continue,will continue to,will do,will get,will go,will have,will have to,will help,will make,will never,will not,will not be,will take,william,williams,willing,willing to,win,win the,wind,window,wing,winner,winning,wins,winter,wisconsin,wish,with,with all,with an,with each,with her,with him,with his,with it,with its,with many,with me,with more,with mr,with mr trump,with my,with new,with no,with one,with other,with our,with people,with president,with russia,with some,with that,with the,with their,with them,with this,with those,with trump,with two,with us,with what,with you,with your,within,within the,without,without the,witness,witnessed,witnesses,woman,woman who,women,women and,women in,women who,won,won be,won the,wonder,wonderful,wondering,word,words,wore,work,work and,work for,work in,work of,work on,work to,work with,worked,worked for,worker,workers,working,working for,working on,working to,working with,works,world,world and,world in,world is,world of,world the,world war,world war ii,worldwide,worried,worried about,worry,worse,worst,worth,would,would also,would be,would be the,would do,would have,would have been,would have to,would like,would like to,would make,would never,would not,would not be,would take,would you,wouldn,wounded,write,writer,writers,writes,writing,written,written by,wrong,wrote,wrote in,wrote on,wrote the,www,yeah,year,year and,year in,year old,year the,year to,years,years after,years ago,years and,years in,years later,years of,years old,years the,years to,yemen,yes,yesterday,yet,yet another,yet the,yet to,york,york and,york city,york times,you,you and,you are,you can,you can follow,you could,you do,you don,you get,you have,you have to,you just,you know,you like,you ll,you look,you may,you might,you need,you need to,you re,you re going,you see,you should,you that,you the,you think,you to,you ve,you want,you want to,you were,you will,you would,young,young people,younger,your,your own,yourself,youth,youtube,zero,zone,news_label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.256123,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.225082,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08217,0.0,0.0,0.046148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044279,0.0,0.0,0.0,0.0,0.029249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045949,0.0,0.0,0.0,0.0,0.0,0.0,0.037185,0.041874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.164355,0.023406,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03547,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037732,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052455,0.0,0.087557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026881,0.0,0.0,0.028596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.018629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013007,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012297,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031742,0.031893,0.0,0.011571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01419,0.0,0.024219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010806,0.0,0.0,0.0,0.011403,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010417,0.0,0.0,0.013745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012403,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01104,0.0,0.0,0.0,0.0,0.0,0.029984,0.0,0.0,0.012302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.162562,0.0,0.008715,0.0,0.0,0.0,0.033196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036355,0.015457,0.031652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03254,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067433,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013578,0.0133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012647,0.0,0.0,0.120613,0.01654,0.0,0.01554,0.0,0.0,0.0,0.046069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025467,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024697,0.0,0.0,0.0,0.0,0.010936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013786,0.0,0.014204,0.0,0.0,0.0,0.0,0.0,0.0,0.024362,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007123,0.0,0.0,0.0,0.0,0.0,0.0,0.016932,0.0,0.0,0.0,0.0,0.0,0.009546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.258799,0.0,0.0,0.051234,0.0,0.0,0.015447,0.0,0.0,0.046969,0.014987,0.0,0.0,0.0,0.015588,0.0,0.015837,0.0,0.015122,0.016469,0.0,0.0,0.0,0.016153,0.0,0.0,0.029707,0.014099,0.0,0.067023,0.0,0.016803,0.058425,0.0,0.0,0.0,0.0,0.085368,0.016454,0.014499,0.0,0.0,0.0,0.0,1
4,0.133819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.158005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.093874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [83]:
Y = tfidf_total['news_label']
X = np.array(tfidf_total.drop(['news_label'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)


In [84]:
# Models
lr = LogisticRegression()
lr.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))


----------------------Logistic Regression Scores----------------------
Training set score: 0.882178644542614

Test set score: 0.8333872010342599


In [85]:
#print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, lr.predict(X_test))))

print("----------------------Confusion Matrix----------------------")
predictions = lr.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5357,1019
actual pos,1043,4957


In [86]:
#print("Classification Report: \n {}\n".format(classification_report(y_test, lr.predict(X_test))))

print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 83.34%
Misclassification rate: 16.66%
Recall / Sensitivity: 82.62%
Specificity: 84.02%
Precision: 82.95%


In [87]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))


----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8481738849385908


In [88]:
#print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, rfc.predict(X_test))))

print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5447,929
actual pos,950,5050


In [89]:
#print("Classification Report: \n {}\n".format(classification_report(y_test, rfc.predict(X_test))))


print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 84.82%
Misclassification rate: 15.18%
Recall / Sensitivity: 84.17%
Specificity: 85.43%
Precision: 84.46%


In [90]:
print("Classification Report: \n {}\n".format(classification_report(y_test, rfc.predict(X_test))))

Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.85      0.85      6376
           1       0.84      0.84      0.84      6000

    accuracy                           0.85     12376
   macro avg       0.85      0.85      0.85     12376
weighted avg       0.85      0.85      0.85     12376




In [91]:
rfc_params = {"n_estimators": [3, 5, 10, 15],
              "max_depth": [2, 3, 4, 5],
              "min_samples_split": [3, 5, 7, 9]}

clf_rfc = GridSearchCV(rfc, rfc_params, cv=5)
clf_rfc.fit(X_train, y_train)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', clf_rfc.score(X_train, y_train))
print('\nTest set score:', clf_rfc.score(X_test, y_test))

----------------------Random Forest Scores----------------------
Training set score: 0.7411378084258162

Test set score: 0.7365061409179057


In [92]:

print("----------------------Confusion Matrix----------------------")
predictions = clf_rfc.predict(X_test)
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,3506,2870
actual pos,391,5609


In [93]:
print("Classification Report: \n {}\n".format(classification_report(y_test, clf_rfc.predict(X_test))))

Classification Report: 
               precision    recall  f1-score   support

           0       0.90      0.55      0.68      6376
           1       0.66      0.93      0.77      6000

    accuracy                           0.74     12376
   macro avg       0.78      0.74      0.73     12376
weighted avg       0.78      0.74      0.73     12376




In [94]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 73.65%
Misclassification rate: 26.35%
Recall / Sensitivity: 93.48%
Specificity: 54.99%
Precision: 66.15%


### word2vec 

In [95]:
import gensim
model1 = gensim.models.Word2Vec(
    df_total['total'],
    workers=4,
    min_count=1,
    window=4,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

model2 = gensim.models.Word2Vec(
    df_total['total'],
    workers=4,
    min_count=1,
    window=6,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

model3 = gensim.models.Word2Vec(
    df_total['total'],
    workers=4,
    min_count=1,
    window=8,
    sg=0,
    sample=1e-3,
    size=100,
    hs=1
)

model4 = gensim.models.Word2Vec(
    df_total['total'],
    workers=4,
    min_count=1,
    window=4,
    sg=0,
    sample=1e-3,
    size=200,
    hs=1
)

model5 = gensim.models.Word2Vec(
    df_total['total'],
    workers=4,
    min_count=1,
    window=6,
    sg=0,
    sample=1e-3,
    size=200,
    hs=1
)

model6 = gensim.models.Word2Vec(
    df_total['total'],
    workers=4,
    min_count=1,
    window=8,
    sg=0,
    sample=1e-3,
    size=200,
    hs=1
)

In [96]:
word2vec_arr1 = np.zeros((df_total.shape[0],100))
word2vec_arr2 = np.zeros((df_total.shape[0],100))
word2vec_arr3 = np.zeros((df_total.shape[0],100))
word2vec_arr4 = np.zeros((df_total.shape[0],200))
word2vec_arr5 = np.zeros((df_total.shape[0],200))
word2vec_arr6 = np.zeros((df_total.shape[0],200))

for i, sentence in enumerate(df_total['total']):
    word2vec_arr1[i,:] = np.mean([model1[lemma] for lemma in sentence], axis=0)
    word2vec_arr2[i,:] = np.mean([model2[lemma] for lemma in sentence], axis=0)
    word2vec_arr3[i,:] = np.mean([model3[lemma] for lemma in sentence], axis=0)
    word2vec_arr4[i,:] = np.mean([model4[lemma] for lemma in sentence], axis=0)
    word2vec_arr5[i,:] = np.mean([model5[lemma] for lemma in sentence], axis=0)
    word2vec_arr6[i,:] = np.mean([model6[lemma] for lemma in sentence], axis=0)

word2vec_arr1 = pd.DataFrame(word2vec_arr1)
word2vec_arr2 = pd.DataFrame(word2vec_arr2)
word2vec_arr3 = pd.DataFrame(word2vec_arr3)
word2vec_arr4 = pd.DataFrame(word2vec_arr4)
word2vec_arr5 = pd.DataFrame(word2vec_arr5)
word2vec_arr6 = pd.DataFrame(word2vec_arr6)


In [97]:
Y1 = df['news_label']
Y2 = df['news_label']
Y3 = df['news_label']
Y4 = df['news_label']
Y5 = df['news_label']
Y6 = df['news_label']

X1 = np.array(word2vec_arr1)
X2 = np.array(word2vec_arr2)
X3 = np.array(word2vec_arr3)
X4 = np.array(word2vec_arr4)
X5 = np.array(word2vec_arr5)
X6 = np.array(word2vec_arr6)

In [98]:
# We split the dataset into train and test sets
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, Y1, test_size=0.4, random_state=123)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, Y2, test_size=0.4, random_state=123)
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, Y3, test_size=0.4, random_state=123)
X_train4, X_test4, y_train4, y_test4 = train_test_split(X4, Y4, test_size=0.4, random_state=123)
X_train5, X_test5, y_train5, y_test5 = train_test_split(X5, Y5, test_size=0.4, random_state=123)
X_train6, X_test6, y_train6, y_test6 = train_test_split(X6, Y6, test_size=0.4, random_state=123)


In [99]:
# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
#gbc = GradientBoostingClassifier()


In [100]:
print("-----------------------Word2vec Model 1------------------------------")
lr.fit(X_train1, y_train1)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train1, y_train1))
print('\nTest set score:', lr.score(X_test1, y_test1))


-----------------------Word2vec Model 1------------------------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7803038465682577

Test set score: 0.7782805429864253


In [101]:
rfc.fit(X_train1, y_train1)
print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train1, y_train1))
print('\nTest set score:', rfc.score(X_test1, y_test1))


----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8207821590174531


In [102]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test1)
cm = confusion_matrix(y_test1, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5327,1049
actual pos,1169,4831


In [103]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test1, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 82.08%
Misclassification rate: 17.92%
Recall / Sensitivity: 80.52%
Specificity: 83.55%
Precision: 82.16%


In [104]:

print("-----------------------Word2vec Model 2------------------------------")
lr.fit(X_train2, y_train2)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train2, y_train2))
print('\nTest set score:', lr.score(X_test2, y_test2))


-----------------------Word2vec Model 2------------------------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7886542398448443

Test set score: 0.7861182934712346


In [105]:
rfc.fit(X_train2, y_train2)
print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train2, y_train2))
print('\nTest set score:', rfc.score(X_test2, y_test2))


----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8236910148674854


In [106]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test2)
cm = confusion_matrix(y_test2, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5408,968
actual pos,1214,4786


In [107]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test2, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 82.37%
Misclassification rate: 17.63%
Recall / Sensitivity: 79.77%
Specificity: 84.82%
Precision: 83.18%


In [108]:

print("-----------------------Word2vec Model 3------------------------------")
lr.fit(X_train3, y_train3)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train3, y_train3))
print('\nTest set score:', lr.score(X_test3, y_test3))


-----------------------Word2vec Model 3------------------------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7975972416765434

Test set score: 0.7972689075630253


In [109]:
rfc.fit(X_train3, y_train3)
print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train3, y_train3))
print('\nTest set score:', rfc.score(X_test3, y_test3))


----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8322559793148029


In [110]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test3)
cm = confusion_matrix(y_test3, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5418,958
actual pos,1118,4882


In [111]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test3, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 83.23%
Misclassification rate: 16.77%
Recall / Sensitivity: 81.37%
Specificity: 84.97%
Precision: 83.6%


In [112]:
print("-----------------------Word2vec Model 4------------------------------")
lr.fit(X_train4, y_train4)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train4, y_train4))
print('\nTest set score:', lr.score(X_test4, y_test4))

-----------------------Word2vec Model 4------------------------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.781758431203534

Test set score: 0.7800581771170007


In [113]:
rfc.fit(X_train4, y_train4)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train4, y_train4))
print('\nTest set score:', rfc.score(X_test4, y_test4))

----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8234486102133161


In [114]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test4)
cm = confusion_matrix(y_test4, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5396,980
actual pos,1205,4795


In [115]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test4, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 82.34%
Misclassification rate: 17.66%
Recall / Sensitivity: 79.92%
Specificity: 84.63%
Precision: 83.03%


In [116]:
print("-----------------------Word2vec Model 5------------------------------")
lr.fit(X_train5, y_train5)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train5, y_train5))
print('\nTest set score:', lr.score(X_test5, y_test5))


-----------------------Word2vec Model 5------------------------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7870919081995474

Test set score: 0.7857950872656755


In [117]:
rfc.fit(X_train5, y_train5)
print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train5, y_train5))
print('\nTest set score:', rfc.score(X_test5, y_test5))


----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8239334195216548


In [118]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test5)
cm = confusion_matrix(y_test5, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5374,1002
actual pos,1177,4823


In [119]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test5, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 82.39%
Misclassification rate: 17.61%
Recall / Sensitivity: 80.38%
Specificity: 84.28%
Precision: 82.8%


In [120]:
print("-----------------------Word2vec Model 6------------------------------")
lr.fit(X_train6, y_train6)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train6, y_train6))
print('\nTest set score:', lr.score(X_test6, y_test6))

-----------------------Word2vec Model 6------------------------------
----------------------Logistic Regression Scores----------------------
Training set score: 0.7946880724059907

Test set score: 0.7940368455074337


In [121]:
rfc.fit(X_train6, y_train6)

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train6, y_train6))
print('\nTest set score:', rfc.score(X_test6, y_test6))

----------------------Random Forest Scores----------------------
Training set score: 0.9999461264949898

Test set score: 0.8297511312217195


In [122]:
print("----------------------Confusion Matrix----------------------")
predictions = rfc.predict(X_test6)
cm = confusion_matrix(y_test6, predictions)
cm_df = pd.DataFrame(cm, columns=['predict neg', 'predict pos'], index=['actual neg', 'actual pos'])
cm_df

----------------------Confusion Matrix----------------------


Unnamed: 0,predict neg,predict pos
actual neg,5457,919
actual pos,1188,4812


In [123]:
print("----------------------Classification Report----------------------")
tn, fp, fn, tp = confusion_matrix(y_test6, predictions).ravel()

accuracy = (tp + tn) / (tp + fn + fp + tn) * 100
misclassification = (100 - accuracy)
sensitivity = tp / (tp + fn) * 100
specificity = tn / (tn + fp) * 100
precision = tp / (tp + fp) * 100

print(f'Accuracy: {round(accuracy, 2)}%')
print(f'Misclassification rate: {round(misclassification, 2)}%')
print(f'Recall / Sensitivity: {round(sensitivity, 2)}%')
print(f'Specificity: {round(specificity, 2)}%')
print(f'Precision: {round(precision, 2)}%')

----------------------Classification Report----------------------
Accuracy: 82.98%
Misclassification rate: 17.02%
Recall / Sensitivity: 80.2%
Specificity: 85.59%
Precision: 83.96%


## Summary

The ppresentation for this can be found here: https://docs.google.com/presentation/d/12o4x_3f_1oKBUhZdDyxAAFGIv3bCiwR3xBJ7Zg6DJ9g/edit#slide=id.g710c842b61_0_109

I use the following models for my project:
- Countvectorizer and MultinomialNB
- Countvectorizer and Logistic regression
- Countvectorizer and Random Forest
- TFIDF and MultinomialNB
- TFIDF and Logistic regression
- TFIDF and Random Forest
- Word2vec and MultinomialNB
- word2vec and Logistic regression
- word2vec and Random Forest

The best models out of the the many model was Countvectorizer and Random Forest. 

1) 
* Countvectorizer with ngram_range=(1,3) and Stop word = None and using Random Forest
* Best Test Score: 0.8516483516 => 85.16%
* True Positive predictive rate: 84.6%
    * Among all Fake news, 84.6% were predicted correctly
* True negative predictive rate: 85.7%
    * Among all non-fake news, 85.7% were predicted correctly
* Misclassification: 14.84%
    * For all predictions, 14.84% were predicted incorrectly


2) 
* TFIDF with ngram_range=(1,3) and Stop word = None and using Random Forest
* Best Test Score: 0.8481738849385908 => 84.81%
* True Positive predictive rate: 84.17%
    * Among all Fake news, 84.17% were predicted correctly
* True negative predictive rate: 85.43%
    * Among all non-fake news, 85.43% were predicted correctly
* Misclassification: 15.18%
    * For all predictions, 15.18% were predicted incorrectly

This project is about coming up with the best model to detect fake news. This detection is done in the simplest way possible. Future work on this will be
- Scrape data from “The onion” and feed it through the models and see how the models do
- With a little bit of more tweaking and testing on real data, this project could be very valuable in this election year to identify fake new and even in these times of COVID-19, it could be used to detect what is fake and what is real news during this pandemic times.

