# Alternus Vera Project (Team Virgo Cluster) - Yu Xu

## Introduction

Due December 12, 2018
By Team Virgo Cluster

### About this Notebook

This python notebook downloads csv files from different sources uploaded by
the individual members of the team working on each factor(s).

This is invidual notebook II to Aternus Vera Team Virgo Cluster. Highlight of this file is to combine tf-idf factors (not aggregated but as a dataframe) with other compound factors, so they are combined together to build the model for prediction. The final accuracy is <strong>~90%</strong>

After downloading the csv files, the factor columns are extracted. The highlight of this note

### <font color='red'>Alarming</font>: this notebook may take >30 mins to run. But I've run through it and guarantee it runs successfully!

## Dependencies

In [2]:
# dependencies
import pandas as pd
import nltk
import numpy as np
import io
import requests
# from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_selection import chi2
from string import punctuation
from nltk import PorterStemmer
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
from xgboost import XGBClassifier

nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/yuxu/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/yuxu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/yuxu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/yuxu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Downloading Individual CSV Files (Factors)

The individual CSV files should have the same rows (fake news and all news dataset concatenated together), with articles in the same order as
prepared originally by Gene.

1. Fake news comes first before Non-fake (all) news
2. Ensuring that the counts are as follows:

```
<your_labeled_csv_data>.type.value_counts()
0    51507
1    11492
Name: type, dtype: int64
 ```
 
 3. Ensuring that the labels are complete with no "holes"

In [3]:
def get_parsed_data(url):
    return pd.read_csv(io.StringIO(requests.get(url, verify=False).content.decode('utf-8')), sep=',', header='infer')
  

### Master Dataset

In [4]:
data_kg_fake_news = get_parsed_data('https://github.com/synle/machine-learning-sample-dataset/raw/master/liar_dataset/kaggle/kaggle-fake.csv')
data_kg_nonfake_news = get_parsed_data('https://dock2.hyunwookshin.com/public/cmpe257_a1/articles1.csv')
data_kg_nonfake_news.rename(columns={"content": "text"}, inplace=True)
data_kg_nonfake_news['type'] = 0
data_kg_fake_news.loc[data_kg_fake_news['type']!='bs', 'type'] = 0
data_kg_fake_news.loc[data_kg_fake_news['type']=='bs', 'type'] = 1


all_data = pd.concat([data_kg_fake_news[['title','text','type']], data_kg_nonfake_news[['title','text','type']]])



Verify dimensions

In [5]:
assert all_data.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Gene 

In [6]:
w2v_d2v_factors = pd.read_csv(io.StringIO(requests.get('https://dock2.hyunwookshin.com/public/cmpe257_a1/fake_news_w2v_d2v_only.csv', \
                                                       verify=False).content.decode('utf-8')), sep=',', header=None, names=['text_w2v_mean','title_w2v_mean','text_d2v_mean','title_d2v_mean'])



In [7]:
w2v_d2v_factors.head()

Unnamed: 0,text_w2v_mean,title_w2v_mean,text_d2v_mean,title_d2v_mean
0,-0.052476,0.066654,-0.220385,-0.114133
1,-0.126095,-0.309628,-0.048451,0.017952
2,-0.095904,-0.209579,-0.118836,-0.079925
3,-0.027249,-0.07195,-0.038647,-0.168344
4,-0.07503,-0.066515,-0.155317,-0.07495


In [8]:
assert w2v_d2v_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Mojdeh

In [9]:
sentiment_factors = get_parsed_data('https://raw.githubusercontent.com/mojdehkeykhanzadeh/NLP_Proj/master/all_news_sentiment.csv')



In [10]:
sentiment_factors.head()

Unnamed: 0.1,Unnamed: 0,title_senti_neg,title_senti_neu,title_senti_pos,title_senti_cmpd,text_senti_neg,text_senti_neu,text_senti_pos,text_senti_cmp
0,0,0.4588,0.0,0.625,0.375,-0.34,0.209,0.606,0.185
1,1,0.0,0.0,1.0,0.0,-0.296,0.063,0.887,0.05
2,2,0.0,0.0,1.0,0.0,0.8957,0.021,0.871,0.108
3,3,-0.7783,0.43,0.57,0.0,0.8316,0.133,0.517,0.35
4,4,0.0,0.0,1.0,0.0,0.9517,0.066,0.765,0.17


In [11]:
assert sentiment_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Hyunwook (James)

News event coverage scores ranging from 0 to 18 are added to the dataset

In [12]:
coverage_factor = get_parsed_data('https://dock2.hyunwookshin.com/public/cmpe257_a1/all_data_coverage_condensed.processed.csv')



In [13]:
coverage_factor.head(10)

Unnamed: 0.1,Unnamed: 0,title,Coverage
0,0,Muslims BUSTED: They Stole Millions In Gov’t B...,0
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,0
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,0
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,0
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0
5,5,Hillary Goes Absolutely Berserk On Protester A...,0
6,6,BREAKING! NYPD Ready To Make Arrests In Weiner...,0
7,7,WOW! WHISTLEBLOWER TELLS CHILLING STORY Of Mas...,0
8,8,BREAKING: CLINTON CLEARED...Was This A Coordin...,0
9,9,"EVIL HILLARY SUPPORTERS Yell ""F*ck Trump""…Burn...",0


In [14]:
assert coverage_factor.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Sy

Here we have 3 scores for reputation and social activeness, and all of them ranges from 0 to 10

*   calculated_reputation_score
*   calculated_spam_score
*   calculated_social_score

In [15]:
social_reliability_factors = get_parsed_data('https://github.com/synle/machine-learning-sample-dataset/raw/master/liar_dataset/factor_social_reliablity.csv')



In [16]:
social_reliability_factors.head()

Unnamed: 0.1,Unnamed: 0,type,calculated_reputation_score,calculated_spam_score,calculated_social_score
0,0,0,8,0.0,0.011
1,1,0,8,0.0,0.011
2,2,0,8,0.0,0.011
3,3,0,8,0.68,0.0
4,4,0,8,8.65,0.0


In [17]:
assert social_reliability_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Lin

In [18]:
import pandas as pd
import io
import requests

def get_parsed_data2(url):
    return pd.read_csv(io.StringIO(requests.get(url, verify=False).content.decode('utf-8')), sep=',', header='infer', error_bad_lines=False)

# download and parse the dataset...
data_kg_fake_news2 = get_parsed_data2('https://github.com/synle/machine-learning-sample-dataset/raw/master/liar_dataset/kaggle/kaggle-fake.csv')



In [19]:
data_kg_nonfake_news2 = get_parsed_data2('https://dock2.hyunwookshin.com/public/cmpe257_a1/articles1.csv')



In [20]:
data_kg_nonfake_news2.rename(columns={"content": "text"}, inplace=True)
data_kg_nonfake_news2['type'] = 'non-bs'
print(data_kg_nonfake_news2.shape)
data_kg_nonfake_news2.head()

(50000, 11)


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,text,type
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,non-bs
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",non-bs
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",non-bs
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",non-bs
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",non-bs


Combine those two datasets, mark data "bias 443 bs 11492 conspiracy 430 fake 19 hate 246 junksci 102 satire 146 state 121" to "bs".

In [21]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
import numpy as np
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from string import punctuation
from nltk import PorterStemmer
import copy 
import re
from sklearn.model_selection import train_test_split

nltk.download('punkt')

cachedStopWords = set(stopwords.words('english') + list(punctuation) + [''])
print(data_kg_fake_news2.shape)
print(data_kg_fake_news2.groupby(['type'])['type'].count())

print(data_kg_nonfake_news2.shape)
print(data_kg_nonfake_news2.groupby(['type'])['type'].count())

data_kg_fake_news_b2=copy.deepcopy(data_kg_fake_news2);
data_kg_fake_news_b2.loc[data_kg_fake_news_b2['type']!='non-bs', 'type'] = 'bs'

all_data2 = pd.concat([data_kg_fake_news_b2[['text','type']], data_kg_nonfake_news2[['text','type']]])

print(all_data2.groupby(['type'])['type'].count())

print(all_data2.shape)
X2=all_data2['text'].astype('U')
y2=all_data2['type']

[nltk_data] Downloading package punkt to /Users/yuxu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
(12999, 20)
type
bias            443
bs            11492
conspiracy      430
fake             19
hate            246
junksci         102
satire          146
state           121
Name: type, dtype: int64
(50000, 11)
type
non-bs    50000
Name: type, dtype: int64
type
bs        12999
non-bs    50000
Name: type, dtype: int64
(62999, 2)


In [22]:
assert social_reliability_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

Now try to use TfidfVectorizer to get a matrix for further classification. Also tried applying SVD for dimension reduction.

In [23]:
def tokenize2(text):
    min_length = 3
    words = map(lambda word: word.lower(), word_tokenize(text))
    words = [word for word in words if word not in cachedStopWords]
    tokens = list(map(lambda token: PorterStemmer().stem(token), words))
    p = re.compile('[a-zA-Z]+')
    filtered_tokens = list(filter(lambda token: p.match(token) and len(token) >= min_length, tokens))
    return filtered_tokens

## Data Aggregation

This is where the magic happens. Please ensure that your dataframe follows
the dimensions, and integrate your factor columns to **all_data**.

In [24]:
### Aggregate Multiple CSV Data into One Data Frame

# #################################################################################################################################
# Only include ones that passed the 62999 Test
# This is important because the columns have -----------------------------------------------------+-------------------------------+
# to align                                                                                        | (Add your name here)          | (Factor)
# ############################################                                                    V                               V

all_data[ 'Coverage' ]    = coverage_factor[ 'Coverage' ]                                     # <-- HYUNWOOK (JAMES)              News Coverage
all_data[ 'Reputation' ]  = social_reliability_factors[ 'calculated_reputation_score' ]       # <-- SY                            Social Reliability
all_data[ 'Spam' ]        = social_reliability_factors[ 'calculated_spam_score' ]             # <-- SY
all_data[ 'Social' ]      = social_reliability_factors[ 'calculated_social_score' ]           # <-- SY
all_data[ 'title_senti_neg' ]  = sentiment_factors[ 'title_senti_neg' ]                       # <-- MOJDEH                        Sentiment
all_data[ 'title_senti_neu' ]  = sentiment_factors[ 'title_senti_neu' ]                       # <-- MOJDEH
all_data[ 'title_senti_pos'	]  = sentiment_factors[ 'title_senti_pos'	]                       # <-- MOJDEH
all_data[ 'title_senti_cmp' ]  = sentiment_factors[ 'title_senti_cmpd' ]                      # <-- MOJDEH
all_data[ 'text_senti_neg' ]   = sentiment_factors[ 'text_senti_neg' ]                        # <-- MOJDEH
all_data[ 'text_senti_neu' ]   = sentiment_factors[ 'text_senti_neu' ]                        # <-- MOJDEH
all_data[ 'text_senti_pos' ]   = sentiment_factors[ 'text_senti_pos' ]                        # <-- MOJDEH
all_data[ 'text_senti_cmp' ]   = sentiment_factors[ 'text_senti_cmp' ]                        # <-- MOJDEH
all_data[['text_w2v_mean','title_w2v_mean','text_d2v_mean','title_d2v_mean']] = w2v_d2v_factors[['text_w2v_mean','title_w2v_mean','text_d2v_mean','title_d2v_mean']]

In [25]:
all_data.head(5)

Unnamed: 0,title,text,type,Coverage,Reputation,Spam,Social,title_senti_neg,title_senti_neu,title_senti_pos,title_senti_cmp,text_senti_neg,text_senti_neu,text_senti_pos,text_senti_cmp,text_w2v_mean,title_w2v_mean,text_d2v_mean,title_d2v_mean
0,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,0,0,8,0.0,0.011,0.4588,0.0,0.625,0.375,-0.34,0.209,0.606,0.185,-0.052476,0.066654,-0.220385,-0.114133
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,0,0,8,0.0,0.011,0.0,0.0,1.0,0.0,-0.296,0.063,0.887,0.05,-0.126095,-0.309628,-0.048451,0.017952
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,0,0,8,0.0,0.011,0.0,0.0,1.0,0.0,0.8957,0.021,0.871,0.108,-0.095904,-0.209579,-0.118836,-0.079925
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,0,0,8,0.68,0.0,-0.7783,0.43,0.57,0.0,0.8316,0.133,0.517,0.35,-0.027249,-0.07195,-0.038647,-0.168344
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,0,0,8,8.65,0.0,0.0,0.0,1.0,0.0,0.9517,0.066,0.765,0.17,-0.07503,-0.066515,-0.155317,-0.07495


## Combine tf-idf together with other factors as features

In [26]:
X2 = X2.reset_index()['text']

In [27]:
vectorizer_all = TfidfVectorizer(tokenizer=tokenize2, min_df=1, max_features=50000)
vectorised_all = vectorizer_all.fit_transform(X2)

### Using RandomForrest Classifier To Determine Important Factors

In [None]:
### Define polynomial function
all_data[ 'Score' ] = \
    all_data[ 'Coverage' ] * 10 + \
    all_data[ 'Reputation' ] * 10 + \
    all_data[ 'Spam' ]  * 10 + \
    all_data[ 'Social' ] * 200 + \
    all_data[ 'title_senti_neg' ] * 20 + \
    all_data[ 'title_senti_neu' ] + \
    all_data[ 'title_senti_pos'	] * 20 + \
    all_data[ 'title_senti_cmp' ] + \
    all_data[ 'text_senti_neg' ] * 10 + \
    all_data[' text_senti_neu' ] + \
    all_data[ 'text_senti_pos' ] + \
    all_data[ 'text_senti_cmp' ] 

In [28]:
from sklearn.model_selection import train_test_split
########################################
# UPDATE YOUR FACTOR HERE

########################################
X = all_data[['Coverage', 'Reputation', 'Spam', 'Social', 'title_senti_neg', 'title_senti_neu', 'title_senti_pos', 'title_senti_cmp', 'text_senti_neg', 'text_senti_neu', 'text_senti_pos', 'text_senti_cmp','text_w2v_mean','title_w2v_mean','text_d2v_mean','title_d2v_mean' ]]
Y = all_data['type']
X_ = X.reset_index()

In [29]:
X_ = pd.concat([X_, pd.DataFrame(vectorised_all.toarray(), columns=vectorizer_all.vocabulary_)], axis = 1)

In [30]:
X_train , X_test , Y_train , Y_test = train_test_split(X_, Y, test_size=0.3)

clf = RandomForestClassifier()
clf.fit(X_train,Y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [32]:
from sklearn import metrics
Y_pred = clf.predict(X_test)
print('Accuracy Score Is:', metrics.accuracy_score(Y_test, Y_pred))

Accuracy Score Is: 0.8978306878306879


# We got around 90% accuracy in determining if the news is fake or not.