## Introduction

This python notebook downloads csv files from different sources uploaded by
the individual members of the team working on each factor(s).

After downloading the csv files, the factor columns are extracted. Using
weighted values, we generate a score to determine "fake"-ness of
each news article.

Please refer to the individual notebook to see how the individual csv files
are generated.

In [0]:
# Combined notebook can be found here, but to cut down run time, the labels
# are pre-computed for each factor and uploaded as csv files.

# https://colab.research.google.com/drive/1bOoY6V0ytxSigKuZ6lJntNWJJcTM_6wU#scrollTo=myrJOvEVIhue

## Dependencies

In [38]:
# dependencies
import pandas as pd
import nltk
import numpy as np
import io
import requests
# from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_selection import chi2
from string import punctuation
from nltk import PorterStemmer
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
from xgboost import XGBClassifier

nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Downloading Individual CSV Files (Factors)

The individual CSV files should have the same rows (fake news and all news dataset concatenated together), with articles in the same order as
prepared originally by Gene.

1. Fake news comes first before Non-fake (all) news
2. Please ensure that the counts are as follows:

```
<your_labeled_csv_data>.type.value_counts()
0    51507
1    11492
Name: type, dtype: int64
 ```
 
 3. And please ensure that your labels are complete with no "holes"

In [0]:
def get_parsed_data(url):
    return pd.read_csv(io.StringIO(requests.get(url, verify=False).content.decode('utf-8')), sep=',', header='infer')
  

### Master Dataset

In [39]:
data_kg_fake_news = get_parsed_data('https://github.com/synle/machine-learning-sample-dataset/raw/master/liar_dataset/kaggle/kaggle-fake.csv')
data_kg_nonfake_news = get_parsed_data('https://dock2.hyunwookshin.com/public/cmpe257_a1/articles1.csv')
data_kg_nonfake_news.rename(columns={"content": "text"}, inplace=True)
data_kg_nonfake_news['type'] = 0
data_kg_fake_news.loc[data_kg_fake_news['type']!='bs', 'type'] = 0
data_kg_fake_news.loc[data_kg_fake_news['type']=='bs', 'type'] = 1


all_data = pd.concat([data_kg_fake_news[['title','text','type']], data_kg_nonfake_news[['title','text','type']]])



Verify dimensions

In [0]:
assert all_data.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Gene 

In [27]:
w2v_d2v_factors = pd.read_csv(io.StringIO(requests.get('https://dock2.hyunwookshin.com/public/cmpe257_a1/fake_news_w2v_d2v_only.csv', \
                                                       verify=False).content.decode('utf-8')), delimiter=",", header=None)



In [34]:
cols = ['index','text_w2v_mean', 'title_w2v_mean', 'text_d2v_mean', 'title_d2v_mean']
w2v_d2v_factors.columns=cols
w2v_d2v_factors.head()

Unnamed: 0,index,text_w2v_mean,title_w2v_mean,text_d2v_mean,title_d2v_mean
0,0,-0.052476,0.066654,-0.220385,-0.114133
1,1,-0.126095,-0.309628,-0.048451,0.017952
2,2,-0.095904,-0.209579,-0.118836,-0.079925
3,3,-0.027249,-0.07195,-0.038647,-0.168344
4,4,-0.07503,-0.066515,-0.155317,-0.07495


In [0]:
assert w2v_d2v_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Mojdeh

In [41]:
sentiment_factors = get_parsed_data('https://raw.githubusercontent.com/mojdehkeykhanzadeh/NLP_Proj/master/all_news_sentiment.csv')



In [0]:
sentiment_factors

Unnamed: 0.1,Unnamed: 0,title_senti_neg,title_senti_neu,title_senti_pos,title_senti_cmpd,text_senti_neg,text_senti_neu,text_senti_pos,text_senti_cmp
0,0,0.4588,0.000,0.625,0.375,-0.3400,0.209,0.606,0.185
1,1,0.0000,0.000,1.000,0.000,-0.2960,0.063,0.887,0.050
2,2,0.0000,0.000,1.000,0.000,0.8957,0.021,0.871,0.108
3,3,-0.7783,0.430,0.570,0.000,0.8316,0.133,0.517,0.350
4,4,0.0000,0.000,1.000,0.000,0.9517,0.066,0.765,0.170
5,5,-0.2500,0.250,0.750,0.000,-0.9936,0.352,0.618,0.030
6,6,-0.3400,0.107,0.893,0.000,-0.9559,0.103,0.813,0.084
7,7,-0.4588,0.317,0.528,0.154,-0.9836,0.138,0.844,0.019
8,8,0.3400,0.102,0.678,0.220,0.1027,0.044,0.902,0.054
9,9,-0.4019,0.243,0.608,0.149,-0.8402,0.151,0.809,0.040


In [0]:
assert sentiment_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Hyunwook (James)

News event coverage scores ranging from 0 to 18 are added to the dataset

In [43]:
coverage_factor = get_parsed_data('https://dock2.hyunwookshin.com/public/cmpe257_a1/all_data_coverage_condensed.processed.csv')



In [0]:
coverage_factor.head(10)

Unnamed: 0.1,Unnamed: 0,title,Coverage
0,0,Muslims BUSTED: They Stole Millions In Gov’t B...,0
1,1,Re: Why Did Attorney General Loretta Lynch Ple...,0
2,2,BREAKING: Weiner Cooperating With FBI On Hilla...,0
3,3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,0
4,4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0
5,5,Hillary Goes Absolutely Berserk On Protester A...,0
6,6,BREAKING! NYPD Ready To Make Arrests In Weiner...,0
7,7,WOW! WHISTLEBLOWER TELLS CHILLING STORY Of Mas...,0
8,8,BREAKING: CLINTON CLEARED...Was This A Coordin...,0
9,9,"EVIL HILLARY SUPPORTERS Yell ""F*ck Trump""…Burn...",0


In [0]:
assert coverage_factor.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Sy

Here we have 3 scores for reputation and social activeness, and all of them ranges from 0 to 10

*   calculated_reputation_score
*   calculated_spam_score
*   calculated_social_score

In [44]:
social_reliability_factors = get_parsed_data('https://github.com/synle/machine-learning-sample-dataset/raw/master/liar_dataset/factor_social_reliablity.csv')



In [0]:
social_reliability_factors

Unnamed: 0.1,Unnamed: 0,type,calculated_reputation_score,calculated_spam_score,calculated_social_score
0,0,0,8,0.00,0.011
1,1,0,8,0.00,0.011
2,2,0,8,0.00,0.011
3,3,0,8,0.68,0.000
4,4,0,8,8.65,0.000
5,5,0,8,0.00,0.011
6,6,0,8,7.01,0.000
7,7,0,8,1.88,0.000
8,8,0,8,1.44,0.000
9,9,0,8,9.95,0.000


In [0]:
assert social_reliability_factors.shape[0] == 62999, "Please review your csv" # INSERTED BY JAMES

### Lin

In [0]:
import pandas as pd
import io
import requests

def get_parsed_data2(url):
    return pd.read_csv(io.StringIO(requests.get(url, verify=False).content.decode('utf-8')), sep=',', header='infer', error_bad_lines=False)

# download and parse the dataset...
data_kg_fake_news = get_parsed_data2('https://github.com/synle/machine-learning-sample-dataset/raw/master/liar_dataset/kaggle/kaggle-fake.csv')



In [0]:
data_kg_nonfake_news = get_parsed_data2('https://dock2.hyunwookshin.com/public/cmpe257_a1/articles1.csv')



In [0]:
data_kg_nonfake_news.rename(columns={"content": "text"}, inplace=True)
data_kg_nonfake_news['type'] = 'non-bs'
print(data_kg_nonfake_news.shape)
data_kg_nonfake_news.head()

(50000, 11)


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,text,type
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,non-bs
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",non-bs
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...",non-bs
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...",non-bs
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",non-bs


Combine those two datasets, mark data "bias 443 bs 11492 conspiracy 430 fake 19 hate 246 junksci 102 satire 146 state 121" to "bs".

In [0]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
import numpy as np
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from string import punctuation
from nltk import PorterStemmer
import copy 
import re
from sklearn.model_selection import train_test_split

nltk.download('punkt')

cachedStopWords = set(stopwords.words('english') + list(punctuation) + [''])
print(data_kg_fake_news.shape)
print(data_kg_fake_news.groupby(['type'])['type'].count())

print(data_kg_nonfake_news.shape)
print(data_kg_nonfake_news.groupby(['type'])['type'].count())

data_kg_fake_news_b=copy.deepcopy(data_kg_fake_news);
data_kg_fake_news_b.loc[data_kg_fake_news_b['type']!='non-bs', 'type'] = 'bs'

all_data = pd.concat([data_kg_fake_news_b[['text','type']], data_kg_nonfake_news[['text','type']]])

print(all_data.groupby(['type'])['type'].count())

print(all_data.shape)
X=all_data['text'].astype('U')
y=all_data['type']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
(12999, 20)
type
bias            443
bs            11492
conspiracy      430
fake             19
hate            246
junksci         102
satire          146
state           121
Name: type, dtype: int64
(50000, 11)
type
non-bs    50000
Name: type, dtype: int64
type
bs        12999
non-bs    50000
Name: type, dtype: int64
(62999, 2)


Now try to use TfidfVectorizer to get a matrix for further classification. Also tried applying SVD for dimension reduction.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.preprocessing import LabelEncoder, Imputer, MaxAbsScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

def tokenize2(text):
    min_length = 3
    words = map(lambda word: word.lower(), word_tokenize(text))
    words = [word for word in words if word not in cachedStopWords]
    tokens = list(map(lambda token: PorterStemmer().stem(token), words))
    p = re.compile('[a-zA-Z]+')
    filtered_tokens = list(filter(lambda token: p.match(token) and len(token) >= min_length, tokens))
    return filtered_tokens

vectorizer = TfidfVectorizer(tokenizer=tokenize2)
svd_model = TruncatedSVD(n_components=200,       
                         algorithm='randomized',
                         n_iter=10)
# svd_transformer = Pipeline([('tfidf', vectorizer), 
#                             ('svd', svd_model)])
svd_transformer=vectorizer
    
vectorised_train_documents = svd_transformer.fit_transform(X_train)
vectorised_test_documents = svd_transformer.transform(X_test)

Now do modeling and tuning
- Random forest  
- Logistic Regression

In [0]:
gs=RandomForestClassifier()
gs.fit(vectorised_train_documents, y_train)
print(vectorised_train_documents.shape)
feature_imp = pd.Series(gs.feature_importances_,index=list(vectorizer.get_feature_names())).sort_values(ascending=False).nlargest(20)
print(feature_imp)

y_pred=gs.predict(vectorised_test_documents)
print(y_test.value_counts(sort=False))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))

(44099, 178824)
said         0.017446
u.s.         0.011317
n't          0.008697
twitter      0.007315
octob        0.006272
advertis     0.003722
sunday       0.003615
breitbart    0.003405
novemb       0.003116
com          0.002982
e-mail       0.002953
click        0.002899
mr.          0.002824
week         0.002718
saturday     0.002639
share        0.002631
follow       0.002571
cnn          0.002528
war          0.002482
presid       0.002449
dtype: float64
bs         3880
non-bs    15020
Name: type, dtype: int64
Accuracy: 0.8895767195767196
[[ 2057  1823]
 [  264 14756]]


In [0]:
# logistic = LogisticRegression()
# logistic = LogisticRegression(class_weight='balanced')
logistic = LogisticRegression(class_weight={"bs":5,"non-bs":3})
# C = [0.1, 1]
# penalty = ['l1','l2']
C = [1]
penalty = ['l1']

param_grid = dict(C=C, penalty=penalty)
gs = GridSearchCV(logistic, param_grid=param_grid, cv= 5, scoring='accuracy')

gs.fit(vectorised_train_documents, y_train)
print(gs.best_params_)

y_pred=gs.predict(vectorised_test_documents)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred, pos_label='bs'))
print(metrics.confusion_matrix(y_test, y_pred))

NameError: ignored

## Data Aggregation

This is where the magic happens. Please ensure that your dataframe follows
the dimensions, and integrate your factor columns to **all_data**.

In [0]:
### Aggregate Multiple CSV Data into One Data Frame

# #################################################################################################################################
# Only include ones that passed the 62999 Test
# This is important because the columns have -----------------------------------------------------+-------------------------------+
# to align                                                                                        | (Add your name here)          | (Factor)
# ############################################                                                    V                               V

all_data[ 'Coverage' ]    = coverage_factor[ 'Coverage' ]                                     # <-- HYUNWOOK (JAMES)              News Coverage
all_data[ 'Reputation' ]  = social_reliability_factors[ 'calculated_reputation_score' ]       # <-- SY                            Social Reliability
all_data[ 'Spam' ]        = social_reliability_factors[ 'calculated_spam_score' ]             # <-- SY
all_data[ 'Social' ]      = social_reliability_factors[ 'calculated_social_score' ]           # <-- SY
all_data[ 'title_senti_neg' ]  = sentiment_factors[ 'title_senti_neg' ]                       # <-- MOJDEH                        Sentiment
all_data[ 'title_senti_neu' ]  = sentiment_factors[ 'title_senti_neu' ]                       # <-- MOJDEH
all_data[ 'title_senti_pos'	]  = sentiment_factors[ 'title_senti_pos'	]                       # <-- MOJDEH
all_data[ 'title_senti_cmp' ]  = sentiment_factors[ 'title_senti_cmpd' ]                      # <-- MOJDEH
all_data[ 'text_senti_neg' ]   = sentiment_factors[ 'text_senti_neg' ]                        # <-- MOJDEH
all_data[' text_senti_neu' ]   = sentiment_factors[ 'text_senti_neu' ]                        # <-- MOJDEH
all_data[ 'text_senti_pos' ]   = sentiment_factors[ 'text_senti_pos' ]                        # <-- MOJDEH
all_data[ 'text_senti_cmp' ]   = sentiment_factors[ 'text_senti_cmp' ]                        # <-- MOJDEH
all_data[ 'text_w2v_mean' ] = w2v_d2v_factors['text_w2v_mean']
all_data[ 'title_w2v_mean' ] = w2v_d2v_factors['title_w2v_mean']
all_data[ 'text_d2v_mean' ] = w2v_d2v_factors['text_d2v_mean']
all_data[ 'title_d2v_mean' ] = w2v_d2v_factors['title_d2v_mean']

In [47]:
all_data.head(5)

Unnamed: 0,title,text,type,Coverage,Reputation,Spam,Social,title_senti_neg,title_senti_neu,title_senti_pos,title_senti_cmp,text_senti_neg,text_senti_neu,text_senti_pos,text_senti_cmp,text_w2v_mean,title_w2v_mean,text_d2v_mean,title_d2v_mean
0,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,0,0,8,0.0,0.011,0.4588,0.0,0.625,0.375,-0.34,0.209,0.606,0.185,-0.052476,0.066654,-0.220385,-0.114133
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,0,0,8,0.0,0.011,0.0,0.0,1.0,0.0,-0.296,0.063,0.887,0.05,-0.126095,-0.309628,-0.048451,0.017952
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,0,0,8,0.0,0.011,0.0,0.0,1.0,0.0,0.8957,0.021,0.871,0.108,-0.095904,-0.209579,-0.118836,-0.079925
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,0,0,8,0.68,0.0,-0.7783,0.43,0.57,0.0,0.8316,0.133,0.517,0.35,-0.027249,-0.07195,-0.038647,-0.168344
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,0,0,8,8.65,0.0,0.0,0.0,1.0,0.0,0.9517,0.066,0.765,0.17,-0.07503,-0.066515,-0.155317,-0.07495


## Polynomial Equation

In [0]:
### Define polynomial function
all_data[ 'Score' ] = \
    all_data[ 'Coverage' ] + \
    all_data[ 'Reputation' ] + \
    all_data[ 'Spam' ]  + \
    all_data[ 'Social' ] + \
    all_data[ 'title_senti_neg' ] + all_data[ 'title_senti_neu' ] + all_data[ 'title_senti_pos'	] + all_data[ 'title_senti_cmp' ] + all_data[ 'text_senti_neg' ]\
+ all_data[' text_senti_neu' ] + all_data[ 'text_senti_pos' ] + all_data[ 'text_senti_cmp' ] + all_data[ 'text_w2v_mean' ] + all_data[ 'title_w2v_mean' ] \
+ all_data[ 'text_d2v_mean' ] + all_data[ 'title_d2v_mean' ]

In [49]:
all_data['Score']

0         9.809461
1         9.248778
2        10.402457
3        10.427110
4        19.230888
5         8.721359
6        15.345283
7        10.059044
8        11.608526
9        18.406794
10        8.985965
11       20.388263
12       11.178958
13        8.410227
14       10.592515
15       10.937271
16       13.924439
17       10.274847
18       10.651439
19       11.442158
20       13.057999
21        8.650405
22        9.274496
23       10.488185
24       10.788798
25       10.573849
26        9.149650
27       13.419171
28        9.219040
29       10.641042
           ...    
49970     1.481569
49971     2.794354
49972     3.605974
49973     2.983899
49974    -0.036702
49975     2.106052
49976     1.918986
49977    -0.421742
49978     2.430215
49979     0.795236
49980     2.305448
49981     3.147310
49982     0.186210
49983     3.249474
49984    -0.571762
49985     2.939873
49986     4.727411
49987     2.757415
49988     3.257516
49989     4.673154
49990     0.736079
49991     2.

## Final Fake-newss Score