<!-- Tala Vahedi
October 30, 2021
Week Ten Assignment

Script Purpose: Perform the Scikit-Learn Tutorial
Script Version: 1.0 
Script Author:  Tala Vahedi, University of Arizona

Script Revision History:
Version 1.0 Oct 30, 2021, Python 3.x

Using the Scikit-Learn Tutorial found here:  https://www.dataquest.io/blog/sci-kit-learn-tutorial/ 
Perform the operations from the beginning of the Tutorial stopping when you get to the Building the 
Model Section.

Create a Report that captures each major steps of the process and provide a short write up of what 
you learned during that step and what areas are you still confused about.
 -->

In [154]:
# Script Purpose: Fake News Detection
# Script Version: 1.0 
# Script Author:  Tala Vahedi, University of Arizona

# Script Revision History:
# Version 1.0 Dec 6, 2021, Python 3.x

# Psuedo Constants
SCRIPT_NAME    = "Script: Fake News Detection"
SCRIPT_VERSION = "Version 1.0"
SCRIPT_AUTHOR  = "Author: Tala Vahedi"

# Python Standard Library
import re
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [155]:
# Print Basic Script Information
print()
print(SCRIPT_NAME)
print(SCRIPT_VERSION)
print(SCRIPT_AUTHOR)
print()  


Script: Final Scripting Project
Version 1.0
Author: Tala Vahedi



In [156]:
# Read in the data with `read_csv()`
data = pd.read_csv("News.csv", dtype=str)

# Using .head() method to view the first few records of the data set
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label
0,0,donald trump sends embarrassing new years eve ...,donald trump couldn t all americans happy new ...,news,"december 31, 2017",fake
1,1,drunk bragging trump staffer started russian c...,house intelligence committee chairman devin nu...,news,"december 31, 2017",fake
2,2,sheriff david clarke internet joke threatening...,friday revealed milwaukee sheriff david clarke...,news,"december 30, 2017",fake
3,3,trump obsessed obamas coded website images,christmas day donald trump announced work day ...,news,"december 29, 2017",fake
4,4,pope francis called donald trump christmas speech,pope francis annual christmas day message rebu...,news,"december 25, 2017",fake


In [157]:
# using the dtypes() method to display the different datatypes available
data.dtypes

Unnamed: 0    object
title         object
text          object
subject       object
date          object
label         object
dtype: object

In [159]:
#feature engineering: expand the date column into month, day, year
data1 = pd.DataFrame(data.date.str.split(' ',2).tolist(), columns = ['month','day', 'year'])
data['month'] = data1['month']
data['day'] = data1['day']
data['year'] = data1['year']

# dropping useless columns
data = data.drop('date',1)
data = data.drop('Unnamed: 0',1)
data['day'] = data['day'].str.replace(',','')
data.dropna()

# print first five rows
data.head()

Unnamed: 0,title,text,subject,label,month,day,year
0,donald trump sends embarrassing new years eve ...,donald trump couldn t all americans happy new ...,news,fake,december,31,2017
1,drunk bragging trump staffer started russian c...,house intelligence committee chairman devin nu...,news,fake,december,31,2017
2,sheriff david clarke internet joke threatening...,friday revealed milwaukee sheriff david clarke...,news,fake,december,30,2017
3,trump obsessed obamas coded website images,christmas day donald trump announced work day ...,news,fake,december,29,2017
4,pope francis called donald trump christmas speech,pope francis annual christmas day message rebu...,news,fake,december,25,2017


In [160]:
#feature engineering: get total number of words in title and text columns
data["numWordsTitle"] = data['title'].apply(lambda x: len(str(x).split(' ')))
data["numWordsText"] = data['text'].apply(lambda x: len(str(x).split(' ')))

# print first five rows
data.head()

Unnamed: 0,title,text,subject,label,month,day,year,numWordsTitle,numWordsText
0,donald trump sends embarrassing new years eve ...,donald trump couldn t all americans happy new ...,news,fake,december,31,2017,9,297
1,drunk bragging trump staffer started russian c...,house intelligence committee chairman devin nu...,news,fake,december,31,2017,8,173
2,sheriff david clarke internet joke threatening...,friday revealed milwaukee sheriff david clarke...,news,fake,december,30,2017,9,353
3,trump obsessed obamas coded website images,christmas day donald trump announced work day ...,news,fake,december,29,2017,6,261
4,pope francis called donald trump christmas speech,pope francis annual christmas day message rebu...,news,fake,december,25,2017,7,188


In [161]:
#feature engineering: getting the number of positive words, and negative words in text
with open("POSITIVE_WORDS.txt", 'r') as positiveWordList:
    positiveWords = positiveWordList.read()
POSITIVE_WORDS = positiveWords.split()

with open("NEGATIVE_WORDS.txt", 'r') as negativeWordList:
    negativeWords = negativeWordList.read()
NEGATIVE_WORDS = negativeWords.split()

txt = data['text'].astype(str)

data['numPosWords'] = txt.apply(lambda x: len([i for i in x.split() if i in POSITIVE_WORDS]))
data['numNegWords'] = txt.apply(lambda x: len([i for i in x.split() if i in NEGATIVE_WORDS]))

# print first five rows
data.head()

Unnamed: 0,title,text,subject,label,month,day,year,numWordsTitle,numWordsText,numPosWords,numNegWords
0,donald trump sends embarrassing new years eve ...,donald trump couldn t all americans happy new ...,news,fake,december,31,2017,9,297,30,30
1,drunk bragging trump staffer started russian c...,house intelligence committee chairman devin nu...,news,fake,december,31,2017,8,173,10,13
2,sheriff david clarke internet joke threatening...,friday revealed milwaukee sheriff david clarke...,news,fake,december,30,2017,9,353,10,32
3,trump obsessed obamas coded website images,christmas day donald trump announced work day ...,news,fake,december,29,2017,6,261,18,11
4,pope francis called donald trump christmas speech,pope francis annual christmas day message rebu...,news,fake,december,25,2017,7,188,11,8


In [163]:
#feature engineering: getting percentage of stop words, negative words, and positive words in text
data['percPosWords'] = round((data['numPosWords']/data['numWordsText'])*100).astype(int)
data['percNegWords'] = round((data['numNegWords']/data['numWordsText'])*100).astype(int)

# print first five rows
data.head()

Unnamed: 0,title,text,subject,label,month,day,year,numWordsTitle,numWordsText,numPosWords,numNegWords,percPosWords,percNegWords
0,donald trump sends embarrassing new years eve ...,donald trump couldn t all americans happy new ...,news,fake,december,31,2017,9,297,30,30,10,10
1,drunk bragging trump staffer started russian c...,house intelligence committee chairman devin nu...,news,fake,december,31,2017,8,173,10,13,6,8
2,sheriff david clarke internet joke threatening...,friday revealed milwaukee sheriff david clarke...,news,fake,december,30,2017,9,353,10,32,3,9
3,trump obsessed obamas coded website images,christmas day donald trump announced work day ...,news,fake,december,29,2017,6,261,18,11,7,4
4,pope francis called donald trump christmas speech,pope francis annual christmas day message rebu...,news,fake,december,25,2017,7,188,11,8,6,4


In [165]:
#feature engineering: getting the pos tag counts from the text column
import nltk
from nltk.tag import pos_tag
from nltk import word_tokenize
from collections import Counter

txt = data['text'].astype(str)

tokens = txt.apply(word_tokenize) 
posTags = tokens.apply(pos_tag)

tagCountDF = pd.DataFrame(posTags.map(lambda x: Counter(tag[1] for tag in x)).to_list())
data1 = pd.concat([data, tagCountDF], axis=1).fillna(0)

# print first five rows
data1.head()

Unnamed: 0,title,text,subject,label,month,day,year,numWordsTitle,numWordsText,numPosWords,...,NNPS,CC,UH,PRP$,WDT,EX,POS,'',PDT,SYM
0,donald trump sends embarrassing new years eve ...,donald trump couldn t all americans happy new ...,news,fake,december,31,2017,9,297,30,...,0,0,0,0,0,0,0,0,0,0
1,drunk bragging trump staffer started russian c...,house intelligence committee chairman devin nu...,news,fake,december,31,2017,8,173,10,...,0,0,0,0,0,0,0,0,0,0
2,sheriff david clarke internet joke threatening...,friday revealed milwaukee sheriff david clarke...,news,fake,december,30,2017,9,353,10,...,0,0,0,0,0,0,0,0,0,0
3,trump obsessed obamas coded website images,christmas day donald trump announced work day ...,news,fake,december,29,2017,6,261,18,...,0,0,0,0,0,0,0,0,0,0
4,pope francis called donald trump christmas speech,pope francis annual christmas day message rebu...,news,fake,december,25,2017,7,188,11,...,0,0,0,0,0,0,0,0,0,0


In [166]:
# using the dtypes() method to display the different datatypes available
data1.dtypes

title             object
text              object
subject           object
label             object
month             object
day               object
year              object
numWordsTitle      int64
numWordsText       int64
numPosWords        int64
numNegWords        int64
percPosWords       int64
percNegWords       int64
JJ               float64
NN               float64
IN               float64
DT               float64
NNS              float64
VBP              float64
CD               float64
RB               float64
VBZ              float64
JJR              float64
RBR              float64
VBD              float64
VBG              float64
VB               float64
MD               float64
VBN              float64
PRP              float64
NNP              float64
$                float64
WP               float64
RBS              float64
RP               float64
TO               float64
FW               float64
JJS              float64
WRB              float64
NNPS             float64


In [171]:
# converting all of the pos tag columns into ints in order to use the sklearn label encoder
count = 0
for column in data1:
    count += 1
    if count >= 14:
        data1[str(column)] = data1[str(column)].astype(int)

# showing all columns data types
data1.dtypes

title            object
text             object
subject          object
label            object
month            object
day              object
year             object
numWordsTitle     int64
numWordsText      int64
numPosWords       int64
numNegWords       int64
percPosWords      int64
percNegWords      int64
JJ                int64
NN                int64
IN                int64
DT                int64
NNS               int64
VBP               int64
CD                int64
RB                int64
VBZ               int64
JJR               int64
RBR               int64
VBD               int64
VBG               int64
VB                int64
MD                int64
VBN               int64
PRP               int64
NNP               int64
$                 int64
WP                int64
RBS               int64
RP                int64
TO                int64
FW                int64
JJS               int64
WRB               int64
NNPS              int64
CC                int64
UH              

In [172]:
#import the necessary module
from sklearn.preprocessing import LabelEncoder

#import the necessary module
from sklearn import preprocessing

txt = data1['text'].astype(str)
day = data1['day'].astype(str)
yr = data1['year'].astype(str)

# create the Labelencoder object
le = preprocessing.LabelEncoder()
#convert the categorical columns into numeric
data1['title'] = le.fit_transform(data1['title'])
data1['text'] = le.fit_transform(txt)
data1['subject'] = le.fit_transform(data1['subject'])
data1['label'] = le.fit_transform(data1['label'])
data1['month'] = le.fit_transform(data1['month'])
data1['day'] = le.fit_transform(day)
data1['year'] = le.fit_transform(yr)

#display the initial records
data1.head()

Unnamed: 0,title,text,subject,label,month,day,year,numWordsTitle,numWordsText,numPosWords,...,NNPS,CC,UH,PRP$,WDT,EX,POS,'',PDT,SYM
0,7780,8027,3,0,11,25,4,9,297,30,...,0,0,0,0,0,0,0,0,0,0
1,7946,12396,3,0,11,25,4,8,173,10,...,0,0,0,0,0,0,0,0,0,0
2,25508,10583,3,0,11,24,4,9,353,10,...,0,0,0,0,0,0,0,0,0,0
3,30127,5976,3,0,11,22,4,6,261,18,...,0,0,0,0,0,0,0,0,0,0
4,21500,21496,3,0,11,18,4,7,188,11,...,0,0,0,0,0,0,0,0,0,0


In [173]:
#assigning the target variable
target = data1['label']
# print first 10 rows
data1.head(10)

Unnamed: 0,title,text,subject,label,month,day,year,numWordsTitle,numWordsText,numPosWords,...,NNPS,CC,UH,PRP$,WDT,EX,POS,'',PDT,SYM
0,7780,8027,3,0,11,25,4,9,297,30,...,0,0,0,0,0,0,0,0,0,0
1,7946,12396,3,0,11,25,4,8,173,10,...,0,0,0,0,0,0,0,0,0,0
2,25508,10583,3,0,11,24,4,9,353,10,...,0,0,0,0,0,0,0,0,0,0
3,30127,5976,3,0,11,22,4,6,261,18,...,0,0,0,0,0,0,0,0,0,0
4,21500,21496,3,0,11,18,4,7,188,11,...,0,0,0,0,0,0,0,0,0,0
5,22302,19881,3,0,11,18,4,9,142,4,...,0,0,0,0,0,0,0,0,0,0
6,10687,8979,3,0,11,16,4,10,215,15,...,0,0,0,0,0,0,0,0,0,0
7,29633,30556,3,0,11,16,4,8,186,10,...,0,0,0,0,0,0,0,0,0,0
8,5225,21143,3,0,11,15,4,12,225,23,...,0,0,0,0,0,0,0,0,0,0
9,36092,6857,3,0,11,14,4,9,138,13,...,0,0,0,0,0,0,0,0,0,0


In [174]:
# showing all of the columns datatypes once again to make sure all columns are ints
data1.dtypes

title            int64
text             int64
subject          int64
label            int64
month            int64
day              int64
year             int64
numWordsTitle    int64
numWordsText     int64
numPosWords      int64
numNegWords      int64
percPosWords     int64
percNegWords     int64
JJ               int64
NN               int64
IN               int64
DT               int64
NNS              int64
VBP              int64
CD               int64
RB               int64
VBZ              int64
JJR              int64
RBR              int64
VBD              int64
VBG              int64
VB               int64
MD               int64
VBN              int64
PRP              int64
NNP              int64
$                int64
WP               int64
RBS              int64
RP               int64
TO               int64
FW               int64
JJS              int64
WRB              int64
NNPS             int64
CC               int64
UH               int64
PRP$             int64
WDT        

In [177]:
#import the necessary module
from sklearn.model_selection import train_test_split
#split data set into train and test sets
data_train, data_test, target_train, target_test = train_test_split(data1, target, test_size = 0.30, random_state = 10)

In [179]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

#create object of the classifier
svm = LinearSVC()
#Train the algorithm
svm.fit(data_train, target_train)
# predict the testing set
pred = svm.predict(data_test)

# print the evaluate metrics
print (" accuracy score : ",accuracy_score(target_test, pred))
print(" precision : ",precision_score(target_test, pred, average="macro"))
print(" recall : ",recall_score(target_test, pred, average="macro")) 
print(" F1 score : ",f1_score(target_test, pred, average="macro"))

 accuracy score :  0.8115070527097253
 precision :  0.8579022891132879
 recall :  0.8033119492467045
 F1 score :  0.8020311191227405


In [None]:
# add the svm predictions to the testing dataset for analysis
data_test["predLabel"] = svm.predict(data_test)

In [227]:
# create a df with the encoded, predicted values - this will be decoded next
df = data_test[['text', 'label','predLabel']].head()
display(df)

Unnamed: 0,text,label,predLabel
3040,7327,0,0
41612,29802,1,1
6040,22789,0,0
30742,34826,1,1
19492,17044,0,0


In [243]:
# identifying the predicted target label
targetLabel = data_test["predLabel"]
# merging the datasets based on the index numbers of the two dataframes
indexMerger = data.loc[df.index]
# identifying the decoded columns
decodedColumns = indexMerger[['text', 'label']]
# adding the target label to the rest of the decoded columns
finalDF = decodedColumns.join(targetLabel)
# decoding the target label to create a final decoded dataframe for analysis
finalDF['predLabel'] = np.where(finalDF['predLabel'] == 0, "fake","true")

# print first five rows
finalDF.head()

Unnamed: 0,text,label,predLabel
3040,democrats stood american people republicans sh...,fake,fake
41612,united nations reuters saudi arabia friday rej...,true,true
6040,republican house speaker paul ryan slammed mom...,fake,fake
30742,washington reuters u s house democratic leader...,true,true
19492,michigan court appeals rejected green party ca...,fake,fake
