last time, the passive aggressive classifier (PAC) trained with tf-idf vectorized data came up with 92.6% accuracy despite attempts to modify the PAC parameter values according to set intervals.
<br/><br/>
this time, i attempt to train with the same data but with svm.
<br/>source for ideas: https://github.com/MehtaPlusTutoring-MLBootcamp20/Real_vs_Fake_News

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [2]:
df=pd.read_csv('/Users/stefbp/Desktop/bata/ML_AI/FakeNewsDetection/data/news.csv')#data reading

side note, definitely take examples from the above project on storing text data as sparse matrices. from reading on scipy's sparse matrices section, it seems like this will really help save data.

In [3]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE


In [4]:
#features are 'title' and 'text', variable to predict is 'label'
X = df.loc[:,['title','text']]
y = df.loc[:,'label']

something else to learn: in the above project, text is handled with use of the nltk package. procedure of said handling:
1. lowercase conversion
2. word token-ization
3. whitespace & punctuation removal
4. stopword removal
5. lemmatization (grouping similar meaning-words into one) and then rejoining of title and text.<br/><br/>
for really big data, they saved the whole thing as an .npz archive with scipy for later reloading. we're not doing that this time, i'll learn that one for next time.
let's try to do steps 1-5 following the quoted project.

In [5]:
#step 1, lowercase conversion
X['title'] = X['title'].str.lower()
X['text'] = X['text'].str.lower()

In [None]:
#using nltk. will have to look into what 'punkt' and 'wordnet' are although i assume 'punkt' is just punctuation marks?
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')#this one is used for lemmatization apparently

In [6]:
#step 2-3, tokenization and whitespace-punctuation removal
#make new columns in our dataframe with all tokenized words, remove whitespace and punctuation
def identify_tokens_for_title(row):
    title = row['title']
    tokens_for_title = nltk.word_tokenize(title)
    # taken only words and numbers (not punctuation)
    token_words_for_title = [w for w in tokens_for_title if w.isalnum()]
    return token_words_for_title

def identify_tokens_for_text(row):
    text = row['text']
    tokens_for_text = nltk.word_tokenize(text)
    # taken only words and numbers (not punctuation)
    token_words_for_text = [w for w in tokens_for_text if w.isalnum()]
    return token_words_for_text

X['title_tokenized'] = X.apply(identify_tokens_for_title, axis=1)
X['text_tokenized'] = X.apply(identify_tokens_for_text, axis=1)

X

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/stefbp/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /Users/stefbp/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /Users/stefbp/nltk_data...


Unnamed: 0,title,text,title_tokenized,text_tokenized
0,you can smell hillary’s fear,"daniel greenfield, a shillman journalism fello...","[you, can, smell, hillary, s, fear]","[daniel, greenfield, a, shillman, journalism, ..."
1,watch the exact moment paul ryan committed pol...,google pinterest digg linkedin reddit stumbleu...,"[watch, the, exact, moment, paul, ryan, commit...","[google, pinterest, digg, linkedin, reddit, st..."
2,kerry to go to paris in gesture of sympathy,u.s. secretary of state john f. kerry said mon...,"[kerry, to, go, to, paris, in, gesture, of, sy...","[secretary, of, state, john, kerry, said, mond..."
3,bernie supporters on twitter erupt in anger ag...,"— kaydee king (@kaydeeking) november 9, 2016 t...","[bernie, supporters, on, twitter, erupt, in, a...","[kaydee, king, kaydeeking, november, 9, 2016, ..."
4,the battle of new york: why this primary matters,it's primary day in new york and front-runners...,"[the, battle, of, new, york, why, this, primar...","[it, primary, day, in, new, york, and, hillary..."
...,...,...,...,...
6330,state department says it can't find emails fro...,the state department told the republican natio...,"[state, department, says, it, ca, find, emails...","[the, state, department, told, the, republican..."
6331,the ‘p’ in pbs should stand for ‘plutocratic’ ...,the ‘p’ in pbs should stand for ‘plutocratic’ ...,"[the, p, in, pbs, should, stand, for, plutocra...","[the, p, in, pbs, should, stand, for, plutocra..."
6332,anti-trump protesters are tools of the oligarc...,anti-trump protesters are tools of the oligar...,"[protesters, are, tools, of, the, oligarchy, i...","[protesters, are, tools, of, the, oligarchy, a..."
6333,"in ethiopia, obama seeks progress on peace, se...","addis ababa, ethiopia —president obama convene...","[in, ethiopia, obama, seeks, progress, on, pea...","[addis, ababa, ethiopia, obama, convened, a, m..."


In [7]:
#step 4, getting stopwords out the way
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))                  

def remove_stops_for_title(row):
    my_list = row['title_tokenized']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

def remove_stops_for_text(row):
    my_list = row['text_tokenized']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

X['title_no_stopwords'] = X.apply(remove_stops_for_title, axis=1)
X['text_no_stopwords'] = X.apply(remove_stops_for_text, axis=1)

X

Unnamed: 0,title,text,title_tokenized,text_tokenized,title_no_stopwords,text_no_stopwords
0,you can smell hillary’s fear,"daniel greenfield, a shillman journalism fello...","[you, can, smell, hillary, s, fear]","[daniel, greenfield, a, shillman, journalism, ...","[smell, hillary, fear]","[daniel, greenfield, shillman, journalism, fel..."
1,watch the exact moment paul ryan committed pol...,google pinterest digg linkedin reddit stumbleu...,"[watch, the, exact, moment, paul, ryan, commit...","[google, pinterest, digg, linkedin, reddit, st...","[watch, exact, moment, paul, ryan, committed, ...","[google, pinterest, digg, linkedin, reddit, st..."
2,kerry to go to paris in gesture of sympathy,u.s. secretary of state john f. kerry said mon...,"[kerry, to, go, to, paris, in, gesture, of, sy...","[secretary, of, state, john, kerry, said, mond...","[kerry, go, paris, gesture, sympathy]","[secretary, state, john, kerry, said, monday, ..."
3,bernie supporters on twitter erupt in anger ag...,"— kaydee king (@kaydeeking) november 9, 2016 t...","[bernie, supporters, on, twitter, erupt, in, a...","[kaydee, king, kaydeeking, november, 9, 2016, ...","[bernie, supporters, twitter, erupt, anger, dn...","[kaydee, king, kaydeeking, november, 9, 2016, ..."
4,the battle of new york: why this primary matters,it's primary day in new york and front-runners...,"[the, battle, of, new, york, why, this, primar...","[it, primary, day, in, new, york, and, hillary...","[battle, new, york, primary, matters]","[primary, day, new, york, hillary, clinton, do..."
...,...,...,...,...,...,...
6330,state department says it can't find emails fro...,the state department told the republican natio...,"[state, department, says, it, ca, find, emails...","[the, state, department, told, the, republican...","[state, department, says, ca, find, emails, cl...","[state, department, told, republican, national..."
6331,the ‘p’ in pbs should stand for ‘plutocratic’ ...,the ‘p’ in pbs should stand for ‘plutocratic’ ...,"[the, p, in, pbs, should, stand, for, plutocra...","[the, p, in, pbs, should, stand, for, plutocra...","[p, pbs, stand, plutocratic, pentagon]","[p, pbs, stand, plutocratic, pentagon, posted,..."
6332,anti-trump protesters are tools of the oligarc...,anti-trump protesters are tools of the oligar...,"[protesters, are, tools, of, the, oligarchy, i...","[protesters, are, tools, of, the, oligarchy, a...","[protesters, tools, oligarchy, information]","[protesters, tools, oligarchy, always, provoke..."
6333,"in ethiopia, obama seeks progress on peace, se...","addis ababa, ethiopia —president obama convene...","[in, ethiopia, obama, seeks, progress, on, pea...","[addis, ababa, ethiopia, obama, convened, a, m...","[ethiopia, obama, seeks, progress, peace, secu...","[addis, ababa, ethiopia, obama, convened, meet..."


In [10]:
#step 5, lemmatization. as a reminder to self, grouping words with similar meaning
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 

def lemmatize_list_for_title(row):
    my_list = row['title_no_stopwords']
    lemmatized_list = [lemmatizer.lemmatize(word) for word in my_list]
    return (lemmatized_list)

def lemmatize_list_for_text(row):
    my_list = row['text_no_stopwords']
    lemmatized_list = [lemmatizer.lemmatize(word) for word in my_list]
    return (lemmatized_list)

X['lemmatized_title'] = X.apply(lemmatize_list_for_title, axis=1)
X['lemmatized_text'] = X.apply(lemmatize_list_for_text, axis=1)

X

Unnamed: 0,title,text,title_tokenized,text_tokenized,title_no_stopwords,text_no_stopwords,lemmatized_title,lemmatized_text
0,you can smell hillary’s fear,"daniel greenfield, a shillman journalism fello...","[you, can, smell, hillary, s, fear]","[daniel, greenfield, a, shillman, journalism, ...","[smell, hillary, fear]","[daniel, greenfield, shillman, journalism, fel...","[smell, hillary, fear]","[daniel, greenfield, shillman, journalism, fel..."
1,watch the exact moment paul ryan committed pol...,google pinterest digg linkedin reddit stumbleu...,"[watch, the, exact, moment, paul, ryan, commit...","[google, pinterest, digg, linkedin, reddit, st...","[watch, exact, moment, paul, ryan, committed, ...","[google, pinterest, digg, linkedin, reddit, st...","[watch, exact, moment, paul, ryan, committed, ...","[google, pinterest, digg, linkedin, reddit, st..."
2,kerry to go to paris in gesture of sympathy,u.s. secretary of state john f. kerry said mon...,"[kerry, to, go, to, paris, in, gesture, of, sy...","[secretary, of, state, john, kerry, said, mond...","[kerry, go, paris, gesture, sympathy]","[secretary, state, john, kerry, said, monday, ...","[kerry, go, paris, gesture, sympathy]","[secretary, state, john, kerry, said, monday, ..."
3,bernie supporters on twitter erupt in anger ag...,"— kaydee king (@kaydeeking) november 9, 2016 t...","[bernie, supporters, on, twitter, erupt, in, a...","[kaydee, king, kaydeeking, november, 9, 2016, ...","[bernie, supporters, twitter, erupt, anger, dn...","[kaydee, king, kaydeeking, november, 9, 2016, ...","[bernie, supporter, twitter, erupt, anger, dnc...","[kaydee, king, kaydeeking, november, 9, 2016, ..."
4,the battle of new york: why this primary matters,it's primary day in new york and front-runners...,"[the, battle, of, new, york, why, this, primar...","[it, primary, day, in, new, york, and, hillary...","[battle, new, york, primary, matters]","[primary, day, new, york, hillary, clinton, do...","[battle, new, york, primary, matter]","[primary, day, new, york, hillary, clinton, do..."
...,...,...,...,...,...,...,...,...
6330,state department says it can't find emails fro...,the state department told the republican natio...,"[state, department, says, it, ca, find, emails...","[the, state, department, told, the, republican...","[state, department, says, ca, find, emails, cl...","[state, department, told, republican, national...","[state, department, say, ca, find, email, clin...","[state, department, told, republican, national..."
6331,the ‘p’ in pbs should stand for ‘plutocratic’ ...,the ‘p’ in pbs should stand for ‘plutocratic’ ...,"[the, p, in, pbs, should, stand, for, plutocra...","[the, p, in, pbs, should, stand, for, plutocra...","[p, pbs, stand, plutocratic, pentagon]","[p, pbs, stand, plutocratic, pentagon, posted,...","[p, pb, stand, plutocratic, pentagon]","[p, pb, stand, plutocratic, pentagon, posted, ..."
6332,anti-trump protesters are tools of the oligarc...,anti-trump protesters are tools of the oligar...,"[protesters, are, tools, of, the, oligarchy, i...","[protesters, are, tools, of, the, oligarchy, a...","[protesters, tools, oligarchy, information]","[protesters, tools, oligarchy, always, provoke...","[protester, tool, oligarchy, information]","[protester, tool, oligarchy, always, provokes,..."
6333,"in ethiopia, obama seeks progress on peace, se...","addis ababa, ethiopia —president obama convene...","[in, ethiopia, obama, seeks, progress, on, pea...","[addis, ababa, ethiopia, obama, convened, a, m...","[ethiopia, obama, seeks, progress, peace, secu...","[addis, ababa, ethiopia, obama, convened, meet...","[ethiopia, obama, seek, progress, peace, secur...","[addis, ababa, ethiopia, obama, convened, meet..."


In [11]:
#step 5, rejoin!
def rejoin_words_in_title(row):
    my_list = row['lemmatized_title']
    joined_words = ( " ".join(my_list))
    return joined_words

def rejoin_words_in_text(row):
    my_list = row['lemmatized_text']
    joined_words = ( " ".join(my_list))
    return joined_words

X['processed_title'] = X.apply(rejoin_words_in_title, axis=1)
X['processed_text'] = X.apply(rejoin_words_in_text, axis=1)

X

Unnamed: 0,title,text,title_tokenized,text_tokenized,title_no_stopwords,text_no_stopwords,lemmatized_title,lemmatized_text,processed_title,processed_text
0,you can smell hillary’s fear,"daniel greenfield, a shillman journalism fello...","[you, can, smell, hillary, s, fear]","[daniel, greenfield, a, shillman, journalism, ...","[smell, hillary, fear]","[daniel, greenfield, shillman, journalism, fel...","[smell, hillary, fear]","[daniel, greenfield, shillman, journalism, fel...",smell hillary fear,daniel greenfield shillman journalism fellow f...
1,watch the exact moment paul ryan committed pol...,google pinterest digg linkedin reddit stumbleu...,"[watch, the, exact, moment, paul, ryan, commit...","[google, pinterest, digg, linkedin, reddit, st...","[watch, exact, moment, paul, ryan, committed, ...","[google, pinterest, digg, linkedin, reddit, st...","[watch, exact, moment, paul, ryan, committed, ...","[google, pinterest, digg, linkedin, reddit, st...",watch exact moment paul ryan committed politic...,google pinterest digg linkedin reddit stumbleu...
2,kerry to go to paris in gesture of sympathy,u.s. secretary of state john f. kerry said mon...,"[kerry, to, go, to, paris, in, gesture, of, sy...","[secretary, of, state, john, kerry, said, mond...","[kerry, go, paris, gesture, sympathy]","[secretary, state, john, kerry, said, monday, ...","[kerry, go, paris, gesture, sympathy]","[secretary, state, john, kerry, said, monday, ...",kerry go paris gesture sympathy,secretary state john kerry said monday stop pa...
3,bernie supporters on twitter erupt in anger ag...,"— kaydee king (@kaydeeking) november 9, 2016 t...","[bernie, supporters, on, twitter, erupt, in, a...","[kaydee, king, kaydeeking, november, 9, 2016, ...","[bernie, supporters, twitter, erupt, anger, dn...","[kaydee, king, kaydeeking, november, 9, 2016, ...","[bernie, supporter, twitter, erupt, anger, dnc...","[kaydee, king, kaydeeking, november, 9, 2016, ...",bernie supporter twitter erupt anger dnc tried...,kaydee king kaydeeking november 9 2016 lesson ...
4,the battle of new york: why this primary matters,it's primary day in new york and front-runners...,"[the, battle, of, new, york, why, this, primar...","[it, primary, day, in, new, york, and, hillary...","[battle, new, york, primary, matters]","[primary, day, new, york, hillary, clinton, do...","[battle, new, york, primary, matter]","[primary, day, new, york, hillary, clinton, do...",battle new york primary matter,primary day new york hillary clinton donald tr...
...,...,...,...,...,...,...,...,...,...,...
6330,state department says it can't find emails fro...,the state department told the republican natio...,"[state, department, says, it, ca, find, emails...","[the, state, department, told, the, republican...","[state, department, says, ca, find, emails, cl...","[state, department, told, republican, national...","[state, department, say, ca, find, email, clin...","[state, department, told, republican, national...",state department say ca find email clinton spe...,state department told republican national comm...
6331,the ‘p’ in pbs should stand for ‘plutocratic’ ...,the ‘p’ in pbs should stand for ‘plutocratic’ ...,"[the, p, in, pbs, should, stand, for, plutocra...","[the, p, in, pbs, should, stand, for, plutocra...","[p, pbs, stand, plutocratic, pentagon]","[p, pbs, stand, plutocratic, pentagon, posted,...","[p, pb, stand, plutocratic, pentagon]","[p, pb, stand, plutocratic, pentagon, posted, ...",p pb stand plutocratic pentagon,p pb stand plutocratic pentagon posted oct 27 ...
6332,anti-trump protesters are tools of the oligarc...,anti-trump protesters are tools of the oligar...,"[protesters, are, tools, of, the, oligarchy, i...","[protesters, are, tools, of, the, oligarchy, a...","[protesters, tools, oligarchy, information]","[protesters, tools, oligarchy, always, provoke...","[protester, tool, oligarchy, information]","[protester, tool, oligarchy, always, provokes,...",protester tool oligarchy information,protester tool oligarchy always provokes rage ...
6333,"in ethiopia, obama seeks progress on peace, se...","addis ababa, ethiopia —president obama convene...","[in, ethiopia, obama, seeks, progress, on, pea...","[addis, ababa, ethiopia, obama, convened, a, m...","[ethiopia, obama, seeks, progress, peace, secu...","[addis, ababa, ethiopia, obama, convened, meet...","[ethiopia, obama, seek, progress, peace, secur...","[addis, ababa, ethiopia, obama, convened, meet...",ethiopia obama seek progress peace security ea...,addis ababa ethiopia obama convened meeting le...


In [13]:
#after this, we'll be taking only the processed_title and processed_text columns, the others have served their purpose.
drop_cols = ['title','text','title_tokenized','text_tokenized','title_no_stopwords','text_no_stopwords','lemmatized_title','lemmatized_text']
X.drop(drop_cols,axis=1,inplace=True)

KeyError: "['title' 'text' 'title_tokenized' 'text_tokenized' 'title_no_stopwords'\n 'text_no_stopwords' 'lemmatized_title' 'lemmatized_text'] not found in axis"

In [14]:
X.head(2)

Unnamed: 0,processed_title,processed_text
0,smell hillary fear,daniel greenfield shillman journalism fellow f...
1,watch exact moment paul ryan committed politic...,google pinterest digg linkedin reddit stumbleu...


best part about it is that doing it this way doesnt mess with the order of the data. we have 6335 observations with order-specific labels, we're still getting it but this time with the processed text data. now for svm.
<br>
so after i tried to just split the data normally and apply svm, it didnt work. turns out i need to countvectorize it, alias turning the text data into sparse matrices. one for the title, one for the text. but the quoted project had the whole .npz saving and loading part which i think we can forgo.
<br/><br/>so that's what i'll be doing. 

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy import sparse
wvec_title = CountVectorizer(ngram_range=(2, 2), analyzer='word')#will look into what ngram_range means, and unigrams, digrams etc.
sparse_title = wvec_title.fit_transform(X['processed_title'])
wvec_text = CountVectorizer(ngram_range=(2, 2), analyzer='word')
sparse_text = wvec_text.fit_transform(X['processed_text'])
X = sparse.hstack([sparse_title,sparse_text])#hstacking two sparse matrices is just about putting the second one on the right of the first one.

In [22]:
X

<6335x1508481 sparse matrix of type '<class 'numpy.int64'>'
	with 2517139 stored elements in COOrdinate format>

so now X no longer looks like something i can see manually. furthermore, it's a sparse matrix in COO (coordinate) format. <br>
COO format sparse matrices basically only have coordinates for the non-zero data, and the value of the non-zero data at said coordinates.<br>
the coordinates are in tuples of 2.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=66)#data split

In [29]:
#svm. there are several kernels to sklearn's svm.SVC: linear, poly, rbf, sigmoid. 
#we try all of them. keep in mind that this will take a good amount of time, a bit below a minute.
svc = svm.SVC(kernel='poly',degree=1,gamma=0.1).fit(X_train, y_train.values.ravel())
print("kernel=poly, poly degree=1, gamma=0.1, regularization=1 as default. accuracy:", svc.score(X_test, y_test))

kernel=poly, poly degree=1, gamma=0.1, regularization=1 as default. accuracy: 0.8871349644830308


In [26]:
svc = svm.SVC(kernel='sigmoid', C=10, gamma=.1).fit(X_train, y_train.values.ravel())
print("kernel=sigmoid, regularization=10, gamma=0.1. accuracy:", svc.score(X_test, y_test))

kernel=sigmoid, regularization=10, gamma=0.1. accuracy: 0.7269139700078927


In [28]:
svc = svm.SVC(kernel='rbf', gamma=.1, C=10).fit(X_train, y_train.values.ravel())
print("kernel=rbf, regularization=10, gamma=0.1. accuracy:", svc.score(X_test, y_test))

kernel=rbf, regularization=10, gamma=0.1. accuracy: 0.56353591160221


In [30]:
#project said that sigmoid kernel, with default gamma but C=10 is their best choice. let's see.
svc = svm.SVC(kernel='sigmoid', C=10).fit(X_train, y_train.values.ravel())
print("kernel=sigmoid, regularization=10. accuracy:", svc.score(X_test, y_test))

kernel=sigmoid, regularization=10. accuracy: 0.8445146014206788


all this effort and time just to see that it's not better than using the passive-aggressive classifier on tf-idf vectorized data so far.
<br>will perform simulations on the poly kernel since it seems to be doing best. will try different gammas and regularization factors, maybe even up the degree to 2.