First Cell involved importing the two datasets, merging them into one dataframe, shuffling the data
and partitioning the data into training and test data with a 0.20 test proportion. Both the training and test data are further partitioned into X and y components, denoting each entries (tweets) and labels (0 or 1) respectively.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

gen = pd.read_csv('genuine1.csv')
poli = pd.read_csv('Political_Cleaned1.csv')
df0 = pd.concat([gen,poli],axis=0,sort=True)
df = df0.drop(['1'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.20)

Next we apply feature extraction to the X data. To do this we use a TfidfVectorizer which extracts a word dictionary from the data and creates a sparse matrix where for each row (data entry) we have the column indicating the prevelance of a specific word. The term frequency–inverse document frequency makes it so that rare words are weighted higher than frequent ones. Our new features allow us to work numerically since X has been tranformed from rows of tweets into rows of floating point numbers.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train.values.astype('U'))
vector = vectorizer.transform(X_train.values.astype('U'))
test_vector = vectorizer.transform(X_test.values.astype('U'))
print(vector.shape)

(118797, 66171)


The large dataset and the variety of words (66,145) create a very sparse matrix. So we use TruncatedSVD, in this context a form of LSA (Latent Semantic Analysis). This allows us to reduce the dimension to the specified number of components.

In [3]:
from sklearn.decomposition import TruncatedSVD
n_components = [10000]
explained_variance= [0]
v_reduced = []
test_v_reduced = []
for i in range(len(n_components)):
    n = n_components[i]
    tsvd = TruncatedSVD(n_components=n, n_iter=5, random_state=32)
    vector_reduced = tsvd.fit_transform(vector)
    test_vector_reduced = tsvd.fit_transform(test_vector)
    v_reduced.append(vector_reduced)
    test_v_reduced.append(test_vector_reduced)
    explained_v = tsvd.explained_variance_ratio_[0:n].sum()
    explained_variance[i] = round(explained_v,4)

print("N_Components",n_components)
print("Explained Variance Ratio",explained_variance)

N_Components [10000]
Explained Variance Ratio [0.9467]


In [9]:
type(v_reduced[0])

numpy.ndarray

In [None]:
np.savetxt("truncatedTrain10000.csv", v_reduced[0], delimiter=",")

from sklearn.svm import LinearSVC
from sklearn import metrics
svm = LinearSVC(C=10)
for n in range(len(n_components)):
    clf = svm.fit(v_reduced[n],y_train)
    p1 = clf.predict(v_reduced[n])
    p2 = clf.predict(test_v_reduced[n])
    accTrain = metrics.accuracy_score(p1,y_train)
    accTest = metrics.accuracy_score(p2,y_test)
    print("N_Components",n_components[n])
    print("Train Accuracy",accTrain)
    print("Test Accuracy", accTest)

Next we apply a linear Support Vector Machine on the reduced data.

In [32]:
from sklearn.svm import LinearSVC
C = np.logspace(0, 4, 10)
hyperparameters = dict(C=C)
svm = LinearSVC()
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(svm, hyperparameters, cv=5, verbose=0)
model = clf.fit(vector,y_train)
print('Best C:', model.best_estimator_.get_params()['C'])
pred = model.predict(vector)
trainAcc = metrics.accuracy_score(y_train,pred)
print(trainAcc)
testPred = model.predict(test_vector)
testAcc = metrics.accuracy_score(y_test,testPred)
print(testAcc)


Best C: 1.0
0.9643593693443437
0.9046127946127946


In [5]:

clf = LinearSVC()
clf.fit(vector,y_train)
p = clf.predict(vector)
p2 = clf.predict(test_vector)
from sklearn import metrics
acc = metrics.accuracy_score(y_train,p)
testacc = metrics.accuracy_score(y_test,p2)


In [6]:
print("Train Acc", acc)
print("Test Acc", testacc)

Train Acc 0.9644687997171646
Test Acc 0.9045791245791246


In [None]:
#a = [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’]
from sklearn.svm import SVC
clf = SVC(kernel = 'poly')
clf.fit(vector,y_train)
p = clf.predict(vector)
p2 = clf.predict(test_vector)
from sklearn import metrics
acc = metrics.accuracy_score(y_train,p)
testacc = metrics.accuracy_score(y_test,p2)
