First Cell involved importing the two datasets, merging them into one dataframe, shuffling the data
and partitioning the data into training and test data with a 0.20 test proportion. Both the training and test data are further partitioned into X and y components, denoting each entries (tweets) and labels (0 or 1) respectively.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

gen = pd.read_csv('genuine1.csv')
poli = pd.read_csv('Political_Cleaned1.csv')
df0 = pd.concat([gen,poli],axis=0,sort=True)
df = df0.drop(['1'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['label'], test_size=0.20)


Next we apply feature extraction to the X data. To do this we use a TfidfVectorizer which extracts a word dictionary from the data and creates a sparse matrix where for each row (data entry) we have the column indicating the prevelance of a specific word. The term frequency–inverse document frequency makes it so that rare words are weighted higher than frequent ones. Our new features allow us to work numerically since X has been tranformed from rows of tweets into rows of floating point numbers.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train.values.astype('U'))
vector = vectorizer.transform(X_train.values.astype('U'))
testVector = vectorizer.transform(X_train.values.astype('U'))
shortTestV = testVector[0:10000]
shortV = vector[0:10000]

The large dataset and the variety of words (66,145) create a very sparse matrix. So we use TruncatedSVD, in this context a form of LSA (Latent Semantic Analysis). This allows us to reduce the dimension to the specified number of components.

In [None]:
from sklearn.decomposition import TruncatedSVD
tsvd = TruncatedSVD(n_components=5000, n_iter=10, random_state=32)
vector_reduced = tsvd.fit_transform(shortV)
tsvd.explained_variance_ratio_[0:5000].sum()
testvector_reduced = tsvd.fit_transform(shortTestV)

Next we apply a linear Support Vector Machine on the reduced data.

In [None]:
from sklearn import svm
clf = svm.SVC(kernel = 'linear')
clf.fit(vector_reduced,y_train[0:10000])
p = clf.predict(shortTestV)
from sklearn import metrics
acc = metrics.accuracy_score(y_test[0:10000],p)
print(acc)
cm = metrics.confusion_matrix(y_test[0:10000],p)
print(cm)