# Readme

This is a notebook explaining the reproducibilty tasks we performed briefly. First, we look into obtaining the data. Second, we look and compare the Feature Generation for Tweet Based and User Based Features. Finally, we look into the Classifiers for Tweet Based and User Based Features and the Agreement Based Retraining Classifier.

#Data Procurement

To replicate the task, we make use of the data provided by the authors in their Github repository (https://github.com/MKLab-ITI/computational-verification). The authors provide extracted Tweet and User Features. We perform our experiments using the data provided by the authors. To replicate the process of feature generation, we obtain JSON objects for some of the tweets from the Tweet IDs provided in the features files.

#Feature Generation

###Tweet Feature Generation

As an example, consider the following Tweet Features generated by us for a tweet selected randomly:


In [29]:
%run C:/Users/imaad/twitteradvancedsearch/fakeimagedetectiontwitter/features/itemFeaturesGeneration.py

(u'262983821446758401', 136, 22, False, True, True, 1, False, False, False, False, False, 0, 1, 1, 2, 0, 1, 0, 8, True, False, 33, 0, 1, 155.07, '1784.0', '1.9922944E7', (u'17', u'15', u'14', u'-1'))


Consider the features provided by the authors for the same Tweet IDs

'262983821446758401',136,21,false,true,false,11,false,false,false,false,false,9,0,2,2,0,1,0,1,true,false,?,0,1,81.424,?,?,?,?,?,?,fake

A tabulation of the values is as shown below

In [36]:
import pandas as pd
df = pd.read_csv("tweet_features_comparison.csv")
df

Unnamed: 0,id,tweetTextLen,numItemWords,questionSymbol,exclamSymbol,externLinkPresent,numberNouns,happyEmo,sadEmo,containFirstPron,...,numQuesSymbol,numExclamSymbol,readabilityValue,Indegree,Harmonic,AlexaCountry,AlexaDelta,AlexaPopularity,AlexaReach,class
0,'262983821446758401',136,21,False,True,False,11,False,False,False,...,0,1,81.424,?,?,?,?,?,?,fake
1,262983821446758401',136,22,False,True,True,1,False,False,False,...,0,1,155.07,'1784.0','1.9922944E7',(u'17',u'15',u'14',u'-1')),


There are a few differences in the values we obtain. For more explanation on the differences, please refer to the paper.

###User Features Generation

In [63]:
%run C:/Users/imaad/twitteradvancedsearch/fakeimagedetectiontwitter/features/userFeaturesGeneration.py

(u'262983821446758401', 294, 338, 0.8698224852071006, 2, False, False, 16273, True, True, False, 33, 390.0, 1506816000.0, False, True, '16273.0', '1784.0', '1.9922944E7', (u'17', u'15', u'14', u'-1'))


'262983821446758401',235,396,1.6851064,2,false,false,11097,true,true,false,?,326,1303397965,false,true,7.1120014,?,?,?,?,?,?,fake


In [66]:
import pandas as pd
df = pd.read_csv("user_features_comparision.csv")
df

Unnamed: 0,id,numFriends,numFollowers,followerFriendRatio,timesListed,hasUrlCheck,verifiedUser,numTweets,bioCheck,locationCheck,...,profileImgCheck,headerImgCheck,tweetRatio,Indegree,Harmonic,alexaCountryRank,alexaDeltaRank,alexaPopularity,alexaReachRank,class
0,'262983821446758401',235,396,1.685106,2,False,False,11097,True,True,...,False,True,7.1120014,?,?,?,?,?,?,fake
1,262983821446758401',294,338,0.869822,2,False,False,16273,True,True,...,False,True,'16273.0','1784.0','1.9922944E7',(u'17',u'15',u'14',u'-1')),


User Features for the same tweet are as compared above. Most of the features change since most of the details of a user, for example, number of friends, change over a period of time.

#Classification

###Tweet Features Classification:
The following functions can be called to perform Tweet Features Classification.
#####getting the file with the extracted tweet features
file_name = read_args()
#####read in csv with tweet features
df = pd.read_csv(file_name)
#####preprocess it (Linear regression for missing values and normalizing the numeric values)
df = preprocess(df)
#####prepare data for training and testing (split data samples based on class)
df = prepare(df)
#####train 
models = fit(df)
#####predict
final_predictions = predict(models)
#####calculate accuracy
acc = accuracy(final_predictions)
<br></br>avg = sum(acc)/len(acc)
<br></br>print avg

In [69]:
%run C:/Users/imaad/twitteradvancedsearch/fakeimagedetectiontwitter/models/itemFeaturesClassification.py

get file name
read the file
pre processing
preparing data
training
prediction
calc_accuracy
0.862299226359


###User Features Classification
The following functions can be called to perform user Features Classification.
#####getting the file with the extracted user features
file_name = read_args()
#####read in csv with user features
df = pd.read_csv(file_name)
#####preprocess it (Linear regression for missing values and normalizing the numeric values)
df = preprocess(df)
#####prepare data for training and testing (split data samples based on class)
df = prepare(df)
#####train 
models = fit(df)
#####predict
final_predictions = predict(models)
#####calculate accuracy
acc = accuracy(final_predictions)
<br></br>avg = sum(acc)/len(acc)
<br></br>print avg

In [71]:
%run C:/Users/imaad/twitteradvancedsearch/fakeimagedetectiontwitter/models/userFeaturesClassification.py

get file name
read the file
pre processing
preparing data
training
prediction
calc_accuracy
0.788584701736


###Agreement Based Retraining Classification
The following functions can be called to perform user Features Classification.
#####Get values predicted by Tweet Feature Classifier (stored in a pickle)
with open('item_fin_val.pkl', 'rb') as f:
		finVal = pickle.load(f)
#####Get values predicted by User Feature Classifier (stored in a pickle)
with open('user_fin_val.pkl', 'rb') as f:
		user_finVal = pickle.load(f)
######Check for agreement of TF and UF Classifiers
agreed_values = agreement(finVal, user_finVal)
#####Retrain the disagreed values
agreed_values = retraining(agreed_values)
#####train 
models = fit(df)
#####predict
final_predictions = predict(models)
#####calculate accuracy
accu = calc_accuracy(agreed_values)
<br></br>avg = sum(accu)/len(accu)
<br></br>print avg



In [74]:
%run C:/Users/imaad/twitteradvancedsearch/fakeimagedetectiontwitter/models/agreementRetraining.py

get finVal
get user_finVal
get agreed values
retrain
average accuracy
0.900453085145
