# Machine vs Bot: 
## tuning machine learning models to detect bots on Twitter 


This is the code for experiments presented in the paper "Machine vs Bot: tuning machine learning models to detect bots on Twitter" submited to the 5th Workshop on Computer Networks and Communication Systems (WCNPS) 2020.

Authors:

- Stefano M P C Souza - Department of Electrical Engineering, University of Brasília (UnB), Brasília, Brazil
- Tito Resende - Institute of Computing, University of Campinas (IC/Unicamp), Campinas, Brazil
- José Nascimento - IC/Unicamp, Campinas, Brazil
- Levy G Chaves - IC/Unicamp, Campinas, Brazil
- Darlinne H P Soto - IC/Unicamp, Campinas, Brazil
- Soroor Salavati - IC/Unicamp, Campinas, Brazil

    Abstract — Bot generated content on social media can spread fake news and hate speech, manipulate 
    public opinion and influence the community on relevant topics, such as elections. Thus, bot 
    detection in social media platforms plays an important role for the health of the platforms and
    for the well-being of societies. In this work, we approach the detection of bots on Twitter as a
    binary output problem through the analysis of account features. We propose a pipeline for feature 
    engineering and model training, tuning and selection. We test ourpipeline using 3 publicly 
    available bot datasets, comparing the performance of all trained models with the model selected 
    at the end of our pipeline. 
 
    Keywords — machine learning, hyper-parameter, random search, bot detection

The `data` folder contains three datasets copied from previous works:

1. **cresci-stock**: S. Cresci, F. Lillo, D. Regoli, S. Tardelli, and M. Tesconi, “Cash-tag piggybacking:  
Uncovering spam and bot activity in stockmicroblogs on twitter", ACM Transactions on the Web (TWEB)", vol. 
13, no. 2, pp. 1–27, 2019.
2. **botwiki**: K.-C. Yang, O. Varol, P.-M. Hui, and F. Menczer, “Scalableand generalizable social bot 
detection through data selection”, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 
no. 01, p. 1096–1103, Apr 2020.
4. **cresci-rtbust**: M. Mazza, S. Cresci, M. Avvenuti, W. Quattrociocchi, and M. Tesconi, “Rtbust: 
Exploiting temporal patterns for botnet detection on twitter", in Proceedings of the 10th ACM Conference 
on Web Science, 2019, pp. 183–192.

In [2]:
# Basic libraries used to manipulate and display data
import numpy as np
import pandas as pd
import os,sys,inspect

currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir)

## Reading the account information received from the Twitter API

We provide the utility function `read_twitter_dataset` to read the json format from the Twitter API into a pandas dataframe. The datasets must contain a `json` file with accounts in the format provided by Twitter API. Each dataset must also contain a `tsv` or `csv` file with two columns: the account ID and a boolean with the value True indicating that the account was identified as a bot.

In [3]:
from twitter_utils import read_twitter_dataset

cresci_rbust = read_twitter_dataset('data/cresci-rtbust-2019')
cresci_stock = read_twitter_dataset('data/cresci-stock-2018')
botwiki = read_twitter_dataset('data/botwiki-2019')

X = cresci_rbust.append(cresci_stock.append(botwiki, ignore_index=True), ignore_index=True)
y = X.is_bot
X.drop(columns='is_bot', inplace=True)

X.head()

Unnamed: 0,created_at,user.id,user.id_str,user.name,user.screen_name,user.location,user.description,user.url,user.entities.description.urls,user.protected,...,user.profile_text_color,user.profile_use_background_image,user.has_extended_profile,user.default_profile,user.default_profile_image,user.following,user.follow_request_sent,user.notifications,user.translator_type,user.entities.url.urls
0,Wed May 15 16:00:19 +0000 2019,3022357312,3022357312,patatavis,pattavis,,,,[],False,...,333333,True,False,True,False,False,False,False,none,
1,Thu May 16 20:07:35 +0000 2019,753659579582541824,753659579582541824,#1DMITAMProject ♡ #OneDirectionBestFans,PromotingLouisT,Italy,➡ here we promote all of the boys' music 🐼💖\n\...,https://t.co/8HvscWygQd,[],False,...,333333,True,False,True,False,False,False,False,none,"[{'url': 'https://t.co/8HvscWygQd', 'expanded_..."
2,Fri May 17 17:22:28 +0000 2019,901802279623417856,901802279623417856,Anna 05600885,Anna10mila,"Milano, Lombardia",Sii il cambiamento,,[],False,...,333333,True,False,True,False,False,False,False,none,
3,Mon Sep 24 13:01:10 +0000 2018,2982392825,2982392825,Hanna,AlleHanna,,,,[],False,...,333333,True,False,True,False,False,False,False,none,
4,Wed May 15 16:41:07 +0000 2019,825436422609969152,825436422609969152,Zaffiro Blu,ZaffiroBlu3,,Se puoi sognarlo puoi farlo 💙,,[],False,...,333333,True,False,True,False,False,False,False,none,


## Loading the pipeline and its steps

We have placed the code for the pipeline in the file `machine_vs_bot.py`. We organized the code in a way that we could clearly identify the pipeline steps presented on the paper. For those familiar with scikit-learn's Pipeline, the code will be realy easy to read and understand.

In [4]:
from machine_vs_bot import DataSplitStep, FeatureEngineeringStep
from machine_vs_bot import ModelTuningStep, ModelSelectionStep, ModelTestStep
from machine_vs_bot import BotClassifierPipeline

We have also created transformers customized for twitter account feature generation and selection. For the sake of readability, we have placed the code specific to Twitter on the `twitter_utils.py` file. The pipeline can be easily extended to fit bot detection classifiers for other platforms just by creating specific transformer for each platform.

In [5]:
from twitter_utils import twitter_feature_generation_transformer, twitter_feature_selection_transformer

We run our experiments with 3 different values for feature scaling: 'none', 'standard' (for `StandarScaler` normalization trasnform) and 'min-max' (for the `MinMaxScaler`); and 3 different cross-validation scoring functions: `roc_auc_score`, `f1_score` and `accuracy_score`. We fixed `roc_auc_score` as the only validation scoring function in order to be able to compare all the models.

In [6]:
# create a cartesian scaling vs cv scoring
experiments = np.array(np.meshgrid(['none', 'standard', 'min-max'], ['roc_auc', 'f1', 'accuracy'])).T.reshape(-1,2)
info = pd.DataFrame()
best = pd.DataFrame()

# Run a pipeline for each combination, recording the performance of all the models
# on the validation set e of the performance of the model selected in each pipeline
# on the test set
for [scaling, cv_scoring] in experiments:
    pipe = BotClassifierPipeline([
        ('data_split', DataSplitStep()),
        ('feature_engineering', FeatureEngineeringStep(
            feature_generation=twitter_feature_generation_transformer,
            feature_selection=twitter_feature_selection_transformer,
            feature_scaling=scaling
        )),
        ('model_tuning', ModelTuningStep(scoring=cv_scoring)),
        ('model_selection', ModelSelectionStep(scoring='roc_auc')),
        ('model_test', ModelTestStep())
    ], name=f'{scaling}.{cv_scoring}', n_iter=20, save_models=True).fit(X, y)
    info = info.append(pipe.info(), ignore_index=True)
    best = best.append(pipe.score())

Now we can study the details of the selected classifier for each pipeline run:

In [7]:
best

Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,afr
0,none.roc_auc,none,Random Forest,"{'n_estimators': 566, 'min_samples_split': 2, ...",roc_auc,0.861323,roc_auc,0.95183,0.867008,0.867061,0.866158,0.866599
0,none.f1,none,Random Forest,"{'n_estimators': 566, 'min_samples_split': 2, ...",f1,0.875722,roc_auc,0.95183,0.867008,0.867061,0.866158,0.866599
0,none.accuracy,none,Random Forest,"{'n_estimators': 566, 'min_samples_split': 2, ...",accuracy,0.862455,roc_auc,0.95183,0.867008,0.867061,0.866158,0.866599
0,standard.roc_auc,standard,XGBoost,"{'learning_rate': 0.037999397928169254, 'max_d...",roc_auc,0.860866,roc_auc,0.854664,0.861466,0.861516,0.860522,0.861009
0,standard.f1,standard,XGBoost,"{'learning_rate': 0.037999397928169254, 'max_d...",f1,0.875338,roc_auc,0.854664,0.861466,0.861516,0.860522,0.861009
0,standard.accuracy,standard,XGBoost,"{'learning_rate': 0.037999397928169254, 'max_d...",accuracy,0.862028,roc_auc,0.854664,0.861466,0.861516,0.860522,0.861009
0,min-max.roc_auc,min-max,Random Forest,"{'n_estimators': 566, 'min_samples_split': 2, ...",roc_auc,0.861081,roc_auc,0.925237,0.865729,0.865746,0.864574,0.865156
0,min-max.f1,min-max,XGBoost,"{'learning_rate': 0.1036111842654619, 'max_dep...",f1,0.875318,roc_auc,0.858628,0.858483,0.85855,0.857635,0.858079
0,min-max.accuracy,min-max,Random Forest,"{'n_estimators': 566, 'min_samples_split': 2, ...",accuracy,0.862455,roc_auc,0.925237,0.865729,0.865746,0.864574,0.865156


And the best validation score for each classifiers

In [8]:
models = info.sort_values(by='validation_score', ascending=False).groupby('name').first().reset_index()
models

Unnamed: 0,name,pipeline,scaling,scoring,model,params,best_score,mean_score,std_score,run_time,validation_scoring,validation_score
0,Baggin,none.f1,none,f1,(DecisionTreeClassifier(random_state=195292617...,"{'n_estimators': 411, 'max_samples': 0.7000000...",0.875489,0.875489,0.004247,11.693905,roc_auc,0.949762
1,Decision Tree,none.accuracy,none,accuracy,"DecisionTreeClassifier(criterion='entropy', ma...","{'min_samples_leaf': 1, 'max_features': 10, 'm...",0.800979,0.800979,0.008484,0.761214,roc_auc,0.826921
2,K-Nearest Neighbors,none.roc_auc,none,roc_auc,"KNeighborsClassifier(algorithm='ball_tree', n_...","{'weights': 'distance', 'n_neighbors': 14, 'al...",0.809983,0.809983,0.008247,0.395795,roc_auc,0.935024
3,Logistic Regression,standard.roc_auc,standard,roc_auc,"LogisticRegression(fit_intercept=False, max_it...","{'penalty': 'l2', 'max_iter': 147, 'fit_interc...",0.647329,0.647329,0.006773,0.406529,roc_auc,0.652058
4,Random Forest,none.roc_auc,none,roc_auc,"(DecisionTreeClassifier(max_depth=45, max_feat...","{'n_estimators': 566, 'min_samples_split': 2, ...",0.861323,0.861323,0.006239,10.114457,roc_auc,0.95183
5,Support Vector Machines,standard.f1,standard,f1,"SVC(C=89, gamma='auto', random_state=42)","{'kernel': 'rbf', 'gamma': 'auto', 'C': 89}",0.759334,0.759334,0.002617,9.497426,roc_auc,0.732568
6,XGBoost,min-max.f1,min-max,f1,"XGBClassifier(base_score=0.5, booster='gbtree'...","{'learning_rate': 0.1036111842654619, 'max_dep...",0.875318,0.875318,0.004492,6.042803,roc_auc,0.858628
