**<h1><center>TWIBOT-20</center></h1>**

In [None]:
# to print all output for a cell instead of only last one 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import sys
sys.path.insert(0, "..")

In [None]:
import utils_analytics as ut
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import os 
from pandas.core.common import flatten

from src.process import process_account_dataset
from sklearn.ensemble import RandomForestClassifier

from string import punctuation
import nltk 
from nltk.corpus import stopwords

## **Introduction**
For this project we decided to focus on data. Our goal was to understand what kind of information is needed to accurately discriminate bot from humans on the Twitter social media platform. To do this we decided to feed our models more and more data from multiple sources to discover what piece of information works best for this task. 

We used (nearly) the same model at all steps, with increasingly more data to learn from: 
the text from only one tweet per user 
the text from one tweet + custom metadata features on that tweet 
the text from multiple tweets + custom statistical features on the aggregated tweets
all the above + custom features from the user account information 

Given the same pool of users, a tweet-based bot detection approach would have significantly more labeled examples to exploit at training time, and would be much faster and flexible when actually deployed. However, prior results in bot detection suggested that tweet text alone is not highly predictive of bot accounts and many works on account-level classification have found that user metadata tends to be the best predictor for bot detection. 

In this project we tackle both account-level bot detection and tweet-level bot detection, with a custom DL model based on LSTM. Given the tabular nature of the metadata features, and with a view to achieve better model efficiency and interpretability, we also decided to test the performance of a Random Forest classifier on the detection task. 

## **Tweet level**

Rather than using a lot of features obtained by analysing and processing a user's behaviour (including but not limited to tweets), which is difficult and slow to do, one can start from the text data of individual tweets and see if it is possible to differentiate between bots and non-bots in this way.
The different types of data chosen as input for the corresponding models include:
- Single tweets
- Single tweet and tweet's text features 
- Multi tweet and tweets' text features 

The following sections will analyse the corresponding dataset and the features obtained by them.

### Preprocessing text

In [None]:
tweets_df, account_df = ut.loadData()
tweets_df.head()

In [None]:
print(tweets_df.loc[9,'tweet'])

Come è possibile vedere, i tweet presentano solitamente hashtag, menzioni, url ecc.. che però difficilmente si riescono ad encodare come informazioni utili da passare in input ad una LSTM. Pertanto uno step di preprocessing è sicuramente necessario per ridurre il numero di OOV words mantenendo intatte il più possibile le informazioni che questi dati possiedono.

Step di preprocessing per ogni singolo tweet:
- 'RT' -> ' retweet '
- '\n' -> ' '
- '$apple' -> ' stock '
- '@' -> ' email '
- '1,2,3..' -> ' number '
- '$,£..' -> ' money '
- '#' -> ' hashtag '
- '@pontifex' -> ' username '

- 'http,https..' -> 'url'
- 'ahah, haha, ajaj, jaja' -> 'ahah'
- '-' -> ' '
- "'" -> " '"
- Remove tweets too shorts (minimum 3 tokens required)

Perchè lo facciamo così:
- cashtag, money, emoji:
- esclusione tweet corti: abbiamo deciso di eliminare i tweet che dopo il preprocessing possedevano un numero di token inferiore a 3. Questo ha permesso di 'pulire' il dataset da tweet poco esplicativi (anche per un umano) che avrebbero costituito degli outlier e che avrebbero peggiorato le performance
- inglese vs altre lingue: 
- FastText vs Glove:

Testo pulito:

Stampare esempi testo pulito

In [None]:
dataset_df = pd.read_pickle(ut.DATA_FOLDER / 'processed_dataset_v1.pkl')
dataset_df.head()

In [None]:
print(dataset_df.loc[9,'processed_tweet'])

### Text of single tweet
- architettura modello
- da testo a embedding 
- tuning hyperparametri 
- motivare : dropout, weight decay, class imbalance  
- risultati

### Text of single tweet + text features 
- Perchè utilizziamo delle feature ( la LSTM non considera troppo elementi statistici del testo come RT, hashtag, numeri di cose, ecc..)
- Come le passiamo: struttura modello + zscore 
- Feature utilizzate:
    - Is retweet? Yes/No
    - N° of URLs, tags, hashtags, cashtag, currency simbols, emails, numbers, emoticons, emojis, stopwords, punctuation
- Perchè queste feature? rilevanza, analisi di correlazione
- Esempi di tweet che mostrano la rilevanza delle feature (i bot tendono a avere più citazioni, hashtag, boh ) risultati 


In [None]:
singletweet_features_df = pd.read_pickle(ut.DATA_FOLDER / 'processed_dataset_v2.pkl')
singletweet_features_df.head()

In [None]:
plt.figure(figsize=(20,10))
cor = singletweet_features_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select all features
	fs = SelectKBest(score_func=f_classif, k='all')
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

In [None]:
feature_columns = ['is_rt','url_c','tag_c','hashtag_c','cashtag_c','money_c','email_c','number_c','emoji_c','emoticon_c','len_tweet','stopwords_c','punct_c']
train_ds = singletweet_features_df[singletweet_features_df['split'] == 'train'].reset_index(drop=True)
val_ds = singletweet_features_df[singletweet_features_df['split'] == 'val'].reset_index(drop=True)
test_ds = singletweet_features_df[singletweet_features_df['split'] == 'test'].reset_index(drop=True)

X_train = train_ds[feature_columns]
y_train = train_ds['label']

X_test = val_ds[feature_columns]
y_test = val_ds['label']

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
	print(f'{i} -> {feature_columns[i]}: {fs.scores_[i]/sum(fs.scores_)*100:.3f}%')
# plot the scores

plt.figure(figsize=(30,10))
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()

Come è possibile vedere, la feature 'is_rt' (Is retweet? Yes/No) sia molto correlata al valore della label (Bot/Human). Andiamo pertanto ad analizzare le percentuali che caratterizzano la differenza tra tweet prodotti da bot che sono retweet o post 'nuovi'.

In [None]:
num_rt_bot = singletweet_features_df[(singletweet_features_df['is_rt'] == 1.0) & (singletweet_features_df['label'] == 1.0)].shape[0]
num_nort_bot = singletweet_features_df[(singletweet_features_df['is_rt'] == 0.0) & (singletweet_features_df['label'] == 1.0)].shape[0]
num_tweets = singletweet_features_df.shape[0]
print(f'Number of tweets from bots which are retweet: {num_rt_bot} - ({num_rt_bot/num_tweets*100:.1f}%)')
print(f'Number of tweets from bots which are not retweet: {num_nort_bot} - ({num_nort_bot/num_tweets*100:.1f}%)\n')

num_rt_human = singletweet_features_df[(singletweet_features_df['is_rt'] == 1.0) & (singletweet_features_df['label'] == 0.0)].shape[0]
num_nort_human = singletweet_features_df[(singletweet_features_df['is_rt'] == 0.0) & (singletweet_features_df['label'] == 0.0)].shape[0]
print(f'Number of tweets from humans which are retweet: {num_rt_human} - ({num_rt_human/num_tweets*100:.1f}%)')
print(f'Number of tweets from humans which are not retweet: {num_nort_human} - ({num_nort_human/num_tweets*100:.1f}%)\n')

print(f"Pearson Correlation:\n{singletweet_features_df[['is_rt','label']].corr()}")

Quindi c'è una probabilità doppia che se il tweet è un retweet l'utente sia in realtà un bot.

Andiamo ora invece ad effettuare lo stesso studio sulla feature 'cashtag_c', che mostra il numero di cashtag all'interno di ogni singolo tweet e che appare molto poco correlata alla label finale.

In [None]:
print(f"Pearson Correlation:\n{singletweet_features_df[['cashtag_c','label']].corr()}")

Infine andiamo a comparare la media di url utilizzati per singolo tweet da bot e umani con la corrispondente media di hashtag.

In [None]:
mean_url_bot = singletweet_features_df[singletweet_features_df['label'] == 1.0]['url_c'].mean()
mean_url_nobot = singletweet_features_df[singletweet_features_df['label'] == 0.0]['url_c'].mean()
print(f"Average z-score of URLs per single tweet by bot user: {mean_url_bot:.3f}")
print(f"Average z-score of URLs per single tweet by human user: {mean_url_nobot:.3f}")
print(f"Difference: {abs(mean_url_bot - mean_url_nobot):.3f}")

In [None]:
mean_hashtag_bot = singletweet_features_df[singletweet_features_df['label'] == 1.0]['hashtag_c'].mean()
mean_hashtag_nobot = singletweet_features_df[singletweet_features_df['label'] == 0.0]['hashtag_c'].mean()
print(f"Average z-score of hashtags per single tweet by bot user: {mean_hashtag_bot:.3f}")
print(f"Average z-score of hashtags per single tweet by human user: {mean_hashtag_nobot:.3f}")
print(f"Difference: {abs(mean_hashtag_bot - mean_hashtag_nobot):.3f}")

Il margine è nettamente più ampio (più del doppio)

### Multi tweets + text features 
- Perchè? Ovviamente da più tweet si capisce meglio e ci sono più informazioni, sia dal punto di vista del testo sia per i metadati, la cui analisi diventa più statisticamente rilevante (num di RT uguali, num avg hashtag per tweet, ecc) 
- Quali features e spiegazione: 
    - rilevanza features (perchè una feature ha senso) 
    - correlazione e esempi presi dal dataset 
- Risultati 


In [None]:
multitweet_features_df = pd.read_pickle(ut.DATA_FOLDER / 'processed_dataset_v3.pkl')
multitweet_features_df.head()

In [None]:
plt.figure(figsize=(20,10))
cor = multitweet_features_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
feature_columns = multitweet_features_df.columns.difference(
	['account_id','label','split','tweet','processed_tweet','n_processed_tweet','n_tweet']).tolist()
train_ds = multitweet_features_df[multitweet_features_df['split'] == 'train'].reset_index(drop=True)
val_ds = multitweet_features_df[multitweet_features_df['split'] == 'val'].reset_index(drop=True)
test_ds = multitweet_features_df[multitweet_features_df['split'] == 'test'].reset_index(drop=True)

X_train = train_ds[feature_columns]
y_train = train_ds['label']

X_test = val_ds[feature_columns]
y_test = val_ds['label']

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
	print(f'{i} -> {feature_columns[i]}: {fs.scores_[i]/sum(fs.scores_)*100:.3f}%')
# plot the scores

plt.figure(figsize=(30,10))
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()

Analisi numero di retweet per account:

In [None]:
NUM_TW_FEATURES = 30
NUM_TW_TXT = 10

def retweet_count(proc_sentence : list):
    return proc_sentence.count('retweet')

df = dataset_df.copy(deep=True)

# AGGREGATE TWEET FROM SAME ACCOUNT 
aggregation_functions = {'account_id': 'first', 'tweet': lambda x : x.tolist(), 'label': 'first', 'split': 'first','processed_tweet': lambda x : x.tolist()}
df = df.groupby(df['account_id'],as_index=False,sort=False).agg(aggregation_functions) 
df = df[df['tweet'].map(lambda x: len(x)) >= NUM_TW_FEATURES].reset_index(drop=True)
df['n_processed_tweet'] = df['processed_tweet'].map(lambda x: x[:NUM_TW_FEATURES]).apply(lambda x : list(flatten(x)))
df['rt_count'] = df['n_processed_tweet'].apply(retweet_count)

In [None]:
mean_rt_bot = df[df['label'] == 1.0]['rt_count'].mean()
mean_rt_nobot = df[df['label'] == 0.0]['rt_count'].mean()
mean_rt = df['rt_count'].mean()
print(f"Average number of retweets: {mean_rt:.3f}")
print(f"Average number of retweets by bot users: {mean_rt_bot:.3f} (+{(mean_rt_bot-mean_rt)/mean_rt*100:.3f}%)")
print(f"Average number of retweets by human users: {mean_rt_nobot:.3f} ({(mean_rt_nobot-mean_rt)/mean_rt*100:.3f}%)")
print(f"Pearson Correlation:\n{multitweet_features_df[['rt_count','label']].corr()}")

Analisi parole diverse per account:

In [None]:
nltk.download('stopwords',ut.DATA_FOLDER)

sw = stopwords.words('english')

def clean_tweet(tweet: list ):
        to_remove = ['retweet','username','hashtag','url','emoticon','emoji','number','stock','money','email']
        return [x for x in tweet if x not in to_remove and x not in punctuation and x not in sw]

def unique_words_ratio(sentence_list : list[list]):
    s = []
    for sentence in sentence_list:
        if sentence[0] != 'retweet':
            s.extend(clean_tweet(sentence))
    
    if s : return len(set(s)) / len(s)
    else : return 1.0

df['unique_words_ratio'] = df['n_processed_tweet'].apply(unique_words_ratio)

In [None]:
mean_unique_words_bot = df[df['label'] == 1.0]['unique_words_ratio'].mean()
mean_unique_words_nobot = df[df['label'] == 0.0]['unique_words_ratio'].mean()
mean_unique_words = df['unique_words_ratio'].mean()
print(f"Average unique words used by each account in general: {mean_unique_words:.3f}")
print(f"Average unique words used by each bot: {mean_unique_words_bot:.3f} ({(mean_unique_words_bot-mean_unique_words)/mean_unique_words*100:.3f}%)")
print(f"Average unique words used by each human: {mean_unique_words_nobot:.3f} (+{(mean_unique_words_nobot-mean_unique_words)/mean_unique_words*100:.3f}%)")
print(f"Pearson Correlation:\n{multitweet_features_df[['unique_words_ratio','label']].corr()}")

Esempio utente che posta tweet con parole maggiormente uguali:

In [None]:
df.loc[df['unique_words_ratio'].idxmin(),'tweet'][:3]

Esempio utente che posta tweet con parole tendenzialmente diverse:

In [None]:
df.loc[df['unique_words_ratio'][:100].idxmax(),'tweet'][:10]

### Account level 
(Come si vede dai risultati, solo il testo, seppur corredato da altre feature, non è abbastanza. Possiamo quindi utilizzare dati generali relativi agli account, che sono presenti nel dataset)
- Features used:
    
    According to paper *"BotSpot: Deep Learning Classification of Bot Accounts Within Twitter"* recent studies show that similar high performance can be achieved by using a minimal number of features, for this reason we decided to use only 21 features and leave the Random Forest to choose the most relevant ones. This approach ensures model efficiency and interpretability (a limited set of features with a clear meaning, like the ones provided by account metadata, allows to produce interpretable models).
    
    *"TweezBot: An AI-Driven Online Media Bot Identification Algorithm for Twitter Social Networks"* finds a strong correlation between **verification status**, **listed count**, **number of friends and followers**, and **default profile**, with respect to the target classification variable. 
    In particular, about the **followers/following features**, bot profiles often indulge in buying followers from online black-marketeers to mimic an influential public status. In Twitter, friends are the specific users that a certain user may choose to follow (i.e., following). In the case of bots, the friend count has a concentration towards the bottom, indicating a low following, whereas there is a uniform ratio in the case of non-bots as the number of followers are not exorbitantly high, with friends being relatively uniform till a certain degree. A key observation was found that bots do not experience an increase in their follower count even when they are posting more and staying extremely active, as a part of their attempt to mimic natural behavior. Meanwhile, for the non-bots, the follower count increases till a certain rate for a number of statuses.
    On the other hand, **listed count feature** indicates the number of public lists that this user is a member of. They find how the non-bots have clearly been added to a number of lists, leading to the increase in the listed count feature, while bots have not added to many lists because of their suspicious nature and lack of credibility, though some bots might also often show non-malicious behavior.
    Finally, bot users often try to conceal their personal information. This is apparent from the fact that the profile-based parameters for bots are missing. This indicates collusion in the online social network caused by such suspicious users. A smaller amount of information on the public domain actually helps these bots to rapidly disappear after fraudulent indulgence. In particular, the most common missing features for bots are: **location** (present or not), **description** (present or not), **url** (present or not) and **verified account** (yes or no).

    In another paper, *"Detecting Social Bots on Twitter: A Literature Review"*, authors computed 20 from the most common features as input to their model. They identify the social connections of users and ranking through performing content and behavioral analysis. For example, if an **account is verified** or protected, it is a logical indicator that it is a human account, not a bot account. The profile features that are extracted from the metadata, such as **profile image**, **screen name**, and **description**, may also indicate the nature of the account. For example, a **default profile image** is a sign of a new user or a bot account. The temporal pattern, such as the **average of tweeting ratios**, for example, can be a sign of bot activity if it occurs with small inter-arrivals. Therefore, using an entropy component to detect behavior as part of the classification system is essential. In addition, the rate of **posting similar content** with URL can be an indicator of a spammer. In other words, the **URL feature** can be used to detect the link farming behavior that is typically employed by spammers and bot accounts. Also, using the **mention feature** in association with the URL and number of link feature and entropy of tweets can indicate a bot account with malicious intention. Moreover, if the **number of followers** is high yet the account is relatively new, it’s likely that the followers are fake and the account is a bot.

- Account verification feature: 

    Abbiamo visto che 'is_verified' è la feature più importante per la scelta e che la RF ha un bias su di essa. Al punto che utilizzando solo questa feature per la classificazione si ottiene 81% di accuracy, che è competitivo con i migliori risultati ottenuti dagli altri paper. Il problema è che questo comportamento è irrealistico poichè nella realtà il numero di umani non verificati è nettamente maggiore di quello di umani verificati. Quindi si può determinare se un account non è un bot se è verificato ma di certo non il contrario:
    - 'is_verified': False & 'label': 0 (no bot) -> tot: 1538, perc: 18.7036%
    - 'is_verified': True & 'label': 0 (no bot) -> tot: 2054, perc: 24.9787%
    - 'is_verified': False & 'label': 1 (bot) -> tot: 4630, perc: 56.3054%
    - 'is_verified': True & 'label': 1 (bot) -> tot: 1, perc: 0.0121%

    Dunque utilizzare questa feature per la classificazione basata su Twibot20 è un hack che permette di ottenere risultati irrealisticamente alti. Pertanto abbiamo deciso di non utilizzare la feature 'is_verified' come input alla RF, nonostante tutti gli altri paper la utilizzino, ottenendo comunque risultati paragonabili ai nostri.

- Modello utilizzato e perchè : 
    - La RF per i dati tabulari è meglio ma non potevamo utilizzarla per il testo
    - Non bisogna normalizzare
    - Vantaggi random forest generici 
    - Scelte per la RF :
        - Numero di alberi 
        - Numero di feature 
        - Depth 
        - Max feaure per node, 
        - Altro 
        - Tuning
    - Analisi di correlazione (gratis con la RF ??)  e explainability 
    - Risultati 
- Conclusioni 


In [None]:
NUM_ESTIMATORS = 100
CLASS_WEIGHT = 'balanced'
RND_STATE = 18

account_df = process_account_dataset(account_df,False)

train = account_df[account_df['split'] == 'train'].reset_index(drop=True)
val = account_df[account_df['split'] == 'val'].reset_index(drop=True)
test = account_df[account_df['split'] == 'test'].reset_index(drop=True)

X_train, y_train = train.drop(columns=["account_id", "label", "split"], axis=1), train["label"]
X_val, y_val = val.drop(columns=["account_id", "label", "split"], axis=1), val["label"]
X_test, y_test = test.drop(columns=["account_id", "label", "split"], axis=1), test["label"]

rf = RandomForestClassifier(n_estimators=NUM_ESTIMATORS, class_weight=CLASS_WEIGHT, random_state=RND_STATE)
rf.fit(X_train, y_train)

plt.figure(figsize=(20, 10))
plt.barh(train.drop(columns=["account_id", "label", "split"], axis=1).columns, rf.feature_importances_)
plt.show()

In [None]:
median_fofo_bot = account_df[account_df['label'] == 1.0]['fofo_ratio'].median()
median_fofo_nobot = account_df[account_df['label'] == 0.0]['fofo_ratio'].median()
median_fofo = account_df['fofo_ratio'].median()
print(f"Median followers/following ratio per account: {median_fofo:.3f}")
print(f"Median followers/following ratio per bot account: {median_fofo_bot:.3f} ({(median_fofo_bot-median_fofo)/median_fofo*100:.3f}%)")
print(f"Median followers/following ratio per human account: {median_fofo_nobot:.3f} (+{(median_fofo_nobot-median_fofo)/median_fofo*100:.3f}%)")
print(f"Pearson Correlation:\n{account_df[['fofo_ratio','label']].corr()}")

In [None]:
print(f'Followers/following ratio of 5 random bot accounts:')
account_df[account_df['label'] == 1.0].sample(n=5)[['account_id', 'fofo_ratio']]
print(f'Followers/following ratio of 5 random human accounts:')
account_df[account_df['label'] == 0.0].sample(n=5)[['account_id', 'fofo_ratio']]

Altre feature invece non sembrano essere rilevanti per la RF, come:
- 'def_profile': default profile
- 'num_in_screen_name': numbers in screen name
- 'has_location': location enabled by the user

Vediamo pertanto alcuni esempi per capire se effettivamente non permettono una netta distinzione tra bot e utenti reali

In [None]:
print(f'Default profile feature of 5 random bot accounts:')
account_df[account_df['label'] == 1.0].sample(n=5)[['account_id', 'def_profile']]
print(f'Default profile feature of 5 random human accounts:')
account_df[account_df['label'] == 0.0].sample(n=5)[['account_id', 'def_profile']]

In [None]:
print(f'Numbers in screen name of 5 random bot accounts:')
account_df[account_df['label'] == 1.0].sample(n=5)[['account_id', 'num_in_screen_name']]
print(f'Numbers in screen name of 5 random human accounts:')
account_df[account_df['label'] == 0.0].sample(n=5)[['account_id', 'num_in_screen_name']]

In [None]:
print(f'Location enabled of 5 random bot accounts:')
account_df[account_df['label'] == 1.0].sample(n=5)[['account_id', 'has_location']]
print(f'Location enabled of 5 random human accounts:')
account_df[account_df['label'] == 0.0].sample(n=5)[['account_id', 'has_location']]