# Bot Community Analysis
Use this notebook to analyze communities in bot retweet network

Data = Bot profiles and community membership, bot tweets

Analysis steps

1) Look at popular retweeted users, arabic profiles, and account creation dates within in bot retweet community.

3) Cluster bots by creation date.  Look at popular retweeted users and arabic profiles in each created_at community


In [1]:
from datetime import datetime, timedelta
import numpy as np
import networkx as nx
from networkx.algorithms import community

import sqlite3,sys,os,string
import pandas as pd
import matplotlib.pyplot as plt
from os import path

from helper_retweet_network import *

#from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
#import arabic_reshaper
#from bidi.algorithm import get_display


## Load Data

Input:

1) fname_bots_db = file of database with bot tweets

2) fname_Gretweet = file where we saved the bot retweet network

3) fname_Gsim = file where we saved retweet similarity network of bots

Output: 

1) df_profiles = dataframe with both profiles and created_at as a datetime object

2) df_communities = dataframe with bot profiles and communities

3) Gretweet = retweet network including bots and who they retweet

4) Gsim = similarity graph of bot accounts based on Jacard index of similarity network

In [2]:
path_data = "Libya//"

fname_bots_db = path_data+"Libya_bot_forensics.db"
fname_Gretweet = path_data + "Gretweet.gpickle"
fname_Gsim = path_data + "Gsim.gpickle"
fname_bots_updated_csv = path_data+"Libya_bot_forensics_community.csv"
conn = sqlite3.connect("%s"%fname_bots_db)
df_tweets = pd.read_sql_query("SELECT * FROM tweet", conn)
df_profiles = pd.read_sql_query("SELECT * FROM user_profile", conn)
df_communities = pd.read_csv(fname_bots_updated_csv)
Gretweet = nx.read_gpickle(fname_Gretweet)
Gsim = nx.read_gpickle(fname_Gsim)
fmt = '%Y-%m-%d %H:%M:%S'

#convert created_at to a datetime object
Tdatetime = []
for s in df_profiles.created_at:
    date_time_obj = datetime.strptime(s, fmt)
    Tdatetime.append(date_time_obj)
Tdatetime = np.array(Tdatetime)
df_profiles["created_at_datetime"]  = Tdatetime
Tdatetime = []
for s in df_communities.created_at:
    date_time_obj = datetime.strptime(s, fmt)
    Tdatetime.append(date_time_obj)
Tdatetime = np.array(Tdatetime)
df_communities["created_at_datetime"]  = Tdatetime


t0 = min(Tdatetime)
ncomm  = max(df_communities.Community)+1

print("%s bots\n%s bot tweets\n%s bot communities"%(len(df_profiles),
                                                    len(df_tweets),
                                                    ncomm))



2026 bots
413974 bot tweets
3 bot communities


## Function to detect Arabic characters

In [3]:
## functions to detect if a string has arabic characters
def isarabic_char(ch):
    if ('\u0600' <= ch <= '\u06FF' or
        '\u0750' <= ch <= '\u077F' or
        '\u08A0' <= ch <= '\u08FF' or
        '\uFB50' <= ch <= '\uFDFF' or
        '\uFE70' <= ch <= '\uFEFF' or
        '\U00010E60' <= ch <= '\U00010E7F' or
        '\U0001EE00' <= ch <= '\U0001EEFF' or
                        ch == '\U0001F1E6' or #saudi flag emoji
                        ch == '\U0001F1E6'): #saudi flag emoji
        return True
    else:
        return False
    
def isarabic_str(str):
    x = False
    for ch in str:
        if isarabic_char(ch): 
            x = True
            break
    return(x)

   


## Fraction of Arabic profiles in each community

In [4]:
for counter in range(ncomm):
    mask_arab = df_communities.arabic_profile==True
    mask_comm = df_communities.Community==counter
    nc = len(list(df_communities.screen_name[mask_comm]))
    nc_arab = len(list(df_communities.screen_name[mask_comm & mask_arab]))
    frac_arab = nc_arab/nc
    print("Community %s has %.2f percent Arab profiles"%(counter,frac_arab))

Community 0 has 0.41 percent Arab profiles
Community 1 has 0.41 percent Arab profiles
Community 2 has 0.03 percent Arab profiles


## Top retweeted users in each community

For each community of bots, we form the subgraph containing the bots and everyone they retweet.  Then we look at the top retweeted users in this subgraph.

Input

1) display_max = number of retweet sources to display for each community

In [5]:
display_max = 20  #number of nodes to display

for counter in range(ncomm):
    community_screen_names = list(df_communities.screen_name[df_communities.Community==counter])
    Vsub = []
    for v in community_screen_names:
        if Gretweet.has_node(v):
            nb = list(Gretweet.predecessors(v))
            Vsub+=nb
            Vsub.append(v)
    
    print("Retweet community %s with %s users"%(counter,len(community_screen_names)))
    G = Gretweet.subgraph(Vsub)
    Dout = dict(G.out_degree())
    print("Top out degree")
    Centrality = Dout
    display_top_centrality_nodes(Centrality,display_max)


Retweet community 0 with 1073 users
Top out degree
	Centrality = 512.00,  monther72
	Centrality = 457.00,  TurkeyAffairs
	Centrality = 424.00,  5a1di
	Centrality = 392.00,  AlArabiya
	Centrality = 377.00,  meshaluk
	Centrality = 357.00,  70sul
	Centrality = 318.00,  AlArabiya_Brk
	Centrality = 300.00,  sattam_al_saud
	Centrality = 289.00,  amjadt25
	Centrality = 289.00,  naif4002
	Centrality = 287.00,  Alshaikh2
	Centrality = 266.00,  KSA24
	Centrality = 263.00,  amhfarraj
	Centrality = 240.00,  SalmanAldosary
	Centrality = 238.00,  fdeet_alnssr
	Centrality = 231.00,  SAUDI_POWER0
	Centrality = 226.00,  Dr_SultanAsqah
	Centrality = 222.00,  SPAregions
	Centrality = 211.00,  alekhbariyatv
	Centrality = 208.00,  AlHadath
	Centrality = 199.00,  s_hm2030
Retweet community 1 with 757 users
Top out degree
	Centrality = 421.00,  EbrahimGasuda
	Centrality = 374.00,  emad_badish
	Centrality = 320.00,  RD_turk
	Centrality = 304.00,  nasser_duwailah
	Centrality = 296.00,  TurkiShalhoub
	Centralit

## Top retweeted users in each (retweet,profile language) community 

For each retweet community of bots, we separate out those
with Arabic and non-Arabic profies.  
We form the subgraph containing the bots and everyone they retweet.  
Then we look at the top retweeted users in this subgraph.

Input

1) display_max = number of retweet sources to display for each community

In [6]:
display_max = 10  #number of nodes to display

for counter in range(ncomm):
    mask_arab = df_communities.arabic_profile==True
    mask_comm = df_communities.Community==counter
    community_screen_names = list(df_communities.screen_name[mask_comm & mask_arab])
    Vsub = []
    for v in community_screen_names:
        if Gretweet.has_node(v):
            nb = list(Gretweet.predecessors(v))
            Vsub+=nb
            Vsub.append(v)
    print("Arabic profile retweet community %s with %s users"%(counter,len(community_screen_names)))
    G = Gretweet.subgraph(Vsub)
    Dout = dict(G.out_degree())
    print("Top out degree")
    Centrality = Dout
    display_top_centrality_nodes(Centrality,display_max)

for counter in range(ncomm):
    mask_arab = df_communities.arabic_profile==False
    mask_comm = df_communities.Community==counter
    community_screen_names = list(df_communities.screen_name[mask_comm & mask_arab])
    Vsub = []
    for v in community_screen_names:
        if Gretweet.has_node(v):
            nb = list(Gretweet.predecessors(v))
            Vsub+=nb
            Vsub.append(v)
    print("\nNon-Arabic profile retweet community %s with %s users"%(counter,len(community_screen_names)))
    G = Gretweet.subgraph(Vsub)
    Dout = dict(G.out_degree())
    print("Top out degree")
    Centrality = Dout
    display_top_centrality_nodes(Centrality,display_max)

Arabic profile retweet community 0 with 435 users
Top out degree
	Centrality = 309.00,  monther72
	Centrality = 283.00,  TurkeyAffairs
	Centrality = 261.00,  5a1di
	Centrality = 241.00,  AlArabiya
	Centrality = 227.00,  meshaluk
	Centrality = 215.00,  70sul
	Centrality = 192.00,  sattam_al_saud
	Centrality = 190.00,  AlArabiya_Brk
	Centrality = 174.00,  naif4002
	Centrality = 171.00,  Alshaikh2
	Centrality = 169.00,  amjadt25
Arabic profile retweet community 1 with 309 users
Top out degree
	Centrality = 218.00,  EbrahimGasuda
	Centrality = 192.00,  emad_badish
	Centrality = 168.00,  RD_turk
	Centrality = 162.00,  full_confident
	Centrality = 158.00,  QATARTEAM
	Centrality = 152.00,  nasser_duwailah
	Centrality = 144.00,  TurkiShalhoub
	Centrality = 134.00,  Hamza_tekin2023
	Centrality = 132.00,  akarh90
	Centrality = 129.00,  mshinqiti
	Centrality = 119.00,  aa_arabic
Arabic profile retweet community 2 with 6 users
Top out degree
	Centrality = 4.00,  AliBakeer
	Centrality = 4.00,  RD_t

## Retweet sources and their bot followers

Print out the bots retweeting a retweet source in each bot community

INPUT:
1) source = screen name of retweet source

OUTPUT:
1) List of bots retweeting source in each community

In [7]:
source = "ghadaoueiss"

display_max = 0  #number of nodes to display

nb = list(Gretweet.successors(source))
print("%s retweeted by %s bots in retweet graph "%(source,len(nb)))
for counter in range(ncomm):
    community_screen_names = list(df_communities.screen_name[df_communities.Community==counter])
    Vsub = list(set(community_screen_names).intersection(nb))
    print("\t%s bots in community %s"%(len(Vsub),counter))
    
    for cv,v in enumerate(Vsub):
        if (cv+1)>=display_max:break
        print("\t\tBot %s: %s"%(cv,v))


ghadaoueiss retweeted by 191 bots in retweet graph 
	4 bots in community 0
	186 bots in community 1
	1 bots in community 2


## Collect Bots Created in Different Time Windows

Choose a start and stop date.  This cell will find all bots in each community created between those dates and save their profiles to a csv file whose name tell us the bot community, start date, and stop date.

Input:

tstart = start date (string)

tstop = stop date (string)

df_communities = dataframe with community info

ncomm = number of communities

In [19]:
tstart = '2019-01-01'
tstop = '2019-06-01'

dtstart =  datetime. strptime(tstart,"%Y-%m-%d")
dtstop =  datetime. strptime(tstop,"%Y-%m-%d")

for counter in range(ncomm+1):
    df_comm = df_communities[df_communities.Community==counter]
    print("Community %s with %s accounts"%(counter,len(df_comm)))
    mask0 = (df_communities.Community==counter)
    mask1 = (pd.to_datetime(df_communities.created_at_datetime)>dtstart)
    mask2 = (pd.to_datetime(df_communities.created_at_datetime)<=dtstop)
    Bots_in_window = df_comm[mask0 & mask1 & mask2]
    print("\t%s bots in community %s created betweet %s to %s"%(len(Bots_in_window),
                                                                 counter,tstart,tstop))
    fname = path_data + "Bots_Community_%s_%s_to_%s.csv"%(counter,tstart,tstop)
    Bots_in_window.to_csv(fname)

Community 0 with 1073 accounts
	57 bots in community 0 created betweet 2019-01-01 to 2019-06-01
Community 1 with 757 accounts
	45 bots in community 1 created betweet 2019-01-01 to 2019-06-01
Community 2 with 195 accounts
	14 bots in community 2 created betweet 2019-01-01 to 2019-06-01
Community 3 with 0 accounts
	0 bots in community 3 created betweet 2019-01-01 to 2019-06-01


  del sys.path[0]
