## Import User Data
The `users.pickle` file contains all the data scraped using the `TwitterScrape.ipynb`. In a nutshell, users is a list of attributes for a specific twitter account along with a classification as either a bot or not a bot. The attributes include features from the downloaded corpus (like if they're using a default profile image, or the url they are displaying in their bio) and also a list of their followers that is scraped from twitter in the `TwitterScrape.ipynb` notebook

In [1]:
import pickle

with open("./data/users510.pickle", "rb") as f:
    users = pickle.load(f)[:510]

In [3]:
import datetime

def parse_date(date_str):
    if type(date_str) is datetime.datetime:
        return date_str
    try:
        return datetime.datetime.strptime(str(date_str), "%a %b %d %H:%M:%S %z %Y")
    except ValueError:
#         print(date_str)
        return None

dates = []
beginning_time = datetime.datetime(2009, 7, 2, 1, 9, 57, tzinfo=datetime.timezone.utc)
for user in users:
    user["age"] = ((parse_date(user["created_at"]) - beginning_time).days)
    user['activity_statuses'] = user['statuses_count'] / user['age']
    user['activity_friends'] = user["friends_count"] / user["age"]

## Construct the graph
Now we will construct the graph with each user as a node and an edge between two users if they follow at least one of the same accounts. This makes a denser graph than if we drew edges for each following relationship because of (in part) the small world effect that exists on twitter. We assume that two accounts are linked in some way if they follow the same accounts. In the future we might investigate assigining a weight to each of the edges based on how many accounts the users share in their following lists

In [4]:
import networkx as nx

G = nx.Graph()

# Add all users in the list to the graph
G.add_nodes_from([user["id"] for user in users])

In [5]:
for user1 in users:
    for user2 in users:
        if user1 is user2:
            continue
        if len(list(set(user1["following"]) & set(user2["following"]))) > 0:
            G.add_edges_from([(user1["id"], user2["id"])])

print(f"{len(G)} users in the graph")
print(f"{len(G.edges)} links between users in the graph")

510 users in the graph
35477 links between users in the graph


## Role Detection
The next 3 cells are the most intensive in the whole notebook. These first set up the graph with the sequential id's that GraphRole needs to run the role extraction and then runs the actual role extraction. This returns a dictionary of node id's to roles assigned in the graph. 

In [6]:
id_set = list(set(G.nodes))

for node in G:
    G.nodes[node]["seq_id"] = id_set.index(node)

In [7]:
from graphrole import RecursiveFeatureExtractor, RoleExtractor

feature_extractor = RecursiveFeatureExtractor(G)
features = feature_extractor.extract_features()

In [8]:
role_extractor = RoleExtractor()
role_extractor.extract_role_factors(features)
roles = role_extractor.roles

  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + vec2, 0))
  kl_div = np.sum(np.where(vec1 != 0, vec1 * np.log(vec1 / vec2) - vec1 + ve

## Centrality Metrics
As well as the sophisticated role detection we also calculate the eigenvector and closeness centrality metrics for the graph. All the graph based features are then added to each user so that all features can be compiled into an input and an output for each classifier. There's also a save point here so we don't have to do the complex computation each time.

In [9]:
eigenvector = nx.algorithms.centrality.eigenvector_centrality(G)
closeness = nx.algorithms.centrality.closeness_centrality(G)

In [11]:
for user in users:
    user["role"] = int(roles[user["id"]][5:])
    user["eigenvector"] = float(eigenvector[user["id"]])
    user["closeness"] = float(closeness[user["id"]])
    
with open("./data/user_filled.pickle", "wb") as file:
    pickle.dump(users, file)