## >whoami : a journey into infosec twitter

"I'm checking twitter. Its for work."

Like it or not, twitter is a key information resource for anyone working in cybersecurity these days. And in order to check twitter, you have to be on twitter. While I try to keep my twitter feed somewhat focused on cybersecurity topics, over time I have added a couple guilty pleasures and so now I split twitter into two worlds: infosec twitter and not-infosec-twitter.

Infosec twitter is a great community. There are links to great resources, lively discussions, and interesting viewpoints. Like the cyber community in the real world, Infosec twitter can also be a very small place. Sometimes it feels like an echo chamber, sometimes a lively debate, and most of the time it seems like everyone is innovating insanely fast and there is no way to catch up. I thought it might be fun to look into some quantitative ways to analyze this community, my place in it, and what I am learning from it.

First, I needed to learn how to interact with the twitter API via python, to be able to programmatically collect tweets for analysis. Tweepy is a great package that makes interacting with the twitter API easier for coding n00bz like myself. 

Here I load up a bunch of packages which I probably will use at some point, or may have forgotten to remove:

In [1]:
import tweepy #https://github.com/tweepy/tweepy
import csv
import pandas as pd
# Used for progress bar
import sys
from tweepy import OAuthHandler
from tweepy import API
from collections import Counter
import sys
import json
import os
import numpy as np
from sklearn.metrics import jaccard_similarity_score

You need to get an API key from twitter, and here is a good explanation of how to do that:
https://themepacific.com/how-to-generate-api-key-consumer-token-access-key-for-twitter-oauth/994/

In [4]:
#Twitter API credentials
consumer_key = "CONSUMER_KEY"
consumer_secret = "CONSUMER_SECRET"
access_key = "ACCESS_KEY"
access_secret = "MOTHERS_MAIDEN_NAME"



OAUTH_KEYS = {'consumer_key':consumer_key, 'consumer_secret':consumer_secret,
 'access_token_key':access_key, 'access_token_secret':access_secret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])

# In order to manage the rate limiting, use these options below. 
# You will find later that rate limiting is, well, a limiting factor.

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Tweepy has a couple terms that bear defining:

Followers: These are users who follow the account in question.

Friends: These are users that the account in question follows.

account_ID: Each account on twitter has an account ID. At times it is easier to return these IDs, and then later transform them back into account names.


Off the bat, I wanted to know: Who is following me? Obviously I could just check on the twitter website, but what if I want to know, at scale, information like how long they have been on twitter, when they joined, etc? The API can help me out.

## Find friends and followers of my account

In [146]:
user = 'secbern'

ct = 0
friends = []
followers = []

print("looking for friends of {}".format(user))
for friend in api.friends_ids(user):
    sn = api.get_user(friend).screen_name
    friends.append(sn)
    
print("looking for followers of {}".format(user))
for follower in api.followers_ids(user):
    sn = api.get_user(follower).screen_name
    followers.append(sn)
    
print("Friends: {}".format(len(friends)))
print(friends[:5])
print("Followers: {}".format(len(followers)))
print(followers[:5])

looking for friends of secbern
looking for followers of secbern
Friends: 215
['CyberScoopNews', 'Bing_Chris', '4n6ir', 'lucaskossack', 'RenditionSec']
Followers: 171
['shahankhatch', 'keeghin', '_HelenaBD', 'W00Tock', 'guerillamos']


Second, I'd like to see who my followers are following. Should I also be following them? In tweepy-API talk, this would be my followers' friends. So here's how you can grab that info.


In [6]:
def get_followers(target):
    followers = []
    for follower in api.followers_ids(target):
        sn = api.get_user(follower).screen_name
        followers.append(sn)
    return followers

def get_friends(target):
    friends = []
    for friend in api.friends_ids(target):
        sn = api.get_user(friend).screen_name
        friends.append(sn)
    return friends

def get_network(user):
    network = get_friends(user) + get_followers(user)
    return network

In [7]:
get_friends('secbern')

['vicfcs',
 'CyberScoopNews',
 'Bing_Chris',
 '4n6ir',
 'lucaskossack',
 'RenditionSec',
 '13M4C',
 'cyb3rops',
 'Lee_Holmes',
 'Viking_Sec',
 'siedlmar',
 'tankbusta',
 '_HelenaBD',
 'DrScottCoull',
 'elephant_musing',
 'Glasswalk3r',
 'hadleywickham',
 'BSlay88',
 'ACKSYNjACKSYN',
 'reesespcres',
 'MLSecProject',
 '3dRailForensics',
 'generationlext',
 'stvemillertime',
 'infosec_truths',
 'Pbarry122',
 'a_tweeter_user',
 'BEERegg',
 'FireEye',
 'gimbi',
 'GradyS',
 'pr0cy0n',
 'MikeOppenheim',
 'dadjokehansolo',
 'hexwaxwing',
 'Bewg12',
 'QW5kcmV3',
 'dewey1net',
 'mikko',
 'taosecurity',
 'BartInglot',
 'proud2bgeeky',
 'ByrneGh',
 'DenverSec',
 'CyberAmyntas',
 'harr0ey',
 'OmarNajam',
 'JerrySeinfeld',
 'SenDuckworth',
 'CarbonBlack_Inc',
 'cylanceinc',
 'Tanium',
 'ebitdad',
 'CAICfrontrange',
 'nihilist_ds',
 'ProfFeynman',
 'pattonoswalt',
 'ClickHole',
 'DidierStevens',
 'KyloR3n',
 'RobertMLee',
 'thecyberwire',
 'Hexacorn',
 'SenAnitaHawkins',
 'amazingmap',
 'TerribleMaps

## How similar are two accounts?

Given another account on twitter, who do we have in common? Who do we mutually follow? Who mutually follows us? How do our networks overlap?

Naturally, this lends itself to a larger question: Of all the people in my network, how much do we overlap?
For this I used a common distance metric called a Jaccard Index, which basically divides the intersection of two sets by the union (combination) of both sets. This gives me a regular metric for "how closely related are we?"

In [8]:
def friends_overlap(user1, user2):
    return list(set(get_friends(user1)) & set(get_friends(user2)))

def followers_overlap(user1, user2):
    return list(set(get_followers(user1)) & set(get_followers(user2)))

def network_overlap(user1, user2):
    return list(set(get_network(user1)) & set(get_network(user2)))


In [16]:
friends_overlap('secbern', 'infosec_truths')

Rate limit reached. Sleeping for: 478


['MikeOppenheim',
 '3dRailForensics',
 'BartInglot',
 'robknake',
 'cyb3rops',
 'matthewdunwoody',
 'stvemillertime',
 'derekcoulson',
 'BarryV',
 'gentilkiwi',
 'Cyb3rWard0g',
 'enigma0x3',
 'bwithnell',
 'ISecPlayasClub',
 'christruncer',
 'williballenthin',
 'JohnHultquist',
 'jackcr',
 'subTee',
 'ByrneGh',
 'PyroTek3',
 'lucaskossack',
 'Glasswalk3r',
 'MITREattack',
 '_devonkerr_',
 'generationlext',
 'mattifestation',
 'ItsReallyNick',
 'RobertMLee',
 'QW5kcmV3',
 'danielhbohannon',
 'kwm',
 'harmj0y',
 'cglyer',
 'Hexacorn']

In [18]:
followers_overlap('secbern', 'infosec_truths')

['ItsReallyNick',
 'bwithnell',
 'parkerrm39',
 'williballenthin',
 'QW5kcmV3',
 'W00Tock',
 'stvemillertime',
 'GrimaldoChris',
 'cglyer',
 'BarryV',
 'generationlext',
 'WhisperScrape',
 'lucaskossack',
 'Glasswalk3r']

In [None]:
network_overlap('secbern', 'infosec_truths')

In [10]:
def get_jaccard(user1, user2):
    a = get_network(user1)
    b = get_network(user2)
    x = list(set(a) & set(b))
    j = float(len(x)) / (len(a) + len(b) - len(x))
    return j

def jaccard_index(a,b):
    x = list(set(a) & set(b))
    j = float(len(x)) / (len(a) + len(b) - len(x)+.0001)
    return j

In [15]:
get_jaccard('secbern', 'infosec_truths')

0.0894854586129754

The scikit-learn package also has a jaccard_similarity_score function you can use, like this:

In [None]:
a = get_network('secbern')
b = get_network('infosec_truths')

jaccard_similarity_score(a,b)

I'd like to also output all the metadata about my network - here is how you do that:

(Thanks to Andy Patel's work at https://labsblog.f-secure.com/2018/02/27/how-to-get-twitter-follower-data-using-python-and-tweepy/ for this information and code. I've extracted some of it and changed slightly to output a csv file with all follower information.)

In [11]:
def get_follower_ids(target):
    return api.followers_ids(target)

def get_friends_ids(target):
    return api.friends_ids(target)

def get_follower_list(target):
    followers =[]
    for follower_id in api.followers_ids(target):
        follower_user = api.get_user(follower_id)
        follower_name = follower_user.screen_name
        followers.append(follower_name)
        
def get_user_objects(follower_ids):
    batch_len = 100
    num_batches = len(follower_ids) / 100
    batches = (follower_ids[i:i+batch_len] for i in range(0, len(follower_ids), batch_len))
    all_data = []
    for batch_count, batch in enumerate(batches):
        sys.stdout.write("\r")
        sys.stdout.flush()
        sys.stdout.write("Fetching batch: " + str(batch_count) + "/" + str(num_batches))
        sys.stdout.flush()
        users_list = api.lookup_users(user_ids=batch)
        users_json = (map(lambda t: t._json, users_list))
        all_data += users_json
    return all_data

def get_follower_profiles(user_data):
    d = []
    follower_dict = {}
    for user in user_data:
        if "followers_count" in user:
            d.append({"name": user["name"], "SN": user["screen_name"], "follower_count": user["followers_count"], "friends": user["friends_count"], "made": user["created_at"]})
            user_pd = pd.DataFrame(d)
    return user_pd


user = 'infosec_truths'
followers = get_follower_ids(user)
follower_objects = get_user_objects(followers)
profile_pd = get_follower_profiles(follower_objects)
output_file = str(user + '.csv')
profile_pd.to_csv(output_file)

Fetching batch: 0/0.44

### Dealing with Rate Limits and Second-degrees

Rate limiting becomes a real problem once you get into the large numbers of friends/ followers. In the code above, Andy found a clever way to make 100 queries at a time, so we'll rewrite those functions using this batching/pagination method.

In [12]:
#https://stackoverflow.com/questions/14265082/query-regarding-pagination-in-tweepy-get-followers-of-a-particular-twitter-use

import itertools
import tweepy

def paginate(iterable, page_size):
    while True:
        i1, i2 = itertools.tee(iterable)
        iterable, page = (itertools.islice(i1, page_size, None),
                list(itertools.islice(i2, page_size)))
        if len(page) == 0:
            break
        yield page

def paginate_friends(target):
    try:
        friends = api.friends_ids(screen_name=target)
        friendlist = []
        for page in paginate(friends, 100):
            results = api.lookup_users(user_ids=page)
            for result in results:
                friendlist.append(result.screen_name)
        return friendlist
    except:
        print("Bummer, not authorized")
        null = ['n/a']
        return null
        

def paginate_followers(target):
    try:
        followers = api.followers_ids(screen_name=target)
        followerlist = []
        for page in paginate(followers, 100):
            results = api.lookup_users(user_ids=page)
            for result in results:
                followerlist.append(result.screen_name)
        return followerlist
    except:
        print("Bummer, not authorized")
        null = ['n/a']
        return null
    

def get_network_paginate(user):
    network = paginate_friends(user) + paginate_followers(user)
    return network

def get_jaccard_paginate(user1, user2):
    a = get_network_paginate(user1)
    b = get_network_paginate(user2)
    x = list(set(a) & set(b))
    j = float(len(x)) / (len(a) + len(b) - len(x))
    return j
    

I also wanted to go out 2 degrees, to see who my followers were following, who is following my friends, etc. These functions will do that.....but bear in mind they are going to take a very LONG time. I have had success running these overnight or for several hours during the day. You will hit the twitter API rate limit here.

What I'm doing (or trying to do) is make one query for each user in the first degree. So in this case, I first ask, "who are my friends?". This is one API query, and returns a list of my friends. Then it loops through that list and asks "Who is <'insert friend'>'s list of friends?". Thus for each use in the first degree list, it makes another API query. The endpoint I hit in this query is documented here: https://developer.twitter.com/en/docs/basics/rate-limits.html and allows 15 requests within a 15 minute window. Thus, the amount of time it will take for you to go out 2 degrees is roughly equivalent, in minutes, to the number of users in that list. If you follow 285 people, that means roughly 285 minutes, or 285/60 = 4.75 hours.

If you know a better way to do this, I am all ears :)

In [13]:

def two_degrees(target, first, second):
    data = {}
    if first == 'friends':
        print('Making list of friends')
        list1 = paginate_friends(target)
    elif first == 'followers':
        print('Making list of followers')
        list1 = paginate_followers(target)
        
    total = len(list1)
    n =1
    
    if second == 'friends':
        for user in list1:
            print('Getting friends of {}'.format(user))
            print('{} of {}'.format(n, total))
            data[user] = paginate_friends(user)
            n+=1
    elif second == 'followers':
        for user in list1:
            print('Getting followers of {}'.format(user))
            print('{} of {}'.format(n, total))
            data[user] = paginate_followers(user)
            n+=1
    return data


Now I want to see how much the followers in my network overlap with each other. Of the people I am following (My Friends), how much do _they_ overlap in who they are following? In other words, how much of an echo chamber am I in?

In [14]:
paginate_friends('infosec_truths')

['derekcoulson',
 'ByrneGh',
 'BartInglot',
 '3dRailForensics',
 'TunnelsUp',
 'spresec',
 'mttcrns',
 'BarryV',
 'ISecPlayasClub',
 'MikeOppenheim',
 'jepayneMSFT',
 'gentilkiwi',
 'harmj0y',
 'markrussinovich',
 'vysecurity',
 'PyroTek3',
 'jaredhaight',
 'robknake',
 'RobertMLee',
 'cnoanalysis',
 'JohnHultquist',
 'cyb3rops',
 'Hexacorn',
 'pwnallthethings',
 'malwareunicorn',
 'jackcr',
 'Cyb3rWard0g',
 '_devonkerr_',
 'MITREattack',
 'enigma0x3',
 'mattifestation',
 'MalwareTechBlog',
 'thegrugq',
 'SwiftOnSecurity',
 'kwm',
 'subTee',
 'christruncer',
 'QW5kcmV3',
 'lucaskossack',
 'Malwarenailed',
 'Glasswalk3r',
 'Matt_Grandy_',
 'deantyler',
 'williballenthin',
 'secbern',
 'generationlext',
 'cglyer',
 'matthewdunwoody',
 'bwithnell',
 'stvemillertime',
 'danielhbohannon',
 'ItsReallyNick']

In [19]:
#How overlapping is my network?
network = paginate_friends('infosec_truths')
network_dict = two_degrees('infosec_truths', 'friends', 'friends')


Making list of friends
Getting friends of derekcoulson
1 of 52
Getting friends of ByrneGh
2 of 52
Getting friends of BartInglot
3 of 52
Getting friends of 3dRailForensics
4 of 52
Getting friends of TunnelsUp
5 of 52
Getting friends of spresec
6 of 52
Getting friends of mttcrns
7 of 52
Getting friends of BarryV
8 of 52
Getting friends of ISecPlayasClub
9 of 52
Getting friends of MikeOppenheim
10 of 52
Getting friends of jepayneMSFT
11 of 52
Getting friends of gentilkiwi
12 of 52
Getting friends of harmj0y
13 of 52


Rate limit reached. Sleeping for: 758


Getting friends of markrussinovich
14 of 52
Bummer, not authorized
Getting friends of vysecurity
15 of 52
Getting friends of PyroTek3
16 of 52
Getting friends of jaredhaight
17 of 52
Getting friends of robknake
18 of 52
Getting friends of RobertMLee
19 of 52
Getting friends of cnoanalysis
20 of 52
Getting friends of JohnHultquist
21 of 52
Getting friends of cyb3rops
22 of 52
Getting friends of Hexacorn
23 of 52
Getting friends of pwnallthethings
24 of 52
Getting friends of malwareunicorn
25 of 52
Getting friends of jackcr
26 of 52
Getting friends of Cyb3rWard0g
27 of 52
Getting friends of _devonkerr_
28 of 52
Getting friends of MITREattack
29 of 52
Getting friends of enigma0x3
30 of 52


Rate limit reached. Sleeping for: 782


Getting friends of mattifestation
31 of 52
Getting friends of MalwareTechBlog
32 of 52
Getting friends of thegrugq
33 of 52
Getting friends of SwiftOnSecurity
34 of 52
Getting friends of kwm
35 of 52
Getting friends of subTee
36 of 52
Getting friends of christruncer
37 of 52
Getting friends of QW5kcmV3
38 of 52
Getting friends of lucaskossack
39 of 52
Getting friends of Malwarenailed
40 of 52
Getting friends of Glasswalk3r
41 of 52
Getting friends of Matt_Grandy_
42 of 52
Getting friends of deantyler
43 of 52
Getting friends of williballenthin
44 of 52


Rate limit reached. Sleeping for: 786


Getting friends of secbern
45 of 52
Getting friends of generationlext
46 of 52
Getting friends of cglyer
47 of 52
Getting friends of matthewdunwoody
48 of 52
Getting friends of bwithnell
49 of 52
Getting friends of stvemillertime
50 of 52
Getting friends of danielhbohannon
51 of 52
Getting friends of ItsReallyNick
52 of 52


Rate limit reached. Sleeping for: 678
Rate limit reached. Sleeping for: 683
Rate limit reached. Sleeping for: 684


KeyboardInterrupt: 

So lets run one of these to get the dictionary built. From there we can do some fun stuff.

Obviously you can then run similarity metrics within these dictionaries as well:

How much do people within my network overlap with one another? (I call this the "echo-chamber-index"):

In [24]:
df = pd.DataFrame(index=network, columns=network)
for i in network:
    for j in network:
        if i != j:
            df[i][j] = jaccard_index(network_dict[i],network_dict[j])

In [27]:
df.head()

Unnamed: 0,derekcoulson,ByrneGh,BartInglot,3dRailForensics,TunnelsUp,spresec,mttcrns,BarryV,ISecPlayasClub,MikeOppenheim,...,deantyler,williballenthin,secbern,generationlext,cglyer,matthewdunwoody,bwithnell,stvemillertime,danielhbohannon,ItsReallyNick
derekcoulson,,0.193416,0.131086,0.122066,0.000776548,0.0808823,0.0178394,0.139847,0.0812808,0.064759,...,0.0652921,0.0639764,0.191083,0.11985,0.188482,0.27044,0.0990338,0.149038,0.145933,0.0698125
ByrneGh,0.193416,,0.145228,0.126984,0.000780488,0.0938775,0.015213,0.104854,0.0868421,0.0460829,...,0.0716981,0.0507968,0.152318,0.100806,0.148571,0.188679,0.0645161,0.126984,0.0968523,0.0502901
BartInglot,0.131086,0.145228,,0.0817307,0.00097352,0.106719,0.0180905,0.104563,0.0732323,0.0358744,...,0.1341,0.0501968,0.111111,0.0877862,0.151351,0.123919,0.0915841,0.0975609,0.088993,0.049904
3dRailForensics,0.122066,0.126984,0.0817307,,0.000394011,0.053398,0.0138741,0.0974576,0.0523256,0.0344828,...,0.0401786,0.0203252,0.133588,0.132653,0.136364,0.0936455,0.0830945,0.218045,0.066313,0.0333988
TunnelsUp,0.000776548,0.000780488,0.00097352,0.000394011,,0.000584795,0.000682361,0.0020284,0.000189502,0.00036075,...,0.000388576,0.00152258,0.000767902,0.00077912,0.000592534,0.000572519,0.000756144,0.000788333,0.000752729,0.00557448


Thus I can query the similarity of any two network members like:

In [29]:
df['ItsReallyNick']['stvemillertime']

0.04711346734482632

And I can generate a cumulative measure of overlap:

In [34]:
df.mean().mean()

0.05221995349270476

## Timeline evaluations and beyond
Obviously the rate limiting calculus gets dicey (and annoying). Additionally, this metric is crude and doesn't really measure my twitter experience, since some users are far more active than others. So how can I subsample this list based on the actual content these users are producing? There are a couple ideas I'd like to explore later:

1. Analyze overlap of inbound tweets, unsing timeline analytics
2. Analyst overlap of likes/ RTs
3. Go 3 or 4 degrees out

But we'll leave that for another day. If you have suggestions, please let me know! @secbern