# Twitter Influencer Analysis*

By Shahab Sheikh-Bahaei

Contents
#### Part 1 - User Graph
- [Get 1 month of home tweets](#Q:-get-1-month-of-home-tweets)
- [Who has the most number of tweets?](#Q:-Who-has-the-most-number-of-tweets?)
- [Histogram of number of tweets per user](#Q:--Histograms-of-number-of-tweets-per-user)
- [Top most liked tweets](#Top-most-liked-tweets)
- [Who has the most number of followers?](#Q:-Who-has-the-most-number-of-followers?)
- [Create a directed graph of users](#Q:-create-a-directed-graph-of-users-(nodes:-authors,-edges:-following))
- [Analase Centrality](#Q:-Analyse-Centrality-(as-a-potential-measure-of-influence))
- [Who has the highest centrality score?](#Q:-Who-has-the-highest-betweenness-score?)
- [Visualize user graph](#Q:-Visualize-the-graph)

#### Part 2- Retweet Graph
- [Analyze-influence-using-number-of-retweets](#Q:-Analyze-influence-using-number-of-retweets-and-likes)
- [Create-a-new-graph-based-on-retweets](#Q:-Create-a-new-graph-based-on-retweets)
- [Algorithm to collect retweet data and create the graph](#Algorithm:)
- [Who-has-the-most-number-of-total-retweets?](#Q:-Who-has-the-most-number-of-total-retweets?)
- [Visualize the retweet graph](#Visualize-Graph)
- [Measure the influence](#Measure-Influence)
- [User with the highest influence score](#Q:-which-node-has-the-highest-influence-score?)
- [Retweet Graph](#Retweet-Graph-Visualization)
- [How-do-the-retweets-propagate-through-the-network?](#Q:-How-do-the-retweets-propagate-through-the-network?)
- [What-is-the-probability-that-a-tweet-published-by-user-A-reaches-user-B?](#Q:-What-is-the-probability-that-a-tweet-published-by-user-A-reaches-user-B?)



*work in progress

## Introduction

The purpose of this work is to analyse the influence of twitter users on each other.

This type of analysis has application in network security analysis where a compromised node can affect other nodes in the network.

#### Install necessary python libraries

In [1]:
# !pip install tweepy
# !pip install visJS2jupyter
# !pip install py2cytoscape
# !pip install networkx

In [84]:
from matplotlib import pyplot as plt
import networkx as nx
from visJS2jupyter import visJS_module
import pandas as pd
import numpy as np
from collections import Counter
import tweepy

In [77]:
import keys

In [5]:
%matplotlib notebook

In [130]:
consumer_key = keys.consumer_key
consumer_secret = keys.consumer_secret
access_token = keys.access_token
access_token_secret = keys.access_token_secret




In [468]:
# Creating the authentication object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# Setting your access token and secret
auth.set_access_token(access_token, access_token_secret)
# Creating the API object while passing in auth information
api = tweepy.API(auth, wait_on_rate_limit=True)

### Q: get 1 month of home tweets

In [466]:
# start_date = "2017-08-01"
ht=api.home_timeline(count=200)

In [144]:
last_id = ht[-1].id
while True:
    t=api.home_timeline(count=200, max_id=last_id-1)
    if not t:
        break
    last_id = t[-1].id
    ht.extend(t)
    print len(ht)

399
599
799
830


In [145]:
len(ht)

830

In [146]:
latest = ht[0]

In [147]:
latest.created_at

datetime.datetime(2017, 9, 11, 0, 39, 18)

In [148]:
oldest = ht[-1]

In [149]:
oldest.created_at

datetime.datetime(2017, 8, 5, 19, 15, 22)

### Q: Who has the most number of tweets?

In [150]:
a=Counter([t.author.screen_name for t in ht])

a.most_common()

[(u'realDonaldTrump', 284),
 (u'SenSanders', 96),
 (u'SenWarren', 92),
 (u'fashionistatalk', 89),
 (u'HenryJEvans', 88),
 (u'kellymcevers', 81),
 (u'StephenAtHome', 29),
 (u'HillaryClinton', 20),
 (u'elizabethforma', 19),
 (u'astorino_steven', 10),
 (u'BarackObama', 8),
 (u'ShahriarSh', 8),
 (u'JZarif', 6)]

### Q:  Histograms of number of tweets per user

In [183]:
d=pd.DataFrame(a.items(), columns=["screen_name","tweet_count"])

In [184]:
d.set_index("screen_name", inplace=True)

In [185]:
d.sort_values(by="tweet_count", inplace=True)

In [186]:
d.plot.barh(y="tweet_count");

<IPython.core.display.Javascript object>

### Q: transform the tweeter data format to DataFrame for easier analysis

In [452]:
# latest.__dict__

In [144]:
use_fields = ["id","created_at", "favorite_count","retweet_count","retweeted","user.screen_name","user.followers_count",
              "text"]

In [187]:
df=pd.DataFrame([[eval("t."+c) for c in use_fields] for t in ht], columns=use_fields).set_index("id")

### Top most liked tweets

In [451]:
df.sort_values(by="favorite_count", ascending=False).head(5)

Unnamed: 0_level_0,created_at,favorite_count,retweet_count,retweeted,user.screen_name,user.followers_count,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
896523232098078720,2017-08-13 00:06:09,4590167,1707297,False,BarackObama,94589040,"""No one is born hating another person because ..."
898261944095789056,2017-08-17 19:15:11,1631429,324677,False,BarackObama,94589040,Michelle and I are thinking of the victims and...
896523304873238528,2017-08-13 00:06:27,1587579,502624,False,BarackObama,94589040,"""People must learn to hate, and if they can le..."
896523357272911872,2017-08-13 00:06:39,1417239,415697,False,BarackObama,94589040,"""...For love comes more naturally to the human..."
905141484386750469,2017-09-05 18:52:01,910297,387807,False,BarackObama,94589039,To target hopeful young strivers who grew up h...


### Q: plot favorite_count as a time series for specific users

In [190]:
ax=df.loc[df["user.screen_name"]=="realDonaldTrump"].set_index("created_at").plot(y="favorite_count",marker='.')
df.loc[df["user.screen_name"]=="SenSanders"].set_index("created_at").plot(y="favorite_count", marker='o',ax=ax)
ax.legend(["Trump's favorite_count","Sanders' favorit_count"])

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x124d75b10>

### Q: Who has the most number of followers?

In [191]:
gfl=df[["user.screen_name","user.followers_count"]].groupby("user.screen_name").max().sort_values("user.followers_count")

In [192]:
gfl.plot.barh()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x12804f1d0>

#### Log Scale

In [162]:
gfl.plot.barh(log=True)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1216af1d0>

### Q: create a directed graph of users (nodes: authors, edges: following)

In [15]:
import matplotlib as mpl
import networkx as nx
import visJS2jupyter.visJS_module

In [21]:
import cPickle as pickle

In [16]:
users={i.id:i for i in api.friends()}

In [17]:
g=nx.DiGraph()

g.add_nodes_from(users)

In [19]:
g.nodes()

[1916120257,
 16303106,
 970207298,
 813286,
 87823144,
 29442313,
 46541122,
 2238324110,
 40815629,
 47813521,
 1339835893,
 357606935,
 1926683448,
 104673756,
 2176745083,
 197574940,
 25073877]

In [None]:
LOAD_GRAPH1 = False
if LOAD_GRAPH1:
    with open("my_twitter_network_v2.pkl","r") as f:
        g_loaded = pickle.load(f)

    g = g_loaded

In [85]:
len(g.nodes())

861

In [797]:
user_list=g.nodes()
for n in np.random.permutation(user_list):
    if g.edges(n):
        continue
    fl=users[n].followers()
    fr=users[n].friends()
    new_users={i.id:i for i in fl+fr}
    users.update(new_users)
    g.add_nodes_from(new_users)
    print "Num. users:", len(g.nodes())
    # edge from A to B: A has influence on B => B follows A
    g.add_edges_from([(n,i.id) for i in fl])
    g.add_edges_from([(i.id,n) for i in fr])

##### Sanity Check

In [90]:
user_list=users#g.nodes()
i=0
j=0
for n in user_list:
    #print n,
    if g.edges(n):
        #print "yes"
        i+=1
    else:
        #print "no"
        j+=1

print "nodes with edge:",i
print "nodes without edge:",j


nodes with edge: 733
nodes without edge: 513


In [91]:
len(g.edges())

1311

In [92]:
len(g.nodes())

1246

In [93]:
with open("my_twitter_network_v2.pkl","w") as f:
    pickle.dump(g, f)

### Q: Analyse Centrality (as a potential measure of influence)

In [94]:
G1 = g 
nodes1 = G1.nodes()
edges1 = G1.edges()

#### betweenness_centrality
Compute the shortest-path betweenness centrality for nodes.

Betweenness centrality of a node `v` is the sum of the
fraction of all-pairs shortest paths that pass through `v`

math:

   $$c_B(v) =\sum_{s,t \in V} \frac{\sigma(s, t|v)}{\sigma(s, t)}$$

where $V$ is the set of nodes, $\sigma(s, t)$ is the number of
shortest $(s, t)$-paths,  and $\sigma(s, t|v)$ is the number of those
paths  passing through some  node `v` other than `s, t`.
If $s = t$, $\sigma(s, t) = 1$, and if $v \in {s, t}$,
$\sigma(s, t|v) = 0$.

In [95]:
bc = nx.betweenness_centrality(G1)

In [96]:
df_bc=pd.DataFrame(data=[(users[i].screen_name, b) for i,b in sorted(bc.items(),key=lambda x:x[1], reverse=True)],
                   columns = ["screen_name","betweenness_score"]).set_index("screen_name")
                   

### Q: Who has the highest betweenness score?

In [97]:
df_bc.head()

Unnamed: 0_level_0,betweenness_score
screen_name,Unnamed: 1_level_1
kymihypek,0.002607
rymarag,0.002117
ShahriarSh,0.002051
realMouseLight,0.001911
kellymcevers,0.001278


In [98]:
df_bc.head(20).plot.barh(y="betweenness_score", log=True)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x11d602bd0>

#### load_centrality

In [99]:
lc=nx.load_centrality(g)

In [100]:
df_lc=pd.DataFrame(data=[(users[i].screen_name, b) for i,b in sorted(lc.items(),key=lambda x:x[1], reverse=True)],
                   columns = ["screen_name","load_centrality"]).set_index("screen_name")

In [101]:
df_lc.head(10)

Unnamed: 0_level_0,load_centrality
screen_name,Unnamed: 1_level_1
kymihypek,0.002607
rymarag,0.002117
ShahriarSh,0.002051
realMouseLight,0.001911
kellymcevers,0.001278
MansoorSB,0.001263
vrebex,0.001213
shahabeddin56,0.001081
GabbyZetino,0.00103
___Kathleen____,0.000801


In [102]:
df_lc.head(20).plot.barh(y="load_centrality", log=True)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x12366a950>

### Q: Visualize the graph

In [103]:
degree = G1.degree()
p=nx.spring_layout(G1)

nx.set_node_attributes(G1,'degree',degree)

nx.set_node_attributes(G1,'betweenness_centrality',bc)

# map the betweenness centrality to the node color, using matplotlib spring_r colormap
node_to_color = visJS2jupyter.visJS_module.return_node_to_color(G1,field_to_map='betweenness_centrality',
                                                                cmap=mpl.cm.spring_r,alpha = 1,
                                                 color_max_frac = .9,color_min_frac = .1)

# set node initial positions using networkx's spring_layout function
pos = nx.spring_layout(G1)

nodes_dict1 = [{"id":users[n].screen_name,"color":node_to_color[n],
               "degree":np.log(users[n].followers_count), #nx.degree(G1,n),
              "x":pos[n][0]*1000,
              "y":pos[n][1]*1000
               } for n in nodes1
              ]
node_map1 = dict(zip(nodes1,range(len(nodes1))))  # map to indices for source/target in edges
edges_dict1 = [{"source":node_map1[edges1[i][0]], "target":node_map1[edges1[i][1]], 
              "color":"gray","title":'test'} for i in range(len(edges1))]

In [755]:
# set some network-wide styles
visJS2jupyter.visJS_module.visjs_network(nodes_dict1,edges_dict1,graph_id=1,
                          node_size_multiplier=5,node_size_transform = '',
                          node_color_highlight_border='red',node_color_highlight_background='#D3918B',
                          node_color_hover_border='blue',node_color_hover_background='#8BADD3',
                          node_font_size=25,edge_arrow_to=True,physics_enabled=True,edge_color_highlight='#8A324E',
                          edge_color_hover='#8BADD3',edge_width=3,max_velocity=15,min_velocity=1)




### Q: Analyze influence using number of retweets and likes

- Which user has more influence on his/her followers? 
- Retweeting and liking may be indicatorts of influence.

### Q: Create a new graph based on retweets 
#### (nodes: users, edges: retweets, i.e. edge from A to B: B retweet A)

In [265]:
g2=nx.DiGraph()

#### Algorithm:
    1. Create an empty directed graph.
    2. Initialize nodes from home friends
    3. Go through the nodes in random order:
        - get n latest tweets for the current user
        - go through the tweets:
            - if the tweet was retweeted:
                - add a new edge from current user to the retweeter  

    Repeat step 3 after the appropriate wait period, if rate limit exceeded.

#### Notes

One major challenge is that we have limitation on API calls.

So we can only scratch the surface and get a sample retweet data.

The challenge is how to optimally call API functions to get a better representation of the graph.

If we get too many recent tweets for a user (i.e. n is too large), then we may hit the limit before visiting other users.

If we get too few recent tweets for a user, we may hit the limit before encountering enough number of retweets.

In [266]:
g2.add_nodes_from([i.id for i in fr])

In [267]:
len(g2.nodes())

20

In [268]:
# set of processed tweets
tweets_store=set()

In [269]:
user_last_tweet = {u:None for u in g2.nodes()}

In [271]:
# edge A->B weight indicates the number of times B retweeted A.
edge_weights={u:{} for u in g2.nodes()}

In [795]:
# This loop can be run multiple times.
# Each time it adds edges to the network in random order, until API's rate limit reached.
user_list = g2.nodes()
for uid in np.random.permutation(user_list):
    print uid, 
    if uid not in users:
        continue
    if uid not in user_last_tweet:
        user_last_tweet[uid]=None
    if uid not in edge_weights:
        edge_weights[uid]={}
    user = users[uid]
    recent_tweets = user.timeline(count=20, max_id = user_last_tweet[uid])
    user_last_tweet[uid] = recent_tweets[-1].id -1
    #if uid in tweets_store:
    #    tweets_store[uid].extend(recent_tweets)
    #else:
    #    tweets_store[uid] = recent_tweets
    for tweet in recent_tweets:
        print ".",
        if tweet in tweets_store:
            # this tweet is already processed
            continue
        tweets_store.add(tweet)
        if tweet.retweet_count:
            retweeters_lst = api.retweeters(tweet.id)
            if not retweeters_lst:
                continue
            # edge from A to B: A has influence on B => B retweets A's tweet
            for retweeter in retweeters_lst:
                if retweeter in edge_weights[uid]:
                    edge_weights[uid][retweeter]+=1
                else:
                    edge_weights[uid][retweeter]=1
                    
            g2.add_weighted_edges_from([(uid,rid,edge_weights[uid][rid]) for rid in retweeters_lst])
            print
            print "Number of edges:", len(g2.edges())

Since we added new edges to the graph, we might have new users which need to lookup from Twitter.

In [767]:
while True:
    new_ids=[i for i in g2.nodes() if i not in users]
    if not new_ids:
        break
    print "found new users:",len(new_ids)
    users_update = api.lookup_users(user_ids = new_ids[:100])
    users.update({i.id:i for i in users_update})

 found new users: 2


In [768]:
len(g2.nodes()),len(g2.edges())

(855, 841)

We have a graph. Let's save it!

In [796]:
with open("my_twitter_network__retweets_v1.pkl","w") as f:
    pickle.dump(g2, f)

### Q: Who has the most number of total retweets?

In [769]:
uv_retweet = [(users[u].screen_name,v,g2.get_edge_data(u,v)['weight']) for (u,v) in g2.edges()]

In [770]:
df_retwt = pd.DataFrame(data=uv_retweet, columns=["tweeter","retweeter","retweet_count"])

In [771]:
len(df_retwt)

841

In [772]:
df_retwt.sort_values(by="retweet_count", ascending=False).head()

Unnamed: 0,tweeter,retweeter,retweet_count
155,GalatasaraySK,2878585546,18
786,GalatasaraySK,2223760357,18
114,GalatasaraySK,751340422966960128,17
331,GalatasaraySK,1086338846,16
522,GalatasaraySK,896697378262192129,14


In [773]:
df_retwt_g = df_retwt[["tweeter","retweet_count"]].groupby("tweeter").sum().sort_values("retweet_count", ascending=False)

In [774]:
df_retwt_g

Unnamed: 0_level_0,retweet_count
tweeter,Unnamed: 1_level_1
GalatasaraySK,1166
sabinelisicki,73
netana_80,32
gs1905Cimb,11
seba4434,10
kbrd_01,4
ygmrsyhn,2
adar1703,1
eceakdemiiir,1
tarrafalense,1


In [775]:
df_retwt_g.plot.barh(y="retweet_count")

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x137160650>

#### Log scale

In [799]:
df_retwt_g.plot.barh(y="retweet_count", log=True)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1272f5e90>

### Visualize Graph

In [776]:
G2=g2.copy()

In [777]:
for n in G2.nodes():
    if G2.degree(n)<1:
        G2.remove_node(n)


In [778]:
len(G2.nodes()),len(G2.edges())

(839, 841)

In [779]:
nodes2 = G2.nodes()
edges2 = G2.edges()

### Measure Influence

We define an "Influence Score" to color the nodes based on that.

Influence Score for a node is a function of:
    1. Centrality: how central the node is (a central node is on most shortest paths in the graph)
    2. Degree: how many connected nodes it has
    
There are many ways to calculate centrality. We use "eigenvector_centrality" here because for this graph it gives a nice distribution.

After several iterations we have the following definition for Influence_Score:

$$InfluenceScore = (Eigenvector Centrality + 0.1)*(log(degree)+0.1)$$

In [780]:
eig2=nx.eigenvector_centrality(G2, weight="weight")

In [781]:
df_bc2=pd.DataFrame(data=[(i,users[i].screen_name, b, G2.degree(i)) 
                          for i,b in sorted(eig2.items(),key=lambda x:x[1], reverse=True)],
                   columns = ["id","screen_name","eigenvector_centrality", "degree"]).set_index("id")

In [782]:
df_bc2["influence_score"]=(df_bc2["eigenvector_centrality"]+0.1)*(np.log(df_bc2["degree"])+0.1)

### Q: which node has the highest influence score?

In [783]:
df_bc2.sort_values(by="influence_score", ascending=False).head()

Unnamed: 0_level_0,screen_name,eigenvector_centrality,degree,influence_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
828436568914403328,gs1905Cimb,1.0,3,1.318474
23186079,GalatasaraySK,0.0,756,0.672804
39710722,sabinelisicki,0.0,63,0.424313
118115748,netana_80,0.0,13,0.266495
3094343105,seba4434,6e-06,4,0.148638


In [784]:
cent2 = {n:df_bc2.loc[n]["influence_score"] for n in G2.nodes()}

In [785]:
degree = G2.degree()
nx.set_node_attributes(G2,'degree',degree)
nx.set_node_attributes(G2,'centrality',cent2)

In [786]:
pos = nx.spring_layout(G2, k=0.2)

In [787]:
node_to_color = visJS_module.return_node_to_color(G2,field_to_map='centrality',
                                                  cmap=mpl.cm.coolwarm,alpha = 1,
                                                  color_vals_transform=None,
                                                  color_max_frac = .9,color_min_frac = .2)

In [788]:
edge_to_color = visJS_module.return_edge_to_color(G2,
                                                  field_to_map = 'weight',
                                                  cmap=mpl.cm.Greys,alpha=.6)#,vmin=0,vmax=1)

In [789]:
nodes_dict = [{"id":users[n].screen_name if n in users else str(n),
               "degree":G2.degree(n),
               "color":node_to_color[n],
               "node_size":np.log(users[n].followers_count)+1 if n in users else 1, #node_to_nodeSize[n],
              "edge_label":'',
              "x":pos[n][0]*1000,
              "y":pos[n][1]*1000} for n in nodes2
              ]

In [790]:
node_map = dict(zip(G2.nodes(),range(G2.node.__len__())))  # map to indices for source/target in edges

In [791]:
edges_with_data = G2.edges(data=True)

In [792]:
edges_dict = [{"source":node_map[edges2[i][0]], 
               "target":node_map[edges2[i][1]], 
              "color":edge_to_color[edges2[i]],
               "title":edges_with_data[i][2]['weight']} for i in range(len(edges2))]

#### Retweet Graph Visualization

In [793]:
visJS_module.visjs_network(nodes_dict,edges_dict, graph_id=2,
                            node_size_field='node_size',
                            node_size_transform='Math.sqrt',
                            node_size_multiplier=4,
                            node_border_width=2,
                            hover = False,
                            edge_width=1,
                            edge_arrow_to=True,
                            hover_connected_edges = False,
                            physics_enabled=False,
                            min_velocity=.5,
                            max_velocity=16,
                            draw_threshold=20,
                            min_label_size=12,
                            max_label_size=25,
                            max_visible=10,
                            edge_title_field='title',
                            graph_title = 'Retweets Graph')

### Q: How do the retweets propagate through the network?

Would be interesting to analyze the dynamics of retweets. i.e. how a tweet propagate through the network. 

When a new tweet is published, the followers see it first, and then their followers and so on.


In [794]:
# future work

### Q: What is the probability that a tweet published by user A reaches user B?

Intuitively, the chance (risk) that user B sees user A's tweets should be a function of:
    - distance between A and B
    - influence of A over its neighbors
    - influence of nodes on the path between A and B

In [798]:
# future work