# Graph Analysis using cuGraph
### Exploring COVID-19 Data

Author:  Brad Rees
Date:    July 1st, 2022

This material was created for presentation at SciPy '22 in Austin, TX.  July 13th - 15th, 2022


Aguilar-Gallegos, Norman (2020), “Dataset on dynamics of Coronavirus on Twitter”, Mendeley Data, V1, doi: 10.17632/7ph4nx8hnc.1

data:  https://data.mendeley.com/datasets/7ph4nx8hnc/1

__Notice__:  I have no affiliation with the author. The dataset is just being used to illustrate graph analytics and not any derived insights from the data


The data has already been downloaded, but could get obtained by
wget https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/7ph4nx8hnc-1.zip
then `gunzip 7ph4nx8hnc-1.zip`

In [1]:
import cudf
import cugraph

-----
## Exploring the data
Tweet Types:
  *  1.Tw      Original Tweet
  *  2.MT      Mentioned within Tweet
  *  3.RT      Retweet
  *  4.Re      Replies

In [2]:
# Let's load the "edges" dataset
gdf_tw_edges = cudf.read_csv("07a.Tw.edges.csv")

In [3]:
# how much data
print("{:,}".format(len(gdf_tw_edges)))

7,296,841


In [4]:
# Count the number of different Tweet types
gdf_tw_edges['Type'].value_counts()

3. RT    5570466
4. Re    1025937
2. MT     700438
Name: Type, dtype: int32

The difference in counts is due to how the data was created.  

----
# Build a Graph

In [5]:
# What does the data look like?
gdf_tw_edges.head()

Unnamed: 0,from,to,Type,status_id,width
0,DrNancyM_CDC,CDCgov,2. MT,x1216747185565507585,1
1,WHO,WHOWPRO,4. Re,x1217027488137826304,1
2,CDCDirector,CDCgov,2. MT,x1217126105506820096,1
3,WHO,pr_moph,2. MT,x1217151178884222976,1
4,HelenBranswell,WHO,2. MT,x1217191264858206209,1


Since the from (source) and to (destination) columns are strings, we will need to use the renumbering feature in cuGraph to convert the strings to integer values.   The nice part of the cuGraph renumbering feature is that it also converts the data back to the original string values in any output

In [6]:
# Let's build an undirected graph
G = cugraph.Graph(directed=False)
G.from_cudf_edgelist(gdf_tw_edges, source='from', destination='to', renumber=True)

In [7]:
(G.number_of_nodes(), G.number_of_edges())

(2922882, 5894695)

----
Let's look at the degree of the nodes.   We can use:
* degree              - total number of edges insident (connecected to) the vertex
* degrees             - return both the in-degree and out-degree
* degree_centrality   - same as 'degree'

Since this is an undirected graph, `degree` would count each edge twice (once for in and once for out)

In [8]:
degree = G.degrees()

In [9]:
degree.sort_values(by='out_degree', ascending=False).head(5)

Unnamed: 0,in_degree,out_degree,vertex
2885879,56119,56119,OlaTinee
2674548,36534,36534,spectatorindex
11776,34876,34876,WHO
2828921,30196,30196,realDonaldTrump
24779,26421,26421,howroute


Wow, there are some very popular nodes!

In [10]:
degree['out_degree'].describe().to_pandas().apply("{:,.2f}".format)

count    2,922,882.00
mean             4.03
std             85.08
min              1.00
25%              1.00
50%              1.00
75%              2.00
max         56,119.00
Name: out_degree, dtype: object

75% of nodes have a degree of 2 or less!

In [11]:
del degree

-----
Let's see how many subgraph (components) there could be.
We use the Weakly Connected Component (WCC) algorithm 

In [12]:
comp = cugraph.weakly_connected_components(G)

In [13]:
# Use groupby on the 'labels' column of the WCC output to get the counts of each connected component with the same label
label_count = comp.groupby('labels').count()
label_count.rename(columns={"vertex": "count"}, inplace=True)

print("Total number of components found : ", "{:,}".format(len(label_count)))

Total number of components found :  107,493


In [14]:
# How many components have only two nodes?
len(label_count[label_count['count'] == 2])

81883

In [15]:
# what are the largest components?
label_count.sort_values(by='count', ascending=False).head()

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
1731182,2668807
2085265,83
1167083,69
2916221,51
2660936,49


-----
# Let's only look at the big component

In [16]:
# Get the label ID for the largest component
max_comp = label_count['count'].max()
big_comp_id = label_count[label_count['count'] == max_comp].index[0]

In [17]:
# now need to get the vertex IDs associated with those good components
big_comp = comp[comp['labels'] == big_comp_id]

In [18]:
# Now get the subgraph that just contains those nodes in large compoents
subgraph = cugraph.subgraph(G, big_comp['vertex'])

In [19]:
(subgraph.number_of_nodes(), subgraph.number_of_edges())

(2668807, 5741055)

-----


In [20]:
#import graphistry

In [21]:
#graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username=xxxxx, password=xxxxx)  

In [22]:
#graphistry.edges(subgraph.edges(), 'src', 'dst').plot()

-----
# Who is important?
Centrality is the measure of how important, or cenbtral, a node is which in the graph
__Note__: always set 'k' value Betweenness Centrality 

In [23]:
# Compute Centrality
# the centrality calls are very straight forward with the graph being the primary argument
# we are using the default argument values for all centrality functions
def compute_centrality(_graph) :
    # Compute Degree Centrality
    _d = cugraph.degree_centrality(_graph)
        
    # Compute the Betweenness Centrality
    _b = cugraph.betweenness_centrality(_graph, 100)

    # Compute Katz Centrality
    _k = cugraph.katz_centrality(_graph)
    
    # Compute PageRank Centrality
    _p = cugraph.pagerank(_graph,tol=0.0001)
    
    # Compute Eigenvector Centrality
    _e = cugraph.eigenvector_centrality(_graph, tol=0.0001)
    
    return (_d, _b, _k, _p, _e)

In [24]:
# Print function
# The input is the tuple from the `compute_centrality` function
from IPython.display import display_html 
def print_centrality(_C, _n):
    dc_top = _C[0].sort_values(by='degree_centrality', ascending=False).head(_n).to_pandas()
    bc_top = _C[1].sort_values(by='betweenness_centrality', ascending=False).head(_n).to_pandas()
    katz_top = _C[2].sort_values(by='katz_centrality', ascending=False).head(_n).to_pandas()
    pr_top = _C[3].sort_values(by='pagerank', ascending=False).head(_n).to_pandas()
    ec_top = _C[4].sort_values(by='eigenvector_centrality', ascending=False).head(_n).to_pandas()
    
    df1_styler = dc_top.style.set_table_attributes("style='display:inline'").set_caption('Degree').hide(axis='index')
    df2_styler = bc_top.style.set_table_attributes("style='display:inline'").set_caption('Betweenness').hide(axis='index')
    df3_styler = katz_top.style.set_table_attributes("style='display:inline'").set_caption('Katz').hide(axis='index')
    df4_styler = pr_top.style.set_table_attributes("style='display:inline'").set_caption('PageRank').hide(axis='index')
    df5_styler = ec_top.style.set_table_attributes("style='display:inline'").set_caption('Eigenvector').hide(axis='index')

    display_html(df1_styler._repr_html_()+
                 df2_styler._repr_html_()+
                 df3_styler._repr_html_()+
                 df4_styler._repr_html_()+
                 df5_styler._repr_html_(), 
                 raw=True)

In [25]:
C = compute_centrality(subgraph)

In [26]:
print_centrality(C, 5)

degree_centrality,vertex
0.042056,OlaTinee
0.027379,spectatorindex
0.026136,WHO
0.022629,realDonaldTrump
0.0198,howroute

betweenness_centrality,vertex
0.105148,OlaTinee
0.099841,WHO
0.095573,generate_output
0.043395,spectatorindex
0.032213,YouTube

katz_centrality,vertex
0.001224,OlaTinee
0.001011,spectatorindex
0.000993,WHO
0.000942,realDonaldTrump
0.0009,howroute

pagerank,vertex
0.006835,OlaTinee
0.003161,spectatorindex
0.002761,vinistupido
0.002161,YouTube
0.001999,WHO

eigenvector_centrality,vertex
0.385401,OlaTinee
0.250899,spectatorindex
0.239513,WHO
0.207373,realDonaldTrump
0.181448,howroute


The centrality numbers are very low, which could indicate that the graph needs to be further clustered

-----
# Community Detection
Component != Community 

Let's run Louvain 

In [27]:
communities_df, mod_score = cugraph.louvain(subgraph)

In [28]:
#  Do we have a good clustering?  Look at the modularity score
mod_score

0.7411910891532898

In [29]:
# How man communities were found?
part_ids = communities_df["partition"].unique()
print("Louvain found " + str(len(part_ids)) + " communities")

Louvain found 1618 communities


In [30]:
community_count = communities_df.groupby('partition').count().rename(columns={"vertex": "count"})

In [31]:
# what are the largest communities?
sorted_community_counts = community_count.sort_values(by='count', ascending=False)
sorted_community_counts.head(10)

Unnamed: 0_level_0,count
partition,Unnamed: 1_level_1
1271,309076
1607,205208
503,171562
633,167905
1390,150457
1478,133335
1489,129612
472,118715
367,114616
1076,111258


-----
Keep diving deep and analze the largest community

In [32]:
def extract_subgraph(idx, sorted_cc, graph):
    _com_id = sorted_cc.index[idx]
    _v = communities_df[communities_df['partition'] == _com_id]  
    _s = cugraph.subgraph(graph, _v['vertex'])
    return _s

In [33]:
g0 = extract_subgraph(0, sorted_community_counts, subgraph)
print_centrality(compute_centrality(g0), 5)

degree_centrality,vertex
0.142458,realDonaldTrump
0.103723,howroute
0.07848,RealJamesWoods
0.062186,IsChinar
0.061765,BNODesk

betweenness_centrality,vertex
0.142623,realDonaldTrump
0.091157,howroute
0.055516,IsChinar
0.051839,RealJamesWoods
0.04846,DarrenPlymouth

katz_centrality,vertex
0.003597,realDonaldTrump
0.003108,howroute
0.002789,RealJamesWoods
0.002584,IsChinar
0.002579,BNODesk

pagerank,vertex
0.012928,realDonaldTrump
0.009578,howroute
0.006905,RealJamesWoods
0.005073,Education4Libs
0.005004,BNODesk

eigenvector_centrality,vertex
0.419529,realDonaldTrump
0.314915,howroute
0.209167,IsChinar
0.158311,BNODesk
0.154515,jenniferatntd


In [34]:
g1 = extract_subgraph(1, sorted_community_counts, subgraph)
print_centrality(compute_centrality(g1), 5)

degree_centrality,vertex
0.097687,CNNEE
0.092443,ActualidadRT
0.070495,AlertaNews24
0.056294,MaihenH
0.047796,dw_espanol

betweenness_centrality,vertex
0.154718,CNNEE
0.137271,ActualidadRT
0.083529,MaihenH
0.062553,AlertaNews24
0.06018,ChalecosAmarill

katz_centrality,vertex
0.004415,CNNEE
0.004296,ActualidadRT
0.0038,AlertaNews24
0.003479,MaihenH
0.003287,dw_espanol

pagerank,vertex
0.011602,CNNEE
0.010421,ActualidadRT
0.008343,AlertaNews24
0.005778,MaihenH
0.004908,dw_espanol

eigenvector_centrality,vertex
0.382461,CNNEE
0.336205,ActualidadRT
0.17312,AlertaNews24
0.151726,dw_espanol
0.136776,MaihenH


In [35]:
g2 = extract_subgraph(2, sorted_community_counts, subgraph)
print_centrality(compute_centrality(g2), 5)

degree_centrality,vertex
0.213755,vinistupido
0.162788,AlineTosin
0.058335,Byano_DJ
0.057332,celsolamounier
0.055432,lucasrohan

betweenness_centrality,vertex
0.265578,vinistupido
0.213782,AlineTosin
0.089499,lucasrohan
0.060687,celsolamounier
0.059768,g1

katz_centrality,vertex
0.004828,vinistupido
0.004252,AlineTosin
0.003073,Byano_DJ
0.003061,celsolamounier
0.00304,lucasrohan

pagerank,vertex
0.043346,vinistupido
0.031361,AlineTosin
0.011662,Byano_DJ
0.011165,celsolamounier
0.00986,carloscrl144

eigenvector_centrality,vertex
0.659705,vinistupido
0.502406,AlineTosin
0.180037,Byano_DJ
0.176943,celsolamounier
0.171078,lucasrohan


-----
We could have gotten to community from the start

In [36]:
%%time
all_communities_df, mod_score = cugraph.louvain(subgraph)

CPU times: user 724 ms, sys: 145 ms, total: 869 ms
Wall time: 863 ms


In [37]:
# How man communities were found?
all_part_ids = all_communities_df["partition"].unique()
print("Louvain found " + str(len(all_part_ids)) + " communities")

Louvain found 1618 communities


In [38]:
all_community_count = all_communities_df.groupby('partition').count().rename(columns={"vertex": "count"})

In [39]:
# what are the largest communities?
all_sorted_community_counts = all_community_count.sort_values(by='count', ascending=False)
all_sorted_community_counts.head(5)

Unnamed: 0_level_0,count
partition,Unnamed: 1_level_1
1271,309076
1607,205208
503,171562
633,167905
1390,150457


-----
Looking for similarities

In [40]:
Jaccard = cugraph.jaccard(g0)

In [41]:
Jaccard.sort_values(by='jaccard_coeff', ascending=False).head()

Unnamed: 0,jaccard_coeff,source,destination
37566,1.0,howroute,howroute
61005,1.0,IsChinar,IsChinar
65558,1.0,BNODesk,BNODesk
76222,1.0,jenniferatntd,jenniferatntd
96239,1.0,DarrenPlymouth,DarrenPlymouth


Lots of users referencing themselves

In [42]:
# Drop self links  (should have done this at the graph level)
J2 = Jaccard[Jaccard['source'] != Jaccard['destination']]

In [43]:
J2.sort_values(by='jaccard_coeff', ascending=False).head(10)

Unnamed: 0,jaccard_coeff,source,destination
1009425,1.0,GuyGuidoFawkes1,PutinRF_English
1011294,1.0,PutinRF_English,GuyGuidoFawkes1
1183188,1.0,800dbcloud,1NKDR0P
1185214,1.0,1NKDR0P,800dbcloud
1298775,1.0,cyberdefensemag,miliefsky
1316297,1.0,miliefsky,cyberdefensemag
697283,0.918919,TresaBridges,lazyishhound
711856,0.918919,lazyishhound,TresaBridges
1181191,0.909091,REAL_DARGI,BLACKON13649707
1214883,0.909091,BLACKON13649707,REAL_DARGI


In [44]:
Jaccard[Jaccard['source'] == 'realDonaldTrump'].sort_values(by='jaccard_coeff', ascending=False).head()

Unnamed: 0,jaccard_coeff,source,destination
5643,0.051783,realDonaldTrump,PressSec
5635,0.042745,realDonaldTrump,POTUS
5636,0.040349,realDonaldTrump,BrianKolfage
5637,0.03085,realDonaldTrump,SecAzar
6065,0.024799,realDonaldTrump,BrianKarem


In [45]:
Overlap = cugraph.overlap(g0)

In [46]:
Overlap[Overlap['source'] == 'realDonaldTrump'].sort_values(by='overlap_coeff', ascending=False).head()

Unnamed: 0,overlap_coeff,source,destination
7169,0.99635,realDonaldTrump,BrianKarem
6407,0.983333,realDonaldTrump,EbneHava
9498,0.98,realDonaldTrump,marc_lotter
9583,0.958333,realDonaldTrump,BruhResign
27105,0.952381,realDonaldTrump,SteveScalise
