# Social Network Analysis - Python Handson

In [6]:
import numpy as np
from networkx import nx

import pandas as pd
import datetime
import matplotlib.pyplot as plt

### CUDA RAPIDS
# import cugraph as cnx
# import cudf
# import cupy as cp

## The Data

The data set is provided by Der Standard, one of the top Austrian newspapers.
In the online Standard people can post comments below articles and up/down vote comments.
The data set used in this handson and further in the project part of the course will consider a sample of those articles, comments, and votes. 

In [4]:
date_cols = ["PostingCreatedAt","ArticlePublishingDate"]

df1 = pd.read_csv('../data/Postings_01052019_15052019.csv',usecols=["ID_CommunityIdentity", "ID_Posting", "PostingCreatedAt", "ArticleTitle",'ArticleChannel' ,"ArticleRessortName","ArticlePublishingDate"],parse_dates=date_cols, sep=';')
df2 = pd.read_csv('../data/Postings_16052019_31052019.csv', usecols=["ID_CommunityIdentity", "ID_Posting","PostingCreatedAt", "ArticleTitle",'ArticleChannel' ,"ArticleRessortName","ArticlePublishingDate"], parse_dates=date_cols,sep=';')
df=df1.append(df2, ignore_index=True)
df.shape

(739094, 7)

In [5]:
df = df[(df.ArticleChannel == "Inland") & (~df.ArticleRessortName.isin([ "Pensionen", "Eurofighter","Off-Topic"]))]
df.head()

Unnamed: 0,ID_Posting,ID_CommunityIdentity,PostingCreatedAt,ArticlePublishingDate,ArticleTitle,ArticleChannel,ArticleRessortName
0,1041073586,671476,2019-05-01 18:21:15.127,2019-05-01 10:28:57.490,1. Mai in Wien: SP√ñ fordert von Strache R√ºcktritt,Inland,Parteien
1,1041073839,566938,2019-05-01 18:28:22.040,2019-05-01 10:28:57.490,1. Mai in Wien: SP√ñ fordert von Strache R√ºcktritt,Inland,Parteien
2,1041073872,669286,2019-05-01 18:29:05.533,2019-05-01 10:28:57.490,1. Mai in Wien: SP√ñ fordert von Strache R√ºcktritt,Inland,Parteien
3,1041080734,671476,2019-05-01 22:37:56.010,2019-05-01 10:28:57.490,1. Mai in Wien: SP√ñ fordert von Strache R√ºcktritt,Inland,Parteien
4,1041080828,671476,2019-05-01 22:42:06.310,2019-05-01 10:28:57.490,1. Mai in Wien: SP√ñ fordert von Strache R√ºcktritt,Inland,Parteien


In [6]:
df.shape

(185509, 7)

There are different entities in the data set: 
* Users - identified by *ID_CommunityIdentity* (or *UserCommunityName*)
* Postings - identified by *ID_Posting*
* Articles - identified by *ID_Article*

Thus, there are different possibilities to build networks based on voting and posting data. 
We will concentrate now on the ***votes-to-network***. 


In [7]:
date_cols = ["VoteCreatedAt","UserCreatedAt"]
votes1 = pd.read_csv('../data/Votes_01052019_15052019.csv',parse_dates=date_cols, sep=';')
votes2 = pd.read_csv('../data/Votes_16052019_31052019.csv', parse_dates=date_cols,sep=';')
votes=votes1.append(votes2, ignore_index=True)
votes.head()

Unnamed: 0,ID_CommunityIdentity,ID_Posting,VoteNegative,VotePositive,VoteCreatedAt,UserCommunityName,UserGender,UserCreatedAt
0,675862,1041076570,1,0,2019-05-06 16:47:46.883,Heckscheibenwischer,m,2018-06-26 06:04:30.513
1,689023,1041076570,1,0,2019-05-01 22:19:06.240,Heinz Fettleber,,2019-03-08 21:23:11.463
2,24810,1041076745,0,1,2019-05-01 23:54:54.600,Bruce Campbell,m,2011-01-12 16:50:40.597
3,673781,1041076745,0,1,2019-05-01 20:59:29.910,Erd√§pfelsack,,2018-05-29 07:13:49.350
4,24810,1041076831,0,1,2019-05-01 23:51:42.730,Bruce Campbell,m,2011-01-12 16:50:40.597


In [9]:
PostAndVotes=pd.merge(df,votes,on="ID_Posting")
PostAndVotes.head()
PostAndVotes.shape

(1046528, 14)

In [10]:
PostAndVotes_less=PostAndVotes.groupby('ID_Posting').filter(lambda x : len(x)>5).copy()
split_date= datetime.datetime(2019,5,17)

PostAndVotes_before = PostAndVotes_less.loc[PostAndVotes_less['PostingCreatedAt'] <= split_date]
PostAndVotes_after = PostAndVotes_less.loc[PostAndVotes_less['PostingCreatedAt'] > split_date]
PostAndVotes_after.head()

Unnamed: 0,ID_Posting,ID_CommunityIdentity_x,PostingCreatedAt,ArticlePublishingDate,ArticleTitle,ArticleChannel,ArticleRessortName,ID_CommunityIdentity_y,VoteNegative,VotePositive,VoteCreatedAt,UserCommunityName,UserGender,UserCreatedAt
288026,1041556100,690843,2019-05-17 11:55:52.037,2019-05-17 06:00:00,"Neos-Kandidatin Gamon: ""Was √ñVP und FP√ñ machen...",Inland,Neos,566359,0,1,2019-05-17 12:58:25.500,RudiSemmel,m,2015-08-18 00:14:31.843
288027,1041556100,690843,2019-05-17 11:55:52.037,2019-05-17 06:00:00,"Neos-Kandidatin Gamon: ""Was √ñVP und FP√ñ machen...",Inland,Neos,509478,0,1,2019-05-17 12:59:20.530,Seidelbast,w,2013-02-28 09:52:08.860
288028,1041556100,690843,2019-05-17 11:55:52.037,2019-05-17 06:00:00,"Neos-Kandidatin Gamon: ""Was √ñVP und FP√ñ machen...",Inland,Neos,580319,1,0,2019-05-17 11:58:30.570,stay calm,m,2016-04-09 21:20:45.280
288029,1041556100,690843,2019-05-17 11:55:52.037,2019-05-17 06:00:00,"Neos-Kandidatin Gamon: ""Was √ñVP und FP√ñ machen...",Inland,Neos,505803,0,1,2019-05-17 12:13:32.723,The_new_Dude,m,2004-07-06 14:10:25.217
288030,1041556100,690843,2019-05-17 11:55:52.037,2019-05-17 06:00:00,"Neos-Kandidatin Gamon: ""Was √ñVP und FP√ñ machen...",Inland,Neos,170067,0,1,2019-05-17 12:41:33.613,Peter Williams,m,2001-04-29 13:51:13.000


In [11]:
print('Before shape: ' + str(PostAndVotes_before.shape))
print('After shape: ' + str(PostAndVotes_after.shape))


Before shape: (280769, 14)
After shape: (579639, 14)


A line in the table above shows that a user (i.e., *ID_CommunityIdentiy*) posted a comment. Every post has its own uniqe identifier (i.e., *ID_Posting*). If a user votes for a posting then the vote is identified by the *ID_Posting* the voting was for, the *ID_CommunityIdentiy* from the voter. Next, it is also recorded, if the vote was negative or positive. This informtion is saved in  *VoteNegative* and *VotePositive* respectively.  

We want to bring the structure above into following format: 
* source, i.e., the voting user
* target, i.e., the post creator
* weight, i.e., how often the source voted for the target (postive and negative)

In other words, we are aiming for a *weighted edge-list*.

### Edges

In [12]:
edgeListBefore= PostAndVotes_before.groupby(["ID_CommunityIdentity_x","ID_CommunityIdentity_y"]).agg({"VoteNegative": [("votes_neg_count","sum")], "VotePositive":[("votes_pos_count","sum")]})
edgeListAfter= PostAndVotes_after.groupby(["ID_CommunityIdentity_x","ID_CommunityIdentity_y"]).agg({"VoteNegative": [("votes_neg_count","sum")], "VotePositive":[("votes_pos_count","sum")]})


In [13]:
edgeListBefore.columns=edgeListBefore.columns.droplevel()
edgeListAfter.columns=edgeListAfter.columns.droplevel()

In [None]:
edgeListBefore.loc[edgeListBefore["votes_neg_count"]>10]


### ATTENTION: 
PLEASE DEFINE YOUR WEIGHT BELOW 

In [15]:
## original
# edgeListBefore["weight"]=edgeListBefore["votes_pos_count"]-edgeListBefore["votes_neg_count"]
# edgeListAfter["weight"]=edgeListAfter["votes_pos_count"]-edgeListAfter["votes_neg_count"]

In [21]:
## v1
edgeListBefore["weight"]= (1+edgeListBefore["votes_pos_count"])/(1+edgeListBefore["votes_neg_count"])
edgeListAfter["weight"]=(1+edgeListAfter["votes_pos_count"])/(1+edgeListAfter["votes_neg_count"])

In [22]:
edgeListBefore.rename_axis(['source', 'target'], inplace=True)
edgeListAfter.rename_axis(['source', 'target'], inplace=True)

In [23]:
edgesBefore = edgeListBefore.reset_index()
edgesAfter = edgeListAfter.reset_index()

In [28]:
#list(edgesAfter.weight.tail(100))

In [30]:
edgesBefore.to_csv("../data/votes_to_comments_before.csv", index=False)
edgesAfter.to_csv("../data/votes_to_comments_after.csv", index=False)

### Graph

We use the *networkx* library.
Since we build a *votes-to-network* we have *source* nodes and *target* nodes. 
Thus, the network is directed.
Therefore, we use *nx.Digraph()*

In [26]:
edgesBefore = pd.read_csv("../data/votes_to_comments_before.csv")
edgesAfter = pd.read_csv("../data/votes_to_comments_after.csv")

In [27]:
G = nx.from_pandas_edgelist(edgesAfter, 
                            source='source', 
                            target='target', 
                            edge_attr = 'weight',
                            create_using=nx.DiGraph())

In [62]:
# ## or CUDA (CuGraph variant)
# edgesBefore = cudf.read_csv("../data/votes_to_comments_before.csv")
# edgesAfter = cudf.read_csv("../data/votes_to_comments_after.csv")

# G = cnx.DiGraph()
# G = cnx.from_cudf_edgelist(edgesAfter, 
#                             source ='source', 
#                             destination ='target',
#                             edge_attr = 'weight')

In [None]:
fig = plt.figure(figsize=(50,50))
nx.draw_spring(G)
plt.show()

### Basic Stats

In [30]:
print(nx.info(G))

Name: 
Type: DiGraph
Number of nodes: 19311
Number of edges: 516939
Average in degree:  26.7691
Average out degree:  26.7691


In [36]:
edges = G.edges()
edges
# number of edges with weight 1
len(edgesBefore[edgesBefore.weight == 1])

857

In [None]:
# max weight of edges
edges.weight.max()

In [None]:
# average weight 
edges.weight.mean()

### Network density and path lengths

In [37]:
# network density
nx.density(G)

0.0013862841594108118

In [39]:
# Average distance (i.e. average shortest path length)
nx.average_shortest_path_length(G)

1.2015565061527573

The method average_shortest_path_length throws an exception if the underlying Graph is disconnected. Thus, one can calculate the average of all finite distances (i.e., existing shortest pathes) nx.single_source_shortest_path_length(G, N) delivers the length of all shortest pathes beginning from node N. Furthermore, the first shortest path is always the distance to itself (i.e., zero), which as to be filtered later on.

In [40]:
# compute all distances
distances = [list(nx.single_source_shortest_path_length(G,N).values()) for N in G.nodes]
# Flatten the distances list! Currently list of lists of single node distances
# and filter out the unnecessary zeroes
distances = [distance for single_distances in distances for distance in single_distances if distance > 0]

In [41]:
# average
np.mean(distances)

3.1636479960640003

To consider the weight one can use e.g. nx.single_source_dijkstra_path_length() But watch out, what does weight in our case mean?

In [42]:
# Diameter (i.e, longest shortest path)
np.max(distances)

8

### Connected components

G.subgraph(c) for c in nx.weakly_connected_components(G) delivers a Generator,which can be used to iterate over all weakly connected compontents (deliverd as a subgraph for further analysis)

In [43]:
wccs = [c for c in (G.subgraph(c) for c in nx.weakly_connected_components(G))]

In [44]:
# number of wccs
len(wccs)

1

In [45]:
# number of wccs
nx.number_weakly_connected_components(G)

1

Sizes of the wccs:

nx.number_of_nodes() delivers the number of nodes of a graph. This can be done for all weakly connected components wcc in the weakly connected component list. Furthermore, with set() one can get the uniqe values.

In [46]:
set([nx.number_of_nodes(wcc) for wcc in wccs])

{19311}

If not the uniqe values are in focus, but the for example how often a wcc with n Nodes appear, one can use Counter().most_common() as follwing

In [47]:
from collections import Counter
Counter([nx.number_of_nodes(wcc) for wcc in wccs]).most_common()

[(19311, 1)]

Strongly connected components:

In [48]:
sccs = [c for c in (G.subgraph(c) for c in nx.strongly_connected_components(G))]
Counter([nx.number_of_nodes(scc) for scc in sccs]).most_common()

[(1, 12557), (6754, 1)]

### Clustering Coefficients

**Local**

nx.clustering(G) returns back a dictionary with clustering coefficients of each node.
with the combination of sorted() and itemgetter() one can get a sorted list of (ID,clustering coeff.) tuples.

In [49]:
from operator import itemgetter
sorted(nx.clustering(G).items(), key=itemgetter(1), reverse=True)[:5]

[(130264, 1.0), (317646, 1.0), (632768, 1.0), (574288, 1.0), (676509, 1.0)]

"*the clustering coefficient quantifies how close the neighbours of i are to being a clique.*" (lecture slides) i.e., how concentrated the neighbours of a nodes is.

**Global**

The global clustering coefficient can have alternative definitions:

1) as the average of the local clustering coefficients

In [50]:
nx.average_clustering(G)

0.06559233425493029

Note, that there might be differences if you use other tools (e.g., Gephi, Igraph, etc.).
So, why does networkx delivers a different average clustering coefficiet?
In order to find an answer, take a look at the nx.clustering() documentation (since nx.average_clustering is just averaging over the individual values). https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.cluster.clustering.html
It says, that clustering coefficients for nodes with degrees lower than 2 is set to ZERO.

Thus, there is no right or wrong way of implementation, but you have to be aware what you are using.

2) as the ratio of triangles and connected triples

In [51]:
nx.transitivity(G)

0.030381907301514577

### Centrality Indices

**In-Degree**

[nx.in_degree_centrality(G)](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.in_degree_centrality.html#networkx.algorithms.centrality.in_degree_centrality) delivers in-degree-centrality of each node in a graph G.
Note, that the centralities are normalized.


With a combination of sorted() and itemgetter() one can again get a sorted list of (Node, centrality) tuples.
Where one can just take the first 5 for reporting.
Note, reverse=True means in decreasing order



In [52]:
sorted(nx.in_degree_centrality(G).items(), key=itemgetter(1), reverse=True)[:5]

[(13770, 0.10719834282755049),
 (91348, 0.07804246504401864),
 (691990, 0.058933195235629204),
 (499631, 0.05033661315380632),
 (586763, 0.04873122734334542)]

User *588542* is replied the most.

**Out-Degree**

In [53]:
sorted(nx.out_degree_centrality(G).items(), key=itemgetter(1), reverse=True)[:5]

[(55475, 0.13821853961677888),
 (688289, 0.08342827550491973),
 (519657, 0.0825479026411186),
 (692018, 0.07540134645261522),
 (204447, 0.06607975142413257)]

User *588542* also replies the most

**Eigenvector-Centrality**

Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors. The eigenvector centrality for node i is the i-th element of the vector ùë• defined by the equation

ùê¥ùë•=ùúÜùë•

where ùê¥ is the adjacency matrix of the graph G with eigenvalue ùúÜ. By virtue of the Perron‚ÄìFrobenius theorem, there is a unique solution ùë•, all of whose entries are positive, if ùúÜ is the largest eigenvalue of the adjacency matrix ùê¥
A.

In [54]:
sorted(nx.eigenvector_centrality(G).items(), key=itemgetter(1), reverse=True)[:5]

[(13770, 0.17154734158027612),
 (91348, 0.14210789770185256),
 (691990, 0.10521084588501653),
 (586763, 0.09673589889007222),
 (526088, 0.09599874654467412)]

**In-closeness centrality**

Closeness centrality of a node u is the reciprocal of the average shortest path distance to u over all n-1 reachable nodes.

ùê∂(ùë¢)= (ùëõ‚àí1)/‚àëùëë(ùë£,ùë¢)

where d(v, u) is the shortest-path distance between v and u, and n is the number of nodes that can reach u. Notice that the closeness distance function computes the incoming distance to u for directed graphs. To use outward distance, act on G.reverse().

In [55]:
sorted(nx.closeness_centrality(G).items(), key=itemgetter(1), reverse=True)[:5]

[(13770, 0.23534927320118174),
 (91348, 0.2246666410887312),
 (691990, 0.21575385562111715),
 (586763, 0.21161370077918698),
 (499631, 0.2114801484277959)]

**Out-closeness centrality**

In [None]:
sorted(nx.closeness_centrality(G.reverse()).items(), key=itemgetter(1), reverse=True)[:5]

**Betweeness centrality**

Betweenness centrality of a node ùë£ is the sum of the fraction of all-pairs shortest paths that pass through ùë£

ùëêùêµ(ùë£)=‚àëùúé(ùë†,ùë°|ùë£)/ùúé(ùë†,ùë°)

where, ùúé(ùë†,ùë°) is the number of shortest (ùë†,ùë°)-paths, and ùúé(ùë†,ùë°|ùë£) is the number of those paths passing through some node ùë£ other than ùë†,ùë° .


Using k=100 nodes to estimate the betweeness centrality.

In [None]:
sorted(nx.betweenness_centrality(G, k=100).items(), key=itemgetter(1), reverse=True)[:5]

### Link Analysis

**Hubs and Authorities**

In [56]:
hubs_auth = nx.hits(G)

In [57]:
# hub scores
sorted(hubs_auth[0].items(), key=itemgetter(1), reverse=True)[:5]

[(519657, 0.007530638908006095),
 (204447, 0.003626318108992372),
 (692018, 0.0034952475278469393),
 (688289, 0.003486458185833507),
 (501035, 0.002635002262536037)]

In [58]:
# authority scores
sorted(hubs_auth[1].items(), key=itemgetter(1), reverse=True)[:5]

[(13770, 0.005023856325453032),
 (91348, 0.0037148012038317976),
 (691990, 0.002411994874554427),
 (526088, 0.0022889160067541482),
 (568564, 0.0021508604775567026)]

**Page Rank**

In [59]:
sorted(nx.pagerank(G).items(), key=itemgetter(1), reverse=True)[:5]

[(13770, 0.002786153322149273),
 (91348, 0.0019966283709411253),
 (691990, 0.0014991130837114835),
 (586763, 0.0011135276467596075),
 (499631, 0.0011099405727271906)]

### To explore
networkx.algorithms.triads.triadic_census

In [61]:
nx.algorithms.triads.triadic_census(G)

{'003': 1190347414874,
 '012': 9413301276,
 '102': 141763297,
 '021D': 53832931,
 '021U': 39848218,
 '021C': 33555437,
 '111D': 2941611,
 '111U': 3445452,
 '030T': 2740732,
 '030C': 95262,
 '201': 132212,
 '120D': 157641,
 '120U': 198516,
 '120C': 107199,
 '210': 41076,
 '300': 2881}