### Workshop series, Koç University, Turkey, 11-12 April 2023

##  [Introduction to Computational Social Science methods with Python](https://socialcomquant.ku.edu.tr/intro-to-css-methods-with-python/)

# Workshop 2: Introduction to network analysis with Python - Part II

**Description**: Computational Social Science is often concerned with the traces of human behavior like those left by uses of social media, messaging services, or cell phones. Such digital behavioral data is genuinely relational and can, therefore, be studied using the formal techniques of network analysis. The basic units of networks called nodes can be actors (e.g., users), communicative symbols (e.g., hashtags), or even transactions (e.g., tweets). By focusing on the edges (relations) among nodes, network analysis is capable of creating insights that are not possible by merely doing statistics on the nodes and their attributes. In the workshop, we will give an introduction to how network data should be organized, how networks can be created in Python, and how they can be analyzed on three levels. On the micro level, we will introduce centrality analysis which results in numerical descriptions of nodes. On the meso level, we will introduce community detection, which results in sets of nodes that form groups or clusters. On the macro level, we will introduce measures that describe homophily, assortativity of, the network in its entirety. We will be using network data from the Copenhagen Networks Study, which describes four different types of social relations among students over time. The workshop will alternate between live-coding demonstrations and periods in which participants apply that knowledge in context, both using Jupyter Notebooks. The software we will be using is NetworkX, a standard Python library that is simple to understand, provides a breadth of options and has a large user community.

**Target group**: Undergraduate, master students, doctoral candidates, and experienced researchers who want to get introduced to the practice of Computational Social Science.

**Requirements**: Participants are expected to know the basics of Python and have at least some experience using it. For the workshops, participants should bring a running system on which they can execute Jupyter Notebooks. We will be using Python 3.9 and several standard libraries that are part of the Anaconda 2022.10 distribution or can be installed on top of that. A list of libraries and versions of these libraries that participants should import will be circulated before the workshops. We recommend that participants install Anaconda 2022.10. Feel free to also work in a cloud-like Google Colab. Consult [this link](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) for more detailed instructions on how to set up your computing environment.

**Lecturers**: Dr. Haiko Lietz is a postdoctoral researcher in the Computational Social Science department at GESIS - Leibniz Institute for the Social Sciences. His research interests are in computational sociology, network science, and complexity science. Dr. N. Gizem Bacaksizlar Turbic is a postdoctoral researcher in the Computational Social Science departments at RWTH Aachen University and GESIS - Leibniz Institute for the Social Sciences. Her research areas include complex adaptive systems and social and political networks.

## Documentation of Networkx 2.8.4

https://networkx.org/documentation/networkx-2.8.4/reference/index.html

## Network analysis

### Centrality measures

In [None]:
import networkx as nx
nx.__version__
import pandas as pd
from operator import itemgetter


In [None]:
# we will focus on the directed network of sms data first
D_sms = nx.read_gml('./data/D_sms_week1_lcc.gml')
D_sms = nx.convert_node_labels_to_integers(D_sms, first_label=0, ordering='default', label_attribute=None)


#### Degree centrality
The __degree__ is the number of ties a node has. The __degree centrality__ for a node _v_ is the fraction of nodes it is connected to. It measures potential communication activity. 

In [None]:
degree_cent = nx.degree_centrality(D_sms)

In [None]:
# show the first five nodes of the degree centrality dictionary
list(degree_cent.items())[:5]

In [None]:
# get the top five nodes with higher degree centrality
dict(sorted(degree_cent.items(), key = itemgetter(1), reverse = True)[:5])

#### In-degree centrality
The __in-degree centrality__ for a node _v_ is the fraction of nodes its incoming edges are connected to.

In [None]:
in_degree_cent = nx.in_degree_centrality(D_sms)

In [None]:
# show the first five nodes of the in_degree centrality dictionary
list(in_degree_cent.items())[:5]

In [None]:
# get the top five nodes with higher in_degree centrality
dict(sorted(in_degree_cent.items(), key = itemgetter(1), reverse = True)[:5])

#### Out-degree centrality
The __out-degree centrality__ for a node _v_ is the fraction of nodes its outgoing edges are connected to.

In [None]:
out_degree_cent = nx.out_degree_centrality(D_sms)

In [None]:
# show the first five nodes of the out_degree centrality dictionary
list(out_degree_cent.items())[:5]

In [None]:
# get the top five nodes with higher out_degree centrality
dict(sorted(out_degree_cent.items(), key = itemgetter(1), reverse = True)[:5])

#### PageRank 
__PageRank__ counts the number and quality of links to a node to determine a rough estimate of how important the node is. The underlying assumption is that more important nodes are likely to receive more edges from other nodes.

Note: Undirected graphs will be converted to a directed graph with two directed edges for each undirected edge.

More details here https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html

In [None]:
pagerank_cent = nx.pagerank(D_sms, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight='weight', dangling=None)


In [None]:
# show the first five nodes of the pagerank centrality dictionary
list(pagerank_cent.items())[:5]

In [None]:
# get the top five nodes with higher pagerank centrality
dict(sorted(pagerank_cent.items(), key = itemgetter(1), reverse = True)[:5])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

In [None]:
D_sms.degree(weight='weight')[4]

In [None]:
[node for node in nx.neighbors(D_sms, n=4)]

##### Correlation of centrality measures

In [None]:
nodelist_cent = pd.DataFrame()
nodelist_cent['degree'] = degree_cent.values()
nodelist_cent['in_degree'] = in_degree_cent.values()
nodelist_cent['out_degree'] = out_degree_cent.values()
nodelist_cent['pagerank'] = pagerank_cent.values()

In [None]:
nodelist_cent.corr()

In [None]:
data = (1 - nodelist_cent.corr()).values

In [None]:
from sklearn.manifold import MDS

In [None]:
mds = MDS(n_components=2, random_state=42, dissimilarity='precomputed')
pos = mds.fit(data).embedding_

In [None]:
plt.scatter(pos[:, 0], pos[:, 1])
for i in range(0, len(nodelist_cent.columns)):
    plt.text(pos[i, 0], pos[i, 1], nodelist_cent.columns[i])

### Exercise 2: Add closeness and betweenness centralities, then update the correlation analysis

__Closeness centrality__ of a node _u_ is the reciprocal of the average shortest path distance to _u_ over all _n-1_ reachable nodes. 

`closeness_centrality(G, u=None, distance=None)` where `u` is node for which centrality should be calculated and `distance` edge attribute name that should be used to calculate distance in shortest pathes.

__Betweenness centrality__ of a node _v_ is the sum of the fraction of all-pairs shortest paths that pass through _v_.

`betweenness_centrality(G, k=None, normalized=True, weight=None, endpoints=False, seed=None)` where `k` size of random samples to estimate betweenness, `weight` is the name of the attribute that should be used as distance metric and `seed` is random number generation state. 

More detail here https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html


In [None]:
closeness_cent = nx.closeness_centrality(D_sms, distance = None)

In [None]:
betweenness_cent = nx.betweenness_centrality(D_sms, normalized=True)

In [None]:
nodelist_cent['closeness'] = closeness_cent.values()
nodelist_cent['betweenness'] = betweenness_cent.values()

In [None]:
nodelist_cent.corr()

In [None]:
new_data = (1 - nodelist_cent.corr()).values

In [None]:
mds = MDS(n_components=2, random_state=42, dissimilarity='precomputed')
pos = mds.fit(new_data).embedding_

In [None]:
plt.scatter(pos[:, 0], pos[:, 1])
for i in range(0, len(nodelist_cent.columns)):
    plt.text(pos[i, 0], pos[i, 1], nodelist_cent.columns[i])

### Community detection

For community detection, we move to networks from our bluetooth exercises.

In [None]:
import pickle

Note: the gml files we have didn't have attributes so I couldn't work on them. Instead just for a placeholder, I started with this pickle file below. After we decide on the versions, we can change these cells below.

In [None]:
# G_bt_unlayered = nx.read_gml('./data/D_sms_week1_lcc.gml', destringizer=int)
with open('./data/G_cns_bt_f2f_p_lcc_unlayered.pickle', 'rb') as f:
    G_bt_unlayered = pickle.load(f)

In [None]:
nodelist_bt = pd.DataFrame(index=G_bt_unlayered.nodes())
nodelist_bt.head()

In [None]:
from networkx.algorithms.community import louvain_communities

In [None]:
louvain = louvain_communities(G=G_bt_unlayered, weight='weight', seed=42)

In [None]:
def get_partition(communities):
    return dict(sorted({node: community for community in range(len(communities)) for node in list(communities[community])}.items()))

In [None]:
nodelist_bt['louvain_week1'] = get_partition(louvain).values()
nodelist_bt

In [None]:
nx.get_node_attributes(G_bt_unlayered, name = 'pos')

In [None]:
G_bt_unlayered.nodes()

In [None]:
nx.draw(
    G = G_bt_unlayered, 
    pos = nx.get_node_attributes(G=G_bt_unlayered, name='pos'), 
    node_size = 40, 
    node_color = nodelist_bt['louvain_week1'], 
    cmap = plt.cm.rainbow
)

### Exercise 3: Try Louvain method on the snapshot 4's graph

Will add when we have snapshot 4 ready with attributes or pickle version..

### Homophily

Assortativity helps analysing pattern of connections in networks. To analyze homophily (if people tend to connect to similar people), we can use attribute assortativity coefficient and attribute mixing matrix.

The sex categories are 0: male; 1: female; 2: unknown.

In [None]:
nx.attribute_assortativity_coefficient(G=G_bt_unlayered, attribute='sex')

In [None]:
nx.attribute_mixing_matrix(G=G_bt_unlayered, attribute='sex', normalized=False)

In [None]:
from collections import Counter
import numpy as np

In [None]:
def attribute_density_matrix(G, attribute):
    l = [data[attribute] for v, data in G.nodes(data=True)]
    counts = list(Counter(l).values())
    a = np.empty(shape=(len(counts), len(counts)))
    for i in range(len(counts)):
        for j in range(len(counts)):
            if i == j:
                a[i, j] = counts[i] * (counts[j] - 1)
            else:
                a[i, j] = counts[i] * counts[j]
    return nx.attribute_mixing_matrix(G=G, attribute=attribute, normalized=False) / a

In [None]:
p_bt = nx.density(G=G_bt_unlayered)

In [None]:
p_bt_sex = attribute_density_matrix(G=G_bt_unlayered, attribute='sex').round(4)
p_bt_sex

In [None]:
(p_bt_sex / p_bt).round(2)