### Workshop series, Koç University, Turkey, 11-12 April 2023

##  [Introduction to Computational Social Science methods with Python](https://socialcomquant.ku.edu.tr/intro-to-css-methods-with-python/)

# Workshop 1: Introduction to network analysis with Python - Part I

**Description**: Computational Social Science is often concerned with the traces of human behavior like those left by uses of social media, messaging services, or cell phones. Such digital behavioral data is genuinely relational and can, therefore, be studied using the formal techniques of network analysis. The basic units of networks called nodes can be actors (e.g., users), communicative symbols (e.g., hashtags), or even transactions (e.g., tweets). By focusing on the edges (relations) among nodes, network analysis is capable of creating insights that are not possible by merely doing statistics on the nodes and their attributes. In the workshop, we will give an introduction to how network data should be organized, how networks can be created in Python, and how they can be analyzed on three levels. On the micro level, we will introduce centrality analysis which results in numerical descriptions of nodes. On the meso level, we will introduce community detection, which results in sets of nodes that form groups or clusters. On the macro level, we will introduce measures that describe inequality in, and the cohesion of, the network in its entirety. We will be using network data from the Copenhagen Networks Study, which describes four different types of social relations among students over time. The workshop will alternate between live-coding demonstrations and periods in which participants apply that knowledge in context, both using Jupyter Notebooks. The software we will be using is NetworkX, a standard Python library that is simple to understand, provides a breadth of options and has a large user community.

**Target group**: Undergraduate, master students, doctoral candidates, and experienced researchers who want to get introduced to the practice of Computational Social Science.

**Requirements**: Participants are expected to know the basics of Python and have at least some experience using it. For the workshops, participants should bring a running system on which they can execute Jupyter Notebooks. We will be using Python 3.9 and several standard libraries that are part of the Anaconda 2022.10 distribution or can be installed on top of that. A list of libraries and versions of these libraries that participants should import will be circulated before the workshops. We recommend that participants install Anaconda 2022.10. Feel free to also work in a cloud-like Google Colab. Consult [this link](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) for more detailed instructions on how to set up your computing environment.

**Lecturers**: Dr. Haiko Lietz is a postdoctoral researcher in the Computational Social Science department at GESIS - Leibniz Institute for the Social Sciences. His research interests are in computational sociology, network science, and complexity science. Dr. N. Gizem Bacaksizlar Turbic is a postdoctoral researcher in the Computational Social Science departments at RWTH Aachen University and GESIS - Leibniz Institute for the Social Sciences. Her research areas include complex adaptive systems and social and political networks.

## Documentation of Networkx 2.8.4

https://networkx.org/documentation/networkx-2.8.4/reference/index.html

## Network construction

### Constructing from scratch

In [None]:
import networkx as nx
nx.__version__

In [None]:
G = nx.Graph()

In [None]:
D = nx.DiGraph()

In [None]:
G.add_node(5)

In [None]:
G.add_nodes_from(['pretty', 13])

In [None]:
G.nodes()

In [None]:
G.add_edge('pretty', 13)

In [None]:
G.add_edges_from([('pretty', 13), ('ruby', 0)])

In [None]:
G.edges()

In [None]:
G.nodes()

### Constructing from Pandas dataframes

In [None]:
import pandas as pd
pd.__version__

#### Copenhagen Networks Study interaction data

In [None]:
edgelist_sms = pd.read_csv('data/sms.csv')
edgelist_sms.sort_values('sender').head()

In [None]:
with open('data/sms.README', 'r') as f:
    print(f.read())

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=[3, 2])
edgelist_sms['timestamp'].hist(bins=28)
for i in range(5):
    plt.axvline(x=7*24*60*60*i, color='black')
plt.xlabel('timestamp')
plt.ylabel('Count')
plt.show()

#### Text messaging network

In [None]:
D_sms = nx.from_pandas_edgelist(
    df = edgelist_sms, 
    source = 'sender', 
    target = 'recipient', 
    create_using = nx.DiGraph
)
D_sms

In [None]:
D_sms.edges()

In [None]:
nx.draw(G=D_sms, node_size=10)

#### Aggregated text messaging network

In [None]:
def aggregate_edges(df, time, source, target, weight, time_zero, window_size, inclusive, fun):
    '''
    Aggregates edges from a time-stamped edge list according to an aggregation function.
    
    Parameters:
        df : Pandas DataFrame
            Time-stamped edgelist.
        time : String
            Name of the column in df which contains the timestamp.
        source : String
            Name of the column in df which contains the source node.
        target : String
            Name of the column in df which contains the target node.>
        weight : String or None
            Name of the column in df which contains the edge weights. If None, a column with unit edge weights will be created.
        time_zero : String or numerical
            Time where aggregation begins.
        window_size : String or numerical
            Size of the time window used for aggregation.
        inclusive : {'both', 'neither', 'left', 'right'}
            Include boundaries. Whether to set each bound as closed or open.
        fun : {'max', 'sum', 'mean'}
            Aggregation method. Either the maximum edge weight is used, weights are summer, or weights are averaged.
    
    Returns:
        Aggregated edge list consisting of a node pair and a weight column.
    '''
    if weight == None:
        weight = 'weight'
        df[weight] = 1
    if fun == 'max':
        df_agg = df[df[time].between(left=time_zero, right=time_zero+window_size, inclusive=inclusive)].groupby([source, target]).max().reset_index()[[source, target, weight]]
    if fun == 'sum':
        df_agg = df[df[time].between(left=time_zero, right=time_zero+window_size, inclusive=inclusive)].groupby([source, target]).sum().reset_index()[[source, target, weight]]
    if fun == 'mean':
        df_agg = df[df[time].between(left=time_zero, right=time_zero+window_size, inclusive=inclusive)].groupby([source, target]).mean().reset_index()[[source, target, weight]]
    return df_agg

In [None]:
edgelist_sms_week1 = aggregate_edges(
    df = edgelist_sms, 
    time = 'timestamp', 
    source = 'sender', 
    target = 'recipient', 
    weight = None, 
    time_zero = 0, 
    window_size = 604800, 
    inclusive = 'left', 
    fun = 'sum'
)
edgelist_sms_week1.head()

In [None]:
D_sms_week1 = nx.from_pandas_edgelist(
    df = edgelist_sms_week1, 
    source = 'sender', 
    target = 'recipient', 
    edge_attr = 'weight', 
    create_using = nx.DiGraph
)
D_sms_week1

In [None]:
D_sms_week1.edges(data=True)

In [None]:
nx.draw(G=D_sms_week1, node_size=10)

#### Components

In [None]:
nx.is_weakly_connected(G=D_sms_week1)

In [None]:
nx.number_weakly_connected_components(G=D_sms_week1)

In [None]:
cc_sms_week1 = nx.weakly_connected_components(D_sms_week1)
cc_sms_week1

In [None]:
next(cc_sms_week1)

In [None]:
cc_sms_week1 = sorted(nx.weakly_connected_components(D_sms_week1), key=len, reverse=True)
cc_sms_week1

In [None]:
D_sms_week1_lcc = D_sms_week1.subgraph(nodes=cc_sms_week1[0])
D_sms_week1_lcc

#### Layouting

In [None]:
pos_sms_week1_lcc = nx.spring_layout(G=D_sms_week1_lcc, weight=None)

In [None]:
pos_sms_week1_lcc = nx.kamada_kawai_layout(G=D_sms_week1_lcc, weight=None)

In [None]:
import numpy as np

In [None]:
width_sms_week1_lcc = [np.log(w) + 1 for w in nx.get_edge_attributes(G=D_sms_week1_lcc, name='weight').values()]
width_sms_week1_lcc

In [None]:
import os
directory = 'results'
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
nx.draw(
    G = D_sms_week1_lcc, 
    pos = pos_sms_week1_lcc, 
    node_size = 40, 
    width = width_sms_week1_lcc
)
#plt.savefig('results/D_sms_week1_lcc.pdf')
#plt.savefig('results/D_sms_week1_lcc.png')

## Exercise 1: Construct a face-to-face interaction network

The physical proximity relations of the CNS dataset resemble a link stream at high temporal resolution. Physical proximity of two students is measured using the Bluetooth devices of the cell phones handed out to the students. These devices scan their environment every five minutes and record the presence of other phones. All instances of students A and B discovering each other were identified, and the larger signal strength (`rssi`) is reported ([Sapiezynski et al. 2019](https://doi.org/10.1038/s41597-019-0325-x)).

In [None]:
edgelist_bt = pd.read_csv('data/bt_symmetric.csv.gz')
edgelist_bt.head()

In [None]:
with open('data/bt_symmetric.README', 'r') as f:
    print(f.read())

Remove relations involving devices not participating in the experiment:

In [None]:
edgelist_bt = edgelist_bt[edgelist_bt['user_b'] >= 0].reset_index(drop=True)

RSSI is a value between -100 and 0. As a rule of thumb, an RSSI signal strength of -75 means that two devices are 1 meter apart (Mones *et al.* 2017). Values closer to 0 mean devices are also closer. Filter the edgelist to obtain very-close-range proximity:

In [None]:
edgelist_bt = edgelist_bt[edgelist_bt['rssi'] >= -75]

Now aggregate the edges in `edgelist_bt`, starting at `time_zero = 118800` (which is 9am on day 2). Construct a graph from this edgelist and draw it. Experiment with different values for `window_size`. Can you reproduce the finding from the following figure that groups of students are visible for a certain aggregation window?

|<img src='https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-019-0325-x/MediaObjects/41597_2019_325_Fig5_HTML.png' style='float: none; width: 480px'>|
|:--|
|<em style='float: center'>**Figure 1**: Temporal aggregation of the Bluetooth network ([Sapiezynski et al. 2019](https://doi.org/10.1038/s41597-019-0325-x)).</em>|

In [None]:
edgelist_bt_snapshot1 = aggregate_edges(
    df = edgelist_bt, 
    time = '# timestamp', 
    source = 'user_a', 
    target = 'user_b', 
    weight = None, 
    time_zero = 118800, 
    window_size = 6900, 
    inclusive = 'left', 
    fun = 'sum'
)
edgelist_bt_snapshot1.head()

In [None]:
G_bt_snapshot1 = nx.from_pandas_edgelist(
    df = edgelist_bt_snapshot1, 
    source = 'user_a', 
    target = 'user_b', 
    edge_attr = 'weight', 
    create_using = nx.Graph
)
G_bt_snapshot1

In [None]:
nx.draw(G=G_bt_snapshot1, node_size=10)
plt.savefig('results/G_bt_snapshot1.pdf')
plt.savefig('results/G_bt_snapshot1.png')

## Transforming, exporting, and importing networks

#### Removing direction of edges

In [None]:
D_sms.to_undirected()

In [None]:
def weighted_digraph_to_graph(G, layered, fun, reciprocal, weight='weight'):
    '''
    Transforms a weighted directed graph into a weighted undirected network.
    
    Parameters:
        G : DiGraph or MultiDiGraph
            Directed network to be transformed.
        layered : Boolean
            Whether or not G is a MultiDiGraph.
        fun : String
            Function how edge weights from two edge directions are treated mathematically. Valid functions are 'mean', 'sum', and 'max'.
        reciprocal : Boolean
            If True only keep edges that appear in both directions in the original DiGraph.
        weight : String, default 'weight'
            Name of edge attribute.
    
    Returns:
        A weighted Graph.
    '''
    G = G.copy()
    
    for u, v, data in G.edges(data=True):
        data['diweight'] = data.pop(weight)
    
    if layered:
        for node in G:
            for neighbor in nx.neighbors(G, node):
                for key in G[node][neighbor].keys():
                    if node in nx.neighbors(G, neighbor):
                        if key in G[neighbor][node].keys():
                            if fun == 'mean':
                                G.edges[node, neighbor, key][weight] = (G.edges[node, neighbor, key]['diweight'] + G.edges[neighbor, node, key]['diweight']) / 2
                            elif fun == 'sum':
                                G.edges[node, neighbor, key][weight] = (G.edges[node, neighbor, key]['diweight'] + G.edges[neighbor, node, key]['diweight'])
                            elif fun == 'max':
                                G.edges[node, neighbor, key][weight] = max(G.edges[node, neighbor, key]['diweight'], G.edges[neighbor, node, key]['diweight'])
                            else:
                                raise NotImplementedError("Valid functions are 'mean', 'sum', and 'max'.")
                    else:
                        G.edges[node, neighbor, key][weight] = G.edges[node, neighbor, key]['diweight']
    else:
        for node in G:
            for neighbor in nx.neighbors(G, node):
                if node in nx.neighbors(G, neighbor):
                    if fun == 'mean':
                        G.edges[node, neighbor][weight] = (G.edges[node, neighbor]['diweight'] + G.edges[neighbor, node]['diweight']) / 2
                    elif fun == 'sum':
                        G.edges[node, neighbor][weight] = (G.edges[node, neighbor]['diweight'] + G.edges[neighbor, node]['diweight'])
                    elif fun == 'max':
                        G.edges[node, neighbor][weight] = max(G.edges[node, neighbor]['diweight'], G.edges[neighbor, node]['diweight'])
                    else:
                        raise NotImplementedError("Valid functions are 'mean', 'sum', and 'max'.")
                else:
                    G.edges[node, neighbor][weight] = G.edges[node, neighbor]['diweight']
    
    for u, v, data in G.edges(data=True):
        del data['diweight']
    
    G = G.to_undirected(reciprocal=reciprocal)
    
    return G

In [None]:
G_sms_week1_lcc = weighted_digraph_to_graph(G=D_sms_week1_lcc, layered=False, fun='mean', reciprocal=False, weight='weight')
G_sms_week1_lcc

#### Exporting

In [None]:
#nx.write_gexf(G=D_sms_week1_lcc, path='results/D_sms_week1_lcc.gexf')
#nx.write_gexf(G=G_sms_week1_lcc, path='results/G_sms_week1_lcc.gexf')

In [None]:
nx.write_gml(G=D_sms_week1_lcc, path='results/D_sms_week1_lcc.gml')
nx.write_gml(G=G_sms_week1_lcc, path='results/G_sms_week1_lcc.gml')

In [None]:
#nx.read_gexf(path='results/D_sms_week1_lcc.gexf')

In [None]:
nx.read_gml(path='results/D_sms_week1_lcc.gml')