Social Networks SS21

# Home Assignment 1



### General Instructions

Submit your solution via Moodle until 23.59pm on Wednesday, May 12th.
Late submissions are accepted for 12 hours following the deadline, with 1/4 of the total possible points deducted from the score.

Submit your solutions in teams of 2-4 members.
Please denote all members of the team with their student ID and full name in the notebook.
Please submit only one notebook per team.
Only submit a notebook, do not submit the datasets you used or image files that you have created - these have to be created from your notebook.
Also, do NOT compress/zip your submission!

Cite ALL your sources for coding this home assignment.
In case of plagiarism (copying solutions from other teams or from the internet), ALL team members will be expelled from the course without warning.


### Evaluation and Grading

Evaluation of your submission is done semi-automatically.
Think of it as this notebook being executed once.
Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Make sure to restrict yourself to using packages that are automatically installed along with anaconda, plus some additional packages that have been introduced in context of this class. An overview of packages that may be used in this assignment can be found in the file 'environment.yaml'.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature and header of a function is usually specified in the task description and via code skeletons.
* Again, make sure that all your function as well as variable names match with what we have specified! The automated grading will only match these exact names, and everything that can not be matched will not be graded.
* Whenever there is a written task, e.g. task 1b), enter your answer in the specified markdown cell. Do NOT remove or edit the label (e.g. '__A1b):__' ) from the markdown cell, as this will have to be parsed by the grading system and matched to your answer. 
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Do not rename any of the datasets you use, and load it from the same directory that your ipynb-notebook is located in, i.e., your working directory. In particular, when loading your file via a pandas or numpy command, make sure that it has the form `nx.read_edgelist("example.edgelist")` instead of `nx.read_edgelist("C:/User/Path/to/your/Homework/example.edgelist")` so that the code directly works from our machines.
* Make sure that your code is executable, any task for which the code does not directly run on our machine will be graded with 0 points. Run your notebook from top to bottom, make sure there is no error!
  Minimize usage of global variables. Do not reuse variable names multiple times!
* Ensure your code/notebook terminates in reasonable time.
* Textual answers must always be backed by code and may not refer to results that are not part of
  your submission.


**There's a story behind each of these points! Don't expect us to fix your stuff!**

##### List team members, including all student IDs, in the cell below:

In [1]:
# credentials of all team members (you may add or remove items from the list)
team_members = [
    {
        'first_name': 'Alice',
        'last_name': 'Foo',
        'student_id': 12345
    },
    {
        'first_name': 'Bob',
        'last_name': 'Bar',
        'student_id': 54321
    },
    {
        'first_name': 'J',
        'last_name': 'Doe',
        'student_id': 90000
    }
]

In [3]:
# general immports may go here!
import networkx as nx
import numpy as np
from typing import List, Optional, Tuple, Dict

### The Train Bombing Network

For the most part of this home assignment, you will be working on the train bombing network, which is provided in an edgelist format in the file _train.edgelist_.
This undirected network contains contacts between suspected terrorists involved in the train bombing of Madrid on March 11, 2004 as reconstructed from newspapers. A node represents a terrorist and an edge between two terrorists shows that there was a contact between the two terroists. The edge weights denote how 'strong' a connection was. This includes friendship and co-participating in training camps or previous attacks. In the following, we will denote this network as $G$.

__References:__  
1) Dataset in the KONECT graph repository: http://konect.cc/networks/moreno_train/

2) Brian Hayes. Connecting the dots. Can the tools of graph theory and social-network studies unravel the next big plot? American Scientist, 94(5):400--404, 2006.

### Task 1:  Basic Network Properties (8 pts)

__a)__ Read in the data file and store the network into variable `G`. Store the number of nodes and edges of `G` into variables `n_nodes` and `n_edges`! **(2 pts)**


In [1]:
G = ...
n_nodes = ...
n_edges = ...

__b)__ Compute the average degree and the density of `G`. Store them into variables `avg_degree` and `density`. Is it sparse? Explain your answer! **(2 pts)**


In [2]:
avg_degree = ...
density = ...

**A1b):** _Please provide your answer regarding the sparsity here!_

__c)__ Determine the network's diameter and average path length. Store them into variables `diameter`and `avg_pl`. Does it display the small-world-effect? Explain your answer! **(2 pts)**

In [3]:
diameter = ...
avg_pl = ...

**A1c):** _Please provide your answer regarding the small-world effect here!_

__d)__ Compute the average local clustering coefficient and store it into `avg_lcc`. Do you think it is highly clustered? Explain your answer! **(2 pts)**

In [1]:
avg_lcc = ...

**A1d):** _Please provide your answer regarding the high clustering here!_

### Task 2: Node Centralities (15 pts)

In this task, we consider the following four node centrality measures:

1. Degree Centrality (DC)
2. Closeness Centrality (CC)
3. Betweenness Centrality (BC)
4. Eigenvector Centrality (EC)

__a)__ For each of the four measures, compute and store the corresponding centrality values of all nodes in the network in dictionaries `DC`, `CC`, `BC` and `EC`! The keys of the dictionaries should represent the node IDs and the corresponding values should represent the centrality of that node. Additionally, store for each of the four measures the node IDs (not the centrality values) with the 10 highest centrality values in lists `DC_top`, `CC_top`, `BC_top` and `EC_top` in descending order (first node in list should have the highest centrality value and so on)! **(3 pts)**

**Example:** _In the example below node 4 has a betweenness centrality of 0.01, which is the third highest value in the network of 4 nodes, as denoted by its third position in list `BC_top`:_

`BC = {'1': 0.05, '2': 0.221, '3': 0.0, '4': 0.01}`

`BC_top = ['2', '1', '4', '3']`

In [5]:
DC = {} # Degree Centrality of all nodes
CC = {} # Closeness Centrality of all nodes
BC = {} # Betweenness Centrality of all nodes
EC = {} # Eigenvector Centrality of all nodes

DC_top = [] # Top10 Degree Centrality nodes
CC_top = [] # Top10 Closeness Centrality nodes
BC_top = [] # Top10 Betweenness Centrality nodes
EC_top = [] # Top10 Eigenvector Centrality nodes

__b)__ For each of the four measures, compute the average and maximum distance of the most central node to all other nodes in the network! Store the average distances in `DC_avg`, `CC_avg`, `BC_avg`, `EC_avg` and maximum distances in `DC_max`, `CC_max`, `BC_max`, `EC_max`. **(4 pts)**

In [6]:
DC_avg = ...
DC_max = ...

CC_avg = ...
CC_max = ...

BC_avg = ...
BC_max = ...

EC_avg = ...
EC_max = ...

__c)__ For each of the four centrality measures, scale all node centralities in the graph such that their maximum is 1, i.e., divide them by the maximum value occuring in the network, and store the updated node centralities into `DC_scaled`, `CC_scaled`, `BC_scaled` and `EC_scaled` in the same format as in 2a! Plot the graph in a spring layout with node colors according to their centrality. Use the "coolwarm" colormap from matplotlib for this coloring. Make sure all networks have the same orientation! The code below should save your plots into files **"DC.png"**, **"CC.png"**, **"BC.png"** and **"EC.png"**. Do not remove the lines of code which create and save the .png files, and do not modify your solution from 2a), i.e., do not modify `DC`, `CC`, `BC` and `EC`! **(5 pts)**

In [2]:
DC_scaled = {} # Scaled Degree Centrality of all nodes
CC_scaled = {} # Scaled Closeness Centrality of all nodes
BC_scaled = {} # Scaled Betweenness Centrality of all nodes
EC_scaled = {} # Scaled Eigenvector Centrality of all nodes

__d)__ After looking at these measures simultaneously, we look into how those measures differ from each other.
Use your node-wise similarities computed in a) to compute the correlation coefficient of all node-wise similarities between any two centrality measures. For example, store the correlation coefficient of the degree centrality (DC) and the closeness centrality (CC) in `DC_CC`. Which measures are the most correlated, and which measure is least correlated with the rest? Argue why that is the case! **(3 pts)**

In [7]:
DC_CC = ...
DC_BC = ...
DC_EC = ...
CC_BC = ...
CC_EC = ...
BC_EC = ...

**A2d):** _Please provide your answer regarding the correlation coefficients here!_

### Task 3: Weak Ties and Triadic Closure (17 pts)

After looking at the nodes of the network, we now consider the (weighted) edges of the network.

__a)__ Consider the distribution of edge weigths. Which edge weights occur in the network, and how often does each edge weight occur in the network? Plot these occurences using a histogram and save it into a file called **"hist_weights.png"**! Do not remove the lines of code which create and save the .png file! **(2 pts)**

__b)__ Write a function that computes the neighborhood overlap score of a given edge, using the function signature which is specified in the cell below. Note that we want to return -1 if the edge does not exist in the network, and that for an edge between nodes $u$ and $v$, we do not count $u$ and $v$ into the union of neighbors in the denominator. **(4 pts)**

**Example:** _Let `H = nx.from_numpy_matrix(np.array([[0,1,1,1,0],[1,0,0,1,1],[1,0,0,1,0],[1,1,1,0,0],[0,1,0,0,0]]))` be an undirected NetworkX graph. Your implementation of `neighborhood_overlap` should return the same output as in the given examples below. Please note that correct output values do not necessarily mean that you have implemented the function correctly. Ideally, you should come up with your own data to test your function._

`neighborhood_overlap((0,1), H) == 0.3333333333333333`

`neighborhood_overlap((0,2), H) == 0.5`

`neighborhood_overlap((0,3), H) == 1.0`

`neighborhood_overlap((1,4), H) == 0.0`

In [120]:
def neighborhood_overlap(edge: Tuple[str, str], G: nx.Graph) -> float:
    """
    :param edge: pair of node IDs which indicate the edge we want to compute the node overlap on.
    :param G: networkx graph whose nodes we want to check. You may assume that it is undirected, but weighted
    :
    :return: the node overlap of the given edge as a float 
    """
    # your code here
    raise NotImplementedError

__c)__ Apply your neighborhood overlap function on the network $G$, and save all edges which are local bridges as tuples into list `lb`! The first two values of the tuple should be the nodes of an edge, while the third value should represent the weight of this edge. Again, plot the graph and save it into file **"local_bridges.png"** using a spring layout with the same orientation as in task 2, with all nodes being blue, and color all edges which are local bridges in red! Do not remove the lines of code which create and save the .png file! **(3 pts)**

**Example:** _In the example below there are three local bridges in total. The first local bridge is an edge between nodes 6 and 33 with an edge weight of 1:_

`lb = [('6', '33', 1.0), ('8', '39', 1.0), ('8', '48', 1.0)]`

In [9]:
lb = []

__d)__ Finally, we want to check whether a weighted graph fulfills the strong triadic closure property. For that matter, we define an edge $e$ to be strong, if its weight is strictly higher than a given threshold $t$, and as weak otherwise. Implement a function that checks whether all nodes in a graph fulfill the strong triadic closure property, using the function signature in the cell below! **(6 pts)**

**Example:** _Let `H = nx.from_numpy_matrix(np.array([[0,2,1,0,0],[2,0,0,2,1],[1,0,0,0,0],[0,2,0,0,0],[0,1,0,0,0]]))` be an undirected weighted NetworkX graph. Your implementation of `check_stc` should return the same output as in the given examples below. The first output shows, that nodes 0 and 1 do not fulfill the STC property in H for t=0. If we would add an edge between nodes 1 and 2, then node 0 would fulfill the property. The last example shows that all nodes fulfill the property for t=2. Please note that correct output values do not necessarily mean that you have implemented the function correctly. Ideally, you should come up with your own data to test your function._

`check_stc(H,threshold=0) == (False, {0: [(1, 2)], 1: [(0, 3), (0, 4), (3, 4)]})`

`check_stc(H,threshold=1) == (False, {1: [(0, 3)]})`

`check_stc(H,threshold=2) == (True, {})`

In [1]:
def check_stc(G: nx.Graph, attr: Optional[str]="weight", threshold: Optional[int]=1) -> (bool, Dict[str, List[Tuple[str, str]]]):
    """
    :param G: networkx graph whose nodes we want to check. You may assume that it is undirected, but weighted
    :param attr: edge attribute the contains the weights we want to look into
    :param threshold: weight threshold which determines whether an edge is weak or strong. 
    :                
    :return: 1. A bool (True or False), indicating whether all nodes in the graph fulfill the STC property
    :        2. A dictionary of all nodes which do not fulfill the STC property (may be empty if no such node exists). 
    :           The keys are the node IDs, the values are a list of missing edges. 
    """
    # your code here
    raise NotImplementedError

__e)__ Apply your function from d) to determine if all nodes in $G$ fulfill the strong triadic closure property, using the threshold $t=1$. If not, which nodes violate it? Do you obtain a different result for $t=2$? Give your answer in the corresponding cell below. Save all nodes which do not fulfill the STC property (may be empty if no such node exists) as a dictionary into `t1_violations` and `t2_violations` for the respective threshold. The keys are the node IDs, the values are a list of missing edges. **(2 pts)**

**Example:** _In the example below there are two violating nodes in total for threshold 1. The first one is node 3, where edges are missing between nodes 13 and 19, and between nodes 14 and 27 in order to fulfill the strong triadic closure property:_

`t1_violations = {'3': [('13', '19'), ('14', '27')], '11': [('22', '31')]}`

In [15]:
t1_violations = {}
t2_violations = {}

**A3e):** _Please provide your answer regarding the strong triadic closure property here!_