# ST5225 Final Exam — Part 2

This is the second part of the final exam. This file is both the problem set and your answer sheet. You need to add your code to the respective code cells, and then upload this notebook to Canvas.

**You have 1 hour 15 minutes to complete this part.**

**Do not use any libraries other than those imported below.**

The solution to each problem is to be stored in the dictionary variable `ans`: Your answer to Problem 1 should be stored in `ans[1]`, your answer to Problem 2 should be stored in `ans[2]`, and so forth. If you do the coding in a separate place and only store the results in this notebook, you will not get any partial points in case your answer is wrong, so I encourage you to add the full code to this notebook. **Make sure the whole Jupyter Notebook runs without errors before submitting**. You can check this by pressing _Run All_ in the menu.

For each problem, the number of points you can achieve is indicated in the question. **The maximum number of points is 20**. Points are only given if the code runs error-free and the output is correct. Partial points may be given depending on the quality of the answer. It is also indicated whether you should use Python or R to solve the problem.

**Ensure your notebook runs by pressing "Run All" before submitting.**

In [1]:
# Import some libraries
import math
import random
import json

try:
  import networkx as nx
  import pandas as pd
  import numpy as np
  import powerlaw
except:
  %pip install networkx
  %pip install pandas
  %pip install numpy
  %pip install powerlaw


# Create your answer dictionary
ans = dict()

# Load the rpy2 extension
try:
  %load_ext rpy2.ipython
except:
  %pip install rpy2

Error importing in API mode: ImportError('On Windows, cffi mode "ANY" is only "ABI".')
Trying to import in ABI mode.


In [2]:
%%R 
options(repos="https://cloud.r-project.org")
packages <- c("igraph", "ergm", "intergraph", "network", "latentnet")
install.packages(setdiff(packages, rownames(installed.packages())))  

library(igraph)
library(intergraph)
library(ergm)
library(network)
library(latentnet)


Attaching package: 'igraph'

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union

Loading required package: network

'network' 1.19.0 (2024-12-08), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information


Attaching package: 'network'

The following objects are masked from 'package:igraph':

    %c%, %s%, add.edges, add.vertices, delete.edges, delete.vertices,
    get.edge.attribute, get.edges, get.vertex.attribute, is.bipartite,
    is.directed, list.edge.attributes, list.vertex.attributes,
    set.edge.attribute, set.vertex.attribute


'ergm' 4.10.1 (2025-08-26), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other informa

## Problem 1 (Erdős-Rényi graphs, Component Sizes)

[**1 point**, **Python**] Generate 1000 Erdős-Rényi graphs each with $n=150$ nodes and edge probability $p=0.01$. For each graph, calculate the size of the second largest component. Save the average of the sizes of second largest components in `ans[1]`. 

In [3]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

second_largest_component = []

for _ in range(1000):
  G = nx.erdos_renyi_graph(n=150, p=0.01)
  # Find size of second largest component
  size = sorted([len(c) for c in nx.connected_components(G)], reverse=True)[1]
  second_largest_component.append(size)

# Calculate the average size of the second largest component
average = np.mean(second_largest_component)

# Change the line below to store your answer
ans[1] = average
print(ans[1])

8.847


## Problem 2 (Erdős–Rényi Graphs, 4-Cycles)

[**1 point**, **Python**] Assume an Erdős-Rényi graph on $n=120000$ nodes. How do you need to choose the edge connection probability $p$ so that the expected number of 4-cycles equals 3200? Save your answer in `ans[2]`.

In [4]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

# Calculate the edge connection probability p
n = 120000
expected_4_cycles = 3200

# The expected number of 4-cycles in an Erdős-Rényi graph is given by:
# E[T] = (n choose 4) * p^4

# Calculate the number of ways to choose 4 nodes from n
n_choose_4 = math.comb(n, 4)

# Solve for p
p = (expected_4_cycles / n_choose_4) ** (1/4)

# Change the line below to store your answer
ans[2] = p
print(ans[2])

0.00013872811578372914


## Problem 3 (Erdős-Rényi Graphs, Connected Components)

[**1 point**, **Python**] Generate 1000 Erdős-Rényi graphs with $n=200$ nodes and edge probability $p=0.008$. For each graph, calculate the number of components. Save the average number of components in `ans[3]`.

In [5]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

num_components = []

for _ in range(1000):
  G = nx.erdos_renyi_graph(n=200, p=0.008)
  num_components.append(nx.number_connected_components(G))

average_num_components = np.mean(num_components)

# Change the line below to store your answer
ans[3] = average_num_components
print(ans[3])

53.198


## Problem 4 (Configuration Model, Diameter)

[**1 point**, **Python**] Generate 1000 realisations of the configuration multigraph model on $n=60$ nodes. Assume the degree distribution is given as follows:
- 10 nodes have degree 1
- 10 nodes have degree 2
- 15 nodes have degree 3
- 15 nodes have degree 4
- 5 nodes have degree 10
- 5 nodes have degree 25

For each realisation, extract the largest component and calculate the diameter of that component. Save the average `ans[4]`. _Note: We have not explicitly discussed **diameter** in the lecture. Use the internet to find out what it is._ 

In [6]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

degree_sequence = [1]*10 + [2]*10 + [3]*15 + [4]*15 + [10]*5 + [25]*5

diameters = []

for _ in range(1000):
  G = nx.configuration_model(degree_sequence)
  # Extract the largest connected component
  G_max_nodes = max(nx.connected_components(G), key=len)
  G_max = G.subgraph(G_max_nodes)
  # Find the diameter of the graph
  diameters.append(nx.diameter(G_max))

average = np.mean(diameters)

# Change the line below to store your answer
ans[4] = average
print(ans[4])


5.817


## Problem 5 (Powerlaw)

Ensure the that file `graph_1.graphml` containing the second graph is stored in the same directory as this notebook. If not download it from `https://nus-st5225.netlify.app/final/graph_1.graphml`. The file is a GraphML file containing the graph. **Do not change the name of the file**! The code to load the graph as networkx graph is given below. The graph is stored in the variable `G1` and used in Problems 5, 6 and 7. Do not change the code.

[**2 points**, **Python**]  Use the `powerlaw` package to fit a power-law distribution to the degree distribution of the graph `G1`. Let the algorithm choose $x_{\min}$ on its own. Save the fitted exponent in `ans[5]`. You need to decide whether a discrete or continuous power-law distribution is more appropriate.

In [7]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
G1 = nx.read_graphml('D:\\Academic\\Master\\ST5225\\reference\\graph_1.graphml')
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

degree_sequence = [d for n, d in G1.degree()]
print(degree_sequence)

fit = powerlaw.Fit(degree_sequence, discrete=False)

# Change the line below to store your answer
ans[8] = fit.alpha
print(ans[8])

[135, 86, 49, 39, 81, 102, 109, 68, 110, 128, 35, 66, 116, 29, 32, 26, 57, 44, 32, 40, 70, 20, 24, 14, 42, 33, 50, 36, 40, 45, 25, 30, 19, 59, 21, 18, 32, 42, 33, 33, 58, 16, 46, 25, 42, 32, 33, 21, 35, 22, 18, 45, 26, 23, 20, 29, 20, 22, 37, 23, 57, 22, 29, 22, 30, 22, 15, 28, 9, 22, 24, 8, 18, 24, 11, 13, 45, 17, 7, 18, 16, 14, 11, 23, 15, 14, 13, 29, 38, 21, 15, 28, 10, 13, 19, 31, 15, 10, 11, 17, 21, 54, 16, 26, 20, 10, 20, 14, 20, 22, 18, 17, 24, 11, 7, 19, 14, 17, 15, 14, 9, 10, 30, 18, 11, 13, 9, 11, 11, 23, 15, 7, 15, 19, 22, 11, 19, 19, 8, 34, 20, 30, 15, 23, 8, 5, 8, 7, 9, 23, 16, 17, 12, 12, 12, 10, 9, 8, 9, 12, 17, 27, 26, 19, 11, 7, 19, 27, 6, 12, 9, 16, 24, 11, 8, 9, 14, 11, 16, 5, 6, 7, 12, 8, 13, 17, 5, 10, 19, 16, 11, 18, 12, 10, 17, 9, 9, 13, 11, 14, 14, 19, 17, 27, 13, 20, 10, 19, 8, 5, 13, 13, 12, 12, 10, 14, 9, 9, 5, 15, 12, 12, 15, 9, 16, 15, 11, 10, 6, 17, 7, 8, 8, 6, 6, 16, 8, 14, 19, 13, 10, 14, 10, 8, 11, 10, 11, 14, 7, 8, 15, 13, 9, 9, 12, 14, 20, 14, 12, 15,

## Problem 6 (Robustness, Random Attacks)

[**2 points**, **Python**] For each fraction $f$ from a predefined list (`fractions = [0.01, 0.04, 0.08, 0.12, 0.16, 0.20, 0.24, 0.28, 0.32]`) repeat for 100 times: Delete a fraction $f$ of the nodes **at random** from `G1` and calculate the number of connected components in the resulting graph. After all trials for a fraction are done, calculate the number of trials where the graph got disconnected, that is, where the number of connected components was greater than 1. Which is the lowest fraction of nodes that needs to be removed to disconnect the graph in strictly more than 50% of the trials? Save your answer in `ans[6]`.

 

In [8]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

num_components = {}
fractions = [0.01, 0.04, 0.08, 0.12, 0.16, 0.20, 0.24, 0.28, 0.32]

for f in fractions:
  num_components[f] = []
  for _ in range(100):
    nodes = list(G1.nodes())
    to_remove = random.sample(nodes, int(f*len(nodes)))
    G = G1.copy()
    G.remove_nodes_from(to_remove)
    num_components[f].append(nx.number_connected_components(G))

# For each percentage, calculate the fraction of times the number of connected components is bigger than 1
fraction_disconnected = {f: np.mean([x > 1 for x in num_components[f]]) for f in fractions}

for f in fractions:
  if fraction_disconnected[f] > 0.5:
    breaking_point = f
    break

# Change the line below to store your answer
ans[6] = breaking_point
print(ans[6])


0.2


## Problem 7 (Robustness, Targeted Attacks)

[**2 point**, **Python**] For each fraction $f$ from a predefined list (`fractions = [0.01, 0.04, 0.08, 0.12, 0.16, 0.20, 0.24, 0.28, 0.32]`) do the following: Delete the fraction $f$ of the **highest degrees nodes** from `G1` and calculate the number of connected components in the resulting graph. Which is the smallest fraction of nodes that needs to be removed to disconnect the graph? Save your answer in `ans[7]`. _Note: There is no need to repeat the process 100 times for each fraction, since the fraction of nodes is chosen deterministically, starting with the highest degrees._

In [9]:
### DO NOT CHANGE THE LINES BELOW ###
random.seed(42)
np.random.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR CODE BELOW ####

num_components = {}
fractions = [0.01, 0.04, 0.08, 0.12, 0.16, 0.20, 0.24, 0.28, 0.32]

# Sort the list of nodes by degree
nodes = sorted(G1.nodes(), key=lambda x: G1.degree(x), reverse=True)

num_components = {}

for f in fractions:
  to_remove = nodes[:int(f*len(nodes))]
  G = G1.copy()
  G.remove_nodes_from(to_remove)
  num_components[f] = nx.number_connected_components(G)

for f in fractions:
  if num_components[f] > 1:
    breaking_point = f
    break

# Change the line below to store your answer
ans[7] = breaking_point
print(ans[7])


0.04


## Problem 8 (Exponential Random Graph Models)

Ensure the that file `graph_2.graphml` containing the second graph is stored in the same directory as this notebook. If not, download it from `https://nus-st5225.netlify.app/final/graph_2.graphml`. The file is a GraphML file containing the graph. **Do not change the name of the file**! The code to load the graph as networkx graph is given below. The graph is stored in the variable `G2` and used in Problems 8—12. Do not change the code.

This network is a social interaction network. The nodes represent individuals, and the edges represent interactions between them. The nodes have two attributes, _gender_ (categorical, "male" or "female") and _age_ (scalar value, between 18 and 65).

[**2 point**, **R**] Use the `ergm` package in R to fit an Exponential Random Graph Model to the graph `G2` with only an `edges` term. Save the estimated coefficient of the edge term in `ans_8`. _Please note that the variable name is different from the previous problems. It will be exported to Python and stored in `ans[8]` automatically._ 

In [10]:
%%R
### DO NOT CHANGE THE LINES BELOW ###
set.seed(42)
G2_ <- read_graph('D:\\Academic\\Master\\ST5225\\reference\\graph_2.graphml', format='graphml')
G2 <- asNetwork(G2_)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR R CODE BELOW ####

fit <- ergm(G2 ~ edges)
# print(summary(fit))

# Change the line below to store your answer
ans_8 = coef(fit)["edges"]
print(ans_8)


    edges 
-0.443003 


Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Evaluating log-likelihood at the estimate. 
1: In ergm(G2 ~ edges) :
  strings not representable in native encoding will be translated to UTF-8
2: In ergm(G2 ~ edges) : input string '
<f5><dc>
' cannot be translated to UTF-8, is it valid in 'UTF-8'?
3: In ergm(G2 ~ edges) : input string '
<f5><dc>
' cannot be translated to UTF-8, is it valid in 'UTF-8'?
4: In ergm(G2 ~ edges) : input string '
<f5><dc>
' cannot be translated to UTF-8, is it valid in 'UTF-8'?
5: In ergm(G2 ~ edges) : input string '
<f5><dc>
' cannot be translated to UTF-8, is it valid in 'UTF-8'?


In [11]:
### DO NOT CHANGE THE CODE BELOW ###
%R -o ans_8
ans[8] = ans_8[0]
print(ans[8])

-0.44300302742705666


## Problem 9/10/11/12 (Latent Space Model)

[**4 points** (one point each), **R**]. Use the function `ergmm` (**set `seed=42` as argument to the function**) from the `latentnet` R-library to fit a latent space model to the graph `G2` with (1) an edge term, and (2) a two dimensional latent space term with Euclidean distance as distance function. For the latent space term, fit the model with 1, 2, and 3 groups, respectively. Save the BIC of models with 1, 2, and 3 groups in `ans_9`, `ans_10`, and `ans_11`, respectively. Then, use the BIC as selection criterion and store the best model number in `ans_12`, that is, 1, 2 or 3. _Please note that the variable name is different from the previous problems. It will be exported to Python and stored in `ans` automatically._ _Note: Do not worry about convergence or other MCMC warnings for the purpose of this exam._


In [12]:
%%R
### DO NOT CHANGE THE LINES BELOW ###
set.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR R CODE BELOW ####

fit1 <- ergmm(G2 ~ edges + euclidean(d=2, G=1), seed=42)
fit2 <- ergmm(G2 ~ edges + euclidean(d=2, G=2), seed=42)
fit3 <- ergmm(G2 ~ edges + euclidean(d=2, G=3), seed=42)

# Change the line below to store your answer

# Extract the BIC values from the summary of the models
ans_9  = summary(fit1)$bic$overall
ans_10 = summary(fit2)$bic$overall
ans_11 = summary(fit3)$bic$overall

# Choose the model with the lowest BIC
ans_12 = which.min(c(ans_9, ans_10, ans_11))

print(ans_9)
print(ans_10)
print(ans_11)
print(ans_12)


[1] 1069.493
[1] 1056.261
[1] 1061.884
[1] 2


NOTE: It is not certain whether it is appropriate to use latentnet's BIC to select latent space dimension, whether or not to include actor-specific random effects, and to compare clustered models with the unclustered model.
In backoff.check(model, burnin.sample, burnin.control) :
  Backing off: too few acceptances. If you see this message several times in a row, use a longer burnin.


In [13]:
### DO NOT CHANGE THE CODE BELOW ###
%R -o ans_9
ans[9] = ans_9[0]
print(ans[9])
%R -o ans_10
ans[10] = ans_10[0]
print(ans[10])
%R -o ans_11
ans[11] = ans_11[0]
print(ans[11])
%R -o ans_12
ans[12] = float(ans_12[0])
print(ans[12])

1069.4932246184028
1056.261055054699
1061.8844534575671
2.0


## Problem 13/14/15/16 (Latent Space Model)

[**4 points** (one point each), **R**]. Use the function `ergmm` (**set `seed=42` as argument to the function**) from the `latentnet` R-library to fit a latent space model to the graph `G2` with (1) an edge term (2) a term representing the absolute difference between the ages of the nodes, (3) a term that reflects whether two nodes have the same gender (4) a two dimensional latent space term with Euclidean distance. For the latent space term, fit the model with 1, 2, and 3 groups, respectively. Save the BIC of the model with 1, 2, and 3 groups in `ans_13`, `ans_14`, and `ans_15`, respectively. Then, use the BIC as selection criterion and store the best model number in `ans_16`, that is, store the number 1, 2 or 3. _Please note that the variable name is different from the previous problems. It will be exported to Python and stored in `ans` automatically._ _Note: Do not worry about convergence or other MCMC warnings for the purpose of this exam._


In [14]:
%%R
### DO NOT CHANGE THE LINES BELOW ###
set.seed(42)
#### END OF PART THAT SHOULD NOT BE MODIFIED ####

#### ADD YOUR R CODE BELOW ####

fit1 <- ergmm(G2 ~ edges + absdiff("age") + nodematch("gender") + euclidean(d=2, G=1), seed=42)
fit2 <- ergmm(G2 ~ edges + absdiff("age") + nodematch("gender") + euclidean(d=2, G=2), seed=42)
fit3 <- ergmm(G2 ~ edges + absdiff("age") + nodematch("gender") + euclidean(d=2, G=3), seed=42)

# Change the line below to store your answer
# Extract the BIC values from the summary of the models
ans_13 = summary(fit1)$bic$overall
ans_14 = summary(fit2)$bic$overall
ans_15 = summary(fit3)$bic$overall
ans_16 = which.min(c(ans_13, ans_14, ans_15))

print(ans_13)
print(ans_14)
print(ans_15)
print(ans_16)


[1] 1043.365
[1] 1009.309
[1] 1013.629
[1] 2


In backoff.check(model, burnin.sample, burnin.control) :
  Backing off: too few acceptances. If you see this message several times in a row, use a longer burnin.


In [15]:
### DO NOT CHANGE THE CODE BELOW ###
%R -o ans_13
ans[13] = ans_13[0]
print(ans[13])
%R -o ans_14
ans[14] = ans_14[0]
print(ans[14])
%R -o ans_15
ans[15] = ans_15[0]
print(ans[15])
%R -o ans_16
ans[16] = float(ans_16[0])
print(ans[16])

1043.364708885525
1009.3090361783005
1013.6292143279242
2.0


In [16]:
ans

{1: np.float64(8.847),
 2: 0.00013872811578372914,
 3: np.float64(53.198),
 4: np.float64(5.817),
 8: np.float64(-0.44300302742705666),
 6: 0.2,
 7: 0.04,
 9: np.float64(1069.4932246184028),
 10: np.float64(1056.261055054699),
 11: np.float64(1061.8844534575671),
 12: 2.0,
 13: np.float64(1043.364708885525),
 14: np.float64(1009.3090361783005),
 15: np.float64(1013.6292143279242),
 16: 2.0}

**Ensure your notebook runs by pressing "Run All" before submitting.**

# END OF ASSIGNMENT

In [16]:
#### DO NOT CHANGE ANYTHING BELOW THIS LINE ####

# Save the dictionary as JSON
with open('solutions.json', 'w') as file:
  json.dump(ans, file, indent=2)