# SI608 Fall 2025 – Homework 2

**Due:** Wednesday, October 8, 12:00PM EST (noon the day before class)

**Instructions:** Please submit your **.ipynb** and **.html** files and **a text file including your genAI chat history related to this homework** to Canvas. Remember to rename the notebook by filling in your uniqname. If you have _additional_ supplementary files (e.g., images), please also attach them in your submission. 

**Collaboration policy:** You may work with others on this but specific answers need to be your own. Please indicate with whom you worked when you turn in this assignment.

## **Question 1 |**  Fitting Models
*[[40 points]]*

In Lab 3, we compared Lada’s Facebook network to a random network to examine whether it was a small world model.  We found that it was likely to be small-world, as it was different from the random model in certain ways: the clustering coefficient was greater and the average shortest path approximated ln(N). This analysis, however, does not tell us what kind of network model it actually is. For this, we need to generate different networks and compare them to the one we observe.

For this question, we will be working with the **San Juan Sur Village dataset** which you worked with in HW1 Question 7. You again will need the file "sj.gml". This network data was collected by sociologists in 1948, who observed visiting patterns among families in the Costa Rican village San Juan Sur. Based on the number of visits, each household was assigned a “status score” ranging from 1 to 14.

To properly compare the San Juan Sur network to the possible network models, it's not enough to generate just one graph. With just one, we can’t be confident (in the statistical sense) whether we found a match: the chance that we get exactly the same thing is slim. Therefore, a better approach is to generate many (say, 200) different instances of the network and find the confidence intervals for the metric of interest (e.g., components, diameter, mean degree).

### **Q1.1 |** Create a function to generate multiple networks + metric confidence intervals
*[[15 points]]*

Write a function that generates 100 instances of an Erdos Renyi graph. The parameters of the function should include number of nodes, link formation probability, and number of networks to generate.

For each of these instances find the following: 
1. diameter (of the largest component), 
2. average shortest path (of the largest component), 
3. average degree, 
4. average clustering coefficient.  

For all 100 networks you generated, identify the 10% confidence interval (i.e., the 5% and 95% bounds). Submit your code and report the intervals for the Erdos-Renyi network with 200 nodes and 0.01 probability 

This may take a few minutes to calculate depending on the speed of your machine, you might want to set a lower number for debugging. On the flip side, you can increase the number to get results you can be more confident in.

In [5]:
# YOUR CODE HERE
import networkx as nx
import numpy as np
import pandas as pd

def generate_erdos_metrics(n_nodes=200, p=0.01, n_networks=100, seed=None):
    """
    parameter:
    --------------
    n_nodes : int
        Number of nodes in each network.
    p : float
        Probability of edge formation.
    n_networks : int
        Number of networks to generate.
    seed : int or None
        Random seed for reproducibility.
    """
    if seed is not None:
        np.random.seed(seed)
        
    results = {
        "diameter": [],
        "avg_shortest_path": [],
        "avg_degree": [],
        "avg_clustering": []
    }

    for i in range(n_networks):
        G = nx.erdos_renyi_graph(n_nodes, p)

        try:
            diameter = nx.diameter(Gc)
            avg_shortest_path = nx.average_shortest_path_length(Gc)
        except nx.NetworkXError:
            diameter = np.nan
            avg_shortest_path = np.nan
    
        avg_degree = np.mean([deg for _, deg in G.degree()])
            avg_clustering = nx.average_clustering(G)
    
        # Store results
        results["diameter"].append(diameter)
        results["avg_shortest_path"].append(avg_shortest_path)
        results["avg_degree"].append(avg_degree)
        results["avg_clustering"].append(avg_clustering)

    # Convert to DataFrame
    df = pd.DataFrame(results)
    
    # Compute 10% confidence interval (5% and 95% bounds)
    summary = df.describe(percentiles=[0.05, 0.95]).loc[["mean", "5%", "95%"]]
    summary.index = ["mean", "lower_5%", "upper_95%"]
    
    return summary


### **Q1.2 |** Inspecting the San Juan Sur Village network
*[[5 points in total]]*

#### **Q1.2.1 |** Load in the San Juan Sur Village graph and report network statistics
*[[2 points]]*

Using the sj.gml file, create an *undirected* graph between two houses if there is a visit from at least one to another.

In [7]:
# YOUR CODE HERE
G = nx.read_gml("sj.gml", label="id")
G = G.to_undirected(G)
print(nx.info(G))

AttributeError: module 'networkx' has no attribute 'info'

#### **Q1.2.2 |** Report 6 statistics for the San Juan Sur Village network
*[[3 points]]*


Report the following network statistics:
1. number of nodes, 
2. number of edges,
3. diameter,
4. average shortest path, 
5. average degree, 
6. average clustering coefficient

In [None]:
# YOUR CODE HERE

### **Q1.3 |** Fitting network models to match the San Juan Sur Village graph
*[[6 points total]]*


#### **Q1.3.1 |** Fitting the Erdos Renyi model 
*[[2 points]]*

Identify the proper parameters for an Erdos Renyi network model of approximately the same size (nodes and edges) as the San Juan Sur network. Report the n and p you found.

In [None]:
# YOUR CODE HERE

#### **Q1.3.2 |** Fitting the Watts Strogatz model
*[[2 points]]*

Now, create a Watts_Strogatz_Graph approximately the same size as San Juan Sur Village graph (See https://networkx.org/documentation/stable/reference/generated/networkx.generators.random_graphs.watts_strogatz_graph.html).

We don’t really know what p is.  You can play with it but when reporting your answers, you can just set it to .2 (somewhat arbitrary).  You’ll need to figure out and report n and k.  Report the actual number of edges for this graph.  It should match pretty closely to the San Juan Sur village network.

In [None]:
# YOUR CODE HERE

#### **Q1.3.3 |** Fitting the Barabasi Albert model
*[[2 points]]*

Create a Barabasi_albert_graph that is approximately as large as the San Juan Sur Village graph, and report the n and m values you used. Report the actual number of edges for this graph. Again, you’ll need to play with it a little, but you should be able to get a rough match.

In [None]:
# YOUR CODE HERE

### **Q1.4 |** Finding the best fit
*[[9 points total]]*

#### **Q1.4.1 |** Use the confidence interval function on the network models 
*[[4
 points]]*

Execute the confidence interval finding code on the three graph types you found in Q1.3, and report the values for all three of the network models: Erdos_Renyi, Watts_Strogatz, Barabasi_Albert.  

(Again, this might take a bit of time, you might want to go do something else while it runs, 3 points for each type of network generator)

In [None]:
# YOUR CODE HERE

#### **Q1.4.2 |** Which network model fits the San Juan Sur Village graph best? 
*[[5 points]]*

Compare the values of the San Juan Sur network to the values you found in Q1.4.1.  Which ones seem to fit within the bounds of the confidence intervals you found?  (Note, there may not be a perfect graph type that models the network perfectly).  

Speculate on the properties of the networks that do fit and what that might mean about the San Juan Sur Village network?

[YOUR ANSWER HERE]

## **Question 2 |**  Small World Networks
*[[20 points]]*

### **Q2.1 |** Compute clustering coefficient and avg shortest path for bitcoin graph
*[[10 points]]*


Compute the clustering coefficient and the average shortest path of the largest **strongly** connected component in http://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html. For this exercise, please create an unweighted and directed graph $G = (V,E)$ where V are the set of bitcoin users and $E = {e_ij}$ where i has indicated a rating (either positive or negative) for user j.

In [12]:
# YOUR CODE HERE

### **Q2.2 |** Compute clustering coefficient and avg shortest path with different parameters
*[[2 points]]*


Compute the clustering coefficient and the average shortest path of the largest strongly connected component of graph $G’ = (V,E’)$ where $E’ = {e_ij}$ where i has indicated a positive  review of j.

In [None]:
# YOUR CODE HERE

### **Q2.3 |** Compute clustering coefficient and avg shortest path with yet again different parameters
*[[3 points]]*

Compute the clustering coefficient and the average shortest path of the largest strongly connected component of graph $Gx’’ = (V, E_x’’)$ where $E_x’’ = {e_ij}$ where i has indicated a review with at least x for user j. Compute these two measures for x = 5,6,7

In [None]:
# YOUR CODE HERE

### **Q2.4 |** Interpreting results
*[[5 points]]*

Report on the patterns you observe across the various graphs. Does the original graph have small world characteristics? What happens to clustering coefficient and average shortest path?

[YOUR ANSWER HERE]

## **Question 3 |**  Bow Tie Model
*[[20 points]]*

### **Q3.1 |** Learning about the Bow Tie model
*[[5 points]]*

A different model for networks we haven't yet considered is the bow-tie model.  We see this kind of network on the Web. A generic image of this type of network is displayed here.  

Please read http://cs.wellesley.edu/~pmetaxas/Why_Is_the_the_Web_a_Bowtie.pdf

Provide a description of the core, in, out, tendril, and island components for the bow-tie model.

<img src="bow-tie.jpg" width=600 align="top">

[YOUR ANSWER HERE]

### **Q3.2 |** Computing component sizes for the Bow Tie model
*[[10 points]]*

Ignore for the moment tubes, tendrils, and disconnected components.  Write a program in python (using networkx) to calculate the sizes of each of those main 3 pieces (IN, SCC, and OUT).

Fill out the following function:

def bowtie(grph):
<br>&emsp;	IN = …
<br>&emsp;	OUT = …
<br>&emsp;	SCC = …
<br>&emsp;	…
<br>&emsp;	return([IN,OUT,SCC])




In [None]:
# YOUR CODE HERE

### **Q3.3 |** Running your code
*[[5 points]]*

Load in the BowTie.gml file and test your new function.  Report the sizes for IN, OUT, and SCC. Turn in the code and your answer

In [None]:
# YOUR CODE HERE

## **Question 4 |** Barabási-Albert Model Part 1
*[[10 points in total]]*

You are tasked with analyzing a Barabási-Albert (BA) model. The degree evolution of two vertices is of interest: one vertex introduced early at  t = 5  and another vertex introduced later at  t = 95.

### **Q4.1 |** Create an BA model
*[[5 points]]*

Create a BA model with 5 initial nodes. After the initial graph is formed, extend the graph to 10,000 nodes by adding new nodes step-by-step, where each new node connects to 2 existing nodes.

Track the degree evolution of:
1. A vertex introduced early at  t = 5  (i.e., the 5th Vertex overall in the graph).
2. A vertex introduced later at  t = 95  (i.e., the node introduced later in the network’s evolution).
3. Plot the degree evolution of both the early and late vertices on a log-log scale.


In [None]:
# YOUR CODE HERE

### **Q4.2 |** Compute the clustering coefficient for the BA model
*[[5 points]]*

Compute the graph-level clustering coefficient of the network as it grows. For this, compute the clustering coefficient after every 10 nodes are added, starting from the initial 5-node graph, and track how it changes over time.

1. Clustering Coefficient Calculation: As the graph grows, after every 10 nodes added, compute the graph’s clustering coefficient.
2. Plot: Create a plot showing the clustering coefficient on the y-axis and the number of nodes on the x-axis.
3. Interpretation: Analyze and interpret how the clustering coefficient changes as the network grows. Does it increase, decrease, or remain constant? How does the structure of the graph affect this?

In [None]:
# YOUR CODE HERE

### Explanation

[YOUR ANSWER HERE] 

## **Question 5 |** Barabási-Albert Model Part 2
*[[10 points in total]]*

![Description of the image](q5_graph.png)

### **Q5.1 |** Identify the most likely connections in a standard BA model
*[[5 points]]*

The following network is being created through the Barabasi-Albert attachment model. 

Identify the three nodes the yellow node is most likely to connect to. 

Report the probability of connecting to those three nodes.  Show your work.

[YOUR ANSWER HERE]

### **Q5.2 |** Identify the most likely connections in an adapted BA model
*[[5 points]]*

Assume the network follows an adapted BA model. With a 25% chance the new node will be connected to some other node at random, and with an 75% chance it will be connected to one of the nodes through the standard preferential attachment model (as in part a). 

Report the attachment probabilities for the two nodes you picked above (most and second most likely) as well the probability of attaching to nodes A and K (so you will report 4 probabilities). Show your work. 

[YOUR ANSWER HERE]