# Random networks

In this assignment, we will follow the work conducted by Moreno and Jennings in 1938 on the topic of ["Statistics of social configurations"](http://dx.doi.org/10.2307/2785588). Their research focused on exploring the differences between interesting structures in social networks and random connections. They are credited with developing one of the first models for random networks. We will follow their experiment, with a similar but different data.

The data is a network of seventh grade students in Victoria, Australia. The students were asked to nominated classmates about who gets on with in the class. Each edge is directed from $i$ to $j$, which represents that student $i$ nominates student $j$.

This data is modified for this exercise. Some edges are removed such that each student has three nominated friends. In the original data, each student nominates more than three friends, and the data contains two other networks on different activities, i.e., who are your best friends, and who would you prefer to work with. See [the source](https://manliodedomenico.com/data.php).

Let's load and visualize the network:

In [None]:
import pandas as pd
import igraph

edge_table = pd.read_csv(
    "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/seventh_graders.csv"
)

g = igraph.Graph.DataFrame(edge_table, directed=True)

igraph.plot(g)

We can see that reciprocal relationships are prevalent. Nevertheless, the results are mixed since there are also some nonreciprocal students, where one student nominates but the other doesn't. And It remains unclear whether the prevalence of reciprocal relationships is solely attributed to the high number of nominations made by students.

How substantial are the reciprocal relationships? Moreno & Jennings developed a statistical assessment based on random networks. Their ideas is to create a fictional random network, of the same number of nodes and edges, but each student nominates others *uniformly at random*.

In [None]:
# Assignment:
# Implement the function named `generate_random_network` to generate the Moreno & Jennings random network.
# - The function takes CSR matrix for the actual network and outputs the CSR matrix representing the random network.
# - Make sure that no self-loop is formed
def generate_random_network(A):
    pass

Then, generate the 30 random networks, and show that the reciprocal relationships are more prevalent than the random networks by using a visualization.

In [None]:
# Assignment:
# - Generate 30 random networks
# - Show that the reciprocal relationships are more prevalent in the actual network than the random networks by using a simple visualization.
# - You can use violin plot, histogram, swarm plot, whatever visualizations for 1D data.

Random networks serve as a reference point, and provide insights into the unique characteristics of empirical networks.

Let's consider another example, where random networks provide a useful reference. In the original network, students receive different number of edges. Let's visualize the distribution of the in-degree.

In [None]:
indeg = np.array(A.sum(axis=0)).reshape(
    -1
)  # remind that A_{ij}=1 if an edge points from i to j. Thus, the column sum corresponds to the number of incoming edges.
ax = sns.histplot(indeg, binwidth=1)
ax.set_xlabel("# of in-coming edges")
sns.despine()

The distribution of the in-coming links are more or less uniform. Is this expected by random networks?

In [None]:
# Assignment:
# - Using the generated random networks above, show the in-degree distributions for the actial and random networks using a simple visualization. In-degree of a node is the number of edges that the node receives.
# - You can use violin plot, histogram, swarm plot, whatever visualizations for 1D data.

You might notice a stark difference, i.e., in random networks, everyone has a more or less similar number of in-coming edges, while in the actual networks, the distribution is more like a bimodal distribution, with two peaks at both extremes. Comparing the data with random data generates insights into the uniqueness of real data, which helps us generate interesting hypothesis about the underlying mechanisms generating the data.

## Random networks with heterogeneous degree distribution

The random networks by Moreno and Jennings have an important shortcoming, i.e., every node has more or less the same number of edges, since the redistribution of edges is *uniformly* at random. This random model does not explain the presence of the heterogeneous degree distribution, like the one we witnessed in the previous example.

Degree is still a non-structural feature, since it is about the number of friends but nothing about how these friends are connected to each other. And we often want random networks to have the same degree distribution as the actual network, to assess the statistical significance. Such random network with a specified degree sequence is the *configuration model*.

Let's calculate the number of reciprocal relationships expected for the configuration model. We will use the [Chung-Lu](https://www.pnas.org/doi/10.1073/pnas.252631999) configuration model, a variant of the configuration model for unweighted networks.
The Chung-Lu model places edges independently of each other with probability
$$
p_{ij} = \frac{d^{\text{out}}_i d^{\text{in}}_j}{2m}
$$
for nodes $i$ and $j$, where $d^{\text{out}}_i$ and $d_j^{\text{in}}$ are the in-degree and out-degree of node $i$, respectively, and $m$ is the number of edges.

In [None]:
# Assignment.
# - Implement the function of the ChungLu model. This function takes scipy CSR matrix and outputs scipy CSR matrix.
# - Self-loops are allowed.
def ChungLu(A):
    pass

Second, let's generate a random network and visualize it.

In [None]:
Arand = ChungLu(A)
sources, targets = Arand.nonzero()
grand = igraph.Graph(zip(sources.tolist(), targets.tolist()), directed=True)
igraph.plot(grand)

Since the configuration model preserves the in-degree and out-degree sequences, the random networks should have a similar in-degree distribution as the original network.

In [None]:
# Assignment:
# Visualize the distributions of in-degree for the random and actual network.

The configuration model would produce a network that is more similar to the actual network, compared with the Moreno & Jennings random network. In fact, the distribution for random networks does not have a clear peak in the middle like that for the original network.

Now, let's compare the number of reciprocal relationships in random networks and that in the actual network.

In [None]:
# Assignment:
# Compare the number of reciprocal relationships for the original network, Moreno & Jennings random networks, and the configuration model by using a simple visualization.

OK. Let's examine another social theory, i.e., [social balance theory](https://en.wikipedia.org/wiki/Social_balance_theory). This theory suggests that triangles are common in social networks because a person is likely to be a good friend of someone, if that someone and the person have common friends.

To simplify the analysis, let's ignore the directionality

In [None]:
import pandas as pd
import igraph

edge_table = pd.read_csv("seventh_graders.csv")

g = igraph.Graph.DataFrame(edge_table, directed=False)  # changed

And count the number of triangles in the network, and calculate the probability that a random network has more triangles than the actual network (which is the statistical significance of the triangles in the original network) by using the Moreno & Jennings' random network model.

In [None]:
# Assignment:
# And count the number of triangles in the network, and examine whether
# the number of triangles is significant larger than that for the configuration model.

In [None]:
# Assignment
# Compare the number of triangles in the random and actual networks. The random networks should be generated by the Chung-Lu model.

The significance suggests that the original network has a significantly larger number of triangles than the Moreno & Jennings' random models. On the other hand, the Chung-Lu model produces random networks with a similar number of triangles as the original network. This implies that the significance of the number of triangles varies depending on the choice of random networks.

So, which random graph models should one use? It comes down to the question of what is random. For example, if the network has a high number of edges, it is expected to have a large number of triangles. You can check if this is indeed true by randomly shuffling the edges and counting the triangles in the shuffled network. Alternatively, if you are interested in whether the high number of triangles are formed around popular individuals who have many edges, the configuration model might be an option since it preserves the degree while everything else is randomized.





