# Exercise #4: Model Selection in Networks Using a Bayesian Approach

## Overview

In this exercise, we aim to understand how real-world networks form by investigating the mechanisms behind edge formation. Often, in empirical networks, we are interested in identifying whether nodes connect based on shared characteristics (homophily) or popularity (preferential attachment). 

### Key Concepts:
- **Homophily**: The tendency for similar individuals to connect more frequently. This is a common pattern in social networks [McPherson et al. 2001](https://www.annualreviews.org/content/journals/10.1146/annurev.soc.27.1.415)
- **Preferential Attachment**: The phenomenon where popular nodes (those with many connections) attract even more connections, reinforcing their popularity. This mechanism is tied to the *Matthew Effect* [Merton, 1968](https://www.science.org/doi/abs/10.1126/science.159.3810.56) and the *Barabási-Albert model* [Barabasi and Albert](https://www.science.org/doi/full/10.1126/science.286.5439.509).

There are multiple approaches to investigating these patterns, such as fitting data into generative models or using **Multiple Regression Quadratic Assignment Procedure (MRQAP)**, an extension for covariate matrices [Dekker et al. 2007](https://link.springer.com/article/10.1007/S11336-007-9016-1). However, this exercise will cover a different approach using **Bayesian inference**.

We will use **JANUS**, a model published by [Espín-Noboa et al. (2017)](https://link.springer.com/article/10.1007/s41109-017-0036-1), based on **HypTrails** [Singer et al. 2017](https://dl.acm.org/doi/abs/10.1145/3054950). This approach uses prior beliefs to calculate marginal likelihoods (evidence) for different hypotheses and compares them to determine which hypothesis best explains the connections observed in the network.

## Task

1. **Generate a Synthetic Network**: Create a synthetic network of your choice (directed or undirected).
2. **Set Hypotheses**: Use the `netin.algorithms.janus.JanusModelFitting` method to test multiple hypotheses about how the network formed.
   - **Baseline Hypotheses**: Random connections, self-loops, and data-based connections.
   - **Edge Formation Mechanisms**: Include hypotheses based on **homophily**, **preferential attachment**, or a combination of both. *Only hypotheses about undirected networks are supported so far.*
3. **Apply Bayesian Inference**: Use JANUS to compute the **marginal likelihoods** for each hypothesis.
4. **Compare Hypotheses**: Analyze the relative plausibility of each hypothesis based on the computed evidence.
5. **Store the Evidence scores**: Save to a file all the evidence scores for all your hypotheses.

### Instructions

1. Generate a synthetic network with customizable parameters (directed/undirected, size, etc.).
2. Define multiple hypotheses using JANUS, including random baselines and more sophisticated mechanisms like homophily and preferential attachment.
3. Use Bayesian inference to compare these hypotheses by calculating the marginal likelihood for each one.
4. Visualize the results to see which hypothesis best explains the formation of the synthetic network.

### Expected Outcome

By the end of this exercise, you will be able to apply a **Bayesian approach** to model selection in networks, testing multiple hypotheses about edge formation and determining which one best explains the observed patterns. This process will enhance your understanding of how to infer the underlying mechanisms of network formation using **evidence-based** comparison methods.

### Disclaimer

The implementation of Janus is in testing mode. If you encounter any bug or incosistency please report it on our [GitHub repository](https://github.com/CSHVienna/NetworkInequalities/issues).



___

In [None]:
# ### If running this on Google Colab, run the following lines:
# import os
# !pip install netin==2.0.0a1
# !pip install networkx==3.2.1
# !mkdir plots
# !mkdir results
# os.kill(os.getpid(), 9)

In [None]:
## Undirected Network models
from netin.models import ...
from netin.models import ...
...

In [None]:
## Directed Network models
from netin.models import ...
...

In [None]:
## Janus: A Bayesian approach for hypothesis testing on edge formation
from ... import JanusModelFitting

In [None]:
## Utils
from netin.utils import io

## Constants

In [None]:
PLOTS = 'plots/'        # where to store the plots
OUTPUT_DIR = 'results/' # where to store the evidence values
io.validate_dir(PLOTS)
io.validate_dir(OUTPUT_DIR)

## Task 1. Generate a Synthetic (Undirected or Directed) Graph
This graph will serve as your "empirical" input data.

In [None]:
# Network properties
...

In [None]:
m_graph = ...Model(..., seed=seed)
m_graph = m_graph.simulate()

## Task 2. Generate hypotheses and compute their marginal likelihood
Hint:  
```python
JanusModelFitting(graph: Graph,
is_global: bool = True, 
k_max: int = 10, 
k_log_scale: bool = True, 
**attr)```

In [None]:
# Janus' parameters
is_global = False
k_max = 10
k_log_scale = False
verbose = False

In [None]:
j = JanusModelFitting(...)

### Default hypothesis
Hint:
```python
h = j.get_uniform_hypothesis() -> Hypothesis
e = j.generate_evidences(h: Hypothesis)

#### Uniform

In [None]:
# Uniform (all nodes are equally likely to be connected to each other)
h = ...
e = ...
j.add_evidences(h.name, e)
del(e)
del(h)

#### Self-loop

In [None]:
# Self-loop hypothesis (only diagonal)
h = ...
e = ...
j.add_evidences(h.name, e)
del(e)
del(h)

#### Data

In [None]:
# Data hypothesis (upper bound)
h = ...
e = ...
j.add_evidences(h.name, e)
del(e)
del(h)

### Link formation hypotheses (belief-based)
Hint:
```python
j.model_fitting_belief_based(m: netin.models.*, first_mover_bias:bool)
```

*Disclaimer: It only support undirected networks*

In [None]:
# PA model
e = j.model_fitting_belief_based(PAModel, first_mover_bias=False)
name = (*e,)[0]
e[f"{name}"] = e.pop(name)
j.update_evidences(e)

In [None]:
# PA model accounting for node age
e = j.model_fitting_belief_based(PAModel, first_mover_bias=True)
name = (*e,)[0]
e[f"{name}_FMB"] = e.pop(name)
j.update_evidences(e)

In [None]:
# PAH model
# This might take a while, as it is generating 121 hypotheses (multiple combinations of h_m and h_M)
# It returns only the best one (the one with highest marginal likelihood score)
e = ...
j.update_evidences(e)

## Task 4. Compare hypotheses

### Evidence
Marginal likelihood

In [None]:
j.

### Bayes factor
Compared against the uniform hypothesis

In [None]:
# Plots the bayes factors as the evidence of each hypothesis divided by the evidence of the uniform hypothesis
j.

## Task 5. Store all evidence scores

In [None]:
# Stores the dictionary with all evidence scores into disk
j.