# Exercise #2: Auditing Node Rankings in Directed Networks

## Overview

In this exercise, we will explore how network structure, particularly the mechanisms of edge formation, impacts node ranking algorithms. Node rankings help determine the importance or relevance of nodes in a network, with applications ranging from social networks to citation networks. We will specifically focus on **PageRank**, a widely used algorithm for ranking nodes based on their centrality.

Our goal is to audit how **majority** and **minority** groups are represented in the top-k rankings of PageRank. A real-world example of this issue is the ranking of scholars based on citation or collaboration networks. For instance, how do men and women rank in the top-k of a PageRank algorithm, and how does this compare to their overall representation in the population?

### Key Concepts:
1. **Node Ranking**: Ranking nodes based on their importance using algorithms like degree centrality or PageRank.
2. **Disparity**: The relationship between inequality (distribution of rankings) and inequity (representation of minority nodes in the top-k rankings).
    a. **Inequality**: Measured by the Gini coefficient of the PageRank distribution.
    b. **Inequity**: The representation of minority nodes in the top-k.

We will use the **DPAHModel** to generate multiple synthetic directed networks and calculate **disparity scores** (inequality and inequity) to understand how these networks treat minority nodes in comparison to majority nodes.

This approach was published in [Espín-Noboa et al. (2022)](https://www.nature.com/articles/s41598-022-05434-1) in *Nature Scientific Reports*.

## Task

1. **Generate Synthetic Networks**: Use the `DPAHModel` to create multiple synthetic directed networks with varying parameters.
2. **Compute centrality metrics**: Rank the nodes in each network using a centrality metric e.g., the PageRank algorithm.
3. **Get to know your data visually!**: Plot the types of edges and degree distribution to see any patterns given the characteristics of the network.
4. **Compute Disparity Scores**:
   - Calculate the **Gini coefficient** of the PageRank distribution to measure **inequality**.
   - Analyze the **representation** of minority nodes in the top-k PageRank rankings to measure **inequity**.
5. **Plot and Compare**: Visualize the disparity scores across the networks to see how inequality and inequity vary based on network structure.

## Instructions

1. Use the provided function to generate networks using the `DPAHModel`.
2. Use a centrality metric of nodes, e.g., `pagerank` or `in_degree` for each network.
3. Use the built-in function to compute the inequality (Gini coefficient) of the PageRank distribution.
4. Use another built-in function to compute the inequity (ME: mean error) of the representation of minority nodes in top-k ranks.
5. Plot the disparity scores (inequality and inequity) for comparison.
6. BONUS: Disentangle the effects of homophily, preferential attachment, and directed links.

## Expected Outcome

By the end of this exercise, you will have a deeper understanding of how different network structures influence node rankings, and how inequality and inequity manifest in these rankings. You will also learn to audit algorithmic outcomes in the context of network science.

___

In [None]:
# ### If running this on Google Colab, run the following lines:
# import os
# !pip install netin==2.0.0a1
# !pip install networkx==3.2.1
# !wget -nc https://raw.githubusercontent.com/snma-tutorial/ecmlpkdd2024/main/exercises/helper.py
# !mkdir plots
# os.kill(os.getpid(), 9)

## Dependencies

In [None]:
## Directed Network models
# import the models that generate directed networks with:
# - only preferential attachment
# - only homophily
# - both, preferential attachment and homophily
...

In [None]:
## Utils
from netin import viz
from netin.utils import io
from netin.stats import ranking 
from netin.stats import distributions
from netin.utils import constants as const
from netin.stats import networks

In [None]:
import pandas as pd

In [None]:
## Helper with additional functions
%load_ext autoreload
%autoreload 2

import helper

## Constants

In [None]:
PLOTS = 'plots/'     # where to store the plots
EXID = 2                # exercise id to name the plot files
io.validate_dir(PLOTS)

## Task 1. Generate Synthetic Directed Graphs

In [None]:
### Fix some parameters of the networks

N = ...      # number of nodes
d = ...      # number of edges to attach to every source node
             # Hint: Remember that the final number of edges will be: e = d * n (n-1)
f_m = ...    # fraction of minority group
plo_M = ...  # powerlaw out_degree exponent of the majority group (activity)
plo_m = ...  # powerlaw out_degree exponent of the minority group (activity)
seed = ...   # random seed (reproducibility)

In [None]:
# DPAH graphs:
# Homophilic h > 0.5
# Neutral h = 0.5
# Heterophilic h < 0.5

# Generate 9 directed graphs with both preferential attachment and homophily
# Add each of them to the graph_models list

graph_models = []

...

## Task 2. Compute Centrality metrics

In [None]:
# Generate the node metadata dataframe for each graph
# Add them to the metadata list

metadata = []
for m in graph_models:
    df = ...
    df.name = helper.get_title(df, m.SHORT, m.f_m, m.h_M, m.h_m)
    metadata.append(df)

In [None]:
# Inspect the content of the graph's metadata
metadata[0].head()

## Task 3. Getting to know the data visually

In [None]:
### Setting the look & feel
viz.reset_style()
viz.set_paper_style()

In [None]:
### Plotting al graphs at once
### Showing 3 graphs per row

fn = io.path_join(PLOTS, f'{EXID}_all_graphs.pdf')
viz.plot_graph(graph_models,
               nc = 3,
               cell_size = 2.0,
               wspace = 0.1,
               ignore_singletons=True,
               fn = fn)

In [None]:
### Plot edge counts for each graph

fn = io.path_join(PLOTS, f'{EXID}_edge_types.pdf')
helper.plot_edge_type_counts(graph_models, 
                             figsize = (12,5),
                             width_bar = 0.08,
                             nc_legend = 3,
                             loc = 'best',
                             fn=fn)

In [None]:
metadata[0].head()

In [None]:
#### Set the metric of interest (network property of the node)
#### in_degree, out_degree, (degree for undirected), clustering, betweenness, etc. (see metadata)
metric = 'pagerank'

In [None]:
### Plot in_degree distribution of the whole graph
### Hint: Check out the dataframe. Which column has the in_degree of the node?

kind = 'pdf'

fn = io.path_join(PLOTS, f'{EXID}_{metric}_distribution.pdf')
viz.plot_powerlaw_fit(data = metadata,
                      col_name = metric,
                      kind = kind,
                      sharex = True, 
                      sharey = True,
                      cell_size = (2.5,2.5),
                      wspace = 0.1,
                      loc = 3,
                      nc = 3,
                      fn = fn)

In [None]:
### Plot in_degree distribution of each group
### Hint: Check out the dataframe. Which column has the class of the node?
### M for majority, and m for minority.

hue = 'real_label'

fn = io.path_join(PLOTS, f'{EXID}_{metric}_distribution_by_{hue}.pdf')
viz.plot_powerlaw_fit(data = metadata,
                      col_name = metric,
                      kind = kind,
                      hue = hue,
                      sharex = True, 
                      sharey = True,
                      cell_size = (2.5,2.5),
                      wspace = 0.1,
                      loc = 1,
                      nc = 3,
                      fontsize = 9,
                      fn = fn)

## Task 4. Compute disparity scores

In [None]:
### smoothness to control for the smallest me (mean error: expected fm - observed fm)
beta = const.INEQUITY_BETA
beta

In [None]:
### Inspect the disparity scores (gini, and me) for each network

df_disparity = pd.DataFrame(columns=['model','params','inequality','inequity','inequity_class'])

for df in metadata:
    f_m = df.query("real_label == @const.MINORITY_LABEL").shape[0] / df.shape[0]
    
    inequity, inequality = distributions...(...) # disparity
    inequity_class = ranking....(...) # inequity class
    
    model_name = df.name.replace('$_{','').replace('}$','')
    model_name, params = model_name.split('\n')
    tmp = pd.DataFrame({'model':model_name, 
                        'params':params, 
                        'inequality':inequality, 
                        'inequity':inequity, 
                        'inequity_class':inequity_class}, index=[0])
    
    df_disparity = pd.concat([df_disparity, tmp], ignore_index=True)

In [None]:
df_disparity

## Task 5. Plot and compare

### Inequity: Minority fraction in top-k

In [None]:
### Plot the inequity of the 'pagerank' distribution (ME: mean error)
### It shows the fraction of minoritiy nodes (y-axis) at each top-k rank (x-axis)
### Then, ME is computed as the difference between the fraction of minority nodes in each top-k 
### and the actual fraction of minorities.

fn = io.path_join(PLOTS, f'{EXID}_{metric}_inequity.pdf')

viz.plot_fraction_of_minority(metadata, 
                              col_name=metric, 
                              sharex=True, sharey=True,
                              cell_size = (2.5,2.5),
                              wspace = 0.1,
                              nc = 3,
                              fn = fn)

### Inequality: Gini coefficient of distribution

In [None]:
### Plot the inequality of the 'pagerank' distribution
### It shows the Gini coefficient in each top-k.
### Also, the global gini refers to the Gini at top-100% 

fn = io.path_join(PLOTS, f'{EXID}_{metric}_inequality.pdf')

viz.plot_gini_coefficient(metadata, 
                          col_name = metric, 
                          sharex = True, sharey = True,
                          nc = 3, 
                          wspace = 0.08, 
                          cell_size = (1.9,2.2),
                          fn = fn)

### Disparity: Inequality vs. Inequity

In [None]:
### Plot the disparity of the 'pagerank' distribution
### It shows the inequity (ME) vs. inequality (Gini)

fn = io.path_join(PLOTS, f'{EXID}_{metric}_disparity.pdf')

viz.plot_disparity(metadata, 
                   col_name = metric, 
                   sharex = True, sharey = True,
                   nc = 3, 
                   wspace = 0.08, 
                   cell_size = (1.9,2.2),
                   fn = fn)

## Bonus: Disentangling the effect of PA and H in ranking disparities

In [None]:
### Homopohily values to test
h_mm = ...
h_MM = ...
metadata_models_directed = []

### Graphs

## Only preferential attachment
mg = ...
mg.simulate()
md = networks.get_node_metadata_as_dataframe(mg.graph) 
md.name = mg.SHORT
metadata_models_directed.append(md)

## Only homophily
mg = ...
mg.simulate()
md = networks.get_node_metadata_as_dataframe(mg.graph) 
md.name = mg.SHORT
metadata_models_directed.append(md)

## Both, preferential attachment and homophily
mg = ...
mg.simulate()
md = networks.get_node_metadata_as_dataframe(mg.graph) 
md.name = mg.SHORT
metadata_models_directed.append(md)


In [None]:
### Visualize
fn = io.path_join(PLOTS, f'{EXID}_{metric}_disparity_DPA_DH_DPAH.pdf')
viz.plot_disparity(metadata_models_directed, 
                   col_name = metric, 
                   sharex = True, sharey = True,
                   nc = 3, 
                   wspace = 0.08, 
                   cell_size = (2.2,2.6),
                   suptitle = "Effects of homphily and preferential attachment",
                   fn = fn)

## Bonus: Disentangling the effect of directed links

In [None]:
### Undirected networks
from netin.models import PAModel
from netin.models import HomophilyModel
from netin.models import PAHModel

In [None]:

### Add graphs (similar as the ones above) but without directed links
m = 2
metadata_models_undirected = []

## Only preferential attachment
mg = ...
mg.simulate()
md = networks.get_node_metadata_as_dataframe(mg.graph) 
md.name = mg.SHORT
metadata_models_undirected.append(md)

## Only homophily
mg = ...
mg.simulate()
md = networks.get_node_metadata_as_dataframe(mg.graph) 
md.name = mg.SHORT
metadata_models_undirected.append(md)

## Both, preferential attachment and homophily
mg = ...
mg.simulate()
md = networks.get_node_metadata_as_dataframe(mg.graph) 
md.name = mg.SHORT
metadata_models_undirected.append(md)


In [None]:
### Visualize
fn = io.path_join(PLOTS, f'{EXID}_{metric}_disparity_directed_vs_undirected.pdf')
viz.plot_disparity(metadata_models_directed + metadata_models_undirected, 
                   col_name = metric, 
                   sharex = True, sharey = True,
                   nc = 3, 
                   wspace = 0.08, 
                   cell_size = (2.2,2.6),
                   suptitle = "Effects of directed links",
                   fn = fn)