# [NML-25] Notebook 1: Introduction to the python toolbox

Responsible TA: [Jeremy Baffou](https://people.epfl.ch/jeremy.baffou)

# Instructions


**Expected output:**

Troughout the different lab session, you will have coding and theoretical questions. Coding exercises shall be solved within the specified space:
```python
# Your solution here ###########################################################
...
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```
Sometimes we provide variable names, such as `x = ...`; do not change names and stick to hinted typing, as they will be reused later.
Within the solution space, you can declare any other variable of function that you might need, but anything outside these lines shall not be changed, or it will invalidate your answers.

Theoretical questions shall be answered in the following markdown cell. The first line will be 
```markdown
**Your answer here:**
...
```

**Solutions:**
* Your code should be self-contained in the `.ipynb` file. The solution to the exercices will be provided in an external `.ipynb` file.

* Try to make your code clean and readable, it is a good training for the project. Provide meaningful variable names and comment where needed.

* You cannot import any other library than we imported, unless explicitly stated.

# Objective

This goal of this notebook is to have an introduction (or refresh) to elementary Python libaries and graph related toolbox. The exercices can be divided in two sections:
* Elementary Python Toolbox
* Basics of graph modelling/analysis

The first part covers some basic Python libraries that will be used multiple times during the exercice sessions and the project, namely [Pandas](https://pandas.pydata.org/docs/), [NumPy](https://numpy.org/devdocs/user/index.html) and [Scikit-learn](https://scikit-learn.org/stable/user_guide.html). If you know those libraries, most task in this section will be straightforward. If you struggle with some aspects, take time to revise the library documentations, as next sessions will rely on these tools. 

The second part of the notebook will introduce [Networkx](https://networkx.org/documentation/latest/tutorial.html), a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. We will start by building graphs from edge lists and from features, and then we will explore some basic graph properties. A tutorial on networkx can be found [here](https://networkx.github.io/documentation/stable/tutorial.html).



# Section I: Introduction to elementary Python toolbox

## Dataset

We will use the [Palmer Archipelago (Antarctica) penguin data](https://github.com/allisonhorst/palmerpenguins/tree/main) for this exercise session.

We provide a simplified version of the data in `penguins_size.csv`

Dataset reference: https://doi.org/10.5281/zenodo.3960218

In [None]:
# Plotting functions
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

## Part 1: Pandas, to manipulate tabular data

In [None]:
import pandas as pd

### Question 1.1: Data loading and examination

**1.1.1** Read the `penguins_size.csv` file into a Pandas DataFrame, using the `read_csv` function.

In [None]:
# Your solution here ###########################################################
penguins: pd.DataFrame = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.2** Extract the first five rows of the data frame and 10 random ones, then concatenate and display them. You can use the built-in `display` function.

In [None]:
# Your solution here ###########################################################

display(...)

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.3** Look at the fourth entry: it is missing information, which is filled with `NaN` (not a number) values.
Let's drop all rows with missing values, then display the first 10 rows.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.4** Compute and display the mean and std of `culmen_length_mm` and `body_mass_g`.

In [None]:
# Your solution here ###########################################################

print("Mean values:")
...

print("Standard deviation:")
...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.5** Examine statistics of all columns with the `describe` method.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.6** Plot a histogram of `body_mass_g`, split by `sex`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plt.show()

_(Note that one penguin seems to have a missing value for the sex entry, leading to the additionnal pannel in the histogram.)_

**1.1.7** The [seaborn](https://seaborn.pydata.org/tutorial.html) library provides nicer visualization functionalities. Let's produce the same histogram with it. (Hint: Try to use the `hue` argument)

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plt.show()

### Question 1.2: indexing and manipulation

NumPy allows manipulating vectors, matrices and higher order tensors as `arrays`. For instance vectors are 1d arrays, and matrices have two dimensions (rows and columns).

**1.2.1** Use the `loc` property to remove penguins without sex. (The `value_counts()` function should return only `Male` and `Female` entries)

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
penguins["sex"].value_counts()

**1.2.2** Make `sex` a boolean property, with value `True` for `"FEMALE"` and `False` for `"MALE"`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
penguins["sex"].value_counts()

**1.2.3** In the next questions we will encode numerically the `island` and `species` property. Let's start by identifying the set of unique island's and species names.

In [None]:
# Your solution here ###########################################################
islands: list[str] = ...
species: list[str] = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("islands:", islands)
print("species:", species)

**1.2.4** For each island and species, add a column with boolean value, indicating whether the penguin comes from said island, or belong to said species.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
display(penguins.loc[[13, 26, 39, 52]])

**1.2.5** In some case, we might want to encode species as integers.
Use the `map` method and a dictionary mapping species to numbers to get a `y_species` vector.

In [None]:
# Your solution here ###########################################################
y_species: pd.Series = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
y_species

**1.2.6** Drop the `island` and `species` columns, since they are not needed anymore.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
assert 'species' not in penguins.columns and 'island' not in penguins.columns

In [None]:
penguins.head(10)

## Part 2: NumPy, scientific computing in Python



In [None]:
import numpy as np

### Question 2.1: Array Creation

**2.1.1** Pandas is built on top of NumPy. We can access the underlying array though the `values` attribute of a DataFrame. Let's put all but the `body_mass_g` columns in the *design matrix* `x`, and the y_mass one in the *target vector* `y_mass`.

In [None]:
# Your solution here ###########################################################
x: np.ndarray = ...
y_mass: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.1.2** Let's inspect the `shape` of these two arrays.

In [None]:
# Your solution here ###########################################################
print("x shape:", ...)
print("y shape:", ...)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.1.3** Then let's check the first five rows of `x`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.1.4** Notice that the `dtype` of `x` is `object`, which means that it contains multiple types (namely `float` and `bool`). Let's convert it to `float` and check the first rows again.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

### Question 2.2: Array manipulation


**2.2.1** Extract the values of `Dream` and `Gentoo` columns into two vectors. Convert them to boolean.

In [None]:
dream: np.ndarray
gentoo: np.ndarray
# Your solution here ###########################################################
dream, gentoo = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.2.2** Count how many penguins come from the Dream island using the `sum` method. Repeat for the Gentoo specie.

In [None]:
# Your solution here ###########################################################
print("Dream's penguins:", ...)
print("Gentoo's penguins:", ...)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.2.3** You can use a boolean mask to extract values from an array.
Compute the average mass and std of Dream's penguins using the corresponding NumPy functions.

In [None]:
# Your solution here ###########################################################
print("Average Dream's mass:", ...)
print("Dream's mass std:", ...)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.2.4** Now, compute again the average y_mass of Dream's penguins but without the `mean`function. Try using the scalar product between the mass vector and the Dream boolean mask.

In [None]:
# Your solution here ###########################################################
dream_avg_mass: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Average Dream's mass:", dream_avg_mass)

**2.2.5** Compute the standard deviation as an inner product too and compare with the values from previous answer.

In [None]:
# Your solution here ###########################################################
dream_std_mass: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Dream's mass std:", dream_std_mass)

### Question 2.3: Linear regression

Linear regression aims to find a weight vector $\mathbf w$ such that the target value $y_i$ can be retrieved as a weighted sum of the corresponding features $\mathbf x_i$, or in matrix notation
$$ \mathbf{X w} = \mathbf y. $$

Most of the time we cannot find an exact solution to this problem, therefore we introduce an error function and look for weights that minimize it.

**2.3.1** Find a solution for `w` by solving a linear system with `np.linalg.solve`.
For this method to work you need as many equations as variables, thus as many samples as the number of variables. Choose them randomly with `np.random.choice`.

Note that by selecting random rows, your design matrix might become singular. You can use `try` to rerun until it works.

In [None]:
n_samples, n_features = x.shape

while True:
    try:
        # Your solution here ###################################################

        w_solve: np.ndarray = ...

        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        break
    except np.linalg.LinAlgError:
        pass

print("Weights:", w_solve)

**2.3.2** Define a function to compute the mean squared error ([MSE](https://en.wikipedia.org/wiki/Mean_squared_error)) between the real `y` and the predicted one.

In [None]:
def mse(y_true: np.array, y_pred: np.array) -> float:
    # Your solution here #######################################################
    return ...
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.3.3** Use the previously computed weights to predict the penguins masses, then compute their MSE.

In [None]:
# Your solution here ###########################################################
mse_solve: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("MSE of random subproblem:", mse_solve)

**2.3.4** Randomly selecting samples is suboptimal as we ignore a significant part of the dataset. Let's look for a solution that uses all the data by using the pseudoinverse of `x`. (Hint: Remember the Linear Regression Equation)

In [None]:
# Your solution here ###########################################################
w_pinv: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.3.5** Compute the MSE of this solution prediction.

In [None]:
# Your solution here ###########################################################
mse_pinv: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("MSE of psudoinverse solution:", mse_pinv)

### Question 2.4: Broadcasting

In this section, we focus on a convenient way to manipulate arrays which allows parallelizing operations over the same input.

**2.4.1** extract all island one-hot encoding from the data frame and convert to boolean.

In [None]:
# Your solution here ###########################################################
islands_oh: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("islands shape", islands_oh.shape)

**2.4.2** Multiply the penguin masses to `islands_oh` to mask them in parallel.

In [None]:
try:
    # Your solution here ###########################################################
    masked_mass: np.ndarray = ...
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
except ValueError as err:
    print("There's an ERROR:", err)

**2.4.3** The error indicates that arrays of different shapes cannot be automatically broadcasted! Using `np.newaxis`, add a dimension to `y_mass` and try again.

In [None]:
# Your solution here ###########################################################
masked_mass: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.4.4** Compute the average masses over different islands summing the masked masses along the corresponding axes.

In [None]:
# Your solution here ###########################################################
avg_masses: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Average masses:")
print(dict(zip(islands, avg_masses)))

**2.4.5** Use broadcasting and masking to compute standard deviations for each island.

In [None]:
# Your solution here ###########################################################
std_masses: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Mass standard deviations:")
print(dict(zip(islands, std_masses)))

### Question 2.5: K-Nearest neighbors

Let's implement a k-nearest neighbors (kNN) classifier to identify penguin species from their physical attributes.

For a query data point, kNN predicts its label as the most frequent one between those of the k closest samples from the training dataset. In this setting, we will use euclidean distance between points:
$$ d(\mathbf x_i, \mathbf x_j) = \sqrt{\sum_{n=1}^D (x_{id} - x_{jd})^2} $$

In the next cell, we prepare to split the data in training and test sets.

In [None]:
physical_attributes = [
    "culmen_length_mm",
    "culmen_depth_mm",
    "flipper_length_mm",
    "body_mass_g",
]

# We count the number of samples and set the amount of training data to 70% of that
n_samples = len(penguins)
samples_tr = int(n_samples * 0.7)

# We shuffle the indices of the data and take the first 70% to be in training
# Working with indices allows us to recognise which features go with which labels
rng = np.random.default_rng(11)
shuffled = rng.permutation(np.arange(n_samples))
idx_tr = shuffled[:samples_tr]
idx_te = shuffled[samples_tr:]

**2.5.1** Extract training and test features from the `penguins` data frame using the indices defined above and split the `y_species` series (computed in 1.2.5) accordingly.

You can use `DataFrame.iloc` to work with integer indexing in pandas.
Remember to extract arrays from data frames.

In [None]:
# Your solution here ###########################################################
x_tr: np.ndarray = ...
x_te: np.ndarray = ...
y_tr: np.ndarray = ...
y_te: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Training features:", x_tr.shape)
print("Test features:", x_te.shape)

**2.5.2** Write a function that computes pairwise distances between all query points and training ones. Use broadcasting and sum along corresponding axes to get an efficient implementation.

In [None]:
def pairwise_distances(x_query: np.ndarray, x_tr: np.ndarray) -> np.ndarray:
    """Compute pairwise distances

    Args:
        x_query (np.ndarray): Array of shape (n_queries, n_features)
        x_tr (np.ndarray): Array of shape (n_samples, n_features)

    Returns:
        np.ndarray: Distances in array of shape (n_queries, n_samples)
    """
    # Your solution here #######################################################
    return ...
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


pdists_te = pairwise_distances(x_te, x_tr)
print("Distances shape:", pdists_te.shape)

**2.5.3** For each query, identify the closest training samples using `np.argsort`. Use the `axis` argument to avoid iterating over the matrix.

In [None]:
# Your solution here ###########################################################
nearest_ngbs_te: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

print("5 nearest neighbors for query 4:", nearest_ngbs_te[4, :5])

**2.5.4** Write a function that takes an array of nearest neighbors for each query, together with training labels and the predefined $k$ and returns the predicted labels for queries.

In [None]:
def predict_labels(nearest_ngbs: np.ndarray, y_tr: np.ndarray, k: int) -> np.ndarray:
    """Predict labels from k-nearest neighbors

    Args:
        nearest_ngbs (np.ndarray): Array of neighbors indices, sorted by distance.
            Shape: (n_queries, n_samples)
        y_tr (np.ndarray): Training labels of shape (n_samples,)
        k (int): number of nearest neighbors to consider

    Returns:
        np.ndarray: Predicted labels of shape (n_queries,)
    """
    # Your solution here #######################################################
    # Extract nearest ngbs labels
    ...

    # Count label occurrencies for each query
    ...

    # Return most frequent occurrence
    return ...

    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


print(
    "Predicted labels of the first 10 queries:",
    predict_labels(nearest_ngbs_te[:10], y_tr, 5),
)
print("Real labels:                             ", y_te[:10])

**2.5.5** Compute the precision of kNN's predictions for both the training and test datasets for all k between 1 and 30.

In [None]:
# Your solution here ###########################################################
precisions_te: list[float] = ...

precisions_tr: list[float] = ...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# Plot
fig, ax = plt.subplots(figsize=(8, 3), dpi=100)
ax.plot(precisions_tr, label="Train")
ax.plot(precisions_te, label="Test")
ax.set(title="My kNN implmentation", ylabel="precision", xlabel="k value")
plt.legend()
plt.show()

## Part 3: Scikit-learn, machine learning toolbox

Scikit-learn provides implementations for many machine learning algorithms, which all share the basic common interface of `fit` and `predict` methods. Let's compare it to our kNN implementation.

**3.1** Import the k nearest neighbor classifier from Scikit-learn.

In [None]:
# Your solution here ###########################################################
from sklearn ...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.2** Import the precision function.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.3** Compute train and test precision for all k between 1 and 30. Use the "micro" option for the precision score.

In [None]:
# Your solution here ###########################################################

sk_prec_tr: list[float] = ...
sk_prec_te: list[float] = ...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.4** Plot scores and compare to your implementation.

In [None]:
fig, ax = plt.subplots(figsize=(8, 3), dpi=100)
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ax.set(title="SkLearn kNN implementation", ylabel="precision", xlabel="k value")
plt.legend()
plt.show()

# Section II: Introduction to Networkx

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import os
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt

The following line is a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html). It enables plotting inside the notebook.

In [None]:
%matplotlib inline

You may also try `%matplotlib notebook` for a zoomable version of plots.

## Part 1: Building Graphs from Edge Lists

### Dataset

We will play with a partial segment of the Tree of Life. The full version is available here: [Open Tree of Life](https://tree.opentreeoflife.org/about/taxonomy-version/ott3.0). In this tutorial, the dataset is reduced to the first 999 taxons (starting from the root node), which can be found in `data/taxonomy_small.tsv`.

![Public domain, https://en.wikipedia.org/wiki/File:Phylogenetic_tree.svg](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Phylogenetic_tree.svg/800px-Phylogenetic_tree.svg.png)

In [None]:
# If needed, change this variable to the relative path to the taxonomy file
path ='./data/taxonomy_small.tsv'

In [None]:
tree_of_life = pd.read_csv(path, sep='\t\|\t?', encoding='utf-8', engine='python')

### Exploring the dataset

For a quick recap, we will go through a guided preliminary exploration of the dataset. We will use what we learned about `pandas` in Section 1.

We start by looking at the head of the dataframe.
The description of the entries is given here:
https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/Interim-taxonomy-file-format

In [None]:
tree_of_life.head()



Let us now drop some columns.

In [None]:
tree_of_life = tree_of_life.drop(columns=['sourceinfo', 'uniqname', 'flags','Unnamed: 7'])
tree_of_life.head()

Note that Pandas infers the type of values inside each column (in this case: int, float, string and string - run below).
The parent_uid column has float values because there was a missing value, converted to `NaN`.

In [None]:
print("Types of the columns in the dataframe:")
for col in tree_of_life.columns:
    print(f"{col}: {tree_of_life[col].dtype}")

To order the data, we can use the function `sort_values()`.

In [None]:
tree_of_life.sort_values(by='name').head()

 *Remark:* Some functions do not change the dataframe (option `inline=False` by default). As you can see below, the `tree_of_life` dataframe remains unchanged.

In [None]:
tree_of_life.head()

Which classes of `rank` do we have?

In [None]:
tree_of_life['rank'].unique()

Can we filter only `species` entries?

In [None]:
tree_of_life[tree_of_life['rank'] == 'species'].head()

Ok, let us now find how many species entries do we have!

In [None]:
len(tree_of_life[tree_of_life['rank'] == 'species'])

For all the possible `rank`s:

In [None]:
tree_of_life['rank'].value_counts()

Now, it is your turn!

### Question 1.1: Operations on columns

**1.1.1** Display the entry with name 'Archaea'.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.2** Display the entry of its parent. _(Hint: Remember of how the parent_id is used to link entities in the tree)_

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#### Extracting relevant information for the graph

Let us build the adjacency matrix of the graph. For that we need to reorganize the data. First we separate the nodes and their properties from the edges.

In [None]:
nodes = tree_of_life[['uid', 'name','rank']]
edges = tree_of_life[['uid', 'parent_uid']]

When using an adjacency matrix, nodes are indexed by their row or column number and not by a `uid`. Let us create a new index for the nodes.

In [None]:
nodes.head()

In [None]:
# Create a column for node index.
nodes.reset_index(level=0, inplace=True)
nodes = nodes.rename(columns={'index':'node_idx'})
nodes.head()

In [None]:
# Create a conversion table from uid to node index.
uid2idx = nodes[['node_idx', 'uid']]
uid2idx = uid2idx.set_index('uid')
uid2idx.head()

For the edges, we should leverage a powerful function of `pandas`: the `join` function.

In [None]:
edges.head()

In [None]:
# Add a new column, matching the uid with the node_idx.
edges = edges.join(uid2idx, on='uid')
edges.head()

In [None]:
# Do the same with the parent_uid.
edges = edges.join(uid2idx, on='parent_uid', rsuffix='_parent')
edges.head()

In [None]:
# Drop the uids.
edges_renumbered = edges.drop(columns=['uid','parent_uid'])

The `edges_renumbered` table is a list of renumbered edges connecting each node to its parent.

In [None]:
edges_renumbered.head()

#### Building the (unweighted and undirected) adjacency matrix

**1.1.3** We will use numpy to build this matrix. Note that we don't have edge weights here, so our graph is going to be unweighted.

In [None]:
n_nodes = len(nodes)
adjacency = np.zeros((n_nodes, n_nodes), dtype=int)

for idx, row in edges.iterrows():
  # Your solution here #########################################################
  pass
  #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

adjacency[:15, :15]

Congratulations, you have built the adjacency matrix!

#### Graph Creation

Let's create a graph object from the adjacency matrix using networkx.

In [None]:
# A simple command to create the graph from the adjacency matrix.
graph = nx.from_numpy_array(adjacency)

In addition, let us add some attributes to the nodes:

In [None]:
node_props = nodes.to_dict()

for key in node_props:
    nx.set_node_attributes(graph, node_props[key], key)

Let us check if it is correctly stored:

In [None]:
print(graph.nodes[3])
nodes.head()

### Question 1.2: Alternative ways to build the graph

**1.2.1** Build the graph directly from the `edges` table (without using the adjacency matrix). _(Hint: Create an empty graph object and iteratively add the edges)_

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
# sanity check
assert len(graph.nodes) == len(nodes)

**1.2.2** Build the graph from the initial `tree_of_life` table by directly iterating over the rows of this table (without building the adjacency matrix).

Store the final graph in the variable `graph`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.2.3** Get the adjacency matrix with `nx.adjacency_matrix(graph)` and compare it with what we obtained previously.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#### Graph visualization

To conclude, let us visualize the graph. We will again take advantage of the readily available functions in `networkx`.

Draw the graph with two different [layout algorithms](https://en.wikipedia.org/wiki/Graph_drawing#Layout_methods).

In [None]:
nx.draw_spectral(graph)

In [None]:
nx.draw_spring(graph)

Save the graph to disk in the `gexf` format, readable by gephi and other tools that manipulate graphs. You may now explore the graph using [gephi](https://gephi.org/) and compare the visualizations.

In [None]:
# Define your saving path
savepath = './tree_of_life.gexf'

In [None]:
nx.write_gexf(graph, savepath)

## Part 2: Building Graphs from Data

It often happens in real life that, when we wish to represent data as a graph, we do not have access to the underlying ground truth network. In this case, we need to make some assumptions about the graph/network structure. We will now explore a very basic and intuitive approach for graph construction based on similarity between nodes.

### Import and explore the data

We will play with the famous Iris dataset. This dataset can be found in many places on the net and was first released at <https://archive.ics.uci.edu/ml/index.php>. For example it is stored on [Kaggle](https://www.kaggle.com/uciml/iris/), with many demos and Jupyter notebooks you can test (have a look at the "kernels" tab).

![Iris Par Za — Travail personnel, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=144395](https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Iris_germanica_002.jpg/251px-Iris_germanica_002.jpg)

In [None]:
path = 'data/iris.csv'

In [None]:
iris = pd.read_csv(path)
iris.head()

The description of the entries is given here:
https://www.kaggle.com/uciml/iris/home

In [None]:
iris['Species'].unique()

In [None]:
# generate statistics
iris.describe()

### Build a graph from the features

We are going to build a graph from these data. The idea is to represent iris samples (rows of the table) as nodes, with connections depending on their physical similarity.

The main question is how to define the notion of similarity between the flowers. For that, we need to introduce a measure of similarity. It should use the properties of the flowers and provide a positive real value for each pair of samples.

*Remark:* The value should increase with the similarity.

Let us separate the data into two parts: physical properties (`features`) and labels (`species`).

In [None]:
features = iris.loc[:, ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
species = iris.loc[:, 'Species']

In [None]:
features.head()

In [None]:
species.head()

### Similarity, distance and edge weight

You can define many similarity measures. One of the most intuitive and perhaps the easiest to program relies on the notion of distance. If a distance between samples is defined, we can compute the weight accordingly: if the distance is small, which means the nodes are similar, we want a strong connection between them (large weight).

#### Different distances

The cosine distance is a good candidate for high-dimensional data. It is defined as follows:
$$d(u,v) = 1 - \frac{u \cdot v} {\|u\|_2 \|v\|_2},$$
where $u$ and $v$ are two feature vectors.

The distance is proportional to the angle formed by the two vectors (0 if colinear, 1 if orthogonal, 2 if opposed direction).

Alternatives are the [$p$-norms](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm) (or $\ell_p$-norms), defined as
$$d(u,v) = \|u - v\|_p,$$
of which the Euclidean distance is a special case, with $p=2$.

**2.1** Compute the Euclidean pairwise distances of all the points in the data. Use the `pdist` function from `scipy` (already imported).

In [None]:
# Your solution here ###########################################################
distances: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# other optional metrics: 'cosine', 'cityblock', 'minkowski'

Now that we have a distance, we can compute the weights.

#### Distance to weights

A common function used to turn distances into edge weights is the Gaussian function:
$$\mathbf{W}(u,v) = \exp \left( \frac{-d^2(u, v)}{\sigma^2} \right),$$
where $\sigma$ is the parameter which controls the width of the Gaussian.
  
The function giving the weights should be positive and monotonically decreasing with respect to the distance. It should take its maximum value when the distance is zero, and tend to zero when the distance increases. Note that distances are non-negative by definition. So any funtion $f : \mathbb{R}^+ \rightarrow [0,C]$ that verifies $f(0)=C$ and $\lim_{x \rightarrow +\infty}f(x)=0$ and is *strictly* decreasing should be adapted. The choice of the function depends on the data.

Some examples:
* A simple linear function $\mathbf{W}(u,v) = \frac{d_{max} - d(u, v)}{d_{max} - d_{min}}$. As the cosine distance is bounded by $[0,2]$, a suitable linear function for it would be $\mathbf{W}(u,v) = 1 - d(u,v)/2$.
* A triangular kernel: a straight line between the points $(0,1)$ and $(t_0,0)$, and equals to 0 beyond it.
* The logistic kernel $\left(e^{d(u,v)} + 2 + e^{-d(u,v)} \right)^{-1}$.
* An inverse function $(\epsilon+d(u,v))^{-n}$, with $n \in \mathbb{N}^{+*}$ and $\epsilon \in \mathbb{R}^+$.
* You can find some more [here](https://en.wikipedia.org/wiki/Kernel_%28statistics%29).


**2.2** Let's use the Gaussian function. Define the gaussian's width (Hint: think what would make sense to define as $\sigma^2$ considering the statistics of the data)

In [None]:
# Your solution here ###########################################################
weights_list: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
# Turn the list of weights into a matrix.
weight_matrix = squareform(weights_list)

**2.3** Find the nodes with highest degree and display their respective entry in the `iris` dataframe. Do they belong to the same iris species?

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We have obtained a full matrix but we may not need all the connections (reducing the number of connections saves some space and computations!). We can sparsify the graph by removing the values (edges) below some fixed threshold. Let's see what kind of threshold we could use:

In [None]:
plt.hist(weights_list)
plt.title('Distribution of weights')
plt.show()

**2.4:** Plot the number of edges with respect to the threshold, for threshold values between 0 and 1.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Remark:* The distances presented here do not work well for categorical data.

**2.5:** Based on the above plot define a threshold for the sparsification. Remember, too high might result in disconnected components. Too low might result in a graph that is still too dense.

In [None]:
# Your solution here ###########################################################
thresh: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
weight_matrix[weight_matrix < thresh] = 0

### Graph visualization

To conclude, let us visualize the graph. We will use the python module networkx.

In [None]:
# A simple command to create the graph from the adjacency matrix.
graph = nx.from_numpy_array(weight_matrix)

Let us try some direct visualizations using networkx.

In [None]:
# Let us add some colors
colors = species.values
colors[colors == 'Iris-setosa'] = 0
colors[colors == 'Iris-versicolor'] = 1
colors[colors == 'Iris-virginica'] = 2

In [None]:
nx.draw_spectral(graph, node_color=colors)

It seems to be separated in 3 parts! Are they related to the 3 different species of iris?

Let's try another [layout algorithm](https://en.wikipedia.org/wiki/Graph_drawing#Layout_methods), where the edges are modeled as springs.

In [None]:
nx.draw_spring(graph, node_color=colors)

Save the graph to disk in the `gexf` format, readable by gephi and other tools that manipulate graphs. You may now explore the graph using gephi and compare the visualizations.

In [None]:
savepath = './'
nx.write_gexf(graph, os.path.join(savepath,'iris.gexf'))

**2.6:** Modify the experiment such that the distance is computed using normalized features, i.e., all features (columns of `features`) having the same mean and variance (hint: use pandas to compute the statistics of the dataframe's column).
This avoids having some features with too much importance in the computation of distance.


In [None]:
# Your solution here ###########################################################
features_norm = ...

# The rest is the same, define the edges with a kernel function, sparsify the edges with thresholding and build the graph. Visualize the graph and compare it with the one built with the original feature values.
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

## Part 3: Graph Metrics

We will now look into some tools we can use to derive interesting properties of the graph under study. More precisely, we will be interested in the notion of [**Centrality**](https://en.wikipedia.org/wiki/Centrality) and [**Clustering**](https://en.wikipedia.org/wiki/Clustering_coefficient).

Briefly, let's recall that:
- Centrality refers to how much a given node is important in a network. The definition of importance depends on the application, it can be structural or related to features.
- Clustering is a measure of how often nodes tends to form cluster.

To perform such analysis, let's travel a bit and cross the ocean, direction New York!

In [None]:
# Dataset curated thanks to https://github.com/gboeing/osmnx
# Can be downloaded here: https://www.kaggle.com/datasets/crailtap/street-network-of-new-york-in-graphml?resource=download
ny_graph = nx.read_graphml('./data/manhatten.graphml.xml')
print(type(ny_graph))

The dataset is the network of the different street of Manhattan, with nodes being cross-point and edges being roads. We can see that this graph has a specific type: `MultiDiGraph`. This means it is a directed multi-graph (two nodes can have multiple edges with different properties between them). In real life, it may occurs when we try to encode different type of relationships between different entities. However, in our setup we are interested simply in the road and not in extra possible link it may exist between. This is a bit too much information for now, so let's focus on key properties of the graph and convert it to a simpler representation: A directed graph.

### Graph Modelling and Visualization

**3.1** Create a DiGraph (directed graph) from the `ny_graph`. _(Hint: Create an directed empty graph object and iteratively add the edges by keeping only the first one encountered if multiple exist.)_

In [None]:
# Your solution here ###########################################################
simpler_ny_graph = ...
for u, v, data in ny_graph.edges(data=True):
    ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
assert type(simpler_ny_graph) == nx.DiGraph

Good, now we have our directed graph. However, depending on our construction, we may have lose node features. For now on we will only be interested in the spatial coordinates (x,y) of the different cross points.

**3.2** Retrieve the x,y coordinates of each node in the initial graph `ny_graph` and store them in the simpler one `simpler_ny_graph`. _(Hint: Look at nx.get_node_attributes() and nx.set_node_attributes(). The name of the features can be accessed by indexing the graph with a node id.)_

In [None]:
# Your solution here ###########################################################
x_pos : dict = ...
y_pos : dict = ...
... # store x,y coordinates in `simpler_ny_graph`
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We introduced the graph as being a map on Manhattan, it is now time to see it.

In [None]:
fig,ax = plt.subplots(figsize=(10,10))
nx.draw_networkx(simpler_ny_graph,arrows=False,ax=ax,node_size=5,with_labels=False,)

You would probably agree that it is not the clearest map of new york ever created. This is due to how networkx plots graph. If you do not provide with insights on where the nodes are located, it will try to make a "meaningful" representation of the graph. But it is not always a success... Hopefully we do have spatial information about the nodes! What a chance we kept those x and y coordinates.

**3.3** Using the `x_pos` and `y_pos`, create a dictionnary that maps each node id to a numpy array with the coordinates in float format.

In [None]:
# Your solution here ###########################################################
pos = dict()
for node_id in simpler_ny_graph.nodes:
    pos[node_id] = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
assert pos['2895441122'].dtype == np.float64

Now we can plot again our graph. Note how it is much faster because it doesn't have to compute a complicated layout for the graph now that it has the node coordinates.

In [None]:
fig,ax = plt.subplots(figsize=(10,10))
nx.draw_networkx(simpler_ny_graph,pos,arrows=False,ax=ax,node_size=5,with_labels=False)
fig.suptitle('View of Manhattan Roads')

### Clustering Coefficient

The graph modelling the underlying network of the street of New York can be studied in many ways. We will first focus on the notion of [clustering](https://en.wikipedia.org/wiki/Clustering_coefficient). Recall from the lecture that the clustering coefficient of a node quantifies how close to a clique (fully connected graph) is the subgraph defined by the node and its neighborhood.

Before doing the computations, based on the very specific structure of the street of new york, do you think that the clustering coefficient will be on average low or high?

**3.4** Compute the clustering coefficients for all nodes in the graph. _(Starting from this exercice and for the rest of the notebook, only use `simpler_ny_graph` for computations)_

In [None]:
from networkx import clustering

In [None]:
# Your solution here ###########################################################
ny_clustering : dict = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.5** Normalize the clustering values using min-max normalization (0-1 range) to create a color gradient of the clustering value of each node for latter visualization. (Check [cmap](https://matplotlib.org/stable/users/explain/colors/colormaps.html) from matplotlib for more information.)


In [None]:
# Your solution here ###########################################################
min_clustering : float = ...
max_clustering : float = ...
normalized_clustering : dict = ... # key = node_ids, values = normalized clustering values
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
# Plot of clustering coefficients on the graph
cmap = plt.cm.Blues
color_mapping = {node_id: cmap(norm_clustering) for node_id,norm_clustering in normalized_clustering.items()}
fig,ax = plt.subplots(figsize=(10,10))
nx.draw_networkx(simpler_ny_graph,pos,arrows=False,ax=ax,node_color=list(color_mapping.values()),node_size=5,with_labels=False)
fig.suptitle('Clustering coefficients over Manhattan Roads')

**3.6** We can see that most of the nodes are homogeneous with a low clustering coefficient. Was this expected?


**Your answer here:**


### Node Centrality

Another important notion in graph is the notion of [centrality](https://en.wikipedia.org/wiki/Centrality) for nodes in a graph. In short, it can be summarized as a set of functions that creates a ranking of each nodes based on a notion of importance in the graph. The latter is usually defined according to the application. In our case, we are interested in a road network and we can easily connect the notion of importance of a road to the one of shortest path. Indeed, a road that belongs to many shortest paths in the graph is very important for the flow and connectivity of the graph. Let's see how it works on our road dataset!

#### Betweenness Centrality

We will begin our analysis with the [betweenness centrality](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality). It is a measure of how central a node is in terms of shortest path. In other word, a node $v$ has a high centrality if many shortest paths connecting any two other nodes $(s,t)$ go trough $v$.

In [None]:
from networkx import betweenness_centrality

**3.7** Compute the betweenness centrality for all nodes in the graph. _(Hint: set `normalized=False` in the function arguments to avoid latter scaling issues when we will apply the color map.)_

In [None]:
# Your solution here ###########################################################
b_centrality : dict = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's have a look at these centrality values (It is always good practice to have an idea of the kind of post-processing you want to apply on them later on.)

In [None]:
# Plot centrality histogram
fig,ax = plt.subplots(figsize=(6,3))
ax.hist(b_centrality.values())
ax.set_xlabel('Centrality Value')
ax.set_ylabel('Number of Nodes')
fig.suptitle('Histogram of betweenness centrality values')

We can see that the centrality values seems to follow a log normal distribution that we can normalize to have a better visual representation of the scale of results.

**3.8** Transform the centrality values using log scaling and z-normalization. _(Hint: Do not forget to add a small terms $\epsilon<<0$ in the log function to avoid evaluation of zero values.)_

In [None]:
# Your solution here ###########################################################
log_centrality: dict = ... # key = node_ids, values = log centrality values
mean_centrality : np.array = ...
std_centrality : np.array = ...
normalized_centrality : dict = ... # key = node_ids, values = normalized centrality values
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
# Plot of centrality coefficients on the graph
cmap = plt.cm.Blues
color_mapping = {node_id: cmap(norm_centrality) for node_id,norm_centrality in normalized_centrality.items()}
fig,ax = plt.subplots(figsize=(10,10))
nx.draw_networkx(simpler_ny_graph,pos,arrows=False,ax=ax,node_color=list(color_mapping.values()),node_size=5,with_labels=False)
fig.suptitle('Betweenness Centrality over Manhattan Roads')

**3.9** Give a brief interpretation of the previous plot (why does nodes on the border of the graph have higher centrality? What happens around Central Park?)

**Your answer here:**


---

#### Closeness Centrality

We will now look at another notion of centrality, the [closeness centrality](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.closeness_centrality.html#networkx.algorithms.centrality.closeness_centrality). This metric is also connected to the notion of shortest path. Here, a node $v$ has high centrality if the average shortest path distance to any other nodes is low (meaning that we can quickly reach any other nodes $u$ in the network starting from $v$.) More formally, it is defined as $C(v)=\frac{n-1}{\sum_{u=1}^{n-1}d(v,u)}$, where $d(v,u)$ is the shortest path distance between nodes $v$ and $u$.

**3.10** Compute the betweenness centrality for all nodes in the graph.

In [None]:
from networkx import closeness_centrality

In [None]:
# Your solution here ###########################################################
c_centrality : dict = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
fig,ax = plt.subplots(figsize=(6,3))
ax.hist(c_centrality.values())
ax.set_xlabel('Centrality Value')
ax.set_ylabel('Number of Nodes')
fig.suptitle('Histogram of closeness centrality values')

It looks again like a known distribution (Gaussian in this case). So let's normalize it!

**3.11** Transform the centrality values using z-normalization.

In [None]:
# Your solution here ###########################################################
mean_centrality : np.array = ...
std_centrality : np.array = ...
normalized_centrality : dict = ... # key = node_ids, values = normalized centrality values
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [None]:
# Plot of centrality coefficients on the graph
cmap = plt.cm.Blues
color_mapping = {node_id: cmap(norm_centrality) for node_id,norm_centrality in normalized_centrality.items()}
fig,ax = plt.subplots(figsize=(10,10))
nx.draw_networkx(simpler_ny_graph,pos,arrows=False,ax=ax,node_color=list(color_mapping.values()),node_size=5,with_labels=False)
fig.suptitle('Closeness Centrality over Manhattan Roads')

**3.12** Give a brief interpretation of the previous plot (why does the centrality follows this vertical (south-north) distribution over roads?)

**Your answer here:**
