# Initializing the notebook

This notebook provides an introduction to using sparsebm on GPU with _Google Colab_.

**⚠️ Do not skip this step ⚠️**

You must enable GPUs for the notebook:
 - Navigate to Edit→Notebook Settings;
 - Select GPU from the Hardware Accelerator drop-down list.


# Installing `sparsebm` and importing the module

The SparseBM module is distributed through the [PyPI repository](https://pypi.org/project/sparsebm/) and the documentation is available [here](https://jbleger.gitlab.io/sparsebm).

On _Google colab_, the `cupy` module to use GPU is already installed. Only `sparsebm` needs to be installed:

In [None]:
# estimated time in colab : <10s
!pip install --upgrade sparsebm

Now, we just have to import the module.

In [None]:
# estimated time in colab: <1s
import sparsebm

# Example with the Stochastic Block Model with a known number of classes

First, we generate a synthetic graph. To illustrate the behavior, we choose to generate a large graph ($10^4$ row nodes and $2\cdot10^4$ col. nodes), with 3 row clusters and 4 col. cluster, with an average row degree of 50.

Note that you should consider a smaller size if the GPU you've been allocated doesn't have enough memory to handle graphs of this size.


In [None]:
# estimated time in colab: <1m
import numpy as np

N1, N2 = 10**4, 2*10**4
U = np.random.uniform(size=(3,4))
connection_matrix = 80/N2*U/U.mean()

dataset = sparsebm.generate_LBM_dataset(number_of_rows=N1, number_of_columns=N2, nb_row_clusters=3, nb_column_clusters=4, connection_probabilities=connection_matrix)

We can now can access the generated dataset with the `dataset` object.  The most useful values are `dataset.data` (the sparse adjacency matrix), `dataset.labels` (the simulated labels). Other attributes or properties can be accessed.

In [None]:
dataset.data

### Inference with a known number of groups

In this part, we assume the number of clusters is known (3 row clusters and 4 col clusters in this example).

We can now perform inference, by declaring the `model` object and fiting the model. The module uses here the scikit-learn syntax.

In [None]:
# estimated time in colab: <2m
model = sparsebm.LBM(3,4)
model.fit(dataset.data)

We can now compare the infered labels in `model.row_labels` and `model.col_labels` to the simulated labels in `dataset.row_labels` and `dataset.col_labels` using the Co-classification adjusted Rand index:


In [None]:
# estimated time in colab: <1s
sparsebm.utils.CARI(
    dataset.row_labels,
    dataset.column_labels,
    model.row_labels,
    model.column_labels,
)

Since the CARI is close to 1, we can conclude that the inference is able to retrieve the graph structure. Other elements can be extracted.

## Inference with unknown number of groups
In this part, we assume that the number of clusters is unknown.

In [None]:
# estimated time in colab: <5m
model_selection = sparsebm.ModelSelection(model_type="LBM", plot=False)
models = model_selection.fit(dataset.data)

We can show the optimal fit:

In [None]:
models.best

In [None]:
# estimated time in colab: <1s
sparsebm.utils.CARI(
    dataset.row_labels,
    dataset.column_labels,
    models.best.row_labels,
    models.best.column_labels,
) # for the best (for ICL) fit

Or we can examine a specific model for a arbitrary number of groups:

In [None]:
models[2,2]

In [None]:
# estimated time in colab: <1s
sparsebm.utils.CARI(
    dataset.row_labels,
    dataset.column_labels,
    models[2,2].row_labels,
    models[2,2].column_labels,
) # for 2 row groups and 2 col. groups

We can show the ICL as a function of the number of groups (sum of the numbers of groups in row and columns).

In [None]:
import matplotlib.pyplot as plt

plt.plot([sum(x) for x in models.keys()], [m.get_ICL() for m in models.values()])
plt.xlabel('number of groups (row+col.)')
plt.ylabel('ICL')