# Building graphs from data

In this notebook, we show how to use some of the functions in the folder **functions/data_specifics.py**. This file includes default parameters for each dataset to allow building any object with the same function call for all datasets.

In [None]:
execfile('functions/data_specifics.py')
execfile('functions/graph_functions.py')

In [None]:
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

import matplotlib.pyplot as plt
import seaborn as sns

import hdbscan
import umap
import umap.plot

import igraph as ig
from collections import Counter

sns.set()

## This is the choice of datasets

In [None]:
print(data_set_list)

## Function ''get_dataset''

Requires a dataset_id, a value between 0 and 4 (according to order in data_set_list)

* raw_data are the high dimensional vectors
* targets are the labels
* dataset_name is clear! 
* image_list is a list of images that we can display

In [None]:
dataset_id = 1
raw_data, targets, dataset_name, image_list = get_dataset(dataset_id, return_images=True)

In [None]:
# Verify dataset
print(dataset_name)

In [None]:
# What are the targets
Counter(targets)

In [None]:
# Look at an image
plt.imshow(image_list[0], cmap='gray')

# Function ''get_umap_graph''

Requires a dataset_id, a value between 0 and 4 (according to order in data_set_list). 
The function has some options on the return values, it can return a graph only of type igraph or networkx, or it can return matrices.

The option *set_op_mix_ratio* set at 1 returns the fuzzy union and at 0 returns the fuzzy intersection.

In [None]:
G = get_umap_graph(raw_data, dataset_id=dataset_id)

In [None]:
G.vcount()

In [None]:
G.ecount()

# Function ''get_umap_vectors''

Requires a dataset_id, a value between 0 and 4 (according to order in data_set_list). 
The function has many options that relate to UMAP arguments, however, it has some pre-determined parameters that are dataset dependent.

The option *return_vectors* set at true will return the vectors only, set at false will return the UMAP object and has vectors as the attribute *.embedding_*.

In [None]:
umap_rep = get_umap_vectors(dataset_id=dataset_id, raw_data=raw_data, n_components=2, return_vectors=False)

In [None]:
umap_rep.embedding_

In [None]:
umap.plot.points(umap_rep, labels=targets, color_key_cmap='Paired', background='black')

In [None]:
umap.plot.connectivity(umap_rep)

## Get directed knn graph

In [None]:
G_di = G = knn_digraph(raw_data, k = 25)

In [None]:
G_di.is_directed()

In [None]:
# All nodes have out degree equal to 24 (they are part of their 25 nearest neighbors)
G_di.degree(mode='out')[0:10]

# Get UMAP + HDBSCAN baseline

In [None]:
dataset_id = 1
# Get data and labels
raw_data, targets, dataset_name = get_dataset(dataset_id)
# Project to lower dimensional space (not in 2-d, the dimension is part of the predetermined parameters)
umap_rep = get_umap_vectors(dataset_id=dataset_id, raw_data=raw_data)
# Run hdbscan with predetermined parameters
hd_umap_labels = h_dbscan(umap_rep, which_algo='hdbscan', dataset_id=dataset_id)

In [None]:
ari = adjusted_rand_score(targets, hd_umap_labels)
ami = adjusted_mutual_info_score(targets, hd_umap_labels)

In [None]:
print(dataset_name)
print(f'ARI = {ari}')
print(f'AMI = {ami}')