# ``giotto-tda`` Mapper – More features

See also: [Getting started with Mapper](https://giotto-ai.github.io/gtda-docs/0.4.0/notebooks/mapper_quickstart.html).

## Import libraries

In [None]:
from IPython.display import SVG, display

# Data wrangling
import numpy as np
import pandas as pd

# TDA magic
from gtda.mapper import (
    CubicalCover,
    make_mapper_pipeline,
    plot_static_mapper_graph,
    MapperInteractivePlotter,
    method_to_transform,
    transformer_from_callable_on_rows
    )

# ML tools
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import make_union

In [None]:
display(SVG("https://giotto-ai.github.io/gtda-docs/latest/_images/mapper_pipeline.svg"))

## Load Wisconsin breast cancer dataset

Via: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

**We can use pandas dataframes directly**

In [None]:
df = pd.read_csv("Wisconsin_data.csv")
feature_names = [c for c in df.columns if c not in ["id", "diagnosis"]]
X = df[feature_names].fillna(0)
y = df["diagnosis"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
y.unique()

## Define the filter function 

For each row, the value of a filter function is a vector with two entries:

1. The value of the **decision function** of a fitted `IsolationForest` (from `sklearn`)
2. The $L^2$ norm (square root of sum of squares) of **all** features in the data

→ Define each as a `sklearn` transformer and then combine them using `make_union`!

In [None]:
# First
isolation_forest = method_to_transform(IsolationForest, "decision_function")(random_state=42)

In [None]:
# Second
l2_norm = transformer_from_callable_on_rows(np.linalg.norm)

In [None]:
# Combine!
filter_func = make_union(isolation_forest, l2_norm)

## Define the covering scheme

In [None]:
cover = CubicalCover(
    kind="balanced",
    n_intervals=15,
    overlap_frac=0.4
    )

## Advanced pipeline options

### Use the `memory` argument to avoid recomputation -- as in `sklearn`

This can help make your interactive session much faster to refresh! Pass a temporary folder as ``memory``, as explained in https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html (6.1.1.3)

In [None]:
from tempfile import mkdtemp
from shutil import rmtree

cachedir = mkdtemp()

**Note**: Don't forget to clear the cache directory when you don't need it anymore! You can do it with
```
rmtree(cachedir)
```

### Use the `n_jobs` argument to parallelize the clustering step across the cover sets

In [None]:
# -1 means use all available cores
n_jobs = -1

### Pass `contract_nodes=True` to have a less redundant graph, or `min_intersection` > 1 to remove "weak" edges

You can also change these in the interactive session.

In [None]:
contract_nodes = True
min_intersection = 2

In [None]:
pipeline = make_mapper_pipeline(
    filter_func=filter_func,
    cover=cover,
    memory=cachedir,
    n_jobs=-1,
    contract_nodes=contract_nodes,
    min_intersection=min_intersection
    )

## Visualize in 3D!

We can color nodes according to arbitrary features using ``color_data``:

In [None]:
color_data = pd.get_dummies(y)
color_data.head()

In [None]:
plotter = MapperInteractivePlotter(pipeline, X)
plotter.plot(layout_dim=3,
             color_data=color_data,
             node_scale=30)

In [None]:
# rmtree(cachedir)

### Exploit the flexibility of `color_features` and `node_color_statistic`

**Pass arbitrary `sklearn` objects and custom functions!**

### Pass `graph_step=False`  to have Mapper behave like a `sklearn` clusterer

In [None]:
pipeline_no_graph = make_mapper_pipeline(graph_step=False)
labels = pipeline_no_graph.fit_transform(X)
labels[:10]

In [None]:
len(labels) == len(X)

### Use a different graph layout!

In [None]:
plotter = MapperInteractivePlotter(pipeline, X)
plotter.plot(layout='fruchterman_reingold',
             layout_dim=3,
             color_data=color_data,
             node_scale=30)

## Finer colour control on static plots with ``plot_static_mapper_graph``

In interactive mode, the number and composition of Mapper nodes changes due to changes in the Mapper pipeline. Hence, it is not possible to hard-code node colors there. However, you can achieve this by giving up pipeline interactivity and using the ``plot_static_mapper_graph`` function on a fixed pipeline. There is no OO interface here, you simply return a static figure directly by calling
```
plot_static_mapper_graph(pipeline, X, **optional_keyword_arguments)
```
Custom node colors can then be passed via the optional keyword argument ``node_color_statistic``. Your array of hard-coded node colors must have the same number of rows as there are nodes in the graph obtained by applying the pipeline to ``X``. As usual, it can have as many columns as you like: you can then switch between colours coming from different columns by using a dropdown.

In [None]:
n_nodes = len(pipeline.fit_transform(X).vs)
node_color_statistic = np.random.randn(n_nodes, 2) # Random node colors
plot_static_mapper_graph(pipeline, X, node_color_statistic=node_color_statistic)

### Use the gap- and histogram-based clusterers `FirstSimpleGap` and `FirstHistogramGap`

In [None]:
from gtda.mapper import FirstSimpleGap, FirstHistogramGap

clusterer = FirstSimpleGap();
clusterer = FirstHistogramGap();