# Datagnosis Tutorial 01 - simple tabular example

*If you prefer, this tutorial is also available on [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1PPcjl9jq6E4j3Qz0cZIQbbQTaeK2qH6b/view?usp=sharing)

In this tutorial we will see how to use "hardness characterization method" plugins to calculate the hardness scores for the data points in a dataset. We will also plot these values and extract some data points based on these scores. For this tutorial we will be using the iris dataset from scikit learn. For a more realistic dataset checkout tutorials 2 and 3!

OK, Lets start!

First we import our logger from datagnosis and set the logging level at "INFO". If something goes wrong and you want to see more detailed logs, you can change the logging level to "DEBUG" or, conversely, if you don't want to see any logs you can remove them with log.remove().

In [None]:
import sys
import datagnosis.logger as log
log.add(sink=sys.stderr, level="INFO")

Load the dataset

In [None]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True, as_frame=True)
df = X.copy(deep=True)
df['target'] = y
display(df)

Do some pre-processing on the data if you like, such as scaling. 

The next key step is to then pass the data to the DataHandler object provided by Datagnosis. This is done by passing the features and the labels separately. The features can be a `pandas.DataFrame`, `numpy.ndarray` or `torch.Tensor`. The labels can be `pandas.series`, `numpy.ndarray` or `torch.Tensor`.

In [None]:

from datagnosis.plugins.core.datahandler import DataHandler
from datagnosis.plugins.core.models.simple_mlp import SimpleMLP
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn

std_scaler = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
X_train = std_scaler.fit_transform(X_train)
X_test = std_scaler.transform(X_test)

datahander = DataHandler(X_train, y_train, batch_size=32)

Now we define some values which we will pass to the plugin, such as the model that we want to use to classify the data.

In [None]:

# creating our model object, which we both want to use downstream, but also we will use to judge the hardness of the data points
model = SimpleMLP()

# creating our optimizer and loss function objects
learning_rate = 0.01
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)

Import the `Plugins` object from Datagnosis. Then by calling `list()` on the we can see all the available plugins that we can use.

In [None]:
# datagnosis absolute
from datagnosis.plugins import Plugins

plugins = Plugins().list()
print(plugins)


Now we can call `get()` to load up a specific plugin from the list.

In [None]:
hcm = Plugins().get(
    "aum",
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    lr=learning_rate,
    epochs=10,
    num_classes=3,
    logging_interval=1,
)


Next we need to `fit()` the plugin

In [None]:

hcm.fit(
    datahandler=datahander,
    use_caches_if_exist=True,
)

Now the plugin has been fit we can access scores. First, lets get a description of the scores then print them.

In [None]:
print(hcm.score_description())
print(hcm.scores)


Printing the scores leaves them difficult to digest, so now we will plot them instead. We can plot 1-dimentional scores in two different ways with `plot_type="dist"` or `plot_type="scatter"`. Why not have a look at both types and compare?

In [None]:

hcm.plot_scores(axis=1, plot_type="dist")

Finally the `extract_datapoints` method can be used to select data based on the hcm score. Available methods for extract include `"top_n"`, `"threshold"` and `"index"`. Give them all a go!

The following cell takes the hardest 10 data points summarises them in a `pandas.DataFrame`.

In [None]:
import pandas as pd
print(f"Data points that are hard to classify have scores that are: {hcm.hard_direction()}")
hardest_10 = hcm.extract_datapoints(method="top_n", n=10)

display(pd.DataFrame(
    data={
        "indices":hardest_10[0][2],
        f"{X.columns[0]}": hardest_10[0][0].transpose(0,1)[0],
        f"{X.columns[1]}": hardest_10[0][0].transpose(0,1)[1],
        f"{X.columns[2]}": hardest_10[0][0].transpose(0,1)[2],
        f"{X.columns[3]}": hardest_10[0][0].transpose(0,1)[3],
        "labels": hardest_10[0][1],
        "scores": hardest_10[1],
    }
))