# Welcome to the dataquality client demo

### This will be a brief introduction to how the client works and a bit under the hood

## Installing

You can currently install dataquality from pypi
`pip install dataquality`

But for development, you may want to install it from github. This will give you the latest changes in master

`pip install git+https://www.github.com/rungalileo/dataquality.git`

You can also clone the repo and install from a path. This is recommended for development

`pip install /path/to/dataquality/directory`

**(It's good to restart the kernel after an install)**

In [None]:
!pip install -q /path/to/dataquality

In [None]:
# If you have cloned the dataquality repo and are running this from the docs folder, you can run this
!pip install -q ../dataquality

In [None]:
# Or install latest from main
!pip install -qqq git+https://www.github.com/rungalileo/dataquality.git

## Components

The data quality client is currently very simple. It has just a few components:

* logging - the inputs and outputs to your model
* config - the urls, usernames, and passwords to interact with the server
* init - how you start a new project/run
* finish - how you end your run

## Getting started

To get started, simply `import dataquality`<br>
If your environment variables are set, your import will pass through. If not, you will be prompted for some url and config variables.<br>

To bypass the prompt, set the following environment variables
* `GALILEO_CONSOLE_URL`

If you have your server (api, minio, mysql) running locally for development, the following will work
```
import os

os.environ['GALILEO_CONSOLE_URL']="http://localhost"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

### How do I get everything running locally??

See our [CONTRIBUTING](https://github.com/rungalileo/api/blob/main/CONTRIBUTING.md) doc
(When running the API, use the `./scripts/run-gunicorn.sh` - you don't need all of them)

In [None]:
import os

os.environ['GALILEO_CONSOLE_URL']="http://localhost"

In [None]:
# For dev cluster, run this cell

# import os
# os.environ['GALILEO_CONSOLE_URL']="https://console.dev.rungalileo.io"

In [None]:
import dataquality

## Logging in

Once you have dataquality imported, you can log into your server and start logging data<br>

To log in, you can call `dataquality.login()` <br>
This will prompt you for your auth method, email, and password. You can skip this prompt with the following environment variables:

* `GALILEO_USERNAME`
* `GALILEO_PASSWORD`

### How do I create a user?

If you are running everything locally, you can do the following to create the admin user.

**Note: If the admin user already exists, you cannot create another one.**

```
import requests

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo",
  "auth_method": "email",
  "password": "Th3secret_"
}

r = requests.post('http://localhost:8088/users/admin', json=data)
r.json()
```

Then set your env vars
```
import os

os.environ["GALILEO_USERNAME"]="{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]="{r.json()['password']}"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

Now login

```
dataquality.login()
```


In [None]:
import requests

pwd = "MyPassword!123"

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo",
  "auth_method": "email",
  "password": pwd
}

r = requests.post(f'{dataquality.config.api_url}/users/admin', json=data)

import os

os.environ["GALILEO_USERNAME"]=f"{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]=pwd

In [None]:
dataquality.login()

## Start my project/run

Now you can start using the tool with `dataquality.init()`<br>

You **must** provide a `task_type` when calling `init`
* A task type describes the kind of modeling you are doing (text classification, multi-label, NER etc).
* Currently the only available task is "text_classification"

You can optionally provide a project name for this run.

In [None]:
dataquality.init?

In [None]:
task = "text_classification"
# Base case
dataquality.init(task)

In [None]:
# New project, unset run (new)
dataquality.init(task_type=task, project_name="a_new_project")

In [None]:
# Existing project, unset run (new)
dataquality.init(task_type=task, project_name="a_new_project")

In [None]:
# Existing project, new run
dataquality.init(task_type=task, project_name="a_new_project", run_name="a_new_run")

In [None]:
# Existing project, existing run
dataquality.init(task_type=task, project_name="a_new_project", run_name="a_new_run")

In [None]:
# New project, new run
dataquality.init(task_type=task, project_name="a_new_project2", run_name="a_new_run2")

## Log to my project/run

Now that you've started your run, all you need to do is log data to it.<br>

All you need to do is call the `dataquality.log_data_input` and `dataquality.log_model_outputs` functions.

`dataquality.log_data_input` knows which task you are logging for, and accepts the proper arguments.
For "text_classification" it is expecting
* text - list of strings indicating the text input
* label - list of strings indicating the labels
* split - string indicating the data split (training, validation, test)
* id (optional) - list of ints indicating the id of each row. If not provided, IDs will be added automatically
  * NOTE: This ID must match the output ID in log_model_outputs in order to join them for analysis

`dataquality.log_model_outputs` also knows which task you are logging for.
For "text_classification" it is expecting
* emb - list of lists of embedding values for a given text input
* probs - list of list of probabilities of the confidence per class
* split - string indicating the data split (training, validation, test)
* epoch - int indicating the training/test/validation epoch for the input
* ids - list of ints indicating the matching id to the input row
  * NOTE: This ID must match the output ID in log_model_outputs in order to join them for analysis


### log some data

We use the `log_input_data` and `log_model_outputs` to log our metadata

In [None]:
dataquality.init(task)

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

newsgroups = fetch_20newsgroups(subset="train", remove=('headers', 'footers', 'quotes'))

dataset = pd.DataFrame()
dataset["text"] = newsgroups.data
label_ind = newsgroups.target_names
dataset["label"] = [label_ind[i] for i in newsgroups.target]
dataset = dataset[:100]

dataquality.log_input_data(text=dataset['text'], labels=dataset['label'], split="train")
dataquality.log_input_data(text=dataset['text'], labels=dataset['label'], split="test")

## We validate data before logging

#### See what happens with an invalid model config

In [None]:
# Labels and text inputs dont match in shape
dataquality.log_input_data(text=dataset['text'], labels=dataset['label'][:3], split="train")

In [None]:
import numpy as np

# Generate fake model outputs
def log_fake_data(log_num: int = 0):
    # Ensure unique IDs
    # Because we're going to call this twice, we need the other dataset rows for the second call, so /2
    num_rows = len(dataset) // 2 
        
    emb = np.random.rand(num_rows, 800)
    prob = np.random.rand(num_rows, 20)
    for split in ['test','train']:
        epoch = 0
        
        r = range(num_rows*log_num, num_rows*(log_num+1))
        ids = list(r)
        dataquality.log_model_outputs(emb=emb, probs=prob, split=split, epoch=epoch, ids=ids)

log_fake_data()

In [None]:
!tree .galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}

### What happened?

When you call `log_batch_input_data` you are logging the input data for this training job. This would typically be run once (per split).<br>

Then, as you train your model in batches, each call to `log_model_outputs` takes the data in that batch, joins it to the input data, and stores it in 3 files, data, emb, and prob.<br>

If we were to log another fake dataset to this, we'd see another file in each dir (under the epoch we set).

The file names in each subdir will match so we can join them at the end

In [None]:
log_fake_data(1)

In [None]:
!tree .galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}

## Take a look at our logged model outputs

Below is the model output data we've logged to test. You can see all of the values available across both logs<br>
To see the training data, just change the variable to `training`

In [None]:
import vaex

split = "test"
vaex.open(f'.galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}/{split}/0/*.hdf5')


## How do I see my results in the UI?

Simply set your labels (`set_labels_for_run`) and call `finish()`

Once called, the data will be joined together at a _per-epoch_ level, and added to minio, with one file for each `prob`, `emb`, and `data` per split/epoch. 

A job will be kicked off to process you data on the server, and after it's done you'll see your results inv the UI

#### Why do I need to set my labels?

Since your model is simply outputting probabilities, we have no way to map the index of each prediction to the model output. Setting your labels enables us to map them so you can see the meaningful values in the UI.<br>

If you have the UI running, you should see it at the URL returned.

**Note:** Check out your local API logs to see the background job!

In [None]:
dataquality.set_labels_for_run(newsgroups.target_names)
dataquality.finish()

## That should take ~10-20 seconds to complete (if you are running the server locally)

### Now we can export our results to a CSV

In [None]:
from dataquality.schemas.split import Split
from dataquality.clients.api import ApiClient
import pandas as pd

api_client = ApiClient()
pname, rname = api_client.get_project_run_name()
api_client.export_run(pname, rname, Split.training, "training_data.csv")

pd.read_csv("training_data.csv")

### (Dev only) we can also read the data from minio

In [None]:
from minio import Minio

url = dataquality.config.minio_url
client = Minio(url, 'minioadmin', 'minioadmin', secure=(':9000' not in url))
p = dataquality.config.current_project_id
r = dataquality.config.current_run_id
client.fget_object('galileo-project-runs-results', f'{p}/{r}/training/data/data.hdf5', 'training_data.hdf5')
client.fget_object('galileo-project-runs-results', f'{p}/{r}/test/data/data.hdf5', 'test_data.hdf5')

display(vaex.open('training_data.hdf5'))

display(vaex.open('test_data.hdf5'))