# Welcome to the dataquality client demo

### This will be a brief introduction to how the client works and a bit under the hood

## Installing

You can currently install dataquality from pypi
`pip install dataquality`

But for development, you may want to install it from github. This will give you the latest changes in master

`pip install git+https://www.github.com/rungalileo/dataquality.git`

You can also clone the repo and install from a path. This is recommended for development

`pip install /path/to/dataquality/directory`

**(It's good to restart the kernel after an install)**

In [1]:
!pwd

/Users/anthcor/dataquality/docs


In [None]:
!pip install -q ..

In [None]:
# If you have cloned the dataquality repo and are running this from the docs folder, you can run this
!pip install -q ../dataquality

In [None]:
# Or install latest from main
!pip install -qqq git+https://www.github.com/rungalileo/dataquality.git

## Components

The data quality client is currently very simple. It has just a few components:

* logging - the inputs and outputs to your model
* config - the urls, usernames, and passwords to interact with the server
* init - how you start a new project/run
* finish - how you end your run

## Getting started

To get started, simply `import dataquality`<br>
If your environment variables are set, your import will pass through. If not, you will be prompted for some url and config variables.<br>

To bypass the prompt, set the following environment variables
* `GALILEO_CONSOLE_URL`

If you have your server (api, minio, mysql) running locally for development, the following will work
```
import os

os.environ['GALILEO_CONSOLE_URL']="http://localhost"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

### How do I get everything running locally??

See our [CONTRIBUTING](https://github.com/rungalileo/api/blob/main/CONTRIBUTING.md) doc
(When running the API, use the `./scripts/run-gunicorn.sh` - you don't need all of them)

In [1]:
import os

os.environ['GALILEO_CONSOLE_URL']="http://localhost"

In [2]:
# For dev cluster, run this cell

# import os
# os.environ['GALILEO_CONSOLE_URL']="https://console.dev.rungalileo.io"

In [1]:
import dataquality as dq

## Logging in

Once you have dataquality imported, you can log into your server and start logging data<br>

To log in, you can call `dataquality.login()` <br>
This will prompt you for your auth method, email, and password. You can skip this prompt with the following environment variables:

* `GALILEO_USERNAME`
* `GALILEO_PASSWORD`

### How do I create a user?

If you are running everything locally, you can do the following to create the admin user.

**Note: If the admin user already exists, you cannot create another one.**

```
import requests

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo",
  "auth_method": "email",
  "password": "Th3secret_"
}

r = requests.post('http://localhost:8088/users/admin', json=data)
r.json()
```

Then set your env vars
```
import os

os.environ["GALILEO_USERNAME"]="{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]="{r.json()['password']}"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

Now login

```
dataquality.login()
```


In [2]:
# import requests

# pwd = "MyPassword!123"

# data={
#   "email": "me@rungalileo.io",
#   "first_name": "Me",
#   "last_name": "Me",
#   "username": "Galileo",
#   "auth_method": "email",
#   "password": pwd
# }

# r = requests.post(f'{dq.config.api_url}/users/admin', json=data)

# import os

os.environ["GALILEO_USERNAME"]=f"user@example.com"
os.environ["GALILEO_PASSWORD"]="Th3secret_"

In [3]:
dq.login()

📡 http://localhost:8088
🔭 Logging you into Galileo

👀 Found auth method email set via env, skipping prompt.
🚀 You're logged in to Galileo as user@example.com!


## Start my project/run

Now you can start using the tool with `dataquality.init()`<br>

You **must** provide a `task_type` when calling `init`
* A task type describes the kind of modeling you are doing (text classification, multi-label, NER etc).
* Currently the only available task is "text_classification"

You can optionally provide a project name for this run.

In [4]:
dq.init?

In [5]:
task = "text_classification"
# Base case
dq.init(task)

✨ Initializing public project excellent_violet_landfowl
🏃‍♂️ Starting run melodic_fuchsia_fish
🛰 Created project, excellent_violet_landfowl, and new run, melodic_fuchsia_fish.


In [6]:
# New project, unset run (new)
dq.init(task_type=task, project_name="a_new_project")

💭 Project a_new_project was not found.
✨ Initializing public project a_new_project
🏃‍♂️ Starting run roasted_yellow_hippopotamus


In [7]:
# Existing project, unset run (new)
dq.init(task_type=task, project_name="a_new_project")

📡 Retrieved project, a_new_project, and starting a new run
🏃‍♂️ Starting run miniature_green_sailfish
🛰 Connected to project, a_new_project, and created run, miniature_green_sailfish.


In [8]:
# Existing project, new run
dq.init(task_type=task, project_name="a_new_project", run_name="a_new_run")

📡 Retrieving run from existing project, a_new_project
🏃‍♂️ Starting run a_new_run
🛰 Connected to project, a_new_project and created new run, a_new_run.


In [9]:
# Existing project, existing run
dq.init(task_type=task, project_name="a_new_project", run_name="a_new_run")

📡 Retrieving run from existing project, a_new_project
🛰 Connected to project, a_new_project, and run, a_new_run.




In [10]:
# New project, new run
dq.init(task_type=task, project_name="a_new_project2", run_name="a_new_run2")

💭 Project a_new_project2 was not found.
✨ Initializing public project a_new_project2
🏃‍♂️ Starting run a_new_run2
🛰 Created project, a_new_project2, and new run, a_new_run2.


## Log to my project/run

Now that you've started your run, all you need to do is log data to it.<br>

All you need to do is call the `dataquality.log_data_samples` and `dataquality.log_model_outputs` functions.

`dataquality.log_data_samples` knows which task you are logging for, and accepts the proper arguments.
For "text_classification" it is expecting
* texts - list of strings indicating the text input
* labels - list of strings indicating the labels
* split - string indicating the data split (training, validation, test)
* ids - list of ints indicating the id of each row.
  * NOTE: This ID must match the output ID in log_model_outputs in order to join them for analysis

`dataquality.log_model_outputs` also knows which task you are logging for.
For "text_classification" it is expecting
* emb - list of lists of embedding values for a given text input
* probs - list of list of probabilities of the confidence per class
* split - string indicating the data split (training, validation, test)
* epoch - int indicating the training/test/validation epoch for the input
* ids - list of ints indicating the matching id to the input row
  * NOTE: This ID must match the output ID in log_model_outputs in order to join them for analysis


### log some data

We use the `log_data_samples` and `log_model_outputs` to log our metadata

In [11]:
dq.init(task)

✨ Initializing public project excited_chocolate_parakeet
🏃‍♂️ Starting run unaware_rose_catfish
🛰 Created project, excited_chocolate_parakeet, and new run, unaware_rose_catfish.


In [12]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

newsgroups = fetch_20newsgroups(subset="train", remove=('headers', 'footers', 'quotes'))

dataset = pd.DataFrame()
dataset["text"] = newsgroups.data
label_ind = newsgroups.target_names
dataset["label"] = [label_ind[i] for i in newsgroups.target]
dataset = dataset[:100]

# Add IDs to the dataset for logging
dataset["id"] = list(range(len(dataset)))

dq.log_dataset(dataset, split="train")
dq.log_dataset(dataset, split="test")

Exporting input data [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Appending input data [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 


## We validate data before logging

#### See what happens with invalid data (not enough IDs)

In [13]:
# Labels and text inputs dont match in shape
dq.log_data_samples(texts=dataset['text'], labels=dataset['label'][:3], split="train", ids=list(range(3)))

AssertionError: labels and text must be the same length, but got(labels, text) (3, 100)

In [14]:
import numpy as np

# Generate fake model outputs
def log_fake_data(log_num: int = 0):
    # Ensure unique IDs
    # Because we're going to call this twice, we need the other dataset rows for the second call, so /2
    num_rows = len(dataset) // 2 
        
    embs = np.random.rand(num_rows, 800)
    logits = np.random.rand(num_rows, 20)
    for split in ['test','train']:
        epoch = 0
        
        r = range(num_rows*log_num, num_rows*(log_num+1))
        ids = list(r)
        dq.log_model_outputs(embs=embs, logits=logits, split=split, epoch=epoch, ids=ids)

log_fake_data()

In [15]:
!tree ~/.galileo/logs/{dq.config.current_project_id}/{dq.config.current_run_id}

[01;34m/Users/anthcor/.galileo/logs/42ce0d80-fde2-4fd8-886b-f002fb5f14de/d816e0dc-849d-4e8d-8091-0b762f8c9cd0[00m
├── input_data.arrow
├── [01;34mtest[00m
│   └── [01;34m0[00m
│       └── 90c8553e6986.hdf5
└── [01;34mtraining[00m
    └── [01;34m0[00m
        └── 689fe0b90f8a.hdf5

4 directories, 3 files


### What happened?

When you call `log_batch_input_data` you are logging the input data for this training job. This would typically be run once (per split).<br>

Then, as you train your model in batches, each call to `log_model_outputs` takes the data in that batch, joins it to the input data, and stores it in 3 files, data, emb, and prob.<br>

If we were to log another fake dataset to this, we'd see another file in each dir (under the epoch we set).

The file names in each subdir will match so we can join them at the end

In [16]:
log_fake_data(1)

In [17]:
!tree ~/.galileo/logs/{dq.config.current_project_id}/{dq.config.current_run_id}

[01;34m/Users/anthcor/.galileo/logs/42ce0d80-fde2-4fd8-886b-f002fb5f14de/d816e0dc-849d-4e8d-8091-0b762f8c9cd0[00m
├── input_data.arrow
├── [01;34mtest[00m
│   └── [01;34m0[00m
│       ├── 90c8553e6986.hdf5
│       └── c9f0272c242f.hdf5
└── [01;34mtraining[00m
    └── [01;34m0[00m
        ├── 689fe0b90f8a.hdf5
        └── fc1358905b08.hdf5

4 directories, 5 files


## Take a look at our logged model outputs

Below is the model output data we've logged to test. You can see all of the values available across both logs<br>
To see the training data, just change the variable to `training`

In [18]:
import vaex
from pathlib import Path

split = "test"
vaex.open(f'{Path.home()}/.galileo/logs/{dq.config.current_project_id}/{dq.config.current_run_id}/{split}/0/*.hdf5')


#,data_schema_version,emb,epoch,id,pred,prob,split
0,1,"'array([8.57106623e-01, 8.93980231e-01, 6.313378...",0,0,13,"'array([0.04859771, 0.05946847, 0.04279913, 0.05...",b'test'
1,1,"'array([0.35334157, 0.65359292, 0.30023309, 0.33...",0,1,10,"'array([0.05875081, 0.07051047, 0.03753773, 0.03...",b'test'
2,1,"'array([1.08380079e-01, 9.05909130e-01, 9.835939...",0,2,4,"'array([0.04596923, 0.07366717, 0.06822204, 0.05...",b'test'
3,1,"'array([0.85230045, 0.57928173, 0.34920781, 0.33...",0,3,3,"'array([0.04394911, 0.07299215, 0.0611209 , 0.07...",b'test'
4,1,"'array([0.19895986, 0.04940142, 0.04691344, 0.27...",0,4,14,"'array([0.03113366, 0.0478598 , 0.05287021, 0.05...",b'test'
...,...,...,...,...,...,...,...
95,1,"'array([5.57126912e-01, 6.27893930e-01, 3.404685...",0,95,9,"'array([0.04180124, 0.06067209, 0.05874632, 0.07...",b'test'
96,1,"'array([3.26800281e-01, 5.57304539e-02, 3.614219...",0,96,5,"'array([0.08142364, 0.05431067, 0.0339393 , 0.07...",b'test'
97,1,"'array([0.48076435, 0.50786567, 0.51101304, 0.02...",0,97,0,"'array([0.07782885, 0.07363639, 0.05245695, 0.03...",b'test'
98,1,"'array([1.91997988e-01, 6.01161071e-01, 7.477167...",0,98,13,"'array([0.04381874, 0.06332405, 0.0422959 , 0.05...",b'test'


## How do I see my results in the UI?

Simply set your labels (`set_labels_for_run`) and call `finish()`

Once called, the data will be joined together at a _per-epoch_ level, and added to minio, with one file for each `prob`, `emb`, and `data` per split/epoch. 

A job will be kicked off to process you data on the server, and after it's done you'll see your results in the UI

#### Why do I need to set my labels?

Since your model is simply outputting probabilities, we have no way to map the index of each prediction to the model output. Setting your labels enables us to map them so you can see the meaningful values in the UI.<br>

If you have the UI running, you should see it at the URL returned.

**Note:** Check out your local API logs to see the background job!

In [19]:
dq.set_labels_for_run(newsgroups.target_names)
dq.finish()

☁️ Uploading Data


training:   0%|          | 0/1 [00:00<?, ?it/s]

Combining batches for upload:   0%|          | 0/2 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

test:   0%|          | 0/1 [00:00<?, ?it/s]

Combining batches for upload:   0%|          | 0/2 [00:00<?, ?it/s]

test (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

🧹 Cleaning up
Job default successfully submitted. Results will be available soon at http://127.0.0.1:3000/insights?projectId=42ce0d80-fde2-4fd8-886b-f002fb5f14de&runId=9541d893-bbe6-429c-b071-22aee032bb5e&split=training&taskType=0&depHigh=1&depLow=0


{'project_id': '42ce0d80-fde2-4fd8-886b-f002fb5f14de',
 'run_id': '9541d893-bbe6-429c-b071-22aee032bb5e',
 'job_name': 'default',
 'labels': ['alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'],
 'task_type': 0,
 'tasks': None,
 'non_inference_logged': False,
 'migration_name': None,
 'message': 'Processing dataquality!',
 'link': 'http://127.0.0.1:3000/insights?projectId=42ce0d80-fde2-4fd8-886b-f002fb5f14de&runId=9541d893-bbe6-429c-b071-22aee032bb5e&split=training&taskType=0&depHigh=1&depLow=0'}

## That should take ~10-20 seconds to complete (if you are running the server locally)



In [20]:
# from time import sleep
# from tqdm.notebook import tqdm


# for i in tqdm(range(20)):
#     sleep(1)
dq.get_dq_log_file()

Your logfile has been written to /Users/anthcor/.galileo/out/9541d893-bbe6-429c-b071-22aee032bb5e/out.log


'/Users/anthcor/.galileo/out/9541d893-bbe6-429c-b071-22aee032bb5e/out.log'

### Now we can export our results to a CSV

In [None]:
from dataquality.schemas.split import Split
from dataquality.clients.api import ApiClient
import pandas as pd

api_client = ApiClient()
pname, rname = api_client.get_project_run_name()
api_client.export_run(pname, rname, Split.training, "training_data.csv")

pd.read_csv("training_data.csv")

### (Local only) we can also read the data from minio

In [None]:
from minio import Minio

url = dq.config.minio_url
client = Minio(url, 'minioadmin', 'minioadmin', secure=(':9000' not in url))
p = dq.config.current_project_id
r = dq.config.current_run_id
client.fget_object('galileo-project-runs-results', f'{p}/{r}/training/data/data.hdf5', 'training_data.hdf5')
client.fget_object('galileo-project-runs-results', f'{p}/{r}/test/data/data.hdf5', 'test_data.hdf5')

display(vaex.open('training_data.hdf5'))

display(vaex.open('test_data.hdf5'))