# Welcome to the dataquality client demo

### This will be a brief introduction to how the client works and a bit under the hood

## Installing

You can currently install dataquality from pypi
`pip install dataquality`

But for development, you may want to install it from github. This will give you the latest changes in master

`pip install git+https://www.github.com/rungalileo/dataquality.git`

You can also clone the repo and install from a path. This is recommended for development

`pip install /path/to/dataquality/directory`

**(It's good to restart the kernel after an install)**

In [18]:
!pip install -q /Users/benepstein/Documents/GitHub/dataquality

You should consider upgrading via the '/Users/benepstein/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [1]:
!pip install -qqq git+https://www.github.com/rungalileo/dataquality.git

## Components

The data quality client is currently very simple. It has just a few components:

* logging - the inputs and outputs to your model
* config - the urls, usernames, and passwords to interact with the server
* init - how you start a new project/run
* finish - how you end your run

## Getting started

To get started, simply `import dataquality`<br>
If your environment variables are set, your import will pass through. If not, you will be prompted for some url and config variables.<br>

To bypass the prompt, set the following environment variables
* `GALILEO_API_URL`
* `GALILEO_MINIO_URL`
* `GALILEO_MINIO_ACCESS_KEY`
* `GALILEO_MINIO_SECRET_KEY`
* `GALILEO_MINIO_REGION`

If you have your server (api, minio, mysql) running locally for development, the following will work
```
import os

os.environ['GALILEO_API_URL']="http://localhost:8000"
os.environ['GALILEO_MINIO_URL']="127.0.0.1:9000"
os.environ['GALILEO_MINIO_ACCESS_KEY']="minioadmin"
os.environ['GALILEO_MINIO_SECRET_KEY']="minioadmin"
os.environ['GALILEO_MINIO_REGION']="us-east-1"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

### How do I get everything running locally??

See our [CONTRIBUTING](https://github.com/rungalileo/api/blob/main/CONTRIBUTING.md) doc
(When running the API, use the `./scripts/run-gunicorn.sh` - you don't need all of them)

In [2]:
import os

os.environ['GALILEO_API_URL']="http://localhost:8088"
os.environ['GALILEO_MINIO_URL']="127.0.0.1:9000"
os.environ['GALILEO_MINIO_ACCESS_KEY']="minioadmin"
os.environ['GALILEO_MINIO_SECRET_KEY']="minioadmin"
os.environ['GALILEO_MINIO_REGION']="us-east-1"

In [3]:
import dataquality

## Logging in

Once you have dataquality imported, you can log into your server and start logging data<br>

To log in, you can call `dataquality.login()` <br>
This will prompt you for your auth method, email, and password. You can skip this prompt with the following environment variables:

* `GALILEO_AUTH_METHOD`
* `GALILEO_USERNAME`
* `GALILEO_PASSWORD`

### How do I create a user?

If you are running everything locally, you can do the following

```
import requests

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo",
  "auth_method": "email",
  "password": "Th3secret_"
}

r = requests.post('http://localhost:8088/users', json=data)
r.json()
```

Then set your env vars
```
import os

os.environ["GALILEO_AUTH_METHOD"]="email"
os.environ["GALILEO_USERNAME"]="{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]="{r.json()['password']}"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

Now login

```
dataquality.login()
```


In [1]:
import requests

pwd = "MyPassword!123"

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo",
  "auth_method": "email",
  "password": pwd
}

r = requests.post('http://localhost:8088/users', json=data)
r.json()

import os

os.environ["GALILEO_AUTH_METHOD"]="email"
os.environ["GALILEO_USERNAME"]=f"{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]=pwd

In [4]:
dataquality.login()

🔭 Logging you into Galileo

👀 Found auth method email set via env, skipping prompt.
🚀 You're logged in to Galileo as me@rungalileo.io!


## Start my project/run

Now you can start using the tool with `dataquality.init()`<br>

You can optionally provide a project name for this run

In [5]:
dataquality.init?

In [10]:
# Base case
dataquality.init()

✨ Initializing project anxious_gold_rooster
🏃‍♂️ Starting run safe_tan_puffin
🛰 Created project, anxious_gold_rooster, and new run, safe_tan_puffin.


In [13]:
# New project, unset run (new)
dataquality.init(project_name="a_new_project")

📡 Retrieved project, a_new_project, and starting a new run
🏃‍♂️ Starting run neutral_tomato_wallaby
🛰 Connected to project, a_new_project, and created run, neutral_tomato_wallaby.


In [16]:
# Existing project, unset run (new)
dataquality.init(project_name="a_new_project")

📡 Retrieved project, a_new_project, and starting a new run
🏃‍♂️ Starting run uncertain_tan_halibut
🛰 Connected to project, a_new_project, and created run, uncertain_tan_halibut.


In [19]:
# Existing project, new run
dataquality.init(project_name="a_new_project", run_name="a_new_run")

📡 Retrieving existing run from project, a_new_project
🛰 Connected to project, a_new_project, and run, a_new_run.


In [25]:
# Existing project, existing run
dataquality.init(project_name="a_new_project", run_name="a_new_run")

📡 Retrieving existing run from project, a_new_project
🛰 Connected to project, a_new_project, and run, a_new_run.


In [27]:
# New project, new run
dataquality.init(project_name="a_new_project2", run_name="a_new_run2")

📡 Retrieving existing run from project, a_new_project2
🛰 Connected to project, a_new_project2, and run, a_new_run2.


## Log to my project/run

Now that you've started your run, all you need to do is log data to it.<br>

To make that easier, we've created the `GalileoModelConfig` and `GalileoDataConfig` classes that you can log with. They have smart validators that will tell you if something is wrong.<br>

Your `GalileoDataConfig` takes your text, label, and (optional) id fields. If ids aren't provided, we will make them for you.<br>

Your `GalileoModelConfig` takes your embedding, probability, and id fields.

In [32]:
from dataquality.core.integrations.config import GalileoModelConfig, GalileoDataConfig

help(GalileoDataConfig)

Help on class GalileoDataConfig in module dataquality.core.integrations.config:

class GalileoDataConfig(builtins.object)
 |  GalileoDataConfig(text: List[str] = None, labels: List[str] = None, ids: List[Union[int, str]] = None, split: dataquality.schemas.split.Split = None) -> None
 |  
 |  Class for storing training data metadata to be logged to Galileo. Separate
 |  GalileoDataConfigs will be created for training, validation, and testing data
 |  * text: The raw text inputs for model training. List[str]
 |  * labels: the ground truth labels aligned to each text field. List[Union[str,int]]
 |  * ids: Optional unique indexes for each record. If not provided, will default to
 |  the index of the record. Optional[List[Union[int,str]]]
 |  
 |  Methods defined here:
 |  
 |  __init__(self, text: List[str] = None, labels: List[str] = None, ids: List[Union[int, str]] = None, split: dataquality.schemas.split.Split = None) -> None
 |      Initialize self.  See help(type(self)) for accurate s

In [36]:
help(GalileoModelConfig)

Help on class GalileoModelConfig in module dataquality.core.integrations.config:

class GalileoModelConfig(builtins.object)
 |  GalileoModelConfig(emb: List[List[Union[int, float]]] = None, probs: List[List[float]] = None, ids: List[Union[int, str]] = None, split: Optional[str] = None, epoch: Optional[int] = None) -> None
 |  
 |  Class for storing model metadata to be logged to Galileo.
 |  * Embeddings: List[List[Union[int,float]]]
 |  * Probabilities from forward passes during model training/evaluation.
 |  List[List[float]]
 |  * ids: Indexes of each input field: List[Union[int,str]]
 |  
 |  Methods defined here:
 |  
 |  __init__(self, emb: List[List[Union[int, float]]] = None, probs: List[List[float]] = None, ids: List[Union[int, str]] = None, split: Optional[str] = None, epoch: Optional[int] = None) -> None
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __setattr__(self, key: Any, value: Any) -> None
 |      Implement setattr(self, name, value).

### Use them to log data

We use the `log_batch_input_data` and `log_model_outputs` to log our metadata

In [91]:
dataquality.init()

✨ Initializing project visual_blush_wallaby
🏃‍♂️ Starting run innocent_emerald_caribou
🛰 Created project, visual_blush_wallaby, and new run, innocent_emerald_caribou.


In [92]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

newsgroups = fetch_20newsgroups(subset="train", remove=('headers', 'footers', 'quotes'))

dataset = pd.DataFrame()
dataset["text"] = newsgroups.data
label_ind = newsgroups.target_names
dataset["label"] = [label_ind[i] for i in newsgroups.target]
dataset = dataset[:5]

data_conf = GalileoDataConfig(text=dataset['text'], labels=dataset['label'], split="train")

dataquality.log_batch_input_data(data_conf)

In [93]:
## See what happens with an invalid data config

# Labels and text inputs dont match in shape
bad_data_conf = GalileoDataConfig(text=dataset['text'], labels=dataset['label'][:3], split="train")

dataquality.log_batch_input_data(bad_data_conf)

GalileoException: labels and text must be the same length, but got(labels, text) (3,5)

In [94]:
import numpy as np

# Generate fake model outputs
def log_fake_data():
    emb = np.random.rand(len(dataset), 800)
    prob = np.random.rand(len(dataset), 20)
    split = "train"
    epoch = 0

    model_conf = GalileoModelConfig(emb=emb, probs=prob, split=split, epoch=epoch, ids=list(range(len(dataset))))
    dataquality.log_model_outputs(model_conf)

log_fake_data()

In [95]:
!tree .galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}

[01;34m.galileo/logs/1724d7cf-8c8e-4a1a-ae79-b971645a7446/e2677774-9190-446f-8689-e53841bef758[00m
├── input_data.jsonl
└── [01;34mtraining[00m
    └── [01;34m0[00m
        ├── [01;34mdata[00m
        │   └── bd1a65d7c8ce.arrow
        ├── [01;34memb[00m
        │   └── bd1a65d7c8ce.arrow
        └── [01;34mprob[00m
            └── bd1a65d7c8ce.arrow

5 directories, 4 files


### What happened?

When you call `log_batch_input_data` you are logging the input data for this training job. This would typically be run once (per split).<br>

Then, as you train your model in batches, each call to `log_model_outputs` takes the data in that batch, joins it to the input data, and stores it in 3 files, data, emb, and prob.<br>

If we were to log another fake dataset to this, we'd see another file in each dir (under the epoch we set).

The file names in each subdir will match so we can join them at the end

In [96]:
log_fake_data()

In [97]:
!tree .galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}

[01;34m.galileo/logs/1724d7cf-8c8e-4a1a-ae79-b971645a7446/e2677774-9190-446f-8689-e53841bef758[00m
├── input_data.jsonl
└── [01;34mtraining[00m
    └── [01;34m0[00m
        ├── [01;34mdata[00m
        │   ├── bd1a65d7c8ce.arrow
        │   └── c1c1fbb45da1.arrow
        ├── [01;34memb[00m
        │   ├── bd1a65d7c8ce.arrow
        │   └── c1c1fbb45da1.arrow
        └── [01;34mprob[00m
            ├── bd1a65d7c8ce.arrow
            └── c1c1fbb45da1.arrow

5 directories, 7 files


## How do I see my results in the UI?

Simply set your labels (`set_labels_for_run`) and call `finish()`

Once called, the data will be joined together at a _per-epoch_ level, and added to minio, with one file for each `prob`, `emb`, and `data` per split/epoch. 

A job will be kicked off to process you data on the server, and after it's done you'll see your results inv the UI

#### Why do I need to set my labels?

Since your model is simply outputting probabilities, we have no way to map the index of each prediction to the model output. Setting your labels enables us to map them so you can see the meaningful values in the UI.<br>

If you have the UI running, you should see it at the URL returned.

**Note:** Check out your local API logs to see the background job!

In [98]:
dataquality.set_labels_for_run(newsgroups.target_names)
dataquality.finish()

☁️ Uploading Data
🧹 Cleaning up
Job default successfully submitted.


{'project_id': '1724d7cf-8c8e-4a1a-ae79-b971645a7446',
 'run_id': 'e2677774-9190-446f-8689-e53841bef758',
 'proc_name': 'default',
 'labels': ['alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'],
 'message': 'Processing dataquality!',
 'link': 'http://127.0.0.1:3000/projects/1724d7cf-8c8e-4a1a-ae79-b971645a7446/runs/e2677774-9190-446f-8689-e53841bef758'}

In [100]:
from dataquality.core.init import _Init

client = _Init()
client.get_project_run_by_name_for_user("a_new_project2", "a_new_run2")


{'name': 'a_new_run2',
 'project_id': '140fd905-1efd-4adf-b929-f962364f056d',
 'id': 'b034c5de-1295-4800-8a44-7ecaf6d49638',
 'created_at': '2021-11-15T19:23:40.513716',
 'updated_at': '2021-11-15T19:23:40.513880'}