# Welcome to the dataquality client demo

### This will be a brief introduction to how the client works and a bit under the hood

## Installing

You can currently install dataquality from pypi
`pip install dataquality`

But for development, you may want to install it from github. This will give you the latest changes in master

`pip install git+https://www.github.com/rungalileo/dataquality.git`

You can also clone the repo and install from a path. This is recommended for development

`pip install /path/to/dataquality/directory`

**(It's good to restart the kernel after an install)**

In [None]:
!pip install -q /path/to/dataquality

In [6]:
# If you have cloned the dataquality repo and are running this from the tests/notebooks folder, you can run this
!pip install ../../../dataquality

Processing /Users/benepstein/Documents/GitHub/dataquality
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


Building wheels for collected packages: dataquality
  Building wheel for dataquality (pyproject.toml) ... [?25ldone
[?25h  Created wheel for dataquality: filename=dataquality-0.0.4-py3-none-any.whl size=48594 sha256=788568b60060000f779ee4a7a11f8aeb9b7cdd9fe6c5a1d38376d997c3772b0b
  Stored in directory: /private/var/folders/3d/d0dl2ykn6c18qg7kg_j7tplm0000gn/T/pip-ephem-wheel-cache-bjdrmqd8/wheels/6d/01/5c/5a5e4c4f4d3ea2066d07fb7f93b690fdf5530f314a686cc929
Successfully built dataquality
Installing collected packages: dataquality
  Attempting uninstall: dataquality
    Found existing installation: dataquality 0.0.4
    Uninstalling dataquality-0.0.4:
      Successfully uninstalled dataquality-0.0.4
Successfully installed dataquality-0.0.4


In [None]:
!pip install -qqq git+https://www.github.com/rungalileo/dataquality.git

## Components

The data quality client is currently very simple. It has just a few components:

* logging - the inputs and outputs to your model
* config - the urls, usernames, and passwords to interact with the server
* init - how you start a new project/run
* finish - how you end your run

## Getting started

To get started, simply `import dataquality`<br>
If your environment variables are set, your import will pass through. If not, you will be prompted for some url and config variables.<br>

To bypass the prompt, set the following environment variables
* `GALILEO_API_URL`
* `GALILEO_MINIO_URL`
* `GALILEO_MINIO_ACCESS_KEY`
* `GALILEO_MINIO_SECRET_KEY`
* `GALILEO_MINIO_REGION`

If you have your server (api, minio, mysql) running locally for development, the following will work
```
import os

os.environ['GALILEO_API_URL']="http://localhost:8000"
os.environ['GALILEO_MINIO_URL']="127.0.0.1:9000"
os.environ['GALILEO_MINIO_ACCESS_KEY']="minioadmin"
os.environ['GALILEO_MINIO_SECRET_KEY']="minioadmin"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

### How do I get everything running locally??

See our [CONTRIBUTING](https://github.com/rungalileo/api/blob/main/CONTRIBUTING.md) doc
(When running the API, use the `./scripts/run-gunicorn.sh` - you don't need all of them)

In [1]:
import os

os.environ['GALILEO_API_URL']="http://localhost:8088"
os.environ['GALILEO_MINIO_URL']="127.0.0.1:9000"
os.environ['GALILEO_MINIO_ACCESS_KEY']="minioadmin"
os.environ['GALILEO_MINIO_SECRET_KEY']="minioadmin"
!rm .galileo/config.json

In [None]:
# For dev cluster, run this cell

# import os
# os.environ['GALILEO_API_URL']="https://api.dev.rungalileo.io"
# os.environ['GALILEO_MINIO_URL']="data.dev.rungalileo.io"
# os.environ['GALILEO_MINIO_ACCESS_KEY']="minioadmin"
# os.environ['GALILEO_MINIO_SECRET_KEY']="minioadmin"
!rm .galileo/config.json

In [2]:
import dataquality

## Logging in

Once you have dataquality imported, you can log into your server and start logging data<br>

To log in, you can call `dataquality.login()` <br>
This will prompt you for your auth method, email, and password. You can skip this prompt with the following environment variables:

* `GALILEO_AUTH_METHOD`
* `GALILEO_USERNAME`
* `GALILEO_PASSWORD`

### How do I create a user?

If you are running everything locally, you can do the following

```
import requests

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo",
  "auth_method": "email",
  "password": "Th3secret_"
}

r = requests.post('http://localhost:8088/users', json=data)
r.json()
```

Then set your env vars
```
import os

os.environ["GALILEO_AUTH_METHOD"]="email"
os.environ["GALILEO_USERNAME"]="{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]="{r.json()['password']}"
```

If you don't set these environment variables, the client will prompt you for the fields (assuming you're running from the newest code).

Now login

```
dataquality.login()
```


In [7]:
import requests

pwd = "MyPassword!123"

data={
  "email": "me@rungalileo.io",
  "first_name": "Me",
  "last_name": "Me",
  "username": "Galileo1",
  "auth_method": "email",
  "password": pwd
}

r = requests.post(f'{dataquality.config.api_url}/users', json=data)
r.json()

import os

os.environ["GALILEO_AUTH_METHOD"]="email"
os.environ["GALILEO_USERNAME"]=f"{r.json()['email']}"
os.environ["GALILEO_PASSWORD"]=pwd

In [9]:
dataquality.login()

🔭 Logging you into Galileo

👀 Found auth method email set via env, skipping prompt.
🚀 You're logged in to Galileo as me@rungalileo.io!


## Start my project/run

Now you can start using the tool with `dataquality.init()`<br>

You **must** provide a `task_type` when calling `init`
* A task type describes the kind of modeling you are doing (text classification, multi-label, NER etc).
* Currently the only available task is "text_classification"

You can optionally provide a project name for this run.

In [None]:
dataquality.init?

In [10]:
task = "text_classification"
# Base case
dataquality.init(task)

✨ Initializing project shallow_aquamarine_sailfish
🏃‍♂️ Starting run retired_silver_dolphin
🛰 Created project, shallow_aquamarine_sailfish, and new run, retired_silver_dolphin.


In [4]:
# New project, unset run (new)
dataquality.init(task_type=task, project_name="a_new_project")

📡 Retrieved project, a_new_project, and starting a new run
🏃‍♂️ Starting run strange_sapphire_locust
🛰 Connected to project, a_new_project, and created run, strange_sapphire_locust.


In [5]:
# Existing project, unset run (new)
dataquality.init(task_type=task, project_name="a_new_project")

📡 Retrieved project, a_new_project, and starting a new run
🏃‍♂️ Starting run olympic_ivory_bat
🛰 Connected to project, a_new_project, and created run, olympic_ivory_bat.


In [6]:
# Existing project, new run
dataquality.init(task_type=task, project_name="a_new_project", run_name="a_new_run")

📡 Retrieving run from existing project, a_new_project
🛰 Connected to project, a_new_project, and run, a_new_run.




In [7]:
# Existing project, existing run
dataquality.init(task_type=task, project_name="a_new_project", run_name="a_new_run")

📡 Retrieving run from existing project, a_new_project
🛰 Connected to project, a_new_project, and run, a_new_run.


In [8]:
# New project, new run
dataquality.init(task_type=task, project_name="a_new_project2", run_name="a_new_run2")

💭 Project a_new_project2 was not found.
✨ Initializing project a_new_project2
🏃‍♂️ Starting run a_new_run2
🛰 Created project, a_new_project2, and new run, a_new_run2.


## Log to my project/run

Now that you've started your run, all you need to do is log data to it.<br>

All you need to do is call the `dataquality.log_data_input` and `dataquality.log_model_outputs` functions.

`dataquality.log_data_input` knows which task you are logging for, and accepts the proper arguments.
For "text_classification" it is expecting
* text - list of strings indicating the text input
* label - list of strings indicating the labels
* split - string indicating the data split (training, validation, test)
* id (optional) - list of ints indicating the id of each row. If not provided, IDs will be added automatically
  * NOTE: This ID must match the output ID in log_model_outputs in order to join them for analysis

`dataquality.log_model_outputs` also knows which task you are logging for.
For "text_classification" it is expecting
* emb - list of lists of embedding values for a given text input
* probs - list of list of probabilities of the confidence per class
* split - string indicating the data split (training, validation, test)
* epoch - int indicating the training/test/validation epoch for the input
* ids - list of ints indicating the matching id to the input row
  * NOTE: This ID must match the output ID in log_model_outputs in order to join them for analysis


### log some data

We use the `log_input_data` and `log_model_outputs` to log our metadata

In [18]:
dataquality.init(task)

✨ Initializing project just_beige_scallop
🏃‍♂️ Starting run clinical_plum_gerbil
🛰 Created project, just_beige_scallop, and new run, clinical_plum_gerbil.


In [19]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

newsgroups = fetch_20newsgroups(subset="train", remove=('headers', 'footers', 'quotes'))

dataset = pd.DataFrame()
dataset["text"] = newsgroups.data
label_ind = newsgroups.target_names
dataset["label"] = [label_ind[i] for i in newsgroups.target]
dataset = dataset[:100]

dataquality.log_input_data(text=dataset['text'], labels=dataset['label'], split="train")
dataquality.log_input_data(text=dataset['text'], labels=dataset['label'], split="test")

export(arrow) [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
export(arrow) [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 

## We validate data before logging

#### See what happens with an invalid model config

In [11]:
# Labels and text inputs dont match in shape
dataquality.log_input_data(text=dataset['text'], labels=dataset['label'][:3], split="train")

AssertionError: labels and text must be the same length, but got(labels, text) (3,100)

In [20]:
import numpy as np

# Generate fake model outputs
def log_fake_data(log_num: int = 0):
    # Ensure unique IDs
    # Because we're going to call this twice, we need the other dataset rows for the second call, so /2
    num_rows = len(dataset) // 2 
        
    emb = np.random.rand(num_rows, 800)
    prob = np.random.rand(num_rows, 20)
    for split in ['test','train']:
        epoch = 0
        
        r = range(num_rows*log_num, num_rows*(log_num+1))
        ids = list(r)
        dataquality.log_model_outputs(emb=emb, probs=prob, split=split, epoch=epoch, ids=ids)

log_fake_data()

In [21]:
!tree .galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}

[01;34m.galileo/logs/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/038639fc-d2ad-498a-a34e-54de3c4dbb07[00m
├── input_data.arrow
├── [01;34mtest[00m
│   └── [01;34m0[00m
│       └── 3bff4cc4b01d.hdf5
└── [01;34mtraining[00m
    └── [01;34m0[00m
        └── fee06682f1f1.hdf5

4 directories, 3 files


### What happened?

When you call `log_batch_input_data` you are logging the input data for this training job. This would typically be run once (per split).<br>

Then, as you train your model in batches, each call to `log_model_outputs` takes the data in that batch, joins it to the input data, and stores it in 3 files, data, emb, and prob.<br>

If we were to log another fake dataset to this, we'd see another file in each dir (under the epoch we set).

The file names in each subdir will match so we can join them at the end

In [22]:
log_fake_data(1)

In [23]:
!tree .galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}

[01;34m.galileo/logs/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/038639fc-d2ad-498a-a34e-54de3c4dbb07[00m
├── input_data.arrow
├── [01;34mtest[00m
│   └── [01;34m0[00m
│       ├── 0505c61d45e0.hdf5
│       └── 3bff4cc4b01d.hdf5
└── [01;34mtraining[00m
    └── [01;34m0[00m
        ├── 5d8541cb5836.hdf5
        └── fee06682f1f1.hdf5

4 directories, 5 files


## Take a look at our logged model outputs

Below is the model output data we've logged to test. You can see all of the values available across both logs<br>
To see the training data, just change the variable to `training`

In [24]:
import vaex

split = "test"
vaex.open(f'.galileo/logs/{dataquality.config.current_project_id}/{dataquality.config.current_run_id}/{split}/0/*.hdf5')


#,id,epoch,split,emb,prob,pred,data_schema_version
0,50,0,test,"'array([7.81780961e-01, 4.23221817e-01, 5.679266...","'array([0.25665689, 0.74681296, 0.58567357, 0.29...",16,1
1,51,0,test,"'array([0.74282788, 0.0697037 , 0.70588388, 0.42...","'array([0.29112115, 0.84031507, 0.05922634, 0.49...",1,1
2,52,0,test,"'array([0.52490394, 0.44603427, 0.11126275, 0.76...","'array([0.09134986, 0.99827672, 0.71067921, 0.14...",1,1
3,53,0,test,"'array([8.81728988e-01, 7.58652927e-01, 1.065074...","'array([0.32182321, 0.80060134, 0.40267719, 0.27...",17,1
4,54,0,test,"'array([0.59171398, 0.95735806, 0.06825045, 0.09...","'array([0.48810191, 0.62990156, 0.20230503, 0.57...",19,1
...,...,...,...,...,...,...,...
95,45,0,test,"'array([7.86610175e-01, 9.57990255e-01, 1.953924...","'array([0.61353825, 0.41159917, 0.11739276, 0.86...",7,1
96,46,0,test,"'array([0.92117707, 0.72841896, 0.72389721, 0.47...","'array([0.53939169, 0.53173078, 0.02529276, 0.33...",5,1
97,47,0,test,"'array([8.47431601e-01, 9.16523767e-03, 5.429822...","'array([0.69445993, 0.24304976, 0.63699774, 0.65...",10,1
98,48,0,test,"'array([5.01069267e-02, 4.23538886e-01, 6.637389...","'array([0.06512617, 0.42445504, 0.40538823, 0.23...",19,1


## How do I see my results in the UI?

Simply set your labels (`set_labels_for_run`) and call `finish()`

Once called, the data will be joined together at a _per-epoch_ level, and added to minio, with one file for each `prob`, `emb`, and `data` per split/epoch. 

A job will be kicked off to process you data on the server, and after it's done you'll see your results inv the UI

#### Why do I need to set my labels?

Since your model is simply outputting probabilities, we have no way to map the index of each prediction to the model output. Setting your labels enables us to map them so you can see the meaningful values in the UI.<br>

If you have the UI running, you should see it at the URL returned.

**Note:** Check out your local API logs to see the background job!

In [25]:
dataquality.set_labels_for_run(newsgroups.target_names)
dataquality.finish()

☁️ Uploading Data


training:   0%|          | 0/3 [00:00<?, ?it/s]

export(hdf5) [########################################] 100.00% elapsed time  :     0.04s =  0.0m =  0.0h
export(hdf5) [########################################] 100.00% elapsed time  :     0.07s =  0.0m =  0.0h
export(hdf5) [########################################] 100.00% elapsed time  :     0.11s =  0.0m =  0.0h
 

test:   0%|          | 0/3 [00:00<?, ?it/s]

export(hdf5) [########################################] 100.00% elapsed time  :     0.04s =  0.0m =  0.0h
export(hdf5) [########################################] 100.00% elapsed time  :     0.07s =  0.0m =  0.0h
export(hdf5) [########################################] 100.00% elapsed time  :     0.11s =  0.0m =  0.0h
 🧹 Cleaning up
Job default successfully submitted. Results will be available soon at http://127.0.0.1:3000/projects/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/runs/038639fc-d2ad-498a-a34e-54de3c4dbb07


{'project_id': 'b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b',
 'run_id': '038639fc-d2ad-498a-a34e-54de3c4dbb07',
 'proc_name': 'default',
 'labels': ['alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'],
 'tasks': None,
 'message': 'Processing dataquality!',
 'link': 'http://127.0.0.1:3000/projects/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/runs/038639fc-d2ad-498a-a34e-54de3c4dbb07'}

## That should take ~10-20 seconds to complete (if you are running the server locally)

### Now we can also read the data from minio

In [26]:
from minio import Minio

url = dataquality.config.minio_url
client = Minio(url, 'minioadmin', 'minioadmin', secure=(':9000' not in url))
p = dataquality.config.current_project_id
r = dataquality.config.current_run_id
client.fget_object('galileo-project-runs-results', f'{p}/{r}/training/data/data.hdf5', 'training_data.hdf5')
client.fget_object('galileo-project-runs-results', f'{p}/{r}/test/data/data.hdf5', 'test_data.hdf5')

display(vaex.open('training_data.hdf5'))

display(vaex.open('test_data.hdf5'))

#,epoch,pred,id,text,split,data_schema_version,data_error_potential,gold,emb_pca,xy,galileo_text_length,galileo_language_id,galileo_pii,galileo_malformed_data
0,0,6,0,'I was wondering if anyone out there could enlig...,training,1,0.5630325204316611,7,"'array([-1.60789669e+00, 7.84201276e-01, -8.418...","array([-5.9552817, 4.489755 ], dtype=float32)",475,en,,Well Formed
1,0,18,1,'A fair number of brave souls who upgraded their...,training,1,0.733762515045353,4,"'array([-4.69114105e-01, -4.81318059e-01, 6.909...","array([-3.8453536, 3.2029743], dtype=float32)",530,en,,Well Formed
2,0,16,2,"'well folks, my mac plus finally gave up the gho...",training,1,0.5790402646066584,4,"'array([-3.17718330e-01, 1.29768016e+00, 1.273...","array([-4.4039063, 4.854735 ], dtype=float32)",1659,en,email,Well Formed
3,0,9,3,"""\nDo you have Weitek's address/phone number? I'...",training,1,0.6572137422565507,1,"'array([ 9.41291715e-01, 2.19386460e+00, -3.339...","array([-7.556582, 3.030379], dtype=float32)",95,en,,Well Formed
4,0,9,4,"'From article <C5owCB.n3p@world.std.com>, by tom...",training,1,0.7974292186134081,14,"'array([ 3.90338895e-01, -3.89708675e-01, -2.981...","array([-5.551905 , 4.2039523], dtype=float32)",448,en,email,Well Formed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,8,95,"'\nOn the back might be tricky, but here in Bould...",training,1,0.48018147020515034,8,"'array([-3.59754923e-01, -1.15906567e+00, 1.095...","array([-6.1999326, 3.9643443], dtype=float32)",199,en,,Well Formed
96,0,5,96,'Here is a press release from the Reserve Office...,training,1,0.7944635222854108,18,"'array([-6.24780239e-01, -1.24853640e+00, -1.057...","array([-6.4813957, 3.8455062], dtype=float32)",2777,en,phone,Well Formed
97,0,18,97,"""..deleted...\n\nIn plain Motify using a dialog 'i...",training,1,0.9907870251327324,5,"'array([ 1.16839637e+00, 3.62197069e-01, 1.667...","array([-4.8365474, 3.865284 ], dtype=float32)",476,en,,Well Formed
98,0,6,98,': Are you saying that their was a physical Adam...,training,1,0.6502322470646098,0,"'array([ 1.29698200e+00, -7.88884550e-01, 8.164...","array([-2.6808245, 4.097144 ], dtype=float32)",375,en,,Well Formed


#,epoch,pred,id,text,split,data_schema_version,data_error_potential,gold,emb_pca,xy,galileo_text_length,galileo_language_id,galileo_pii,galileo_malformed_data
0,0,6,0,'I was wondering if anyone out there could enlig...,test,1,0.5630325204316611,7,"'array([-1.60789669e+00, 7.84201276e-01, -8.418...","array([-5.9552817, 4.489755 ], dtype=float32)",475,en,,Well Formed
1,0,18,1,'A fair number of brave souls who upgraded their...,test,1,0.733762515045353,4,"'array([-4.69114105e-01, -4.81318059e-01, 6.909...","array([-3.8453536, 3.2029743], dtype=float32)",530,en,,Well Formed
2,0,16,2,"'well folks, my mac plus finally gave up the gho...",test,1,0.5790402646066584,4,"'array([-3.17718330e-01, 1.29768016e+00, 1.273...","array([-4.4039063, 4.854735 ], dtype=float32)",1659,en,email,Well Formed
3,0,9,3,"""\nDo you have Weitek's address/phone number? I'...",test,1,0.6572137422565507,1,"'array([ 9.41291715e-01, 2.19386460e+00, -3.339...","array([-7.556582, 3.030379], dtype=float32)",95,en,,Well Formed
4,0,9,4,"'From article <C5owCB.n3p@world.std.com>, by tom...",test,1,0.7974292186134081,14,"'array([ 3.90338895e-01, -3.89708675e-01, -2.981...","array([-5.551905 , 4.2039523], dtype=float32)",448,en,email,Well Formed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,8,95,"'\nOn the back might be tricky, but here in Bould...",test,1,0.48018147020515034,8,"'array([-3.59754923e-01, -1.15906567e+00, 1.095...","array([-6.1999326, 3.9643443], dtype=float32)",199,en,,Well Formed
96,0,5,96,'Here is a press release from the Reserve Office...,test,1,0.7944635222854108,18,"'array([-6.24780239e-01, -1.24853640e+00, -1.057...","array([-6.4813957, 3.8455062], dtype=float32)",2777,en,phone,Well Formed
97,0,18,97,"""..deleted...\n\nIn plain Motify using a dialog 'i...",test,1,0.9907870251327324,5,"'array([ 1.16839637e+00, 3.62197069e-01, 1.667...","array([-4.8365474, 3.865284 ], dtype=float32)",476,en,,Well Formed
98,0,6,98,': Are you saying that their was a physical Adam...,test,1,0.6502322470646098,0,"'array([ 1.29698200e+00, -7.88884550e-01, 8.164...","array([-2.6808245, 4.097144 ], dtype=float32)",375,en,,Well Formed


In [27]:
from dataquality.utils import tqdm

In [2]:
!pip install ../../../dataquality

Processing /Users/benepstein/Documents/GitHub/dataquality
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


Building wheels for collected packages: dataquality
  Building wheel for dataquality (pyproject.toml) ... [?25ldone
[?25h  Created wheel for dataquality: filename=dataquality-0.0.4-py3-none-any.whl size=48536 sha256=2b1d2026ea937a1d3aa9e424dc1203bd35c940ddc87c00fabe80a030779aa498
  Stored in directory: /private/var/folders/3d/d0dl2ykn6c18qg7kg_j7tplm0000gn/T/pip-ephem-wheel-cache-7mluzeuy/wheels/6d/01/5c/5a5e4c4f4d3ea2066d07fb7f93b690fdf5530f314a686cc929
Successfully built dataquality
Installing collected packages: dataquality
  Attempting uninstall: dataquality
    Found existing installation: dataquality 0.0.4
    Uninstalling dataquality-0.0.4:
      Successfully uninstalled dataquality-0.0.4
Successfully installed dataquality-0.0.4


## Bad run (no data to process)

In [1]:
import dataquality
from dataquality.clients import api_client 
dataquality.login()
dataquality.config


p, r = api_client.get_projects()[2]["name"], api_client.get_projects()[2]["runs"][0]["name"]
p, r


api_client.reprocess_run(p, r)

🔭 Logging you into Galileo

👀 Found auth method email set via env, skipping prompt.
🚀 You're logged in to Galileo as me@rungalileo.io!


GalileoException: It seems no data is available for run a_new_project2/a_new_run2

## Good run (default currently initialized)

In [6]:
api_client.reprocess_run()

Job default successfully resubmitted. New results will be available soon at http://127.0.0.1:3000/projects/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/runs/038639fc-d2ad-498a-a34e-54de3c4dbb07


{'project_id': 'b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b',
 'run_id': '038639fc-d2ad-498a-a34e-54de3c4dbb07',
 'proc_name': 'default',
 'labels': ['alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'],
 'tasks': None,
 'message': 'Processing dataquality!',
 'link': 'http://127.0.0.1:3000/projects/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/runs/038639fc-d2ad-498a-a34e-54de3c4dbb07'}

## Good run (manually provided)

In [5]:
from dataquality import config
p, r = config.current_project_id, config.current_run_id
print(p, r)
pn = api_client.get_project(p)["name"]
rn = api_client.get_project_run(p, r)["name"]

print(pn, rn)

api_client.reprocess_run(pn, rn)

b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b 038639fc-d2ad-498a-a34e-54de3c4dbb07
just_beige_scallop clinical_plum_gerbil
Job default successfully resubmitted. New results will be available soon at http://127.0.0.1:3000/projects/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/runs/038639fc-d2ad-498a-a34e-54de3c4dbb07


{'project_id': 'b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b',
 'run_id': '038639fc-d2ad-498a-a34e-54de3c4dbb07',
 'proc_name': 'default',
 'labels': ['alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'],
 'tasks': None,
 'message': 'Processing dataquality!',
 'link': 'http://127.0.0.1:3000/projects/b0ff7b5a-fb09-4cb2-8dc9-05542c10e05b/runs/038639fc-d2ad-498a-a34e-54de3c4dbb07'}

In [20]:
import numpy as np

x = np.random.rand(10)
y = np.random.randint(5, size=10)
y[0] = 10000
x, y

(array([0.93977603, 0.47317141, 0.66128151, 0.1081444 , 0.23966007,
        0.34280039, 0.84580327, 0.32180892, 0.78400448, 0.86464419]),
 array([10000,     1,     4,     1,     0,     3,     4,     3,     2,
            4]))

In [22]:
np.average(x, weights=y)

0.9390730758423024

In [23]:
x.mean()

0.55810946777751

In [2]:
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

for _ in range(1000):
    ypred = np.random.randint(5, size=10_000)
    ygold = np.random.randint(5, size=10_000)

    prec, rec, f1, _ = precision_recall_fscore_support(ypred, ygold, average="micro")
    assert np.allclose(prec, rec, f1)

In [229]:
import json
from typing import *
import pandas as pd
import vaex
from collections import defaultdict
from random import random


__data_schema_version__ = 1
class Logger:
    def __init__(
        self,
        text: List[str] = None,
        text_tokenized: List[List[List[str]]] = None,
        gold_spans: List[List[Dict]] = None,
        ids: List[int] = None,
        split: str = None,
        meta: Optional[Dict[str, List[Union[str, float, int]]]] = None,
    ) -> None:
        self.text = text if text is not None else []
        self.text_tokenized = text_tokenized if text_tokenized is not None else []
        self.gold_spans = gold_spans if gold_spans is not None else []
        self.ids = ids if ids is not None else []
        self.split = split
        self.meta = meta or {}
        
    def _get_input_dict(self) -> Dict[str, Any]:
        return dict(
            id=self.ids,
            text=self.text,
            text_tokenized=json.dumps(self.text_tokenized),
            split=self.split,
            data_schema_version=__data_schema_version__,
            gold_spans=json.dumps(self.gold_spans) if self.split != "inference" else None,
            **self.meta,
        )
    
    
class ModelLogger:
    def __init__(
        self,
        gold_emb: List[List[np.ndarray]] = None,
        pred_emb: List[List[np.ndarray]] = None,
        pred_spans: List[List[List[np.ndarray]]] = None,
        probs: Union[List, np.ndarray] = None,
        ids: Union[List, np.ndarray] = None,
        split: Optional[str] = None,
        epoch: Optional[int] = None,
    ) -> None:
        super().__init__()
        # Need to compare to None because they may be np arrays which cannot be
        # evaluated with bool directly
        self.gold_emb = gold_emb if gold_emb is not None else []
        self.pred_emb = pred_emb if pred_emb is not None else []
        self.pred_spans = pred_spans if pred_spans is not None else []
        self.probs = probs if probs is not None else []
        self.ids = ids if ids is not None else []
        self.split = split
        self.epoch = epoch
        self.dep_scores = []
        for i in self.probs:
            self.dep_scores.append([random() for _ in range(len(i))])
        
        
    def _get_data_dict(self) -> Dict[str, Any]:
        data = defaultdict(list)
        for record_id, span_deps, gold_emb, pred_emb, pred_span in zip(
            self.ids, self.dep_scores, self.gold_emb, self.pred_emb, self.pred_spans
        ):
            record = {
                "id": record_id,
                "epoch": self.epoch,
                "split": self.split,
                "pred_span": json.dumps(pred_span),
                "num_pred_spans": len(pred_span),
                "num_gold_spans": len(gold_emb),
                "data_schema_version": __data_schema_version__,
            }
            for i, gold_span_emb in enumerate(gold_emb):
                record[f"gold_emb_{i}"] = gold_span_emb
            for i, pred_span_emb in enumerate(pred_emb):
                record[f"pred_emb_{i}"] = pred_span_emb
            for i, span_dep in enumerate(span_deps):
                record[f"dep_{i}"] = span_dep

            # Pad the embeddings and deps for missing values
            # We enable up to 5 spans, so for samples with less, we need to add values
            # those vaex df columns. We add np.zeros for embs and -1 for dep scores
            rng = range(record["num_pred_spans"], 5)
            for i in rng:
                record[f"pred_emb_{i}"] = np.zeros(768)
                record[f"dep_{i}"] = -1
            
            rng = range(record["num_gold_spans"], 5)
            for i in rng:
                record[f"gold_emb_{i}"] = np.zeros(768)

            for k in record.keys():
                data[k].append(record[k])
        return data
    
    def validate(self):
        avg_gold_emb: List[List[np.ndarray]] = []
        avg_pred_emb: List[List[np.ndarray]] = []
        for gold_sample, pred_sample in zip(self.gold_emb, self.pred_emb):
            avg_gold_emb.append(
                [np.mean(gold_span, axis=0) for gold_span in gold_sample]
            )
            avg_pred_emb.append(
                [np.mean(pred_span, axis=0) for pred_span in pred_sample]
            )
        self.gold_emb = avg_gold_emb
        self.pred_emb = avg_pred_emb
    



In [230]:
text_inputs: List[str] = [
    "The president is Joe Biden",
    "Joe Biden addressed the United States on Monday"
]
text_tokenized: List[List[List[str]]] = [
    [
       # span indexes
       # 0        1       2            3       4        5     6          7
        ["the"], ["pres", "##ident"], ["is"], ["joe"], ["bi", "##den"], ["."]
    ],
    [
       # 0        1     2          3          4         5
        ["joe"], ["bi", "##den"], ["address", "##ed"], ["the"],
       # 6        7        8        9        10      11          12
        ["unite", "##d"], ["state", "##s"], ["on"], ["monday"], ["."]
    ]
]
gold_spans: List[List[dict]] = [
    [
        {"start":4, "end":7, "label":"person"}  # [joe], [bi, ##den]
    ],
    [
        {"start":0, "end":3, "label":"person"},    # [joe], [bi, ##den]
        {"start":6, "end":10, "label":"location"}  # [unite, ##d], [state, ##s]
    ]
]
ids: List[int] = [0, 1]
split = "training"

logger = Logger(text_inputs, text_tokenized, gold_spans, ids, split)

In [231]:
df = vaex.from_pandas(pd.DataFrame(logger._get_input_dict()))
df.export("input_data.arrow", reduce_large=True)
df["text_tokenized"].values[0]

TypeError: export() got an unexpected keyword argument 'reduce_large'

In [232]:
probs = [
    [
        [np.random.rand(10), np.random.rand(10), np.random.rand(10)]
    ],
    [
        [np.random.rand(10), np.random.rand(10)],
        [np.random.rand(10)]
    ]
]

pred_spans = [
    [
        {"start":4, "end":7, "label":"person"}  # [joe], [bi, ##den]
    ],
    [
        {"start":0, "end":3, "label":"person"},    # [joe], [bi, ##den]
        {"start":11, "end":12, "label":"person"}  # [monday] (bad prediction)
    ]
]


pred_emb = [
    [
        [np.random.rand(768), np.random.rand(768), np.random.rand(768)]  # Correct span
    ],
    [
        [np.random.rand(768), np.random.rand(768)],  # Correct span
        [np.random.rand(768)]  # Incorrect span, but the prediction
    ]
]


gold_emb = [
    [
        [np.random.rand(768), np.random.rand(768), np.random.rand(768)]  # True span
    ],
    [
        [np.random.rand(768), np.random.rand(768), np.random.rand(768)],                      # True span
        [np.random.rand(768), np.random.rand(768), np.random.rand(768), np.random.rand(768)]  # True span
    ]
]

ids = [0, 1]
epoch = 0
split = "training"


mlogger = ModelLogger(
    gold_emb, pred_emb, pred_spans, probs, ids, split, epoch
)
mlogger.validate()
data = mlogger._get_data_dict()
model_output = vaex.from_dict(data)
model_output.export('model_output.hdf5')
model_output

#,id,epoch,split,pred_span,num_pred_spans,num_gold_spans,data_schema_version,gold_emb_0,pred_emb_0,dep_0,pred_emb_1,dep_1,pred_emb_2,dep_2,pred_emb_3,dep_3,pred_emb_4,dep_4,gold_emb_1,gold_emb_2,gold_emb_3,gold_emb_4
0,0,0,training,"[{""start"": 4, ""end"": 7, ""label"": ""person""}]",1,1,1,"'array([0.88952657, 0.60228823, 0.71759194, 0.20...","'array([0.45615093, 0.54049334, 0.73304172, 0.33...",0.156726,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1.0,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...","'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...","'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...","'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ..."
1,1,0,training,"'[{""start"": 0, ""end"": 3, ""label"": ""person""}, {""s...",2,2,1,"'array([0.38557429, 0.41706254, 0.6371061 , 0.38...","'array([0.43439484, 0.38752056, 0.35483564, 0.52...",0.614494,"'array([9.04401978e-02, 2.41261098e-01, 6.729725...",0.213313,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1,"'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...",-1,"'array([0.43243748, 0.29057348, 0.50017519, 0.72...","'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...","'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ...","'array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., ..."


In [228]:
df.dtypes

id                      int64
text                   string
text_tokenized         object
split                  string
data_schema_version     int64
gold_spans             object
dtype: object

In [185]:
# len(mlogger.gold_emb[1])
x = [np.random.rand(10), np.array([10])]
y = [np.array([])]
np.isnan(np.mean(y))



gold_emb = [
    [
        [np.random.rand(768), np.random.rand(768), np.random.rand(768)]  # True span
    ],
    [
        [np.random.rand(768), np.random.rand(768), np.random.rand(768)],                      # True span
        [np.random.rand(768), np.random.rand(768), np.random.rand(768), np.random.rand(768)]  # True span
    ],
    [np.array([])]
]

avg_gold_emb = []
for gold_spans in gold_emb:
    avg_gold_emb.append([np.mean(gold_span, axis=0) for gold_span in gold_spans])


In [200]:
np.isnan(avg_gold_emb[0][0]).all()

False

In [203]:
np.isnan(np.nan).all()

True

In [207]:
len(avg_gold_emb[2])

next(filter(lambda emb: not np.isnan(emb[0]).all(), avg_gold_emb))[0].shape[0]

768

In [173]:
df1 = vaex.from_dict({"id": list(range(0,10)), "emb_1": np.random.rand(10, 5), "emb_2": np.random.rand(10, 6)})
df2 = vaex.from_dict({"id": list(range(10,20)), "emb_1": np.random.rand(10, 5), "emb_2": np.random.rand(10, 5)})


missing_cols = set(df2.get_column_names()) - set(df1.get_column_names())
print(missing_cols)
for col in missing_cols:
    df1[col] = np.zeros(shape=(10,5))


vaex.concat([df1,df2])[:1].emb_1.values

set()


ValueError: Unequal shapes are not supported yet, please open an issue on https://github.com/vaexio/vaex/issues

In [182]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [175]:
test = "test"
err = (
    f"cannot have more than {test} but had"
    "{}"
)

err.format(100)
err

'cannot have more than test but had{}'

In [56]:
import os

account_clients = {"1234-6973": "Betty"}
CPA_clients = {"HerreidTim": ["1234-6973", "1454-5933"]}

file_loc = "."
must_be_pdf = True
all_files = [file for file in os.listdir(file_loc) if os.path.isfile(file)]

all_files = ["1234-6973_001_021221_1099Composite.pdf"]

if must_be_pdf:
    all_files = [file for file in all_files if file.endswith("pdf")]
    
# Structure is 1234-5678_rest_of_name.pdf
# First 8 digits are account #, first 4 must be redacted
cleaned_files = []
for file in all_files:
    acct_number, *rest_of_file = file.split("_")
    rest_of_file = "_".join(rest_of_file)
    redacted_acct_number = "X"*4 + acct_number[4:]
    client = account_clients[acct_number]
    CPA = [n for n, v in CPA_clients.items() if acct_number in v][0] # should only be 1
    cleaned_files.append(f"{CPA}_{client}_{redacted_acct_number}_{rest_of_file}")  

print('input:', all_files[0], 'output:', cleaned_files[0])

input: 1234-6973_001_021221_1099Composite.pdf output: HerreidTim_Betty_XXXX-6973_001_021221_1099Composite.pdf
