## End to end examples logging data to Galileo for Text Classification, MLTC, and NER

### For understanding the client and how to get started, see the [Dataquality Demo](./Dataquality-Client-Demo.ipynb)
### Check out the full documentation [here](https://rungalileo.gitbook.io/galileo/getting-started)
### To see real end-to-end notebooks training real ML models, see [here](https://drive.google.com/drive/folders/17-cHuRzXIpWaD8rYwy69RMQr__HiAiDk?usp=sharing)

In [17]:
## Local

import os

os.environ['GALILEO_CONSOLE_URL']="http://localhost:8088"
os.environ["GALILEO_USERNAME"]="user@example.com"
os.environ["GALILEO_PASSWORD"]="Th3secret_"

In [13]:
## Local

import os

os.environ['GALILEO_CONSOLE_URL']="http://console.dev.rungalileo.io"
os.environ["GALILEO_USERNAME"]="galileo@rungalileo.io"
os.environ["GALILEO_PASSWORD"]="A11a1una!"

In [None]:
dq.metrics.api_client

In [10]:
%%time
import dataquality as dq
dq.configure()
dq.__version__

📡 https://console.dev.rungalileo.io
🔭 Logging you into Galileo

🚀 You're logged in to Galileo as galileo@rungalileo.io!
CPU times: user 1.07 s, sys: 438 ms, total: 1.51 s
Wall time: 2.98 s


'v0.8.0'

In [11]:
from tqdm.notebook import tqdm
import time
import numpy as np
from uuid import uuid4
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import os

# os.environ["GALILEO_DATA_EMBS_ENCODER"] = "hf-internal-testing/tiny-random-distilbert"
os.environ["GALILEO_DATA_EMBS_ENCODER"] = "tmp/testing-random-distilbert-sq"

BATCH_SIZE=16
EMB_DIM=85
NUM_EPOCHS=1

dq.init("text_classification", "foo","bar")

newsgroups = fetch_20newsgroups(subset="train", remove=('headers', 'footers', 'quotes'))
dataset = pd.DataFrame()
dataset["text"] = newsgroups.data
label_ind = newsgroups.target_names
dataset["label"] = [label_ind[i] for i in newsgroups.target]
dataset["id"] = list(range(len(dataset)))

dataset = dataset[:200]


def generate_random_embeddings(batch_size: int, emb_dims: int) -> np.ndarray:
    return np.random.rand(batch_size, emb_dims)


def generate_random_probabilities(batch_size: int, num_classes: int) -> np.ndarray:
    probs = np.random.rand(batch_size, num_classes)
    return probs / probs.sum(axis=-1).reshape(-1, 1)  # Normalize to sum to 1


t_start = time.time()
dq.set_labels_for_run(dataset["label"].unique())

print("Logging input data")
for split in ["training", "test"]:
    dq.log_dataset(dataset, split=split)
    
print("Done")
print(f"Input logging took {time.time() - t_start} seconds\n\n")


print("Logging model outputs")
t_start = time.time()
num_classes = dataset["label"].nunique()
# Simulates model training loop
for epoch_idx in range(NUM_EPOCHS):
    print(f"Epoch {epoch_idx}")
    print('-'*100)
    for split in ["training", "test"]:
        print(split.capitalize())
        dq.set_split(split)
        for i in tqdm(range(0, len(dataset), BATCH_SIZE)):
            batch = dataset[i : i + BATCH_SIZE]
            embeddings = generate_random_embeddings(len(batch), EMB_DIM)
            probs = generate_random_probabilities(len(batch), num_classes)
            dq.log_model_outputs(
                embs=embeddings,
                probs=probs,
                epoch=epoch_idx,
                ids=batch["id"],
            )
    print('-'*100,end="\n\n")
            
print("Done")

time_spent = time.time() - t_start
print(f"Logging output took {time_spent} seconds")
dq.finish(create_data_embs=False)

📡 Retrieving run from existing project, foo




🛰 Connected to project, foo, and run, bar.
Logging input data
Logging 200 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Logging 200 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 Done
Input logging took 6.895177125930786 seconds


Logging model outputs
Epoch 0
----------------------------------------------------------------------------------------------------
Training


  0%|          | 0/13 [00:00<?, ?it/s]

Test




  0%|          | 0/13 [00:00<?, ?it/s]

----------------------------------------------------------------------------------------------------

Done
Logging output took 0.23681902885437012 seconds
☁️ Uploading Data


training:   0%|          | 0/1 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/13 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/145k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/286k [00:00<?, ?B/s]

test:   0%|          | 0/1 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/13 [00:00<?, ?it/s]

test (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/145k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/286k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=113c8883-7320-45c2-9f63-9abdcae89ec2&runId=c5be7886-a521-4494-9da7-2ec220527e31&split=training&metric=f1&depHigh=1&depLow=0&taskType=0
Waiting for job...
	Found embs. Analyzing dimensions
	Applying dimensionality reduction to training embs
	Applying dimensionality reduction to test embs
	Calculating data error potential
	Looking for likely mislabeled samples
	Saving processed training data
	Looking for likely mislabeled samples
	Measuring class overlap
	Saving processed test data
Done! Job finished with status completed
Click here to see your run! https://console.dev.rungalileo.io/insights?projectId=113c8883-7320-45c2-9f63-9abdcae89ec2&runId=c5be7886-a521-4494-9da7-2ec220527e31&split=training&metric=f1&depHigh=1&depLow=0&taskType=0
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=113c8883-7320-45c2-9f63-9abdcae89ec2&runId=c5be7886-a521-4494-9da7-2ec220527e31&split=training&metric=f1&depHigh=1&depLow=0&taskType=0'

In [25]:
len("____"), len("______")

(4, 6)

In [36]:
token_embs = data_model.encode(
    [
        "this is a long sentence about how jon is really pretty and tall and has great eyes",
        "this is a long sentence about how jon is "
    ],
    output_value=None, 
    convert_to_numpy=True
)
token_embs[0]

{'input_ids': tensor([ 101, 2023, 2003, 1037, 2146, 6251, 2055, 2129, 6285, 2003, 2428, 3492,
         1998, 4206, 1998, 2038, 2307, 2159,  102]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'token_embeddings': tensor([[-0.0989, -0.1244, -0.0870,  ...,  0.1521,  0.4170,  0.0777],
         [-0.1934, -0.1937, -0.3912,  ...,  0.1710,  0.5467,  0.0530],
         [-0.1983, -0.0436, -0.0218,  ...,  0.3699,  0.0821,  0.4078],
         ...,
         [ 0.1655,  0.3409,  0.2666,  ..., -0.1975,  0.1927, -0.1595],
         [-0.0490,  0.0316, -0.0028,  ...,  0.0534,  0.2502, -0.5220],
         [ 0.7531, -0.0019, -0.3970,  ...,  0.1781, -0.3043, -0.6650]]),
 'sentence_embedding': tensor([ 1.1392e-02,  5.2962e-03,  2.4081e-02,  5.5669e-02,  2.9600e-01,
         -2.5092e-01,  3.5211e-04,  1.0010e+00, -3.2073e-01, -2.2458e-01,
          1.2752e-01, -5.7345e-01, -3.2998e-02,  4.6655e-01, -2.2689e-01,
          2.9207e-01,  2.8244e-01,  2.1811e-01, -8.0712e-02, 

In [38]:
SentenceTransformer("all-MiniLM-L6-v2")

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [32]:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

toks = tok.encode("this is a long sentence about how jon is really pretty and tall and has great eyes")
toks

[101,
 2023,
 2003,
 1037,
 2146,
 6251,
 2055,
 2129,
 6285,
 2003,
 2428,
 3492,
 1998,
 4206,
 1998,
 2038,
 2307,
 2159,
 102]

## Multi Label

In [15]:
from typing import *
from random import choice
import numpy as np


dq.init("text_multi_label", "test-mltc-run")
dq.set_labels_for_run([["not "+_label, _label] for _label in ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult','identity_hate']]) 
dq.set_tasks_for_run(['task_0', 'task_1', 'task_2', 'task_3', 'task_4', 'task_5'], binary=False)

n = 5000

texts: List[str] = [f"text sample {i}" for i in range(n)]

labels: List[str] = [
    [choice(i) for i in dq.get_data_logger().logger_config.labels]
    for _ in range(n)
]

ids = list(range(n))


dq.log_data_samples(texts=texts, task_labels=labels, ids=ids, split="training")
dq.log_data_samples(texts=texts, task_labels=labels, ids=ids, split="test")
dq.log_data_samples(texts=texts, task_labels=labels, ids=ids, split="validation")

for split in ["training", "test", "validation"]:
    for epoch in range(5):
        emb=np.random.rand(n, 768)
        logits=[[np.random.rand(2)] * 6] * n
        ids=list(range(n))
        
        for i in range(0, n, 32):
            dq.log_model_outputs(
                embs=emb[i:i+5],
                logits=logits[i:i+5],
                ids=ids[i:i+5],
                split=split,
                epoch=epoch
            )

dq.finish()
df_train, df_test, df_val = see_results()


📡 Retrieved project, test-mltc-run, and starting a new run
🏃‍♂️ Starting run near_beige_sole
🛰 Connected to project, test-mltc-run, and created run, near_beige_sole.
Logging 5000 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Logging 5000 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Logging 5000 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 ☁️ Uploading Data


training:   0%|          | 0/5 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

training (epoch=1):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

training (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

training (epoch=3):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/4.62M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/102k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

training (epoch=4):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/4.62M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/102k [00:00<?, ?B/s]

validation:   0%|          | 0/5 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

validation (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

validation (epoch=1):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

validation (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

validation (epoch=3):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/4.62M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/104k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

validation (epoch=4):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/4.62M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/104k [00:00<?, ?B/s]

test:   0%|          | 0/5 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

test (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

test (epoch=1):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

test (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

test (epoch=3):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/4.62M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/99.0k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/157 [00:00<?, ?it/s]

test (epoch=4):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/4.62M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/135k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/99.0k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at http://127.0.0.1:3000/insights?projectId=2a0464fe-78e1-49d9-a0ea-0d3b92bd2a12&runId=ee9e8834-050b-485f-81df-63f0182dc297&split=training&metric=f1&depHigh=1&depLow=0&taskType=1
Waiting for job...
	Applying dimensionality reduction to training embs
	Applying dimensionality reduction to validation embs
	Applying dimensionality reduction to test embs
	Calculating data error potential
	Looking for likely mislabeled samples
	Measuring sample similarity for training
	Measuring class overlap
	Saving processed training data
	Calculating data error potential
	Looking for likely mislabeled samples
	Measuring class overlap
	Saving processed validation data
	Calculating data error potential
	Looking for likely mislabeled samples
	Measuring class overlap
	Saving processed test data
Done! Job finished with status completed
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disablin

NameError: name 'see_results' is not defined

## NER

In [12]:
from dataquality.schemas.task_type import TaskType
from dataquality import config 
from uuid import uuid4
import numpy as np
from time import sleep
from tqdm.notebook import tqdm


dq.init("text_ner", "test-ner-run")


def log_inputs():
    text_inputs = ['what movies star bruce willis', 'show me films with drew barrymore from the 1980s', 'what movies starred both al pacino and robert deniro', 'find me all of the movies that starred harold ramis and bill murray', 'find me a movie with a quote about baseball in it']
    tokens = [[(0, 4), (5, 11), (12, 16), (17, 22), (17, 22), (23, 29), (23, 29)], [(0, 4), (5, 7), (8, 13), (14, 18), (19, 23), (24, 33), (24, 33), (24, 33), (34, 38), (39, 42), (43, 48)], [(0, 4), (5, 11), (12, 19), (20, 24), (25, 27), (28, 34), (28, 34), (28, 34), (35, 38), (39, 45), (39, 45), (46, 52), (46, 52)], [(0, 4), (5, 7), (8, 11), (12, 14), (15, 18), (19, 25), (26, 30), (31, 38), (39, 45), (39, 45), (39, 45), (46, 51), (46, 51), (52, 55), (56, 60), (61, 67), (61, 67), (61, 67)], [(0, 4), (5, 7), (8, 9), (10, 15), (16, 20), (21, 22), (23, 28), (29, 34), (35, 43), (44, 46), (47, 49)]]
    gold_spans = [[{'start': 17, 'end': 29, 'label': 'ACTOR'}], [{'start': 19, 'end': 33, 'label': 'ACTOR'}, {'start': 43, 'end': 48, 'label': 'YEAR'}], [{'start': 25, 'end': 34, 'label': 'ACTOR'}, {'start': 39, 'end': 52, 'label': 'ACTOR'}], [{'start': 39, 'end': 51, 'label': 'ACTOR'}, {'start': 56, 'end': 67, 'label': 'ACTOR'}], []]
    ids = [0, 1, 2, 3, 4]

    labels = ['[PAD]', '[CLS]', '[SEP]', 'O', 'B-ACTOR', 'I-ACTOR', 'B-YEAR', 'B-TITLE', 'B-GENRE', 'I-GENRE', 'B-DIRECTOR', 'I-DIRECTOR', 'B-SONG', 'I-SONG', 'B-PLOT', 'I-PLOT', 'B-REVIEW', 'B-CHARACTER', 'I-CHARACTER', 'B-RATING', 'B-RATINGS_AVERAGE', 'I-RATINGS_AVERAGE', 'I-TITLE', 'I-RATING', 'B-TRAILER', 'I-TRAILER', 'I-REVIEW', 'I-YEAR']
    dq.set_labels_for_run(labels)
    dq.set_tagging_schema("BIO")
    dq.log_data_samples(texts=text_inputs, text_token_indices=tokens, ids=ids, gold_spans=gold_spans, split="training")
    dq.log_data_samples(texts=text_inputs, text_token_indices=tokens, ids=ids, gold_spans=gold_spans, split="validation")
    dq.log_data_samples(texts=text_inputs, text_token_indices=tokens, ids=ids, gold_spans=gold_spans, split="test")

def log_outputs():
    num_classes = 28
    embs = [np.random.rand(119, 768) for _ in range(5)]
    logits= [np.random.rand(119, 28) for _ in range(5)]                                      
    ids= list(range(5))
    for epoch in tqdm(range(6)):
        for split in ["training"]:#, "test", "validation"]:
            dq.log_model_outputs(
                embs=embs, logits=logits, ids=ids, split=split, epoch=epoch
            )
    
def finish():
    dq.finish()
    
    
def runit():
    log_inputs()
    log_outputs()
#     finish()
    
runit()
# df_train, df_test, df_val = see_results()
%time dq.finish()
# %time dq.wait_for_run()


📡 Retrieved project, test-ner-run, and starting a new run
🏃‍♂️ Starting run elegant_magenta_pelican
🛰 Connected to project, test-ner-run, and created run, elegant_magenta_pelican.
Logging 5 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Logging 5 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Logging 5 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 

  0%|          | 0/6 [00:00<?, ?it/s]

☁️ Uploading Data


training:   0%|          | 0/6 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/43.7k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=1):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/43.7k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/43.7k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=3):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/43.7k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=4):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/181k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/43.4k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=5):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/181k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/43.4k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=9683fbaa-5f4b-471d-9c68-40095c543d5c&runId=8a4b7381-3446-4308-a7ba-f6bc288b858a&split=training&metric=f1&depHigh=1&depLow=0&taskType=2
Waiting for job...
	Applying dimensionality reduction to training embs
	Measuring class overlap
	Saving processed training data
Done! Job finished with status completed
Click here to see your run! https://console.dev.rungalileo.io/insights?projectId=9683fbaa-5f4b-471d-9c68-40095c543d5c&runId=8a4b7381-3446-4308-a7ba-f6bc288b858a&split=training&metric=f1&depHigh=1&depLow=0&taskType=2
🧹 Cleaning up
CPU times: user 3.05 s, sys: 340 ms, total: 3.39 s
Wall time: 24.8 s


'https://console.dev.rungalileo.io/insights?projectId=9683fbaa-5f4b-471d-9c68-40095c543d5c&runId=8a4b7381-3446-4308-a7ba-f6bc288b858a&split=training&metric=f1&depHigh=1&depLow=0&taskType=2'

In [7]:
!pip install --upgrade vaex-core==4.12.0

Collecting vaex-core==4.12.0
  Using cached vaex_core-4.12.0-cp39-cp39-macosx_10_9_x86_64.whl (5.1 MB)
Installing collected packages: vaex-core
  Attempting uninstall: vaex-core
    Found existing installation: vaex-core 4.15.0
    Uninstalling vaex-core-4.15.0:
      Successfully uninstalled vaex-core-4.15.0
Successfully installed vaex-core-4.12.0


In [2]:
import vaex
df = vaex.example()[:1].export("file.hdf5")


df2 = vaex.open("file.hdf5")
df_copy = df2.copy()
df_copy = df_copy[df_copy["x"] > 50]
float(df_copy["x"].mean())

nan

In [8]:
import numpy as np

avg = df_copy["x"].mean()
std = df_copy["x"].std()

np.min((avg + std, 1.0))

nan

In [12]:
df = vaex.example()
df["id"] = np.array(list(range(len(df))))
df2 = df.copy()
df2["d"] = df2.x * df2.y
df = df.join(df2[["id","d"]], on="id")

In [13]:
df

#,id,x,y,z,vx,vy,vz,E,L,Lz,FeH,d
0,0,1.2318684,-0.39692867,-0.59805775,301.15527,174.05948,27.427546,-149431.4,407.38898,333.95554,-1.0053853,-0.48896387
1,1,-0.16370061,3.6542213,-0.25490645,-195.00023,170.47217,142.53023,-124247.95,890.24115,684.6676,-1.708667,-0.59819824
2,2,-2.120256,3.3260527,1.7078403,-48.63423,171.6473,-2.0794373,-138500.55,372.2411,-202.17618,-1.8336141,-7.052083
3,3,4.715589,4.585251,2.2515438,-232.42084,-294.85083,62.85865,-60037.04,1297.6304,-324.6875,-1.4786882,21.622158
4,4,7.217187,11.994717,-1.0645622,-1.6891745,181.32935,-11.333611,-83206.84,1332.799,1328.949,-1.8570484,86.568115
...,...,...,...,...,...,...,...,...,...,...,...,...
329995,329995,1.9938701,0.7892761,0.2220599,-216.9299,16.12442,-211.24438,-146457.44,457.72247,203.36758,-1.7451677,1.5737141
329996,329996,3.7180912,0.7213376,1.6415337,-185.9216,-117.250824,-105.49866,-126627.11,335.00256,-301.837,-0.9822322,2.681999
329997,329997,0.36885077,13.029609,-3.6339347,-53.677147,-145.15771,76.7091,-84912.26,817.1376,645.8507,-1.7645613,4.805981
329998,329998,-0.112592645,1.4529126,2.1689527,179.30865,205.7971,-68.75873,-133498.47,724.00024,-283.69104,-1.8808953,-0.16358727


In [54]:
vaex.__version__

{'vaex-core': '4.12.0', 'vaex-hdf5': '0.12.3'}