## The content

1. Set up dev environment / MLFlow / Outerbounds

1. High-level overview of the Training Pipeline

1. Introduction to Metaflow
    1. Create a sample metaflow flow 
    1. Explain how to run it from the command line and from a notebook
    1. Metaflow: @project
    1. Metaflow: pypi/conda - How do they work

1. Integrating Metaflow and MLFlow
    1. Step: start
    1. Metaflow: Artifacts
    1. Metaflow: Cards
    1. Metaflow: importing libraries within a step

1. Loading the data
    1. Step: load
    1. Metaflow: Parameters
    1. Metaflow: IncludeFile
    1. Metaflow: S3 class
    1. Metaflow: Retry
    1. Metaflow: Branches

1. Using cross-validation to train a model
    1. Step: cross_validation
    1. Metaflow: foreach
    1. Why cross-validation? Compare with train-test split
    1. Break down the general structure of a cross-validation process

1. Transforming the data
    1. Step: transform_fold
    1. Scikit-learn transformation pipelines
    1. Imputation
    1. Scaling
    1. Encoding (one-hot, label)

1. Training a model
    1. Step: train_model_fold
    1. Metaflow: Environment variables
    1. MLFlow: Experiment tracking
    1. MLFlow: Parent runs and child runs
    1. MLFlow: auto-logging
    1. Using Keras with different backends
    1. Model architecture, loss function, optimizer

1. Evaluating the model
    1. Step: evaluate_model_fold
    1. MLFlow: Logging metrics 

1. Final model evaluation
    1. Step: evaluate_model
    1. Metaflow: Merging artifacts
    1. Why do we need this step as part of cross-validation?

1. Transforming the entire dataset
    1. Step: transform_dataset
    1. Why do we need this step?

1. Training the final model
    1. Step: train_model
    1. Why do we need this step?

1. Hyperparameter tuning
    1. Keras Tuner

1. Introduction to model versioning
    1. Step: register_model
    1. Metaflow: Conditional execution of a step
    1. MLFlow: Model registry
    1. MLFlow: log_model

1. Building a custom inference process
    1. MLFlow: Subclassing PythonModel
    1. Why do we need a custom inference process?
    1. Running     

1. Deploying the model
    1. MLFlow: Deploying the model locally
    1. Loading the latest model from Model Registry
    1. Deploying the model to a SageMaker Endpoint
    1. Deploying the model to Azure
    1. Deploying the model to GCP

1. Running a Production Pipeline in the Cloud
    1. Running on AWS Batch
    1. Running on Kubernetes
    1. Metaflow: @resources (Requesting compute resources)
    1. Mixing cloud environments

1. Scheduling Pipelines
    1. Scheduling with AWS Step Functions
    1. Scheduling with Argo Workflows
    
1. Setting up monitoring

1. Connecting Flows via Events


### Bonus lessons
1. Build an EDA flow that includes the code I wrote to explore the penguins dataset
1. Multi-user worlkflows using @project decorator in Outerbounds
1. Shadow deployments
1. Active Learning example
1. Knowledge distillation example
1. Model compression example
1. Test-time augmentation example
1. Adversarial validation example
1. Human-in-the-loop example


In [2]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

import os
from pathlib import Path

os.environ["KERAS_BACKEND"] = "tensorflow"

import sys

In [3]:
CODE_FOLDER = Path("code")
CODE_FOLDER.mkdir(exist_ok=True)
sys.path.extend([f"./{CODE_FOLDER}"])

In [15]:
%%writefile code/load.py

from io import StringIO
from pathlib import Path

import pandas as pd
from metaflow import S3


def load_data_from_s3(location: str):
    """Load the dataset from an S3 location.

    This function will concatenate every CSV file in the given location
    and return a single DataFrame.
    """
    print(f"Loading dataset from location {location}")

    with S3(s3root=location) as s3:
        files = s3.get_all()

        print(f"Found {len(files)} file(s) in remote location")

        raw_data = [pd.read_csv(StringIO(file.text)) for file in files]
        return pd.concat(raw_data)


def load_data_from_file(dataset_location):
    """Load the dataset from a local file.

    This function is useful to test the pipeline locally
    without having to access the data remotely.
    """
    location = Path(dataset_location)
    print(f"Loading dataset from location {location.as_posix()}")
    return pd.read_csv(location)


def load_data(dataset_location, debug=False):
    if debug:
        df = load_data_from_file(dataset_location)
    else:
        df = load_data_from_s3(dataset_location)

    # Shuffle the data
    data = df.sample(frac=1, random_state=42)

    print(f"Loaded dataset with {len(data)} samples")

    return data

Overwriting code/load.py


In [16]:
from load import load_data

data = load_data("../penguins.csv", debug=True)
data

Loading dataset from location ../penguins.csv
Loaded dataset with 344 samples


Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
194,Chinstrap,Dream,50.9,19.1,196.0,3550.0,MALE
157,Chinstrap,Dream,45.2,17.8,198.0,3950.0,FEMALE
225,Gentoo,Biscoe,46.5,13.5,210.0,4550.0,FEMALE
208,Chinstrap,Dream,45.2,16.6,191.0,3250.0,FEMALE
318,Gentoo,Biscoe,48.4,14.4,203.0,4625.0,FEMALE
...,...,...,...,...,...,...,...
188,Chinstrap,Dream,47.6,18.3,195.0,3850.0,FEMALE
71,Adelie,Torgersen,39.7,18.4,190.0,3900.0,MALE
106,Adelie,Biscoe,38.6,17.2,199.0,3750.0,FEMALE
270,Gentoo,Biscoe,46.6,14.2,210.0,4850.0,FEMALE


## Session X - Loading the data

In [17]:
from metaflow import FlowSpec, NBRunner, step, pypi


class TrainingFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.load_data)

    @pypi(packages={"pandas": "2.2.2"})
    @step
    def load_data(self):
        from load import load_data

        data = load_data("../../penguins.csv", debug=True)
        print(data.head())

        self.next(self.end)

    @step
    def end(self):
        print("the end")


run = NBRunner(TrainingFlow, base_dir="code", environment="pypi").nbrun()

Metaflow 2.12.3 executing TrainingFlow for user:svpino
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
Bootstrapping virtual environment(s) ...
Virtual environment(s) bootstrapped!
2024-07-29 11:07:23.354 Workflow starting (run-id 1722265643352845):
2024-07-29 11:07:23.385 [1722265643352845/start/1 (pid 37577)] Task is starting.
2024-07-29 11:07:23.526 [1722265643352845/start/1 (pid 37577)] Task finished successfully.
2024-07-29 11:07:23.553 [1722265643352845/load_data/2 (pid 37582)] Task is starting.
2024-07-29 11:07:23.820 [1722265643352845/load_data/2 (pid 37582)] Loading dataset from location ../../penguins.csv
2024-07-29 11:07:23.821 [1722265643352845/load_data/2 (pid 37582)] Loaded dataset with 344 samples
2024-07-29 11:07:23.824 [1722265643352845/load_data/2 (pid 37582)] species  island  ...  body_mass_g     sex
2024-07-29 11:07:23.861 [1722265643352845/load_data/2 (pid 37582)] 194  Chinstrap   Dream  ...   

# Session X - Cross Validation

In [18]:
from metaflow import FlowSpec, NBRunner, step, pypi


class TrainingFlow(FlowSpec):
    @step
    def start(self):
        self.next(self.load_data)

    @pypi(packages={"pandas": "2.2.2"})
    @step
    def load_data(self):
        from load import load_data

        data = load_data("../../penguins.csv", debug=True)
        print(data.head())

        self.next(self.cross_validation)

    @step
    def cross_validation(self):
        from sklearn.model_selection import KFold

        kfold = KFold(n_splits=5, shuffle=True)
        self.folds = list(enumerate(kfold.split(self.species, self.data)))

        self.next(self.transform_fold, foreach="folds")

    @step
    def end(self):
        print("the end")


run = NBRunner(TrainingFlow, base_dir="code", environment="pypi").nbrun()

Metaflow 2.12.3 executing TrainingFlow for user:svpino
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
Bootstrapping virtual environment(s) ...
Virtual environment(s) bootstrapped!
2024-07-29 11:09:38.478 Workflow starting (run-id 1722265778477383):
2024-07-29 11:09:38.510 [1722265778477383/start/1 (pid 38366)] Task is starting.
2024-07-29 11:09:38.657 [1722265778477383/start/1 (pid 38366)] Task finished successfully.
2024-07-29 11:09:38.683 [1722265778477383/load_data/2 (pid 38371)] Task is starting.
2024-07-29 11:09:38.944 [1722265778477383/load_data/2 (pid 38371)] Loading dataset from location ../../penguins.csv
2024-07-29 11:09:38.946 [1722265778477383/load_data/2 (pid 38371)] Loaded dataset with 344 samples
2024-07-29 11:09:38.949 [1722265778477383/load_data/2 (pid 38371)] species  island  ...  body_mass_g     sex
2024-07-29 11:09:38.986 [1722265778477383/load_data/2 (pid 38371)] 194  Chinstrap   Dream  ...   

# Monitoring

In [1]:
from mlflow import MlflowClient

client = MlflowClient()
latest_model_version = client.search_model_versions(
    "name='penguins'",
    max_results=1,
    order_by=["last_updated_timestamp DESC"],
)[0]
print(f"Model version: {latest_model_version.version}")

Model version: 1


In [3]:
import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")
model = mlflow.pyfunc.load_model(latest_model_version.source)

 - scikit-learn (current: 1.4.2, required: scikit-learn==1.5.1)
 - keras (current: 3.3.0, required: keras==3.5.0)
 - jax (current: 0.4.26, required: jax[cpu]==0.4.31)
 - packaging (current: 24.0, required: packaging==24.1)
 - mlflow (current: 2.12.1, required: mlflow==2.15.1)
 - setuptools (current: 65.5.0, required: setuptools==72.1.0)
To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/st

In [4]:
import pandas as pd
import numpy as np

data = pd.read_csv("../penguins.csv")
data.pop("species")
data["sex"] = data["sex"].replace(".", np.nan)
data = data.sample(frac=1).reset_index(drop=True)


def nan_to_none(value):
    return None if pd.isna(value) else value

In [5]:
std_dev = data["body_mass_g"].std()

# Add random noise within 3 standard deviations to body_mass_g
rng = np.random.default_rng()
data["body_mass_g"] += rng.uniform(1, 3 * std_dev, size=len(data))

for _, row in data[0:200].iterrows():
    payload = {k: nan_to_none(v) for k, v in row.to_dict().items()}
    prediction = model.predict(payload, params={"data_capture": True})
    # print(prediction)




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 621us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 865us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 675us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 689us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 736us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 975us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 557us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 617us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 541us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 584us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 490us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 438us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 619us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 447us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 567us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 559us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 532us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 472us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 653us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 721us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 612us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 976us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 901us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 915us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 874us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 709us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 690us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 771us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 880us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 862us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 959us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 895us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 843us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 837us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 824us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 713us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 875us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 851us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 670us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 757us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 706us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 636us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 638us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 746us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 702us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 719us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 716us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 637us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 683us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 876us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 700us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 729us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 887us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 605us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 746us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 685us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 659us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 564us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 638us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 710us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 826us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 682us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 599us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 693us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 554us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 544us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 625us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 580us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 785us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 622us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 631us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 714us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 732us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 550us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 679us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 645us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 713us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 626us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 655us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 691us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 562us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 579us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 712us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 631us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 683us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 583us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 676us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 622us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 631us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 579us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 662us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 633us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 730us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 595us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 683us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 716us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 676us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 513us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 618us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 753us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 620us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 652us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 735us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 694us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 727us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 634us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 683us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 885us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 765us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 603us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 563us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 648us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 599us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 748us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 563us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 553us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 479us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 918us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 596us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 659us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 586us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 637us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 670us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 748us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 626us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 678us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 523us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 653us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 718us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 699us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 605us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 752us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 724us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 611us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 811us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 752us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 565us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 682us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 506us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 722us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 531us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 654us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 777us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 732us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 542us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 626us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 801us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 656us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 844us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 788us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 635us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 628us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 659us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 615us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 733us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 766us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 810us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 832us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 830us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 851us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 933us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 881us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 872us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 889us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 774us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 786us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 608us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 925us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 979us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 689us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 943us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 948us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 770us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 948us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 793us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 800us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 885us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 927us/step




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 713us/step


In [6]:
import sqlite3
import random

connection = sqlite3.connect("penguins.db")

query = "SELECT * FROM data"
df = pd.read_sql_query(query, connection)

# Update the species column
for index, row in df.iterrows():
    if random.random() < 0.9:
        # 90% of the time, set species to prediction
        species = row["prediction"]
    else:
        # 10% of the time, set species to a random value
        species = random.choice(["Adelie", "Gentoo", "Chinstrap"])

    # Update the database
    update_query = "UPDATE data SET species = ? WHERE rowid = ?"
    connection.execute(update_query, (species, index + 1))

# Commit the changes
connection.commit()

# Close the connection
connection.close()

print("Database updated successfully.")


Database updated successfully.


# Other

In [22]:
from metaflow import Runner

with Runner("training.py", environment="pypi", env={"KERAS_BACKEND": "jax"}).run(
    max_workers=1
) as running:
    print(f"{running.run}")

Metaflow 2.12.3 executing TrainingFlow for user:svpino
Project: penguins, Branch: user.svpino
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
Bootstrapping virtual environment(s) ...
Virtual environment(s) bootstrapped!
2024-07-30 11:45:45.292 Workflow starting (run-id 1722354345288526):
2024-07-30 11:45:45.328 [1722354345288526/start/1 (pid 87120)] Task is starting.
2024-07-30 11:45:45.996 [1722354345288526/start/1 (pid 87120)] Running flow in development mode.
2024-07-30 11:45:46.130 [1722354345288526/start/1 (pid 87120)] Task finished successfully.
2024-07-30 11:45:46.161 [1722354345288526/load_data/2 (pid 87130)] Task is starting.
2024-07-30 11:45:46.756 [1722354345288526/load_data/2 (pid 87130)] Loading dataset from location ../penguins.csv
2024-07-30 11:45:46.757 [1722354345288526/load_data/2 (pid 87130)] Loaded dataset with 344 samples
2024-07-30 11:45:46.825 [1722354345288526/load_data/2 (pid 87130)] Loaded

In [53]:
from metaflow import Runner

with Runner("flows/00-introduction.py").run() as running:
    print(f"{running.run}")

Metaflow 2.12.3 executing IntroductionFlow for user:svpino
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-07-18 20:38:10.756 Workflow starting (run-id 1721327890756156):
2024-07-18 20:38:10.762 [1721327890756156/start/1 (pid 36197)] Task is starting.
2024-07-18 20:38:10.881 [1721327890756156/start/1 (pid 36197)] Start
2024-07-18 20:38:10.900 [1721327890756156/start/1 (pid 36197)] Task finished successfully.
2024-07-18 20:38:10.903 [1721327890756156/end/2 (pid 36200)] Task is starting.
2024-07-18 20:38:11.025 [1721327890756156/end/2 (pid 36200)] the end
2024-07-18 20:38:11.043 [1721327890756156/end/2 (pid 36200)] Task finished successfully.
2024-07-18 20:38:11.043 Done!
Run('IntroductionFlow/1721327890756156')


In [54]:
from metaflow import Runner

with Runner("flows/01-load.py").run() as running:
    print(f"{running.run}")

Metaflow 2.12.3 executing LoadFlow for user:svpino
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-07-18 20:40:03.290 Workflow starting (run-id 1721328003289419):
2024-07-18 20:40:03.297 [1721328003289419/start/1 (pid 38623)] Task is starting.
2024-07-18 20:40:03.426 [1721328003289419/start/1 (pid 38623)] Start
2024-07-18 20:40:03.448 [1721328003289419/start/1 (pid 38623)] Task finished successfully.
2024-07-18 20:40:03.452 [1721328003289419/load_data/2 (pid 38627)] Task is starting.
2024-07-18 20:40:03.580 [1721328003289419/load_data/2 (pid 38627)] Load
2024-07-18 20:40:03.600 [1721328003289419/load_data/2 (pid 38627)] Task finished successfully.
2024-07-18 20:40:03.603 [1721328003289419/end/3 (pid 38630)] Task is starting.
2024-07-18 20:40:03.730 [1721328003289419/end/3 (pid 38630)] the end
2024-07-18 20:40:03.753 [1721328003289419/end/3 (pid 38630)] Task finished successfully.
2024-07-18 20:40:03.753 Done!


In [20]:
from pathlib import Path
import pandas as pd
import numpy as np

location = Path("../penguins.csv")
df = pd.read_csv(location)
df["sex"] = df["sex"].replace(".", np.nan)
df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [16]:
import numpy as np

labels = df.pop("species")

In [17]:
labels = labels.to_numpy().reshape(-1, 1)
labels.shape

(344, 1)

In [18]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

label_transformer = ColumnTransformer(
    transformers=[("species", OrdinalEncoder(), [0])],
)

In [21]:
def build_target_transformer():
    """Build a Scikit-Learn transformer to preprocess the target variable."""
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OrdinalEncoder

    return ColumnTransformer(
        transformers=[("species", OrdinalEncoder(), [0])],
    )


def build_features_transformer():
    """Build a Scikit-Learn transformer to preprocess the feature columns."""
    from sklearn.compose import ColumnTransformer, make_column_selector
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import OneHotEncoder, StandardScaler

    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="mean"),
        StandardScaler(),
    )

    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        # We can use the `handle_unknown="ignore"` parameter to ignore
        # unseen categories during inference. When encoding an unknown
        # category, the transformer will return an all-zero vector.
        OneHotEncoder(handle_unknown="ignore"),
    )

    return ColumnTransformer(
        transformers=[
            (
                "numeric",
                numeric_transformer,
                make_column_selector(dtype_exclude="object"),
            ),
            (
                "categorical",
                categorical_transformer,
                ["island", "sex"],
                # make_column_selector(dtype_include="object"),
            ),
        ],
    )

In [22]:
target_transformer = build_target_transformer()
y = target_transformer.fit_transform(df.species.to_numpy().reshape(-1, 1))

features_transformer = build_features_transformer()
x = features_transformer.fit_transform(df)

In [23]:
x[0]

array([-0.88708123,  0.78774251, -1.42248782, -0.56578921,  0.        ,
        0.        ,  1.        ,  0.        ,  1.        ])

In [24]:
from keras import Input
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD

model = Sequential(
    [
        # TODO: I thought we needed 10 inputs here?
        Input(shape=(9,)),
        Dense(10, activation="relu"),
        Dense(8, activation="relu"),
        Dense(3, activation="softmax"),
    ],
)

model.compile(
    optimizer=SGD(learning_rate=0.01),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

In [25]:
model.fit(x, y, verbose=0, epochs=1, batch_size=32)


<keras.src.callbacks.history.History at 0x358888a00>

In [49]:
transformer.transform(df2)

array([[-0.88708123,  0.78774251, -1.42248782, -0.56578921,  0.        ,
         0.        ,  0.        ]])

In [348]:
df["sex"] = df["sex"].replace(".", np.nan)
df = df.dropna()

In [345]:
test_df = df.sample(frac=0.2, random_state=42)
train_df = df.drop(test_df.index)

In [346]:
train_df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,FEMALE


In [349]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=2, shuffle=True)
kfold_indices = list(enumerate(kfold.split(train_df)))

In [351]:
kfold_indices[0]

(0,
 (array([  0,   1,   3,   4,   5,   7,  10,  14,  16,  17,  18,  23,  24,
          25,  26,  28,  29,  30,  31,  32,  35,  36,  37,  38,  40,  41,
          43,  45,  46,  47,  50,  52,  53,  59,  60,  61,  65,  66,  67,
          68,  69,  71,  72,  78,  79,  80,  81,  82,  84,  95,  98,  99,
         101, 102, 103, 104, 108, 109, 111, 115, 116, 117, 119, 122, 130,
         132, 133, 136, 137, 139, 140, 141, 143, 144, 145, 147, 149, 150,
         152, 153, 154, 155, 160, 162, 166, 172, 173, 174, 175, 177, 178,
         179, 187, 188, 194, 195, 196, 197, 200, 201, 202, 203, 205, 215,
         216, 217, 219, 220, 221, 222, 224, 225, 227, 228, 230, 234, 235,
         238, 239, 242, 244, 245, 246, 247, 248, 249, 251, 254, 255, 256,
         261, 263, 265, 267, 268, 271, 274]),
  array([  2,   6,   8,   9,  11,  12,  13,  15,  19,  20,  21,  22,  27,
          33,  34,  39,  42,  44,  48,  49,  51,  54,  55,  56,  57,  58,
          62,  63,  64,  70,  73,  74,  75,  76,  77,  83,  85

In [352]:
train_df = df.iloc[kfold_indices[0][1][0]]
test_df = df.iloc[kfold_indices[0][1][1]]

In [353]:
import tensorflow as tf
from keras.layers import StringLookup

label_lookup = StringLookup(
    # the order here is important since the first index will be encoded as 0
    vocabulary=["Adelie", "Chinstrap", "Gentoo"],
    num_oov_indices=0,
)


def encode_label(x, y):
    encoded_y = label_lookup(y)
    return x, encoded_y


def dataframe_to_dataset(df):
    df = df.copy()
    labels = df.pop("species")
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    ds = ds.map(encode_label, num_parallel_calls=tf.data.AUTOTUNE)
    ds = ds.shuffle(buffer_size=len(df))
    return ds


train_dataset = dataframe_to_dataset(train_df)
test_dataset = dataframe_to_dataset(test_df)

In [354]:
for x, y in train_dataset.take(1):
    print("Input:", x)
    print("Target:", y)

Input: {'island': <tf.Tensor: shape=(), dtype=string, numpy=b'Biscoe'>, 'culmen_length_mm': <tf.Tensor: shape=(), dtype=float64, numpy=42.8>, 'culmen_depth_mm': <tf.Tensor: shape=(), dtype=float64, numpy=14.2>, 'flipper_length_mm': <tf.Tensor: shape=(), dtype=float64, numpy=209.0>, 'body_mass_g': <tf.Tensor: shape=(), dtype=float64, numpy=4700.0>, 'sex': <tf.Tensor: shape=(), dtype=string, numpy=b'FEMALE'>}
Target: tf.Tensor(2, shape=(), dtype=int64)


2024-06-08 17:21:07.476200: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [119]:
train_dataset = train_dataset.batch(32)
test_dataset = test_dataset.batch(32)

In [271]:
for x, y in train_dataset.take(1):
    print("Input:", x)

Input: {'island': <tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'Biscoe', b'Torgersen', b'Biscoe', b'Dream', b'Biscoe', b'Biscoe',
       b'Torgersen', b'Torgersen', b'Dream', b'Biscoe', b'Dream',
       b'Biscoe', b'Dream', b'Biscoe', b'Biscoe', b'Biscoe', b'Dream',
       b'Biscoe', b'Biscoe', b'Biscoe', b'Biscoe', b'Biscoe', b'Biscoe',
       b'Biscoe', b'Biscoe', b'Biscoe', b'Dream', b'Dream', b'Dream',
       b'Torgersen', b'Dream', b'Biscoe'], dtype=object)>, 'culmen_length_mm': <tf.Tensor: shape=(32,), dtype=float64, numpy=
array([44. , 41.5, 49.6, 49.8, 35.5, 43.3, 36.2, 36.2, 43.5, 39.7, 47.6,
       45.6, 51.3, 49.9, 49.5, 46.8, 45.6, 45.2, 37.7, 47.5, 40.6, 50.8,
       49.4, 50. , 36.5, 49.3, 50.8, 39.7, 49. , 39.7, 35.6, 42. ])>, 'culmen_depth_mm': <tf.Tensor: shape=(32,), dtype=float64, numpy=
array([13.6, 18.3, 15. , 17.3, 16.2, 14. , 17.2, 16.1, 18.1, 17.7, 18.3,
       20.3, 19.9, 16.1, 16.1, 14.3, 19.4, 13.8, 16. , 14.2, 18.6, 15.7,
       15.8, 15.9, 16.6, 15

2024-06-08 16:08:53.143833: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [120]:
from keras.utils import FeatureSpace

feature_space = FeatureSpace(
    features={
        "sex": FeatureSpace.string_categorical(num_oov_indices=0),
        "island": "string_categorical",
        "culmen_length_mm": "float_normalized",
        "culmen_depth_mm": "float_normalized",
        "flipper_length_mm": "float_normalized",
        "body_mass_g": "float_normalized",
    },
    output_mode="concat",
)

In [121]:
train_ds_with_no_labels = train_dataset.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)

2024-06-08 14:52:46.673085: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-08 14:52:46.701861: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-08 14:52:46.726271: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-08 14:52:46.750699: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-08 14:52:46.774244: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-08 14:52:46.801095: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-06-08 14:52:46.823662: W tensorflow/core/framework/local_rendezvous.cc:404] L

In [122]:
for x, _ in train_dataset.take(1):
    preprocessed_x = feature_space(x)
    print("preprocessed_x.shape:", preprocessed_x.shape)
    print("preprocessed_x.dtype:", preprocessed_x.dtype)

preprocessed_x.shape: (32, 10)
preprocessed_x.dtype: <dtype: 'float32'>


2024-06-08 14:52:47.613255: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [123]:
preprocessed_train_ds = train_dataset.map(
    lambda x, y: (feature_space(x), y),
    num_parallel_calls=tf.data.AUTOTUNE,
)
preprocessed_train_ds = preprocessed_train_ds.prefetch(tf.data.AUTOTUNE)

preprocessed_test_ds = test_dataset.map(
    lambda x, y: (feature_space(x), y),
    num_parallel_calls=tf.data.AUTOTUNE,
)
preprocessed_test_ds = preprocessed_test_ds.prefetch(tf.data.AUTOTUNE)

In [124]:
dict_inputs = feature_space.get_inputs()
encoded_features = feature_space.get_encoded_features()

In [125]:
dict_inputs

{'island': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=island>,
 'sex': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=sex>,
 'culmen_length_mm': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=culmen_length_mm>,
 'culmen_depth_mm': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=culmen_depth_mm>,
 'flipper_length_mm': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=flipper_length_mm>,
 'body_mass_g': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=body_mass_g>}

In [126]:
encoded_features

<KerasTensor shape=(None, 10), dtype=float32, sparse=False, name=keras_tensor_110>

In [127]:
from keras import Model
from keras.layers import Dense
from keras.optimizers import SGD

x = Dense(10, activation="relu")(encoded_features)
x = Dense(8, activation="relu")(x)
outputs = Dense(3, activation="softmax")(x)

model = Model(inputs=encoded_features, outputs=outputs)
model.compile(
    optimizer=SGD(learning_rate=0.01),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

In [128]:
inference_model = Model(inputs=dict_inputs, outputs=outputs)

In [129]:
model.fit(
    preprocessed_train_ds,
    epochs=50,
    validation_data=preprocessed_test_ds,
    verbose=2,
)

Epoch 1/50


9/9 - 0s - 23ms/step - accuracy: 0.1873 - loss: 1.2161 - val_accuracy: 0.2879 - val_loss: 1.1632
Epoch 2/50
9/9 - 0s - 5ms/step - accuracy: 0.2322 - loss: 1.1728 - val_accuracy: 0.3636 - val_loss: 1.1302
Epoch 3/50
9/9 - 0s - 5ms/step - accuracy: 0.2996 - loss: 1.1336 - val_accuracy: 0.3636 - val_loss: 1.1041
Epoch 4/50
9/9 - 0s - 5ms/step - accuracy: 0.3408 - loss: 1.1017 - val_accuracy: 0.4242 - val_loss: 1.0803
Epoch 5/50
9/9 - 0s - 5ms/step - accuracy: 0.3858 - loss: 1.0732 - val_accuracy: 0.4545 - val_loss: 1.0581
Epoch 6/50
9/9 - 0s - 5ms/step - accuracy: 0.4120 - loss: 1.0464 - val_accuracy: 0.5152 - val_loss: 1.0365
Epoch 7/50
9/9 - 0s - 5ms/step - accuracy: 0.4494 - loss: 1.0204 - val_accuracy: 0.5152 - val_loss: 1.0140
Epoch 8/50
9/9 - 0s - 5ms/step - accuracy: 0.4906 - loss: 0.9946 - val_accuracy: 0.5606 - val_loss: 0.9917
Epoch 9/50
9/9 - 0s - 5ms/step - accuracy: 0.5243 - loss: 0.9701 - val_accuracy: 0.6364 - val_loss: 0.9700
Epoch 10/50
9/9 - 0s - 5ms/step - accuracy: 0.5

<keras.src.callbacks.history.History at 0x3a63ae4a0>

In [304]:
sample = {
    "island": ["Biscoe", "Torgersen", "Torgersen"],
    "culmen_length_mm": [48.6, 44.1, 39.1],
    "culmen_depth_mm": [16.0, 18.0, 18.7],
    "flipper_length_mm": [230.0, 210.0, 181.0],
    "body_mass_g": [5800.0, 4000.0, 3750.0],
    "sex": ["MALE", "FEMALE", "MALE"],
}

input_dict = {name: tf.convert_to_tensor(value) for name, value in sample.items()}
input_dict

{'island': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Biscoe', b'Torgersen', b'Torgersen'], dtype=object)>,
 'culmen_length_mm': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([48.6, 44.1, 39.1], dtype=float32)>,
 'culmen_depth_mm': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([16. , 18. , 18.7], dtype=float32)>,
 'flipper_length_mm': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([230., 210., 181.], dtype=float32)>,
 'body_mass_g': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([5800., 4000., 3750.], dtype=float32)>,
 'sex': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'MALE', b'FEMALE', b'MALE'], dtype=object)>}

In [314]:
result = inference_model.predict(input_dict)
result

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step


array([[0.03257361, 0.14812116, 0.81930524],
       [0.3940005 , 0.28193894, 0.32406056],
       [0.94649315, 0.02197206, 0.0315347 ]], dtype=float32)

In [315]:
from keras.layers import Lambda


def pred(p):
    return tf.stack(
        [
            tf.cast(tf.math.argmax(p, axis=1), dtype=tf.float32),
            tf.math.reduce_max(p, axis=1),
        ],
    )


prediction = Lambda(pred)(outputs)

inference_model2 = Model(inputs=dict_inputs, outputs=prediction)

In [316]:
result2 = inference_model2.predict(input_dict)
result2

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step


array([[2.        , 0.        , 0.        ],
       [0.81930524, 0.3940005 , 0.94649315]], dtype=float32)

In [317]:
label_lookup.get_vocabulary()

['Adelie', 'Chinstrap', 'Gentoo']

In [318]:
decoder = StringLookup(
    vocabulary=label_lookup.get_vocabulary(),
    invert=True,
    num_oov_indices=0,
)

In [320]:
decoder(np.argmax(result, axis=1)).numpy()

array([b'Gentoo', b'Adelie', b'Adelie'], dtype=object)

In [321]:
np.argmax(result, axis=1), np.max(result, axis=1)

(array([2, 0, 0]), array([0.81930524, 0.3940005 , 0.94649315], dtype=float32))

In [12]:
import numpy as np

classes = ["Adelie", "Chinstrap", "Gentoo"]

prediction = np.array([0, 2, 1, 1])
condifence = np.array([0.6, 0.9, 0.8, 0.7])

prediction = np.vectorize(lambda x: classes[x])(prediction)

[
    {"prediction": p, "confidence": c}
    for p, c in zip(prediction, condifence, strict=True)
]

[{'prediction': 'Adelie', 'confidence': 0.6},
 {'prediction': 'Gentoo', 'confidence': 0.9},
 {'prediction': 'Chinstrap', 'confidence': 0.8},
 {'prediction': 'Chinstrap', 'confidence': 0.7}]

In [7]:
predictions

['Adelie', 'Gentoo', 'Chinstrap', 'Chinstrap']

In [30]:
input_example = {
    "island": "Biscoe",
    "culmen_length_mm": 48.6,
    "culmen_depth_mm": 16.0,
    "flipper_length_mm": 230.0,
    "body_mass_g": 5800.0,
    "sex": "MALE",
}

In [33]:
from mlflow.models import infer_signature

infer_signature(
    model_input=input_example, model_output={"prediction": "Adelie", "confidence": 0.90}
)

inputs: 
  ['island': string (required), 'culmen_length_mm': double (required), 'culmen_depth_mm': double (required), 'flipper_length_mm': double (required), 'body_mass_g': double (required), 'sex': string (required)]
outputs: 
  ['prediction': string (required), 'confidence': double (required)]
params: 
  None