# üîÑ Build an ML Pipeline with scikit-learn & Union

<a target="_blank" href="https://colab.research.google.com/github/unionai-oss/scikit-learn-ml-pipelines/blob/main/tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial will walk you through building an end-to-end machine learning pipeline using scikit-learn and Union's AI workflow and inference platform. We'll download a dataset, train a machine learning model, deploy it, and track its artifacts using Union's powerful MLOps features. Although this example may seem relatively simple, all the concepts and tools used here can be applied to more complex machine learning and AI projects.

By just adding a few lines of code to your Python functions, you'll be able to create a reproducible ML pipeline, taking advantage of Union's features:

- Reproducible AI workflows: Ensure your ML pipeline produces the same environments every time.
- Versioning of code and artifacts: Track changes in your code and models automatically.
- Data Caching for faster iterations: Reuse results from previous executions to save time.
- Declarative Infrastructure: Define your ML infrastructure needs directly in your code without worrying about provisioning.
- Artifact Management for models and data: Automatically manage your model files and datasets.
- Container Image Builder: Build and deploy your code in a consistent environment.
- Local Development: Test your workflows locally before deploying them to the cloud.
- Actors for long-running stateful containers: Handle tasks that require continuous state or interaction.
- And more...


## üß∞ Setup 

Sign up for a Union Serverless account at [Union.ai](https://union.ai) by clicking the "Get Started" button. No card required, and you'll get $30 in free credits to get started. Signing up can take a few minutes.

Or you can use your [Union BYOC Enterprise](https://www.union.ai/pricing) login if you have one.

### üì¶ Install Python Packages & Clone Repo

Packages can be installed in your local environment using the following command using your preferred package manager from the [requirements.txt](requirements.txt) file. For example `pip install -r requirements.txt`. 

to clone the repo, run the following command in your environment: `git clone `

If you're running this notebook in a Google Colab environment, you can install the packages and clone the GitHub repo directly in the notebook by running the following cell:


In [1]:
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    !git clone https://github.com/unionai-oss/scikit-learn-ml-pipelines
    %cd phi3-on-union-actors
    !pip install -r requirements.txt

### üîê Authenticate

If you're using [Union BYOC Enterprise](https://www.union.ai/pricing) use: `union create login --host <union-host-url>`

Otherwise, Authenticate to [Union Serverless](https://www.union.ai/) by running the command below - create an account for free at [Union.ai](https://union.ai) if you don't have one:
 

In [None]:
!union create login --serverless --auth device-flow

## üß© Create a Simple Workflow

Before we build our ML pipeline lets build a simple workflow to understand the basics of Union's workflow system.

`ImageSpec` - Allows you to specify the environment in which your task will run directly in your Python code. This includes the Python packages, CUDA version, and any additional environment setup you need. When a task is run, Union will automatically build a container image with the specified environment if it doens't already exsist and run the task in that container.

`Tasks` - Tasks are the building blocks of workflows. They allow you to define a unit of work and what infrastructure to us.

`Workflows` - A workflow is a collection of tasks that and defines data flow. Workflows can be run locally or in the cloud.

Both tasks workflows are strongly typed

Note: This section we register tasks and workflows directly in the notebook cells. This is powerful for prototyping and testing and allows you to create and run reproducible workflows without leaving the notebook! However, most people will probably prefer to define tasks in a separate file and import them into workflows, which we will do in the next section.

In [10]:
# Simple workflow
import flytekit as fl
from functools import partial

base_image = fl.ImageSpec(
    name="base_image_for_tutorial",
    packages=["flytekit==1.14.2", "union==0.1.117"]
)

# task = fl.task(container_image=base_image)

@fl.task(container_image=base_image)
def hello_world(name: str) -> str:
    return f"Hello {name}"

# workflow
@fl.workflow()
def simple_workflow(name: str) -> str:
    return hello_world(name=name)

In [None]:
# Execute the workflow locally
simple_workflow(name="world")

In [None]:
# Create remote context
from union.remote import UnionRemote
serverless = UnionRemote()

In [None]:
# Execute the workflow remotely
execution = serverless.execute(simple_workflow, inputs={'name': 'world'})

## üîÄ ML Model Training Pipeline

In this sections we'll be running tasks and workflows defined in Python under the relevant folders. 

Navigate to the `tasks` and `workflows` folders to see the code. if you're following along in a hosted jupyter notebook you should be able to view the code by clicking on a folder icon (usually on the left side of the screen).

First we'll create a machine learning pipeline that trains a model on the iris dataset.

Our workflow will have the following steps:
- Load the iris dataset
- Split the dataset into training and testing sets
- Train a Random Forest model
- Evaluate the model
- Save model as an artifact

Note: Data pipelines could be seperate from model training pipelines for more complex pipelines. In this example we'll keep it simple and combine them into one workflow.

Navigate to the `workflows` folder and open [the](workflows/workflows.py) `workflows.py` file. navigate to the `train_iris_classification()` function to see the code for the workflow. This workflow uses tasks defined in the `tasks` folder and builds a container image from `container.py`.

In [None]:
!union run --remote workflows/workflows.py train_iris_classification

The `--remote` flag is used to run the workflow in the cloud. If you want to run the workflow locally, you can remove the flag.

Often times you may want to run a workflow locally to test it before running it in the cloud. This is especially useful when you're developing a new workflow or debugging an existing one.

It can be useful to do some things different when running locally, like using a subset of data, save files in a different format for debugging, etc. to trigger a section of code when running locally you can use can check for `"FLYTE_INTERNAL_EXECUTION_ID"` variable in the code. If it's not present, the code is running locally.

```python
if "FLYTE_INTERNAL_EXECUTION_ID" not in os.environ:
    # Only run this code locally
```



## üöÄ Model Serving

In this tutorial we'll show you the common ways to serve a model using Union, but you can also download or move the model to your own infrastructure.

- Use a regular containers for batch inference
- Use Actors (long running stateful) for near real-time inference
- _Coming soon:_ Serve the model and application interface within Union

### Batch Prediction ML Workflow
The training workflow produced a model artifact that we can use to make predictions on new data.

Lets run our first prediction worflow. This workflow ... We'll see how we can use actors to run long running tasks next for faster predictions.

In [None]:
!union run --remote workflows/workflows.py batch_prediction_knn

### ‚ö° Enabling Near Real-time Predictions with Actors
Actors

In [None]:
!union run --remote workflows/workflows.py actor_prediction_knn

### Build an application with Gradio

Full app serving coming soon! 

## Learn More About Union and Building AI Pipelines:

We hope you had funand learned something new from this tutorial on building ML pipelines with Union! Creating reproducible AI workflows is a powerful way to increase productivity and collaboration accross your team. And an essential part of MLOps for deploying and managing machine learning models in production.

To learn more about Union and building AI pipelines: 
- Check out the [Union Documentation](https://docs.union.ai/).
- Contact us at [Union.ai](https://union.ai) for a demo or to learn more about Union Enterprise.
- Join our Slack community to ask questions and share your projects with other Union users.

### Other resources to learn more about MLOps and building AI pipelines:



