# 🔄 Build an ML Pipeline with scikit-learn & Union

<a target="_blank" href="https://colab.research.google.com/github.com/unionai-oss/bert-llm-classification-pipeline/blob/main/tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial will walk you through building an end-to-end machine learning pipeline using scikit-learn and Union's AI workflow and inference platform. We'll download a dataset, train a machine learning model, deploy it, and track its artifacts using Union's powerful MLOps features. Although this example may seem relatively simple, all the concepts and tools used here can be applied to more complex machine learning and AI projects.


By just adding a few lines of code to your Python functions, you'll be able to create a reproducible ML pipeline, taking advantage of Union's features:

- Reproducible AI workflows: Ensure your ML pipeline produces the same environments every time.
- Versioning of code and artifacts: Track changes in your code and models automatically.
- Data Caching for faster iterations: Reuse results from previous executions to save time.
- Declarative Infrastructure: Define your ML infrastructure needs directly in your code without worrying about provisioning.
- Artifact Management for models and data: Automatically manage your model files and datasets.
- Container Image Builder: Build and deploy your code in a consistent environment.
- Local Development: Test your workflows locally before deploying them to the cloud.
- Actors for long-running stateful containers: Handle tasks that require continuous state or interaction.
- And more...

```python
@task(
    cache=True,
    cache_version="4",
    container_image=image,
    requests=Resources(cpu="2", mem="2Gi")
)
def download_data(): -> pd.DataFrame:
    ...

@task(
    container_image=image,
    requests=Resources(cpu="2", mem="20Gi", gpu="1")
)
def train_model(data: pd.DataFrame:): -> pytorch.Model:
    ...

@workflow()
def pipeline_workflow():
    data = download_data()
    train_model(data=data)
    ...

```


## 🧰 Setup 

Sign up for a Union Serverless account at [Union.ai](https://union.ai) by clicking the "Get Started" button. No card required, and you'll get $30 in free credits to get started. Signing up can take a few minutes.

Or you can use your [Union BYOC Enterprise](https://www.union.ai/pricing) login if you have one.

### 📦 Install Python Packages & Clone Repo

Packages can be installed in your local environment using the following command using your preferred package manager from the [requirements.txt](requirements.txt) file. For example `pip install -r requirements.txt`. 

to clone the repo, run the following command in your environment: `git clone `

If you're running this notebook in a Google Colab environment, you can install the packages and clone the GitHub repo directly in the notebook by running the following cell:
