# Automate Model Lifecycle

Automating the model lifecycle refers to streamlining and automating the stages involved in developing, deploying, and maintaining machine learning models. Some of these steps are: 
- data preprocessing,
- model training,
- evaluation, 
- deployment, 
- monitoring, 
- and retraining. 

Automating these processes can save time, improve efficiency, and ensure consistency in model development.

# Lecture plan

# Reminders

#### Google Cloud Platform

- Console vs CLI vs code
- Authentication: one method for each interface

Google Cloud Platform (GCP) provides multiple interfaces for interacting with its services and resources, including the Console (web-based graphical user interface), Command-Line Interface (CLI), and various SDKs (which allow interaction via code). Each of these interfaces requires authentication, and GCP offers different methods for authenticating in each interface

#### Cloud Storage

- Immutable data

In Google Cloud Storage (GCS), blobs (binary large objects) are the fundamental units of data storage. They represent individual pieces of data stored in buckets. Tjey are Immutable because the data cannot be altered or deleted after it has been written.

- Relational data
- Columnar storage & partitions

In BigQuery, columnar storage and partitioning are two important features that can significantly improve query performance and cost efficiency when working with large datasets.

If they ask:

`Columnar Storage`:
BigQuery utilizes a columnar storage format called Capacitor, which organizes data by column rather than by row. In a columnar storage format, values of each column are stored together, allowing for more efficient compression and better data compression ratios. This format enables BigQuery to scan and process only the specific columns needed for a query, minimizing the amount of data read from storage. <br>
`Partitioning` is a technique that divides a table into smaller, more manageable sections based on a specific column's values. BigQuery supports partitioning tables based on time and date, also called ingestion-time partitioning) or date/timestamp partitioning. 

#### Compute Engine

we talked about the idea of a virtual machine,  which are an important building block of cloud computing.

#### direnv

direnv is declared in the .env file and is a command-line tool designed to manage environment variables in a directory basis. It helps us automate the loading and unloading of environment variables based on the current working directory. CURIOSITY: The name "direnv" is short for "directory environment."

# Objective

### What is our progress?

# What's next

So the way that it is right now is that someone has to log into your virtual machine and manually run `make run_train`, manually re-prepreocess the data to have it up to date so in the longrun it's an unrealistic prospect, you see with one model it may be ok, someone can do that but - if everytime you add a new model you need to maintain it becomes cumbersome and pointless.

That is why we need to **Create a robust model lifecycle** - to:
- Ensure the reproducibility of the training in the future
- Track the performance of the model over time
- Serve multiple versions of the model
- Automate the model lifecycle

# Robust Lifecycle

# Experiment tracking with MLflow

MLflow is an open-source platform that enables machine learning experiment tracking, reproducibility, and model management. And what its gonna do for us is help track our experiments.

# CLoud Training

Yesterday we were sitting with this workflow -> we had out data warehouse that was passing preprocessed data to our virtual machine which was then passing on the data, that is our model to google cloud storage.

# Experiment Tracking

Now we want to track the experiment and that is gonna come with a few things:
- data version: refers to the data used for the training, in our case we are gonna be looking at the datetime or the timestamp and the number of rows which the past couple of days we have been working with 1K, 200K or all.


- with experiment parameters:
1. code version
2. Code parameters: which is the learning rate, epochs, etc...
3. training env: which is python + package versions
4. Preprocessing type: which is what we used to preprocess the data
5. the model hyper parameters

- with experiment metrics:
1. training metrcis: your loss, you mae, your accuracy and so on...


- then finally the model version, the actual model itself. like V1 trained on may 25, V2 trained on may 30th and so on.

As you can see there are quite a lot of things that we wanna keep track  of and if we were doing it just by ourselves in python it would involve writing tons and tons of code.

# Tracking Requirements

For tracking machine learning experiments using MLflow, you'll need to ensure you have the necessary infrastructure and practices in place to capture various aspects of your experiments effectively. Here are some of the requirements for tracking experiments using MLflow:

1. **Experiment Params & Metrics**:
   - **Code Version**: Record the version of the code used for training the model. This can be achieved by capturing the git commit ID, hash, or SHA of the code repository.
   - **Code Parameters**: Document the parameters used in the code, such as input paths, output paths, and any other configurable settings.
   - **Training Environment**: Include information about the Python version and package versions used for training the model. This ensures reproducibility by capturing the exact environment in which the model was trained.
   - **Preprocessing Type**: Describe the preprocessing steps applied to the data before training the model. This may include data cleaning, feature engineering, scaling, etc.
   - **Model Hyperparameters**: Record the hyperparameters used for training the model, such as learning rate, batch size, number of epochs, etc.
   - **Training Metrics**: Log relevant training metrics, such as loss, accuracy, precision, recall, F1 score, etc., to evaluate the performance of the model during training.

2. **Model Version**:
   - **Persisted Trained Model**: Save the trained model artifacts, including the model architecture, weights, and any other necessary files required to reproduce the model.
   - **Version Number**: Assign a unique version number or identifier to the trained model artifacts. This helps track different versions of the model over time.

3. **Data Version**:
   - **Data Used for Training**: It's good practice to keep track of the data used for training the model. This includes information such as the start date, end date, and the size of the dataset (e.g., 1k, 200k, or entire dataset).
   - **Data Versioning Control**: DVC -> Consider using a data versioning tool like DVC to manage and track changes to your datasets. DVC integrates with Git and allows you to version control large datasets efficiently by storing only the changes (diffs) to the data files.

To establish experiment tracking from a data warehouse to a virtual machine and finally to the experiment tracking system (such as MLflow), you'll need to design a workflow that captures and logs relevant information at each stage:

1. Data Warehouse: In the data warehouse, data relevant to your machine learning experiments is stored. This data may include raw datasets, preprocessed data, and any associated metadata.

2. Virtual Machine: You will provision a virtual machine instance in your cloud environment to establish connectivity between the VM and the data warehouse to access the required datasets.

3. Experiment Tracking with MLflow: And finally, configure MLflow on the virtual machine to track machine learning experiments. Within your machine learning code running on the VM, integrate MLflow to log experiment parameters, metrics, and artifacts. And Use MLflow's APIs or command-line interface to start and manage experiments, log metrics, and save model artifacts.

What are model artifacts?

A model artifact refers to the tangible output or result of a machine learning model training process. These artifacts encapsulate the knowledge and learned patterns extracted from the training data, which the model uses to make predictions or perform other tasks during inference.

# How to track our experiments?

To track machine learning experiments using MLflow, you can follow these steps:

1. **Set Up MLflow Server**:
   - Install and configure the MLflow server, which stores experiment tracking data in a database and trained models in a file storage system.
   - The MLflow server can be hosted on a local machine or in the cloud. 

2. **Use MLflow UI**:
   - Access the MLflow UI, a web interface that allows you to visualize experiment tracking data and annotate trained models.
   - The MLflow UI provides a user-friendly way to monitor experiments, compare runs, and track model performance over time.

3. MLflow CLI

    - We will not focus on it, just be aware that it exists

4. **Integrate MLflow into Your Code**:
   - Import the MLflow library into your Python code and use its functions to track experiments:

5. **Push Tracking Data to MLflow Server**:
   - When your code runs, it will push experiment tracking data (parameters, metrics, model artifacts) to the MLflow server through an API.

6. **Monitor Experiments**:
   - Use the MLflow UI to monitor your experiments, view tracking data, compare different runs, and analyze model performance.
   - The MLflow UI provides insights into experiment results and helps you make informed decisions about model improvements and optimizations.

By following these steps, you can effectively track your machine learning experiments using MLflow, from logging experiment data in your code to visualizing and analyzing results in the MLflow UI.

# Track Experiment + Save Model

-> GO TO ROADMAP

-> BACK ON SLIDE: THEORY

# CHART

# ML FLow Tracking

# ML Flow Server Architecture

So this is how it is structured behind the scenes. So there is a remote host that has ML Flow running on it and it'll be connected to a SQL database that will store the metrics and parameters. But a SQL DB is not a proper way to store a model so behind the scenes it is using a bucket to store the models.

# Automate the model lifecycle with prefect

#### Manual trigger
at the moment we still need to login to our VM or do it locally but we still need to manually run make run train

#### Model Lifecycle
but what we want is - to breakdown the models lifecycle. So let's say we wanna train our model every month. So in order to do that we need to get fresh data and check our model's performance on the new data. do we wanna retrain? do we wanna send it to production?

#### get fresh data
so in order to do that we need to get fresh data, we need to preprocess it and push it to our data warehose

#### evaluate model performance
so what we wanna do here is look at the past performance of the model so in order to do that we pull our new data from the warehouse, and run an evaluation on our model with the new data

#### retrain model
we'll retrain the model either on a VM but you can do it locally as well

#### mark model for production
and from that we want to compare the performance of the model and then maybe mark it for production


# Model Lifecycle

# Goal of the you guys' challenge today is:
Implement an automated workflow to:
- Fetch fresh data
- Preprocess the fresh data
- Evaluate the performance of the Production model on fresh data
- Train a Staging model on the fresh data, in parallel to the mdoel evaluation
- Compare the performance of the Production model vs Staging model
- Set a threshold for a model being good enough for production
- If neither meet the threshold notify a human who will decide whether or not to deploy the Staging model to Production and what others fixes are needed!

**We wanna decompose ou model's licycle into tasks that fit into an acyclic graph.**

# Direct Acyclic Graph (DAG)

A Directed Acyclic Graph (DAG) is a graph data structure that consists of vertices (nodes) connected by directed edges (arrows), where edges have a direction and there are no cycles.
   - Each edge in a DAG has a direction, indicating a one-way relationship between nodes. For example, if there is an edge from node A to node B, it implies that there is a relationship or dependency from A to B.

2. **Acyclic Property**:
   - The term "acyclic" means that the graph does not contain any cycles. A cycle is a path in the graph that starts and ends at the same node, traversing through one or more edges. In other words, you cannot follow the edges of a DAG and return to the starting node via a directed path.

3. **Vertices (Nodes)**:
   - Nodes represent entities or elements within the graph. These can represent various entities depending on the application. For example, in a workflow DAG, nodes might represent tasks or processes.

4. **Directed Paths**:
   - A directed path is a sequence of vertices connected by directed edges, where each edge leads from one vertex to the next. In a DAG, directed paths always lead in one direction and never form a closed loop.

DAGs have numerous applications such as:
- Workflow and task scheduling: i represents dependencies between tasks or processes that need to be executed in a specific order.
- It manages dependencies between software components or modules.
- It represents data flows and transformations in ETL (Extract, Transform, Load) processes.
- Version control systems: Tracking changes and dependencies in projects.

# Livecode

So now we are gonna decompose the model lifecylce into tasks that fit into an DAG

# Worflow

- so basically we are gonna have a taks A the has a takks B and C that happen at the same time

# theory

# Production workflow

So everything has been put into a single production workflow now and all we have to do is work on sending that somewhere else.

# How to automate our workflow?

1. **Install Prefect**:
   - First, ensure that you have Prefect installed in your Python environment:
     ```
     pip install prefect
     ```

2. **Create Workflow and Tasks**:
   - Use the Prefect Python package to define your workflow and individual tasks. Prefect provides a flexible and intuitive API for creating workflows using Python code.
   - Define your tasks as functions or classes, and specify dependencies between tasks to create the workflow graph.

3. **Set Up Prefect Server**:
   - Install and configure Prefect Server, which is responsible for storing workflow execution parameters and results in a database.
   - Prefect Server can be hosted on a local machine or in the cloud, depending on your requirements

4. **Use Prefect UI / Prefect CLI**:
   - Prefect provides a web interface (Prefect UI) and a command-line interface (Prefect CLI) for interacting with and managing workflows.
   - Use the Prefect UI to parametrize, visualize, and monitor workflow execution. You can define parameters, schedule workflows, and view execution logs and results.
   - Alternatively, use the Prefect CLI for scripting and automating workflow management tasks from the command line.

5. **Run the Workflow**:
   - Once your workflow is defined and configured, you can execute it by running the Python script that contains the workflow definition.
   - You can also trigger workflow execution programmatically using Prefect's API or schedule it to run at specific intervals using Prefect Server or an external scheduling tool.

6. **Monitor and Debug**:
   - Monitor workflow execution and track task statuses, inputs, outputs, and execution times using Prefect UI or CLI.
   - Debug any issues that arise during workflow execution by inspecting logs and outputs, and modify the workflow as needed to address them.

# Livecode 🚧
 

🎯 Automate the model workflow

GO TO ROADMAP -> workflow.py