# Automate Model Lifecycle

Automating the model lifecycle refers to streamlining and automating the stages involved in developing, deploying, and maintaining machine learning models. Some of these steps are: 
- data preprocessing,
- model training,
- evaluation, 
- deployment, 
- monitoring, 
- and retraining. 

Automating these processes can save time, improve efficiency, and ensure consistency in model development.

# Lecture plan

# Reminders

#### Google Cloud Platform

#### Cloud Storage

In BigQuery, columnar storage and partitioning are two important features that can significantly improve query performance and cost efficiency when working with large datasets.

So what are these things?

`Columnar Storage`:
BigQuery utilizes a columnar storage format called Capacitor (formerly known as Capacitor Bigtable), which organizes data by column rather than by row. In a columnar storage format, values of each column are stored together, allowing for more efficient compression and better data compression ratios. This format enables BigQuery to scan and process only the specific columns needed for a query, minimizing the amount of data read from storage. <br>
`Partitioning` is a technique that divides a table into smaller, more manageable sections based on a specific column's values. BigQuery supports partitioning tables based on time and date, also called ingestion-time partitioning) or date/timestamp partitioning. 

#### Compute Engine

#### direnv

direnv is a command-line tool designed to manage environment variables ia a directory basis. It helps us automate the loading and unloading of environment variables based on the current working directory. CURIOSITY: The name "direnv" is short for "directory environment."

# Objective

### What is our progress?

# What's next

So the way that it is right now is that someone has to log into your virtual machine and manually run make run train, manually re-prepreocess the data to have it up to date so in the longrun it's an unrealistic prospect, you see with one model it may be ok, someone can do that but everytime you add a new model you need to maintain it becomes cumbersome.

That is why we need to **Create a robust model lifecycle** - to:
- Ensure the reproducibility of the training in the future
- Track the performance of the model over time
- Serve multiple versions of the model
- Automate the model lifecycle

# Robust Lifecycle

# Experiment tracking with MLflow

MLflow is an open-source platform that enables machine learning experiment tracking, reproducibility, and model management. And what its gonna do for us is help track our experiments.

Yesterday we were sitting with this workflow -> we had out data warehouse that was passing preprocessed data to our virtual machine which was then passing on the data, that is our model to google cloud storage.

Now we want to track the experiment and that is gonna come with a few things:
- data version: refers to the data used for the training, in our case we are gonna be looking at the datetime or the timestamp and the number of rows which the past couple of days we have been working with 1K, 200K or all.


- as experiment parameters:
1. code version
2. Code parameters (learning rate, epocs, eetc...)
3. training env: which is python + package versions
4. Preprocessing type: what we used to preprocess the data
5. the model hyper parameters

- as experiment metrics:
1. training metrcis: your loss, you mae, your accuracy and so on...


- then finally the model version, the actual model itself. like V1 trained on may 25, V2 trained on may 30th and so on.

As you can see there are quite a lot of things that we wanna keep track  of and if we were doing it just by ourselves in python it would involve writing tons and tons of code.

# How to track our experiments?

So what we are gonna have instead of having to write all that code is an MLflow server. and this server will:

-> GO TO ROADMAP

-> BACK ON SLIDE: THEORY

# CHART

# ML Flow Server Architecture

So this is how it is structured behind the scenes. So there is a remote host that has ML Flow running on it and it'll be connected to a SQL database that will store the metrics and parameters. But a SQL DB is not a proper way to store a model so behind the scenes it is using a bucket to store the models.

# Automate the model lifecycle with prefect

#### Manual trigger
at the moment we still need to login to our VM or do it locally but we still need to manually run make run train

#### Model Lifecycle
but what we want is - to breakdown the models lifecycle. based on the lecture today, on the dates that we were using, im presuming we wanna train our model every montl. so in order to do that we need to get fresh data and check our model's performance on the new data. do we wanna retrain? do we wanna sent it to production?

#### get fresh data
so in order to do that we need to get fresh data, we need to preprocess it and push it to our data warehose

#### evaluate model performance
so what we wanna do here is look at the past performance of the model so in order to do that we pull our new data from the warehouse, and run an evaluation on our model with the new data

#### retrain model
we'll retrain the model it should happen on a VM but it can happen locally as well

#### mark model for production
and from that we want to compare the performance of the model and then maybe mark it for production
