### 2.1 Video

ML experimenet: The process of building an ML model
Experiment run: Each trial in an ML experiment
Run Artifact: Any file that is associated with an ML run
Experiment metadata

What is experiment tracking?
Experiment tracking is the process of keeping track of all the relevant information from an ML experiment, which includes:
- Source code
- Environment
- Data
- Model
- Hyperparameters
- Metrics
...

Why is experiment tracking so important?
In general, because of these 3 main reasons:
- Reproducibility
- Organization
- Optimization

Trakcing Experiments in spreedsheets:
Why is not enough?
- Error Prone
- No standard format
- Visibility & Collaboration

MLflow
Definition: "An open source platform for the machine learning lifecyle"

In practice, it's just a Python package that can be installed with pip, and it contains four main modules:
- Tracking
- Models
- Model Registry
- Projects

Tracking experiments with MLflow
The Mlflow Tracking module allows you to organize your experiments into runs, and to keep track of:
- Parameters: Hyperparameters / any other parameters that you think will have an effect on the metric of the model example: path to the training dataset, cause later you can change it. Hence it will be reflected into the run and can be tracked. Others, can be pre-processing techniques used.
- Metrics: Any evaluation metric
- Metadata: eg. tags
- Artifacts: Any file, model trained, visualizations, datasets as well -> but does scale very well.
- Models: Save the model

Along with this information, MLflow automatically logs extra information about the run:
- Source code
- Version of the code (git commit)
- Start and end time
- Author

mlflow demo
To launch the mlflow ui: `mlflow ui`

### 2.2 Video

Check out the requirements.txt

mlflow ui --backend-store-uri sqlite:///mlflow.db

Create environment: conda create -p venv python=3.9 -y 
Activate: conda activate venv/
Install requirements.txt: pip install -r requirements.txt

```
.
├── duration_prediction.ipynb
├── models
├── requirements.txt
└── venv
```

Checkout the duration_predictions.ipybn file

### 2.3 Video



### 2.4 Video

Model Management:
- Error prone
- No versioning
- No model lineage

directory name as -> final_model, model_final_final, ...


1. Log the model as artifact:
```
mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models_pickle")
```

2. Log model using the method `log_model`:
```
mlflow.xgboost.log_model(model_as_ip, artifact_path="models_mlflow")
mlflow.<framework>.log_model(model_as_ip, artifact_path="models_mlflow")
```

Now let's save the pre-processing step as well
```
with open("models/preprocessor.b", "wb") as handle:
    pickle.dump(dv, handle) # dv = dictvectorizer

mlfow.log_artifact("models/preprocessor.b", artifact_path="preprocessor")
```

### 2.5 Video

Model Registry
All the models that are ready for production should be stored in model registry. 
It helps in communication between the person building the model and the person that is in charge of deploying the model

Model registry has multiple stages:
- Staging
- Production
- Archive

Data Scientist only decides what are the models that are ready for production, once the model is registered in the model registry the deployment engineer can take a look and check what are the parameters that were used, what is the size of the model, the performance, and based on that decide to move this model between the different stages.

Model registry is not deploying any model, it is only a place to list what are the models that are production-ready and the stages are just labels. Complement model registry with some CI/CD pipeline for deployment.
```
from mlflow.tracking import MlflowClient

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"
clinet = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)

client.list_experiments()

client.create_experiment(name = "my-cool-experiment")

from mlflow.entities import ViewType

runs = client.search_runs (
    experiment_ids='1',
    filter_string='',
    run_view_type=ViewType.ACTIVE_ONLY,
    max_results=5,
    order_by=["metrics.rmse ASC"]
)
```
**Promote some of the models to model registry**

```
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri=model_uri, name="nyc-taxi-regressor")

client.list_registered_models()

model_name = "nyc-taxi-regressor"
latest_versions = client.get_latest_versions(name=model_name)

model_version = 4
new_stage = "Staging"
client.transition_model_version_stage(
    name=model_name,
    version=model_version,
    stage=new_stage,
    archive_existing_versions=False
)

from datetime import datetime
date = datetime.today().date()
client.update_model_version(
    name=model_name,
    version=model_version,
    description = f"The model version {model_version} was transitioned to {new_stage} on {date}" 
)

from sklearn.metrics import mean_squared_error
import pandas as pd


def read_dataframe(filename):
    df = pd.read_csv(filename)

    df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
    df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)

    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df


def preprocess(df, dv):
    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
    categorical = ['PU_DO']
    numerical = ['trip_distance']
    train_dicts = df[categorical + numerical].to_dict(orient='records')
    return dv.transform(train_dicts)


def test_model(name, stage, X_test, y_test):
    model = mlflow.pyfunc.load_model(f"models:/{name}/{stage}")
    y_pred = model.predict(X_test)
    return {"rmse": mean_squared_error(y_test, y_pred, squared=False)}

df = read_dataframe("data/green_tripdata_2021-03.csv")

client.download_artifacts(run_id=run_id, path='preprocessor', dst_path='.')

import pickle

with open("preprocessor/preprocessor.b", "rb") as f_in:
    dv = pickle.load(f_in)

X_test = preprocess(df, dv)

target = "duration"
y_test = df[target].values

%time test_model(name=model_name, stage="Production", X_test=X_test, y_test=y_test)

%time test_model(name=model_name, stage="Staging", X_test=X_test, y_test=y_test)

client.transition_model_version_stage(
    name=model_name,
    version=4,
    stage="Production",
    archive_existing_versions=True
)
```
Model management in MLflow
The model registry component is a centralized model store, set of APIs, and a UI, to collaboratively manage the full lifecycle of an MLflow Model.

It provides:
- Model lineage
- Model versioning
- Stage transitions, and 
- Annotations

### 2.6 Video

MLflow in Practice
Let's consider these three scenarios:

- A single data scientist participating in an ML competition
    - A remote tracking server will be a over-kill
- A cross-functional team with one data scientist working on an ML model
    - Sharing information is required but running a tracking server remotely is also not required. Locally is enough.
    - Model registry could be a good idea to manage the life cycle of the models but not clear whether to run it locally or remotely.
- Multiple data scientists working on multiple ML models
    - Sharing the information is very important.
    - Remote tracking is important.
    - Model registry is also important.

Configuring MLflow
- Backend Store
    - local filesystem
    - SQLAlchemy compatible DB (eg. SQLite)
- Artifacts Store
    - local filesystem
    - remote (eg. S3 bucket)
- Tracking Server
    - No tracking server
    - local host
    - remote

Checkout the dirctory at: https://github.com/DataTalksClub/mlops-zoomcamp/tree/main/02-experiment-tracking/running-mlflow-examples

Remote tracking server:
The tracking server can be easily deployed to the cloud.
Some benefits:
- Share experiments with other data scientists.
- Collaborate with others to build and deploy models
- Give more visibility of the data science efforts.

Issues with running a remote (shared) MLflow server

- Security
    - Restrict access to the server (eg. access through VPN)
- Scalability
    - Check Deploy MLflow on AWS Fargate
    - Check MLflow at Company Scale by Jean-Denis Lesage
- Isolation
    - Deine standard for naming experiments, models, and a set of default tags
    - Restrict access to artifact (eg. use S3 buckets living in different AWS accounts)

MLflow limitations (and when not to use it)
- Authentication & Users: The open source version of MLflow doesn't provide any sort of authentication
- Data versioning: to ensure full reproducibility we need to version the data used to train the model. MLflow doesn't provide a built-in solution for that but there are a few ways to deal with this limitation. (log Params for data path)
- Model/Data Monitoring & Alerting: This is outside of the scope of Mlflow and currently there are more suitable tools for doing this.

Alternatives:
- Neptune
- Comet
- Weights & Biases

### Homework 2

1. 2.3.2
2. 154 kb
3. max_depth = 10
4. mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./artifacts
    RMSE: 2.45
5. 2.185 (I got 2.285)
6. Model version, Source Experiment, Model Signature -> All of the above