## MLOPS ZoomCamp - Week 2
- [Questions](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/cohorts/2023/02-experiment-tracking/homework.md)

### Q1. MLFlow version

In [1]:
!mlflow --version

mlflow, version 2.3.2


### Q2. Downloading the Data

In [5]:
data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-"
jan, feb, mar = "01.parquet", "02.parquet", "03.parquet"
out_dir = "../data/hw2-data/green_tripdata_2022-"

for file in [jan, feb, mar]:
    !wget {data_url}{file} -O {out_dir}{file} -q


In [6]:
!python hw-utils/preprocess_data.py --raw_data_path ../data/hw2-data/ --dest_path ./output

In [8]:
!ls -lah ./output

total 7.0M
drwxr-xr-x 1 uditm 197609    0 May 31 13:47 .
drwxr-xr-x 1 uditm 197609    0 May 31 13:47 ..
-rw-r--r-- 1 uditm 197609 151K May 31 13:47 dv.pkl
-rw-r--r-- 1 uditm 197609 2.6M May 31 13:47 test.pkl
-rw-r--r-- 1 uditm 197609 2.1M May 31 13:47 train.pkl
-rw-r--r-- 1 uditm 197609 2.3M May 31 13:47 val.pkl


### Q3. Train model with autolog

In [18]:
# check updated script
# !cat hw-utils/train.py
!cat -n hw-utils/train.py | sed '21,31!d'

    21	def run_train(data_path: str):
    22	    mlflow.sklearn.autolog()
    23	    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    24	    X_val, y_val = load_pickle(os.path.join(data_path, "val.pkl"))
    25	
    26	    rf = RandomForestRegressor(max_depth=10, random_state=0)
    27	    with mlflow.start_run():
    28	        rf.fit(X_train, y_train)
    29	    y_pred = rf.predict(X_val)
    30	
    31	    rmse = mean_squared_error(y_val, y_pred, squared=False)


In [19]:
!python hw-utils/train.py 



In [20]:
# RUN COMMAND IN TERMINAL
# mlflow ui
# Navigate to experiment and inside params check max_depth

# Hint: max_depth = 10

### Q4. Tune model hyperparameters

In [21]:
# check updated script
!cat -n hw-utils/hpo.py | sed '45,51!d'

    45	        with mlflow.start_run():
    46	            mlflow.log_params(params)
    47	            rf = RandomForestRegressor(**params)
    48	            rf.fit(X_train, y_train)
    49	            y_pred = rf.predict(X_val)
    50	            rmse = mean_squared_error(y_val, y_pred, squared=False)
    51	            mlflow.log_metric("RMSE", rmse)


In [22]:
!python hw-utils/hpo.py

2023/05/31 14:47:34 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-hyperopt' does not exist. Creating a new experiment.
[I 2023-05-31 14:47:34,364] A new study created in memory with name: no-name-576b7200-b8c0-4110-9e52-e67b6530d4e9
[I 2023-05-31 14:47:36,474] Trial 0 finished with value: 2.451379690825458 and parameters: {'n_estimators': 25, 'max_depth': 20, 'min_samples_split': 8, 'min_samples_leaf': 3}. Best is trial 0 with value: 2.451379690825458.
[I 2023-05-31 14:47:36,759] Trial 1 finished with value: 2.4667366020368333 and parameters: {'n_estimators': 16, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 0 with value: 2.451379690825458.
[I 2023-05-31 14:47:38,635] Trial 2 finished with value: 2.449827329704216 and parameters: {'n_estimators': 34, 'max_depth': 15, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 2 with value: 2.449827329704216.
[I 2023-05-31 14:47:39,339] Trial 3 finished with value: 2.460983516558473 a

Best `RMSE = 2.45`

### Q5. Promote the best model to the model registry

In [24]:
# check updated script
!cat -n hw-utils/register_model.py | sed '72,82!d'

    72	    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
    73	    best_run = client.search_runs(
    74	        experiment_ids=experiment.experiment_id,
    75	        run_view_type=ViewType.ACTIVE_ONLY,
    76	        order_by=["metrics.rmse ASC"]
    77	    )[0]
    78	
    79	    # Register the best model
    80	    run_id = best_run.info.run_id
    81	    model_uri =f"runs:/{run_id}/model"
    82	    mlflow.register_model(model_uri=model_uri, name=EXPERIMENT_NAME)


In [25]:
!python hw-utils/register_model.py

2023/05/31 16:40:51 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-best-models' does not exist. Creating a new experiment.
Successfully registered model 'random-forest-best-models'.
2023/05/31 16:41:13 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: random-forest-best-models, version 1
Created version '1' of model 'random-forest-best-models'.


In [33]:
# list best test RMSE
import mlflow
from mlflow.entities import ViewType

EXPERIMENT_NAME = "random-forest-best-models"

client = mlflow.tracking.MlflowClient()
experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
best_run = client.search_runs(
    experiment_ids=experiment.experiment_id,
    run_view_type=ViewType.ACTIVE_ONLY,
    order_by=["metrics.test_rmse ASC"]
)[0]
# print(best_run)
print(best_run.data.metrics["test_rmse"])

2.2913864757421787


### Q6. Model metadata

![Model Registry UI](model-reg-ui.png)