# **Lab: Reproducible ML Pipeline with DVC and MLflow**

**Duration: 15 mins**

This lab demonstrates how to:

1. Use **DVC** to track datasets, models, and metrics.
2. Integrate **MLflow** for experiment tracking.
3. Reproduce pipelines and manage different versions of data and models.
4. Compare metrics across experiments.

## **Pre-Created Files**
The following files are pre-created and available in your folder:

1. **params.yaml**: Contains hyperparameters for the model.

2. **train.py**: A Python script to train the Random Forest model, evaluate it, and log metrics to MLflow.

3. **dvc.yaml**: Defines the pipeline structure, including dependencies and outputs.

You will modify and use these files to explore DVC and MLflow.

## **Step 1: Initialize  DVC**

Set up  DVC to version control your datasets and pipeline.

In [None]:
# Initialize DVC
!dvc init

# Explore DVC Components

After initializing **DVC**, consider exploring the following components to understand how it manages and tracks your data:

### 1. `.dvc/` Directory
- This directory contains **DVC's internal files and configurations**.
- Reviewing its contents can provide insight into how **DVC manages data** and tracks changes.

---

### 2. `.dvcignore` File
- Similar to `.gitignore`, this file tells DVC which files or directories to **ignore**.
- Understanding its configuration can help you **manage tracked files** effectively.

---

### 3. Integration with Git
- If your project is under **Git version control**, DVC modifies the `.gitignore` file to prevent large data files from being tracked by Git.
- This ensures that:
  - **Metadata** is versioned in Git.
  - Actual **data is managed by DVC**.

---

By exploring these components, you’ll gain a better understanding of how DVC organizes and manages your project’s data, enabling **efficient and reproducible machine learning workflows**.


## **Step 2: Understand Pre-Created Files**

### 1. **params.yaml**
This file contains hyperparameters for your machine learning model. You can modify it to tune your model.

```yaml
train:
  test_size: 0.2
  random_state: 42
  n_estimators: 100
  max_depth: 5
```

### 2. **train.py**
You have been provided with a training script performs the following tasks:
- Loads the dataset.
- Splits the data into training and testing sets.
- Trains a Random Forest model.
- Logs metrics and the model using MLflow on 127.0.0.1:5001.

### 3. **dvc.yaml**
Defines the pipeline, including:
- Dependencies: `train.py`, `params.yaml`, and `data/sales.csv`.
- Outputs: `metrics.txt`.

You will use this pipeline to track your ML workflow.

```yaml
stages:
  train:
    cmd: python train.py
    deps:
      - train.py
      - data/sales.csv
      - params.yaml
    outs:
      - metrics.txt
    metrics:
      - metrics.txt
```

## **Step 3: Add and Track the Dataset**

Track the dataset using DVC and push it to remote storage.

In [None]:
# Add dataset to DVC
!dvc add data/sales.csv

!git add data/sales.csv.dvc data/.gitignore
!git commit -m "Initial dataset tracking with DVC"



## **Step 4: Run the Pipeline**

Execute the pipeline and push outputs to remote storage.

In [None]:
# Run the pipeline
print("Running the pipeline")
!dvc repro


# Commit changes
print("Commiting the files in GIT for first run")
!git add params.yaml dvc.yaml dvc.lock metrics.txt
!git commit -m "Run pipeline with first version of data and hyperparameters"

## **Step 4: Modify Hyperparameters in params.yaml**

Modify the hyperparameters to experiment with different model configurations.

In [None]:
# Modify params.yaml
yaml_content = """
train:
  test_size: 0.3
  random_state: 24
  n_estimators: 200
  max_depth: 10
"""

with open("params.yaml", "w") as f:
    f.write(yaml_content)

print("params.yaml updated successfully!")

## **Step 5: Change Dataset to new_sales.csv**

Replace the dataset with `new_sales.csv`, update the pipeline, and run the pipeline with the new dataset.

In [None]:
# Replace dataset and update pipeline
import shutil
shutil.copy('data/sales_new.csv', 'data/sales.csv')

# Re-add the dataset to DVC
!dvc add data/sales.csv

# Re-run the pipeline
!dvc repro

# Push changes and commit
!git add data/sales.csv.dvc dvc.yaml dvc.lock metrics.txt
!git commit -m "Run pipeline with new_sales.csv and new hyper parameters"

## **Step 4: Run the Pipeline Again with new hyperparameters**

Execute the pipeline and push outputs to remote storage.

In [None]:
# Run the pipeline
!dvc repro


# Commit changes
!git add params.yaml dvc.yaml dvc.lock metrics.txt
!git commit -m "Run pipeline with updated hyperparameters"

## **Step 7: Restore a Previous Version**

Restore an older version of the pipeline, dataset, or model to reproduce past results.

In [None]:
# List Git commits
!git log --oneline

# Checkout a previous version
!git checkout d19312f

# Restore files using DVC
!dvc checkout

# Re-run the pipeline
!dvc repro