✅ The notebook for **Module 2.1: MLflow Projects Basics** 

## 🎯 **Learning Objectives Expanded**

### 1️⃣ **Understand What MLflow Projects Are and Why They Matter**

* **What it means:**
  MLflow Projects are a standardized way to package your ML code—including scripts, environments, and configurations—so that it can be easily shared, run, and reproduced on any machine.

* **Key Concepts:**

  * A project is just a folder (or Git repo) containing your training code and an `MLproject` file.
  * It can define parameters, environments (via `conda.yaml`), and entry points.

* **Why it matters:**
  MLflow Projects make your code reproducible and portable, whether you're running it locally, in the cloud, or in automated pipelines. This promotes collaboration and consistent results across different environments.

---

### 2️⃣ **Structure a Simple Project with `train.py`, `MLproject`, and `conda.yaml`**

* **What it means:**
  Building the basic components of a self-contained MLflow Project.

* **Directory Structure:**

  ```
  my_project/
  ├── MLproject
  ├── conda.yaml
  └── train.py
  ```

* **File Descriptions:**

  ✅ `train.py`
  The main training script with argument parsing:

  ```python
  import argparse
  def main(alpha):
      print(f"Training with alpha={alpha}")
  if __name__ == "__main__":
      parser = argparse.ArgumentParser()
      parser.add_argument("--alpha", type=float, default=0.5)
      args = parser.parse_args()
      main(args.alpha)
  ```

  ✅ `MLproject`
  The project specification file:

  ```yaml
  name: simple_linear_project

  conda_env: conda.yaml

  entry_points:
    main:
      parameters:
        alpha: {type: float, default: 0.5}
      command: "python train.py --alpha {alpha}"
  ```

  ✅ `conda.yaml`
  The environment file that lists all Python dependencies:

  ```yaml
  name: simple-env
  channels:
    - defaults
  dependencies:
    - python=3.9
    - scikit-learn
  ```

* **Why it matters:**
  This structure ensures all collaborators (and machines) run the same code in the same environment with consistent results.

---

### 3️⃣ **Learn How to Execute a Parameterized Project Using the MLflow CLI**

* **What it means:**
  Running your project from the command line with different parameter values and environments using MLflow’s command-line interface.

* **Basic Command:**

  ```bash
  mlflow run . -P alpha=0.2
  ```

* **Additional Options:**

  * Run from a Git repo:

    ```bash
    mlflow run https://github.com/yourname/my_project -P alpha=0.7
    ```
  * Use a remote backend (e.g., Databricks, Sagemaker) if configured.

* **What happens under the hood:**

  * MLflow creates a fresh virtual environment based on `conda.yaml`.
  * It passes the parameter (`alpha=0.2`) to `train.py`.
  * It logs the run like any other MLflow experiment (can be viewed with the Tracking UI).

* **Why it matters:**
  This allows anyone to reproduce your experiment by running a single command, ensuring consistency in both code and environment—critical for MLOps and collaboration.




In [1]:
# 📓 Module 2.1: MLflow Projects Basics
# Goal: Learn how to structure and run MLflow Projects for reproducible ML workflows

# ✅ Step 1: Install MLflow
!pip install -q mlflow

# ✅ Step 2: Create a directory structure for the MLflow Project
import os
project_name = "mlflow_example_project"
os.makedirs(project_name, exist_ok=True)

# ✅ Step 3: Write a simple training script (train.py) to the project folder
code = '''
import mlflow
import mlflow.sklearn
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import sys

# Parse alpha from command-line argument
alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 1.0

# Enable autologging
mlflow.sklearn.autolog()

with mlflow.start_run():
    X, y = load_diabetes(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Test MSE: {mse:.4f}")
'''

with open(os.path.join(project_name, "train.py"), "w") as f:
    f.write(code)

# ✅ Step 4: Create an MLproject file to define the project structure
mlproject_content = '''
name: RidgeRegressionProject

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: {type: float, default: 1.0}
    command: "python train.py {alpha}"
'''

with open(os.path.join(project_name, "MLproject"), "w") as f:
    f.write(mlproject_content)

# ✅ Step 5: Write a conda.yaml file defining the environment
conda_yaml = '''
name: mlflow-env
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.8
  - scikit-learn
  - pip
  - pip:
      - mlflow
'''

with open(os.path.join(project_name, "conda.yaml"), "w") as f:
    f.write(conda_yaml)

# ✅ Step 6: Run the project using MLflow CLI (note: works only in local CLI, not in Colab)
print("\n📦 Project setup complete! You can now run this project locally with:")
print(f"mlflow run {project_name} -P alpha=0.5")



📦 Project setup complete! You can now run this project locally with:
mlflow run mlflow_example_project -P alpha=0.5


In [2]:
!mlflow run mlflow_example_project -P alpha=0.5

Retrieving notices: ...working... done
Channels:
 - defaults
 - conda-forge
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Installing pip dependencies: ...working... Ran pip subprocess with arguments:
['C:\\Users\\ryass\\anaconda3\\envs\\mlflow-2ced9beaaa4525d5306a92a4f35089d553876b2f\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'c:\\Users\\ryass\\OneDrive\\Documents\\GitHub\\MLflow_learn\\MLflow_step_by_step\\mlflow_example_project\\condaenv.ntadszqk.requirements.txt', '--exists-action=b']
Pip subprocess output:
Collecting mlflow (from -r c:\Users\ryass\OneDrive\Documents\GitHub\MLflow_learn\MLflow_step_by_step\mlflow_example_project\condaenv.ntadszqk.requirements.txt (line 1))

  Downloading mlflow-2.17.2-py3-none-any.whl.meta

2025/08/03 08:53:48 INFO mlflow.utils.conda: === Creating conda environment mlflow-2ced9beaaa4525d5306a92a4f35089d553876b2f ===
2025/08/03 08:56:33 INFO mlflow.projects.utils: === Created directory C:\Users\ryass\AppData\Local\Temp\tmpewp2m553 for downloading remote URIs passed to arguments of type 'path' ===
2025/08/03 08:56:33 INFO mlflow.projects.backend.local: === Running command 'conda activate mlflow-2ced9beaaa4525d5306a92a4f35089d553876b2f && python train.py 0.5' in run with ID 'afab7c1ca5f846afbe6045e058d52c24' === 
Traceback (most recent call last):
  File "train.py", line 16, in <module>
    with mlflow.start_run():
  File "C:\Users\ryass\anaconda3\envs\mlflow-2ced9beaaa4525d5306a92a4f35089d553876b2f\lib\site-packages\mlflow\tracking\fluent.py", line 338, in start_run
    active_run_obj = client.get_run(existing_run_id)
  File "C:\Users\ryass\anaconda3\envs\mlflow-2ced9beaaa4525d5306a92a4f35089d553876b2f\lib\site-packages\mlflow\tracking\client.py", line 226, in get_run
    r

## 📝 Assessment: MLflow Projects Basics

### 📘 Multiple Choice (Choose the best answer)

**1. What is the main purpose of the `MLproject` file in an MLflow Project?**      
A. To log metrics and parameters       
B. To track model performance over time      
**C. To define how the project should be run and what parameters it uses** ✅       
D. To register models in the model registry       

---
    
**2. Which command is used to execute an MLflow Project locally with a parameter?**    
A. `mlflow train alpha=0.5`    
**B. `mlflow run <project_path> -P alpha=0.5`** ✅    
C. `python train.py --alpha=0.5`    
D. `mlflow ui run project`    

---

**3. What file in an MLflow Project specifies the Python and library dependencies?**    
A. `requirements.txt`    
B. `MLproject`    
**C. `conda.yaml`** ✅    
D. `run_config.json`    

---

**4. If the MLproject file defines a parameter `alpha` with a default value of 1.0, what happens when no `-P alpha=...` is provided in the `mlflow run` command?**
A. The run fails    
**B. It uses the default value defined in the MLproject file** ✅    
C. It uses zero as a fallback    
D. It asks the user to input the value interactively    

---

### ✏️ Short Answer
    
**5. What are the benefits of structuring your code as an MLflow Project?**
*Hint: Think reproducibility, environment portability, and standardized execution.*    

---
    
**6. Why do MLflow Projects typically include a `conda.yaml` file instead of `requirements.txt`?**    
*Hint: Conda can control Python version and non-Python dependencies.*    

---

### 🧪 Mini Project    
    
**7. Task:**    
You want to train a classification model using MLflow Projects.    

* Create a directory with `train.py`, `MLproject`, and `conda.yaml`    
* Add a parameter `max_depth` to control a Decision Tree    
* Use `mlflow.run()` or CLI to run the project with `max_depth=4`    
* Log model, parameters, and accuracy    


