## Jupyter Notebook

### A. System Overview

This machine learning system is designed to showcase an end-to-end implementation covering the entire lifecycle of a machine learning model. It integrates various MLOps tools and technologies discussed during the DASCI 270 sessions, that includes facilitating data ingestion, preprocessing, training, validation, deployment, and monitoring of the model. The system is also structured to handle drift detection to ensure the model remains effective and accurate over time. This document will guide you through interacting with the deployed system and provide detailed documentation of each component and their functionalities.

#### System Architecture

! insert diagram

#### Components

1. **Machine Learning Model (XGBoost for Classification)** - 

2. **Data Pipeline (Dagster)** - Orchestrates the workflow for data ingestion, preprocessing, and preparation. Dagster manages the sequence of these operations to ensure data flows correctly from one process to another, maintaining a clear and manageable execution order.

3. **Experiment Tracking (MLflow)** - Provides a framework to track experiments, including model training runs, parameters, metrics, and artifacts, enabling easier debugging and optimization. It stores models, performance metrics, and custom objects like drift reports, making them easy to access and compare across different runs.

4. **Model Serving (FastAPI)** - Deploys the trained model through a REST API using FastAPI, facilitating easy access to the model’s predictive capabilities. The API handles requests for predictions and provides model metadata, ensuring input validation and structured outputs.

5. **Containerization (Docker)** - Containerizes the services making up the system to ensure consistency and reproducibility across environments. Each isolated environment ("container") contains all necessary dependencies for each service, which can be easily deployed on any system supporting Docker.

6. **Drift Detection (Evidently AI)** - Integrates Evidently AI to monitor the model for any signs of data or concept drift. This component is crucial for maintaining the model's accuracy, providing insights into how the data characteristics and relationships are changing over time.

7. **Data Validation (Pydantic)** - Ensures that the data received by the API matches the expected format and type. This prevents errors during model prediction and ensures reliable model performance.

8. **Testing (Pytest and Github Actions)** - Uses Pytest to develop and run comprehensive tests that validate the correctness of the data processing and feature engineering components. GitHub Actions automates these tests, ensuring that all code integrations meet quality standards and function as expected.

### B. Setup Instructions

This section provides detailed instructions on how to set up and run the system using Docker Compose. The instructions assume you have Docker and Docker Compose installed on your machine. If not, please install them from the [Docker official website](https://www.docker.com/) before proceeding.

**Step 1: Clone the Repository**
First, clone the project repository from GitHub to get the necessary code and configuration files. Use the following command:

```python
git clone https://github.com/raymundojavajr/ml-fp.git
cd ml-fp
```

This command clones the repository into a directory named ml-fp on your local machine and changes into that directory.

**Step 2: Configure Environment Variables**
Create a .venv file in the root directory of the project to store environment variables. This can be conveniently done using uv:

```python
uv sync
```

It can also be done using pip:

```python
python -m .venv venv
pip install -r requirements.txt
```

**Step 3: Build the Docker Containers**
Use Docker Compose to build the services defined in the docker-compose.yml file. This includes your Dagster orchestration, MLflow tracking server, and FastAPI application. Run the following command:

```python
docker-compose build
```

This command builds the Docker images for each service according to the specifications in the Dockerfile and docker-compose.yml files. It installs all necessary dependencies in these images, ensuring each service has what it needs to run.

**Step 4: Run the Services**
Once the images are built, you can start the services with the following command:

```python
docker-compose up
```

This command starts all services defined in docker-compose.yml. It creates and starts Docker containers for each service. The services will be networked together as defined, allowing them to communicate with each other.

**Step 5: Access the Services**
* FastAPI Application: Access the FastAPI application at http://localhost:8000. You will find the automatically generated API documentation there, which allows you to interact with the API directly through your web browser.
* MLflow Tracking Server: Access the MLflow tracking server at http://localhost:5000. Here you can view and compare different model runs and their metrics.
* Dagster Dagit UI: Access the Dagster orchestration UI at http://localhost:3000. This interface allows you to monitor and control the data pipelines.

**Step 6: Shut Down the Services**
To stop and remove all running containers, use the following command:

```python
docker-compose down
```
This command stops all containers and removes the containers, networks, and volumes created by docker-compose up. It cleans up everything except the built images, allowing for a quick restart of the services if needed.



### C. Happy Path

#### Prediction Request

#### Model Information Retrieval

### D. Drift Detection Demonstration

#### Evidently AI Access from MLflow

Insert instructions and demonstration on how to access the Evidently AI drift reports from MLflow

Show Python code to access MLflow and retrieve the reports

Show screenshots or directly embed the HTML reports

#### Drift Result Analysis

In [None]:
#data source https://archive.ics.uci.edu/dataset/601/ai4i+2020+predictive+maintenance+dataset

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.feature_selection import mutual_info_classif

In [4]:
file_path = "../data/raw/predictive_maintenance.csv"

In [5]:
df = pd.read_csv(file_path)
df = df.copy()
df.head(5)

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,No Failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,No Failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,No Failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,No Failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,No Failure


In [None]:
print(df.dtypes)

In [None]:
# Cell: Define Data Columns
categorical_cols = ["Type", "Product ID", "Failure Type"]
numerical_cols = [
    "Air temperature [K]",
    "Process temperature [K]",
    "Rotational speed [rpm]",
    "Torque [Nm]",
    "Tool wear [min]"
]


In [None]:
def define_data_columns():
    categorical_cols = ["Type", "Product ID", "Failure Type"]

    numerical_cols = [
    "Air temperature [K]",
    "Process temperature [K]",
    "Rotational speed [rpm]",
    "Torque [Nm]",
    "Tool wear [min]",
]
    return categorical_cols, numerical_cols

cat_cols, num_cols = define_data_columns()
print("Categorical Columns:", cat_cols)
print("Numerical Columns:", num_cols)

In [None]:
# label enconding
le_dict = {}

for col in cat_cols:
    le = LabelEncoder()
    df[col + "_encoded"] = le.fit_transform(df[col])
    le_dict[col] = le

print("encoded columns:")
print (df[[col + "_encoded" for col in cat_cols]].head(5))

In [None]:
feature_cols = numerical_cols + [col + "_encoded" for col in categorical_cols]
features_array = df[feature_cols].values
features_array

In [None]:
y_binary = df["Target"].values
y_multiclass = df["Failure Type_encoded"].values

mi_scores = mutual_info_classif(features_array, y_binary)

feature_importance = pd.DataFrame({
    "feature": feature_cols,
    "mi_scores": mi_scores
})

feature_importance = feature_importance.sort_values(by="mi_scores", ascending=False)

print("Feature Importance:")
print(feature_importance)

In [None]:
import matplotlib.pyplot as plt

# Plot the mutual information scores as a bar chart
plt.figure(figsize=(8, 5))
plt.bar(feature_importance["feature"], feature_importance["mi_scores"], color="skyblue")
plt.xlabel("Features")
plt.ylabel("Mutual Information Score")
plt.title("Feature Importance Based on Mutual Information")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


In [None]:
threshold = mi_scores.mean()

selected_features = feature_importance[feature_importance["mi_scores"] > threshold]["feature"].tolist()
X = df[selected_features].values


# Display the selected features and the shape of the resulting feature matrix.
print("Selected Features:", selected_features)
print("Feature matrix shape:", X.shape)
X

In [None]:
# 5-fold cross validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

binary_predictions = np.zeros(len(df))
multiclass_predictions = np.zeros(len(df))

f1_scores = []


In [None]:

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    # Split the data into training and validation sets
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y_binary[train_idx], y_binary[val_idx]
    y_train_multi, y_val_multi = y_multiclass[train_idx], y_multiclass[val_idx]

    # --- Binary Classification Model ---
    binary_model = XGBClassifier(random_state=42, eval_metric="logloss")
    binary_model.fit(X_train, y_train)
    binary_pred = binary_model.predict(X_val)
    fold_f1 = f1_score(y_val, binary_pred)
    f1_scores.append(fold_f1)
    print(f"Fold {fold + 1} - Binary F1 Score: {fold_f1:.4f}")
    binary_predictions[val_idx] = binary_pred

    # --- Multiclass Classification Model ---
    multiclass_model = XGBClassifier(random_state=42, eval_metric="mlogloss")
    multiclass_model.fit(X_train, y_train_multi)
    multiclass_predictions[val_idx] = multiclass_model.predict(X_val)


In [None]:
# ---------------------------------------------------
# Step 8: Create Submission DataFrame
# ---------------------------------------------------

# Create the submission DataFrame using:
# - "UDI": Unique identifiers from the original dataset.
# - "Target": Binary predictions (converted to integers).
# - "Failure_Type": Multiclass predictions are inverse-transformed back to original labels.
submission = pd.DataFrame({
    "UDI": df["UDI"],
    "Target": binary_predictions.astype(int),
    "Failure_Type": le_dict["Failure Type"].inverse_transform(multiclass_predictions.astype(int))
})

# Display the first few rows of the submission DataFrame
print("Submission preview:")
print(submission.head())
