[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ramonzaca/MLSecOPs/blob/main/TP_03/03_model_and_data_monitoring.ipynb)

**How's the model performing? - Practice 3**

*Now that you have a model, you want to make sure that it's performing well and that the data used to train it is still relevant. This is where model and data monitoring comes into play.*

*To do so, you'll use the Evidently library to monitor the model's performance and the data's drift.*

*First, let's learn the basics of how to use Evidently to monitor your model and data.*

In [None]:
# First, let's install Evidently if it's not already installed
try:
    import evidently
except ImportError:
    !pip install git+https://github.com/evidentlyai/evidently.git

In [None]:
# Import necessary libraries
import numpy as np
from sklearn import datasets
from evidently.report import Report
from evidently.metric_preset import (
    DataDriftPreset,
)
from evidently.metrics import (
    ColumnSummaryMetric,
    ColumnQuantileMetric,
    ColumnDriftMetric,
)
from evidently.test_suite import TestSuite
from evidently.test_preset import (
    DataStabilityTestPreset,
    NoTargetPerformanceTestPreset,
    RegressionTestPreset,
)
from evidently.tests import (
    TestNumberOfColumnsWithMissingValues,
    TestNumberOfRowsWithMissingValues,
    TestNumberOfConstantColumns,
    TestNumberOfDuplicatedRows,
    TestNumberOfDuplicatedColumns,
    TestColumnsType,
    TestNumberOfDriftedColumns,
    TestColumnDrift,
    TestMeanInNSigmas,
    TestShareOfOutRangeValues,
)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Suppress warnings for cleaner output
import warnings

warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

---

*In this step, we load the California Housing dataset, prepare it for analysis, and train a Random Forest model. We split the data into reference (training) and current (testing) sets, which is crucial for drift detection.*

*In this context, the reference dataset is the training data used to train the model, while the current dataset is the data used to test the model, simulating real-world data.*

In [None]:
# 1. Data Preparation
print("1. Data Preparation")

# Load California Housing dataset
data = datasets.fetch_california_housing(as_frame=True)
housing_data = data.frame
housing_data.rename(columns={"MedHouseVal": "target"}, inplace=True)

# Split the data into training (reference) and testing (current) sets

reference, current = train_test_split(housing_data, test_size=0.5, random_state=42)

# Train a Random Forest model
features = [col for col in housing_data.columns if col != "target"]
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(reference[features], reference["target"])

# Add predictions to both reference and current datasets
reference["prediction"] = rf_model.predict(reference[features])
current["prediction"] = rf_model.predict(current[features])

print("Data prepared and model trained successfully.")

---

*Here, we create a basic data drift report using Evidently's DataDriftPreset. This gives us an initial overview of potential data drift between our reference and current datasets.*

In [None]:
# 2. Basic Report Generation
print("\n2. Basic Report Generation")

# Create a basic data drift report
basic_report = Report(metrics=[DataDriftPreset()])
basic_report.run(reference_data=reference, current_data=current)
print("Basic data drift report generated. Use basic_report.show() to display it.")

In [None]:
basic_report.show()

---

*We now create a more focused report with custom metrics. This demonstrates how to analyze specific aspects of your data, such as summary statistics and drift for individual columns. (In this case, we're looking at the 'AveRooms' column.)*

In [None]:
# 3. Custom Metric Reports
print("\n3. Custom Metric Reports")

# Create a report with custom metrics
custom_report = Report(
    metrics=[
        ColumnSummaryMetric(column_name="AveRooms"),
        ColumnQuantileMetric(column_name="AveRooms", quantile=0.25),
        ColumnDriftMetric(column_name="AveRooms"),
    ]
)
custom_report.run(reference_data=reference, current_data=current)
print("Custom metric report generated. Use custom_report.show() to display it.")

In [None]:
custom_report.show()

---

*This section introduces a function to generate metrics for multiple columns efficiently. We then use this function to create a report focusing on quantile metrics for selected columns.*

In [None]:
# 4. Generate Column Metrics
print("\n4. Generate Column Metrics")


def generate_column_metrics(metric_class, parameters):
    # Generate metrics for all numeric columns
    columns = reference.select_dtypes(include=[np.number]).columns.tolist()
    return [metric_class(column_name=col, **parameters) for col in columns]


# Generate metrics for multiple columns
multi_column_report = Report(
    metrics=generate_column_metrics(
        ColumnQuantileMetric,
        parameters={"quantile": 0.25},
    )
)
multi_column_report.run(reference_data=reference, current_data=current)
print("Multi-column report generated. Use multi_column_report.show() to display it.")

In [None]:
multi_column_report.show()

---

*Here, we combine various metrics to create a comprehensive report. This includes column summaries, quantile metrics for numeric columns, and overall data drift analysis.*

In [None]:
# 5. Comprehensive Report
print("\n5. Comprehensive Report")

# Create a comprehensive report with various metrics
comprehensive_report = Report(
    metrics=[
        ColumnSummaryMetric(column_name="AveRooms"),
        *generate_column_metrics(ColumnQuantileMetric, parameters={"quantile": 0.25}),
        DataDriftPreset(),
    ]
)
comprehensive_report.run(reference_data=reference, current_data=current)
print("Comprehensive report generated. Use comprehensive_report.show() to display it.")

In [None]:
comprehensive_report.show()

---

*Now that we have generated reports, we can export them as HTML or JSON files, allowing for easy sharing and integration with other tools.*

In [None]:
# 6. Exporting Reports
print("\n6. Exporting Reports")

# Uncomment these lines to save the report
# comprehensive_report.save_html("report.html")
# comprehensive_report.save_json("report.json")
print("Report can be exported as HTML or JSON.")

---

*Now that we have our reports and metrics; we can shift focus to test suites, starting with a basic suite that checks for common data quality issues like missing values, constant columns, and duplicates.*

In [None]:
# 7. Test Suites
print("\n7. Test Suites")

# Create a basic test suite
basic_suite = TestSuite(
    tests=[
        TestNumberOfColumnsWithMissingValues(),
        TestNumberOfRowsWithMissingValues(),
        TestNumberOfConstantColumns(),
        TestNumberOfDuplicatedRows(),
        TestNumberOfDuplicatedColumns(),
        TestColumnsType(),
        TestNumberOfDriftedColumns(),
    ]
)
basic_suite.run(reference_data=reference, current_data=current)
print("Basic test suite executed. Use basic_suite.show() to display results.")

In [None]:
basic_suite.show()

---

*Same as with the reports, we can use preset test suites to quickly check for common issues.*

In [None]:
# 8. Preset Test Suites
print("\n8. Preset Test Suites")

# Use a preset test suite
preset_suite = TestSuite(tests=[NoTargetPerformanceTestPreset()])
preset_suite.run(reference_data=reference, current_data=current)
print("Preset test suite executed. Use preset_suite.show() to display results.")

In [None]:
preset_suite.show()

---

*And we can also create custom test suites to check for specific conditions or drift in our data.*

In [None]:
# 9. Custom Test Suites
print("\n9. Custom Test Suites")

# Create a custom test suite
custom_suite = TestSuite(
    tests=[
        TestColumnDrift("Population"),
        TestMeanInNSigmas("HouseAge"),
        NoTargetPerformanceTestPreset(columns=["AveRooms", "AveBedrms", "AveOccup"]),
    ]
)
custom_suite.run(reference_data=reference, current_data=current)
print("Custom test suite executed. Use custom_suite.show() to display results.")

In [None]:
custom_suite.show()

---

*Once we have our test suites, we can create a comprehensive suite that checks for all the issues we care about.*

In [None]:
# 10. Comprehensive Test Suite
print("\n10. Comprehensive Test Suite")

# Create a comprehensive test suite
comprehensive_suite = TestSuite(
    tests=[
        TestNumberOfColumnsWithMissingValues(),
        TestNumberOfRowsWithMissingValues(),
        TestNumberOfConstantColumns(),
        TestNumberOfDuplicatedRows(),
        TestNumberOfDuplicatedColumns(),
        TestColumnsType(),
        TestNumberOfDriftedColumns(),
        TestColumnDrift("Population"),
        TestShareOfOutRangeValues("Population"),
        DataStabilityTestPreset(),
        RegressionTestPreset(),
    ]
)
comprehensive_suite.run(reference_data=reference, current_data=current)
print(
    "Comprehensive test suite executed. Use comprehensive_suite.show() to display results."
)

In [None]:
comprehensive_suite.show()

---

*Finally, we can export the test suite results*

In [None]:
# 11. Exporting Test Suites
print("\n11. Exporting Test Suites")

# Uncomment these lines to save the test suite results
# comprehensive_suite.save_html('test_suite.html')
# comprehensive_suite.save_json('test_suite.json')

print("Test suite results can be exported as HTML or JSON.")