## Connect to MLflow in Your Python Code

Now, in your Jupyter notebook or Python script, you can load these environment variables and tell MLflow where to send its data.


How to use mlflow.start_run() and mlflow.end_run() in a Notebook:


In a notebook environment, you often want a single MLflow run to span across multiple cells. For example, one cell might load data, another might preprocess it, and a third might train a model - all part of the same experimental "run."
Here's how you can set it up:

In [1]:
# --- Cell 1: Environment Setup and Start MLflow Run ---
from dotenv import load_dotenv
import os
import mlflow

# Load environment variables from .env file
load_dotenv()

# Get credentials and URI from environment variables
MLFLOW_USERNAME = os.getenv('MLFLOW_TRACKING_USERNAME')
MLFLOW_PASSWORD = os.getenv('MLFLOW_TRACKING_PASSWORD')
MLFLOW_TRACKING_URI = os.getenv('MLFLOW_TRACKING_URI')

# These lines are crucial for MLflow to authenticate with your server
os.environ['MLFLOW_TRACKING_USERNAME'] = MLFLOW_USERNAME
os.environ['MLFLOW_TRACKING_PASSWORD'] = MLFLOW_PASSWORD

# Set the tracking URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

# Define an experiment name
# If the experiment doesn't exist, MLflow creates it.
EXPERIMENT_NAME = "UCI Adult Income Prediction - Centralized"
mlflow.set_experiment(EXPERIMENT_NAME)

# Start a new run manually. We'll end it later.
# It's good practice to give your run a descriptive name.
current_run = mlflow.start_run(run_name="Initial_Setup_and_Connection_Test")
print(f"MLflow Run Name: {current_run.data.tags.get('mlflow.runName')}")
print(f"MLflow Run ID: {current_run.info.run_id}")
print(f"MLflow Experiment ID: {current_run.info.experiment_id}")
print(f"MLflow tracking URI: {mlflow.get_tracking_uri()}")
print(f"MLflow artifact URI: {mlflow.get_artifact_uri()}")

# You can log a simple parameter to test
mlflow.log_param("connection_status", "successful")

MLflow Run Name: Initial_Setup_and_Connection_Test
MLflow Run ID: 0294f20a6c274613aeaa8a140c210d77
MLflow Experiment ID: 2
MLflow tracking URI: http://135.235.251.124
MLflow artifact URI: wasbs://artifactroot@tharindumlflow0aa3981a.blob.core.windows.net/2/0294f20a6c274613aeaa8a140c210d77/artifacts


'successful'

In [2]:
# --- Cell X: End the MLflow Run ---
if mlflow.active_run(): # Check if a run is active before trying to end it
    mlflow.end_run()
    print("MLflow run ended.")
else:
    print("No active MLflow run to end.")

🏃 View run Initial_Setup_and_Connection_Test at: http://135.235.251.124/#/experiments/2/runs/0294f20a6c274613aeaa8a140c210d77
🧪 View experiment at: http://135.235.251.124/#/experiments/2
MLflow run ended.


# Practical Application: The UCI Adult Income Dataset
With our MLflow environment connected to our central server, let's get practical. We'll use the well-known UCI Adult Income dataset. The goal is to predict whether an individual's income is more than $50,000 per year based on census data. This is a binary classification problem (two possible outcomes).
Throughout this project, we'll see how to use MLflow to keep track of our data cleaning, exploratory data analysis (EDA), model training, and model checking steps.

# Phase 1: Data Ingestion and Initial Preparation
First, we need to get the dataset. The ucimlrepo library is a handy way to download datasets directly from the UCI Machine Learning Repository.


- [IMPORTANT NOTE FOR YOU]: The following Python code block is designed to be run in a Jupyter notebook cell. It assumes you've already run the setup cell (Cell 1 from "Connect to MLflow in Your Python Code") that sets the MLflow tracking URI and starts an MLflow run. We will be logging to the active run started previously, or if you prefer, you can start a new, dedicated run for this phase.

Let's create a new run specifically for data ingestion.

In [17]:
# --- Notebook Cell: Data Ingestion and Initial Preparation ---
# Make sure you've run the initial MLflow setup cell first to set the tracking URI and experiment.

import pandas as pd
import json # For handling JSON data
from ucimlrepo import fetch_ucirepo # To get the dataset

# --- Start a new MLflow run for this data loading & prep phase ---
# (This assumes your MLFLOW_TRACKING_URI, USERNAME, PASSWORD are set via os.environ
# and mlflow.set_experiment() has been called from the previous setup cell)

# If a run is already active from a previous cell, you might want to end it first
if mlflow.active_run():
    mlflow.end_run()
    print("Ended previous active run.")

# Start a new run for data ingestion
data_ingestion_run = mlflow.start_run(run_name="Data_Ingestion_and_Initial_Prep")
print(f"Starting new MLflow Run for Data Ingestion: {data_ingestion_run.data.tags.get('mlflow.runName')}")
print(f"Run ID: {data_ingestion_run.info.run_id}")

# --- 1. Fetch Dataset ---
print("Fetching UCI Adult dataset...")
try:
    adult_dataset_info = fetch_ucirepo(id=2) # id=2 is for the Adult dataset
    X_raw = adult_dataset_info.data.features
    y_raw = adult_dataset_info.data.targets
    print("Dataset fetched successfully.")
except Exception as e:
    print(f"Error fetching dataset: {e}")
    mlflow.log_param("data_fetching_status", "failed")
    mlflow.log_param("data_fetching_error", str(e))
    if mlflow.active_run(): mlflow.end_run() # End run if fetching fails
    raise # Re-raise the exception to stop execution

# --- 2. Combine Features and Target ---
# For easier handling, let's put features and target into one DataFrame
df_raw = pd.concat([X_raw, y_raw], axis=1)
print("\nCombined DataFrame head (first 5 rows):")
print(df_raw.head())

# --- 3. Basic Data Type Conversion ---
# Some integer columns might be better as floats for later processing
# or to handle potential NaNs more consistently.
int_columns = df_raw.select_dtypes(include='int64').columns
if not int_columns.empty:
    df_raw[int_columns] = df_raw[int_columns].astype('float64')
    print(f"\nConverted integer columns to float: {list(int_columns)}")
    mlflow.log_param("int_columns_converted_to_float", list(int_columns))
else:
    print("\nNo int64 columns found to convert.")
    mlflow.log_param("int_columns_converted_to_float", "None")

target_column_name = y_raw.columns[0] # Get the target column name (e.g., 'income')

# --- 4. Logging Dataset as an MLflow Input ---
# MLflow can track datasets as inputs, which helps with lineage.
print(f"\nLogging dataset as MLflow input. Target column: '{target_column_name}'")
try:
    mlflow_dataset = mlflow.data.from_pandas(
        df_raw,
        source=adult_dataset_info.metadata.get('data_url', 'UCI Repository ID 2'),
        name="UCI Adult Income - Raw Combined",
        targets=target_column_name
    )
    mlflow.log_input(mlflow_dataset, context="raw_dataset") # context is a tag
    print("Dataset logged as MLflow input.")
    mlflow.log_param("raw_dataset_logged_as_input", "success")
except Exception as e:
    print(f"Error logging dataset as MLflow input: {e}")
    mlflow.log_param("raw_dataset_logged_as_input", "failed")
    mlflow.log_param("raw_dataset_logging_error", str(e))


# --- 5. Save and Log Dataset Metadata as Artifacts ---
# The fetched dataset has metadata. Let's save this as a JSON artifact.
# This is handy for understanding where the data came from, variable descriptions, etc.

# Create a local directory to temporarily store artifacts before logging
# This is good practice if you have multiple files or want to organize them.
local_artifacts_dir = "temp_data_ingestion_artifacts"
os.makedirs(local_artifacts_dir, exist_ok=True)

# Save original UCI metadata
uci_metadata_file_path = os.path.join(local_artifacts_dir, "uci_adult_dataset_metadata.json")
try:
    with open(uci_metadata_file_path, "w") as f:
        json.dump(adult_dataset_info.metadata, f, indent=4)
    mlflow.log_artifact(uci_metadata_file_path, artifact_path="dataset_description")
    print(f"\nUCI dataset metadata logged to MLflow artifacts: {uci_metadata_file_path}")
except Exception as e:
    print(f"Error logging UCI metadata: {e}")

# Save variable information from UCI
uci_variables_file_path = os.path.join(local_artifacts_dir, "uci_adult_dataset_variables.json")
try:
    # Convert DataFrame to dict for JSON serialization if it's a DataFrame
    variables_info_serializable = adult_dataset_info.variables.to_dict(orient='records') if isinstance(adult_dataset_info.variables, pd.DataFrame) else adult_dataset_info.variables
    with open(uci_variables_file_path, "w") as f:
        json.dump(variables_info_serializable, f, indent=4)
    mlflow.log_artifact(uci_variables_file_path, artifact_path="dataset_description")
    print(f"UCI dataset variable info logged to MLflow artifacts: {uci_variables_file_path}")
except Exception as e:
    print(f"Error logging UCI variable info: {e}")


# --- 6. Extract and Log Unique Values for Categorical Features ---
# Knowing unique values in categorical columns is key for EDA and preprocessing.
unique_values_categorical = {}
# Select columns that are 'object' (strings) or 'category' type
categorical_cols = df_raw.select_dtypes(include=['object', 'category']).columns.tolist()

# Make sure the target column isn't accidentally included if it's an object type
if target_column_name in categorical_cols:
    categorical_cols.remove(target_column_name)
    
for col in categorical_cols:
    # df_raw[col].dropna().unique() ensures we don't include NaN if it's treated as a category
    unique_values_categorical[col] = [str(val) for val in df_raw[col].dropna().unique().tolist()]


unique_values_file_path = os.path.join(local_artifacts_dir, "categorical_features_unique_values.json")
try:
    with open(unique_values_file_path, "w") as f:
        json.dump(unique_values_categorical, f, indent=4)
    mlflow.log_artifact(unique_values_file_path, artifact_path="dataset_description")
    print(f"Unique values for categorical features logged to MLflow artifacts: {unique_values_file_path}")
except Exception as e:
    print(f"Error logging unique categorical values: {e}")

# --- 7. Log Dataset Overview Parameters ---
# Let's log some general stats about the dataset as MLflow parameters.
# This gives a quick overview in the MLflow UI for this run.
print("\nLogging dataset overview parameters to MLflow...")
try:
    dataset_params = {
        "dataset_name": "UCI Adult Income",
        "source_repository_id": adult_dataset_info.metadata.get('uci_id', 2),
        "num_rows_raw": df_raw.shape[0],
        "num_columns_raw_total": df_raw.shape[1],
        "num_features_raw": X_raw.shape[1],
        "num_target_columns_raw": y_raw.shape[1],
        "column_names_raw": df_raw.columns.tolist(),
        "numerical_columns_count_raw": df_raw.select_dtypes(include='number').shape[1],
        "categorical_columns_count_raw": len(categorical_cols),
        "categorical_column_names_raw": categorical_cols,
        "numerical_column_names_raw": df_raw.select_dtypes(include='number').columns.tolist(),
        "missing_values_total_raw": int(df_raw.isnull().sum().sum()),
        "target_column_name": target_column_name,
        "target_unique_values_count": df_raw[target_column_name].nunique(),
        "target_unique_values_list": [str(val) for val in df_raw[target_column_name].unique().tolist()],
        "target_value_counts": {str(k): v for k, v in df_raw[target_column_name].value_counts().to_dict().items()},
        "target_value_counts_percentage": {str(k): v for k, v in df_raw[target_column_name].value_counts(normalize=True).to_dict().items()}
    }
    # MLflow parameters have a length limit (often 250 chars for value), so for long lists, consider logging as a text artifact.
    # For this example, we'll try logging directly.
    for key, value in dataset_params.items():
        if isinstance(value, (list, dict)):
            # Convert lists/dicts to string for parameters, or log them as JSON artifacts if too long/complex
            try:
                mlflow.log_param(key, json.dumps(value))
            except TypeError: # handles cases where json.dumps might fail for complex objects
                 mlflow.log_param(key, str(value))

        else:
            mlflow.log_param(key, value)
    print("Dataset overview parameters logged.")
except Exception as e:
    print(f"Error logging dataset parameters: {e}")


# --- 8. Log a Sample of the Raw Data as an Artifact (e.g., CSV) ---
# This can be useful for quick inspection from the MLflow UI.
# Be mindful of data size; log a small sample if the dataset is large.
sample_df_path = os.path.join(local_artifacts_dir, "raw_data_sample.csv")
try:
    df_raw.head(100).to_csv(sample_df_path, index=False) # Log first 100 rows
    mlflow.log_artifact(sample_df_path, artifact_path="dataset_samples")
    print(f"Logged a sample of raw data to MLflow artifacts: {sample_df_path}")
except Exception as e:
    print(f"Error logging data sample: {e}")

# Clean up temporary local artifacts directory (optional)
# import shutil
# shutil.rmtree(local_artifacts_dir)
# print(f"Removed temporary local artifacts directory: {local_artifacts_dir}")

print(f"\nData ingestion and initial preparation run finished. Check MLflow UI for run ID: {data_ingestion_run.info.run_id}")

# --- End the current MLflow run ---
if mlflow.active_run():
    mlflow.end_run()
    print("MLflow run for Data Ingestion ended.")

🏃 View run EDA at: http://135.235.251.124/#/experiments/2/runs/a5e9075e3c794825a523a791000dcd19
🧪 View experiment at: http://135.235.251.124/#/experiments/2
Ended previous active run.
Starting new MLflow Run for Data Ingestion: Data_Ingestion_and_Initial_Prep
Run ID: ec782e2e90684de5b2811a3cbc3a6269
Fetching UCI Adult dataset...
Dataset fetched successfully.

Combined DataFrame head (first 5 rows):
   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2  

  return _dataset_source_registry.resolve(


Dataset logged as MLflow input.

UCI dataset metadata logged to MLflow artifacts: temp_data_ingestion_artifacts/uci_adult_dataset_metadata.json
UCI dataset variable info logged to MLflow artifacts: temp_data_ingestion_artifacts/uci_adult_dataset_variables.json
Unique values for categorical features logged to MLflow artifacts: temp_data_ingestion_artifacts/categorical_features_unique_values.json

Logging dataset overview parameters to MLflow...
Dataset overview parameters logged.
Logged a sample of raw data to MLflow artifacts: temp_data_ingestion_artifacts/raw_data_sample.csv

Data ingestion and initial preparation run finished. Check MLflow UI for run ID: ec782e2e90684de5b2811a3cbc3a6269
🏃 View run Data_Ingestion_and_Initial_Prep at: http://135.235.251.124/#/experiments/4/runs/ec782e2e90684de5b2811a3cbc3a6269
🧪 View experiment at: http://135.235.251.124/#/experiments/4
MLflow run for Data Ingestion ended.


# MLflow setup

In [4]:
import os
import mlflow

MLFLOW_USERNAME = os.getenv('MLFLOW_TRACKING_USERNAME')
MLFLOW_PASSWORD = os.getenv('MLFLOW_TRACKING_PASSWORD')
MLFLOW_TRACKING_URI = os.getenv('MLFLOW_TRACKING_URI')

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.start_run(run_name="EDA")

<ActiveRun: >

# Data Loading

In [5]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables) 


{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

# Data Combination

In [6]:
import pandas as pd

# Combine features and target into one DataFrame
df = pd.concat([X, y], axis=1)
print("Data sample after combining X and y:")
df


Data sample after combining X and y:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [8]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [7]:
int_columns = df.select_dtypes(include='int64').columns
df[int_columns] = df[int_columns].astype('float64')


In [10]:
df.dtypes

age               float64
workclass          object
fnlwgt            float64
education          object
education-num     float64
marital-status     object
occupation         object
relationship       object
race               object
sex                object
capital-gain      float64
capital-loss      float64
hours-per-week    float64
native-country     object
income             object
dtype: object

In [8]:
target_column = y.columns[0]

# Create a Dataset object from the DataFrame
dataset = mlflow.data.from_pandas(df, name="UCI Adult Income", targets=target_column)

mlflow.log_input(dataset, context="dataset")

In [9]:
import json

# Define the directory name
metadata_dir = "../data/metadata"

# Create the directory if it doesn't exist
os.makedirs(metadata_dir, exist_ok=True)

metadata_file_path = os.path.join(metadata_dir, "adult_metadata.json")
with open(metadata_file_path, "w") as f:
    json.dump(adult.metadata, f, indent=2)


unique_values = {}
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    unique_values[col] = df[col].dropna().unique().tolist()


unique_values_file_path = os.path.join(metadata_dir, "unique_values.json")
with open(unique_values_file_path, "w") as f:
    json.dump(unique_values, f, indent=2)


In [10]:
mlflow.log_artifact(unique_values_file_path, artifact_path="metadata")
mlflow.log_artifact(metadata_file_path, artifact_path="metadata")

# Initial Data Inspection

In [11]:
# Check the dimensions and basic info
print("Dataset shape:", df.shape)
df.info()
print(df.describe())

# For categorical columns, you can view unique values as well
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"Unique values for {col}: {df[col].unique()}")


Dataset shape: (48842, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             48842 non-null  float64
 1   workclass       47879 non-null  object 
 2   fnlwgt          48842 non-null  float64
 3   education       48842 non-null  object 
 4   education-num   48842 non-null  float64
 5   marital-status  48842 non-null  object 
 6   occupation      47876 non-null  object 
 7   relationship    48842 non-null  object 
 8   race            48842 non-null  object 
 9   sex             48842 non-null  object 
 10  capital-gain    48842 non-null  float64
 11  capital-loss    48842 non-null  float64
 12  hours-per-week  48842 non-null  float64
 13  native-country  48568 non-null  object 
 14  income          48842 non-null  object 
dtypes: float64(6), object(9)
memory usage: 5.6+ MB
                age        fnlwgt  education-nu

### 🔍 Problem:
```
['<=50K', '>50K', '<=50K.', '>50K.']
```

This means the **same labels** appear with and without a **trailing period (`.`)**.

---

### 📌 Cause:
The UCI Adult dataset comes in two files:
- **`adult.data`** (no headers, training data)
- **`adult.test`** (starts with a header/comment line, test data)

In the `adult.test` file, **labels have a period** at the end — i.e., `'>50K.'` and '`<=50K.'`.

---

### ⚠️ Why It Matters:
- Your model might treat `'<=50K'` and '`<=50K.'` as **different classes**, which leads to:
  - Incorrect label counts
  - Misleading model evaluation
  - Skewed visualizations

---

### ✅ Solution:
Standardize the labels early during preprocessing:

In [12]:
# Strip whitespace and trailing periods from income
df['income'] = df['income'].str.strip().str.replace('.', '', regex=False)


# Handling Missing Values

In [13]:
import numpy as np

# Replace '?' with np.nan for consistent missing value notation
df.replace('?', np.nan, inplace=True)

# Check the number of missing values per column
print("Missing values count:")
print(df.isnull().sum())

# Calculate percentage of missing values (optional)
missing_percent = df.isnull().mean() * 100
print("Missing values percentage per column:")
print(missing_percent)


Missing values count:
age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
income               0
dtype: int64
Missing values percentage per column:
age               0.000000
workclass         5.730724
fnlwgt            0.000000
education         0.000000
education-num     0.000000
marital-status    0.000000
occupation        5.751198
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    1.754637
income            0.000000
dtype: float64


In [14]:
for col in categorical_cols:
    df[col].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('Unknown', inplace=True)


## Display and log dataset metadata

In [15]:
# Log basic info
mlflow.log_params({
    "dataset_name": "adult",
    "no_of_cols": df.shape[1],
    "no_of_rows": df.shape[0],
    "columns": ','.join(df.columns.tolist()),
    "numerical_columns_count": df.select_dtypes(include='number').shape[1],
    "missing_values_total": int(df.isnull().sum().sum()),
    "target_column": "income",
    "categorical_columns_count": len(categorical_cols),
    "categorical_columns": ','.join(categorical_cols),
    "numerical_columns": ','.join(df.select_dtypes(include='number').columns.tolist()),
    "target_column_unique_values": df['income'].nunique(),
    "target_column_unique_values_list": ','.join(df['income'].unique().tolist()),
    "target_column_value_counts": df['income'].value_counts().to_dict(),
    "target_column_value_counts_percentage": df['income'].value_counts(normalize=True).to_dict()
})


# Phase 2: Exploratory Data Analysis (EDA) with MLflow Artifact Logging

Exploratory Data Analysis is a crucial step to understand the dataset's characteristics, distributions, correlations, and potential issues. A key part of EDA involves generating visualizations. Traditionally, these plots might live temporarily in a notebook's output or be manually saved to local folders. With MLflow, we can automatically log these plots as artifacts directly associated with a specific experiment run, preserving the visual insights alongside the code and parameters that generated them.

Let's start a new MLflow run specifically for our EDA phase. This helps keep the logs organized.

In [25]:
# --- Notebook Cell: Start EDA Run ---
# Ensure MLflow tracking URI and credentials are set from the initial setup cell
# Assumes 'df_raw' DataFrame is available from the previous Data Ingestion phase.
# Assumes 'target_column_name' and 'categorical_cols' were defined earlier.

# If a previous run is active, end it first.
if mlflow.active_run():
    mlflow.end_run()
    print("Ended previous active run.")

# Start a new run dedicated to EDA
eda_run = mlflow.start_run(run_name="Exploratory_Data_Analysis")
print(f"Starting new MLflow Run for EDA: {eda_run.data.tags.get('mlflow.runName')}")
print(f"Run ID: {eda_run.info.run_id}")

# For convenience, let's use a shorter variable name for our DataFrame in this phase
df = df_raw # Or df = df_cleaned if you performed cleaning steps

# Log a parameter indicating the data state used for EDA
mlflow.log_param("eda_data_source", "raw_combined") # Or "cleaned" if applicable

🏃 View run EDA at: http://135.235.251.124/#/experiments/2/runs/806585c8a8d545e191b53fb80668533a
🧪 View experiment at: http://135.235.251.124/#/experiments/2
Ended previous active run.
Starting new MLflow Run for EDA: Exploratory_Data_Analysis
Run ID: 0ad8a52b62274b879be38a9253ed0923


'raw_combined'

Now, let's generate various plots and log them using `mlflow.log_figure()`. This function takes the current Matplotlib figure (`plt.gcf()`) and saves it as an image artifact to the specified path within the run's artifact store (our Azure Blob Storage).

## **1. Visualizing Numerical Feature Distributions (Histograms)**

In [26]:
# --- Notebook Cell: Numerical Histograms ---
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow

print("Generating and logging histograms for numerical features...")
sns.set(style="whitegrid")

# Select only numerical columns (excluding potential IDs if necessary)
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
mlflow.log_param("numerical_features_for_eda", list(numerical_cols))

for col in numerical_cols:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[col], bins=30, kde=True, color='skyblue', edgecolor='black')
    plt.title(f"Distribution of {col}", fontsize=14, fontweight='bold')
    plt.xlabel(col)
    plt.ylabel("Frequency")
    plt.tight_layout()

    # Log the figure directly to MLflow under a specific sub-directory
    # Using f-strings makes organizing artifacts easy
    artifact_path = f"eda_plots/numerical_distributions/hist_{col}.png"
    mlflow.log_figure(plt.gcf(), artifact_path)
    print(f"  Logged histogram: {artifact_path}")

    # plt.show() # Display the plot in the notebook (optional)
    plt.close() # Close the figure to free memory, crucial in loops!

print("Finished logging histograms.")

Generating and logging histograms for numerical features...
  Logged histogram: eda_plots/numerical_distributions/hist_age.png
  Logged histogram: eda_plots/numerical_distributions/hist_fnlwgt.png
  Logged histogram: eda_plots/numerical_distributions/hist_education-num.png
  Logged histogram: eda_plots/numerical_distributions/hist_capital-gain.png
  Logged histogram: eda_plots/numerical_distributions/hist_capital-loss.png
  Logged histogram: eda_plots/numerical_distributions/hist_hours-per-week.png
Finished logging histograms.


*   **`mlflow.log_figure(plt.gcf(), artifact_path)`**: This is the core command. `plt.gcf()` gets the current Matplotlib figure, and `artifact_path` defines where it will be stored within the MLflow run's artifacts (e.g., `eda_plots/numerical_distributions/hist_age.png`).
*   **`plt.close()`**: Closing the figure after logging is important, especially within loops, to prevent plots from consuming excessive memory or interfering with subsequent plots.

## **2. Visualizing Numerical Feature Distributions (Boxplots)**

Boxplots help identify outliers and understand the spread.

In [27]:
# --- Notebook Cell: Numerical Boxplots ---
print("Generating and logging boxplots for numerical features...")

for col in numerical_cols: # Reuse numerical_cols from previous cell
    plt.figure(figsize=(8, 4))
    sns.boxplot(x=df[col], color='lightgreen', linewidth=1.5)
    plt.title(f"Boxplot for {col}", fontsize=14, fontweight='bold')
    plt.xlabel(col)
    plt.tight_layout()

    # Log the figure directly to MLflow
    artifact_path = f"eda_plots/numerical_distributions/boxplot_{col}.png"
    mlflow.log_figure(plt.gcf(), artifact_path)
    print(f"  Logged boxplot: {artifact_path}")

    # plt.show() # Optional display
    plt.close()

print("Finished logging boxplots.")

Generating and logging boxplots for numerical features...
  Logged boxplot: eda_plots/numerical_distributions/boxplot_age.png
  Logged boxplot: eda_plots/numerical_distributions/boxplot_fnlwgt.png
  Logged boxplot: eda_plots/numerical_distributions/boxplot_education-num.png
  Logged boxplot: eda_plots/numerical_distributions/boxplot_capital-gain.png
  Logged boxplot: eda_plots/numerical_distributions/boxplot_capital-loss.png
  Logged boxplot: eda_plots/numerical_distributions/boxplot_hours-per-week.png
Finished logging boxplots.


## **3. Visualizing Numerical Feature Correlations (Heatmap)**

Understanding correlations is vital for feature selection and modeling.

In [28]:
# --- Notebook Cell: Correlation Heatmap ---
print("Generating and logging correlation heatmap...")

plt.figure(figsize=(12, 10))
# Recalculate correlation matrix on the numerical columns identified
corr_matrix = df[numerical_cols].corr()
sns.heatmap(
    corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5,
    square=True, cbar_kws={"shrink": .8}
)
plt.title("Correlation Heatmap of Numerical Features", fontsize=16, fontweight='bold')
plt.tight_layout()

# Log the figure directly to MLflow
artifact_path = "eda_plots/correlation_heatmap.png"
mlflow.log_figure(plt.gcf(), artifact_path)
print(f"  Logged heatmap: {artifact_path}")

# plt.show() # Optional display
plt.close()

print("Finished logging correlation heatmap.")

Generating and logging correlation heatmap...
  Logged heatmap: eda_plots/correlation_heatmap.png
Finished logging correlation heatmap.


## **4. Visualizing Categorical Feature Distributions**

Let's see the counts for each category in our categorical features.

In [29]:
# --- Notebook Cell: Categorical Distributions ---
import matplotlib.pyplot as plt # Ensure imported if in new session
import seaborn as sns         # Ensure imported
import mlflow               # Ensure imported

print("Generating and logging distributions for categorical features...")
sns.set(style="whitegrid")

# Ensure 'categorical_cols' is defined (likely from data ingestion phase)
# Example: categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
# Make sure the target column isn't included if it's categorical
# if target_column_name in categorical_cols:
#     categorical_cols.remove(target_column_name)

mlflow.log_param("categorical_features_for_eda", list(categorical_cols))


for col in categorical_cols:
    plt.figure(figsize=(10, 5))
    # Using hue=col helps if colors are needed per bar, otherwise optional
    sns.countplot(
        data=df,
        y=col, # Using y can be better for long category names
        order=df[col].value_counts().index, # Order by frequency
        palette="viridis",
        hue=col, # Assign hue for color mapping, necessary if legend=False fails
        legend=False # Avoid redundant legend if hue=col
    )
    plt.title(f"Distribution of {col}", fontsize=14, fontweight='bold')
    plt.xlabel("Count", fontsize=12)
    plt.ylabel(col, fontsize=12)
    # plt.xticks(rotation=45) # Not needed if using y=col
    plt.tight_layout()

    # Log the figure directly to MLflow
    artifact_path = f"eda_plots/categorical_distributions/dist_{col}.png"
    mlflow.log_figure(plt.gcf(), artifact_path)
    print(f"  Logged categorical distribution: {artifact_path}")

    # plt.show() # Optional display
    plt.close()

print("Finished logging categorical distributions.")

Generating and logging distributions for categorical features...
  Logged categorical distribution: eda_plots/categorical_distributions/dist_workclass.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_education.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_marital-status.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_occupation.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_relationship.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_race.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_sex.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_native-country.png
  Logged categorical distribution: eda_plots/categorical_distributions/dist_income.png
Finished logging categorical distributions.


## **5. Visualizing Categorical Features vs. Target Variable**

Understanding how categorical features relate to the target variable (`income`) is key for predictive modeling.

In [30]:
# --- Notebook Cell: Categorical vs Target ---
print("Generating and logging categorical features vs target...")
sns.set(style="whitegrid")

# Ensure 'target_column_name' is defined (e.g., 'income')
# target_col = target_column_name
target_col = 'income' # Explicitly set if not using variable from ingestion


if target_col in df.columns:
    mlflow.log_param("target_column_for_eda_comparison", target_col)
    for col in categorical_cols: # Reuse categorical_cols
        plt.figure(figsize=(12, 6)) # Adjusted size
        sns.countplot(
            data=df,
            y=col, # Again, using y=col often better
            hue=target_col, # Color bars based on income level
            order=df[col].value_counts().index, # Order categories by frequency
            palette="Set1" # Use a different palette for contrast
        )
        plt.title(f"{col} vs {target_col}", fontsize=14, fontweight='bold')
        plt.xlabel("Count", fontsize=12)
        plt.ylabel(col, fontsize=12)
        plt.legend(title=target_col, loc='center right', bbox_to_anchor=(1.25, 0.5)) # Adjust legend position
        plt.tight_layout(rect=[0, 0, 1, 1]) # Adjust layout to potentially make space for legend

        # Log the figure directly to MLflow
        artifact_path = f"eda_plots/categorical_vs_target/{col}_vs_{target_col}.png"
        mlflow.log_figure(plt.gcf(), artifact_path)
        print(f"  Logged categorical vs target: {artifact_path}")

        # plt.show() # Optional display
        plt.close()
else:
    print(f"Target column '{target_col}' not found in DataFrame. Skipping comparison plots.")
    mlflow.log_param("target_comparison_skipped", f"Column '{target_col}' not found.")

print("Finished logging categorical vs target plots.")


Generating and logging categorical features vs target...
  Logged categorical vs target: eda_plots/categorical_vs_target/workclass_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/education_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/marital-status_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/occupation_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/relationship_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/race_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/sex_vs_income.png
  Logged categorical vs target: eda_plots/categorical_vs_target/native-country_vs_income.png


  plt.legend(title=target_col, loc='center right', bbox_to_anchor=(1.25, 0.5)) # Adjust legend position


  Logged categorical vs target: eda_plots/categorical_vs_target/income_vs_income.png
Finished logging categorical vs target plots.


# **Phase 3: Data Preprocessing with Scikit-learn Pipelines**

After exploring the data, the next logical step is preprocessing. This involves transforming our raw data into a format suitable for machine learning algorithms. Common steps include:

1.  **Handling Categorical Features:** Most ML models require numerical input. We need to convert categorical features (like 'workclass', 'education', 'occupation') into a numerical representation, typically using One-Hot Encoding.
2.  **Scaling Numerical Features:** Numerical features often have different ranges (e.g., 'age' vs. 'capital-gain'). Scaling them (e.g., using Standardization) ensures that features with larger values don't disproportionately influence the model.
3.  **Handling Missing Values:** Although our EDA didn't heavily focus on it, real-world datasets often require strategies for missing data (imputation). *Note: For this example, we'll assume the UCI dataset fetched is relatively clean or that missing value handling was implicitly done during encoding/scaling choices, but in a full project, this would be an explicit step.*

**Why Use Scikit-learn Pipelines?**

Instead of applying these steps sequentially using pandas or separate Scikit-learn transformers (like in the initial code snippets you provided), we will use Scikit-learn's `Pipeline` and `ColumnTransformer`. This approach offers significant advantages:

1.  **Workflow Simplification:** It bundles multiple processing steps into a single object. You call `fit` and `transform` (or `fit_transform`) once on the pipeline, and it handles the sequence internally.
2.  **Preventing Data Leakage:** This is crucial. When using tools like `ColumnTransformer`, transformations (like scaling parameters or categories for one-hot encoding) are learned *only* from the training data during the `fit` step. When `transform` is called on the test set (or new data), it uses the *already learned* parameters, preventing information from the test set leaking into the training process. Applying transformations manually *before* splitting the data (as sometimes done with pandas) can lead to overly optimistic results.
3.  **Consistency and Reproducibility:** The pipeline ensures the exact same sequence of steps is applied during training, evaluation, and prediction.
4.  **Easier Model Persistence:** You can save the entire fitted pipeline (including preprocessing steps and the model) as a single object, simplifying deployment.

## **Building the Preprocessing Pipeline**

Let's define the preprocessing steps using `ColumnTransformer`, which allows applying different transformations to different columns.

In [34]:
# --- Notebook Cell: Define Preprocessing Pipeline ---
# Ensure MLflow tracking URI and credentials are set from the initial setup cell
# Assumes 'df_raw' DataFrame and 'target_column_name' ('income') are available.
# Assumes 'X_raw' and 'y_raw' (features and target DataFrames from fetch_ucirepo) are available.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression # We'll add a placeholder model later
import mlflow
import os

print("Defining preprocessing steps...")

# --- 1. Identify Feature Columns ---
# We need the names of numerical and categorical columns from the *original* feature set (X_raw)
numerical_cols = X_raw.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X_raw.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Identified {len(numerical_cols)} numerical columns: {numerical_cols}")
print(f"Identified {len(categorical_cols)} categorical columns: {categorical_cols}")

# Log these columns for reference in MLflow (optional, but good practice)
# We can start a specific run for pipeline definition or log later during training.
# Let's log them during the first training run that uses this preprocessor.

# --- 2. Create the ColumnTransformer ---
# This object will apply specific transformers to designated columns.

preprocessor = ColumnTransformer(
    transformers=[
        # ('name', transformer_object, list_of_columns_to_apply_to)
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first', sparse_output=False), categorical_cols)
        # handle_unknown='ignore': Allows the transformer to handle categories seen in test but not train.
        # drop='first': Avoids multicollinearity by dropping one category per feature (like pd.get_dummies(drop_first=True)).
        # sparse_output=False: Returns a dense numpy array, often easier to work with downstream.
    ],
    remainder='passthrough' # In case we missed any columns, keep them. Should be empty if cols identified correctly.
)

print("\nPreprocessor defined:")
print(preprocessor)

# --- 3. (Placeholder) Define a full pipeline (Preprocessor + Model) ---
# We will integrate the actual model later, but let's see the structure.
# This pipeline object isn't fitted yet.

# Example with Logistic Regression
# placeholder_pipeline = Pipeline(steps=[
#     ('preprocessor', preprocessor),
#     ('classifier', LogisticRegression(max_iter=1000, random_state=42)) # Example model
# ])
# print("\nExample full pipeline structure:")
# print(placeholder_pipeline)

print("\nPreprocessing steps using ColumnTransformer are ready.")
# We will use the 'preprocessor' object within our main training pipeline later.

Defining preprocessing steps...
Identified 6 numerical columns: ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
Identified 8 categorical columns: ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

Preprocessor defined:
ColumnTransformer(remainder='passthrough',
                  transformers=[('num', StandardScaler(),
                                 ['age', 'fnlwgt', 'education-num',
                                  'capital-gain', 'capital-loss',
                                  'hours-per-week']),
                                ('cat',
                                 OneHotEncoder(drop='first',
                                               handle_unknown='ignore',
                                               sparse_output=False),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex'

### **Explanation:**

*   **Column Identification:** We re-identify numerical and categorical columns directly from `X_raw`, the original feature DataFrame. This is important because the pipeline will operate on this raw input.
*   **`ColumnTransformer`:**
    *   We define two main transformations:
        *   `('num', StandardScaler(), numerical_cols)`: Applies `StandardScaler` to all columns listed in `numerical_cols`.
        *   `('cat', OneHotEncoder(...), categorical_cols)`: Applies `OneHotEncoder` to all columns in `categorical_cols`. We use `handle_unknown='ignore'` to gracefully handle potential new categories during prediction and `drop='first'` to mimic the behavior of `pd.get_dummies(drop_first=True)`, reducing dimensionality and potential multicollinearity. `sparse_output=False` gives a standard NumPy array.
    *   `remainder='passthrough'` ensures any columns not explicitly mentioned are kept as-is (though ideally, all feature columns should be covered by 'num' or 'cat').
*   **Placeholder Pipeline:** The commented-out section shows how this `preprocessor` would typically be combined with a classifier (like `LogisticRegression`) into a single `Pipeline` object. This is the structure we'll use for training.

# **Phase 4: Model Training and Tracking with Pipelines & Autologging**

Now, let's integrate this preprocessing logic into our model training workflow. We'll use the `Pipeline` structure and leverage `mlflow.sklearn.autolog()` for automated tracking.

**Benefits of `mlflow.sklearn.autolog()`:**

*   **Automatic Logging:** It automatically logs parameters (from the pipeline steps *and* the final estimator), metrics (calculated on a test set if provided or inferred), and the fitted Scikit-learn model/pipeline artifact.
*   **Reduced Boilerplate:** Significantly reduces the amount of manual `mlflow.log_param()` and `mlflow.log_metric()` calls needed.
*   **Model Signature & Input Example:** Can automatically log the expected input schema (signature) and an example input, which is useful for deployment and validation.
*   **Code and Environment:** Often captures details about the execution environment and code versions (depending on configuration).

## **Training a Single Model (Logistic Regression) with Pipeline and Autolog**

In [None]:
# --- Notebook Cell: Train Single Model with Pipeline & Autolog ---
# Make sure MLflow environment variables and tracking URI are set.
# Assumes 'X_raw', 'y_raw', 'preprocessor' are defined from previous cells.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report # For potential manual checks
import mlflow
import mlflow.sklearn
import os
import numpy as np # For potential type checks

# --- 1. Prepare Data (Map Target and Split) ---
# Ensure the target variable is binary (0/1) BEFORE splitting.
# Handle potential variations like '.' suffixes found in raw data.
print("Preparing target variable and splitting data...")
try:
    # Clean potential variations and map to 0/1
    y_mapped = y_raw['income'].str.strip().replace({'<=50K.': '<=50K', '>50K.': '>50K'})
    y_binary = y_mapped.apply(lambda x: 1 if x == '>50K' else 0).astype(np.int32) # Ensure integer type

    target_name = 'income_binary' # Define a clear name for the processed target
    y_binary.name = target_name

    print(f"Target mapping successful. Value counts:\n{y_binary.value_counts()}")

    # Split the *raw* features and the *binary* target
    X_train, X_test, y_train, y_test = train_test_split(
        X_raw, y_binary,
        test_size=0.2,
        random_state=42,
        stratify=y_binary # Stratify helps maintain class proportion in splits
    )
    print(f"Train/Test split complete. Train shape: {X_train.shape}, Test shape: {X_test.shape}")

except Exception as e:
    print(f"Error during target mapping or data split: {e}")
    # Consider stopping execution if data prep fails
    raise

# --- 2. Define the Full Pipeline ---
# Combine the preprocessor with the chosen model.
print("\nDefining the full Scikit-learn pipeline...")
lr_model = LogisticRegression(max_iter=1000, random_state=42, solver='liblinear') # Using liblinear for potential L1 later
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), # The ColumnTransformer defined earlier
    ('classifier', lr_model)        # The Logistic Regression model
])
print("Pipeline created:")

# --- 3. Configure MLflow and Enable Autologging ---
print("\nConfiguring MLflow and enabling autologging...")
mlflow.set_experiment("UCI Adult Income Prediction - Centralized") # Ensure consistent experiment name

# Enable autologging
# - log_input_examples=True: Records a sample of training data.
# - log_model_signatures=True: Infers the model input/output schema.
# - registered_model_name: Optionally register the logged model in MLflow Model Registry.
mlflow.sklearn.autolog(
    log_input_examples=True,
    log_model_signatures=True,
    # registered_model_name="AdultIncomeLogisticRegressionPipeline", # Uncomment to register
    disable=False # Explicitly enable (default)
)
print("MLflow autologging for Scikit-learn enabled.")

# --- 4. Train the Model within an MLflow Run ---
print("\nStarting MLflow run for training...")
# Autologging works within the context of an active MLflow run.
run_name = "Pipeline_LogisticRegression_Autolog"
with mlflow.start_run(run_name=run_name) as run:
    run_id = run.info.run_id
    print(f"MLflow Run started (ID: {run_id}, Name: {run_name}).")

    # Log the columns used by the preprocessor manually for clarity
    mlflow.log_param("numerical_features", numerical_cols)
    mlflow.log_param("categorical_features", categorical_cols)
    mlflow.log_param("target_variable_processed", target_name)

    # Fit the *entire* pipeline on the training data
    print("Fitting the pipeline...")
    pipeline.fit(X_train, y_train)
    print("Pipeline fitting complete.")

    # Autologging automatically handles:
    # - Logging pipeline parameters (including preprocessor steps and classifier params).
    # - Training the model.
    # - Evaluating the model on the test set (if X_test, y_test are available implicitly via fit).
    #   Note: Autolog usually infers evaluation based on typical fit/predict patterns,
    #         or sometimes requires explicit scoring call depending on version/context.
    #         Let's explicitly evaluate to ensure metrics are logged by autolog.
    print("Evaluating model on test set (triggers autolog metric calculation)...")
    test_score = pipeline.score(X_test, y_test) # .score often triggers autolog metrics
    print(f"Pipeline test accuracy score: {test_score:.4f}")

    # - Logging metrics (accuracy, F1, etc. - might vary by MLflow version).
    # - Logging the fitted pipeline artifact (including preprocessor and model).
    # - Logging input example and model signature.

    # (Optional) Add Custom Tags or Artifacts if needed
    mlflow.set_tag("model_variant", "LogisticRegression")
    mlflow.set_tag("pipeline_description", "StandardScaler + OHE(drop_first)")
    # Example: Log classification report manually if autolog doesn't capture it as desired
    # y_pred_test = pipeline.predict(X_test)
    # report = classification_report(y_test, y_pred_test)
    # mlflow.log_text(report, "classification_report_test.txt")

    print(f"MLflow Run {run_name} finished. Check the MLflow UI for details.")

# --- 5. Disable Autologging (Good Practice) ---
# Disable if you plan to run other non-Scikit-learn code afterwards
# that you don't want autologged, or if setting up a new autolog config.
mlflow.sklearn.autolog(disable=True)
print("\nMLflow autologging disabled.")

Preparing target variable and splitting data...
Target mapping successful. Value counts:
income_binary
0    37155
1    11687
Name: count, dtype: int64
Train/Test split complete. Train shape: (39073, 14), Test shape: (9769, 14)

Defining the full Scikit-learn pipeline...
Pipeline created:

Configuring MLflow and enabling autologging...
MLflow autologging for Scikit-learn enabled.

Starting MLflow run for training...
MLflow autologging for Scikit-learn enabled.

Starting MLflow run for training...
MLflow Run started (ID: 6859ec890605440cba7f4e1a615e17e1, Name: Pipeline_LogisticRegression_Autolog).
MLflow Run started (ID: 6859ec890605440cba7f4e1a615e17e1, Name: Pipeline_LogisticRegression_Autolog).
Fitting the pipeline...
Fitting the pipeline...




Pipeline fitting complete.
Evaluating model on test set (triggers autolog metric calculation)...
Pipeline test accuracy score: 0.8519
Pipeline test accuracy score: 0.8519
MLflow Run Pipeline_LogisticRegression_Autolog finished. Check the MLflow UI for details.
MLflow Run Pipeline_LogisticRegression_Autolog finished. Check the MLflow UI for details.
🏃 View run Pipeline_LogisticRegression_Autolog at: http://135.235.251.124/#/experiments/2/runs/6859ec890605440cba7f4e1a615e17e1
🧪 View experiment at: http://135.235.251.124/#/experiments/2
🏃 View run Pipeline_LogisticRegression_Autolog at: http://135.235.251.124/#/experiments/2/runs/6859ec890605440cba7f4e1a615e17e1
🧪 View experiment at: http://135.235.251.124/#/experiments/2

MLflow autologging disabled.

MLflow autologging disabled.


🏃 View run traveling-moose-745 at: http://135.235.251.124/#/experiments/5/runs/a07b500e5829400fb548564a9c81cdcd
🧪 View experiment at: http://135.235.251.124/#/experiments/5
🏃 View run masked-seal-883 at: http://135.235.251.124/#/experiments/7/runs/1483ed422d164045873a9f07a4bd07a6
🧪 View experiment at: http://135.235.251.124/#/experiments/7
🏃 View run legendary-shrimp-629 at: http://135.235.251.124/#/experiments/8/runs/be3aa64addbe460ea4da660b2df751e5
🧪 View experiment at: http://135.235.251.124/#/experiments/8


### **Explanation:**

1.  **Data Preparation:** We perform the critical step of mapping the target variable (`income`) to a binary format (0/1) *before* the `train_test_split`. This ensures the split happens on the final target representation. We use `stratify=y_binary` to maintain the class balance between train and test sets, which is important for classification tasks.
2.  **Full Pipeline Definition:** We create the `Pipeline` object, explicitly listing the steps: first our `preprocessor` (`ColumnTransformer`), then the `classifier` (`LogisticRegression`).
3.  **Autologging Setup:** We set the MLflow experiment name and enable `mlflow.sklearn.autolog()`. We include `log_input_examples` and `log_model_signatures` for richer tracking. Optionally, `registered_model_name` can be set to directly register the resulting model in the MLflow Model Registry upon logging.
4.  **Training within MLflow Run:**
    *   We start an MLflow run using `with mlflow.start_run(...)`.
    *   We manually log the feature lists used by the preprocessor for extra clarity using `mlflow.log_param`.
    *   The core step is `pipeline.fit(X_train, y_train)`. This single command trains the entire pipeline: the preprocessor learns scaling parameters and categories from `X_train`, transforms `X_train`, and then trains the `LogisticRegression` model on the transformed data.
    *   We call `pipeline.score(X_test, y_test)`. For many autologging setups, performing an evaluation step like `.score()` or `.predict()` after `.fit()` helps ensure that test metrics are calculated and logged automatically.
    *   Autologging captures the pipeline's parameters (e.g., `C` from Logistic Regression, `handle_unknown` from OHE), evaluation metrics (accuracy, F1, etc., calculated on the test set implicitly or via the `.score` call), and the fitted pipeline itself as an artifact.
    *   We add optional custom tags using `mlflow.set_tag` for easier filtering/grouping in the UI.
5.  **Disable Autologging:** It's good practice to disable autologging when you're done with the specific training block to avoid unintended logging from subsequent code.

### **Viewing Autologged Results in MLflow UI**

After running the cell above:

1.  Go back to your MLflow UI (http://\<YOUR\_EXTERNAL\_IP\>).
2.  Navigate to the "UCI Adult Income Prediction - Centralized" experiment.
3.  Find the run named "Pipeline\_LogisticRegression\_Autolog".
4.  **Parameters:** You should see parameters logged from *both* the `preprocessor` (like `remainder`, `cat__handle_unknown`) and the `classifier` (like `classifier__C`, `classifier__max_iter`). MLflow prefixes parameters from pipeline steps with the step name (e.g., `classifier__`). You'll also see the manually logged `numerical_features` and `categorical_features`.
5.  **Metrics:** Autologging should have captured metrics like `test_accuracy_score`, `test_f1_score`, `test_precision_score`, `test_recall_score`, etc. (The exact names might vary slightly based on MLflow/Scikit-learn versions). These are calculated using `X_test` and `y_test`.
6.  **Artifacts:**
    *   You'll find a `model` directory containing the *entire fitted pipeline* saved in MLflow's `python_function` flavor (and potentially the Scikit-learn flavor). This artifact includes the fitted preprocessor and the trained Logistic Regression model.
    *   You might also see `input_example.json` and `model_signature.json` (if `log_input_examples` and `log_model_signatures` were successful).
    *   Any manually logged artifacts (like the classification report if you uncommented that part) would also appear here.


# **Phase 5: Training Multiple Models with Pipelines and Autologging**

This demonstrates how easily you can swap out model estimators within the same preprocessing framework and train multiple models with different hyperparameters using defined model configurations.

In [43]:
# --- Notebook Cell: Train Multiple Models with Pipelines & Autolog ---
# Ensure MLflow environment variables and tracking URI are set.
# Assumes 'X_raw', 'y_raw', 'X_train', 'X_test', 'y_train', 'y_test' are defined.
# Assumes 'numerical_cols', 'categorical_cols' are identified based on X_raw.

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC # Note: SVC can be slow without tuning/sampling
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
# Add other necessary imports: Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, mlflow, os, etc.

print("Setting up for multiple model training with pipelines...")

# --- 1. Define Preprocessor (if not already defined globally) ---
# Ensure the preprocessor using numerical_cols and categorical_cols from X_raw is available.
# (Assuming 'preprocessor' ColumnTransformer object exists from previous cells)
if 'preprocessor' not in locals():
     print("Re-defining preprocessor...")
     preprocessor = ColumnTransformer(
         transformers=[
             ('num', StandardScaler(), numerical_cols),
             ('cat', OneHotEncoder(handle_unknown='ignore', drop='first', sparse_output=False), categorical_cols)
         ],
         remainder='passthrough'
     )
     print("Preprocessor re-defined.")


# --- 2. Define Models Configuration ---
# Use a dictionary for clarity, key is descriptive name, value is model instance.
models_config = {
    "LogisticRegression_C_1_0": LogisticRegression(max_iter=2000, solver="liblinear", C=1.0, random_state=42),
    "RandomForest_n100_md10": RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1),
    "RandomForest_n200_md15": RandomForestClassifier(n_estimators=200, max_depth=15, random_state=42, n_jobs=-1),
    "XGBoost_default": XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42, n_jobs=-1),
    "XGBoost_lr01_md3": XGBClassifier(learning_rate=0.1, max_depth=3, use_label_encoder=False, eval_metric="logloss", random_state=42, n_jobs=-1),
    "KNN_n5_minkowski": KNeighborsClassifier(n_neighbors=5, metric='minkowski', n_jobs=-1),
    "GradientBoosting_n100_lr01_md3": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
    # Add more models or variations as needed
    # "SVC_linear": SVC(kernel='linear', probability=True, random_state=42), # Can be slow
}
print(f"Defined {len(models_config)} model configurations.")

# --- 3. Enable Autologging for the Loop ---
# Make sure it's enabled before the loop starts.
# Set registered_model_name=None if you don't want to register every single variant automatically.
mlflow.sklearn.autolog(
    log_input_examples=True,
    log_model_signatures=True,
    registered_model_name=None, # Avoid registering every model in the loop automatically
    disable=False # Ensure it's enabled
)
print("MLflow autologging enabled for the training loop.")

# --- 4. Training Loop ---
experiment_name = "UCI Adult Income Prediction - Centralized" # Or a new one like "Adult_Pipeline_Comparison"
mlflow.set_experiment(experiment_name)
print(f"Logging runs to experiment: '{experiment_name}'")

for model_name, model_instance in models_config.items():
    print(f"\n--- Training {model_name} ---")

    # Create the full pipeline for *this specific model*
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor), # Reuse the same preprocessor
        ('classifier', model_instance)  # Insert the current model
    ])

    # Start a unique run for this model pipeline
    run_name = f"Pipeline_{model_name}_Autolog"
    with mlflow.start_run(run_name=run_name) as run:
        mlflow.set_tag("model_name", model_name) # Tag with the specific model name
        mlflow.set_tag("pipeline_used", "standard_prep_v1") # Tag the preprocessing version

        # Log feature lists for reference (can be repetitive but ensures it's in each run)
        mlflow.log_param("numerical_features", numerical_cols)
        mlflow.log_param("categorical_features", categorical_cols)

        print(f"Fitting pipeline for {model_name}...")
        try:
            # Fit the pipeline
            pipeline.fit(X_train, y_train)

            # Evaluate (helps ensure autolog captures test metrics)
            test_score = pipeline.score(X_test, y_test)
            print(f"  {model_name} Test Accuracy: {test_score:.4f}")

            # Autologging handles params, metrics, model artifact for this pipeline

        except Exception as e:
            print(f"  ERROR training {model_name}: {e}")
            mlflow.set_tag("status", "failed")
            mlflow.log_param("error_message", str(e))
            # Optionally log stack trace or more details

    print(f"--- Finished {model_name} ---")

# --- 5. Disable Autologging After Loop ---
mlflow.sklearn.autolog(disable=True)
print("\nTraining loop complete. MLflow autologging disabled.")


Setting up for multiple model training with pipelines...
Defined 7 model configurations.
MLflow autologging enabled for the training loop.
Logging runs to experiment: 'UCI Adult Income Prediction - Centralized'

--- Training LogisticRegression_C_1_0 ---
Fitting pipeline for LogisticRegression_C_1_0...




  LogisticRegression_C_1_0 Test Accuracy: 0.8519
🏃 View run Pipeline_LogisticRegression_C_1_0_Autolog at: http://135.235.251.124/#/experiments/2/runs/c9f12d5d3ff740b0b145a5e83e239ac7
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished LogisticRegression_C_1_0 ---

--- Training RandomForest_n100_md10 ---
Fitting pipeline for RandomForest_n100_md10...




  RandomForest_n100_md10 Test Accuracy: 0.8597
🏃 View run Pipeline_RandomForest_n100_md10_Autolog at: http://135.235.251.124/#/experiments/2/runs/747dd40a876f4addbbddc13f64703e4c
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished RandomForest_n100_md10 ---

--- Training RandomForest_n200_md15 ---
Fitting pipeline for RandomForest_n200_md15...




  RandomForest_n200_md15 Test Accuracy: 0.8654
🏃 View run Pipeline_RandomForest_n200_md15_Autolog at: http://135.235.251.124/#/experiments/2/runs/5964418eb79e465dae3cadeb7edb1c38
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished RandomForest_n200_md15 ---

--- Training XGBoost_default ---
Fitting pipeline for XGBoost_default...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


  XGBoost_default Test Accuracy: 0.8759
🏃 View run Pipeline_XGBoost_default_Autolog at: http://135.235.251.124/#/experiments/2/runs/65f88853841548f1a042b5a60688a3de
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished XGBoost_default ---

--- Training XGBoost_lr01_md3 ---
Fitting pipeline for XGBoost_lr01_md3...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


  XGBoost_lr01_md3 Test Accuracy: 0.8700
🏃 View run Pipeline_XGBoost_lr01_md3_Autolog at: http://135.235.251.124/#/experiments/2/runs/2f6eee9cba6142ddb7506ea81c19524f
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished XGBoost_lr01_md3 ---

--- Training KNN_n5_minkowski ---
Fitting pipeline for KNN_n5_minkowski...




  KNN_n5_minkowski Test Accuracy: 0.8314
🏃 View run Pipeline_KNN_n5_minkowski_Autolog at: http://135.235.251.124/#/experiments/2/runs/bf4192c152364996adaa0edfcecd20b5
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished KNN_n5_minkowski ---

--- Training GradientBoosting_n100_lr01_md3 ---
Fitting pipeline for GradientBoosting_n100_lr01_md3...




  GradientBoosting_n100_lr01_md3 Test Accuracy: 0.8715
🏃 View run Pipeline_GradientBoosting_n100_lr01_md3_Autolog at: http://135.235.251.124/#/experiments/2/runs/37974f7eaf4245179a0dca3959b7c01f
🧪 View experiment at: http://135.235.251.124/#/experiments/2
--- Finished GradientBoosting_n100_lr01_md3 ---

Training loop complete. MLflow autologging disabled.


- Preprocessor Reuse: The preprocessor (ColumnTransformer) is defined once (or ensured to exist) outside the loop.
- Pipeline Creation Inside Loop: For each model_instance from models_config, a new Pipeline is created, combining the standard preprocessor with the specific model_instance.
- Unique Run Per Model: Each iteration of the loop starts a new MLflow run with a descriptive name (e.g., "Pipeline\_RandomForest\_n100\_md10\_Autolog").
- Autologging Per Run: autolog() captures the parameters, metrics, and the specific fitted pipeline (e.g., preprocessor + RandomForest) for each run.
- Tagging: We add tags (model_name, pipeline_used) to each run for easier filtering and comparison in the MLflow UI.
- Error Handling: Basic try…except block added to catch errors during fitting specific models without stopping the entire loop, logging an error message and status tag to MLflow.
- Configuration: models_config dictionary makes adding/removing models easy. n_jobs=-1 is added to classifiers where applicable to speed up training using multiple CPU cores. use_label_encoder=False added for XGBClassifier.

# **Phase 6: Integrating TensorFlow and Deep Learning with MLflow Autologging**

Beyond Scikit-learn, MLflow offers robust support for popular deep learning frameworks like TensorFlow and PyTorch. This is crucial, as many complex problems benefit from the power of neural networks. MLflow's autologging for TensorFlow (`mlflow.tensorflow.autolog()`) simplifies the tracking of deep learning experiments significantly, capturing essential information with minimal code changes.

### **The Power of `mlflow.tensorflow.autolog()`**

When enabled, `mlflow.tensorflow.autolog()` automatically logs a wealth of information during your Keras/TensorFlow model training, including:

1.  **Model Parameters:** Hyperparameters like learning rate, batch size, number of epochs, and optimizer configurations.
2.  **Model Summary:** The architecture of your neural network.
3.  **Training Metrics:** Metrics specified in `model.compile()` (e.g., loss, accuracy) are logged for each epoch, for both training and validation sets. This allows you to visualize learning curves directly in the MLflow UI.
4.  **Callbacks:** Information about Keras callbacks used, such as `EarlyStopping` parameters.
5.  **Fitted Model:** The trained TensorFlow/Keras model is saved as an MLflow artifact, typically in TensorFlow's SavedModel format, making it ready for deployment.
6.  **TensorBoard Logs:** If you're using TensorBoard, autologging can also capture the TensorBoard log directory.
7.  **(Optional) Model Signature & Input Example:** Similar to Scikit-learn, it can infer and log the model's input/output schema and an example input.

This comprehensive logging helps in understanding model behavior, comparing different architectures or hyperparameters, and reproducing results.

### **Practical Example: Training a Deep Neural Network**

Let's train a simple Deep Neural Network (DNN) for our UCI Adult Income prediction task using TensorFlow (Keras API) and see how `mlflow.tensorflow.autolog()` works.

For this example, we will use the preprocessed data (`X_train`, `X_test`, `y_train`, `y_test`) that was saved earlier (after one-hot encoding and scaling were applied directly). In a production scenario with pipelines, you would typically extract the transformed numerical arrays from your Scikit-learn pipeline to feed into TensorFlow.

In [None]:
# --- Notebook Cell: TensorFlow DNN Training with Autolog ---
import os
import joblib # To load our preprocessed data
import mlflow
import mlflow.tensorflow # Specific TensorFlow integration
from mlflow.models.signature import infer_signature # For model schema
import numpy as np
import tempfile
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping # Useful callback

from dotenv import load_dotenv
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    log_loss, roc_auc_score, average_precision_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay
)

# --- 1. Setup and Load Data ---
print("Setting up MLflow and loading preprocessed data...")
load_dotenv() # Load environment variables (MLFLOW_TRACKING_URI, etc.)
mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI"))
mlflow.set_experiment("UCI Adult Income Prediction - Centralized") # Consistent experiment

# Load the preprocessed data that was saved earlier
# This data has already been one-hot encoded and scaled.
try:
    X_train, X_test, y_train, y_test = joblib.load("../data/train_test_data.pkl")
    print(f"Data loaded. X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
except FileNotFoundError:
    print("ERROR: train_test_data.pkl not found. Please ensure preprocessing steps were run and data was saved.")
    # Depending on notebook flow, you might want to raise an error or stop
    raise

# Ensure y_train and y_test are 1D arrays, as Keras expects for binary classification
y_train = np.asarray(y_train).ravel()
y_test = np.asarray(y_test).ravel()
print(f"Target arrays reshaped. y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")

# --- 2. Enable MLflow Autologging for TensorFlow ---
print("\nEnabling MLflow autologging for TensorFlow...")
# Key options:
# - log_models: (default True) Save the trained Keras model.
# - log_every_n_epoch: Log metrics every N epochs (default 1).
# - registered_model_name: Optionally register the model directly.
mlflow.tensorflow.autolog(
    log_model_signatures=True, # Log input/output schema
    log_input_examples=True,   # Log an input example
    # registered_model_name="AdultIncomeDNN", # Uncomment to register model
    disable=False # Ensure it's enabled
)
print("TensorFlow autologging enabled.")

# --- 3. Train the TensorFlow Model within an MLflow Run ---
run_name = "Deep_Neural_Network_TensorFlow_Autolog"
print(f"\nStarting MLflow run: {run_name}")

with mlflow.start_run(run_name=run_name) as run:
    run_id = run.info.run_id
    print(f"MLflow Run started (ID: {run_id}).")

    # --- A. Build the DNN Model ---
    print("Building the Keras Sequential model...")
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(X_train.shape[1],), name="input_layer"), # Input layer matching feature dimension
        tf.keras.layers.Dense(128, activation='relu', name="dense_1"),
        tf.keras.layers.Dropout(0.3, name="dropout_1"),
        tf.keras.layers.Dense(64, activation='relu', name="dense_2"),
        tf.keras.layers.Dropout(0.3, name="dropout_2"),
        tf.keras.layers.Dense(1, activation='sigmoid', name="output_layer") # Output layer for binary classification
    ])
    print("Model built successfully.")
    # Autolog will capture model.summary()

    # --- B. Compile the Model ---
    print("Compiling the model...")
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), # Autolog captures optimizer_name, learning_rate
        loss='binary_crossentropy',      # Autolog captures loss function
        metrics=['accuracy', tf.keras.metrics.AUC(name='roc_auc'), tf.keras.metrics.AUC(name='pr_auc', curve='PR')] # Autolog captures these metrics per epoch
    )
    print("Model compiled.")

    # --- C. Define EarlyStopping Callback ---
    # Autologging will also log parameters of callbacks like EarlyStopping.
    early_stop = EarlyStopping(
        monitor='val_loss',       # Metric to monitor
        patience=10,              # Number of epochs with no improvement after which training will be stopped
        verbose=1,
        restore_best_weights=True # Restores model weights from the epoch with the best value of the monitored quantity.
    )
    print(f"EarlyStopping callback configured: monitor='val_loss', patience={early_stop.patience}.")

    # --- D. Train the Model ---
    print("Training the model...")
    history = model.fit(
        X_train, y_train,
        validation_data=(X_test, y_test), # Autolog uses this for validation metrics
        epochs=100,                       # Autolog captures epochs
        batch_size=64,                    # Autolog captures batch_size
        callbacks=[early_stop],           # Autolog logs callback info
        verbose=1                         # Set to 1 or 2 to see Keras progress
    )
    print("Model training complete.")
    # Autologging automatically logs epoch-wise metrics (loss, acc, val_loss, val_acc, etc.)
    # and the final trained model artifact.

    # --- E. (Optional) Manual Logging for Detailed Final Evaluation ---
    # While autolog captures epoch-wise metrics and the model, we might want specific
    # overall performance metrics or visualizations not covered by default.
    print("\nPerforming final predictions and logging detailed evaluation metrics...")
    y_train_prob_tf = model.predict(X_train).ravel()
    y_test_prob_tf = model.predict(X_test).ravel()
    y_train_pred_tf = (y_train_prob_tf >= 0.5).astype(int)
    y_test_pred_tf = (y_test_prob_tf >= 0.5).astype(int)

    # Helper function for logging common classification metrics
    def log_classification_metrics_manual(y_true, y_pred, y_prob, prefix):
        mlflow.log_metric(f"{prefix}_accuracy_final", accuracy_score(y_true, y_pred))
        mlflow.log_metric(f"{prefix}_precision_final", precision_score(y_true, y_pred))
        mlflow.log_metric(f"{prefix}_recall_final", recall_score(y_true, y_pred))
        mlflow.log_metric(f"{prefix}_f1_score_final", f1_score(y_true, y_pred))
        mlflow.log_metric(f"{prefix}_log_loss_final", log_loss(y_true, y_prob))
        mlflow.log_metric(f"{prefix}_roc_auc_final", roc_auc_score(y_true, y_prob)) # This will be overall ROC AUC
        mlflow.log_metric(f"{prefix}_pr_auc_final", average_precision_score(y_true, y_prob)) # Overall PR AUC

        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        mlflow.log_metric(f"{prefix}_true_negative_final", tn)
        mlflow.log_metric(f"{prefix}_false_positive_final", fp)
        mlflow.log_metric(f"{prefix}_false_negative_final", fn)
        mlflow.log_metric(f"{prefix}_true_positive_final", tp)

    # Log final metrics for train and test sets
    log_classification_metrics_manual(y_train, y_train_pred_tf, y_train_prob_tf, "train")
    log_classification_metrics_manual(y_test, y_test_pred_tf, y_test_prob_tf, "test")
    print("Final evaluation metrics (accuracy, precision, recall, F1, AUCs) logged manually.")

    # Log confusion matrix plots manually
    for prefix_cm, y_true_cm, y_pred_cm in [
        ("Train_Final", y_train, y_train_pred_tf),
        ("Test_Final", y_test, y_test_pred_tf)
    ]:
        cm = confusion_matrix(y_true_cm, y_pred_cm)
        disp = ConfusionMatrixDisplay(confusion_matrix=cm)
        disp.plot(cmap=plt.cm.Blues)
        plt.title(f"Confusion Matrix - {prefix_cm}")
        # Save to a temporary file to log as an artifact
        with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
            plt.savefig(tmpfile.name)
            mlflow.log_artifact(tmpfile.name, artifact_path="evaluation_plots/confusion_matrices")
        plt.close() # Close the plot to free memory
    print("Confusion matrix plots logged as artifacts.")

    # Log classification reports manually
    for prefix_cr, y_true_cr, y_pred_cr in [
        ("train_final", y_train, y_train_pred_tf),
        ("test_final", y_test, y_test_pred_tf)
    ]:
        report = classification_report(y_true_cr, y_pred_cr)
        # Save to a temporary file to log as an artifact
        with tempfile.NamedTemporaryFile("w+", delete=False, suffix=".txt") as tmpfile:
            tmpfile.write(f"Classification Report - {prefix_cr.capitalize()}\n\n")
            tmpfile.write(report)
            tmpfile.flush() # Ensure content is written to disk
            mlflow.log_artifact(tmpfile.name, artifact_path="evaluation_reports/classification_reports")
    print("Classification reports logged as artifacts.")

    # (Optional) Log learning curves plot
    if history:
        plt.figure(figsize=(12, 5))
        # Plot training & validation accuracy values
        plt.subplot(1, 2, 1)
        if 'accuracy' in history.history and 'val_accuracy' in history.history:
            plt.plot(history.history['accuracy'])
            plt.plot(history.history['val_accuracy'])
            plt.title('Model Accuracy')
            plt.ylabel('Accuracy')
            plt.xlabel('Epoch')
            plt.legend(['Train', 'Validation'], loc='upper left')

        # Plot training & validation loss values
        plt.subplot(1, 2, 2)
        if 'loss' in history.history and 'val_loss' in history.history:
            plt.plot(history.history['loss'])
            plt.plot(history.history['val_loss'])
            plt.title('Model Loss')
            plt.ylabel('Loss')
            plt.xlabel('Epoch')
            plt.legend(['Train', 'Validation'], loc='upper left')
        
        plt.tight_layout()
        with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
            plt.savefig(tmpfile.name)
            mlflow.log_artifact(tmpfile.name, artifact_path="evaluation_plots/learning_curves")
        plt.close()
        print("Learning curves plot logged as artifact.")

    mlflow.set_tag("model_type", "TensorFlow_DNN")
    print(f"MLflow Run {run_name} (ID: {run_id}) finished successfully.")

# --- 4. Disable Autologging (Good Practice) ---
mlflow.tensorflow.autolog(disable=True)
print("\nMLflow autologging for TensorFlow disabled.")

Setting up MLflow and loading preprocessed data...
Data loaded. X_train shape: (39073, 102), y_train shape: (39073,)
Target arrays reshaped. y_train shape: (39073,), y_test shape: (9769,)

Enabling MLflow autologging for TensorFlow...
TensorFlow autologging enabled.

Starting MLflow run: Deep_Neural_Network_TensorFlow_Autolog
MLflow Run started (ID: 743e45a2945f4711910f9896afcfdf70).
Building the Keras Sequential model...
Model built successfully.
Compiling the model...
Model compiled.
EarlyStopping callback configured: monitor='val_loss', patience=10.
Training the model...




Epoch 1/100
[1m586/611[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 1ms/step - accuracy: 0.8336 - loss: 0.3489 - pr_auc: 0.4842 - roc_auc: 0.8184



[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8354 - loss: 0.3457 - pr_auc: 0.4906 - roc_auc: 0.8219 - val_accuracy: 0.8999 - val_loss: 0.2195 - val_pr_auc: 0.7547 - val_roc_auc: 0.9419
Epoch 2/100
[1m599/611[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.8968 - loss: 0.2252 - pr_auc: 0.7339 - roc_auc: 0.9366



[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.8969 - loss: 0.2251 - pr_auc: 0.7341 - roc_auc: 0.9366 - val_accuracy: 0.9030 - val_loss: 0.2121 - val_pr_auc: 0.7690 - val_roc_auc: 0.9445
Epoch 3/100
[1m593/611[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 1ms/step - accuracy: 0.8995 - loss: 0.2148 - pr_auc: 0.7639 - roc_auc: 0.9415



[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8996 - loss: 0.2148 - pr_auc: 0.7639 - roc_auc: 0.9415 - val_accuracy: 0.9011 - val_loss: 0.2092 - val_pr_auc: 0.7735 - val_roc_auc: 0.9459
Epoch 4/100
[1m601/611[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.9019 - loss: 0.2115 - pr_auc: 0.7688 - roc_auc: 0.9438



[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9019 - loss: 0.2115 - pr_auc: 0.7689 - roc_auc: 0.9439 - val_accuracy: 0.9041 - val_loss: 0.2088 - val_pr_auc: 0.7779 - val_roc_auc: 0.9456
Epoch 5/100
[1m578/611[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 2ms/step - accuracy: 0.9042 - loss: 0.2060 - pr_auc: 0.7828 - roc_auc: 0.9466



[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9042 - loss: 0.2060 - pr_auc: 0.7827 - roc_auc: 0.9466 - val_accuracy: 0.9060 - val_loss: 0.2076 - val_pr_auc: 0.7793 - val_roc_auc: 0.9464
Epoch 6/100
[1m607/611[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - accuracy: 0.9031 - loss: 0.2064 - pr_auc: 0.7714 - roc_auc: 0.9463



[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9031 - loss: 0.2064 - pr_auc: 0.7715 - roc_auc: 0.9463 - val_accuracy: 0.9058 - val_loss: 0.2073 - val_pr_auc: 0.7799 - val_roc_auc: 0.9465
Epoch 7/100
[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9064 - loss: 0.2011 - pr_auc: 0.7966 - roc_auc: 0.9496 - val_accuracy: 0.9037 - val_loss: 0.2077 - val_pr_auc: 0.7784 - val_roc_auc: 0.9462
Epoch 8/100
[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9060 - loss: 0.2015 - pr_auc: 0.7953 - roc_auc: 0.9503 - val_accuracy: 0.9044 - val_loss: 0.2095 - val_pr_auc: 0.7791 - val_roc_auc: 0.9451
Epoch 9/100
[1m611/611[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9089 - loss: 0.1961 - pr_auc: 0.7932 - roc_auc: 0.9513 - val_accuracy: 0.9043 - val_loss: 0.2086 - val_pr_auc: 0.7771 - val_roc_auc: 0.9455
Epoch 10/100
[1m611/611[0m [32m━━━━━━━━━━━━━━



Model training complete.

Performing final predictions and logging detailed evaluation metrics...
[1m1222/1222[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 614us/step
[1m306/306[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 429us/step
Final evaluation metrics (accuracy, precision, recall, F1, AUCs) logged manually.
Confusion matrix plots logged as artifacts.
Classification reports logged as artifacts.
Learning curves plot logged as artifact.
MLflow Run Deep_Neural_Network_TensorFlow_Autolog (ID: 743e45a2945f4711910f9896afcfdf70) finished successfully.
🏃 View run Deep_Neural_Network_TensorFlow_Autolog at: http://135.235.251.124/#/experiments/2/runs/743e45a2945f4711910f9896afcfdf70
🧪 View experiment at: http://135.235.251.124/#/experiments/2

MLflow autologging for TensorFlow disabled.


# **Phase 8: Model Deployment**

## Finding the best model from parameter tunning experiement

In [1]:
# --- Cell 1: Environment Setup and Start MLflow Run ---
from dotenv import load_dotenv
import os
import mlflow
from mlflow.tracking import MlflowClient

# Load environment variables from .env file
load_dotenv()

# Get credentials and URI from environment variables
MLFLOW_USERNAME = os.getenv('MLFLOW_TRACKING_USERNAME')
MLFLOW_PASSWORD = os.getenv('MLFLOW_TRACKING_PASSWORD')
MLFLOW_TRACKING_URI = os.getenv('MLFLOW_TRACKING_URI')

# These lines are crucial for MLflow to authenticate with your server
os.environ['MLFLOW_TRACKING_USERNAME'] = MLFLOW_USERNAME
os.environ['MLFLOW_TRACKING_PASSWORD'] = MLFLOW_PASSWORD

# Set the tracking URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

# 1. Initialize client (connects to MLFLOW_TRACKING_URI env var or default)
client = MlflowClient()  

# 2. Specify your experiment ID or name
experiment_id = client.get_experiment_by_name("Adult_Classification_Tuning_XGboost").experiment_id  # :contentReference[oaicite:0]{index=0}

# 3. Query runs ordered by your metric, descending, limit to top-1
best_runs = client.search_runs(
    experiment_ids=[experiment_id],
    filter_string="",
    run_view_type=1,  # ViewType.ACTIVE_ONLY
    max_results=1,
    order_by=["metrics.test_pr_auc DESC"]  # Replace with your primary metric :contentReference[oaicite:1]{index=1}
)
best_run = best_runs[0]
print(f"Best run ID: {best_run.info.run_id}")
print(f"Best test PR AUC: {best_run.data.metrics['test_pr_auc']}")

Best run ID: 5b426b4e758a426792dfb4b0b1fa1458
Best test PR AUC: 0.8309769945531836


In [5]:
model_uri = f"runs:/{best_run.info.run_id}/model"
print(f"Model URI: {model_uri}")

Model URI: runs:/5b426b4e758a426792dfb4b0b1fa1458/model


### Registering the model in model registry

In [3]:
model_details = mlflow.register_model(
    model_uri=model_uri,
    name="XGBoost_AdultIncome_BestModel"  # :contentReference[oaicite:2]{index=2}
)
print(f"Registered model version: {model_details.version}")  # :contentReference[oaicite:5]{index=5}

Successfully registered model 'XGBoost_AdultIncome_BestModel'.
2025/05/18 17:45:56 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: XGBoost_AdultIncome_BestModel, version 1


Registered model version: 1


Created version '1' of model 'XGBoost_AdultIncome_BestModel'.


In [20]:
print(model_details)

<ModelVersion: aliases=[], creation_timestamp=1747570556188, current_stage='None', description='', last_updated_timestamp=1747570556188, name='XGBoost_AdultIncome_BestModel', run_id='5b426b4e758a426792dfb4b0b1fa1458', run_link='', source='wasbs://artifactroot@tharindumlflow615422a9.blob.core.windows.net/1/5b426b4e758a426792dfb4b0b1fa1458/artifacts/model', status='READY', status_message=None, tags={}, user_id='', version='1'>


In [21]:
print(model_details.source)

wasbs://artifactroot@tharindumlflow615422a9.blob.core.windows.net/1/5b426b4e758a426792dfb4b0b1fa1458/artifacts/model


### Transition to “Staging”

In [None]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
# Replace with your model name and version from registration step
model_name = "XGBoost_AdultIncome_BestModel"
model_version = model_details.version  

# Transition to “Staging”
client.transition_model_version_stage(
    name=model_name,
    version=model_version,
    stage="Staging"
)
print(f"Model {model_name} v{model_version} is now in Staging.")


  client.transition_model_version_stage(


Model XGBoost_AdultIncome_BestModel v1 is now in Staging.


### Transition to “Production”

In [None]:

client.transition_model_version_stage(
    name=model_name,
    version=model_version,
    stage="Production"
)
print(f"Model {model_name} v{model_version} is now in Production.")

  client.transition_model_version_stage(


Model XGBoost_AdultIncome_BestModel v1 is now in Production.


### Load your model from MLflow

In [9]:
model_name = "XGBoost_AdultIncome_BestModel"
stage = "Production"  # or "Staging", or None if you want latest
model_uri = f"models:/{model_name}/{stage or 'latest'}"
print("Loading:", model_uri)
model = mlflow.pyfunc.load_model(model_uri)

Loading: models:/XGBoost_AdultIncome_BestModel/Production


  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|██████████| 8/8 [00:00<00:00,  9.79it/s]  


### Build a sample input DataFrame

In [13]:
import pandas as pd

# Re-use your example function, or just hard-code one row:
sample = {
    'age': 38,
    'workclass': 'Private',
    'fnlwgt': 215646,
    'education': 'HS-grad',
    'education-num': 9,
    'marital-status': 'Divorced',
    'occupation': 'Handlers-cleaners',
    'relationship': 'Not-in-family',
    'race': 'White',
    'sex': 'Male',
    'capital-gain': 0,
    'capital-loss': 0,
    'hours-per-week': 40,
    'native-country': 'United-States'
}
input_df = pd.DataFrame([sample])
input_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States


### Run inference

In [None]:
preds = model.predict(input_df)

In [19]:
print(f"Predicted class: {preds[0]}  (<=50K=0, >50K=1)")

Predicted class: 0  (<=50K=0, >50K=1)
