We will:
1. Load the cleaned dataset  
2. Preprocess:
   - Handle missing values
   - Encode categorical features
   - Scale numerical features
3. Split into train/test sets
4. Train a basic model
5. Use **MLflow** to track:
   - Parameters
   - Metrics
   - Model artifacts

## 1. Load the cleaned data


In [0]:
import pandas as pd
import numpy as np

file_path = "/dbfs/FileStore/tables/kidney_disease_cleaned.csv"
df = pd.read_csv(file_path)

df['classification'] = df['classification'].str.strip().str.lower()
df.drop('id', axis=1, inplace=True, errors='ignore')  # drop if not already dropped

## 2. Define Features & Target

In [0]:
X = df.drop("classification", axis=1)
y = df["classification"].map({'ckd': 1, 'notckd': 0})  # binary target

## 3. Create Preprocessing Pipeline
We'll use:
- **SimpleImputer** for missing values
- **StandardScaler** for numeric
- **OneHotEncoder** for categorical

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Identify column types
num_cols = X.select_dtypes(include=["float64", "int64"]).columns.tolist()
cat_cols = X.select_dtypes(include="object").columns.tolist()

# Pipelines
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", cat_pipeline, cat_cols)
])

## 4. Train-Test Split

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

## 5. Build Full Pipeline + Model
We’ll use `RandomForestClassifier` as a start:

In [0]:
from sklearn.ensemble import RandomForestClassifier

model_pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

## 6. Train Model + Track with MLflow

In [0]:
import mlflow
import mlflow.sklearn
from sklearn.metrics import classification_report, accuracy_score

mlflow.set_experiment("/Users/zhao.xinyuan@northeastern.edu/ckd-mlops")  # set this to your Databricks email path

with mlflow.start_run():
    model_pipeline.fit(X_train, y_train)
    preds = model_pipeline.predict(X_test)
    
    acc = accuracy_score(y_test, preds)
    clf_report = classification_report(y_test, preds, output_dict=True)

    # Log params, metrics
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metrics({f"f1_{k}": v["f1-score"] for k, v in clf_report.items() if k in ["0", "1"]})
    
    # Log model
    mlflow.sklearn.log_model(model_pipeline, "ckd_rf_model")

    print("Run logged with MLflow 🎉")


## What You’ll Learn/Showcase

| Feature | Why It’s Important |
|--------|--------------------|
| `Pipeline` | Clean, repeatable ML process |
| `ColumnTransformer` | Real-world data handling (numeric + categorical) |
| `MLflow` | MLOps logging = critical for Sanofi role |
| `RandomForestClassifier` | A solid baseline ML model |
| `classification_report` | Communicates performance clearly |

---

## Next Steps

Once you're done:
1. I’ll help you move to `03_model_selection_and_tuning.ipynb` for **hyperparameter tuning** and **model comparison**.
2. Then we can add **Streamlit** or **MLflow Serving Endpoint** if you want to demo!

---