Okay, this is a comprehensive example lab that will guide you through building an end-to-end readmission prediction pipeline, consolidating many of the concepts and techniques from previous labs.

---
## 📘 Section 12: Labs and Exercises

### 🧪 Lab 12.2: Example Lab: Build an End-to-End Readmission Prediction Pipeline

**Objective:** To design and implement a complete end-to-end pipeline in Microsoft Fabric for predicting patient readmissions. This includes data ingestion (conceptual), data transformation, feature engineering, model training with MLflow, batch scoring, and creating a Power BI dashboard for visualization.

**Scenario:** Valley General Hospital wants a robust system to identify patients at high risk of readmission within 30 days of discharge. This pipeline will provide actionable insights to care teams to implement interventions.

**Assumed Prerequisites:**
* A Microsoft Fabric enabled environment with an available capacity.
* Permissions to create workspaces, Lakehouses, Notebooks, Data Factory items, and Power BI reports.
* Understanding of concepts from Labs 2.9, 3.8, 4.8, 5.9, 6.8, and 7.8.

---
**Phase 1: Environment Setup & Data Ingestion (Recap & Assumptions)**

* **Step 1.1: Create a New Workspace.**
    * Name: `E2E_ReadmissionAnalytics_YourName`. Assign it to a Fabric capacity.
* **Step 1.2: Create a New Lakehouse.**
    * Within the new workspace, create a Lakehouse named `HealthClinicLH_E2E_YourName`.
* **Step 1.3: Assumed Bronze Layer Data.**
    * For this lab, we will assume that raw data has been ingested into the `Files` section of your Lakehouse (or into Bronze tables directly). We'll create sample CSV files and then load them into Spark DataFrames to simulate this.

    * **Create `sample_patients_bronze.csv` locally:**
        ```csv
        PatientID,MRN,DateOfBirth,Gender,ZipCode,DeceasedDate
        P001,MRN001,1965-03-15,Female,90210,
        P002,MRN002,1972-07-22,Male,90010,
        P003,MRN003,1950-11-02,Male,90211,2023-08-15
        P004,MRN004,1985-01-30,Female,92101,
        P005,MRN005,1990-05-10,Other,90030,
        ```
    * **Create `sample_encounters_bronze.csv` locally:**
        ```csv
        EncounterID,PatientID,AdmissionDate,DischargeDate,EncounterType,PrimaryDiagnosisCode,HospitalizationLengthDays,DischargeDisposition
        E001,P001,2023-01-10T10:00:00,2023-01-15T14:30:00,Inpatient,I21.3,5,Home
        E002,P002,2023-02-01T11:00:00,2023-02-05T12:00:00,Inpatient,J44.9,4,Skilled Nursing Facility
        E003,P001,2023-02-10T09:00:00,2023-02-12T17:00:00,Inpatient,I21.3,2,Home
        E004,P003,2023-03-01T15:00:00,2023-03-10T10:00:00,Inpatient,N18.6,9,Expired
        E005,P002,2023-03-05T08:00:00,2023-03-08T18:00:00,Emergency,R07.4,3,Home
        E006,P004,2023-04-10T16:00:00,2023-04-18T11:00:00,Inpatient,S06.9,8,Rehab Facility
        E007,P001,2023-04-15T10:00:00,2023-04-20T14:30:00,Inpatient,I50.0,5,Home
        E008,P005,2023-05-01T07:00:00,2023-05-03T10:00:00,Observation,Z03.89,2,Home
        ```
    * **Upload these CSV files:**
        * In `HealthClinicLH_E2E_YourName`, navigate to `Files`.
        * Create a folder named `bronze_layer_uploads`.
        * Upload `sample_patients_bronze.csv` and `sample_encounters_bronze.csv` into this folder.

---
**Phase 2: Data Preparation and Transformation (Silver & Gold Layers)**

* **Step 2.1: Create a New Notebook.**
    * In your workspace, create a new Notebook named `Prepare_Readmission_Data_E2E_YourName`.
    * Attach it to the `HealthClinicLH_E2E_YourName` Lakehouse.

* **Step 2.2: Notebook Code for Data Prep (PySpark).**

In [None]:
from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    from pyspark.sql.types import DateType, TimestampType, IntegerType

    lakehouse_name = "HealthClinicLH_E2E_YourName" # Ensure this matches your Lakehouse name

    # --- Load Bronze Data ---
    patients_bronze_df = spark.read.format("csv") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load(f"/lakehouses/{lakehouse_name}/Files/bronze_layer_uploads/sample_patients_bronze.csv")

    encounters_bronze_df = spark.read.format("csv") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load(f"/lakehouses/{lakehouse_name}/Files/bronze_layer_uploads/sample_encounters_bronze.csv")

    print("Bronze data loaded.")
    patients_bronze_df.show()
    encounters_bronze_df.show()

    # --- Create Dim_Patient_E2E (Gold) ---
    dim_patient_df = patients_bronze_df \
        .withColumn("DateOfBirth", F.to_date(F.col("DateOfBirth"))) \
        .withColumn("DeceasedDate", F.to_date(F.col("DeceasedDate"))) \
        .withColumn("IsDeceased", F.when(F.col("DeceasedDate").isNotNull(), True).otherwise(False)) \
        .withColumn("Age", F.floor(F.datediff(F.current_date(), F.col("DateOfBirth")) / 365.25).cast(IntegerType())) \
        .withColumn("PatientKey", F.md5(F.col("PatientID"))) \
        .select("PatientKey", "PatientID", "MRN", "DateOfBirth", "Age", "Gender", "ZipCode", "IsDeceased", "DeceasedDate")

    dim_patient_df.write.format("delta").mode("overwrite").saveAsTable(f"{lakehouse_name}.Dim_Patient_E2E")
    print("Dim_Patient_E2E created.")
    spark.read.table(f"{lakehouse_name}.Dim_Patient_E2E").show()

    # --- Create Dim_Date_E2E (Gold) ---
    # (Simplified Dim_Date for this lab)
    date_range_df = spark.sql("SELECT sequence(to_date('2022-01-01'), to_date('2025-12-31'), interval 1 day) as date_array")
    dim_date_df = date_range_df.select(F.explode(F.col("date_array")).alias("FullDate")) \
        .withColumn("DateKey", F.date_format(F.col("FullDate"), "yyyyMMdd").cast(IntegerType())) \
        .withColumn("Year", F.year(F.col("FullDate"))) \
        .withColumn("Month", F.month(F.col("FullDate"))) \
        .withColumn("Day", F.dayofmonth(F.col("FullDate"))) \
        .withColumn("DayOfWeek", F.dayofweek(F.col("FullDate")))

    dim_date_df.write.format("delta").mode("overwrite").saveAsTable(f"{lakehouse_name}.Dim_Date_E2E")
    print("Dim_Date_E2E created.")
    spark.read.table(f"{lakehouse_name}.Dim_Date_E2E").show(5)

    # --- Create Fact_Encounter_E2E (Gold) ---
    # Convert string dates to timestamp/date
    encounters_transformed_df = encounters_bronze_df \
        .withColumn("AdmissionDate", F.to_timestamp(F.col("AdmissionDate"))) \
        .withColumn("DischargeDate", F.to_timestamp(F.col("DischargeDate")))

    # Calculate DaysToNextAdmission & PriorAdmissionsCount
    window_patient_admission_asc = Window.partitionBy("PatientID").orderBy(F.col("AdmissionDate").asc())
    window_patient_discharge_asc = Window.partitionBy("PatientID").orderBy(F.col("DischargeDate").asc())

    fact_encounter_intermediate_df = encounters_transformed_df \
        .withColumn("NextAdmissionDate", F.lead(F.col("AdmissionDate"), 1).over(window_patient_admission_asc)) \
        .withColumn("PriorAdmissionsCount", F.row_number().over(window_patient_admission_asc) - 1) \
        .withColumn("DaysToNextAdmission", F.when(F.col("NextAdmissionDate").isNotNull() & F.col("DischargeDate").isNotNull(),
                                                 F.datediff(F.col("NextAdmissionDate"), F.col("DischargeDate")))
                      .otherwise(None))

    # Join with Dim_Patient to get PatientKey, Dim_Date for AdmissionDateKey
    fact_encounter_df = fact_encounter_intermediate_df \
        .join(dim_patient_df.select("PatientID", "PatientKey"), "PatientID", "inner") \
        .join(dim_date_df.select("FullDate", "DateKey").alias("dim_adm_date"),
              F.to_date(fact_encounter_intermediate_df.AdmissionDate) == F.col("dim_adm_date.FullDate"), "left") \
        .select(
            F.md5(F.col("EncounterID")).alias("EncounterKey"),
            F.col("PatientKey"),
            F.col("dim_adm_date.DateKey").alias("AdmissionDateKey"),
            F.to_date(F.col("AdmissionDate")).alias("AdmissionDateOnly"), # For easier use in ML
            F.to_date(F.col("DischargeDate")).alias("DischargeDateOnly"), # For easier use in ML
            F.col("EncounterType"),
            F.col("PrimaryDiagnosisCode"),
            F.col("HospitalizationLengthDays").alias("LengthOfStayInDays"), # Renaming for consistency
            F.col("DischargeDisposition"),
            F.col("PriorAdmissionsCount"),
            F.col("DaysToNextAdmission")
        )

    fact_encounter_df.write.format("delta").mode("overwrite").saveAsTable(f"{lakehouse_name}.Fact_Encounter_E2E")
    print("Fact_Encounter_E2E created.")
    spark.read.table(f"{lakehouse_name}.Fact_Encounter_E2E").show()

---
**Phase 3: Model Training and Experimentation**

* **Step 3.1: Create a New Notebook.**
    * Name: `Train_Readmission_Model_E2E_YourName`. Attach to `HealthClinicLH_E2E_YourName`.
* **Step 3.2: Notebook Code for Model Training (PySpark & MLflow).**

In [None]:
import mlflow
    from pyspark.sql import functions as F
    from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Imputer, StandardScaler
    from pyspark.ml.classification import RandomForestClassifier # Using RF for this example
    from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
    from pyspark.ml import Pipeline

    lakehouse_name = "HealthClinicLH_E2E_YourName"
    fact_table = f"{lakehouse_name}.Fact_Encounter_E2E"
    dim_patient_table = f"{lakehouse_name}.Dim_Patient_E2E"

    # --- Load Data for Modeling ---
    fact_df = spark.read.table(fact_table)
    patient_df = spark.read.table(dim_patient_table)

    model_input_df = fact_df.join(patient_df, "PatientKey", "inner") \
        .select(
            fact_df.LengthOfStayInDays,
            fact_df.PrimaryDiagnosisCode,
            fact_df.EncounterType,
            fact_df.PriorAdmissionsCount,
            fact_df.DaysToNextAdmission,
            patient_df.Age,
            patient_df.Gender,
            patient_df.PatientKey # Keep for potential analysis, but not as a feature
        ).na.drop(subset=["DaysToNextAdmission", "LengthOfStayInDays", "Age", "PriorAdmissionsCount"]) # Drop if target or key numeric features are null

    # --- Feature Engineering for Model ---
    # Create Target Variable
    model_input_df = model_input_df.withColumn(
        "label", # Standard name for target in Spark ML
        F.when((F.col("DaysToNextAdmission") >= 0) & (F.col("DaysToNextAdmission") <= 30), 1.0).otherwise(0.0)
    )

    # Handle categorical features
    categorical_cols = ["PrimaryDiagnosisCode", "EncounterType", "Gender"]
    indexers = [StringIndexer(inputCol=col, outputCol=col+"_Index", handleInvalid="keep") for col in categorical_cols]
    encoders = [OneHotEncoder(inputCol=col+"_Index", outputCol=col+"_Vec") for col in categorical_cols]

    # Handle numerical features (impute nulls if any before scaling)
    numerical_cols = ["LengthOfStayInDays", "PriorAdmissionsCount", "Age"]
    imputer_numerical = Imputer(inputCols=numerical_cols, outputCols=[c + "_Imputed" for c in numerical_cols], strategy="median")

    # Assemble features (numerical imputed + categorical one-hot encoded)
    imputed_numerical_feature_cols = [c + "_Imputed" for c in numerical_cols]
    encoded_categorical_feature_cols = [col+"_Vec" for col in categorical_cols]
    assembler_inputs = imputed_numerical_feature_cols + encoded_categorical_feature_cols
    vector_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="assembledFeatures")

    # Scale features
    scaler = StandardScaler(inputCol="assembledFeatures", outputCol="features") # 'features' is standard name

    # Split data
    train_df, test_df = model_input_df.randomSplit([0.8, 0.2], seed=123)

    # --- Model Training with MLflow ---
    with mlflow.start_run(run_name="RandomForest_Readmission_E2E") as run:
        mlflow.log_param("data_source_fact", fact_table)
        mlflow.log_param("data_source_dim_patient", dim_patient_table)
        mlflow.log_param("train_test_split_ratio", "0.8/0.2")

        # Define Random Forest model
        rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100, maxDepth=10, seed=123)
        mlflow.log_param("numTrees", 100)
        mlflow.log_param("maxDepth", 10)

        # Create pipeline
        pipeline_stages = [imputer_numerical] + indexers + encoders + [vector_assembler, scaler, rf]
        pipeline_rf = Pipeline(stages=pipeline_stages)

        print("Training Random Forest model...")
        rf_model = pipeline_rf.fit(train_df)
        print("Model training complete.")

        # Make predictions
        predictions = rf_model.transform(test_df)

        # Evaluate model
        evaluator_auc = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
        evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1") # F1 for positive class (1.0)

        auc_score = evaluator_auc.evaluate(predictions)
        # For F1 of positive class, need to filter or use appropriate evaluator setup
        f1_score = evaluator_f1.setMetricLabel(1.0).evaluate(predictions.filter(F.col("label") == 1.0)) if predictions.filter(F.col("label") == 1.0).count() > 0 else 0.0


        print(f"Test AUC: {auc_score}")
        print(f"Test F1-Score (for class 1): {f1_score}")

        mlflow.log_metric("AUC", auc_score)
        mlflow.log_metric("F1_Class1", f1_score)

        # Log model
        mlflow.spark.log_model(
            spark_model=rf_model,
            artifact_path="readmission_rf_pipeline_model",
            registered_model_name="ReadmissionRiskModelE2E" # This will register the model
        )
        print("Random Forest model and metrics logged to MLflow, and model registered.")

        run_id_rf = run.info.run_id
        mlflow.end_run()

    print(f"MLflow Run ID for Random Forest E2E: {run_id_rf}")

---
**Phase 4: Batch Scoring of Upcoming Appointments**

* **Step 4.1: Create a New Notebook.**
    * Name: `Score_Upcoming_Appointments_E2E_YourName`. Attach to `HealthClinicLH_E2E_YourName`.
* **Step 4.2: Notebook Code for Batch Scoring.**

In [None]:
import mlflow
    from pyspark.sql import functions as F

    lakehouse_name = "HealthClinicLH_E2E_YourName"
    registered_model_name = "ReadmissionRiskModelE2E"

    # --- Simulate Upcoming Appointments Data ---
    # In a real scenario, this would be new data. Here, we'll use a subset of Fact_Encounter_E2E
    # and pretend it's upcoming (remove DaysToNextAdmission and label).
    # Or, create a new sample CSV similar to encounters_bronze but for future dates.

    # For simplicity, let's just re-load part of Fact_Encounter_E2E and necessary Dim_Patient columns
    # and select rows that were NOT readmitted (label=0) to act as our "new, unscored" data.
    fact_df_all = spark.read.table(f"{lakehouse_name}.Fact_Encounter_E2E")
    patient_df_all = spark.read.table(f"{lakehouse_name}.Dim_Patient_E2E")

    upcoming_appointments_raw_df = fact_df_all.join(patient_df_all, "PatientKey", "inner") \
        .select(
            fact_df_all.EncounterKey, # For identifying the appointment
            fact_df_all.PatientKey,
            fact_df_all.LengthOfStayInDays,
            fact_df_all.PrimaryDiagnosisCode,
            fact_df_all.EncounterType,
            fact_df_all.PriorAdmissionsCount,
            patient_df_all.Age,
            patient_df_all.Gender
        ).limit(2) # Take a small sample to simulate "new" data for scoring

    if upcoming_appointments_raw_df.count() == 0:
        print("No data in upcoming_appointments_raw_df to score. Exiting scoring.")
    else:
        print("Simulated upcoming appointments data:")
        upcoming_appointments_raw_df.show()

        # --- Load Registered Model ---
        # Load the latest version of the registered model
        # The model URI format is "models:/<model_name>/<version>" or "models:/<model_name>/latest"
        try:
            loaded_model = mlflow.spark.load_model(f"models:/{registered_model_name}/latest")
            print(f"Successfully loaded model '{registered_model_name}/latest'.")
        except Exception as e:
            print(f"Error loading model: {e}. Make sure the model '{registered_model_name}' is registered.")
            dbutils.notebook.exit("Model loading failed.")


        # --- Make Predictions ---
        # The loaded_model is the entire pipeline, so it will handle preprocessing.
        predictions_df = loaded_model.transform(upcoming_appointments_raw_df)

        # Select relevant output columns
        # The probability for class 1 (readmission) is usually the second element in the 'probability' vector
        scored_appointments_df = predictions_df.select(
            F.col("EncounterKey"),
            F.col("PatientKey"),
            F.col("prediction").alias("PredictedReadmissionLabel"), # 0 or 1
            F.element_at(F.col("probability"), 2).alias("ReadmissionProbability") # Probability of class 1
        )

        print("Scored appointments:")
        scored_appointments_df.show(truncate=False)

        # --- Save Predictions ---
        scored_appointments_df.write.format("delta").mode("overwrite") \
            .saveAsTable(f"{lakehouse_name}.Gold_Upcoming_ReadmissionPredictions_E2E")
        print(f"Predictions saved to Gold_Upcoming_ReadmissionPredictions_E2E")

---
**Phase 5: Reporting and Visualization**

* **Step 5.1: Create a New Power BI Report.**
    * In your Fabric workspace, create a new Power BI report named `ReadmissionRiskDashboard_E2E`.
* **Step 5.2: Connect to Data.**
    * Connect to your `HealthClinicLH_E2E_YourName` Lakehouse SQL Endpoint.
    * Load tables:
        * `Gold_Upcoming_ReadmissionPredictions_E2E`
        * `Dim_Patient_E2E`
        * `Fact_Encounter_E2E` (for historical context or joining to get more appointment details if needed for the predictions)
        * `Dim_Date_E2E`
* **Step 5.3: Create Relationships in Power BI Model.**
    * `Gold_Upcoming_ReadmissionPredictions_E2E[PatientKey]` to `Dim_Patient_E2E[PatientKey]`
    * (If joining predictions back to original encounter details) `Gold_Upcoming_ReadmissionPredictions_E2E[EncounterKey]` to `Fact_Encounter_E2E[EncounterKey]`
* **Step 5.4: Design Dashboard (Conceptual).**
    * **KPIs:** Overall Predicted Readmission Rate (from `ReadmissionProbability`), Number of High-Risk Patients.
    * **Table/List:** Upcoming appointments, sorted by `ReadmissionProbability` (descending), showing Patient Name (from `Dim_Patient_E2E`), predicted probability, and original appointment details (from `Fact_Encounter_E2E` if joined).
    * **Charts:**
        * Distribution of `ReadmissionProbability` scores.
        * Count of high-risk patients by clinic (if `AssignedClinicID` is in `Dim_Patient_E2E` and joined).
        * Average `LengthOfStayInDays` or `PriorAdmissionsCount` for high-risk vs. low-risk predicted patients.
    * **Slicers:** By date (for upcoming appointments), by risk level, by clinic.

---
**Phase 6: Governance and Security (Recap & Considerations)**

* **Microsoft Purview:**
    * Ensure the `E2E_ReadmissionAnalytics_YourName` workspace is scanned.
    * Classify `Dim_Patient_E2E` and `Fact_Encounter_E2E` with appropriate sensitivity labels (e.g., "Confidential - PHI").
    * Review lineage from Bronze data through Gold predictions to the Power BI report.
* **Access Control:**
    * Define workspace roles (Admin, Member, Contributor, Viewer) for different team members.
    * Share the `ReadmissionRiskDashboard_E2E` Power BI report with "View" access to care teams/schedulers.
    * Implement Row-Level Security (RLS) on the Power BI dataset so that users only see predictions for patients within their authorized scope (e.g., their clinic/department).
* **Auditing:** Use Fabric audit logs to monitor access to sensitive data and the dashboard.
* **MLOps (MLflow):** The use of MLflow for model versioning, registration, and tracking metrics is a core part of MLOps and governance for the ML solution.

This end-to-end lab provides a framework. Each phase can be expanded with more detailed data, complex feature engineering, more sophisticated model tuning, and more elaborate Power BI visualizations based on specific organizational needs.

---
*(Continuing to Lab 12.3 - Challenge Lab)*

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples">https://github.com/GoogleCloudPlatform/vertex-ai-samples</a> subject to Apache - 2.0</li>
  <li><a href="https://github.com/sara-hammami/chirn-prediction-with-Pyspark-">https://github.com/sara-hammami/chirn-prediction-with-Pyspark-</a></li>
  <li><a href="https://github.com/kvsmadhulika/Madhulika_koduru">https://github.com/kvsmadhulika/Madhulika_koduru</a></li>
  <li><a href="https://github.com/meznah1995/spamemail">https://github.com/meznah1995/spamemail</a></li>
  <li><a href="https://github.com/currencyfxjle/PySpark_Models_Evaluation">https://github.com/currencyfxjle/PySpark_Models_Evaluation</a></li>
  </ol>
</div>