regulate-tech · regulate-tech · Oct 24, 2025 · Oct 24, 2025 · Oct 24, 2025 · Oct 24, 2025
diff --git a/.gitignore b/.gitignore
@@ -26,3 +26,5 @@ __pycache__/
 
 # Visual Studio working files
 .vs
+/demo-feature-store/diabetes_model.pkl
+/demo-feature-store/feature_list.pkl
diff --git a/demo-feature-store/README.md b/demo-feature-store/README.md
@@ -0,0 +1,198 @@
+
+-----
+
+# NHS Diabetes Risk Calculator (Feature Store Demo)
+
+This project is a functional, interactive web application that demonstrates the **Feature Store** approach in a healthcare context. It uses a synthetic dataset of patient records to train a diabetes risk model and provides a simple web interface to get real-time predictions.
+
+The core idea is to show how a feature store separates the **data engineering** (creating features like "BMI") from the **data science** (training models) and the **application** (getting a real-time risk score). This creates a "single source of truth" for features, ensuring that the data used to train a model is the same as the data used for a real-time prediction.
+
+This demo uses:
+
+  * **Feast:** An open-source feature store.
+  * **Flask:** A lightweight Python web server.
+  * **Scikit-learn:** For training the risk model.
+  * **Parquet / SQLite:** As the offline (training) and online (real-time) databases.
+
+## System Architecture
+
+The application demonstrates the full MLOps loop, from data generation to real-time inference.
+
+```mermaid
+flowchart TD
+    subgraph DataEngineering["1. Data Engineering (lib.py / generate_data.py)"]
+        A["Add New Patients"] -- "Appends to" --> B(Offline Store patient_gp_data.parquet)
+        B -- "feast materialize" --> C(Online Store online_store.db_)
+    end
+
+    subgraph DataScience["2. Data Science (lib.py / train_model.py)"]
+        B -- "get_historical_features" --> D["Train New Model (train_and_save_model)"]
+        D -- "Saves" --> E("Risk Model diabetes_model.pkl")
+    end
+
+    subgraph Application["3. Real-time Application (app.py)"]
+        F(User Web Browser) <--> G["Flask Web App app.py"]
+        G -- "Loads" --> E
+        G -- "get_online_features" --> C
+        G -- "Displays Prediction" --> F
+    end
+
+    F -- "Clicks Button" --> A
+    F -- "Clicks Button" --> D
+```
+
+-----
+
+## Project Structure
+
+Your repository should contain the following key files:
+
+  * **`app.py`**: The main Flask web server.
+  * **`lib.py`**: A library containing all core logic for data generation and model training.
+  * **`check_population.py`**: A utility script to analyze the risk distribution of your patient database.
+  * **`generate_data.py`**: A helper script to manually run data generation (optional, can be done from the app).
+  * **`train_model.py`**: A helper script to manually run model training (optional, can be done from the app).
+  * **`templates/index.html`**: The web page template.
+  * **`nhs_risk_calculator/`**: The Feast feature store repository.
+      * **`definitions.py`**: The formal definitions of our patient features.
+      * **`feature_store.yaml`**: The Feast configuration file.
+  * **`.gitignore`**: (Recommended) To exclude generated files (`*.db`, `*.pkl`, `*.parquet`, `.venv/`) from GitHub.
+
+-----
+
+## Local PC Setup
+
+Follow these steps to get the application running locally.
+
+1.  **Clone the Repository:**
+
+    ```bash
+    git clone <your-repo-url>
+    cd <your-repo-name>
+    ```
+
+2.  **Create a Virtual Environment:**
+
+    ```bash
+    python -m venv .venv
+    ```
+
+3.  **Activate the Environment:**
+
+      * On Windows:
+        ```bash
+        .venv\Scripts\activate
+        ```
+      * On macOS/Linux:
+        ```bash
+        source .venv/bin/activate
+        ```
+
+4.  **Install Required Packages:**
+
+    ```bash
+    pip install feast pandas faker scikit-learn flask joblib
+    ```
+
+5.  **Generate Initial Data:**
+    You must create an initial dataset before the app can run. This script creates the first batch of patients, registers them with Feast, and populates the databases.
+
+    ```bash
+    python generate_data.py
+    ```
+
+6.  **Train the First Model:**
+    Now that you have data, you must train the first version of the risk model.
+
+    ```bash
+    python train_model.py
+    ```
+
+7.  **Run the Web App:**
+    You're all set. Start the Flask server:
+
+    ```bash
+    python app.py
+    ```
+
+    Now, open `http://127.0.0.1:5000` in your web browser.
+
+-----
+
+## Using the Application
+
+The web app provides an interactive way to simulate the MLOps lifecycle.
+
+### 1\. Calculate Patient Risk (The "GP View")
+
+This is the main function of the app.
+
+  * **Action:** Enter a Patient ID (e.g., 1-500) and click "Calculate Risk".
+  * **What it does:** The app queries the **online store (SQLite)** for the *latest* features for that patient, feeds them into the loaded **model (.pkl file)**, and displays the resulting risk score (LOW/MEDIUM/HIGH).
+
+### 2\. Add New Patients (The "Data Engineering" View)
+
+This simulates new patient data arriving in the health system.
+
+  * **Action:** Click the "Add 500 New Patients" button.
+  * **What it does:**
+    1.  Generates 500 new synthetic patients and appends them to the **offline store (Parquet file)**.
+    2.  Runs `feast materialize` to scan the offline store and update the **online store (SQLite)** with the latest features for all patients (including the new ones).
+
+### 3\. Retrain Risk Model (The "Data Science" View)
+
+This simulates a data scientist updating the risk model with new data.
+
+  * **Action:** Click the "Retrain Risk Model" button.
+  * **What it does:**
+    1.  Fetches the *entire patient history* from the **offline store (Parquet file)**.
+    2.  Uses our new, more realistic logic to generate `True/False` diabetes labels based on their risk factors.
+    3.  Trains a *new* `LogisticRegression` model on this fresh, complete dataset.
+    4.  Saves the new model over the old `diabetes_model.pkl` and reloads it into the app's memory.
+
+-----
+
+## Test the Full Loop (A "What If" Scenario)
+
+This is the best way to see the whole system in action. The predictions you see depend on two things: the **Patient's Data** and the **Model's "Brain"**. You can change both.
+
+### The Experiment
+
+Follow these steps to see how a prediction can change for the *same patient*.
+
+1.  **Get a Baseline:**
+
+      * Run the app (`python app.py`).
+      * Enter Patient ID **10**.
+      * Note their features (e.g., BMI, Age) and their risk (e.g., **28.5% - MEDIUM RISK**).
+
+2.  **Check the Population:**
+
+      * In your terminal (while the app is still running), run the `check_population.py` script:
+        ```bash
+        python check_population.py
+        ```
+      * Note the total number of HIGH risk patients (e.g., `HIGH RISK: 212`).
+
+3.  **Add More Data:**
+
+      * Go back to the browser and click the **"Add 500 New Patients"** button.
+      * After it reloads, click it **again**. You have now added 1,000 new patients to the database.
+      * **Test Patient 10 again:** Enter Patient ID **10**. Their risk score will be **identical (28.5% - MEDIUM RISK)**.
+      * **Why?** Because their personal data hasn't changed, and the *model is still the same old one*.
+
+4.  **Retrain the Model:**
+
+      * Now, click the **"Retrain Risk Model"** button.
+      * The app will now "learn" from the *entire* database, including the 1,000 new patients you added. This new data will slightly change the model's understanding of how features (like BMI) correlate with risk.
+
+5.  **See the Change:**
+
+      * **Test Patient 10 one last time:** Enter Patient ID **10**.
+      * You will see that their risk score has changed\! (e.g., it might now be **30.1% - MEDIUM RISK**).
+      * **Why?** Their personal data is the same, but the **model's "brain" has been updated**. It has a more refined understanding of risk, so its prediction for the *same patient* is now different.
+
+6.  **Check the Population Again:**
+
+      * Run `python check_population.py` one more time.
+      * You will see that the total number of LOW/MEDIUM/HIGH risk patients has changed, reflecting the new model's predictions across the entire population.
diff --git a/demo-feature-store/app.py b/demo-feature-store/app.py
@@ -0,0 +1,165 @@
+
+import joblib
+import pandas as pd
+from flask import Flask, request, render_template, flash, redirect, url_for
+from feast import FeatureStore
+import os
+
+from lib import (
+    generate_and_save_data, 
+    run_feast_commands, 
+    train_and_save_model,
+    MODEL_FILE, 
+    FEATURES_FILE, 
+    FEAST_REPO_PATH
+)
+
+app = Flask(__name__)
+app.secret_key = os.urandom(24) 
+
+model = None
+feature_list = None
+feast_features = []
+store = None
+
+def load_resources():
+    global model, feature_list, feast_features, store
+
+    try:
+        model = joblib.load(MODEL_FILE)
+        feature_list = joblib.load(FEATURES_FILE)
+        feast_features = [f"gp_records:{name}" for name in feature_list]
+        print(f"Model and feature list loaded. Features: {feature_list}")
+    except FileNotFoundError:
+        print("WARNING: Model or feature list not found.")
+        print("Please run train_model.py to generate them.")
+        model = None
+        feature_list = None
+
+    try:
+        store = FeatureStore(repo_path=FEAST_REPO_PATH)
+        print("Connected to Feast feature store.")
+    except Exception as e:
+        print(f"FATAL: Could not connect to Feast feature store: {e}")
+        store = None
+
+
+@app.route('/')
+def home():
+    return render_template('index.html')
+
+# ... (imports, app = Flask(...), load_resources(), home(), etc.) ...
+
+@app.route('/predict', methods=['POST'])
+def predict():
+    """Handles the form submission and returns a prediction."""
+
+    # We define two thresholds to create three levels
+    HIGH_RISK_THRESHOLD = 0.35   # (35%)
+    MEDIUM_RISK_THRESHOLD = 0.15 # (15%)
+
+    if not model or not store or not feature_list:
+        return render_template('index.html', 
+                               error="Server error: Model or feature store not loaded.")
+
+    patient_id_str = request.form.get('patient_id', '').strip()
+    if not patient_id_str.isdigit():
+        return render_template('index.html', error="Invalid Patient ID. Must be a number.")
+
+    patient_id = int(patient_id_str)
+
+    try:
+        entity_rows = [{"patient_id": patient_id}]
+        online_features_dict = store.get_online_features(
+            features=feast_features,
+            entity_rows=entity_rows
+        ).to_dict()
+
+        features_df = pd.DataFrame(online_features_dict)
+
+        if features_df.empty or features_df['patient_id'][0] is None:
+             return render_template('index.html', 
+                                    error=f"Patient ID {patient_id} not found.")
+
+        X_predict = features_df[feature_list]
+        prediction_error = None
+
+        if X_predict.isnull().values.any():
+            prediction_error = "Patient data is incomplete. Prediction may be inaccurate."
+            X_predict = X_predict.fillna(0) # Fill with 0 for demo
+
+
+        # 1. Get the probability of "True" (diabetes)
+        probability_true = model.predict_proba(X_predict)[0][1]
+
+        # 2. Compare against our new thresholds
+        if probability_true >= HIGH_RISK_THRESHOLD:
+            prediction_text = "HIGH RISK"
+        elif probability_true >= MEDIUM_RISK_THRESHOLD:
+            prediction_text = "MEDIUM RISK"
+        else:
+            prediction_text = "LOW RISK"
+
+        # 3. Format the results
+        risk_percent = round(probability_true * 100, 1)
+
+
+        return render_template(
+            'index.html',
+            patient_id=patient_id,
+            patient_data=X_predict.to_dict('records')[0],
+            prediction=prediction_text,
+            probability=risk_percent,
+            error=prediction_error
+        )
+
+    except Exception as e:
+        return render_template('index.html', error=f"An error occurred: {e}")
+
+# ... (add_data(), retrain_model(), if __name__ == '__main__':, etc.) ...
+@app.route('/add-data', methods=['POST'])
+def add_data():
+    """Generates new data, materializes it, and reloads the store."""
+    global store  
+    try:
+        generate_and_save_data()
+        run_feast_commands()
+
+        print("Reloading feature store...")
+        store = FeatureStore(repo_path=FEAST_REPO_PATH)
+
+        flash("Successfully added 500 new patients and updated feature store.", "success")
+    except Exception as e:
+        print(f"Error adding data: {e}")
+        flash(f"Error adding data: {e}", "error")
+
+    return redirect(url_for('home'))
+
+
+@app.route('/retrain-model', methods=['POST'])
+def retrain_model():
+    """Retrains the model, saves it, and reloads it into the app."""
+    global model, feature_list, feast_features, store 
+
+    try:
+        train_and_save_model()
+
+        model = joblib.load(MODEL_FILE)
+        feature_list = joblib.load(FEATURES_FILE)
+        feast_features = [f"gp_records:{name}" for name in feature_list]
+
+        print("Reloading feature store...")
+        store = FeatureStore(repo_path=FEAST_REPO_PATH)
+
+
+        flash("Successfully retrained and reloaded the risk model.", "success")
+    except Exception as e:
+        print(f"Error retraining model: {e}")
+        flash(f"Error retraining model: {e}", "error")
+
+    return redirect(url_for('home'))
+
+
+if __name__ == '__main__':
+    load_resources()
+    app.run(debug=True)