Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@ __pycache__/

# Visual Studio working files
.vs
/demo-feature-store/diabetes_model.pkl
/demo-feature-store/feature_list.pkl
198 changes: 198 additions & 0 deletions demo-feature-store/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@

-----

# NHS Diabetes Risk Calculator (Feature Store Demo)

This project is a functional, interactive web application that demonstrates the **Feature Store** approach in a healthcare context. It uses a synthetic dataset of patient records to train a diabetes risk model and provides a simple web interface to get real-time predictions.

The core idea is to show how a feature store separates the **data engineering** (creating features like "BMI") from the **data science** (training models) and the **application** (getting a real-time risk score). This creates a "single source of truth" for features, ensuring that the data used to train a model is the same as the data used for a real-time prediction.

This demo uses:

* **Feast:** An open-source feature store.
* **Flask:** A lightweight Python web server.
* **Scikit-learn:** For training the risk model.
* **Parquet / SQLite:** As the offline (training) and online (real-time) databases.

## System Architecture

The application demonstrates the full MLOps loop, from data generation to real-time inference.

```mermaid
flowchart TD
subgraph DataEngineering["1. Data Engineering (lib.py / generate_data.py)"]
A["Add New Patients"] -- "Appends to" --> B(Offline Store patient_gp_data.parquet)
B -- "feast materialize" --> C(Online Store online_store.db_)
end

subgraph DataScience["2. Data Science (lib.py / train_model.py)"]
B -- "get_historical_features" --> D["Train New Model (train_and_save_model)"]
D -- "Saves" --> E("Risk Model diabetes_model.pkl")
end

subgraph Application["3. Real-time Application (app.py)"]
F(User Web Browser) <--> G["Flask Web App app.py"]
G -- "Loads" --> E
G -- "get_online_features" --> C
G -- "Displays Prediction" --> F
end

F -- "Clicks Button" --> A
F -- "Clicks Button" --> D
```

-----

## Project Structure

Your repository should contain the following key files:

* **`app.py`**: The main Flask web server.
* **`lib.py`**: A library containing all core logic for data generation and model training.
* **`check_population.py`**: A utility script to analyze the risk distribution of your patient database.
* **`generate_data.py`**: A helper script to manually run data generation (optional, can be done from the app).
* **`train_model.py`**: A helper script to manually run model training (optional, can be done from the app).
* **`templates/index.html`**: The web page template.
* **`nhs_risk_calculator/`**: The Feast feature store repository.
* **`definitions.py`**: The formal definitions of our patient features.
* **`feature_store.yaml`**: The Feast configuration file.
* **`.gitignore`**: (Recommended) To exclude generated files (`*.db`, `*.pkl`, `*.parquet`, `.venv/`) from GitHub.

-----

## Local PC Setup

Follow these steps to get the application running locally.

1. **Clone the Repository:**

```bash
git clone <your-repo-url>
cd <your-repo-name>
```

2. **Create a Virtual Environment:**

```bash
python -m venv .venv
```

3. **Activate the Environment:**

* On Windows:
```bash
.venv\Scripts\activate
```
* On macOS/Linux:
```bash
source .venv/bin/activate
```

4. **Install Required Packages:**

```bash
pip install feast pandas faker scikit-learn flask joblib
```

5. **Generate Initial Data:**
You must create an initial dataset before the app can run. This script creates the first batch of patients, registers them with Feast, and populates the databases.

```bash
python generate_data.py
```

6. **Train the First Model:**
Now that you have data, you must train the first version of the risk model.

```bash
python train_model.py
```

7. **Run the Web App:**
You're all set. Start the Flask server:

```bash
python app.py
```

Now, open `http://127.0.0.1:5000` in your web browser.

-----

## Using the Application

The web app provides an interactive way to simulate the MLOps lifecycle.

### 1\. Calculate Patient Risk (The "GP View")

This is the main function of the app.

* **Action:** Enter a Patient ID (e.g., 1-500) and click "Calculate Risk".
* **What it does:** The app queries the **online store (SQLite)** for the *latest* features for that patient, feeds them into the loaded **model (.pkl file)**, and displays the resulting risk score (LOW/MEDIUM/HIGH).

### 2\. Add New Patients (The "Data Engineering" View)

This simulates new patient data arriving in the health system.

* **Action:** Click the "Add 500 New Patients" button.
* **What it does:**
1. Generates 500 new synthetic patients and appends them to the **offline store (Parquet file)**.
2. Runs `feast materialize` to scan the offline store and update the **online store (SQLite)** with the latest features for all patients (including the new ones).

### 3\. Retrain Risk Model (The "Data Science" View)

This simulates a data scientist updating the risk model with new data.

* **Action:** Click the "Retrain Risk Model" button.
* **What it does:**
1. Fetches the *entire patient history* from the **offline store (Parquet file)**.
2. Uses our new, more realistic logic to generate `True/False` diabetes labels based on their risk factors.
3. Trains a *new* `LogisticRegression` model on this fresh, complete dataset.
4. Saves the new model over the old `diabetes_model.pkl` and reloads it into the app's memory.

-----

## Test the Full Loop (A "What If" Scenario)

This is the best way to see the whole system in action. The predictions you see depend on two things: the **Patient's Data** and the **Model's "Brain"**. You can change both.

### The Experiment

Follow these steps to see how a prediction can change for the *same patient*.

1. **Get a Baseline:**

* Run the app (`python app.py`).
* Enter Patient ID **10**.
* Note their features (e.g., BMI, Age) and their risk (e.g., **28.5% - MEDIUM RISK**).

2. **Check the Population:**

* In your terminal (while the app is still running), run the `check_population.py` script:
```bash
python check_population.py
```
* Note the total number of HIGH risk patients (e.g., `HIGH RISK: 212`).

3. **Add More Data:**

* Go back to the browser and click the **"Add 500 New Patients"** button.
* After it reloads, click it **again**. You have now added 1,000 new patients to the database.
* **Test Patient 10 again:** Enter Patient ID **10**. Their risk score will be **identical (28.5% - MEDIUM RISK)**.
* **Why?** Because their personal data hasn't changed, and the *model is still the same old one*.

4. **Retrain the Model:**

* Now, click the **"Retrain Risk Model"** button.
* The app will now "learn" from the *entire* database, including the 1,000 new patients you added. This new data will slightly change the model's understanding of how features (like BMI) correlate with risk.

5. **See the Change:**

* **Test Patient 10 one last time:** Enter Patient ID **10**.
* You will see that their risk score has changed\! (e.g., it might now be **30.1% - MEDIUM RISK**).
* **Why?** Their personal data is the same, but the **model's "brain" has been updated**. It has a more refined understanding of risk, so its prediction for the *same patient* is now different.

6. **Check the Population Again:**

* Run `python check_population.py` one more time.
* You will see that the total number of LOW/MEDIUM/HIGH risk patients has changed, reflecting the new model's predictions across the entire population.
165 changes: 165 additions & 0 deletions demo-feature-store/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@

import joblib
import pandas as pd
from flask import Flask, request, render_template, flash, redirect, url_for
from feast import FeatureStore
import os

from lib import (
generate_and_save_data,
run_feast_commands,
train_and_save_model,
MODEL_FILE,
FEATURES_FILE,
FEAST_REPO_PATH
)

app = Flask(__name__)
app.secret_key = os.urandom(24)

model = None
feature_list = None
feast_features = []
store = None

def load_resources():
global model, feature_list, feast_features, store

try:
model = joblib.load(MODEL_FILE)
feature_list = joblib.load(FEATURES_FILE)
feast_features = [f"gp_records:{name}" for name in feature_list]
print(f"Model and feature list loaded. Features: {feature_list}")
except FileNotFoundError:
print("WARNING: Model or feature list not found.")
print("Please run train_model.py to generate them.")
model = None
feature_list = None

try:
store = FeatureStore(repo_path=FEAST_REPO_PATH)
print("Connected to Feast feature store.")
except Exception as e:
print(f"FATAL: Could not connect to Feast feature store: {e}")
store = None


@app.route('/')
def home():
return render_template('index.html')

# ... (imports, app = Flask(...), load_resources(), home(), etc.) ...

@app.route('/predict', methods=['POST'])
def predict():
"""Handles the form submission and returns a prediction."""

# We define two thresholds to create three levels
HIGH_RISK_THRESHOLD = 0.35 # (35%)
MEDIUM_RISK_THRESHOLD = 0.15 # (15%)

if not model or not store or not feature_list:
return render_template('index.html',
error="Server error: Model or feature store not loaded.")

patient_id_str = request.form.get('patient_id', '').strip()
if not patient_id_str.isdigit():
return render_template('index.html', error="Invalid Patient ID. Must be a number.")

patient_id = int(patient_id_str)

try:
entity_rows = [{"patient_id": patient_id}]
online_features_dict = store.get_online_features(
features=feast_features,
entity_rows=entity_rows
).to_dict()

features_df = pd.DataFrame(online_features_dict)

if features_df.empty or features_df['patient_id'][0] is None:
return render_template('index.html',
error=f"Patient ID {patient_id} not found.")

X_predict = features_df[feature_list]
prediction_error = None

if X_predict.isnull().values.any():
prediction_error = "Patient data is incomplete. Prediction may be inaccurate."
X_predict = X_predict.fillna(0) # Fill with 0 for demo


# 1. Get the probability of "True" (diabetes)
probability_true = model.predict_proba(X_predict)[0][1]

# 2. Compare against our new thresholds
if probability_true >= HIGH_RISK_THRESHOLD:
prediction_text = "HIGH RISK"
elif probability_true >= MEDIUM_RISK_THRESHOLD:
prediction_text = "MEDIUM RISK"
else:
prediction_text = "LOW RISK"

# 3. Format the results
risk_percent = round(probability_true * 100, 1)


return render_template(
'index.html',
patient_id=patient_id,
patient_data=X_predict.to_dict('records')[0],
prediction=prediction_text,
probability=risk_percent,
error=prediction_error
)

except Exception as e:
return render_template('index.html', error=f"An error occurred: {e}")

# ... (add_data(), retrain_model(), if __name__ == '__main__':, etc.) ...
@app.route('/add-data', methods=['POST'])
def add_data():
"""Generates new data, materializes it, and reloads the store."""
global store
try:
generate_and_save_data()
run_feast_commands()

print("Reloading feature store...")
store = FeatureStore(repo_path=FEAST_REPO_PATH)

flash("Successfully added 500 new patients and updated feature store.", "success")
except Exception as e:
print(f"Error adding data: {e}")
flash(f"Error adding data: {e}", "error")

return redirect(url_for('home'))


@app.route('/retrain-model', methods=['POST'])
def retrain_model():
"""Retrains the model, saves it, and reloads it into the app."""
global model, feature_list, feast_features, store

try:
train_and_save_model()

model = joblib.load(MODEL_FILE)
feature_list = joblib.load(FEATURES_FILE)
feast_features = [f"gp_records:{name}" for name in feature_list]

print("Reloading feature store...")
store = FeatureStore(repo_path=FEAST_REPO_PATH)


flash("Successfully retrained and reloaded the risk model.", "success")
except Exception as e:
print(f"Error retraining model: {e}")
flash(f"Error retraining model: {e}", "error")

return redirect(url_for('home'))


if __name__ == '__main__':
load_resources()
app.run(debug=True)
Loading