diff --git a/.gitignore b/.gitignore index 8db0495..dcf3083 100644 --- a/.gitignore +++ b/.gitignore @@ -26,3 +26,5 @@ __pycache__/ # Visual Studio working files .vs +/demo-feature-store/diabetes_model.pkl +/demo-feature-store/feature_list.pkl diff --git a/demo-feature-store/README.md b/demo-feature-store/README.md new file mode 100644 index 0000000..8ac760e --- /dev/null +++ b/demo-feature-store/README.md @@ -0,0 +1,198 @@ + +----- + +# NHS Diabetes Risk Calculator (Feature Store Demo) + +This project is a functional, interactive web application that demonstrates the **Feature Store** approach in a healthcare context. It uses a synthetic dataset of patient records to train a diabetes risk model and provides a simple web interface to get real-time predictions. + +The core idea is to show how a feature store separates the **data engineering** (creating features like "BMI") from the **data science** (training models) and the **application** (getting a real-time risk score). This creates a "single source of truth" for features, ensuring that the data used to train a model is the same as the data used for a real-time prediction. + +This demo uses: + + * **Feast:** An open-source feature store. + * **Flask:** A lightweight Python web server. + * **Scikit-learn:** For training the risk model. + * **Parquet / SQLite:** As the offline (training) and online (real-time) databases. + +## System Architecture + +The application demonstrates the full MLOps loop, from data generation to real-time inference. + +```mermaid +flowchart TD + subgraph DataEngineering["1. Data Engineering (lib.py / generate_data.py)"] + A["Add New Patients"] -- "Appends to" --> B(Offline Store patient_gp_data.parquet) + B -- "feast materialize" --> C(Online Store online_store.db_) + end + + subgraph DataScience["2. Data Science (lib.py / train_model.py)"] + B -- "get_historical_features" --> D["Train New Model (train_and_save_model)"] + D -- "Saves" --> E("Risk Model diabetes_model.pkl") + end + + subgraph Application["3. Real-time Application (app.py)"] + F(User Web Browser) <--> G["Flask Web App app.py"] + G -- "Loads" --> E + G -- "get_online_features" --> C + G -- "Displays Prediction" --> F + end + + F -- "Clicks Button" --> A + F -- "Clicks Button" --> D +``` + +----- + +## Project Structure + +Your repository should contain the following key files: + + * **`app.py`**: The main Flask web server. + * **`lib.py`**: A library containing all core logic for data generation and model training. + * **`check_population.py`**: A utility script to analyze the risk distribution of your patient database. + * **`generate_data.py`**: A helper script to manually run data generation (optional, can be done from the app). + * **`train_model.py`**: A helper script to manually run model training (optional, can be done from the app). + * **`templates/index.html`**: The web page template. + * **`nhs_risk_calculator/`**: The Feast feature store repository. + * **`definitions.py`**: The formal definitions of our patient features. + * **`feature_store.yaml`**: The Feast configuration file. + * **`.gitignore`**: (Recommended) To exclude generated files (`*.db`, `*.pkl`, `*.parquet`, `.venv/`) from GitHub. + +----- + +## Local PC Setup + +Follow these steps to get the application running locally. + +1. **Clone the Repository:** + + ```bash + git clone + cd + ``` + +2. **Create a Virtual Environment:** + + ```bash + python -m venv .venv + ``` + +3. **Activate the Environment:** + + * On Windows: + ```bash + .venv\Scripts\activate + ``` + * On macOS/Linux: + ```bash + source .venv/bin/activate + ``` + +4. **Install Required Packages:** + + ```bash + pip install feast pandas faker scikit-learn flask joblib + ``` + +5. **Generate Initial Data:** + You must create an initial dataset before the app can run. This script creates the first batch of patients, registers them with Feast, and populates the databases. + + ```bash + python generate_data.py + ``` + +6. **Train the First Model:** + Now that you have data, you must train the first version of the risk model. + + ```bash + python train_model.py + ``` + +7. **Run the Web App:** + You're all set. Start the Flask server: + + ```bash + python app.py + ``` + + Now, open `http://127.0.0.1:5000` in your web browser. + +----- + +## Using the Application + +The web app provides an interactive way to simulate the MLOps lifecycle. + +### 1\. Calculate Patient Risk (The "GP View") + +This is the main function of the app. + + * **Action:** Enter a Patient ID (e.g., 1-500) and click "Calculate Risk". + * **What it does:** The app queries the **online store (SQLite)** for the *latest* features for that patient, feeds them into the loaded **model (.pkl file)**, and displays the resulting risk score (LOW/MEDIUM/HIGH). + +### 2\. Add New Patients (The "Data Engineering" View) + +This simulates new patient data arriving in the health system. + + * **Action:** Click the "Add 500 New Patients" button. + * **What it does:** + 1. Generates 500 new synthetic patients and appends them to the **offline store (Parquet file)**. + 2. Runs `feast materialize` to scan the offline store and update the **online store (SQLite)** with the latest features for all patients (including the new ones). + +### 3\. Retrain Risk Model (The "Data Science" View) + +This simulates a data scientist updating the risk model with new data. + + * **Action:** Click the "Retrain Risk Model" button. + * **What it does:** + 1. Fetches the *entire patient history* from the **offline store (Parquet file)**. + 2. Uses our new, more realistic logic to generate `True/False` diabetes labels based on their risk factors. + 3. Trains a *new* `LogisticRegression` model on this fresh, complete dataset. + 4. Saves the new model over the old `diabetes_model.pkl` and reloads it into the app's memory. + +----- + +## Test the Full Loop (A "What If" Scenario) + +This is the best way to see the whole system in action. The predictions you see depend on two things: the **Patient's Data** and the **Model's "Brain"**. You can change both. + +### The Experiment + +Follow these steps to see how a prediction can change for the *same patient*. + +1. **Get a Baseline:** + + * Run the app (`python app.py`). + * Enter Patient ID **10**. + * Note their features (e.g., BMI, Age) and their risk (e.g., **28.5% - MEDIUM RISK**). + +2. **Check the Population:** + + * In your terminal (while the app is still running), run the `check_population.py` script: + ```bash + python check_population.py + ``` + * Note the total number of HIGH risk patients (e.g., `HIGH RISK: 212`). + +3. **Add More Data:** + + * Go back to the browser and click the **"Add 500 New Patients"** button. + * After it reloads, click it **again**. You have now added 1,000 new patients to the database. + * **Test Patient 10 again:** Enter Patient ID **10**. Their risk score will be **identical (28.5% - MEDIUM RISK)**. + * **Why?** Because their personal data hasn't changed, and the *model is still the same old one*. + +4. **Retrain the Model:** + + * Now, click the **"Retrain Risk Model"** button. + * The app will now "learn" from the *entire* database, including the 1,000 new patients you added. This new data will slightly change the model's understanding of how features (like BMI) correlate with risk. + +5. **See the Change:** + + * **Test Patient 10 one last time:** Enter Patient ID **10**. + * You will see that their risk score has changed\! (e.g., it might now be **30.1% - MEDIUM RISK**). + * **Why?** Their personal data is the same, but the **model's "brain" has been updated**. It has a more refined understanding of risk, so its prediction for the *same patient* is now different. + +6. **Check the Population Again:** + + * Run `python check_population.py` one more time. + * You will see that the total number of LOW/MEDIUM/HIGH risk patients has changed, reflecting the new model's predictions across the entire population. \ No newline at end of file diff --git a/demo-feature-store/app.py b/demo-feature-store/app.py new file mode 100644 index 0000000..632947b --- /dev/null +++ b/demo-feature-store/app.py @@ -0,0 +1,165 @@ + +import joblib +import pandas as pd +from flask import Flask, request, render_template, flash, redirect, url_for +from feast import FeatureStore +import os + +from lib import ( + generate_and_save_data, + run_feast_commands, + train_and_save_model, + MODEL_FILE, + FEATURES_FILE, + FEAST_REPO_PATH +) + +app = Flask(__name__) +app.secret_key = os.urandom(24) + +model = None +feature_list = None +feast_features = [] +store = None + +def load_resources(): + global model, feature_list, feast_features, store + + try: + model = joblib.load(MODEL_FILE) + feature_list = joblib.load(FEATURES_FILE) + feast_features = [f"gp_records:{name}" for name in feature_list] + print(f"Model and feature list loaded. Features: {feature_list}") + except FileNotFoundError: + print("WARNING: Model or feature list not found.") + print("Please run train_model.py to generate them.") + model = None + feature_list = None + + try: + store = FeatureStore(repo_path=FEAST_REPO_PATH) + print("Connected to Feast feature store.") + except Exception as e: + print(f"FATAL: Could not connect to Feast feature store: {e}") + store = None + + +@app.route('/') +def home(): + return render_template('index.html') + +# ... (imports, app = Flask(...), load_resources(), home(), etc.) ... + +@app.route('/predict', methods=['POST']) +def predict(): + """Handles the form submission and returns a prediction.""" + + # We define two thresholds to create three levels + HIGH_RISK_THRESHOLD = 0.35 # (35%) + MEDIUM_RISK_THRESHOLD = 0.15 # (15%) + + if not model or not store or not feature_list: + return render_template('index.html', + error="Server error: Model or feature store not loaded.") + + patient_id_str = request.form.get('patient_id', '').strip() + if not patient_id_str.isdigit(): + return render_template('index.html', error="Invalid Patient ID. Must be a number.") + + patient_id = int(patient_id_str) + + try: + entity_rows = [{"patient_id": patient_id}] + online_features_dict = store.get_online_features( + features=feast_features, + entity_rows=entity_rows + ).to_dict() + + features_df = pd.DataFrame(online_features_dict) + + if features_df.empty or features_df['patient_id'][0] is None: + return render_template('index.html', + error=f"Patient ID {patient_id} not found.") + + X_predict = features_df[feature_list] + prediction_error = None + + if X_predict.isnull().values.any(): + prediction_error = "Patient data is incomplete. Prediction may be inaccurate." + X_predict = X_predict.fillna(0) # Fill with 0 for demo + + + # 1. Get the probability of "True" (diabetes) + probability_true = model.predict_proba(X_predict)[0][1] + + # 2. Compare against our new thresholds + if probability_true >= HIGH_RISK_THRESHOLD: + prediction_text = "HIGH RISK" + elif probability_true >= MEDIUM_RISK_THRESHOLD: + prediction_text = "MEDIUM RISK" + else: + prediction_text = "LOW RISK" + + # 3. Format the results + risk_percent = round(probability_true * 100, 1) + + + return render_template( + 'index.html', + patient_id=patient_id, + patient_data=X_predict.to_dict('records')[0], + prediction=prediction_text, + probability=risk_percent, + error=prediction_error + ) + + except Exception as e: + return render_template('index.html', error=f"An error occurred: {e}") + +# ... (add_data(), retrain_model(), if __name__ == '__main__':, etc.) ... +@app.route('/add-data', methods=['POST']) +def add_data(): + """Generates new data, materializes it, and reloads the store.""" + global store + try: + generate_and_save_data() + run_feast_commands() + + print("Reloading feature store...") + store = FeatureStore(repo_path=FEAST_REPO_PATH) + + flash("Successfully added 500 new patients and updated feature store.", "success") + except Exception as e: + print(f"Error adding data: {e}") + flash(f"Error adding data: {e}", "error") + + return redirect(url_for('home')) + + +@app.route('/retrain-model', methods=['POST']) +def retrain_model(): + """Retrains the model, saves it, and reloads it into the app.""" + global model, feature_list, feast_features, store + + try: + train_and_save_model() + + model = joblib.load(MODEL_FILE) + feature_list = joblib.load(FEATURES_FILE) + feast_features = [f"gp_records:{name}" for name in feature_list] + + print("Reloading feature store...") + store = FeatureStore(repo_path=FEAST_REPO_PATH) + + + flash("Successfully retrained and reloaded the risk model.", "success") + except Exception as e: + print(f"Error retraining model: {e}") + flash(f"Error retraining model: {e}", "error") + + return redirect(url_for('home')) + + +if __name__ == '__main__': + load_resources() + app.run(debug=True) \ No newline at end of file diff --git a/demo-feature-store/check_population.py b/demo-feature-store/check_population.py new file mode 100644 index 0000000..1ea9ebf --- /dev/null +++ b/demo-feature-store/check_population.py @@ -0,0 +1,126 @@ +import pandas as pd +import joblib +from feast import FeatureStore +import os +import sys + +# --- Configuration --- +# Ensure these match your app.py and lib.py settings +MODEL_FILE = "diabetes_model.pkl" +FEATURES_FILE = "feature_list.pkl" +FEAST_REPO_PATH = "nhs_risk_calculator" +DATA_FILE = "nhs_risk_calculator/data/patient_gp_data.parquet" + +# These thresholds MUST match what is in app.py +HIGH_RISK_THRESHOLD = 0.35 # (35%) +MEDIUM_RISK_THRESHOLD = 0.15 # (15%) + +def categorize_risk(prob): + """Assigns a risk category based on probability.""" + if prob >= HIGH_RISK_THRESHOLD: + return "HIGH" + elif prob >= MEDIUM_RISK_THRESHOLD: + return "MEDIUM" + else: + return "LOW" + +def run_population_check(): + """ + Loads all patients from the feature store, runs predictions, + and prints a summary of the risk distribution. + """ + print("--- Running Population Risk Analysis ---") + + # --- 1. Load Model and Features --- + try: + print(f"Loading model from {MODEL_FILE}...") + model = joblib.load(MODEL_FILE) + feature_list = joblib.load(FEATURES_FILE) + feast_features = [f"gp_records:{name}" for name in feature_list] + except FileNotFoundError: + print(f"ERROR: Model or feature file not found.", file=sys.stderr) + print("Please run 'Retrain Risk Model' first.", file=sys.stderr) + return + + # --- 2. Connect to Feast and Get All Patient IDs --- + try: + print("Connecting to feature store...") + store = FeatureStore(repo_path=FEAST_REPO_PATH) + + print(f"Reading patient list from {DATA_FILE}...") + patient_data = pd.read_parquet(DATA_FILE) + patient_ids = patient_data['patient_id'].unique() + + if len(patient_ids) == 0: + print("No patients found in data file.") + return + + print(f"Found {len(patient_ids)} unique patients.") + # Create the entity_rows structure Feast expects + entity_rows = [{"patient_id": int(pid)} for pid in patient_ids] + + except Exception as e: + print(f"Error connecting to store or reading data: {e}", file=sys.stderr) + return + + # --- 3. Get Online Features for ALL Patients --- + try: + print("Retrieving online features for all patients...") + online_features_dict = store.get_online_features( + features=feast_features, + entity_rows=entity_rows + ).to_dict() + + features_df = pd.DataFrame(online_features_dict) + # Drop any patients Feast couldn't find (should be 0) + features_df = features_df.dropna(subset=['patient_id']) + + except Exception as e: + print(f"Error getting online features: {e}", file=sys.stderr) + return + + # --- 4. Prepare Data and Handle Missing Values --- + + # Get just the feature columns in the correct order + X_predict = features_df[feature_list] + total_patients = len(X_predict) + + # Find rows with ANY null (missing) features + missing_data_mask = X_predict.isnull().any(axis=1) + num_missing = missing_data_mask.sum() + + # Create a clean DataFrame with only complete records + X_complete = X_predict[~missing_data_mask] + + print(f"\nRetrieved features for {total_patients} patients.") + print(f" - {len(X_complete)} patients have complete data.") + print(f" - {num_missing} patients have missing data (will be excluded).") + + if len(X_complete) == 0: + print("\nNo patients with complete data. Cannot generate report.") + return + + # --- 5. Run Predictions and Categorize --- + print(f"Running predictions on {len(X_complete)} patients...") + + # Get the probability of "True" (diabetes) + probabilities = model.predict_proba(X_complete)[:, 1] + + # Assign categories + risk_groups = [categorize_risk(p) for p in probabilities] + + # Count the results + counts = pd.Series(risk_groups).value_counts() + + # --- 6. Display Report --- + print("\n--- Population Risk Profile ---") + print(f"Total Patients Scored: {len(X_complete)}\n") + print("Risk Group Counts:") + print("---------------------------------") + print(f" HIGH RISK (>= {HIGH_RISK_THRESHOLD*100: >2.0f}%): \t{counts.get('HIGH', 0)}") + print(f" MEDIUM RISK ({MEDIUM_RISK_THRESHOLD*100: >2.0f}-{HIGH_RISK_THRESHOLD*100:2.0f}%): \t{counts.get('MEDIUM', 0)}") + print(f" LOW RISK (< {MEDIUM_RISK_THRESHOLD*100: >2.0f}%): \t{counts.get('LOW', 0)}") + print("---------------------------------") + +if __name__ == "__main__": + run_population_check() \ No newline at end of file diff --git a/demo-feature-store/generate_data.py b/demo-feature-store/generate_data.py new file mode 100644 index 0000000..0d568ba --- /dev/null +++ b/demo-feature-store/generate_data.py @@ -0,0 +1,17 @@ + +from lib import generate_and_save_data, run_feast_commands +import sys + +if __name__ == "__main__": + try: + print("--- Running Data Generation ---") + generate_and_save_data() + + print("\n--- Updating Feature Store ---") + run_feast_commands() + + print("\n--- Process Complete ---") + + except Exception as e: + print(f"An error occurred: {e}", file=sys.stderr) + sys.exit(1) \ No newline at end of file diff --git a/demo-feature-store/lib.py b/demo-feature-store/lib.py new file mode 100644 index 0000000..b597503 --- /dev/null +++ b/demo-feature-store/lib.py @@ -0,0 +1,227 @@ + +import pandas as pd +import numpy as np # <-- ADD THIS +import random # <-- ADD THIS +from faker import Faker +from datetime import datetime, timedelta +import os +import subprocess +import joblib + +from feast import FeatureStore +from sklearn.linear_model import LogisticRegression + +# --- Configuration (No changes here) --- +DATA_DIR = "nhs_risk_calculator/data" +DATA_FILE = os.path.join(DATA_DIR, "patient_gp_data.parquet") +FEAST_REPO_PATH = "nhs_risk_calculator" +MODEL_FILE = "diabetes_model.pkl" +FEATURES_FILE = "feature_list.pkl" + +# --- Data Generation Logic (No changes here) --- + +def _create_synthetic_records(num_patients=500, records_per_patient=5): + """Generates a DataFrame of new synthetic patient data.""" + print(f"Generating data for {num_patients} new patients...") + + try: + existing_df = pd.read_parquet(DATA_FILE) + start_id = existing_df['patient_id'].max() + 1 + except FileNotFoundError: + start_id = 1 + + patient_ids = list(range(start_id, start_id + num_patients)) + all_records = [] + fake = Faker() + + for pid in patient_ids: + base_age = random.randint(25, 70) + base_bmi = round(random.uniform(18.0, 40.0), 1) + base_bp = round(random.uniform(110, 160), 0) + has_hypertension = (base_bp > 140) or (random.random() < 0.15) + family_history = random.random() < 0.2 + current_time = datetime.now() - timedelta(days=365 * 2) + + for i in range(records_per_patient): + event_ts = current_time + timedelta(days=i * 90) + record_age = base_age + (i * 0.25) + record_bmi = round(base_bmi + random.uniform(-0.5, 0.5), 1) + record_bp = round(base_bp + random.uniform(-5, 5), 0) + + all_records.append({ + "patient_id": pid, + "event_timestamp": event_ts, + "created_timestamp": datetime.now(), + "patient_age_years": int(record_age), + "patient_bmi_latest": record_bmi, + "patient_systolic_bp_avg_12months": record_bp, + "patient_has_hypertension": has_hypertension, + "patient_family_history_diabetes": family_history, + }) + + df = pd.DataFrame(all_records) + df["patient_age_years"] = df["patient_age_years"].astype(np.int64) + df["patient_bmi_latest"] = df["patient_bmi_latest"].astype(np.float32) + df["patient_systolic_bp_avg_12months"] = df["patient_systolic_bp_avg_12months"].astype(np.float32) + return df + +def generate_and_save_data(): + """ + Generates new data and appends it to the existing Parquet file. + """ + if not os.path.exists(DATA_DIR): + os.makedirs(DATA_DIR) + + new_data = _create_synthetic_records() + + try: + existing_data = pd.read_parquet(DATA_FILE) + combined_data = pd.concat([existing_data, new_data], ignore_index=True) + print(f"Appending {len(new_data)} new records. Total records: {len(combined_data)}") + except FileNotFoundError: + combined_data = new_data + print(f"Creating new data file with {len(combined_data)} records.") + + combined_data.to_parquet(DATA_FILE, index=False) + print(f"Data saved to {DATA_FILE}") + +def run_feast_commands(): + """Runs 'apply' and 'materialize' to update the feature store.""" + print("Running 'feast apply'...") + subprocess.run(["feast", "apply"], cwd=FEAST_REPO_PATH, check=True) + + print("Running 'feast materialize' (this may take a moment)...") + + start_date = "2010-01-01T00:00:00" + end_date = "2099-01-01T00:00:00" + + subprocess.run( + ["feast", "materialize", start_date, end_date], + cwd=FEAST_REPO_PATH, + check=True + ) + + print("Feast store updated.") + + +# --- 2. Model Training Logic --- + +def calculate_diabetes_label(row): + """ + Calculates a realistic diabetes label based on feature values. + Returns True (has diabetes) or False (does not). + """ + # Handle missing data - if any risk factor is missing, return None + if row.isnull().any(): + return None + + # 1. Start with a base log-odds (e.g., -4.5 is a good test base) + # This is our 'intercept' + score = -4.5 + + # 2. Add weighted factors (our "correlations") + + # Age: Add 0.075 for each year over 40 + score += max(0, row['patient_age_years'] - 40) * 0.075 + + # BMI: Add 0.15 for each BMI point over 25 (overweight) + score += max(0, row['patient_bmi_latest'] - 25) * 0.15 + + # Family History: Add a 1.2 "point" bonus (a strong indicator) + if row['patient_family_history_diabetes']: + score += 1.2 + + # Hypertension: Add a 0.8 "point" bonus + if row['patient_has_hypertension']: + score += 0.8 + + # 3. Convert log-odds 'score' to probability (0.0 to 1.0) + # This is the logistic/sigmoid function + probability = 1 / (1 + np.exp(-score)) + + # 4. Use this probability to make a weighted random choice + # If prob is 0.3 (30%), this will return True ~30% of the time. + return random.random() < probability + + +def train_and_save_model(): + """ + Fetches latest data from Feast, generates realistic labels, + trains a new model, and saves it to disk. + """ + print("Connecting to Feast store...") + store = FeatureStore(repo_path=FEAST_REPO_PATH) + + print(f"Reading patient list from {DATA_FILE}...") + try: + patient_data = pd.read_parquet(DATA_FILE) + patient_ids = patient_data['patient_id'].unique() + if len(patient_ids) == 0: + raise Exception("No patients found in data file.") + except FileNotFoundError: + print("No data file found. Cannot train model.") + raise Exception("Parquet data file not found. Run 'Add Data' first.") + + print(f"Creating 'entity_df' for {len(patient_ids)} patients...") + + # 1. Create the "scaffolding" DataFrame WITHOUT the random label + entity_df = pd.DataFrame({ + "patient_id": patient_ids, + "event_timestamp": [ + datetime.now() - timedelta(days=random.randint(365, 365*3)) + for _ in patient_ids + ] + }) + + features_to_get = [ + "gp_records:patient_age_years", + "gp_records:patient_bmi_latest", + "gp_records:patient_systolic_bp_avg_12months", + "gp_records:patient_has_hypertension", + "gp_records:patient_family_history_diabetes", + ] + + print("Retrieving historical features from Feast...") + # 2. Get the features for our "scaffolding" + training_data_job = store.get_historical_features( + entity_df=entity_df, # Use the label-less df + features=features_to_get, + ) + training_df = training_data_job.to_df() + + print("Calculating realistic diabetes labels based on features...") + + # 3. Get the plain feature names (without the 'gp_records:' prefix) + features = [f.split(":")[1] for f in features_to_get] + + # 4. Apply our new realistic logic to create the label + training_df["developed_diabetes_5yr"] = training_df[features].apply( + calculate_diabetes_label, + axis=1 + ) + + # Now 'training_df' has features AND a realistic, correlated label + target = "developed_diabetes_5yr" + + print("Cleaning data and training model...") + # 5. Drop rows where any feature was missing OR our label could not be calculated + training_df_clean = training_df.dropna(subset=features + [target]).copy() + + if training_df_clean.empty: + print("No complete data rows to train on. Aborting.") + raise Exception("No complete data rows to train on.") + + X_train = training_df_clean[features] + y_train = training_df_clean[target] + print(f"Training on {len(X_train)} complete rows.") + + model = LogisticRegression() + model.fit(X_train, y_train) + print("Model training complete.") + + print(f"Saving model to {MODEL_FILE}...") + joblib.dump(model, MODEL_FILE) + + print(f"Saving feature list to {FEATURES_FILE}...") + joblib.dump(features, FEATURES_FILE) + print("Model and features saved.") \ No newline at end of file diff --git a/demo-feature-store/nhs_risk_calculator/definitions.py b/demo-feature-store/nhs_risk_calculator/definitions.py new file mode 100644 index 0000000..479411a --- /dev/null +++ b/demo-feature-store/nhs_risk_calculator/definitions.py @@ -0,0 +1,48 @@ +import pandas as pd +from datetime import timedelta + +from feast import ( + Entity, + FeatureView, + Field, + FileSource, + ValueType, # We still need this for the Entity +) +# --- THE FIX IS HERE --- +# We import the specific type classes +from feast.types import Bool, Float32, Int64 + +# --- 1. Define Data Source --- +patient_gp_data_source = FileSource( + name="patient_gp_data_source", + path="data/patient_gp_data.parquet", + timestamp_field="event_timestamp", + created_timestamp_column="created_timestamp", +) + +# --- 2. Define Entity --- +# ValueType is used correctly here +patient = Entity( + name="patient_id", + value_type=ValueType.INT64, + description="Unique patient identifier" +) + +# --- 3. Define the Feature View --- +gp_records_view = FeatureView( + name="gp_records", + entities=[patient], + ttl=timedelta(days=365 * 10), + schema=[ + # --- AND THE FIX IS HERE --- + # We use the imported type classes + Field(name="patient_age_years", dtype=Int64), + Field(name="patient_bmi_latest", dtype=Float32), + Field(name="patient_systolic_bp_avg_12months", dtype=Float32), + Field(name="patient_has_hypertension", dtype=Bool), + Field(name="patient_family_history_diabetes", dtype=Bool), + ], + online=True, + source=patient_gp_data_source, + tags={"owner": "clinical_data_team"}, +) \ No newline at end of file diff --git a/demo-feature-store/nhs_risk_calculator/feature_store.yaml b/demo-feature-store/nhs_risk_calculator/feature_store.yaml new file mode 100644 index 0000000..10821d3 --- /dev/null +++ b/demo-feature-store/nhs_risk_calculator/feature_store.yaml @@ -0,0 +1,8 @@ +project: nhs_risk_calculator +registry: data/registry.db +provider: local +online_store: + type: sqlite + path: data/online_store.db # Feast will create this file +offline_store: + type: file # This means use the FileSource (Parquet) defined in our .py \ No newline at end of file diff --git a/demo-feature-store/requirements.txt b/demo-feature-store/requirements.txt new file mode 100644 index 0000000..1465495 --- /dev/null +++ b/demo-feature-store/requirements.txt @@ -0,0 +1,14 @@ +# Install all the required modules with "pip install -r requirements.txt" + +# -- Main Application -- +flask # The web server +joblib # For saving/loading the model + +# -- Data & ML -- +pandas # For data manipulation +scikit-learn # For the machine learning model + +# -- Feature Store -- +feast # The feature store +numpy # Required for scikit-learn and pandas +faker # For generating synthetic data \ No newline at end of file diff --git a/demo-feature-store/templates/index.html b/demo-feature-store/templates/index.html new file mode 100644 index 0000000..7479f8c --- /dev/null +++ b/demo-feature-store/templates/index.html @@ -0,0 +1,106 @@ + + + + + + NHS Risk Calculator + + + +
+ + {% with messages = get_flashed_messages(with_categories=true) %} + {% if messages %} +
    + {% for category, message in messages %} +
  • {{ message }}
  • + {% endfor %} +
+ {% endif %} + {% endwith %} + +

NHS Diabetes Risk Calculator (Demo)

+ +
+
+ + +
+ +
+ + {% if error and not prediction %} +

Error: {{ error }}

+ {% endif %} + + {% if prediction %} + +
+

Risk Assessment for Patient ID: {{ patient_id }}

+

Prediction: {{ prediction }}

+

Probability of Diabetes (5-yr): {{ probability }}%

+ +

Patient Features Used:

+ + + + + + {% for key, value in patient_data.items() %} + + + + + {% endfor %} +
FeatureValue
{{ key }}{{ value }}
+ {% if error %} +

Note: {{ error }}

+ {% endif %} +
+ {% endif %} + +
+

Admin Tools

+
+ +
+
+ +
+
+ +
+ + \ No newline at end of file diff --git a/demo-feature-store/train_model.py b/demo-feature-store/train_model.py new file mode 100644 index 0000000..79414ed --- /dev/null +++ b/demo-feature-store/train_model.py @@ -0,0 +1,14 @@ +# In train_model.py + +from lib import train_and_save_model +import sys + +if __name__ == "__main__": + try: + print("--- Running Model Training ---") + train_and_save_model() + print("\n--- Process Complete ---") + + except Exception as e: + print(f"An error occurred: {e}", file=sys.stderr) + sys.exit(1) \ No newline at end of file