Purpose: To simulate and test the logic for Phase 1.

Content: Loads a sample of raw user data, formats it into a text prompt (like in the example), makes a call to an LLM API (Ollama), and displays the generated preliminary insight.

#### Objective
This notebook simulates Phase 1 of the CrohnCare ML approach. We will act as the "backend" that generates preliminary insights for the user.

Our process will be:
- Load & Merge Data: Load meals.csv and symptoms.csv and combine them into a single chronological event log.
- Prepare DataFrame: Add predicted and actual columns to store LLM outputs and future user feedback.
- Engineer a Few-Shot Prompt: Create a function that builds a dynamic prompt. This prompt will include past prediction examples to give the LLM context, asking it to predict a trigger for a new symptom.
- Simulate LLM Interaction: Loop through symptom events, generate a prompt for each, and use a mock LLM function to populate the predicted column.



In [30]:
import pandas as pd
import numpy as np

In [31]:
try:
    df_user = pd.read_csv("data/raw/user.csv")
    df_meals = pd.read_csv("data/raw/meals.csv")
    df_symptoms = pd.read_csv("data/raw/symptoms.csv")
    print("All data files loaded successfully!")
except FileNotFoundError as e:
    print(f"Error: {e}. Make sure you have run the '0_data_generation.ipynb' notebook first.")

All data files loaded successfully!


prepare data

timestamps

In [32]:
df_meals['timestamp'] = pd.to_datetime(df_meals['timestamp'], errors='coerce')
df_symptoms['timestamp'] = pd.to_datetime(df_symptoms['timestamp'], errors='coerce')

df_meals['timestamp'] = df_meals['timestamp'].dt.round('12H')
df_symptoms['timestamp'] = df_symptoms['timestamp'].dt.round('12H')


this will be used to combine both a meal and symptoms by timestamp

In [33]:
df_meals['event_type'] = 'meal'
df_symptoms['event_type'] = 'symptom'

combine the meal and symptom dfs based on timestamp in a new df

In [34]:
df_timeline = pd.merge_asof(
    df_meals.sort_values('timestamp'),
    df_symptoms.sort_values('timestamp'),
    on='timestamp',
    by='user_id',
    direction='nearest',
    tolerance=pd.Timedelta('0s'),
    suffixes=('', '_symptom')
)
df_timeline.iloc[5:10]

Unnamed: 0,meal_id,user_id,timestamp,meal_type,food_name,food_tags,event_type,symptom_id,symptom,severity,event_type_symptom
5,meal_6,user_123,2025-01-03 00:00:00,Dinner,Cereal with Milk,"gluten, dairy, processed, sugary",meal,symp_2,Bloating,5.0,symptom
6,meal_7,user_123,2025-01-03 12:00:00,Breakfast,Large Apple,"high-fiber, fruit, raw",meal,symp_3,Fatigue,4.0,symptom
7,meal_8,user_123,2025-01-03 12:00:00,Lunch,Grilled Chicken Breast,"low-fodmap, protein",meal,symp_3,Fatigue,4.0,symptom
8,meal_9,user_123,2025-01-04 00:00:00,Dinner,Cereal with Milk,"gluten, dairy, processed, sugary",meal,symp_4,Abdominal Pain,6.0,symptom
9,meal_10,user_123,2025-01-04 12:00:00,Breakfast,Pizza Slice,"gluten, dairy, processed, high-fat",meal,,,,


add the new cols predicted and actual (i shouldve done this in the generate data)

In [35]:
df_timeline['predicted'] = np.nan
df_timeline['actual'] = np.nan

want to see what it looks liek

In [36]:
df_timeline[df_timeline['symptom_id'].notnull() & df_timeline['meal_id'].notnull()]

Unnamed: 0,meal_id,user_id,timestamp,meal_type,food_name,food_tags,event_type,symptom_id,symptom,severity,event_type_symptom,predicted,actual
3,meal_4,user_123,2025-01-02 12:00:00,Breakfast,Lentil Soup,"high-fiber, legume",meal,symp_1,Abdominal Pain,7.0,symptom,,
4,meal_5,user_123,2025-01-02 12:00:00,Lunch,Grilled Chicken Breast,"low-fodmap, protein",meal,symp_1,Abdominal Pain,7.0,symptom,,
5,meal_6,user_123,2025-01-03 00:00:00,Dinner,Cereal with Milk,"gluten, dairy, processed, sugary",meal,symp_2,Bloating,5.0,symptom,,
6,meal_7,user_123,2025-01-03 12:00:00,Breakfast,Large Apple,"high-fiber, fruit, raw",meal,symp_3,Fatigue,4.0,symptom,,
7,meal_8,user_123,2025-01-03 12:00:00,Lunch,Grilled Chicken Breast,"low-fodmap, protein",meal,symp_3,Fatigue,4.0,symptom,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
252,meal_253,user_123,2025-03-26 12:00:00,Breakfast,Mac & Cheese,"gluten, dairy, processed",meal,symp_40,Nausea,8.0,symptom,,
253,meal_254,user_123,2025-03-26 12:00:00,Lunch,Cheeseburger,"gluten, dairy, processed, red-meat",meal,symp_40,Nausea,8.0,symptom,,
264,meal_265,user_123,2025-03-30 12:00:00,Breakfast,Spicy Curry,"spicy, high-fat, dairy-free",meal,symp_42,Nausea,8.0,symptom,,
265,meal_266,user_123,2025-03-30 12:00:00,Lunch,Bean Burrito,"high-fiber, gluten",meal,symp_42,Nausea,8.0,symptom,,


-------

#### Engineering the Few-Shot LLM Prompt
This is the core of our task. The function below now takes the user_profile as an argument and injects it directly into the prompt. This gives the LLM crucial context about the patient it is analyzing.

In [41]:
# First, let's get the user's profile data, which we'll need for the prompt.
user_profile = df_user.iloc[0]

Identify the rows in our timeline that represent a symptom event that needs a prediction. These are the rows where a symptom was successfully merged with a meal.

In [43]:
symptom_event_rows = df_timeline[df_timeline['symptom_id'].notna()].copy()

In [45]:
# Get the row numbers (indices) of these symptom events.
symptom_indices = symptom_event_rows.index
symptom_indices

Int64Index([  3,   4,   5,   6,   7,   8,  15,  16,  18,  19,  33,  34,  35,
             39,  40,  59,  75,  76,  81,  82,  83,  87,  88, 101, 102, 103,
            104, 105, 106, 135, 136, 141, 142, 144, 145, 162, 163, 183, 184,
            189, 190, 204, 205, 207, 208, 218, 222, 223, 231, 232, 234, 235,
            246, 247, 249, 250, 252, 253, 264, 265, 269],
           dtype='int64')