## Data simulation

Reference: https://www.frontiersin.org/journals/pharmacology/articles/10.3389/fphar.2019.00383/full

1. Define the Observation Period and Initial Fill:
- Every individual starts on the same day with an initial fill that covers 30 days.
- This ensures all individuals “initiate” treatment simultaneously.

2. Simulate Subsequent Refills:
- After the initial fill, each subsequent refill’s duration is randomly chosen from a discrete set (typically 30, 60, or 90 days).
- Refills continue until the observation period ends.
- For each refill, you update the refill date by adding the chosen duration (plus any group-specific delay).

3. Introduce Group-Specific Patterns:
- Individuals are assigned to one of six groups with distinct refill behaviors. For example:

    - Group 1 (“High adherence”): Refills occur on time, leading to near-perfect adherence.
    - Group 2 (“Erratic adherence”): Refills are more variable (with additional random delays).
    - Group 3 (“Gradual decline”): The delay between refills increases over time.
    - Group 4 (“Intermittent adherence”): Refills alternate between on‐time and delayed refills.
    - Group 5 (“Partial drop-off”): Initially good adherence with a later drop-off.
    - Group 6 (“Non-persistence”): Only one or two refills occur before treatment stops. – The simulation can “tune” these patterns by, for example, adding extra delay (or even ceasing refills) according to the group.
4. Generate Multiple Individuals:
- Repeat the process for a specified number of individuals (e.g., 1000) while randomly assigning each individual to a group (often with predetermined proportions—for instance, groups 1 and 6 might each be 10% of the sample).

5. Create the Final Dataset:
- The resulting dataset should record for each individual the dates of each refill, the duration supplied, and (optionally) the group membership.
- This simulated refill history can then be used to calculate longitudinal adherence metrics (such as LCMA1 or LCMA2) with sliding windows.

### Import Libraries

In [2]:
import numpy as np
import pandas as pd

### Define Function/s

In [3]:
def simulate_refill_history(n_individuals=1000, obs_period=720):
    # Define group probabilities (e.g., groups 1 and 6 are 10% each)
    group_probs = [0.1, 0.2, 0.2, 0.2, 0.2, 0.1]
    groups = np.random.choice(np.arange(1,7), size=n_individuals, p=group_probs)

    data_list = []
    for i in range(n_individuals):
        group = groups[i]
        refill_dates = [0]  # start at day 0
        current_day = 0

        while current_day < obs_period:
            refill_duration = np.random.choice([30, 60, 90])

            # Group-specific behavior
            if group == 3:
                extra_delay = np.random.randint(0, 11) * len(refill_dates)
            elif group == 2:
                extra_delay = np.random.randint(0, 21)
            elif group == 6 and len(refill_dates) > 1:
                # Non-persistence: stop after the first refill
                break
            else:
                extra_delay = 0

            current_day += refill_duration + extra_delay
            if current_day <= obs_period:
                refill_dates.append(current_day)
            else:
                break

        # Create a DataFrame for this individual
        individual_df = pd.DataFrame({
            'individual': i+1,
            'group': group,
            'refill_date': refill_dates
        })
        individual_df['refill_duration'] = individual_df['refill_date'].diff()
        data_list.append(individual_df)

    simulated_data = pd.concat(data_list, ignore_index=True)
    return simulated_data

#### Simulated Data Features:
- individual: patient ID
- group: simulated adherence group (not used in clustering)
- refill_date: the refill day (numeric, representing days from treatment initiation)
- refill_duration: the difference between consecutive refill dates

### Generate simulated data

In [4]:
export_df = simulate_refill_history()

export_df.tail()

Unnamed: 0,individual,group,refill_date,refill_duration
10624,1000,5,480,30.0
10625,1000,5,510,30.0
10626,1000,5,600,90.0
10627,1000,5,660,60.0
10628,1000,5,690,30.0


### Export to csv

In [None]:
export_df.to_csv('../data/simulated_data.csv', index=False)