INSTRUCTIONS

Assignment 1 for Clustering:
New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
1. Look at this website: https://rpubs.com/alanyang0924/TTE
2. Extract the dummy data in the package and save it as "data_censored.csv"
2. Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
3. Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
4. Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
5. Do this by pair, preferably your thesis partner.
6. Push to your github repository.
7. Deadline is: March 9,  2025 at 11:59 pm.

HINT: For those who dont have a thesis topic yet, you can actually develop a thesis topic out of this assignment.

Libraries

In [34]:
import os
import pandas as pd
import numpy as np
import statsmodels 
import linearmodels
import sklearn
import dowhy
import causalml
import econml
import statsmodels.api as sm
from statsmodels.formula.api import glm
from sklearn.linear_model import LogisticRegression
from pathlib import Path
from Functions.trial_seq import prepare_data_for_tte


Step 1:Setup

In [35]:
observational_data = pd.read_csv('dataset/data_censored.csv')
pd.set_option('display.max_columns', None)
print(observational_data.head())

   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  
0         0         1  
1         0         0  
2         0         0  
3         0         0  
4         0         0  


In [36]:
# project_path = r"D:\Jean\Documents\TTE"
# # directory_name = "trial_pp"
# directory_name = "trial_itt"

# full_path = os.path.join(project_path, directory_name)

# if not os.path.exists(full_path):
#     os.mkdir(full_path)
#     print(f"Directory '{full_path}' created successfully.")
# else:
#     print(f"Directory '{full_path}' already exists.")

Step 2:Data Preparation

In [37]:
columns = observational_data.columns
print(columns)

Index(['id', 'period', 'treatment', 'x1', 'x2', 'x3', 'x4', 'age', 'age_s',
       'outcome', 'censored', 'eligible'],
      dtype='object')


In [38]:
# Define a class to handle trial analysis
class TrialAnalysis:
    def __init__(self):
        self.data = None
        self.id = None
        self.period = None
        self.treatment = None
        self.outcome = None
        self.eligible = None

    def set_data(self, data, id, period, treatment, outcome, eligible):
        self.data = data
        self.id = id
        self.period = period
        self.treatment = treatment
        self.outcome = outcome
        self.eligible = eligible
        return self

    def filter_eligible(self):
        """Filter data to include only eligible participants (for PP analysis)."""
        if self.eligible is not None:
            self.data = self.data[self.data[self.eligible] == True]
        return self

# Per-Protocol (PP) analysis
trial_pp = TrialAnalysis().set_data(
    data=observational_data,
    id="id",
    period="period",
    treatment="treatment",
    outcome="outcome",
    eligible="eligible"
).filter_eligible()  # Filter to include only eligible participants

# Intention-to-Treat (ITT) analysis
trial_itt = TrialAnalysis().set_data(
    data=observational_data,
    id="id",
    period="period",
    treatment="treatment",
    outcome="outcome",
    eligible="eligible"
)  # No filtering for ITT

# Check the results
print("Per-Protocol Analysis Data:")
print(trial_pp.data)

print("\nIntention-to-Treat Analysis Data:")
print(trial_itt.data)

Per-Protocol Analysis Data:
     id  period  treatment  x1        x2  x3        x4  age     age_s  \
0     1       0          1   1  1.146148   0  0.734203   36  0.083333   
6     2       0          0   1 -0.802142   0 -0.990794   26 -0.750000   
7     2       1          1   1 -0.983030   0 -0.990794   27 -0.666667   
11    3       0          1   0  0.571029   1  0.391966   48  1.083333   
19    4       0          0   0 -0.107079   1 -1.613258   29 -0.500000   
..   ..     ...        ...  ..       ...  ..       ...  ...       ...   
681  96       0          0   0 -1.954236   1 -1.293043   47  1.000000   
682  96       1          1   0 -1.085325   1 -1.293043   48  1.083333   
701  97       0          0   1  0.621108   1  0.830741   36  0.083333   
702  98       0          1   1  1.392339   0  0.317418   64  2.416667   
717  99       0          1   1 -0.346378   1  0.575268   65  2.500000   

     outcome  censored  eligible  
0          0         0         1  
6          0         0   

Step 3: Weight Models and censoring

3.1 Censoring due to treatment switching

3.2 Other informative censoring

PP dataset

ITT dataset

Step 4:

Alright, let's break down what Target Trial Emulation (TTE) is, just like I'm explaining it to a college student who's brand new to data science. Think of this as building blocks, step by step.

**Imagine you want to answer a really important question:**  Does a new study method actually *help* students learn better compared to the old method?

To really know for sure, the **best way** to answer this question would be to run an **ideal experiment**.  Think of it like a perfectly designed science experiment in a lab, but with people and in the real world. This "ideal experiment" is what we call a **Target Trial**.

**1. What is a "Target Trial"? (The Ideal Experiment in Our Heads)**

Imagine the *perfect* way to figure out if the new study method is better.  This perfect way is our "Target Trial". Let's picture it:

*   **Random Assignment:** We'd take a bunch of students and randomly assign half of them to use the **new study method** and the other half to use the **old study method**.  "Random" is key here – like flipping a coin for each student to decide which group they are in. This makes sure the groups are as similar as possible at the start, except for the study method they use.
*   **Follow-up Over Time:** We'd then follow both groups of students for a certain amount of time, let's say a semester, and carefully track how well they are learning. We'd measure this maybe through exam scores, project grades, or participation.
*   **Compare Outcomes:** At the end of the semester, we'd compare the learning outcomes (like average exam scores) between the two groups. If the group using the new study method consistently performs better than the group using the old method, we could be pretty confident that the new method is indeed more effective.

**That, in a nutshell, is a Target Trial.** It's the gold standard, the ideal experiment we *wish* we could always run to answer questions about what works. It’s like having a perfect recipe for answering our question.

**2.  The Problem:  We Can't Always Run the "Perfect" Trial in the Real World**

In reality, running a perfect "Target Trial" is often **impossible** or **unethical**.  Think about it for our study method example:

*   **Ethical Concerns:** Is it ethical to randomly assign students to a potentially *worse* study method just for the sake of an experiment?  Maybe the old method is known to be less effective, and we feel obligated to give everyone the best possible chance.
*   **Practical Issues:**  It can be super hard to control everything in a real-world classroom or study environment. Students might use methods from both groups, they might drop out of the study, or other factors outside our control might influence their learning.
*   **Time and Money:**  Running a large, well-controlled experiment takes a lot of time, money, and resources.

**This is where "Observational Data" comes in, and this is where "Target Trial Emulation" becomes important.**

**3.  Observational Data:  Real-World Information We Already Have**

Instead of running a brand new experiment, we often have access to existing data that was collected in the real world, without a planned experiment. This is called **observational data**.

Let's say a university has been using the old study method for years, and then some professors started trying out the new study method on their own. The university probably has data on student performance, which study methods were used (maybe not perfectly recorded!), student characteristics, etc.  This is observational data.

**The Catch with Observational Data:**

*   **No Random Assignment:**  In observational data, students weren't randomly assigned to study methods.  Professors and students chose which method to use, and these choices are often **not random**. For example, maybe the most motivated students choose to try the new method, or maybe professors who are already good teachers are more likely to adopt it.
*   **Confusing Factors (Confounding):** Because of the lack of random assignment, it's hard to know if differences in outcomes are *really* because of the study method itself, or because of these other factors (like student motivation or teacher skill) that are mixed up with the choice of study method. These other factors are called **confounding factors**.

**4. Target Trial Emulation: "Playing Pretend" with Observational Data**

Target Trial Emulation (TTE) is like "playing pretend" with observational data.  Our goal is to use this messy, real-world data to **mimic** or **emulate** the "Target Trial" we described earlier as closely as possible.  We want to pretend we *did* run that perfect experiment, even though we didn't.

**Think of it like this analogy:**

Imagine you want to learn to bake a cake, but you don't have all the ingredients or a fancy oven.  Instead, you decide to "emulate" baking a cake: you might use a toy oven, pretend ingredients, and follow a simplified recipe. It's not the *real* cake, but by going through the steps, you learn about the process.

TTE does something similar with data. We take our messy observational data and go through steps to *emulate* the key features of a Target Trial, especially **random assignment** (even though we didn't have it originally).

**5.  Key Steps of Target Trial Emulation (Based on the Text you provided):**

Let's break down the steps mentioned in your text to understand how we "emulate" a Target Trial:

*   **a) Define the Estimand (What are we trying to measure?)**
    *   The text mentions "intention-to-treat (ITT)" or "per-protocol (PP)".  These are fancy terms for *exactly* what question we are asking about the study methods.
    *   **Intention-to-Treat (ITT):** We compare everyone who was *assigned* (or in our case, who we *emulate* as being assigned) to the new method group *versus* everyone assigned to the old method group, regardless of whether they actually stuck with that method throughout the whole time.  It's like asking: "What is the effect of *assigning* the new method?"
    *   **Per-Protocol (PP):**  We only compare those who *perfectly followed* the new method *versus* those who *perfectly followed* the old method.  This is like asking: "What is the effect of *actually using* the new method exactly as intended?"
    *   **Choosing the Estimand:** Deciding whether to focus on ITT or PP depends on the specific question you want to answer.  For example, ITT is often used because it's closer to the original random assignment idea and less affected by people dropping out or changing methods.

*   **b) Prepare Observational Data:**
    *   We need our observational data to have specific columns or information. The text mentions:
        *   **Treatment:**  Which study method was used (new or old).
        *   **Outcomes:**  How well students performed (exam scores, grades).
        *   **Eligibility:**  Who was eligible to be included in our "emulated" trial.  We might want to set rules, like only including undergraduate students, or students in a specific department.

*   **c) Censoring and IPCW (Dealing with People Dropping Out or Changing Methods):**
    *   **Censoring:** In a real study (or our emulated one), things can go wrong. Students might:
        *   **Treatment Switching:** Start with the new method but then switch to the old one, or vice-versa.
        *   **Informative Censoring:** Drop out of the study altogether, and the reason they drop out might be related to the study method or their learning progress.  This is called "informative" because the dropout tells us something important.
    *   **Inverse Probability of Censoring Weights (IPCW):**  To deal with these problems, especially "informative censoring," we use a clever technique called IPCW. Think of it like giving more "weight" to the data from students who are similar to those who dropped out, but *didn't* drop out.  It's a way to statistically adjust for the bias caused by people leaving the study in a non-random way.  The text mentions "separate models" to calculate these weights, which is a bit more technical but just know it's about making adjustments.

*   **d) Expand the Observational Dataset into Trials:**
    *   This is a bit more advanced and depends on the specifics of the `TrialEmulation` package in R. But the general idea is to take our observational data, which might be collected over a long period, and break it down into a sequence of "mini-trials."
    *   Imagine we have student data over several semesters. We might "expand" this data to create a series of "trials," maybe one trial for each semester. This allows us to analyze how the study methods perform over time and in different groups.
    *   The "expansion options" mentioned in the text are ways to control *how* we break down the data into these mini-trials.

*   **e) Fit a Marginal Structural Model (MSM):**
    *   MSM is a statistical technique used to analyze the "expanded" data and estimate the causal effects.  Think of it as a sophisticated tool to untangle the relationship between study methods and student outcomes, while accounting for all the potential confounding factors and biases we've tried to address.
    *   It's called "marginal" because it looks at the average effect of the treatment across the whole population, rather than focusing on specific subgroups.

*   **f) Predictions and Visualization:**
    *   After fitting the MSM, we can make predictions.  For example, we can predict the "survival probabilities" – in our case, maybe the probability of students achieving a certain grade level over time, for both the new and old study methods.  Or we could look at "cumulative incidences" – like the percentage of students who fail a course in each group.
    *   **Visualization:** The results are then visualized using graphs, charts, etc., to show the differences in outcomes between the study methods over time.  This makes it easier to see if there's a meaningful difference and to understand the size and direction of the effect.

**6. TrialEmulation Package in R:**

The text mentions that this whole process is implemented in R using the `TrialEmulation` package.  R is a programming language widely used in data science and statistics. This package provides tools and functions to help researchers carry out all the steps of TTE in a systematic way, from data preparation to model fitting and visualization.

**In Summary: Why is TTE Useful?**

Target Trial Emulation is a powerful approach because:

*   **It tries to get as close as possible to the "gold standard" (randomized controlled trial) when we can't actually run one.**
*   **It helps us use real-world, observational data to answer important causal questions more reliably.**
*   **It forces us to be very clear about what our "ideal" experiment would look like and what assumptions we are making when we use observational data.**
*   **It provides a structured framework for addressing common problems in observational data, like confounding and censoring.**

**Think of it this way:**  If you can't bake a real cake, emulating the process is the next best thing to learn about baking and maybe even get a pretty good (though not perfect) idea of what the cake would be like!  Similarly, TTE helps us learn about cause and effect from observational data, even when we can't run the perfect experiment.

This is a high-level overview.  If you go into data science, you'll learn much more about each of these steps in detail. But hopefully, this gives you a good starting understanding of what Target Trial Emulation is all about! Let me know if any part is unclear, and I can try to explain it in a different way.
