### Identifiers and Study Design
- PatientID → unique patient code
- Group → study group:
    CR = corticosteroid-resistant asthmatics
    CS = corticosteroid-sensitive asthmatics
    Healthy = healthy controls
- Part → study part: A or B (Part B only for CR patients)
- RandomizationArm → treatment during crossover: AZD7624, Placebo, or None (if not in treatment block).
- Visit → study visit name:
    CR: Baseline, PostPrednisone, Week4_Block1, Washout, Week4_Block2
    CS: Baseline, PostPrednisone
    Healthy: Baseline
-TreatmentBlock → treatment phase in crossover (1 or 2, only for CR)

### Demographics
- Age, in years
- Sex - M or F
- BMI, in kg/m2

### Lung Function / Spirometry
- FEV1_percent_predicted → Forced Expiratory Volume in 1 second (% predicted for age/sex/height).
    Lower in CR (~65–75%).
    Higher in CS (~75–85%).
    Near normal in Healthy (~95%).
- Bronchodilator_Response_% → percent improvement in FEV1 after bronchodilator:
    CR: typically <10%.
    CS: typically >12%.
    Healthy: not measured (NaN)

### Biomarkers (Inflammatory / MAPK Pathway)
    p38_MAPK_Blood → mean fluorescence intensity (MFI) of phosphorylated p38 MAPK in blood (by flow cytometry)
    p38_MAPK_BAL → MAPK activity in bronchoalveolar lavage (only for CR, Part B, during bronchoscopy)
    p38_MAPK_Sputum → MAPK activity in induced sputum (only for CR, Part B, crossover visits)
    CRP → C-reactive protein (mg/L), systemic inflammation marker

### Clinical Outcomes
- Asthma_Control_Score → Asthma Control Test (ACT), 5–25 scale:
    Lower in CR (10–16)
    Higher in CS (18–22)
    Very high in Healthy (22–25)
- AdverseEvent → Yes/No for any adverse event at that visit.

### Create a dummy dataset for a clinical trial description following ClinicalTrials.gov's ID NCT02753764

import pandas as pd
import numpy as np
import random

def generate_clinical_trial_dataset(seed=42):
    np.random.seed(seed)
    random.seed(seed)

    n_CR, n_CS, n_HC = 10, 10, 10
    CR_ids = [f"CR{i+1:02d}" for i in range(n_CR)]
    CS_ids = [f"CS{i+1:02d}" for i in range(n_CS)]
    HC_ids = [f"HC{i+1:02d}" for i in range(n_HC)]

    visits_CR = ["Baseline", "PostPrednisone", "Week4_Block1", "Washout", "Week4_Block2"]
    visits_CS = ["Baseline", "PostPrednisone"]
    visits_HC = ["Baseline"]

    def gen_demo():
        return np.random.randint(18, 65), np.random.choice(["M", "F"]), round(np.random.uniform(20, 35), 1)

    def gen_patient_rows(pid, group):
        age, sex, bmi = gen_demo()
        rows = []
        if group == "CR":
            order = np.random.choice(["AZD7624_first", "Placebo_first"])
            for v in visits_CR:
                if v in ["Baseline", "PostPrednisone"]:
                    part, arm, block = "A", None, None
                elif v == "Week4_Block1":
                    part, block = "B", 1
                    arm = "AZD7624" if order == "AZD7624_first" else "Placebo"
                elif v == "Washout":
                    part, arm, block = "B", None, None
                else:  # Week4_Block2
                    part, block = "B", 2
                    arm = "Placebo" if order == "AZD7624_first" else "AZD7624"

                fev1 = round(np.random.normal(70, 5), 1)
                broncho = np.random.randint(4, 9)
                p38_blood = round(np.random.uniform(2.0, 3.5), 2)
                p38_bal = round(np.random.uniform(1.5, 3.0), 2) if "Week4" in v else np.nan
                p38_sputum = round(np.random.uniform(2.0, 3.5), 2) if "Week4" in v else np.nan
                crp = round(np.random.uniform(3.0, 6.0), 1)
                act = np.random.randint(10, 17)
                ae = np.random.choice(["Yes", "No"], p=[0.2, 0.8])

                rows.append([pid, group, part, arm, v, block, age, sex, bmi,
                             fev1, broncho, p38_blood, p38_bal, p38_sputum,
                             crp, act, ae])

        elif group == "CS":
            for v in visits_CS:
                fev1 = round(np.random.normal(80, 5), 1)
                broncho = np.random.randint(12, 20)
                p38_blood = round(np.random.uniform(0.8, 1.5), 2)
                crp = round(np.random.uniform(2.0, 4.0), 1)
                act = np.random.randint(18, 23)
                ae = "No"
                rows.append([pid, group, "A", None, v, None, age, sex, bmi,
                             fev1, broncho, p38_blood, np.nan, np.nan,
                             crp, act, ae])

        else:  # Healthy
            for v in visits_HC:
                fev1 = round(np.random.normal(95, 3), 1)
                broncho = np.nan
                p38_blood = round(np.random.uniform(0.7, 1.2), 2)
                crp = round(np.random.uniform(0.5, 2.0), 1)
                act = np.random.randint(22, 25)
                ae = "No"
                rows.append([pid, group, "A", None, v, None, age, sex, bmi,
                             fev1, broncho, p38_blood, np.nan, np.nan,
                             crp, act, ae])
        return rows

    all_rows = []
    for pid in CR_ids: all_rows.extend(gen_patient_rows(pid, "CR"))
    for pid in CS_ids: all_rows.extend(gen_patient_rows(pid, "CS"))
    for pid in HC_ids: all_rows.extend(gen_patient_rows(pid, "Healthy"))

    columns = ["PatientID", "Group", "Part", "RandomizationArm", "Visit", "TreatmentBlock",
               "Age", "Sex", "BMI", "FEV1_percent_predicted", "Bronchodilator_Response_%",
               "p38_MAPK_Blood", "p38_MAPK_BAL", "p38_MAPK_Sputum", "CRP",
               "Asthma_Control_Score", "AdverseEvent"]

    return pd.DataFrame(all_rows, columns=columns)


df = generate_clinical_trial_dataset()
df.to_csv("clinical_trial_dummy.csv", index=False)
df.head()

## SET UP NOTEBOOK and LOAD RESOURCES

In [None]:
# Import libraries

import pandas as pd
import numpy as np

In [None]:
# Open data file

df = pd.read_csv("clinical_trial_dummy.csv")

In [None]:
df.head()

Unnamed: 0,PatientID,Group,Part,RandomizationArm,Visit,TreatmentBlock,Age,Sex,BMI,FEV1_percent_predicted,Bronchodilator_Response_%,p38_MAPK_Blood,p38_MAPK_BAL,p38_MAPK_Sputum,CRP,Asthma_Control_Score,AdverseEvent
0,CR01,CR,A,,Baseline,,56,F,34.3,72.4,5.0,2.23,,,3.2,14,No
1,CR01,CR,A,,PostPrednisone,,56,F,34.3,76.8,6.0,2.03,,,5.9,13,No
2,CR01,CR,B,AZD7624,Week4_Block1,1.0,56,F,34.3,77.3,8.0,2.65,1.94,2.92,3.4,13,No
3,CR01,CR,B,,Washout,,56,F,34.3,77.7,6.0,2.57,,,5.9,10,Yes
4,CR01,CR,B,Placebo,Week4_Block2,2.0,56,F,34.3,64.2,5.0,3.42,2.95,3.21,3.9,14,No


In [None]:
df.shape

(80, 17)

## DATA PREPARATION

In [None]:
# check for missing and duplicate data

df_missing = df.isna().sum()
df_missing

PatientID                     0
Group                         0
Part                          0
RandomizationArm             60
Visit                         0
TreatmentBlock               60
Age                           0
Sex                           0
BMI                           0
FEV1_percent_predicted        0
Bronchodilator_Response_%    10
p38_MAPK_Blood                0
p38_MAPK_BAL                 60
p38_MAPK_Sputum              60
CRP                           0
Asthma_Control_Score          0
AdverseEvent                  0
dtype: int64

### Reason for missing values

*RandomizationArm → None (if not in treatment block) 
*TreatmentBlock → treatment phase in crossover (1 or 2, only for CR) - NaN if not CR
*Bronchodilator_Response_% → percent improvement in FEV1 after bronchodilator: Healthy: not measured (NaN)
*p38_MAPK_BAL → MAPK activity in bronchoalveolar lavage (only for CR, Part B, during bronchoscopy) - NaN if not CR
*p38_MAPK_Sputum → MAPK activity in induced sputum (only for CR, Part B, crossover visits) - NaN if not CR

In [None]:
# Check and fix dtypes

df.dtypes

PatientID                     object
Group                         object
Part                          object
RandomizationArm              object
Visit                         object
TreatmentBlock               float64
Age                            int64
Sex                           object
BMI                          float64
FEV1_percent_predicted       float64
Bronchodilator_Response_%    float64
p38_MAPK_Blood               float64
p38_MAPK_BAL                 float64
p38_MAPK_Sputum              float64
CRP                          float64
Asthma_Control_Score           int64
AdverseEvent                  object
dtype: object

All data types are set correctly.

## STATISTICAL ANALYSIS

In [None]:
# Count frequency of nominal data

# Select only object dtype columns
obj_cols = df.select_dtypes(include="object")

# Count frequencies for each column
for col in obj_cols.columns:
    print(f"\n--- {col} ---")
    print(obj_cols[col].value_counts())


--- PatientID ---
PatientID
CR01    5
CR02    5
CR03    5
CR04    5
CR05    5
CR06    5
CR07    5
CR08    5
CR09    5
CR10    5
CS10    2
CS09    2
CS08    2
CS07    2
CS06    2
CS05    2
CS04    2
CS03    2
CS02    2
CS01    2
HC01    1
HC02    1
HC03    1
HC04    1
HC05    1
HC06    1
HC07    1
HC08    1
HC09    1
HC10    1
Name: count, dtype: int64

--- Group ---
Group
CR         50
CS         20
Healthy    10
Name: count, dtype: int64

--- Part ---
Part
A    50
B    30
Name: count, dtype: int64

--- RandomizationArm ---
RandomizationArm
AZD7624    10
Placebo    10
Name: count, dtype: int64

--- Visit ---
Visit
Baseline          30
PostPrednisone    20
Week4_Block1      10
Washout           10
Week4_Block2      10
Name: count, dtype: int64

--- Sex ---
Sex
F    43
M    37
Name: count, dtype: int64

--- AdverseEvent ---
AdverseEvent
No     72
Yes     8
Name: count, dtype: int64


In [None]:
# Describe numerical data

df.describe()

Unnamed: 0,TreatmentBlock,Age,BMI,FEV1_percent_predicted,Bronchodilator_Response_%,p38_MAPK_Blood,p38_MAPK_BAL,p38_MAPK_Sputum,CRP,Asthma_Control_Score
count,20.0,80.0,80.0,80.0,70.0,80.0,20.0,20.0,80.0,80.0
mean,1.5,42.8,29.6025,76.72125,8.585714,2.168,2.2985,2.731,3.71,15.8875
std,0.512989,12.937435,3.622503,9.467141,4.447741,0.925361,0.43335,0.402269,1.348661,4.339964
min,1.0,18.0,21.1,60.3,4.0,0.72,1.51,2.06,0.7,10.0
25%,1.0,33.0,25.975,70.0,5.0,1.1875,2.0075,2.41,3.1,12.0
50%,1.5,44.0,30.4,75.4,7.0,2.36,2.365,2.67,3.7,15.0
75%,2.0,56.0,32.6,80.075,12.75,3.0175,2.6125,3.06,4.625,20.0
max,2.0,64.0,34.8,100.6,18.0,3.47,2.95,3.36,5.9,24.0


In [None]:
# Demographics 

df_age = df[['Age']].sort_values(by='Age').value_counts()
df_age.head(5) ## top 5 count only

Age
56     15
45      7
58      6
26      6
18      5
Name: count, dtype: int64

In [None]:
df_patient = df.groupby('PatientID')
df_patient

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ed1b110>

In [None]:
# Long and short data

In [None]:
Use for between-group comparisons: CR vs CS vs Healthy.
Practice plotting (bar plots, boxplots, violin plots).
Input feature for classification (predict group from biomarkers).

In [None]:
# Correlation matrix

In [None]:
Descriptive Stats
Mean ± SD of FEV1 by group.
Adverse event rates by treatment arm.
Visualization
Boxplot: p38_MAPK_Blood by Group.
Lineplot: FEV1_percent_predicted across visits for CR patients.
Statistical Testing
Paired t-test: Baseline vs PostPrednisone FEV1 in CR vs CS.
ANOVA: MAPK blood levels across CR, CS, Healthy.
Regression
Linear regression: FEV1 ~ p38_MAPK_Blood + Age + BMI.
Logistic regression: AdverseEvent ~ TreatmentArm + CRP.
Machine Learning
Classification: Predict Group (CR vs CS vs Healthy) from biomarkers.
Clustering: Cluster patients by MAPK + CRP.
Data Wrangling
Reshape data from long (visits) → wide (one row per patient).
Handle missing data (BAL & Sputum NaN in non-CR).