a. All patients had to have at least on APE to be enrolled in the study\
b. We have patients that did not have any treatments during the study\
c. We have patients that don't show any "Is Exacerbated" == True. The predicting classifier never spotted a measurement in an exacerbation period.

We'll use this notebook to compare a), b), c)

Conclusions

- 147 individuals in total
- 103 individuals had a treatment (are in the antibiotics data)
- 103 individuals have at least one exacerbation label, BUT they are NOT the same 103 individuals as the ones that had a treatment
- **only 70 individuals had a treatment AND are listed in the ex labels data**
- in the ex labels data, only 57/103 individuals have measurements in ex and in stable periods (note that the discovery `2023-09-01_ex_labels_add_transition_period` will show that **only 53 individuals have a measurements in ex and in stable periods** after merging O2_FEV1 with ex_labels)

In [2]:
import sys

sys.path.append("../data/")
sys.path.append("../O2-FEV1 analysis/")

import antibiotics_data
import patient_data
import measurements_data
import ex_labels_data

import numpy as np
import pandas as pd


In [3]:
# Individuals with no treatments during study
df_patient_data = patient_data.load()
df_antibio = antibiotics_data.load()
ex_labels = ex_labels_data.load().reset_index()
df_measurement = measurements_data.load()


** Loading patient data **


  for idx, row in parser.parse():
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.Height.loc[df.ID == "60"] = tmp * 100
  for idx, row in parser.parse():



* Dropping unnecessary columns from patient data *
Columns filtered: ['ID', 'Study Date', 'DOB', 'Age', 'Sex', 'Height', 'Weight', 'Predicted FEV1', 'FEV1 Set As']
Columns dropped: {'Comments', 'Unable Informed Consent', 'Pulmonary Exacerbation', 'Transplant Recipients', 'Study Number', 'Less Exacerbation', 'Hospital', 'Study Email', 'Telemetric Measures', 'GP Letter Sent', 'CFQR Quest Comp', 'Date Last PE Stop', 'Age 18 Years', 'Genetic Testing', 'Unable Sputum Samples', 'Inconvenience Payment', 'Remote Monitoring App User ID', 'Sputum Samples', 'Informed Consent', 'Freezer Required', 'Date Last PE Start', 'Date Consent Obtained'}

* Correcting patient data *
ID 60: Corrected height 60 from 1.63 to 163.0
ID 66: Corrected height for ID 66 from 1.62 to 162.0
Replace Age by calculate age
Drop FEV1 Set As and Predicted FEV1
Compute Calculated Predicted FEV1

* Applying data sanity checks *
Loaded patient data with 147 entries (147 initially)

** Loading antibiotics data **

* Dropping un

  df = pd.read_csv(datadir + "mydata.csv")



* Dropping unnecessary columns from measurements data *
Columns filtered ['User ID', 'UserName', 'Recording Type', 'Date/Time recorded', 'FEV 1', 'Weight in Kg', 'O2 Saturation', 'Pulse (BPM)', 'Rating', 'Temp (deg C)']
Dropping columns {'Activity - Points', 'Calories', 'FEV 1 %', 'Sputum sample taken?', 'Activity - Steps', 'FEV 10', 'Predicted FEV'}

* Renaming columns *
Renamed columns {'Date/Time recorded': 'Date recorded', 'FEV 1': 'FEV1', 'Weight in Kg': 'Weight (kg)'}

* Applying data sanity checks *

FEV1
Dropping 1 entries with FEV1 = 3.45 for user Kings004

Weight (kg)
Dropping 2 entries with Weight (kg) = 6.0 for user Papworth033
Dropping 1 entries with Weight (kg) = 0.55 for user Kings013
Dropping 1 entries with Weight (kg) = 8.262500000000001 for user Papworth017
Dropping 1 entries with Weight (kg) = 1056.0 for user leeds01730
Dropping 1 entries with Weight (kg) = 20.0 for user Papworth019

Pulse (BPM)
Dropping 14 entries with Pulse (BPM) == 511)
       Pulse (BPM)      Us

In [4]:
# Are there the same ids in both lists?
patient_ids = df_patient_data.ID.unique()  # All individuals
antibio_ids = df_antibio.ID.unique()  # Had a treatment
ex_labels_ids = (
    ex_labels.ID.unique()
)  # Was used in the predictive classifier
measurement_ids = df_measurement.ID.unique()  # Had a measurement

print(f"{len(measurement_ids)} individuals in measurement data")
print(f"{len(patient_ids)} individuals in patient data")
print(f"{len(antibio_ids)} individuals had a treatment (are in the antibiotics data), BUT")
ids_treatment_stable = np.intersect1d(antibio_ids, ex_labels_ids)
print(
    f" only {len(ids_treatment_stable)} individuals that had a treatment were used in the pred classifier"
)
print(f"{len(ex_labels_ids)} individuals have at least one exacerbation label")

print()
print("IMPORTANT: the 103 individuals in the antibiotics data and in the ex labels data are NOT the same")


146 individuals in measurement data
147 individuals in patient data
103 individuals had a treatment (are in the antibiotics data), BUT
 only 70 individuals that had a treatment were used in the pred classifier
103 individuals have at least one exacerbation label

IMPORTANT: the 103 individuals in the antibiotics data and in the ex labels data are NOT the same


### Patient vs measurement data

In [5]:
# What ID in patient data is not in measurement data?
print("IDs in patient data but not in measurement data:")
print(np.setdiff1d(patient_ids, measurement_ids))

# Are measurement IDs a subset of patient IDs?
print("\nIs measurement IDs a subset of patient IDs?")
print(np.in1d(measurement_ids, patient_ids).all())


IDs in patient data but not in measurement data:
['204']

Is measurement IDs a subset of patient IDs?
True


### Ex labels data

In [6]:
# Count number of true and false for each id
def count_true(s):
    return s.sum()


def count_false(s):
    return len(s) - s.sum()


ex_labels_desc = (
    ex_labels[["ID", "Is Exacerbated"]]
    .groupby("ID")
    .agg([count_true, count_false, "count"])
)


ex_labels_desc.loc["144"]
ex_labels_desc.loc["138"]
ex_labels_desc.loc["189"]

Is Exacerbated  count_true      21
                count_false    108
                count          129
Name: 189, dtype: int64

### Patient vs antibiotics vs ex labels data

In [14]:
# Are antibiotics data and ex labels data both subsets of patient data?
print(
    f"Antibiotics data is a subset of patient data: {np.all(np.isin(antibio_ids, patient_ids))}"
)
print(
    f"Ex labels data is a subset of patient data: {np.all(np.isin(ex_labels_ids, patient_ids))}\n"
)

# Study the exacerbation labels data
ids_stable = ex_labels[ex_labels["Is Exacerbated"] == False].ID.unique()
ids_ex = ex_labels[ex_labels["Is Exacerbated"] == True].ID.unique()

# Create a dataframe with column 1: patients, column 2: antibio, column 3: ex labels, column 4: is exacerbated
df = pd.DataFrame(columns=["ID", "had a treatment", "has an ex label", "Ex labels"])
# In ID column, put all individuals
df["ID"] = patient_ids
# In had a treatment column, put true if the ID is in antibio_ids, false otherwise
df["had a treatment"] = np.isin(patient_ids, antibio_ids)
# In ex labels column, put true if the ID is in ex_labels_ids, false otherwise
df["has an ex label"] = np.isin(patient_ids, ex_labels_ids)

df["Ex labels"] = np.where(
    np.isin(patient_ids, ex_labels_ids),
    np.where(
        np.isin(patient_ids, ids_stable),
        np.where(np.isin(patient_ids, ids_ex), "ex and stable labels", "stable labels only"),
        np.where(np.isin(patient_ids, ids_ex), "ex labels only", "should not happen"),
    ),
    "no ex label for this ID",
)
# Sort by Antibiot then Ex Labels
df = df.sort_values(
    by=["had a treatment", "has an ex label", "Ex labels"], ascending=False
).reset_index(drop=True)

print("Grouping individuals by types of exacerbation labels")
print(df["Ex labels"].value_counts())

pd.set_option("display.max_rows", None)
df.head(147)

Antibiotics data is a subset of patient data: True
Ex labels data is a subset of patient data: True

Grouping individuals by types of exacerbation labels
ex and stable labels       57
stable labels only         46
no ex label for this ID    44
Name: Ex labels, dtype: int64


Unnamed: 0,ID,had a treatment,has an ex label,Ex labels
0,100,True,True,ex and stable labels
1,101,False,True,stable labels only
2,102,True,True,stable labels only
3,107,True,False,no ex label for this ID
4,113,True,True,ex and stable labels
5,114,False,True,stable labels only
6,115,True,True,ex and stable labels
7,121,False,True,stable labels only
8,122,False,False,no ex label for this ID
9,123,True,True,ex and stable labels


In [9]:
# Get IDs of individuals with ex and stable labels
ids_ex_and_stable_labels = df[df["Ex labels"] == "ex and stable labels"].ID.unique()

# intersection of ids_ex_and_stable_labels and ids_measurement
ids_ex_and_stable_labels_in_measurement = np.intersect1d(
    ids_ex_and_stable_labels, measurement_ids
)
print(
    f"{len(ids_ex_and_stable_labels_in_measurement)} individuals with ex and stable labels have measurements"
)
print(ids_ex_and_stable_labels_in_measurement)

57 individuals with ex and stable labels have measurements
['100' '113' '115' '123' '130' '132' '133' '137' '138' '139' '140' '141'
 '143' '144' '151' '153' '171' '172' '173' '176' '179' '186' '188' '189'
 '193' '194' '195' '200' '214' '215' '229' '23' '231' '232' '233' '24'
 '241' '29' '30' '31' '32' '35' '36' '38' '39' '42' '45' '58' '59' '66'
 '69' '70' '71' '75' '78' '79' '92']


In [16]:
# Checking overlap between IDs in antibiotics data and ex labels data
# diff antibiotics and ex labels
diff = np.setdiff1d(antibio_ids, ex_labels_ids)
print(f"{len(diff)} ids in antibio_ids and not in ex_labels_ids:\n{diff}")

# diff ex labels and antibiotics
diff = np.setdiff1d(ex_labels_ids, antibio_ids)
print(f"{len(diff)} ids in ex_labels_ids and not in antibio_ids:\n{diff}")

33 ids in antibio_ids and not in ex_labels_ids:
['107' '135' '152' '170' '174' '177' '181' '182' '190' '192' '198' '199'
 '201' '202' '203' '204' '205' '206' '207' '228' '230' '235' '27' '33'
 '37' '41' '43' '47' '67' '68' '72' '73' '76']
33 ids in ex_labels_ids and not in antibio_ids:
['101' '114' '121' '125' '126' '127' '128' '131' '134' '136' '175' '187'
 '197' '209' '213' '216' '223' '229' '236' '34' '40' '46' '53' '54' '55'
 '57' '61' '62' '63' '80' '81' '93' '99']


In [17]:
# Get list of ids that had no treatment and no exacerbation
ids_treatment_ex = np.intersect1d(antibio_ids, ids_ex)
ids_treatment_stable = np.intersect1d(antibio_ids, ex_labels_ids)

print(
    f"{len(ids_treatment_ex)} individuals had a treatment and were marked as exacerbated: \n{ids_treatment_ex}\n"
)
print(
    f"{len(ids_treatment_stable)} individuals had a treatment and are listed in the predictive classifier: \n{ids_treatment_stable}\n"
)

56 individuals had a treatment and were marked as exacerbated: 
['100' '113' '115' '123' '130' '132' '133' '137' '138' '139' '140' '141'
 '143' '144' '151' '153' '171' '172' '173' '176' '179' '186' '188' '189'
 '193' '194' '195' '200' '214' '215' '23' '231' '232' '233' '24' '241'
 '29' '30' '31' '32' '35' '36' '38' '39' '42' '45' '58' '59' '66' '69'
 '70' '71' '75' '78' '79' '92']

70 individuals had a treatment and are listed in the predictive classifier: 
['100' '102' '113' '115' '123' '129' '130' '132' '133' '137' '138' '139'
 '140' '141' '143' '144' '151' '153' '169' '171' '172' '173' '176' '178'
 '179' '186' '188' '189' '191' '193' '194' '195' '196' '200' '212' '214'
 '215' '227' '23' '231' '232' '233' '234' '24' '241' '28' '29' '30' '31'
 '32' '35' '36' '38' '39' '42' '44' '45' '56' '58' '59' '66' '69' '70'
 '71' '74' '75' '78' '79' '82' '92']

