# Proportion representative test split (90% 0 | 10% 1)

As referenced in `1 - dataPreprocessing`, the HRCT_Pilot folder is not representative of the dataset's class imbalance, and portrays exceedingly small fibrosis cases (only 2 in 301 slices).

For that reason, the next logical step is to build a test dataframe containing ``SliceID`` and ``Class`` pairs with the following characteristics:

 - if patient is present in the test dataset, then it cannot be in the train dataset (and vice-versa), in order to respect data validity 
 - class imbalance must simulate overall 90-10 proportion, considering a reasonable difference threshold, in order for the sample to be representative of the entire dataset
 - test dataframe size must represent 20 to 30% of the entire dataset, yielding 630-900 out of 3075 slices

Respecting these rules will ensure the creation of a valid test split.

Code intuition is as follows:

 - Randomly select a patient from the full dataset (can easily be extracted from `SliceID`)
 - Remove patient from train and add to test dataframe

After that, the new dataframe will undergo a series of tests and approximation measures:

 - if conditions_above_met: test split successful
 - elif test_split[`Class` == 0] < 90% - threshold: add patient with imbalance > (90% 0 | 10% 1), increasing `Class` == 0 imbalance, approximating emulation of dataset imbalance
 - elif test_split[`Class` == 0] > 90% + threshold: add patient with imbalance < (90% 0 | 10% 1), increasing `Class` == 1 examples, approximating emulation of dataset imbalance
 - elif len(test_split) < 630 - threshold: add patient with closest imbalance to (90% 0 | 10% 1)
 - elif len(test_split) > 900 + threshold: remove patient with closest imbalance to (90% 0 | 10% 1)

For efficiency sake, after n iterations, the function will print the test dataframe size and proportion. This will also allow for fine-tuning, in case proportion and size are already sufficiently good but the code doesn't stop running due to poorly chosen thresholds.

In [1]:
import pandas as pd
import pickle
import random 

In [2]:
#df_fibrosis = pd.read_pickle(r'D:\Rafa\A1Uni\2semestre\Estágio\fibrosis_data.pkl')

df_fibrosis = pd.read_pickle(r'..\..\\fibrosis_data.pkl')

# Removing SliceData as it is not necessary for this procedure
df_fibrosis = df_fibrosis.drop(columns=["SliceData"])

### Utility functions

Produced ID will end in "\__\" in order to facilitate finding same patient slices, which guarantees "if slice_id in df["SliceID"]" correctly identifies pateint even if the number appears in a different place (for example, 142__[...] is correct for patient 142, but [...]__142-77 is not):

In [3]:
def getPatientID(slice_id):
    
    # Finds index of "__" occurence -> finds flag index
    flag = slice_id.find("__")

    # Finds main folder in "txt ROI's" 
    main_folder = slice_id[:flag] if flag != -1 else slice_id

    # Main folder is not a number, patient is in the 
    # "HRCT_Pilot" folder, extract the patient id in front
    if "HRCT_Pilot" in str(main_folder):
        # Removes "HRCT_Pilot__"
        patientID = slice_id[12:]
        # Crops to "PatientID__"
        return patientID[:5]

    # Main folder is already a number, even if a patient
    # has more than 1 exam folder, use number as ID
    elif str(main_folder).isnumeric: return main_folder + "__"

    else: 
        print("ERROR IN GETPATIENTID")
        print(slice_id)
        return slice_id

Uses pre-defined flag in order to correctly identify patients, correctly handling every special case:

 - `HRCT_Pilot`: all patient id's are >= 200, and flagged as "HRCT_Pilot__PatientID__SliceID"
 - `Nested Folders`: SliceID for patients with more than 1 exam still start with PatientID
 - `Regular`: same as above

In [4]:
# Moves all patient slices from one dataset to another
def movePatient(patient, goes_from, to):

    # If the first 3 represent a number >= 200 then it's contains "_number__"
    # NOT WORKING
    if int(patient[:-2]) >= 200: mask = goes_from["SliceID"].str.contains(f"_{patient}")
    else: mask = goes_from["SliceID"].str.startswith(patient)
    to = pd.concat([to, goes_from[mask]])
    goes_from = goes_from[~mask]

    return goes_from, to

Simple calculation of proportion of Class 0 in defined dataframe:

In [5]:
# Returns proportion of class 0
def getProportion(df):
    class_counts, total_samples = df['Class'].value_counts(), len(df)
    if total_samples == 0: return 0
    return class_counts[0] / total_samples

The code below creates a dictionary with with (patientID: proportion in df) pairs:

In [6]:
# Returns list containing proportions for each patient
def getInfoDict(df_original):
    df = df_original.copy()
    info_dict = {}

    while not df.empty:
        patient_id = getPatientID(df["SliceID"].iloc[0])

        # Get all rows for that patient
        mask = df["SliceID"].str.contains(patient_id)
        df_patient = df[mask]

        # Add info
        proportion = getProportion(df_patient) 
        info_dict[patient_id] = proportion

        # Remove from df
        df = df[~mask]

    return info_dict

Each utility function serves a useful purpose in the main function below. If needed, it can easily be changed to customizable proportion as well as test size:

In [7]:
def fibrosisTestSplit(original_dataframe, show_updates=False):

    df = original_dataframe.copy()
    test_df = pd.DataFrame(columns=["SliceID", "Class"])


    # Initialize loop with random slice
    init_slice = df.iloc[random.choice(range(len(df)))]
    cur_id = getPatientID(init_slice["SliceID"])
    df, test_df = movePatient(patient=cur_id, goes_from=df, to=test_df)

    print("Started by moving patient ",cur_id)
    print("Initial proportion: ", getProportion(test_df),"   |    Initial size: ",len(test_df))

    # Proportion and size tolerance/threshold
    prop_thr, size_thr = 0.01, 25

    # Counter for integrity checks
    n, nothing_counter = 0, 0


    # Until test dataframe has reasonable size and proportion, loop will run 
    # Also includes iteration limit
    while (not ((getProportion(test_df) in [0.9-prop_thr, 0.9+prop_thr])
            and (len(test_df) in [630, 900+size_thr]))) and n<=1000:
        
        # Dictionary with (patientID: proportion in df) pairs
        info_train = getInfoDict(df)
        info_test = getInfoDict(test_df)

        # Approximation measures
        if getProportion(test_df) < 0.9-prop_thr: 
            nothing_counter = 0
            # Add from random.choice(list of patient ids with proportion >0.9)
            cur_patient = random.choice([key for key, value in info_train.items() if value > 0.9])
            df, test_df = movePatient(cur_patient,goes_from=df,to=test_df)

        elif getProportion(test_df) > 0.9+prop_thr: 
            nothing_counter = 0
            # Add from random.choice(list of patient ids with proportion <0.9)
            cur_patient = random.choice([key for key, value in info_train.items() if value < 0.9])
            df, test_df = movePatient(cur_patient,goes_from=df,to=test_df)

        elif len(test_df) < 630: 
            nothing_counter = 0
            # Add random.choice(list of patient ids closer to 0.9 FROM TRAIN)
            cur_patient = min(info_train, key=lambda k: abs(info_train[k] - 0.9))
            df, test_df = movePatient(cur_patient,goes_from=df,to=test_df)

        elif len(test_df) > 900+size_thr: 
            nothing_counter = 0
            # Remove random.choice(list of patient ids closer to 0.9 FROM TEST)
            cur_patient = min(info_train, key=lambda k: abs(info_test[k] - 0.9))
            # Swap order of patient trade
            test_df, df = movePatient(cur_patient,goes_from=test_df,to=df)

        else: 
            nothing_counter += 1

        # Useful for debugging
        if show_updates and n % 10 == 0:
            print("Proportion: ", getProportion(test_df),"   |    Size: ",len(test_df))

        # Early stop if program is doing nothing
        if nothing_counter > 50: break

        n+=1

    print("\n----------------------------------------------\n")
    print("Final test proportion: ", getProportion(test_df),"   |    Final test size: ",len(test_df))
    print("Final train proportion: ", getProportion(df),"   |    Final train size: ",len(df))

    return df, test_df

In [8]:
df_train, df_test = fibrosisTestSplit(df_fibrosis)

Started by moving patient  107__
Initial proportion:  1.0    |    Initial size:  31

----------------------------------------------

Final test proportion:  0.8917682926829268    |    Final test size:  656
Final train proportion:  0.9073614557485525    |    Final train size:  2418


Code below is disabled in order to prevent overwrite:

```py

df_train, df_test = pd.read_csv("train_dataframe.csv"), pd.read_csv("test_dataframe.csv")

```

In [9]:
if int(getProportion(df_test)*100) in range(89,91): print(f"Proportion ({getProportion(df_test):.2f}) is representative of entire dataset.")
print(f"Test dataframe consists of {len(df_test)/len(df_fibrosis)*100:.2f}% of entire dataset.")

Proportion (0.89) is representative of entire dataset.
Test dataframe consists of 21.34% of entire dataset.


Code below is disabled in order to prevent overwrite:

```py

df_train.to_csv("train_dataframe.csv", index=False)
df_test.to_csv("test_dataframe.csv", index=False)

```