# W1 Review Code

In [2]:
import pandas as pd

## Load Dataset

---
### Preventing Data Leakage
* It is worth noting that our dataset contains multiple images for each patient.
* This could be the case, for example, when a patient has taken multiple X-ray images at different times during their hospital visits.
* In our data splitting, we have ensured that the split is done on the patient level so that **there is no data "leakage" between the train, validation, and test datasets.**

#### Checking Data Leakage
* Write a function to check whether there is leakage between two datasets.
* We'll use this to make sure there are no patients in the test set that are also present in either the train or validation sets.

#####  Identifying Overlapping Records

Function: **check_for_leakage**: Return True if there any patients are in both df1 and df2

    """
    Args:
        df1 (dataframe): dataframe describing first dataset
        df2 (dataframe): dataframe describing second dataset
        patient_col (str): string name of column with patient IDs

    Returns:
        leakage (bool): True if there is leakage, otherwise False
    """


In [36]:
def check_for_leakage(df1, df2, patient_col):
    df1_patients_unique = set(df1[patient_col].values)
    df2_patients_unique = set(df2[patient_col].values)

    patients_in_both_groups = df1_patients_unique.intersection(df2_patients_unique)

    # leakage contains true if there is patient overlap, otherwise false.
    leakage = len(patients_in_both_groups) != 0 # boolean (true if there is at least 1 patient in both groups)

    return leakage

If we get False for both, then we're ready to start preparing the datasets for training. Remember to always check for data leakage!

In [69]:
def zach_check_for_leakage(df1, df2, patient_col):
    df1_patients_unique = set(df1[patient_col].values)
    df2_patients_unique = set(df2[patient_col].values)

    patients_in_both_groups = df1_patients_unique.intersection(df2_patients_unique)

    # leakage contains true if there is patient overlap, otherwise false.
    leakage = len(patients_in_both_groups) != 0 # boolean (true if there is at least 1 patient in both groups)

    if leakage is False:
        return print(f'{leakage}: There is No Leakage between the datasets. Ready to start preparing the datasets for training')
    else:
        return print(f'{leakage}: Leakage: there are {len(patients_in_both_groups)} records in both the datasets: {patients_in_both_groups}')

Run the next cell to check if there are patients in both train and test or in both valid and test.

In [81]:
#Case 1
case1_train_df = pd.DataFrame({'PatientId': [0, 1, 2,3,4,5,6,8]})
case1_valid_df= pd.DataFrame({'PatientId': [9, 10, 11,12]})
case1_test_df = pd.DataFrame({'PatientId': [6, 0, 4]})

In [83]:
zach_check_for_leakage(case1_train_df, case1_test_df, 'PatientId')
zach_check_for_leakage(case1_valid_df, case1_test_df, 'PatientId')

True: Leakage: there are 3 records in both the datasets: {0, 4, 6}
False: There is No Leakage between the datasets. Ready to start preparing the datasets for training


In [84]:
print("leakage between train and test: {}".format(check_for_leakage(case1_train_df, case1_test_df, 'PatientId')))
print("leakage between valid and test: {}".format(check_for_leakage(case1_valid_df, case1_test_df, 'PatientId')))

leakage between train and test: True
leakage between valid and test: False


#####  Remove Overlapping Records