# Patient Overlap and Data Leakage

Patient overlap in medical data is a part of a more general problem in machine learning called **data leakage**.  To identify patient overlap in this week's graded assignment, you'll check to see if a patient's ID appears in both the training set and the test set. You should also verify that you don't have patient overlap in the training and validation sets, which is what you'll do here.a

Below is a simple example showing how you can check for and remove patient overlap in your training and validations sets.

In [2]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import seaborn as sns
sns.set()

In [7]:
train = pd.read_csv('Files/nih/train-small.csv')
train

Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00008270_015.png,0,0,0,0,0,0,0,0,0,0,0,8270,0,0,0
1,00029855_001.png,1,0,0,0,1,0,0,0,1,0,0,29855,0,0,0
2,00001297_000.png,0,0,0,0,0,0,0,0,0,0,0,1297,1,0,0
3,00012359_002.png,0,0,0,0,0,0,0,0,0,0,0,12359,0,0,0
4,00017951_001.png,0,0,0,0,0,0,0,0,1,0,0,17951,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,00015869_010.png,0,0,0,0,0,0,0,0,0,0,0,15869,0,0,0
996,00020113_005.png,0,0,0,0,1,0,0,0,0,0,0,20113,0,0,0
997,00019939_000.png,0,0,0,0,0,0,0,0,0,0,0,19939,0,0,0
998,00030496_000.png,0,0,0,0,0,0,0,0,0,0,0,30496,0,0,0


In [8]:
valid = pd.read_csv('Files/nih/valid-small.csv')
valid

Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00027623_007.png,0,0,0,1,1,0,0,0,0,0,0,27623,0,0,0
1,00028214_000.png,0,0,0,0,0,0,0,0,0,0,0,28214,0,0,0
2,00022764_014.png,0,0,0,0,0,0,0,0,0,0,0,22764,0,0,0
3,00020649_001.png,1,0,0,0,1,0,0,0,0,0,0,20649,0,0,0
4,00022283_023.png,0,0,0,0,0,0,0,0,0,0,0,22283,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104,00020290_008.png,0,0,0,0,0,0,0,0,0,0,0,20290,0,0,0
105,00019627_000.png,0,0,0,0,0,0,0,0,0,0,0,19627,0,0,0
106,00018464_017.png,0,0,0,1,0,0,0,0,1,0,0,18464,0,1,0
107,00009964_000.png,0,0,0,0,0,0,0,0,0,0,0,9964,0,0,0


In [12]:
train_pids = train.PatientId.values
valid_pids = valid.PatientId.values
#train_pids, valid_pids

In [14]:
common_ids = list(set(train_pids).intersection(set(valid_pids)))
common_ids, len(common_ids)


([20290, 27618, 9925, 10888, 22764, 19981, 18253, 4461, 28208, 8760, 7482], 11)

In [27]:
valid_idxs = []
for i in common_ids:
    valid_idxs.extend(valid.index[valid['PatientId']==i].tolist())

valid.drop(valid_idxs, inplace=True)

valid

[104, 88, 65, 13, 2, 41, 56, 70, 26, 75, 20, 52, 55]


Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00027623_007.png,0,0,0,1,1,0,0,0,0,0,0,27623,0,0,0
1,00028214_000.png,0,0,0,0,0,0,0,0,0,0,0,28214,0,0,0
3,00020649_001.png,1,0,0,0,1,0,0,0,0,0,0,20649,0,0,0
4,00022283_023.png,0,0,0,0,0,0,0,0,0,0,0,22283,0,0,0
5,00003098_000.png,0,0,0,0,1,0,0,0,1,0,0,3098,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,00025335_000.png,0,0,0,0,0,0,0,0,0,0,0,25335,0,0,0
105,00019627_000.png,0,0,0,0,0,0,0,0,0,0,0,19627,0,0,0
106,00018464_017.png,0,0,0,1,0,0,0,0,1,0,0,18464,0,1,0
107,00009964_000.png,0,0,0,0,0,0,0,0,0,0,0,9964,0,0,0


In [28]:
common_ids = list(set(train.PatientId.values).intersection(set(valid.PatientId.values)))
common_ids

[]