##Preprocessing Dataset Example

Chest X-ray dataset: https://www.kaggle.com/nih-chest-xrays/data

The example dataset can be used to detect if a patient has some form of chest disease based on the person's characteristics.
The objective of this example is to extract features and labels and save them as csv files in our Google Drive.

###Setting up

Import the proper libraries

In [9]:
import numpy as np
import pandas as pd

Mount google drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Find the file path of your CSV file!

In [41]:
!ls /content/drive/MyDrive/ChestXRays.csv

/content/drive/MyDrive/ChestXRays.csv


###Pandas


Read in the csv! Let us name the variable that stores the dataframe - *data*.

In [80]:
data_df = pd.read_csv("/content/drive/MyDrive/ChestXRays.csv")

Take a look at the first few rows of the data!

In [81]:
data_df

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
0,00000001_000.png,Cardiomegaly,0,1,58,M,PA,2682,2749,0.143,0.143,
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143,
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168,
3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,0.171,0.171,
4,00000003_000.png,Hernia,0,3,81,F,PA,2582,2991,0.143,0.143,
...,...,...,...,...,...,...,...,...,...,...,...,...
112115,00030801_001.png,Mass|Pneumonia,1,30801,39,M,PA,2048,2500,0.168,0.168,
112116,00030802_000.png,No Finding,0,30802,29,M,PA,2048,2500,0.168,0.168,
112117,00030803_000.png,No Finding,0,30803,42,F,PA,2048,2500,0.168,0.168,
112118,00030804_000.png,No Finding,0,30804,30,F,PA,2048,2500,0.168,0.168,


In [82]:
data_df.shape

(112120, 12)

In [69]:
data_df.head()

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
0,00000001_000.png,Cardiomegaly,0,1,58,M,PA,2682,2749,0.143,0.143,
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143,
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168,
3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,0.171,0.171,
4,00000003_000.png,Hernia,0,3,81,F,PA,2582,2991,0.143,0.143,


In [83]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112120 entries, 0 to 112119
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Image Index                  112120 non-null  object 
 1   Finding Labels               112120 non-null  object 
 2   Follow-up #                  112120 non-null  int64  
 3   Patient ID                   112120 non-null  int64  
 4   Patient Age                  112120 non-null  int64  
 5   Patient Gender               112120 non-null  object 
 6   View Position                112120 non-null  object 
 7   OriginalImage[Width          112120 non-null  int64  
 8   Height]                      112120 non-null  int64  
 9   OriginalImagePixelSpacing[x  112120 non-null  float64
 10  y]                           112120 non-null  float64
 11  Unnamed: 11                  0 non-null       float64
dtypes: float64(3), int64(5), object(4)
memory usage: 10.3+ MB


In [84]:
data_df.describe()

Unnamed: 0,Follow-up #,Patient ID,Patient Age,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
count,112120.0,112120.0,112120.0,112120.0,112120.0,112120.0,112120.0,0.0
mean,8.573751,14346.381743,46.901463,2646.078844,2486.438842,0.155649,0.155649,
std,15.40632,8403.876972,16.839923,341.246429,401.268227,0.016174,0.016174,
min,0.0,1.0,1.0,1143.0,966.0,0.115,0.115,
25%,0.0,7310.75,35.0,2500.0,2048.0,0.143,0.143,
50%,3.0,13993.0,49.0,2518.0,2544.0,0.143,0.143,
75%,10.0,20673.0,59.0,2992.0,2991.0,0.168,0.168,
max,183.0,30805.0,414.0,3827.0,4715.0,0.1988,0.1988,


In [85]:
data_df.columns

Index(['Image Index', 'Finding Labels', 'Follow-up #', 'Patient ID',
       'Patient Age', 'Patient Gender', 'View Position', 'OriginalImage[Width',
       'Height]', 'OriginalImagePixelSpacing[x', 'y]', 'Unnamed: 11'],
      dtype='object')

In [86]:
data_df['Follow-up #'].value_counts()

Unnamed: 0_level_0,count
Follow-up #,Unnamed: 1_level_1
0,30805
1,13302
2,9189
3,7089
4,5759
...,...
177,1
176,1
174,1
173,1


###Determining Features and Labels

Drop all columns that are not useful and view the new dataframe.

In [87]:
data_df = data_df.drop(columns=['Image Index', 'Follow-up #', 'Patient ID', 'OriginalImage[Width', 'Height]', 'Image Index', 'OriginalImagePixelSpacing[x', 'y]', 'View Position', 'Unnamed: 11'])

Patient Age and Patient Gender are features, while Finding Labels is the label.

###Examining Dataframe

Let's examine each column. What are the distribution of values in each column?
 Are there any weird values?

In [91]:
data_df[['Patient Age', 'Patient Gender', 'Finding Labels']]


Unnamed: 0,Patient Age,Patient Gender,Finding Labels
0,58,M,Cardiomegaly
1,58,M,Cardiomegaly|Emphysema
2,58,M,Cardiomegaly|Effusion
3,81,M,No Finding
4,81,F,Hernia
...,...,...,...
112115,39,M,Mass|Pneumonia
112116,29,M,No Finding
112117,42,F,No Finding
112118,30,F,No Finding


In [92]:
data_df[data_df['Patient Gender'] == 'M']

Unnamed: 0,Finding Labels,Patient Age,Patient Gender
0,Cardiomegaly,58,M
1,Cardiomegaly|Emphysema,58,M
2,Cardiomegaly|Effusion,58,M
3,No Finding,81,M
12,Mass|Nodule,82,M
...,...,...,...
112112,No Finding,32,M
112114,No Finding,39,M
112115,Mass|Pneumonia,39,M
112116,No Finding,29,M


In [93]:
data_df.groupby("Patient Gender", as_index=False)["Patient Age"].mean()

Unnamed: 0,Patient Gender,Patient Age
0,F,46.521074
1,M,47.194411


###Removing rows with weird values

There's a feature column in specific that has some weird values. Let's delete the rows where this column has weird values.

One way to drop weird values is setting the cells that have a weird value to np.nan using the numpy library. NaN means not a number.

Then we drop the rows with NaN values.

First, let's set the weird values to np.nan

In [94]:
data_df.loc[data_df['Patient Age'] > 100, 'Patient Age'] = np.nan
data_df.loc[data_df['Patient Age'] > 100, 'Patient Age']

Unnamed: 0,Patient Age


Now, let's drop the rows with the NaN values.

In [95]:
data_df = data_df.dropna(axis=0, how='any')
data_df.loc[data_df['Patient Age'] > 100, 'Patient Age']

Unnamed: 0,Patient Age


###Transforming features into proper representation

The other feature needs a more appropriate representation that the computer can understand.

**Binary Representation**

Now transform that feature into this representation and print the new dataframe.

In [96]:
gen_data = pd.get_dummies(data_df['Patient Gender'], prefix='Gender', dtype=float, drop_first=False )
gen_data

Unnamed: 0,Gender_F,Gender_M
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0
...,...,...
112115,0.0,1.0
112116,0.0,1.0
112117,1.0,0.0
112118,1.0,0.0


Concatenate the outputted dataframe with the dataframe that you were working with.

In [97]:
concatenated_df = pd.concat([data_df, gen_data], axis=1)
concatenated_df

Unnamed: 0,Finding Labels,Patient Age,Patient Gender,Gender_F,Gender_M
0,Cardiomegaly,58.0,M,0.0,1.0
1,Cardiomegaly|Emphysema,58.0,M,0.0,1.0
2,Cardiomegaly|Effusion,58.0,M,0.0,1.0
3,No Finding,81.0,M,0.0,1.0
4,Hernia,81.0,F,1.0,0.0
...,...,...,...,...,...
112115,Mass|Pneumonia,39.0,M,0.0,1.0
112116,No Finding,29.0,M,0.0,1.0
112117,No Finding,42.0,F,1.0,0.0
112118,No Finding,30.0,F,1.0,0.0


Delete any unnecessary columns that we will not be using and print out the new dataframe.

In [98]:
condition_df = concatenated_df.drop(columns=['Patient Gender'])
condition_df

Unnamed: 0,Finding Labels,Patient Age,Gender_F,Gender_M
0,Cardiomegaly,58.0,0.0,1.0
1,Cardiomegaly|Emphysema,58.0,0.0,1.0
2,Cardiomegaly|Effusion,58.0,0.0,1.0
3,No Finding,81.0,0.0,1.0
4,Hernia,81.0,1.0,0.0
...,...,...,...,...
112115,Mass|Pneumonia,39.0,0.0,1.0
112116,No Finding,29.0,0.0,1.0
112117,No Finding,42.0,1.0,0.0
112118,No Finding,30.0,1.0,0.0


###Tranforming labels into proper representation

Similarly, labels must be transformed to an encoding that makes sense.

In [99]:
combined_df = condition_df['Finding Labels'].str.get_dummies(sep='|')
combined_df

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112115,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
112116,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112117,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
112118,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


Concatenate the outputted dataframe with the dataframe that you were working with. Drop any unnecessary columns and print the dataframe.

In [100]:
data = pd.concat([combined_df, condition_df], axis=1)
data = data.drop(columns=['Finding Labels'])
data

Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax,Patient Age,Gender_F,Gender_M
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,58.0,0.0,1.0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,58.0,0.0,1.0
2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,58.0,0.0,1.0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,81.0,0.0,1.0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,81.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112115,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,39.0,0.0,1.0
112116,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,29.0,0.0,1.0
112117,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,42.0,1.0,0.0
112118,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,30.0,1.0,0.0


###Extract the features from dataframe

Let's extract the features.







In [101]:
features = data
features


Unnamed: 0,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax,Patient Age,Gender_F,Gender_M
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,58.0,0.0,1.0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,58.0,0.0,1.0
2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,58.0,0.0,1.0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,81.0,0.0,1.0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,81.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112115,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,39.0,0.0,1.0
112116,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,29.0,0.0,1.0
112117,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,42.0,1.0,0.0
112118,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,30.0,1.0,0.0


###Save a CSV of features and labels to your Google Drive

In [102]:
features.to_csv('features.csv')

In [None]:
!cp features.csv /content/drive/MyDrive/
from google.colab import files
files.download('features.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>