# Exploratory Data Analysis Phenotypic Data

This notebook investigates the contents of the phenotypic data from all sites from the ADHD-200 Competition set. 

# Imports

Access all of the packages and files that are important for running this notebook. 
This includes packages, a function, and the phenotypic file for all sites.

## Packages

Since this is only a exploratory data analysis, there aren't very many imports

- `os` for opening files

- `pandas` for dataframes

- `numpy` for arrays

- `matplotlib.pyplot` for plotting

- `seaborn` for customizing plots

In [1]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

## File

Import the file to perform exploratory data analysis on. 
This file is located in another folder, so it will need to be accessed from the root folder for this project. 

### get_base_filepath()

Access the filepath for th ebase folder of the project. 
From here, any other asset of the project can be located.

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

### Access file

Update the filepath to include the file's location and open it as a dataframe. 
The index is the column 'ID' which is the subject ID that the row contains information about.

In [3]:
# The folder for the project
base_folder_filepath = get_base_filepath()

# Phenotypic data site folder
filepath = base_folder_filepath + '\\Data\\Phenotypic\\allSubs_testSet_phenotypic_dx.csv'

# Dataframe from filepath
df_pheno = pd.read_csv(filepath, index_col='ID')

# Exploratory Data Analysis

Look at the dataframe and draw conclusions from the insights.

View basic properties of the unchanged dataframe.

In [4]:
df_pheno.shape

(197, 23)

This dataframe includes 197 subjects each with 23 features.

In [5]:
df_pheno.head()

Unnamed: 0_level_0,Disclaimer,Site,Gender,Age,Handedness,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,...,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ,QC_Rest_1,QC_Rest_2,QC_Rest_3,QC_Rest_4,QC_Anatomical_1,QC_Anatomical_2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1038415,,1,1,14.92,1,3,ODD,1,52,34,...,109.0,103.0,-999.0,107.0,1,,,,1,
1201251,,1,1,12.33,1,3,,1,49,28,...,115.0,103.0,-999.0,110.0,1,,,,1,
1245758,,1,0,8.58,1,0,,1,35,20,...,121.0,88.0,-999.0,106.0,1,,,,1,
1253411,,1,1,8.08,1,0,,1,35,19,...,119.0,106.0,-999.0,114.0,1,,,,1,
1419103,,1,0,9.92,1,0,,1,41,22,...,124.0,76.0,-999.0,102.0,1,,,,1,


List the 23 columns

In [6]:
df_pheno.columns

Index(['Disclaimer', 'Site', 'Gender', 'Age', 'Handedness', 'DX',
       'Secondary Dx ', 'ADHD Measure', 'ADHD Index', 'Inattentive',
       'Hyper/Impulsive', 'Med Status', 'IQ Measure', 'Verbal IQ',
       'Performance IQ', 'Full2 IQ', 'Full4 IQ', 'QC_Rest_1', 'QC_Rest_2',
       'QC_Rest_3', 'QC_Rest_4', 'QC_Anatomical_1', 'QC_Anatomical_2'],
      dtype='object')

## Filter Dataframe

Not all columns will be useful for making predictions. 

### Unnecessary

These are the features that are least likely to be useful when training a machine learning model. 
A diagnosis should not depend on the quality of the scan.

In [7]:
drop_features = ['Disclaimer',
                 'QC_Rest_1', 'QC_Rest_2', 'QC_Rest_3', 'QC_Rest_4', 
                 'QC_Anatomical_1', 'QC_Anatomical_2']

df_pheno_filtered = df_pheno.copy()
df_pheno_filtered = df_pheno.drop(drop_features, axis=1)

In [8]:
df_pheno_filtered.head()

Unnamed: 0_level_0,Site,Gender,Age,Handedness,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,Hyper/Impulsive,Med Status,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1038415,1,1,14.92,1,3,ODD,1,52,34,18,1,3.0,109.0,103.0,-999.0,107.0
1201251,1,1,12.33,1,3,,1,49,28,21,2,3.0,115.0,103.0,-999.0,110.0
1245758,1,0,8.58,1,0,,1,35,20,15,1,3.0,121.0,88.0,-999.0,106.0
1253411,1,1,8.08,1,0,,1,35,19,16,1,3.0,119.0,106.0,-999.0,114.0
1419103,1,0,9.92,1,0,,1,41,22,19,1,3.0,124.0,76.0,-999.0,102.0


With this change, the number of columns has been reduced to 16. 
This will also make it easier when training since the model has less features to look at.

In [9]:
df_pheno_filtered.shape

(197, 16)

### Holdout

Some of the data was used as a holdout for testing during the competition. 
This data is from the Brown site (Site 2) and has 'pending' for all data directly related to the diagnosis.

In [10]:
df_brown = df_pheno_filtered.loc[df_pheno_filtered['Site'] == 2]
df_pheno_filtered = df_pheno_filtered.drop(df_pheno_filtered.loc[df_pheno_filtered['Site'] == 2].index)
df_brown

Unnamed: 0_level_0,Site,Gender,Age,Handedness,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,Hyper/Impulsive,Med Status,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
26001,2,1,16.92,1,pending,pending,pending,pending,pending,pending,pending,,133.0,104.0,,120.0
26002,2,1,15.68,1,pending,pending,pending,pending,pending,pending,pending,,106.0,106.0,,107.0
26004,2,0,14.99,1,pending,pending,pending,pending,pending,pending,pending,,119.0,123.0,,125.0
26005,2,0,15.16,1,pending,pending,pending,pending,pending,pending,pending,,116.0,131.0,,126.0
26009,2,1,16.91,0,pending,pending,pending,pending,pending,pending,pending,,113.0,81.0,,97.0
26014,2,0,16.21,1,pending,pending,pending,pending,pending,pending,pending,,101.0,102.0,,102.0
26015,2,0,15.2,1,pending,pending,pending,pending,pending,pending,pending,,127.0,98.0,,113.0
26016,2,1,16.07,1,pending,pending,pending,pending,pending,pending,pending,,120.0,96.0,,109.0
26017,2,0,14.56,1,pending,pending,pending,pending,pending,pending,pending,,95.0,87.0,,89.0
26022,2,1,17.83,1,pending,pending,pending,pending,pending,pending,pending,,105.0,111.0,,109.0


There are 26 subjects at the Brown site.

In [11]:
df_brown.shape

(26, 16)

When the Brown subjects are removed from the dataframe, there are 171 subjects remaining.

In [12]:
df_pheno_filtered.shape

(171, 16)

### Targets

These columns directly relate to the diagnosis. 
In the Brown site, these columns are withheld and replaced with 'pending'

In [13]:
targets_features = ['DX', 'Secondary Dx ', 
                    'ADHD Measure', 'ADHD Index', 'Inattentive', 
                    'Hyper/Impulsive', 'Med Status']

df_targets = df_pheno_filtered[targets_features]
df_brown_targets = df_brown[targets_features]
df_pheno_filtered = df_pheno_filtered.drop(targets_features, axis=1)

In [14]:
df_targets.head()

Unnamed: 0_level_0,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,Hyper/Impulsive,Med Status
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1038415,3,ODD,1,52,34,18,1
1201251,3,,1,49,28,21,2
1245758,0,,1,35,20,15,1
1253411,0,,1,35,19,16,1
1419103,0,,1,41,22,19,1


In [15]:
df_pheno_filtered.head()

Unnamed: 0_level_0,Site,Gender,Age,Handedness,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1038415,1,1,14.92,1,3.0,109.0,103.0,-999.0,107.0
1201251,1,1,12.33,1,3.0,115.0,103.0,-999.0,110.0
1245758,1,0,8.58,1,3.0,121.0,88.0,-999.0,106.0
1253411,1,1,8.08,1,3.0,119.0,106.0,-999.0,114.0
1419103,1,0,9.92,1,3.0,124.0,76.0,-999.0,102.0


## Null values

Look at every column and find the null values or imputed null values (-999)

### Main Dataframe

Focuses on the main dataframe.

#### Gender

There are no null values for 'Gender' and the minimum value is 0, which is a valid input.

In [16]:
df_pheno_filtered['Gender'].isnull().sum()

0

In [17]:
min(df_pheno_filtered['Gender'])

0

#### Age

There are no null values for 'Age' and the lowest age is 7.26

In [18]:
df_pheno_filtered['Age'].isnull().sum()

0

In [19]:
min(df_pheno_filtered['Age'])

7.26

#### Handedness

There are 2 null values for 'Handedness' and one input with a letter.

In [20]:
df_pheno_filtered['Handedness'].isnull().sum()

2

In [21]:
df_pheno_filtered[df_pheno_filtered['Handedness'] == 'L']

Unnamed: 0_level_0,Site,Gender,Age,Handedness,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
4125514,1,1,9.17,L,3.0,136.0,100.0,-999.0,121.0


Fix the 'L' value to its numeric value (0).

In [22]:
df_pheno_filtered.loc[df_pheno_filtered['Handedness'] == 'L', 'Handedness'] = 0

#### IQ Measure

There are no null values for 'IQ Measure' and the minimum value is 1, which is a valid input.

In [23]:
df_pheno_filtered['IQ Measure'].isnull().sum()

0

In [24]:
min(df_pheno_filtered['IQ Measure'])

1.0

#### Verbal IQ

There are 60 null values for 'Verbal IQ' and the minimum IQ is 80, which is a valid input.

In [25]:
df_pheno_filtered['Verbal IQ'].isnull().sum()

60

In [26]:
min(df_pheno_filtered['Verbal IQ'])

80.0

#### Performance IQ

There are 60 null values for 'Performance IQ' and the mimimum IQ is 67, which is a valid input.

In [27]:
df_pheno_filtered['Performance IQ'].isnull().sum()

60

In [28]:
min(df_pheno_filtered['Performance IQ'])

67.0

The 60 points where Perforamnce IQ is null are the same 60 points where Verbal IQ is null.

In [43]:
df_pheno_filtered[df_pheno_filtered['Performance IQ'].isnull()]

Unnamed: 0_level_0,Site,Gender,Age,Handedness,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
27000,4,0,20.9,1.0,5.0,,,91.0,
27003,4,1,17.54,0.0,5.0,,,106.0,
27004,4,1,14.24,0.0,5.0,,,91.0,
27005,4,0,18.65,0.0,5.0,,,87.0,
27007,4,1,16.92,1.0,5.0,,,91.0,
27008,4,1,25.04,1.0,5.0,,,78.0,
27010,4,0,16.45,1.0,5.0,,,116.0,
27011,4,1,20.82,1.0,5.0,,,128.0,
27012,4,0,14.05,,5.0,,,,
27015,4,0,20.34,1.0,5.0,,,106.0,


In [29]:
df_pheno_filtered['Full2 IQ'].isnull().sum()

96

In [30]:
min(df_pheno_filtered['Full2 IQ'])

-999.0

In [31]:
df_pheno_filtered[df_pheno_filtered['Full2 IQ'] == -999]

Unnamed: 0_level_0,Site,Gender,Age,Handedness,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1038415,1,1,14.92,1,3.0,109.0,103.0,-999.0,107.0
1201251,1,1,12.33,1,3.0,115.0,103.0,-999.0,110.0
1245758,1,0,8.58,1,3.0,121.0,88.0,-999.0,106.0
1253411,1,1,8.08,1,3.0,119.0,106.0,-999.0,114.0
1419103,1,0,9.92,1,3.0,124.0,76.0,-999.0,102.0
1517058,1,1,9.75,1,3.0,141.0,138.0,-999.0,144.0
1581470,1,1,8.83,1,3.0,111.0,123.0,-999.0,118.0
1784368,1,1,8.92,1,3.0,136.0,89.0,-999.0,116.0
1849382,1,1,11.67,1,3.0,140.0,114.0,-999.0,131.0
1854691,1,0,8.83,1,3.0,117.0,108.0,-999.0,114.0


In [32]:
df_pheno_filtered['Full2 IQ'].isnull().sum() + len(df_pheno_filtered[df_pheno_filtered['Full2 IQ'] == -999])

147

In [33]:
df_pheno_filtered['Full4 IQ'].isnull().sum()

27

In [34]:
min(df_pheno_filtered['Full4 IQ'])

77.0

In [35]:
df_targets['Med Status'].isnull().sum()

111

In [38]:
min(df_targets['Med Status'].astype(int))

ValueError: cannot convert float NaN to integer

In [40]:
df_targets['Secondary Dx '] = df_targets['Secondary Dx '].fillna('none')

In [41]:
df_targets['DX'].isnull().sum()

0