# Cleaning Phenotypic Data

This notebook focuses on cleaning the phenotypic data for all sites. 
Many of the insights from the Exploratory Data Analysis notebook were used when writing this notebook.

The purpose of this notebook is to modify the dataframe to be ready to build a machine learning model. 
The primary issue with the current dataframe is the null values and excessive features. 

The resulting file is a .csv file of a cleaned dataframe. 

## Prepare Notebook

Import packages and files to prepare for the contents of the notebook.

### Imports

These are the packages that will be used to perform the evaluations and actions on the dataframe.

- `os` for accessing files

- `pandas` for using dataframes

- `matplotlib.pyplot` for plotting

- `seaborn` for more customized plots

In [1]:
import os
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

### Functions

There are only two functions in this notebook

1. get_base_filepath

2. get_null_values

#### get_base_filepath()

Access the filepath for the base folder of the project

**Input:** None

**Output:** The filepath to the root of the folder

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

#### get_null_values()

Generate a dataframe of the null value count and the minimum value.

**Input:** 

- A list of numeric features to find the null an min values for

- A dataframe to access the features from

**Output:** A dataframe of null value coutn and minimum value for each feature

In [3]:
def get_null_values(features, df):
    '''
    Generate a dataframe of the null value count and the minimum value
    
    Input:
        - A list of numeric features to find the null an min values for
        - A dataframe to access the features from
        
    Output: A dataframe of null value coutn and minimum value for each feature
    '''
    null_vals = dict()
    for col in features:
        null_vals[col] = (df[col].isnull().sum(), df[col].min())
        
    df_null_vals = pd.DataFrame(data=null_vals, index=['null_count', 'min_value'])
    return df_null_vals

### Load File

Read the phenotypic file as a dataframe. Set the 'ID' as the index since it is unique for each patient.

In [4]:
# The folder for the project
base_folder_filepath = get_base_filepath()

# Phenotypic data site folder
filepath = base_folder_filepath + '\\Data\\Phenotypic\\allSubs_testSet_phenotypic_dx.csv'

# Dataframe from filepath
df_pheno = pd.read_csv(filepath, index_col='ID')

## Feature Engineering

Adjust existing features and create new ones to improve the dataframe.

### Drop Features

Some features in the current dataframe will not be useful for making predictions. 
The way that the IQ and ADHD values are measured should not be included as an indicator of ADHD. 
Similarly, the quality of the fMRIs should not determine what diagnosis the patient has.

In [5]:
drop_features = ['Disclaimer', 'IQ Measure', 'ADHD Measure',
                 'QC_Rest_1', 'QC_Rest_2', 'QC_Rest_3', 'QC_Rest_4', 
                 'QC_Anatomical_1', 'QC_Anatomical_2']

df_pheno_filtered = df_pheno.copy()
df_pheno_filtered = df_pheno.drop(drop_features, axis=1)

### IQ 

The places where Full4 IQ are null is the same subjects that have Full2 IQ. 
These two features can be combined to a create a single IQ feature. 
There are still some points where both Full4 and Full2 IQs are null, which will be filled later in the notebook.

In [6]:
df_pheno_filtered.loc[df_pheno_filtered['Full4 IQ'].isnull(), 'Full4 IQ'] = df_pheno_filtered.loc[df_pheno_filtered['Full4 IQ'].isnull(), 'Full2 IQ']

In [7]:
df_pheno_filtered.loc[df_pheno_filtered['Full4 IQ'].isnull(), 'Full4 IQ']

ID
27012   NaN
21008   NaN
21030   NaN
Name: Full4 IQ, dtype: float64

In [8]:
df_pheno_filtered['IQ'] = df_pheno_filtered['Full4 IQ']
df_pheno_filtered = df_pheno_filtered.drop(['Full4 IQ', 'Full2 IQ'], axis=1)

### Handedness

The handedness measure at one of the sites measures handedness on a continuous scale unlike the other sites. 
The values greater than 0 are right-handed and the values less than 0 are left-handed. 
These categorical values replace the continuous values.

In [9]:
df_pheno_filtered.loc[df_pheno_filtered['Handedness'] > 0, 'Handedness'] = 1
df_pheno_filtered.loc[df_pheno_filtered['Handedness'] < 0, 'Handedness'] = 0

## Extract Brown Site

Remove the data from the Brown site and set it as its own dataframe. 
Save the Brown dataframe as a .csv file.

In [10]:
df_brown = df_pheno_filtered.loc[df_pheno_filtered['Site'] == 2]
df_pheno_filtered = df_pheno_filtered.drop(df_pheno_filtered.loc[df_pheno_filtered['Site'] == 2].index)

df_brown.to_csv('2023.7.10-Brown_phenotypic.csv')

## Null Values

Explore the null values and fill them with reasonable values.

### View null values

Look at the null values and potential false null values for the numeric features in the dataframe.

In [12]:
null_values = dict()
numeric_cols = ['Gender', 'Age', 'Handedness',
                'Verbal IQ', 'Performance IQ', 'IQ']

df_null_values_train = get_null_values(numeric_cols, df_pheno_filtered)

df_null_values_train.head()

Unnamed: 0,Gender,Age,Handedness,Verbal IQ,Performance IQ,IQ
null_count,0,0.0,2.0,60.0,60.0,3.0
min_value,0,7.26,0.0,80.0,67.0,75.0


In [13]:
df_null_values_brown = get_null_values(numeric_cols, df_brown)
df_null_values_brown.head()

Unnamed: 0,Gender,Age,Handedness,Verbal IQ,Performance IQ,IQ
null_count,0,0.0,0.0,0.0,0.0,0.0
min_value,0,8.5,0.0,89.0,81.0,85.0


### Handedness

There are two null values for handedness. 
Most people are right-handed, including the subjects in this study, so null values will be filled with 1 (right).

In [14]:
df_pheno_filtered['Handedness'] = df_pheno_filtered['Handedness'].fillna(1)

### Secondary DX

Fill the null values with 'None' since it can be assumed that the patient does not have any secondary diagnosis.
Map the types of diagnosis to numbers.

In [16]:
df_pheno_filtered['Secondary Dx '] = df_pheno_filtered['Secondary Dx '].fillna('None')

In [17]:
df_pheno_filtered['Secondary Dx '] = df_pheno_filtered['Secondary Dx '].map({'None': 0, 
                                                                             'ODD': 1, 
                                                                             'TS':2, 
                                                                             'enuresis':3})

The number of secondary diagnosis could also be a useful insight to include as a target.

In [15]:
df_pheno_filtered['Num Secondary DX'] = df_pheno_filtered['Secondary Dx '].isnull()
df_pheno_filtered['Num Secondary DX'] = df_pheno_filtered['Num Secondary DX'].map({True: 0, False: 1})

## Mean-based fill values

The features with IQ have more null values than the other features. 
For this reason, it is more important to fill the null values with points that are representative of the subject. 

With this in mind, the null values for these features will be filled according to the average value for other subjects with the same diagnosis. 

### Performance IQ

Fill the nulll Performance IQ values with the average value for each type of diagnosis.

In [22]:
df_pheno_filtered.groupby('DX')['Performance IQ'].mean()

DX
0    108.480769
1    101.060606
2    104.000000
3    101.480000
Name: Performance IQ, dtype: float64

In [23]:
df_pheno_filtered.loc[df_pheno_filtered['Performance IQ'].isnull(), 'Performance IQ'] = df_pheno_filtered.loc[df_pheno_filtered['Performance IQ'].isnull(), 'DX']

df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '0'] = 108.480769
df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '1'] = 101.060606
df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '2'] = 104.000000
df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '3'] = 101.480000

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '0'] = 108.480769
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '1'] = 101.060606
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['Performance IQ'].loc[df_pheno_filtered['Performance IQ'] == '2'] = 104.000000
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in

### Verbal IQ

Fill the nulll Verbal IQ values with the average value for each type of diagnosis.

In [21]:
df_pheno_filtered.groupby('DX')['Verbal IQ'].mean()

DX
0    119.250000
1    109.212121
2    108.000000
3    108.560000
Name: Verbal IQ, dtype: float64

In [24]:
df_pheno_filtered.loc[df_pheno_filtered['Verbal IQ'].isnull(), 'Verbal IQ'] = df_pheno_filtered.loc[df_pheno_filtered['Verbal IQ'].isnull(), 'DX']

df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '0'] = 108.480769
df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '1'] = 101.060606
df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '2'] = 104.000000
df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '3'] = 101.480000

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '0'] = 108.480769
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '1'] = 101.060606
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['Verbal IQ'].loc[df_pheno_filtered['Verbal IQ'] == '2'] = 104.000000
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pa

### IQ

Fill the nulll IQ values with the average value for each type of diagnosis.

In [None]:
df_pheno_filtered.groupby('DX')['IQ'].mean()

In [25]:
df_pheno_filtered.loc[df_pheno_filtered['IQ'].isnull(), 'IQ'] = df_pheno_filtered.loc[df_pheno_filtered['IQ'].isnull(), 'DX']

df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '0'] = 112.666667
df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '1'] = 103.357447
df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '2'] = 116.500000
df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '3'] = 106.115385

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '0'] = 112.666667
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '1'] = 103.357447
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pheno_filtered['IQ'].loc[df_pheno_filtered['IQ'] == '2'] = 116.500000
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_gu

## Export Dataframe

Save the dataframe as a .csv file and export it to the Phenotypic folder in the project folder for later use.

In [26]:
df_pheno_filtered[features].to_csv(base_folder_filepath + 
                                   '\\Data\\Phenotypic\\2023.7.11-Cleaned_Phenotypic_Training_Sites.csv')