# Understanding Site Phenotypic Data

This notebook seeks to gain insight to the distributions of ADHD types at the sites included in the dataset. 

This is being examined to determine what site will will have the most variety. 
The intention is to build a simple model off of the data at this site that can be expanded to include the other sites.

## Result

The model with the most variety is OHSU

## Imports

This test does not require many imports. 

Needs `pandas` for working with dataframes.

Needs `os` to access the files

In [1]:
import pandas as pd
import os

### get_base_filepath()

Function to get to root of folder for access to all files

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

## Filepaths

Get the filepaths ot each of the phenotypic datasets

In [3]:
# Universal filepath to where phenotypic data is located
base_folder_filepath = get_base_filepath()
base_folder_filepath += '\\Data\\Phenotypic\\Sites\\'

# Add filename to filepath
KKI_phenotypic_filepath = base_folder_filepath + 'KKI_phenotypic.csv'
NYU_phenotypic_filepath = base_folder_filepath + 'NYU_phenotypic.csv'
OHSU_phenotypic_filepath = base_folder_filepath + 'OHSU_phenotypic.csv'
Peking_1_phenotypic_filepath = base_folder_filepath + 'Peking_1_phenotypic.csv'
Peking_2_phenotypic_filepath = base_folder_filepath + 'Peking_2_phenotypic.csv'
Peking_3_phenotypic_filepath = base_folder_filepath + 'Peking_3_phenotypic.csv'
Pittsburgh_phenotypic_filepath = base_folder_filepath + 'Pittsburgh_phenotypic.csv'
WashU_phenotypic_filepath = base_folder_filepath + 'WashU_phenotypic.csv'

## Open files

Read the files into a dataframe to be able to use the data contained in the files

In [4]:
KKI_phenotypic = pd.read_csv(KKI_phenotypic_filepath)
NYU_phenotypic = pd.read_csv(NYU_phenotypic_filepath)
OHSU_phenotypic = pd.read_csv(OHSU_phenotypic_filepath)
Peking_1_phenotypic = pd.read_csv(Peking_1_phenotypic_filepath)
Peking_2_phenotypic = pd.read_csv(Peking_2_phenotypic_filepath)
Peking_3_phenotypic = pd.read_csv(Peking_3_phenotypic_filepath)
Pittsburgh_phenotypic = pd.read_csv(Pittsburgh_phenotypic_filepath)
WashU_phenotypic = pd.read_csv(WashU_phenotypic_filepath)

## Matrix

The important information from the files is the diagnosis ('DX'). 
The number of patients that were diagnosed with each type of ADHD can be accessed by using `value_counts()`. 

Combining all of these values into a matrix will make it easier to draw conclusions from the data when it is turned into a dataframe

In [5]:
sites_diagnosis = [KKI_phenotypic['DX'].value_counts(), 
                     NYU_phenotypic['DX'].value_counts(), 
                     OHSU_phenotypic['DX'].value_counts(),
                     Peking_1_phenotypic['DX'].value_counts(), 
                     Peking_2_phenotypic['DX'].value_counts(), 
                     Peking_3_phenotypic['DX'].value_counts(),
                     Pittsburgh_phenotypic['DX'].value_counts(), 
                     WashU_phenotypic['DX'].value_counts()]

## Dataframe

Turn the matrix from the previous cell into a dataframe. 
This dataframe uses the diagnosis to directly compare the sites in a single matrix.

Fill the null values with 0 since not having an entry means that there were no patients with that diagnosis.

Sort the columns based on their diagnosis. 
This will make it easier to read and learn how many patients were diagnosed with what type of ADHD.

In [6]:
df = pd.DataFrame(sites_diagnosis, index=['KKI', 'NYU', 'OHSU', 
                                            'Peking_1', 'Peking_2', 'Peking_3',
                                            'Pittsburgh', 'WashU'])
# Fill null values with 0
df = df.fillna(0)

# Sort columns
df = df.reindex(sorted(df.columns), axis=1)
df

DX,0,1,2,3
KKI,61.0,16.0,1.0,5.0
NYU,99.0,77.0,2.0,44.0
OHSU,42.0,23.0,2.0,12.0
Peking_1,61.0,7.0,0.0,17.0
Peking_2,32.0,15.0,0.0,20.0
Peking_3,23.0,7.0,0.0,12.0
Pittsburgh,89.0,0.0,0.0,0.0
WashU,61.0,0.0,0.0,0.0


## Feature Extraction

Create additional features to better understand the contents of the dataframe

### Total Patients

Not all sites conducted research on the same number of patients. This may be important to understand the distribution of cases rather than rely on sheer numbers.

In [7]:
df['total_patients'] = df[0] + df[1] + df[2] + df[3]
df

DX,0,1,2,3,total_patients
KKI,61.0,16.0,1.0,5.0,83.0
NYU,99.0,77.0,2.0,44.0,222.0
OHSU,42.0,23.0,2.0,12.0,79.0
Peking_1,61.0,7.0,0.0,17.0,85.0
Peking_2,32.0,15.0,0.0,20.0,67.0
Peking_3,23.0,7.0,0.0,12.0,42.0
Pittsburgh,89.0,0.0,0.0,0.0,89.0
WashU,61.0,0.0,0.0,0.0,61.0


### Diagnosis Percentage

Understand what percentage of the patients received the type of diagnosis. 
This will give more insight to how the number of patients diagnosed relates to the total population of patients at the site.

In [8]:
df['percent_0'] = df[0] / df['total_patients']
df['percent_1'] = df[1] / df['total_patients']
df['percent_2'] = df[2] / df['total_patients']
df['percent_3'] = df[3] / df['total_patients']
df

DX,0,1,2,3,total_patients,percent_0,percent_1,percent_2,percent_3
KKI,61.0,16.0,1.0,5.0,83.0,0.73494,0.192771,0.012048,0.060241
NYU,99.0,77.0,2.0,44.0,222.0,0.445946,0.346847,0.009009,0.198198
OHSU,42.0,23.0,2.0,12.0,79.0,0.531646,0.291139,0.025316,0.151899
Peking_1,61.0,7.0,0.0,17.0,85.0,0.717647,0.082353,0.0,0.2
Peking_2,32.0,15.0,0.0,20.0,67.0,0.477612,0.223881,0.0,0.298507
Peking_3,23.0,7.0,0.0,12.0,42.0,0.547619,0.166667,0.0,0.285714
Pittsburgh,89.0,0.0,0.0,0.0,89.0,1.0,0.0,0.0,0.0
WashU,61.0,0.0,0.0,0.0,61.0,1.0,0.0,0.0,0.0


## Conclusion

The dataframe that provides the most variety is OHSU. 
The variety will be important to create a model that will not be overfitting the dataset.

The OHSU will be used to create a small model that will then be scaled up to include all sites.