# ADHD First Look

This is my initial exploration of the data. Insights from this will likely not be very deep or meaningful.

## Load dataset

Use `pandas` to save the `.csv` file for analysis and manipulation

Based on the first few rows, it seems like there are a lot of null values. These may be irrelevant to the analysis, but it can't be determined yet

In [4]:
import pandas as pd

df = pd.read_csv('allSubs_testSet_phenotypic_dx.csv', index_col='ID')

df.head()

Unnamed: 0_level_0,Disclaimer,Site,Gender,Age,Handedness,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,...,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ,QC_Rest_1,QC_Rest_2,QC_Rest_3,QC_Rest_4,QC_Anatomical_1,QC_Anatomical_2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1038415,,1,1,14.92,1,3,ODD,1,52,34,...,109.0,103.0,-999.0,107.0,1,,,,1,
1201251,,1,1,12.33,1,3,,1,49,28,...,115.0,103.0,-999.0,110.0,1,,,,1,
1245758,,1,0,8.58,1,0,,1,35,20,...,121.0,88.0,-999.0,106.0,1,,,,1,
1253411,,1,1,8.08,1,0,,1,35,19,...,119.0,106.0,-999.0,114.0,1,,,,1,
1419103,,1,0,9.92,1,0,,1,41,22,...,124.0,76.0,-999.0,102.0,1,,,,1,


## Shape

The data set contains 197 data points and 23 features. This is a fairly small dataset, so it will be important to avoid overfitting

In [6]:
df.shape

(197, 23)

## Null Values

This is a list of the null values. There are several columns missing nearly all of their data

Most columns seem to be missing a substantial amount of information. Upon investigation, it seems that some features have matching null and real values, like they should be used together.

In [5]:
df.isnull().sum()

Disclaimer         172
Site                 0
Gender               0
Age                  0
Handedness           2
DX                   0
Secondary Dx       160
ADHD Measure       120
ADHD Index          83
Inattentive         83
Hyper/Impulsive     83
Med Status         111
IQ Measure          26
Verbal IQ           60
Performance IQ      60
Full2 IQ           122
Full4 IQ            27
QC_Rest_1            0
QC_Rest_2          134
QC_Rest_3          163
QC_Rest_4          197
QC_Anatomical_1      0
QC_Anatomical_2    197
dtype: int64

## Investigating features

### `'Site'`

There appear to be 7 sights with somewhat even distribution. These sites are based on where the evaluation was conducted

In [9]:
df['Site'].value_counts()

Site
1    51
5    41
6    34
2    26
4    25
3    11
7     9
Name: count, dtype: int64

### `'ADHD Measure'`

This is the measuring system that was used to categorize the ADHD symptoms. It appears that this category is only measured off of one system and there are a significant number of null values

In [13]:
df['ADHD Measure'].value_counts()

ADHD Measure
1          51
pending    26
Name: count, dtype: int64

## Features

### Dividing Features

It seems like there are three categories of features in this dataset:

1. Results: These are the results of if the patient has ADHD and their performance on the tests

2. Quality Control: These are a list of columns all starting with `'QC_'` which likely means 'Quality Control'

3. Predicting Factors: These are the remainding columns that could be used to predict if the patient has ADHD

In [21]:
results=df[['DX','Secondary Dx ','ADHD Measure', 'ADHD Index', 'Inattentive', 
            'Hyper/Impulsive', 'Med Status']]

qc = df[['QC_Rest_1', 'QC_Rest_2', 'QC_Rest_3', 'QC_Rest_4', 'QC_Anatomical_1', 'QC_Anatomical_2']]

features = df.drop(['DX','Secondary Dx ','ADHD Measure', 'ADHD Index', 
                    'Inattentive', 'Hyper/Impulsive', 'Med Status', 
                    'QC_Rest_1', 'QC_Rest_2', 'QC_Rest_3', 'QC_Rest_4', 
                    'QC_Anatomical_1', 'QC_Anatomical_2'], axis=1)


In [22]:
results.head()

Unnamed: 0_level_0,DX,Secondary Dx,ADHD Measure,ADHD Index,Inattentive,Hyper/Impulsive,Med Status
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1038415,3,ODD,1,52,34,18,1
1201251,3,,1,49,28,21,2
1245758,0,,1,35,20,15,1
1253411,0,,1,35,19,16,1
1419103,0,,1,41,22,19,1


In [25]:
qc.head()

Unnamed: 0_level_0,QC_Rest_1,QC_Rest_2,QC_Rest_3,QC_Rest_4,QC_Anatomical_1,QC_Anatomical_2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1038415,1,,,,1,
1201251,1,,,,1,
1245758,1,,,,1,
1253411,1,,,,1,
1419103,1,,,,1,


In [23]:
features.head()

Unnamed: 0_level_0,Disclaimer,Site,Gender,Age,Handedness,IQ Measure,Verbal IQ,Performance IQ,Full2 IQ,Full4 IQ
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1038415,,1,1,14.92,1,3.0,109.0,103.0,-999.0,107.0
1201251,,1,1,12.33,1,3.0,115.0,103.0,-999.0,110.0
1245758,,1,0,8.58,1,3.0,121.0,88.0,-999.0,106.0
1253411,,1,1,8.08,1,3.0,119.0,106.0,-999.0,114.0
1419103,,1,0,9.92,1,3.0,124.0,76.0,-999.0,102.0
