# 1. Exploratory Data Analysis
The first part of this project will involve exploratory data analysis (EDA) to understand and describe the content and nature of the data.

Note that much of the work performed during your EDA will enable the completion of the final component of this project which is focused on documentation of your algorithm for the FDA. This is described in a later section, but some important things to focus on during your EDA may be:

- The patient demographic data such as gender, age, patient position,etc. (as it is available)
- The x-ray views taken (i.e. view position)
- The number of cases including:
  - number of pneumonia cases,
  - number of non-pneumonia cases
- The distribution of other diseases that are comorbid with pneumonia
- Number of disease per patient
- Pixel-level assessments of the imaging data for healthy & disease states of interest (e.g. histograms of intensity values) and compare distributions across diseases.

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

##Import any other packages you may need here

In [3]:
## Below is some helper code to read all of your full image filepaths into a dataframe for easier manipulation

all_xray_df = pd.read_csv('/data/Data_Entry_2017.csv')
all_xray_df.sample(3)

data_sample = pd.read_csv('sample_labels.csv')
data_sample.sample(3)

FileNotFoundError: [Errno 2] No such file or directory: '/data/Data_Entry_2017.csv'

EDA is open-ended, and it is up to you to decide how to look at different ways to slice and dice your data. A good starting point is to look at the requirements for the FDA documentation in the final part of this project to guide (some) of the analyses you do. 

This EDA should also help to inform you of how pneumonia looks in the wild. E.g. what other types of diseases it's commonly found with, how often it is found, what ages it affects, etc. 

Note that this NIH dataset was not specifically acquired for pneumonia. So, while this is a representation of 'pneumonia in the wild,' the prevalence of pneumonia may be different if you were to take only chest x-rays that were acquired in an ER setting with suspicion of pneumonia. 

Also, **describe your findings and how will you set up the model training based on the findings.**

## EDA

### The patient demographic data such as gender, age, patient position,etc. (as it is available)

In [None]:
# OneHotCode the Finding Labels
all_xray_labels = np.unique(list(chain(*all_xray_df['Finding Labels'].map(lambda x: x.split('|').tolist()))))
all_xray_labels = [x for x in all_xray_labels if len(x) > 0]

print('All Labels ({}): {}'.format(len(all_xray_labels), all_xray_labels))

for label in all_xray_labels: 
    all_xray_df['Finding Labels'].map(lambda finding_labels: 1 if label in finding_labels else 0)
       
# labels distribution
print(all_xray_df[all_xray_labels].sum()/len(all_xray_df))
ax = all_xray_df[all_xray_labels].sum().plot(kind='bar')
ax.set(ylabel = 'Number of Images with Label')

#all_xray_df['Class'] = all_xray_df['Pneumonia'].apply(lambda x: 'Pneumonia' if x==1 else "Non-pneumonia")
#all_xray_df.sample(3)

In [None]:
# drop the Unnamed column
all_xray_df = all_xray_df.drop(columns=['Unnamed: 11'])
all_xray_df.info()

In [None]:
# stats
all_xray_df.describe()

#### AGE

In [None]:
# check age histogram and remove invalid ages
all_xray_df['Patient Age'].plot(kind='bar')
all_xray_df = all_xray_df[all_xray_df['Patient Age']<=120]

In [None]:
plt.figure(figsize=(10,6))
plt.hist(d['Patient Age'])
plt.hist(d[d.Pneumonia==1]['Patient Age'])

#### gender

In [None]:
all_xray_df['Patient Gender'].value_counts.plot(kind='bar')
all_xray_df[all_xray_df.Pneumonia==1]['Patient Gender'].value_counts().plot(kind='bar')
# Gender distribution seems to be pretty equal in the whole population as well as with Infiltration, with a slight preference towards females in the Effusion distribution.

#### patient position

In [None]:
all_xray_df['View Position'].value_counts.plot(kind='bar')
all_xray_df[all_xray_df.Pneumonia==1]['View Position'].value_counts().plot(kind='bar')
