# NHAMCS Dataset:

In this notebook, we will be analysing The National Hospital Ambulatory Medical Care Survey (NHAMCS) Dataset. It is a dataset describing Emergency Departments in the US from various different states.

In [1]:
# The dataset is uploaded on Google Drive so we need to import the drive utility library

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The next thing is to import the dataset and inspect it. We will be importing pandas from an SAS file into a pandas Dataframe.

In [2]:
# import pandas
import pandas as pd

# read the dataset from the SAS file
NHAMCS = pd.read_sas(filepath_or_buffer = '/content/drive/Shared drives/Vodafone Internship/Dataset/ed2017_sas.sas7bdat')

# inspect the first few records
print(NHAMCS.head())

# look at the dimensions of the dataframe

print(NHAMCS.shape)

   VMONTH  VDAYR  ARRTIME  WAITTIME  ...     CSTRATM  CPSUM       PATWT      EDWT
0     6.0    6.0  b'2056'      72.0  ...  40100000.0    4.0  3723.12641  21.58043
1     6.0    2.0  b'1417'      64.0  ...  40100000.0    4.0  3723.12641       NaN
2     6.0    2.0  b'2303'      -7.0  ...  40100000.0    4.0  3723.12641       NaN
3     6.0    5.0  b'0930'      29.0  ...  40100000.0    4.0  3723.12641       NaN
4     6.0    2.0  b'1332'      20.0  ...  40100000.0    4.0  3723.12641       NaN

[5 rows x 949 columns]
(16709, 949)


The dataframe has about 16.7 thousand examples and each of these has 949 features.

## Data Quality

In this section, we will be doing 2 things:

### NaNs Exploration
>In this subsection of the data quality section, we explore the following:
- how many NaNs are there?
- how can we visualize them?
- what could be the reasons for these NaNs?
- further steps to be taken for data quality remarks.

### How many NaNs are there?


Count the number of NaNs Across all columns.

In [1]:
nans = NHAMCS.isnull().sum()

NameError: name 'NHAMCS' is not defined

In [3]:
# Bassant's work goes here

### 2- Encoding the Categorical Features

The second thing in this section is converting the categorical features into a numeric version of them, that could be input to different ML models. 

In [4]:
# import the LabelEncoder Class
from sklearn.preprocessing import LabelEncoder

# creating instance of labelencoder
labelencoder = LabelEncoder()

# Replacing the categorical coloumns with a numerical equivalent
NHAMCS['CAUSE1'] = labelencoder.fit_transform(NHAMCS['CAUSE1'])
NHAMCS['CAUSE2'] = labelencoder.fit_transform(NHAMCS['CAUSE2'])
NHAMCS['CAUSE3'] = labelencoder.fit_transform(NHAMCS['CAUSE3'])
NHAMCS['DIAG1'] = labelencoder.fit_transform(NHAMCS['DIAG1'])
NHAMCS['DIAG2'] = labelencoder.fit_transform(NHAMCS['DIAG2'])
NHAMCS['DIAG3'] = labelencoder.fit_transform(NHAMCS['DIAG3'])
NHAMCS['DIAG4'] = labelencoder.fit_transform(NHAMCS['DIAG4'])



## Target Label

Given the dataset, we need to extract the label that we should be able to predict given the rest of the features. Firstly we wanted to predict the department that each patient should be redirected to. However, we couldn't find any information in the dataset regarding departments.

Instead, we decided to go with the immediacy level of each patient. In the dataset, the feature **"IMMEDR"** represents exactly that. Howeverm a key requirement has to be met: this feature has to be present in most examples. If this is not the case, then predicting it would be very difficult.

In [13]:
# Firstly, we will look at the percentage of the missing values (or blank) in the 'IMMEDR' column
missingValuesPercentage = (NHAMCS[NHAMCS['IMMEDR']==-9].shape[0] + NHAMCS[NHAMCS['IMMEDR']==-8].shape[0] + NHAMCS[NHAMCS['IMMEDR']==7].shape[0] +NHAMCS[NHAMCS['IMMEDR']==0].shape[0])/16709 *100
print(str(missingValuesPercentage) + "% of the values are missing in the 'IMMEDR' feature" )

# Only 27% is missing and therefore, we will still use it as our target variable
immediacyLevel = NHAMCS['IMMEDR']


26.925608953258724% of the values are missing in the 'IMMEDR' feature


## Feature Selection

In the previous section, we extracted the target feature. As mentioned earlier, we have almost 950 features. Consequently, training a model using all of these features might not be the best idea. Consequently, we have to select some features out of the 950 features.

In order to do so, we devised a 2-step process:

### 1- Manual Extraction

Our first plan was to manually extract the features that 'make sense'. We inspected the textfiles describing the dataset and came up with about 150 features that could be useful. These can be found below:

In [6]:
# The patient ID
patID = NHAMCS[['PATCODE']]

# Demoggraphics of the patient
demographics = NHAMCS[['AGE', 'AGER', 'AGEDAYS', 'SEX', 'PATWT']]

# Data related to the ER visit
visit = NHAMCS[['WAITTIME', 'PAINSCALE', 'SEEN72', 'TOTDIAG']]

# The causes recorded for the patient's situation
causes = NHAMCS[['CAUSE1', 'CAUSE2', 'CAUSE3']]

# The proposed diagnoses for the patient as well as how probably each of them is
diagnoses = NHAMCS[['DIAG1', 'DIAG2', 'DIAG3', 'DIAG4']]
diagnosesProbable = NHAMCS[['PRDIAG1', 'PRDIAG2', 'PRDIAG3', 'PRDIAG4']]

# The complaints recorded by the patient in their previous visits
patientComplaintsDetailed = NHAMCS[['RFV1', 'RFV2', 'RFV3', 'RFV4']]
patientComplaintsBroad = NHAMCS[['RFV13D', 'RFV23D', 'RFV33D', 'RFV43D']]

# Data related to the patients injury (if any)
injuryData = NHAMCS[['INJURY', 'INJPOISAD', 'INJURY72', 'INTENT15', 'INJURY_ENC']]

# The patient's vitals
vitals = NHAMCS[['VITALSD', 'TEMPDF', 'PULSED', 'RESPRD', 'BPSYSD', 'BPDIASD']]

# The patient's disease history
previousDiseases = NHAMCS[['ETOHAB' ,'ALZHD','ASTHMA','CANCER','CEBVD','CKD','COPD','CHF','CAD',
                           'DEPRN','DIABTYP1','DIABTYP2','DIABTYP0','ESRD','HPE','EDHIV','HYPLIPID','HTN',
                           'OBESITY' ,'OSA' ,'OSTPRSIS', 'SUBSTAB', 'NOCHRON','TOTCHRON']]

# Blood test results (if any)
blood = NHAMCS[['ABG','BAC','BMP','BNP','BUNCREAT','CARDENZ','CBC','CMP','BLOODCX',
                'TRTCX','URINECX','WOUNDCX','OTHCX','DDIMER','ELECTROL','GLUCOSE','LACTATE','LFT','PTTINR','OTHERBLD','CARDMON',
                'EKG','HIVTEST','FLUTEST','PREGTEST','TOXSCREN','URINE']]

# Imaging results (if any)
imaging = NHAMCS[['ANYIMAGE','XRAY','CATSCAN','CTCONTRAST','CTAB','CTCHEST','CTHEAD','CTOTHER','CTUNK','MRI','MRICONTRAST','ULTRASND','OTHIMAGE']]

# The patient's medicine history
medications = NHAMCS[['MED1','MED2','MED3','MED4','MED5','MED6','MED7','MED8','MED9','MED10',
                      'MED11','MED12','MED13','MED14','MED15','MED16','MED17','MED18','MED19',
                      'MED20','MED21','MED22','MED23','MED24','MED25','MED26','MED27','MED28','MED29','MED30']]

# Any medicine prescribed in the ER 
ERMedications = NHAMCS[['GPMED1','GPMED2','GPMED3','GPMED4','GPMED5','GPMED6','GPMED7','GPMED8','GPMED9','GPMED10',
                        'GPMED11','GPMED12','GPMED13','GPMED14','GPMED15','GPMED16','GPMED17','GPMED18','GPMED19',
                        'GPMED20','GPMED21','GPMED22','GPMED23','GPMED24','GPMED25','GPMED26','GPMED27','GPMED28','GPMED29','GPMED30']]

manually_selected_features = pd.concat([patID, demographics, visit, causes, diagnoses, diagnosesProbable, patientComplaintsDetailed, patientComplaintsBroad, injuryData, 
                        vitals, previousDiseases, blood, imaging, medications, ERMedications], axis=1)

### 2- Feature Selection using SK Learn

After manually extracting about 150 features, we realized that this is still a large value and we decided to use the models from the feature_selection library provided in sklearn to select a small number out of these 150 features. 

In [7]:
# Import the SelectKBest Class, as well as the f_classif scoring metric
from sklearn.feature_selection import SelectKBest, f_classif

# create an instance of SelectKBest which will select the best 10 features
selector = SelectKBest(f_classif, k=10)

# Create a new dataframe with the only top 10 features that would affect our label (immediacyLevel)
X_new = selector.fit_transform(manually_selected_features, immediacyLevel)

# This part extracts the names of the features since the X_new does not contain column names
mask = selector.get_support() #list of booleans
new_features = [] # The list of your K best features
feature_names = manually_selected_features.columns
for bool, feature in zip(mask, feature_names):
  if bool:
    new_features.append(feature)

# Replace X_new with itself, along with the names of the columns
X_new = pd.DataFrame(X_new, columns=new_features)

# print the first 5 records of X_new to inspect it
print(X_new.head())


  PAINSCALE SEEN72 TOTDIAG BPSYSD BPDIASD CBC CMP OTHERBLD CARDMON EKG
0         5      2       0     -9      -9   0   0        0       0   0
1        -8      2       1     -9      -9   0   0        0       0   0
2        -9      2       0     -9      -9   0   0        0       0   0
3        -8      2       0     -9      -9   0   0        0       0   0
4        -9      2       1     -9      -9   0   0        0       0   0
