# Picking the table with diagnosis information
### Diagnosis information is contained in multiple tables
* The ADMISSIONS.csv table has a brief diagnosis description for which the patient has been admitted, it is written as free text and does not conform to systematic ontology (ICD9-International Classification of Diseases).
* DIAGNOSES_ICD.csv table contains ICD9 codes of all the diagnosis assigned to a particular hospital stay.
* A short and long description of ICD9 codes is in the D_ICD_DIAGNOSES.csv table.
* Finally, a higher level description is in DRGCODES.csv table. DRG (Diagnosis Related Group) type (DRG_TYPE) together with DRG_CODE identify a unique group.
    * Note, DRG_CODE alone does not uniquely identify a diagnoses, since different diagnosis types can have the same code.
    * Diagnosis description also contains important information on comorbid conditions or compliclations.
    * This table also contains DRG_SEVERITY and DRG_MORTALITY scores, which score the severity and mortality of the given diagnosis on the scale of 0-4. This is an exptremely useful metric, however, only APR (all payer regitry) DRG_TYPE has a value assigned.
    
I will engineer features from the frequent words in the description.

In [3]:
import pandas as pd
# load the drgcodes tables 
diagnoses_df = pd.read_csv('../../data/raw/DRGCODES.csv')
diagnoses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125557 entries, 0 to 125556
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ROW_ID         125557 non-null  int64  
 1   SUBJECT_ID     125557 non-null  int64  
 2   HADM_ID        125557 non-null  int64  
 3   DRG_TYPE       125557 non-null  object 
 4   DRG_CODE       125557 non-null  int64  
 5   DESCRIPTION    125494 non-null  object 
 6   DRG_SEVERITY   66634 non-null   float64
 7   DRG_MORTALITY  66634 non-null   float64
dtypes: float64(2), int64(4), object(2)
memory usage: 7.7+ MB


In [4]:
import time
start_time = time.monotonic()

# get rid of unnecessary columns to free up memory
diagnoses_df.drop(columns=['ROW_ID', 'SUBJECT_ID', 'DRG_TYPE', 'DRG_CODE'], inplace=True)

# Combine multiple descriptions per admission
print("The length of the original dataframe is ", len(diagnoses_df))
desc_combined = diagnoses_df.groupby('HADM_ID')['DESCRIPTION'].apply(lambda x: x.str.cat(sep=', '))

print("After grouping by admission the number of rows is ", len(desc_combined))
print("The number of qualifying admissions without a diagnosis description ", desc_combined.isnull().sum())

# Fill out severity and mortality scores to be the same per group.
# before filling out find out the number of missing values
print("Number of diagnosis severity missing entries is ", diagnoses_df.DRG_SEVERITY.isnull().sum())
print("Number of diagnosis mortality missing entries is ", diagnoses_df.DRG_MORTALITY.isnull().sum())

# use forward fill and backward fill to fill out missing values
diagnoses_df['DRG_SEVERITY'] = diagnoses_df.groupby('HADM_ID')['DRG_SEVERITY'].transform(lambda x: x.fillna(method='ffill'))
print("Severity score forward filling complete")
print("Time elapsed in seconds ", time.monotonic()-start_time)
start_time = time.monotonic()

diagnoses_df['DRG_SEVERITY'] = diagnoses_df.groupby('HADM_ID')['DRG_SEVERITY'].transform(lambda x: x.fillna(method='bfill'))  
print("Severity score backward filling complete")
print("Time elapsed in seconds ", time.monotonic()-start_time)
start_time = time.monotonic()

diagnoses_df['DRG_MORTALITY'] = diagnoses_df.groupby('HADM_ID')['DRG_MORTALITY'].transform(lambda x: x.fillna(method='ffill'))
print("Mortality score forward filling complete")
print("Time elapsed in seconds ", time.monotonic()-start_time)
start_time = time.monotonic()

diagnoses_df['DRG_MORTALITY'] = diagnoses_df.groupby('HADM_ID')['DRG_MORTALITY'].transform(lambda x: x.fillna(method='bfill')) 
print("Mortality score backward filling complete")
print("Time elapsed in seconds ", time.monotonic()-start_time)
start_time = time.monotonic()

print("After tranformation number of diagnosis severity missing entries is ", diagnoses_df.DRG_SEVERITY.isnull().sum())
print("After tranformation number of diagnosis mortality missing entries is ", diagnoses_df.DRG_MORTALITY.isnull().sum())
print("Time elapsed in seconds ", time.monotonic()-start_time)

The length of the original dataframe is  125557
After grouping by admission the number of rows is  58890
The number of qualifying admissions without a diagnosis description  0
Number of diagnosis severity missing entries is  58923
Number of diagnosis mortality missing entries is  58923
Severity score forward filling complete
Time elapsed in seconds  33.18370428399999
Severity score backward filling complete
Time elapsed in seconds  20.87144844999989
Mortality score forward filling complete
Time elapsed in seconds  20.603132451000192
Mortality score backward filling complete
Time elapsed in seconds  20.57246440100016
After tranformation number of diagnosis severity missing entries is  19506
After tranformation number of diagnosis mortality missing entries is  19506
Time elapsed in seconds  0.0025895280000440835


In [10]:
# The remaining missing values will be filled with the mean of the column, but first need to get rid of 
# redundant entries
sev_combined = diagnoses_df.groupby('HADM_ID')['DRG_SEVERITY'].mean()
mort_combined = diagnoses_df.groupby('HADM_ID')['DRG_MORTALITY'].mean()

print("Number of missing severity scores in the unique admission dataset", sev_combined.isnull().sum())
print("Number of missing mortality scores in the unique admission dataset", mort_combined.isnull().sum())

sev_combined = sev_combined.fillna(sev_combined.median())
mort_combined = mort_combined.fillna(mort_combined.median())

print("After median value fill the number of missing severity scores in the unique admission dataset", sev_combined.isnull().sum())
print("After median value fill the nNumber of missing mortality scores in the unique admission dataset", mort_combined.isnull().sum())

Number of missing severity scores in the unique admission dataset 19495
Number of missing mortality scores in the unique admission dataset 19495
After median value fill the number of missing severity scores in the unique admission dataset 0
After median value fill the nNumber of missing mortality scores in the unique admission dataset 0


In [11]:
#Finally merge all the grouped data frames and the output labels
desc_combined = desc_combined.reset_index()
sev_combined = sev_combined.reset_index()
mort_combined = mort_combined.reset_index()
desc_combined = pd.merge(desc_combined, sev_combined[['HADM_ID', 'DRG_SEVERITY']], left_on = 'HADM_ID', right_on='HADM_ID', how='left')
desc_combined = pd.merge(desc_combined, mort_combined[['HADM_ID', 'DRG_MORTALITY']], left_on = 'HADM_ID', right_on='HADM_ID', how='left')
desc_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58890 entries, 0 to 58889
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   HADM_ID        58890 non-null  int64  
 1   DESCRIPTION    58890 non-null  object 
 2   DRG_SEVERITY   58890 non-null  float64
 3   DRG_MORTALITY  58890 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 2.2+ MB


In [14]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.4.5.zip (1.5 MB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Created wheel for nltk: filename=nltk-3.4.5-py3-none-any.whl size=1449905 sha256=9530c8da075bd308605e94e09a87ebc153dd0c115e317ae63d0d57fc37721160
  Stored in directory: /Users/zhannahakhverdyan/Library/Caches/pip/wheels/48/8b/7f/473521e0c731c6566d631b281f323842bbda9bd819eb9a3ead
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5


In [15]:
# stem the words in the descriptions
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
import string

def parseOutText(all_text):
    """ given text field parse out all text
        and return a string that contains all the words
        """
    words = ""
    if len(all_text) > 1:        
        text_string = all_text.translate(str.maketrans("", "", string.punctuation))
        spl = text_string.split()
        for i in spl:
            i = stemmer.stem(i)
            words += i + ' '
    
    return words

In [18]:
# get rid of any numbers in diagnosis description
desc_combined['DESCRIPTION'] = desc_combined['DESCRIPTION'].str.replace('\d+', '')

# process the text in descriptions
desc_combined['DESCRIPTION'] = desc_combined['DESCRIPTION'].apply(parseOutText)
desc_combined['DESCRIPTION'].head()

0                           diabet w cc diabet diabet 
1    peptic ulcer gastriti peptic ulcer gastriti gi...
2                   chronic obstruct pulmonari diseas 
3    major small larg bowel procedur w cc w major g...
4    coronari bypass wo cardiac cath or percutan ca...
Name: DESCRIPTION, dtype: object

In [20]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
from sklearn.feature_extraction.text import TfidfVectorizer

# text vectorization: go from text to word count arrays
# ignore the words appearing in more than 50% of documents and select top 200 words
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words=stopwords, max_features=200)
# transform descriptions
description_transformed = vectorizer.fit_transform(desc_combined['DESCRIPTION'])

In [21]:
features = vectorizer.get_feature_names()
final_feature_df = pd.DataFrame(data=description_transformed.toarray(), columns=features)
final_feature_df.head()

Unnamed: 0,abdomin,abus,acut,age,agent,alcohol,aliv,ami,anomali,anoth,...,treatment,ulcer,unrel,unspecifi,urinari,valv,vascular,ventil,without,wo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.522269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.359093


In [22]:
# combine the severity, mortality scores and description featues
columns1 = desc_combined.columns.tolist()
columns2 = final_feature_df.columns.tolist()
col_combined = columns1+columns2
desc_combined = pd.concat([desc_combined, final_feature_df], axis=1, ignore_index=True)
desc_combined.columns = col_combined
desc_combined.values.shape

(58890, 204)

In [23]:
# save the intermediate analysis
desc_combined.to_csv('../../data/intermediate/inter022120')