# The Indiana Training Program in Public and Population Health Informatics
## EXERCISE 3  - BMI calculation using height and weight

With exercise 3 we will be extracting and enhancing the basal metabolic index information (BMI) of patients. The exercise is set in 3 parts:<br>
3.1- Exploring the data for keywords related to height, weight and BMI, subsetting the data<br>
3.2- Cleaning the subset data, transforming (long to wide) and editing it<br>
3.3- Creating BMI values from weight and height information, visualization<br> 

### 3.1 Filtering the data 

*Competencies addressed*  

 1. II.1  Exploratory data analysis
 2. Transformation of raw data to formats more suitable for downstream use cases (I.1.2)
 3. ...


*Learning objectives*  
 At the end of this module the student will be able to do the following:

1. Explore the structure of the data ....
2. Search for words within the data ... using string search methods including RegEx 
3. Subset the data using the criteria based on the words found...

In this exercise... we will explore the clinical_var.csv file and select keywords for weight, height and BMI. We will subset the data using these keywords. 
<br><s>we will  enhance the BMI information of the patients using the height and weight records. The BMI information is recorded in the clinical_var dataset, however much fewer cases have this information. </s>

Before we start working on R, we need to set the working directory. We will then check folder contents and check the clinical_var data contents without loading it. This is a practical way to inspect large datasets. 

In [1]:
getwd()  #check current directory
setwd("/N/dc2/projects/T15/Sample") # setting working folder
list.files()
file.info("clinical_vars.csv")

Unnamed: 0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
clinical_vars.csv,1410053638,False,660,2018-10-19 00:38:30,2018-10-19 00:51:32,2018-10-21 01:48:34,498945,1297,ukirbiyi,T15


### Loading clinical_vars.csv and checking the contents
This is a relatively large dataset and may take several minutes to load using base R. 
We will use fread function from  "data.table" package for faster loading. 

In [2]:
#install.packages("data.table")
library(data.table)

In [3]:
clinical <- fread("clinical_vars.csv", sep = ",", header= TRUE)
head(clinical)
summary(clinical)

STUDYID,OBS,TYPE,OBSVALUE,CODED_CODE,CODE_NAME,DAYS_VIS_INDEX
16,Acetaminophen,Medications,3783.0,,,-392
16,Acetamnphn W/Cod,Medications,12.0,,,-392
16,BP Dias Sitting,Phenotypes,69.0,,,-392
16,BP Sys Sitting,Phenotypes,158.0,,,-392
16,Clinic Site,Other,,3636.0,MED CL,-392
16,Encounter Site,Other,,3636.0,MED CL,-392


    STUDYID            OBS                TYPE              OBSVALUE       
 Min.   :     16   Length:27776723    Length:27776723    Min.   :   -6407  
 1st Qu.: 110019   Class :character   Class :character   1st Qu.:       5  
 Median : 306685   Mode  :character   Mode  :character   Median :      22  
 Mean   : 384745                                         Mean   :     122  
 3rd Qu.: 549386                                         3rd Qu.:      87  
 Max.   :1256051                                         Max.   :42128382  
                                                         NA's   :2329016   
  CODED_CODE         CODE_NAME         DAYS_VIS_INDEX   
 Length:27776723    Length:27776723    Min.   :-7597.0  
 Class :character   Class :character   1st Qu.:  -10.0  
 Mode  :character   Mode  :character   Median :  484.0  
                                       Mean   :  743.7  
                                       3rd Qu.: 1761.0  
                                       Max.   : 76

<b>NOTE:</b>  `fread` loads the OBS variable class as character where Base R loads it as a factor. This affects the way we interact with the data.<br> 
While character class does not retain levels after subsetting, factor class does. There is a need to switch to character and then back to factor to remove these levels if the file was to be loaded with  Base R (in Ex 3.2).<br>
With `fread`, we will convert OBS to factor in Ex 3.2. 

### Exploring OBS variable levels 
This is where all the observations (& tests) names recorded. The correspoding OBSVALUE field contains to the recorded value.

Let's create the list of all the observation (and test) names. 

In [4]:
obs_data<- as.data.frame(table(clinical$OBS))
dim(obs_data)
head(obs_data) #looking at the first 6 rows of the OBS table

Var1,Freq
# Cells Counted in Diff,19017
17 OH Progest SerPl Qn,100
A 1 Antitrypsin SerPl Qn,910
A1 Glob Ser Qn Elp,5243
A2 Glob Ser Qn Elp,5267
ABDOMINAL PAIN PAST MONTH,15


There are 2134 names in the list. We need to find "weight", "height" and "BMI" observations within this table. Lets search for "weight and "height"  and "BMI" in the obs levels. Because "weight" and " height" only differ by the first letter, we can search for "eight" to find them both. 
To do the search, we need the `str_detect` function from the stringr package. 

In [5]:
install.packages("stringr")
library(stringr)  

Installing package into '/gpfs/home/u/k/ukirbiyi/Carbonate/R/x86_64-pc-linux-gnu-library/3.3'
(as 'lib' is unspecified)


In [6]:
#Creating the name_list by appending the two querry results
# the option ignore_case is used to disregard upper & lower case versions of letters. 
eight <- obs_data[str_detect(obs_data$Var1, fixed('eight', ignore_case=TRUE)),]
bmi <- obs_data[str_detect(obs_data$Var1, fixed('bmi', ignore_case=TRUE)),]
name_list <- rbind(eight,bmi) #add the rows of the two results 
name_list <- name_list[ order ( -name_list$Freq), ] #order from large to small by freq
name_list

rm(eight, bmi) # this code removes the temporary data created 

Unnamed: 0,Var1,Freq
2099,Weight Lbs,81715
2101,Weight Metric,43833
931,Height(In),35938
268,Body Weight,22690
930,Height Metric,15897
175,BMI,9217
1599,Patient Height Qn,5262
266,Body Height,4829
759,Fundus Height,2783
2100,Weight Loss 6 Months,1246


In [7]:
# The same result can be achieved with one line of code using a combination of 
# Regular Expressions (RegEx) and str_detect
# regex('eight|bmi') translates as "eight" or "bmi" 
name_list <- obs_data[str_detect(obs_data$Var1, regex('eight|bmi', ignore_case=TRUE)),]
name_list <- name_list[ order(-name_list$Freq), ] #order by freq (large to small)
as.data.frame(name_list)

Unnamed: 0,Var1,Freq
2099,Weight Lbs,81715
2101,Weight Metric,43833
931,Height(In),35938
268,Body Weight,22690
930,Height Metric,15897
175,BMI,9217
1599,Patient Height Qn,5262
266,Body Height,4829
759,Fundus Height,2783
2100,Weight Loss 6 Months,1246


### Choosing variables of interest 
The querry returns 18 OBS levels. The observations (OBS) with higher frequency and with units are useful for our purposes. So we choose the following fields: 
* "Weight Lbs", 
* "Weight Metric", * "Height(In)", 
* "Height Metric", 
* "BMI"

We will be filtering (subsetting) the dataset using these  OBS values as criteria. There are different ways to do this. Below are 3 different approaches: 

### Method 1- Subseting the dataset into 5 files using the selected obs values
Later, we join these subsets to form our final dataset, wt_ht.  

In [8]:
a <- clinical[clinical$OBS == "Weight Lbs",]
b <- clinical[clinical$OBS == "Weight Metric",]
c <- clinical[clinical$OBS == "Height(In)",]
d <- clinical[clinical$OBS == "Height Metric",]
e <- clinical[clinical$OBS == "BMI",]

wt_ht <- rbind(a,b,c,d,e)

head(wt_ht)
length(unique(wt_ht$STUDYID))
rm(a,b,c,d,e) #deleting temporary variables

STUDYID,OBS,TYPE,OBSVALUE,CODED_CODE,CODE_NAME,DAYS_VIS_INDEX
16,Weight Lbs,Phenotypes,144,,,-392
16,Weight Lbs,Phenotypes,148,,,-382
16,Weight Lbs,Phenotypes,145,,,-266
16,Weight Lbs,Phenotypes,152,,,-245
16,Weight Lbs,Phenotypes,144,,,-200
16,Weight Lbs,Phenotypes,145,,,-154


### Method 2- Using Regular Expressions (RegEx) 
The following RegEx (regular expression) code does the same filtering as above examples. It uses  the `grep` function as well as the `subset` function. Here is how the RegEx code translates:
`^[wh]+eight+.*(ric$|bs$|\\)$)|^bmi$`

  all names start with "w" or "h" followed by "eight", after a number of characters (.*), AND <br>
  must end with "ric" or "bs" or ")"(note that the escape character "\" has to be written twice) <br>
OR (|) <br>
  names that start and end with "bmi" <br>

In [9]:
#wt_ht <- subset(clinical, grepl("^[wh]+eight+.*(ric$|bs$|\\)$)|^bmi$",OBS,ignore.case = TRUE))

head(wt_ht)
length(unique(wt_ht$STUDYID)) #number of unique patients
summary(wt_ht)

STUDYID,OBS,TYPE,OBSVALUE,CODED_CODE,CODE_NAME,DAYS_VIS_INDEX
16,Weight Lbs,Phenotypes,144,,,-392
16,Weight Lbs,Phenotypes,148,,,-382
16,Weight Lbs,Phenotypes,145,,,-266
16,Weight Lbs,Phenotypes,152,,,-245
16,Weight Lbs,Phenotypes,144,,,-200
16,Weight Lbs,Phenotypes,145,,,-154


    STUDYID            OBS                TYPE              OBSVALUE     
 Min.   :     16   Length:186600      Length:186600      Min.   :   0.0  
 1st Qu.:  44423   Class :character   Class :character   1st Qu.:  69.0  
 Median : 115472   Mode  :character   Mode  :character   Median : 136.0  
 Mean   : 236677                                         Mean   : 140.5  
 3rd Qu.: 313805                                         3rd Qu.: 189.0  
 Max.   :1255910                                         Max.   :6327.4  
  CODED_CODE         CODE_NAME         DAYS_VIS_INDEX   
 Length:186600      Length:186600      Min.   :-7532.0  
 Class :character   Class :character   1st Qu.: -485.0  
 Mode  :character   Mode  :character   Median :  300.0  
                                       Mean   :  408.9  
                                       3rd Qu.: 1628.0  
                                       Max.   : 7619.0  

### Metthods 3- Using the filter function from the Dplyr package (quickest)
The code below does the criteria search within one string of code. The pipe operator `%>%` is used to apply `filter`  then a second `select` function to select the columns (which we do later).

The `filter`, `select` and `%>%` (piping) functions require dplyr (which is a part of Tidyverse package).

 

In [10]:
# Install the package if you haven't before. 
#install.packages("tidyverse")
library(tidyverse) 


-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
<U+221A> ggplot2 3.0.0     <U+221A> readr   1.1.1
<U+221A> tibble  1.4.2     <U+221A> purrr   0.2.5
<U+221A> tidyr   0.8.1     <U+221A> dplyr   0.7.6
<U+221A> ggplot2 3.0.0     <U+221A> forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::between()   masks data.table::between()
x dplyr::filter()    masks stats::filter()
x dplyr::first()     masks data.table::first()
x dplyr::lag()       masks stats::lag()
x dplyr::last()      masks data.table::last()
x purrr::transpose() masks data.table::transpose()


In [11]:
#Piping operation 
wt_ht <- clinical %>% 
  filter(OBS %in% c("Weight Lbs", "Weight Metric", "Height(In)", "Height Metric", "BMI")) %>%
  select(c(1,2,4,7))
head(wt_ht)
length(unique(wt_ht$STUDYID))

STUDYID,OBS,OBSVALUE,DAYS_VIS_INDEX
16,Weight Lbs,144,-392
16,Weight Lbs,148,-382
16,Weight Lbs,145,-266
16,Weight Lbs,152,-245
16,Weight Lbs,144,-200
16,Weight Lbs,145,-154


### Saving the data in RDATA format
to load the file we use the following code: <br>
load(file = "wt_ht.RDATA")

In [12]:
save(wt_ht, file ="wt_ht.RDATA")
file.info("wt_ht.RDATA")

Unnamed: 0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
wt_ht.RDATA,672195,False,660,2018-10-21 14:13:05,2018-10-21 14:13:05,2018-10-21 01:50:49,498945,1297,ukirbiyi,T15


== END of EXERCISE 3.1 ==