# The Indiana Training Program in Public and Population Health Informatics
## EXERCISE 3  - BMI calculation using height and weight

With exercise 3, we will extract and enhance the body mass index information (BMI) of patients. The exercise is set in 3 parts:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3a. Exploring the data for keywords related to height, weight, and BMI, subsetting the data<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3b. Cleaning the subset data, transforming (long to wide) and editing it<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3c. Creating BMI values from weight and height information, visualization<br> 

*Learning objectives*  
 At the end of these three modules the student will be able to do the following:

1. Explore data using string keywords
2. Parse and manipulate string data using multiple different methods including the use of regular expressions and the `dplyr` package
2. Search for words within the data using string operations including regular expressions
3. Subset the data using a criteria based on text strings
4. Perform basic transformation of the dataset in multiple ways
5. Create derived variables using several vector transformations
6. Create visualizations to describe the data

### 3a. Exploring the data for keywords related to height, weight, and BMI, subsetting the data 
In this exercise, we will explore the clinical_var.csv file and select keywords for weight, height and BMI. Additionally, thse keywords will be used to subset the data. 

Before we start working in R, we need to set the working directory. We will then check folder contents and check the clinical_var data contents without loading it. This is a practical way to inspect large datasets. 

In [1]:
getwd()  # Check the current directory.
setwd("/N/dc2/projects/T15/Sample") # Set the working folder. Change this path to the location where YOUR files are.
list.files()
file.info("clinical_vars.csv")

Unnamed: 0,size,isdir,mode,mtime,ctime,atime,uid,gid,uname,grname
clinical_vars.csv,1410053638,False,660,2018-10-19 00:38:30,2018-10-19 00:51:32,2019-03-27 16:30:22,498945,1297,ukirbiyi,T15


### Load the clinical_vars.csv data file and check the contents
This is a relatively large dataset and may take several minutes to load using base R. 
We will use the `fread` function from  "data.table" package for faster loading. 

In [2]:
# If you haven't done so already, install the data.table package using the code below.
install.packages("data.table")
library(data.table)

Installing package into '/gpfs/home/r/a/rahurkar/Carbonate/R/x86_64-pc-linux-gnu-library/3.3'
(as 'lib' is unspecified)


In [3]:
clinical <- fread("clinical_vars.csv", sep = ",", header= TRUE)
head(clinical)
summary(clinical)

STUDYID,OBS,TYPE,OBSVALUE,CODED_CODE,CODE_NAME,DAYS_VIS_INDEX
16,Acetaminophen,Medications,3783.0,,,-392
16,Acetamnphn W/Cod,Medications,12.0,,,-392
16,BP Dias Sitting,Phenotypes,69.0,,,-392
16,BP Sys Sitting,Phenotypes,158.0,,,-392
16,Clinic Site,Other,,3636.0,MED CL,-392
16,Encounter Site,Other,,3636.0,MED CL,-392


    STUDYID            OBS                TYPE              OBSVALUE       
 Min.   :     16   Length:27776723    Length:27776723    Min.   :   -6407  
 1st Qu.: 110019   Class :character   Class :character   1st Qu.:       5  
 Median : 306685   Mode  :character   Mode  :character   Median :      22  
 Mean   : 384745                                         Mean   :     122  
 3rd Qu.: 549386                                         3rd Qu.:      87  
 Max.   :1256051                                         Max.   :42128382  
                                                         NA's   :2329016   
  CODED_CODE         CODE_NAME         DAYS_VIS_INDEX   
 Length:27776723    Length:27776723    Min.   :-7597.0  
 Class :character   Class :character   1st Qu.:  -10.0  
 Mode  :character   Mode  :character   Median :  484.0  
                                       Mean   :  743.7  
                                       3rd Qu.: 1761.0  
                                       Max.   : 76

<b>NOTE:</b>  `fread` loads the OBS variable class as character where Base R loads it as a factor. This affects the way we interact with the data.<br> 
While character class does not retain levels after subsetting, factor class does. There is a need to switch to character and then back to factor to remove these levels if the file was to be loaded with  Base R (in Ex 3.2).<br>
With `fread`, we will convert OBS to factor in Ex 3.2. 

### Exploring OBS variable levels 
This is where all the observation and test names are recorded. The correspomding OBSVALUE field contains the recorded value.

Let's create the list of all the observation and test names. 

In [4]:
obs_data<- as.data.frame(table(clinical$OBS))
dim(obs_data)
head(obs_data) # Look at the first 6 rows of the OBS table to get a sense of the data.

Var1,Freq
# Cells Counted in Diff,19017
17 OH Progest SerPl Qn,100
A 1 Antitrypsin SerPl Qn,910
A1 Glob Ser Qn Elp,5243
A2 Glob Ser Qn Elp,5267
ABDOMINAL PAIN PAST MONTH,15


There are 2134 names in the list. We need to find "weight", "height" and "BMI" observations within this table. Lets search for "weight and "height"  and "BMI" in the obs levels. Because "weight" and " height" only differ by the first letter, we can search for "eight" to find them both. 
To do the search, we need the `str_detect` function from the stringr package. 

In [5]:
# If you haven't already done so, install the stringr package using the code below.
install.packages("stringr")
library(stringr)  

Installing package into '/gpfs/home/r/a/rahurkar/Carbonate/R/x86_64-pc-linux-gnu-library/3.3'
(as 'lib' is unspecified)


ERROR: Error in library(stringr): Package 'stringr' version 1.3.1 cannot be unloaded


In [None]:
# Create the name_list by appending the two query results.
# The option ignore_case is used to disregard upper & lower case versions of letters. 
eight <- obs_data[str_detect(obs_data$Var1, fixed('eight', ignore_case=TRUE)),]
bmi <- obs_data[str_detect(obs_data$Var1, fixed('bmi', ignore_case=TRUE)),]
name_list <- rbind(eight,bmi) # Add the rows of the two results. 
name_list <- name_list[ order ( -name_list$Freq), ] # Order from large to small by frequency.
name_list

rm(eight, bmi) # This code removes the temporary data created. 

In [None]:
# The same result can be achieved with one line of code using a combination of: 
# Regular Expressions (RegEx) and str_detect
# regex('eight|bmi') translates as "eight" or "bmi" 
name_list <- obs_data[str_detect(obs_data$Var1, regex('eight|bmi', ignore_case=TRUE)),]
name_list <- name_list[ order(-name_list$Freq), ] #order by freq (large to small)
as.data.frame(name_list)

### Choosing variables of interest 
The query returns 18 OBS levels. The observations (OBS) with higher frequency and with units are useful for our purposes. So we choose the following fields: 
* "Weight Lbs", 
* "Weight Metric", * "Height(In)", 
* "Height Metric", 
* "BMI"

We will be filtering (subsetting) the dataset using these  OBS values as criteria.  Below, are 3 different approaches to accomplish this: 

### Method 1- Subsett the dataset into 5 files using the selected OBS values
Later, we join these subsets to form our final dataset, wt_ht.  

In [None]:
a <- clinical[clinical$OBS == "Weight Lbs",]
b <- clinical[clinical$OBS == "Weight Metric",]
c <- clinical[clinical$OBS == "Height(In)",]
d <- clinical[clinical$OBS == "Height Metric",]
e <- clinical[clinical$OBS == "BMI",]

wt_ht <- rbind(a,b,c,d,e)

head(wt_ht)
length(unique(wt_ht$STUDYID))
rm(a,b,c,d,e) # Delete the temporary variables.

### Method 2- Use Regular Expressions (RegEx) 
The following RegEx (regular expression) code does the same filtering as above. It uses  the `grep` function and the `subset` function. Here is how the RegEx code translates:

`^[wh]+eight+.*(ric$|bs$|\\)$)|^bmi$`

  All names start with "w" or "h" followed by "eight", after a number of characters (.*), AND <br>
  must end with "ric" or "bs" or ")"(note that the escape character "\" has to be written twice) <br>
OR (|) <br>
  names that start and end with "bmi" <br>

In [None]:
# If you would like to try out the regular expressions method for filtering the data, use the code below:

# wt_ht <- subset(clinical, grepl("^[wh]+eight+.*(ric$|bs$|\\)$)|^bmi$",OBS,ignore.case = TRUE))

head(wt_ht)
length(unique(wt_ht$STUDYID)) # The number of unique patients.
summary(wt_ht)

### Method 3- Use the `filter` function from Dplyr (quickest)
The code below does the criteria search within one string of code. The pipe operator `%>%` is used to apply `filter`  then a second `select` function to select the columns (which we do later).

The `filter`, `select` and `%>%` (piping) functions require dplyr, which is a part of the Tidyverse package.

In [None]:
# If you haven't already done so, use the code below to install the Tidyverse package. 
install.packages("tidyverse")
library(tidyverse) 


In [None]:
# Piping operation 
wt_ht <- clinical %>% 
  filter(OBS %in% c("Weight Lbs", "Weight Metric", "Height(In)", "Height Metric", "BMI")) %>%
  select(c(1,2,4,7))
head(wt_ht)
length(unique(wt_ht$STUDYID))

### Save the data in RDATA format
Sometimes it is useful to save the data in different file formats. To save the file in RDATA format, use the code below. To load a file of this type, we use the following code: load(file = "wt_ht.RDATA")

In [None]:
save(wt_ht, file ="wt_ht.RDATA")
file.info("wt_ht.RDATA")

****
## End of Exercise 3a