In [16]:
library(dplyr)
library(tidyverse)
library(stringr)

## Extracted Parents -> Notes for Annotation

Using the dataframe: extracted_parents.csv

This notebook extracts notes that will be annotated for our project in Symbolic Methods. The notes will be annotated manually by two students, MetaMap, Usagi and our proposed method.

For this project we are interested in the modifiers: **severe, significant or serious**

From the original dataframe we:
1. remove notes that do not include spaces. This is because MetaMap requires the space separation of words
2. remove repeats e.g., pain and painful are both found in the criteria_string "... painful ...". 
   To avoid biasing our gold standard we remove these "repeats"
3. remove criteria_string that are less than 200 characters in length due to an indexing error
4. remove criteria_string in which the matched_string is not present

From this editted dataframe we:
1. investigate the parent concepts that are most representative in the dataset
2. investigate the number of criteria_string that include the modifiers
3. ignore parent concepts that include the three modifiers of interest

Finally, we create a dataset of notes for annotation that follow the criteria below
1. parent concepts included must have at least 10 instances where a criteria_string includes a modifier
2. For each parent concept, we choose 5 instances with a modifier and 5 without a modifier (10 total per parent)

_Note_ In the dataset we include 5 instances for each parent in which the criteria_string includes a modifier. However, this modifier may not be applied the the matched_string but just has to be present in the criteria_string.

**RESULT:** Dataframe with 77 unique parent_concept_ids. Total of 770 observations and 6 fields written to: **annotate_notes.csv**


In [116]:
mz_data <- read.delim("../data/extracted_parents.csv", sep = ",", stringsAsFactors = F)

In [89]:
# example of where matched_string is not present in criteria_string
mz_data %>% filter(NCT_id == "NCT01829919")

NCT_id,matched_string,criteria_string,parent_concept_id,parent_concept_name
NCT01829919,major depressive disorder,sonality disorder; presence of any of the following psychiatric disorders within the  timeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-,4152280,Major depressive disorder
NCT01829919,bipolar disorder,story of clinical diagnosis of  depression; or treatment for depression; history of clinical diagnosis of border-line  personality disorder; presence of any of the following psychiatric,436665,Bipolar disorder
NCT01829919,depressive disorder,ty disorder; presence of any of the following psychiatric disorders within the  timeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-Lifeti,440383,Depressive disorder
NCT01829919,depression,retion of study  medication; history of self-injurious behavior; history of clinical diagnosis of  depression; or treatment for depression; history of clinical diagnosis of border-line,440383,Depressive disorder
NCT01829919,psychotic disorder,ence of any of the following psychiatric disorders within the  timeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-Lifetime; Panic Disorde,436073,Psychotic disorder
NCT01829919,panic,imeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-Lifetime; Panic Disorder-Lifetime; Agoraphobia-Past Month; Social Phobia-Past  Month; O,4196358,Panic


In [87]:
# remove notes without spaces
notes_edit <- mz_data[str_detect(mz_data$criteria_string, " "), ]
# remove repeated NCTID with same parent concept
notes_edit <- notes_edit[rownames(unique(notes1000[,c("NCT_id", "parent_concept_id")])), ]
# remove notes less than 200 characters
notes_edit <- notes_edit[nchar(notes_edit$criteria_string) == 200, ]
# remove notes where matched_string is not in criteria_string
notes_edit <- notes_edit %>% rowwise() %>% 
                    filter(str_detect(string = tolower(criteria_string), pattern = matched_string))

In [96]:
# check point after editing original dataframe
table(nchar(notes_edit$criteria_string)))
notes_edit %>% filter(NCT_id == "NCT01829919")


   200 
233365 


NCT_id,matched_string,criteria_string,parent_concept_id,parent_concept_name
NCT01829919,major depressive disorder,sonality disorder; presence of any of the following psychiatric disorders within the  timeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-,4152280,Major depressive disorder
NCT01829919,depressive disorder,ty disorder; presence of any of the following psychiatric disorders within the  timeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-Lifeti,440383,Depressive disorder
NCT01829919,panic,imeframes specified: Major Depressive Disorder-Lifetime; Dysthymia-Past 2 Years; Bipolar  Disorder-Lifetime; Panic Disorder-Lifetime; Agoraphobia-Past Month; Social Phobia-Past  Month; O,4196358,Panic


In [98]:
# Identify 100 most representative parent concepts
top100 <- notes_edit %>% group_by(parent_concept_id, parent_concept_name) %>% 
                        summarise(count = n()) %>% 
                        arrange(-count) %>% 
                        head(n = 100)
head(top100)

“Grouping rowwise data frame strips rowwise nature”

parent_concept_id,parent_concept_name,count
316866,Hypertensive disorder,28053
437312,Bleeding,25745
4329041,Pain,18620
440383,Depressive disorder,13529
317009,Asthma,11751
4182210,Dementia,9314


In [97]:
# Found that the top 100 don't always have >= 5 instances with modifiers
# Check out how many strings have "severe", "significant" or "serious" (modifiers of interest)
severe_ids <- notes_edit %>% filter(!grepl(tolower(parent_concept_name), pattern = "severe|significant|serious")) %>%
                group_by(parent_concept_id, parent_concept_name) %>%
                summarise(count = n(),
                          severe = sum(str_detect(criteria_string, "severe|significant|serious"))) %>%
                arrange(-severe) %>%
                head(n = 100)
head(severe_ids)

“Grouping rowwise data frame strips rowwise nature”

parent_concept_id,parent_concept_name,count,severe
316866,Hypertensive disorder,28053,5164
437312,Bleeding,25745,3972
440383,Depressive disorder,13529,3030
317009,Asthma,11751,2643
4182210,Dementia,9314,2062
4329041,Pain,18620,1819


In [108]:
# parent concepts that have at least 10 instances of criteria_string that includes modifier
parent_ids <- severe_ids %>% filter(severe >= 10) %>% select(parent_concept_id) %>% unlist() %>% unname()

In [112]:
# choose 5 with modifier and 5 without modifiers for these 77 parents
notes_770 <- notes_edit %>% filter(parent_concept_id %in% parent_ids) %>%
                mutate(w.mod = str_detect(criteria_string, "severe|significant|serious")) %>% 
                group_by(parent_concept_id, w.mod) %>% sample_n(size = 5)

“Grouping rowwise data frame strips rowwise nature”

In [114]:
print(dim(notes_770))
head(notes_770, n = 20)

[1] 770   6


NCT_id,matched_string,criteria_string,parent_concept_id,parent_concept_name,w.mod
NCT03937804,scoliosis,"ve pulmonary disease, chronic obstructive airway disease,  emphysema, chronic bronchitis, lung transplant, kyphoscoliosis, sarcoidosis,  bronchopulmonary dysplasia, cystic fibr",72418,Scoliosis deformity of spine,False
NCT00252252,scoliosis,"ar disease (eg. old polio, bilateral diaphragm  paralysis) or chest wall disease (eg. idiopathic scoliosis, thoracoplasty)  - Recruited from 1300 patients attending Royal Brompt",72418,Scoliosis deformity of spine,False
NCT03329716,scoliosis,"en wearing underarm (Boston) bracing, and who have reached  skeletal maturity based on the Scoliosis Research Society (SRS) standardized criteria:  Risser stage ≥4, >2 years po",72418,Scoliosis deformity of spine,False
NCT03090971,scoliosis,"- a distinct osseous lesion, such as sphenoid wing dysplasia, pseudoarthrosis of the  tibia, macrocephaly, or scoliosis  - a first-degree relative with NF1  - Presen",72418,Scoliosis deformity of spine,False
NCT03404232,scoliosis,pine  - Relevant peripheral neuropathy  - Acute denervation subsequent to a radiculopathy  - Scoliosis with Cobb angle greater than 25°  - Spondylolisthesis,72418,Scoliosis deformity of spine,False
NCT01467882,scoliosis,"or other study endpoints (e.g. chronic steroid use [except mild topical steroids],  renal failure, diabetes, moderate to severe scoliosis, previously treated intracranial  tum",72418,Scoliosis deformity of spine,True
NCT03581084,scoliosis,"tis, vocal cord dysfunction  (that is the sole cause of respiratory symptoms and at the PI's discretion), severe  scoliosis or chest wall deformities that affect lung function,",72418,Scoliosis deformity of spine,True
NCT01926275,scoliosis,"nced  bronchiectases, active tuberculosis, post tuberculosis syndrome, pneumonia, severe  kyphoscoliosis, tracheostoma, neuromuscular diseases, or any other disorder, which",72418,Scoliosis deformity of spine,True
NCT01784094,scoliosis,":  - previous severe back or lower extremity injury or surgery  - major structural spinal deformity (scoliosis, kyphosis, stenosis)  - ankylosing spondylitis or rheuma",72418,Scoliosis deformity of spine,True
NCT02131090,scoliosis,men  - BMI < 50  Exclusion Criteria:  - BMI >50  - Documented diagnosis of scoliosis and/or other significant lumbar spinal pathology  - Previous lu,72418,Scoliosis deformity of spine,True


In [117]:
write.csv(file = "../data/annotate_notes.csv",
          x = notes_770, row.names = FALSE)