In [1]:
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   1.0.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


In [2]:
modifiers <- 'severe|significant|serious'

# These files have rows with entries that start with / " and other problematic
#  characters. Luckily there are no ~ used in either, so I chose this as the
#  quote char to avoid read errors.
concept <- read_tsv('../data/athena/CONCEPT.csv', quote="~") %>% 
    filter(vocabulary_id == 'SNOMED')

synonym <- read_tsv('../data/athena/CONCEPT_SYNONYM.csv', quote="~") %>% 
    filter(language_concept_id == 4180186) %>%  # Only keep English concepts
    select(-language_concept_id)

relationship <- read_tsv('../data/athena/CONCEPT_RELATIONSHIP.csv', quote="~")

ancestor <- read_tsv('../data/athena/CONCEPT_ANCESTOR.csv', quote="~")

Parsed with column specification:
cols(
  concept_id = col_double(),
  concept_name = col_character(),
  domain_id = col_character(),
  vocabulary_id = col_character(),
  concept_class_id = col_character(),
  standard_concept = col_character(),
  concept_code = col_character(),
  valid_start_date = col_double(),
  valid_end_date = col_double(),
  invalid_reason = col_character()
)
Parsed with column specification:
cols(
  concept_id = col_double(),
  concept_synonym_name = col_character(),
  language_concept_id = col_double()
)
Parsed with column specification:
cols(
  concept_id_1 = col_double(),
  concept_id_2 = col_double(),
  relationship_id = col_character(),
  valid_start_date = col_double(),
  valid_end_date = col_double(),
  invalid_reason = col_logical()
)
Parsed with column specification:
cols(
  ancestor_concept_id = col_double(),
  descendant_concept_id = col_double(),
  min_levels_of_separation = col_double(),
  max_levels_of_separation = col_double()
)


# Concept names vs synonyms

We want concepts that have either one of the modifiers in the `concept_name` or in the name of a synonym. Morbid obesity is a good example of this.

| concept_id |	concept_name |	domain_id |	concept_code |	concept_synonym_name |
| -- |	-- |	-- |	-- |	-- |
| 40565487 |	Morbid obesity |	Condition	| 389986000 | OBESITY, SEVERE |
| 40565487 |	Morbid obesity |	Condition	| 389986000 | Severe obesity |
| 40565487 |	Morbid obesity |	Condition	| 389986000 | obesity severe |
| 40565487 |	Morbid obesity |	Condition	| 389986000 | severe obesity	 |

A couple more good examples are below.

| concept_id |	concept_name |	domain_id |	concept_code |	concept_synonym_name |
| -- |	-- |	-- |	-- |	-- |
| 440370 |	Nutritional marasmus |	Condition |	29740003 |	Severe malnutrition |
| 256716 |	Asthma with status asthmaticus |	Condition | 	57546000 |	acute severe asthma	 |

In [3]:
concept_with_modifiers <- concept %>% 
    filter(concept_class_id == 'Clinical Finding') %>%
    filter(domain_id %>% str_detect('Condition')) %>%    
    select(concept_id, concept_name, concept_code) %>%

    # Want all concepts that have a synonym with a modifier
    left_join(synonym, by = 'concept_id') %>%

    # For some reason, not all concept_names are given as synonyms themselves
    #  in the CONCEPT_SYNONYM table. Have to check both separately.
    mutate(
        name_has_mod = concept_name %>% str_to_lower %>% str_detect(modifiers),
        syn_has_mod = concept_synonym_name %>% str_to_lower %>% str_detect(modifiers),
    ) %>%
    
    # All concepts with either a synonym or name having a modifier
    filter(name_has_mod | syn_has_mod) %>%

    # Just unique concepts, so we can find parents (HLT)
    select(concept_id, concept_name, concept_code) %>%
    distinct()

concept_with_modifiers %>% write_tsv('../data/computed/concepts_with_modifiers.tsv')

concept_with_modifiers %>% head(2)

concept_id,concept_name,concept_code
<dbl>,<chr>,<chr>
45765743,Severe dry skin,702757002
45765900,Severe cognitive impairment,702956004


## Parents of modified concepts

Rather than a top-down approach, in which I try to pick a distance from the root of SNOMED diseases, say, I think it is better to just look at the parents of modified concepts.

First find the parents of modified children, then find synonyms of those parents. Finally, associate each parent with ALL its descendants.

In [4]:
parents_modified_children <- concept_with_modifiers %>%
left_join(
    relationship %>% 
        filter(relationship_id == 'Is a'), 
    by = c('concept_id' = 'concept_id_1')
 ) %>%
select(
    modified_concept_id = concept_id,
    modified_concept_name = concept_name,
    modified_concept_code = concept_code,
    parent_concept_id = concept_id_2,
) %>%
left_join(
    concept %>% 
    select(concept_id, parent_concept_name = concept_name, 
           parent_concept_code = concept_code), 
    by = c('parent_concept_id' = 'concept_id')
) 

parents_modified_children %>% write_tsv('../data/computed/parents_with_modified_children.tsv')

parents_modified_children %>% head(2)

modified_concept_id,modified_concept_name,modified_concept_code,parent_concept_id,parent_concept_name,parent_concept_code
<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
45765743,Severe dry skin,702757002,4039266,Dry skin,16386004
45765900,Severe cognitive impairment,702956004,443432,Impaired cognition,386806002


In [5]:
parents_synonyms <- parents_modified_children %>%
select(starts_with('parent')) %>%
left_join(synonym, by = c('parent_concept_id' = 'concept_id'))

parents_synonyms %>% write_tsv('../data/computed/parents_synonyms.tsv')

parents_synonyms %>% head(2)

parent_concept_id,parent_concept_name,parent_concept_code,concept_synonym_name
<dbl>,<chr>,<chr>,<chr>
4039266,Dry skin,16386004,Anhydrotic skin
4039266,Dry skin,16386004,Dry skin (finding)


In [6]:
parent_to_descendant_synonyms <- parents_modified_children %>%
distinct(parent_concept_id) %>%
left_join(ancestor, by = c('parent_concept_id' = 'ancestor_concept_id')) %>%
select(parent_concept_id, descendant_concept_id) %>%
left_join(synonym, by = c('descendant_concept_id' = 'concept_id')) %>%
rename(descendant_synonym_name = concept_synonym_name)

parent_to_descendant_synonyms %>% write_tsv('../data/computed/parent_to_descendant_synonyms.tsv')

parent_to_descendant_synonyms %>% head(2)

parent_concept_id,descendant_concept_id,descendant_synonym_name
<dbl>,<dbl>,<chr>
4039266,4287562,Xeroderma pigmentosum group D
4039266,4287562,"Xeroderma pigmentosum, group D"
