# Dutch UMLS to concept table
This notebook describes how to convert a UMLS concept table containing Dutch terms, to a formatted concept table to be used in a tool such as MedCAT. In the second part of this notebook, we add drug names from Dutch SNOMED, because these concepts are not well represented in the Dutch UMLS source vocabularies. A large scale automatic mapping from SNOMED Dutch to UMLS is not possible because there of many-to-mapping mapping.

Requirements:
- MySQL database containing Dutch UMLS terms

For adding Dutch SNOMED drug names:
- Dutch SNOMED concept table
- MySQL database containing SNOMED-US, which is used for mapping SNOMED Dutch -> UMLS

In [None]:
from sqlalchemy import create_engine
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import json
import re
import os

In [None]:
# Credentials to connect to UMLS MySQL database
load_dotenv()
user = os.getenv('MYSQL_USER')
password = os.getenv('MYSQL_PASSWORD')
host = os.getenv('MYSQL_HOST')
port = os.getenv('MYSQL_PORT')
database = os.getenv('MYSQL_DATABASE')

# Create the connection
connection_string = f'mysql://{username}:{password}@{host}:{port}/{database}'
connection = create_engine(connection_string)

In [None]:
# Retrieve Dutch UMLS concepts
query = """
SELECT * FROM MRCONSO WHERE LAT = 'DUT'
"""
df_dutch_umls = pd.read_sql_query(query, con=connection)
df_dutch_umls.head()

## Term type in source
Some source-defined term types are not relevant for our use case. In the next part we will drop those. See https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html 

In [None]:
df_dutch_umls.TTY.value_counts()

| TTY  | Description | Count | Example | Reference|
| - | - | - | - | - |
| PT | Designated preferred name| 111766 | harthypertrofie, Pancoast-syndroom ||
| LLT | Lower Level Term | 71603 | heupkombreuk, buikkramp| |
| LN | LOINC official fully specified name | 52313 | fencyclidine:massa/massa:moment:haar:kwantitatief | |
| MH | Main heading | 28657 | Dehydratie, Astma | |
| SY | Designated synonym | 11863 | Spanningshoofdpijn, Ziekte van Hodgkin | |
| OL | Non-current Lower Level Term| 9291 | acquired immunodeficiency syndrome, ankylose van gewricht, meerdere plaatsen | https://meddra.org/sites/default/files/page/documents_insert/meddra_-_terminologies_coding.pdf |
| HT | Hierarchical term | 3295 | calciummetabolismestoornissen, oculaire hemorragische aandoeningen	 | |
| LO | Obsolete official fully specified name | 1696| promyelocyten/100 leukocyten:getalsfractie:mom...	| |
| HG | High Level Group Term |  337| complicaties geassocieerd met medisch hulpmiddel, zuur-basestoornissen | |
| SMQ| Standardised MedDRA Query |  225| Leveraandoeningen (SMQ) , Tumormarkers (SMQ) | |
| CP | ICPC component process (in original form) |   38| Ander bloedonderzoek, Medicatie/recept/injectie | |
| OS | System-organ class |   27| Bloed- en lymfestelselaandoeningen, Infecties en parasitaire aandoeningen | |
| AB | Abbreviation in any source vocabulary |   27| Infec, Neopl, Ear, Endo | |

In [None]:
# Select a set of TTYs that seem most relevant for entity linking
tty_selection = ['PT', 'LLT', 'MH', 'SY']
df_dutch_umls_subset = df_dutch_umls[df_dutch_umls.TTY.isin(tty_selection)].copy()

# Keep only relevant columns
df_dutch_umls_subset = df_dutch_umls_subset[['CUI', 'STR', 'TTY', 'SAB']]
df_dutch_umls_subset.rename({'CUI': 'cui', 'STR': 'str', 'TTY': 'tty', 'SAB': 'sab'}, inplace=True, axis=1)

# Most of the terms in UMLS have the Metathesaurus preferred name in English. 
# For a rough but effective fix to get a good preferred name for the Dutch terms, 
# change the terms that have the value "Designated preferred name" (PT) for the 
# Term Type in Source (TTY) to the Metathesaurus preferred name (PN). All others
# can be saved as synonym.
df_dutch_umls_subset.tty.replace({'PT': 'PN',
                                  'LLT': 'SY',
                                  'MH': 'SY'}, inplace=True)

# Remove "NAO" ("Niet Anders Omschreven"), which is relevant for the source terminlogy but not for entity linking.
# See https://meddra.org/sites/default/files/guidance/file/intguide_15_0_dutch.pdf
df_dutch_umls_subset.str = df_dutch_umls_subset.str.replace({' NAO': '', ' \(NAO\)': '', ' nao': ''}, regex=True)

# Sort values
df_dutch_umls_subset.sort_values(by=['cui', 'tty', 'str', 'sab'], inplace=True)

# Drop duplicates, only keep the first entry (which is a PN because we sorted)
print(f'Records before dropping duplicates: {df_dutch_umls_subset.shape[0]}')
df_dutch_umls_subset = df_dutch_umls_subset.drop_duplicates(subset=['cui', 'str'], keep='first').reset_index(drop=True)
print(f'Records after dropping duplicates: {df_dutch_umls_subset.shape[0]}')

# Because dropped duplicates, only the first value in SAB is saved. Because we lost the information of other colums, rename the values to UMLS-dutch
df_dutch_umls_subset['sab'] = 'UMLS-dutch'
df_dutch_umls_subset.head(20)

# Add SNOMED

### Load SNOMED US

In [None]:
query = "SELECT distinct cui, scui FROM MRCONSO where sab = 'SNOMEDCT_US'"
df_snomed_us = pd.read_sql_query(query, con=connection)
df_snomed_us.scui = df_snomed_us.scui.astype(int)
print(f'SNOMED US terms with UMLS CUI: {df_snomed_us.shape[0]}')
df_snomed_us.head()

In [None]:
unambiguous_mapping_ids = []
snomed_names_to_add = []

In [None]:
snomed_to_umls_mapping = df_snomed_us.groupby('scui')['cui'].apply(list).to_dict()
print(f'Number of SNOMED IDs that map to at least 1 CUI: {len(snomed_to_umls_mapping)}')

In [None]:
unambiguous_mapping_ids = []
for snomed_id in snomed_to_umls_mapping:
    if len(snomed_to_umls_mapping[snomed_id]) == 1:
        unambiguous_mapping_ids.append(snomed_id)
print(len(unambiguous_mapping_ids))
print(f'Number of SNOMED IDs that map to only 1 CUI: {len(unambiguous_mapping_ids)}')

## Load SNOMED NL

In [None]:
df_snomed_dutch = pd.read_csv('04_ConceptDB/snomedct-dutch_v1.0.csv')
df_snomed_dutch.cui = df_snomed_dutch.cui.astype(int)
df_snomed_dutch.head()

In [None]:
df_dutch_umls_subset.head()

## Add SNOMED NL to UMLS

In [None]:
df_snomed_dutch.shape

In [None]:
dutch_umls_ids=df_dutch_umls_subset.groupby('cui')['str'].apply(list).to_dict()

dutch_umls_names_lowercase = set()
for cui in dutch_umls_ids:
    for value in dutch_umls_ids[cui]:
        dutch_umls_names_lowercase.add(value.lower())

In [None]:
df_snomed_dutch.head()

In [None]:
df_snomed_dutch['lowercase_str'] = df_snomed_dutch.str.str.lower()

In [None]:
%%time
def map_dutch_snomed_to_umls(row):
    snomed_id = row['cui']
    if snomed_id in unambiguous_mapping_ids:
        cui = snomed_to_umls_mapping[snomed_id][0]
        
        # Check whether SNOMED name is a name in UMLS, under any CUI.
        # This is to prevent:
        # - Adding names for a concept that we already have.
        # - Introducing concepts that are already in our DB but map to a different CUI
        #   because of one-to-many SNOMED to UMLS mapping. 
        # The downside is that concepts that are not in our DB yet, will not be added. 
        if row['lowercase_str'] not in dutch_umls_names_lowercase:
            
            # Check if the term is new, or already exists and therefor always is a synonym.
            if cui in dutch_umls_ids:
                snomed_names_to_add.append([cui, row['str'], 'SY'])
            else:
                snomed_names_to_add.append([cui, row['str'], row['tty']])
            
snomed_names_to_add = []

# Apply function
# Wall time: 12min 55s
df_snomed_dutch.apply(map_dutch_snomed_to_umls, axis = 1)

print(len(snomed_names_to_add))

In [None]:
snomed_names_with_cui = pd.DataFrame(snomed_names_to_add, columns = ['cui', 'str', 'tty'])
snomed_names_with_cui['sab'] = 'SNOMEDCT-NL'
snomed_names_with_cui.head()

In [None]:
df_dutch_umls_subset.shape

In [None]:
snomed_names_with_cui.shape

In [None]:
umls_snomed_merged = pd.concat([df_dutch_umls_subset, snomed_names_with_cui])
umls_snomed_merged.shape

In [None]:
umls_snomed_merged.sort_values(by=['cui', 'tty', 'sab', 'str'], inplace=True)
umls_snomed_merged.reset_index(drop=True,inplace=True)

## Remove problematic names


In [None]:
names_to_remove = ['Bij', # C0004923
                   'Bijen', # C0004923
                   'Haar', # C0018494
                   'bleek', # C0678215
                   'Weer', # C0043085
                   'Na+'] # C0337443
umls_snomed_merged[umls_snomed_merged.str.isin(names_to_remove)]

In [None]:
# Remove rows
rows_to_remove = umls_snomed_merged[umls_snomed_merged.str.isin(names_to_remove)].index
print(f'Number of rows before removing rows: {umls_snomed_merged.shape[0]}')
umls_snomed_filtered = umls_snomed_merged.drop(umls_snomed_merged.index[rows_to_remove])
print(f'Number of rows before removing rows: {umls_snomed_filtered.shape[0]}')

## Add custom CUIs
Sometimes names or concept are not captured in any of the Dutch terminologies. By looking up the English name for these concepts, we can add custom Dutch names using the real UMLS identifier.

In [None]:
umls_snomed_filtered.head()

In [None]:
custom_concepts = pd.read_csv("custom_concepts.csv")
custom_concepts

In [None]:
print(f'Number of rows before adding rows: {umls_snomed_filtered.shape[0]}')
umls_snomed_custom = pd.concat([umls_snomed_filtered, custom_concepts])
print(f'Number of rows after adding rows: {umls_snomed_custom.shape[0]}')

## Add TUI (types)
UMLS concepts have one or multiple types. These types are kept in a separate table, `MRSTY`. See https://semanticnetwork.nlm.nih.gov/download/SemGroups.txt for all types.

In [None]:
# Load TUI table from MySQL
query = """
SELECT cui, tui, sty FROM MRSTY
"""
df_tui = pd.read_sql_query(query, con=connection)

In [None]:
# Add TUI column to previously created dataframe
umls_snomed_tui = umls_snomed_custom.merge(df_tui, how='left', on='cui')

# View some concepts that have multiple TUIs
umls_snomed_tui[umls_snomed_tui.duplicated(subset=['cui', 'str'], keep=False)].head(10)

In [None]:
print(f'Number of unique TUIs in Dutch UMLS subset: {len(umls_snomed_tui.tui.unique())}')

In [None]:
# Create dataframe with counts per TUI name
type_counts = umls_snomed_tui.sty.value_counts().to_frame()
type_counts_tui = umls_snomed_tui.sty.value_counts().to_frame()
type_counts.reset_index(inplace=True)

# Add TUI code
type_counts_tui = umls_snomed_tui.tui.value_counts().to_frame()
tuis = type_counts_tui.index
type_counts['tui'] = tuis

# Format nicely
type_counts.columns = ['sty', 'count', 'tui']
type_counts = type_counts[['tui', 'sty', 'count']]
type_counts

## TUI Filtering
We could implement filtering of TUIs here. This depends on the domain and question of subsequent analysis. For SNOMED

In [None]:
tuis_to_remove = ['T078', # Idea or Concept
                 ]
umls_snomed_tui[umls_snomed_tui.tui.isin(tuis_to_remove)].head()

In [None]:
# Remove rows
rows_to_remove = umls_snomed_tui[umls_snomed_tui.tui.isin(tuis_to_remove)].index
print(f'Number of rows before removing rows: {umls_snomed_tui.shape[0]}')
umls_snomed_tui_filtered = umls_snomed_tui.drop(umls_snomed_tui.index[rows_to_remove])
print(f'Number of rows before removing rows: {umls_snomed_tui_filtered.shape[0]}')

## Saving

In [None]:
umls_snomed_tui_filtered.to_csv('04_ConceptDB/umls-dutch_v1.4.csv', index=False)
type_counts.to_csv('04_ConceptDB/tuis-umls-dutch_v1.4.csv', index = False, sep='\t')