# Dutch UMLS to concept table
This notebook describes how to convert a UMLS concept table containing Dutch terms, to a formatted concept table to be used in a tool such as MedCAT. In the second part of this notebook, we add drug names from Dutch SNOMED, because these concepts are not well represented in the Dutch UMLS source vocabularies. A large scale automatic mapping from SNOMED Dutch to UMLS is not possible because of many-to-many mapping, explained in this notebook.

Requirements:
- MySQL database containing Dutch UMLS terms

For adding Dutch SNOMED drug names:
- Dutch SNOMED concept tablel, created in `dutch-snomed_to_concept-table.ipynb`
- MySQL database containing SNOMED-US, which is used for mapping SNOMED Dutch -> UMLS

In [None]:
# Set output version of the generated UMLS dutch concept table
UMLS_DUTCH_VERSION = 'v1.9'

# Set input version of SNOMED to append to UMLS terms
SNOMED_DUTCH_VERSION = 'v1.2'

In [None]:
from sqlalchemy import create_engine
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import json
import re
import os

In [None]:
# Credentials to connect to UMLS MySQL database
load_dotenv()
user = os.getenv('MYSQL_USER')
password = os.getenv('MYSQL_PASSWORD')
host = os.getenv('MYSQL_HOST')
port = os.getenv('MYSQL_PORT')
database = os.getenv('MYSQL_DATABASE')

# Create the connection
connection_string = f'mysql://{user}:{password}@{host}:{port}/{database}'
connection = create_engine(connection_string)

In [None]:
# Retrieve Dutch UMLS concepts
query = """
SELECT cui, str, tty, sab FROM MRCONSO WHERE LAT = 'DUT'
"""
dutch_umls_original = pd.read_sql_query(query, con=connection)
dutch_umls_original.head()

## Term type in source
Some source-defined term types are not relevant for our use case. In the next part we will drop those. See https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html 

In [None]:
dutch_umls_original.tty.value_counts()

| TTY  | Description | Count | Example | Reference|
| - | - | - | - | - |
| PT | Designated preferred name| 111766 | harthypertrofie, Pancoast-syndroom ||
| LLT | Lower Level Term | 71603 | heupkombreuk, buikkramp| |
| LN | LOINC official fully specified name | 52313 | fencyclidine:massa/massa:moment:haar:kwantitatief | |
| MH | Main heading | 28657 | Dehydratie, Astma | |
| SY | Designated synonym | 11863 | Spanningshoofdpijn, Ziekte van Hodgkin | |
| OL | Non-current Lower Level Term| 9291 | acquired immunodeficiency syndrome, ankylose van gewricht, meerdere plaatsen | https://meddra.org/sites/default/files/page/documents_insert/meddra_-_terminologies_coding.pdf |
| HT | Hierarchical term | 3295 | calciummetabolismestoornissen, oculaire hemorragische aandoeningen	 | |
| LO | Obsolete official fully specified name | 1696| promyelocyten/100 leukocyten:getalsfractie:mom...	| |
| HG | High Level Group Term |  337| complicaties geassocieerd met medisch hulpmiddel, zuur-basestoornissen | |
| SMQ| Standardised MedDRA Query |  225| Leveraandoeningen (SMQ) , Tumormarkers (SMQ) | |
| CP | ICPC component process (in original form) |   38| Ander bloedonderzoek, Medicatie/recept/injectie | |
| OS | System-organ class |   27| Bloed- en lymfestelselaandoeningen, Infecties en parasitaire aandoeningen | |
| AB | Abbreviation in any source vocabulary |   27| Infec, Neopl, Ear, Endo | |

In [None]:
# Select a set of TTYs that seem most relevant for entity linking
tty_selection = ['PT', 'LLT', 'MH', 'SY']
dutch_umls = dutch_umls_original[dutch_umls_original.tty.isin(tty_selection)].copy()

# Keep only relevant columns
dutch_umls = dutch_umls[['cui', 'str', 'tty', 'sab']]

## Preferred names
Most, if not all, of UMLS concepts have a preferred name in English. For other languages,
it can be difficult to select a preferred name, because each source vocabulary has one or
multiple preferred names for a concepts. For example, ICPC2ICD10DUT only contains preferred name-type values.

It's not possible to keep the English UMLS preferred names, because MedCAT would add those names to the concept table for entity linking. Perhaps future functionality can be added for MedCAT to prevent taking these preferred names into account during entity linking.

### Solution 1: Use UMLS source vocabularies preferred names
For a rough but effective solution to get decent preferred names for the Dutch terms, change the terms that have the value "Designated preferred name" (PT) for the "Term Type in Source" (TTY) to MedCAT's preferred name value (P), and all others can be saved as (A). See https://github.com/CogStack/MedCAT/blob/master/examples/README.md

In [None]:
# dutch_umls.tty.replace({'PT': 'P',
#                                   'LLT': 'A',
#                                   'MH': 'A',
#                                   'SY': 'A'}, inplace=True)

### Solution 2: Use preferred names from Dutch SNOMED
In previous experiments we have shown that the Dutch vocabularies from UMLS and Dutch SNOMED complement each other. SNOMED however, does provide most of the names, and contains excellent primary names. So we could use the preferred names from Dutch SNOMED, and for the terms not in that vocabulary, let MedCAT pick a random one.

In [None]:
# Drop tty column, put it back in just before merging with SNOMED
dutch_umls.drop(['tty'], axis=1, inplace=True)
dutch_umls.head()

## Clean values

In [None]:
dutch_umls[dutch_umls.cui == 'C0000833']

In [None]:
# Remove "NAO" ("Niet Anders Omschreven"), which is relevant for the source terminlogy but not for entity linking.
# See https://meddra.org/sites/default/files/guidance/file/intguide_15_0_dutch.pdf
dutch_umls.str = dutch_umls.str.replace({' NAO': '', ' \(NAO\)': '', ' nao': ''}, regex=True)
dutch_umls[dutch_umls.cui == 'C0000833']

In [None]:
def convert_title_to_lowercase(name):
    if name.split(' ')[0].istitle():
        return name.lower()
    else:
        return name

# Many ontologies start all names with an uppercase and consider it a title. 
# SNOMEDCT does not do this, so to prevent duplication, convert all title-cased names to lowercase.
# Converting all names to lowercase could lead to issues for names that are in all uppercase, such as ALS.
dutch_umls['str'] = dutch_umls['str'].apply(convert_title_to_lowercase)
dutch_umls[dutch_umls.cui == 'C0000833']

In [None]:
# Drop duplicates
print(f'Records before dropping duplicates: {dutch_umls.shape[0]}')
dutch_umls = dutch_umls.drop_duplicates(subset=['cui', 'str', 'sab'], keep='first').reset_index(drop=True)
dutch_umls[dutch_umls.cui == 'C0000833']

In [None]:
dutch_umls[dutch_umls.cui == 'C0002736']

## Merge rows from different vocabularies

In [None]:
# Merge SAB into single row
print(f'Records before merging rows: {dutch_umls.shape[0]}')
dutch_umls = dutch_umls.groupby(['cui','str'])['sab'].apply('|'.join).reset_index()
print(f'Records after merging rows: {dutch_umls.shape[0]}')
dutch_umls[dutch_umls.cui == 'C0000833']

In [None]:
# Add tty column with value 'A' to set these names as synonyms 
dutch_umls['tty'] = 'A'
dutch_umls.head(20)

# Add Dutch names from SNOMED
UMLS does not contain the Dutch SNOMEDCT, but it does contain the English SNOMEDCT_US. So through the English SNOMED concepts, we can map the Dutch SNOMED names to UMLS.

Dutch SNOMED names with SNOMED ID **->** Get English SNOMED ID to UMLS ID mapping **->** Map Dutch SNOMED names with SNOMED ID to UMLS ID

### Load SNOMED US

In [None]:
query = "SELECT distinct cui, scui FROM MRCONSO where sab = 'SNOMEDCT_US'"
snomed_us = pd.read_sql_query(query, con=connection)
snomed_us.scui = snomed_us.scui.astype(str)
print(f'SNOMED US terms with UMLS CUI: {snomed_us.shape[0]}')
snomed_us.head()

### Load SNOMED NL
We're using a cleaned and filtered list of Dutch SNOMED names, see other notebook in this repository how this is created.

In [None]:
snomed_dutch = pd.read_csv(f'04_ConceptDB/snomedct-dutch_{SNOMED_DUTCH_VERSION}.csv', dtype=str)
snomed_dutch.head()

In [None]:
snomed_dutch.shape

## Find ambiguous mapping

First find which SNOMED concepts can map to UMLS concepts. SNOMED concepts could map to multiple UMLS concepts.

In [None]:
# Create SNOMED - UMLS mapping
snomed_to_umls_mapping = snomed_us.groupby('scui')['cui'].apply(list).to_dict()
print(f'Number of SNOMED US IDs that map to at least 1 CUI: {len(snomed_to_umls_mapping)}')

In [None]:
# Check ambiguity of UMLS-SNOMED mapping
unambiguous_mapping_ids = set()
ambiguous_mapping_ids = set()
for snomed_id in snomed_to_umls_mapping:
    if len(snomed_to_umls_mapping[snomed_id]) == 1:
        unambiguous_mapping_ids.add(snomed_id)
    else:
        ambiguous_mapping_ids.add(snomed_id)
print(f'Number of SNOMED IDs that map to only 1 CUI: {len(unambiguous_mapping_ids)}')
print(f'Number of SNOMED IDs that map to multiple CUIs: {len(ambiguous_mapping_ids)}')

So 2073 SNOMED concepts map to multiple UMLS concepts. If we would add the Dutch names from SNOMED, we would have to add them to all UMLS concepts. This will introduce ambiguity, which will lead to problems in our downstream named entity linking methods. Therefor we don't add names for these ambiguously mapping SNOMED concepts.

## Example of ambiguous mapping

In [None]:
# Find example
ambiguous_mapping_ids = [int(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids.sort()
ambiguous_mapping_ids = [str(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids[0:5]

In [None]:
query = "SELECT distinct cui, code, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '216004'"
snomed_us_example = pd.read_sql_query(query, con=connection)
snomed_us_example.head()

In [None]:
snomed_dutch[snomed_dutch.cui == '216004']

In [None]:
dutch_umls[dutch_umls.cui.isin(['C0151836', 'C1704268'])]

So SNOMED US has four names for 216004. Three of these names map to C1704268 and one maps to C0151836. In SNOMED NL, there is only one name for this concept. We could map this name to both concepts, to a specific one, or ignore it.

- Mapping to both will cause ambiguity. It might have no effect on entity linking, as it could be solved during MedCAT's unsupervised training, depending on the synonyms and their presence in the training corpus. In this example there is only 1 Dutch SNOMED term, but when there are multiple Dutch SNOMED terms, adding all to both terms, will lead to many duplicates.
- Mapping to a single one is the best option for a single example, but this is time consuming, not within the scope and responsibility of this project and can be quite difficult. There are about 2000 of these terms. SNOMED US names are in UMLS, so those ambiguously mapping names are are added by either the UMLS or SNOMED team.
- Ignoring the name is the easiest option and will not lead to potential difficult downstream interpretation. The drawback is that the name, which in this example is unique to SNOMED NL, will not be in the final Dutch UMLS table.

In [None]:
# Another example of ambiguous SNOMED-CT -> UMLS mapping:
snomed_dutch[snomed_dutch.cui == '2776000']

In [None]:
# Show that a single SNOMED ID maps to multiple UMLS concepts
query = "SELECT distinct cui, code, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '2776000'"
snomed_us_example = pd.read_sql_query(query, con=connection)
snomed_us_example.head()

## Merge SNOMED Dutch with UMLS Dutch

In [None]:
# Create dictionary of UMLS concepts that are in our existing Dutch name table
dutch_umls_ids=dutch_umls.groupby('cui')['str'].apply(list).to_dict()

# Create a set with all Dutch UMLS names in lowercase
dutch_umls_names_lowercase = set()
for cui in dutch_umls_ids:
    for value in dutch_umls_ids[cui]:
        dutch_umls_names_lowercase.add(value.lower())
        
# Also create a column with the lowercase names, which allows for easy comparison 
snomed_dutch['lowercase_str'] = snomed_dutch.str.str.lower()

In [None]:
def map_dutch_snomed_to_umls(row):
    snomed_id = row['cui']
    if snomed_id in unambiguous_mapping_ids:
        cui = snomed_to_umls_mapping[snomed_id][0]
        snomed_names_to_add.append([cui, row['str'], row['tty']])
snomed_names_to_add = list()
snomed_names_to_skip = list()

# Map Dutch SNOMED to UMLS
snomed_dutch.apply(map_dutch_snomed_to_umls, axis = 1)

print(f'Number of Dutch names in existing UMLS table: {dutch_umls.shape[0]}')
print(f'Number of Dutch SNOMED names to add: {len(snomed_names_to_add)}')
print(f'Number of Dutch SNOMED names to skip: {len(snomed_names_to_skip)}')

### Example of skipped names

In [None]:
# Format SNOMED names in pandas dataframe
snomed_names_with_cui = pd.DataFrame(snomed_names_to_add, columns = ['cui', 'str', 'tty'])
snomed_names_with_cui['sab'] = 'SNOMEDCT_NL'
snomed_names_with_cui.head()

### Remove duplicate SNOMED concepts
Multiple SNOMED concepts can map to a single UMLS concept.

In [None]:
snomed_dutch[snomed_dutch.str == 'abces']

In [None]:
snomed_to_umls_mapping['44132006']

In [None]:
snomed_to_umls_mapping['128477000']

In [None]:
snomed_names_with_cui[snomed_names_with_cui.cui == 'C0000833']

In [None]:
print(f'Number of duplicate SNOMED names: {snomed_names_with_cui.shape[0]}')
snomed_names_with_cui = snomed_names_with_cui.drop_duplicates(subset=['cui', 'str', 'sab', 'tty'], keep='first').reset_index(drop=True)
print(f'Number of SNOMED names: {snomed_names_with_cui.shape[0]}')
snomed_names_with_cui[snomed_names_with_cui.cui == 'C0000833']

### Concatenate SNOMED names to UMLS table

In [None]:
# Add SNOMED names to UMLS
dutch_umls_snomed = pd.concat([dutch_umls, snomed_names_with_cui])
print(f'Number of Dutch names in UMLS + SNOMED table: {dutch_umls_snomed.shape[0]}')

In [None]:
# Grouping rows on SAB
dutch_umls_snomed = dutch_umls_snomed.groupby(['cui', 'str', 'tty'])['sab'].apply('|'.join).reset_index()
print(f'Number of names after merging rows on SAB: {dutch_umls_snomed.shape[0]}')

In [None]:
# Sort on CUI and TTY
dutch_umls_snomed.sort_values(by=['cui', 'tty', 'sab', 'str'], ascending=[True, False, True, True], inplace=True)
dutch_umls_snomed.reset_index(drop=True,inplace=True)
dutch_umls_snomed[0:20]

In [None]:
# View some concepts that have are "P" in SNOMEDCT_NL and "A" in another SB 
dutch_umls_snomed[dutch_umls_snomed.duplicated(subset=['cui', 'str'], keep=False)].head(10)

## Remove problematic names


In [None]:
names_to_remove = ['Bij', # C0004923
                   'Bijen', # C0004923
                   'Haar', # C0018494
                   'bleek', # C0678215
                   'Weer', # C0043085
                   'Na+'] # C0337443
dutch_umls_snomed[dutch_umls_snomed.str.isin(names_to_remove)]

In [None]:
# Remove rows
rows_to_remove = dutch_umls_snomed[dutch_umls_snomed.str.isin(names_to_remove)].index
print(f'Number of rows before removing rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = dutch_umls_snomed.drop(dutch_umls_snomed.index[rows_to_remove])
print(f'Number of rows after removing rows: {dutch_umls_snomed.shape[0]}')

## Add custom CUIs
Sometimes names or concept are not captured in any of the Dutch terminologies. By looking up the English name for these concepts, we can add custom Dutch names using the real UMLS identifier.

In [None]:
dutch_umls_snomed.head()

In [None]:
custom_concepts = pd.read_csv("custom_concepts.csv")
custom_concepts

In [None]:
print(f'Number of rows before adding rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = pd.concat([dutch_umls_snomed, custom_concepts])
print(f'Number of rows after adding rows: {dutch_umls_snomed.shape[0]}')

## Add TUI (types)
UMLS concepts have one or multiple types. These types are kept in a separate table, `MRSTY`. See https://semanticnetwork.nlm.nih.gov/download/SemGroups.txt for all types.

In [None]:
# Load TUI table from MySQL
query = """
SELECT cui, tui FROM MRSTY
"""
tui_original = pd.read_sql_query(query, con=connection)
tui_original.head()

In [None]:
# Add TUI column to UMLS + SNOMED CUI table
dutch_umls_snomed = dutch_umls_snomed.merge(tui_original, how='left', on='cui')
dutch_umls_snomed.head(20)

## TUI Filtering
We could implement filtering of TUIs here. This depends on the domain and question of subsequent analysis. For SNOMED names there has been a seperate filtering step based on type, which is done in the notebook that creates the SNOMED concept table.

In [None]:
tuis_to_remove = [
    
    # Concepts & Ideas
    'T078', # Idea or Concept
    'T089', # Regulation or Law

    # Living beings
    'T011', # Amphibian
    'T008', # Animal
    'T012', # Bird
    'T013', # Fish
    'T015', # Mammal
    'T001', # Organism
    'T001', # Plant
    'T014', # Reptile
    'T010', # Vertebrate
    
    # Objects
    'T168', # Food
    
    # Organizations
    'T093', # Healthcare Related Organization
    
    # Geographic areas
    'T083', #Geographic Aera
]
                  
dutch_umls_snomed[dutch_umls_snomed.tui.isin(tuis_to_remove)].head()

In [None]:
# Remove rows based on TUI
rows_to_remove = dutch_umls_snomed[dutch_umls_snomed.tui.isin(tuis_to_remove)].index
print(f'Number of rows before removing rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = dutch_umls_snomed.drop(dutch_umls_snomed.index[rows_to_remove])
print(f'Number of rows after removing rows: {dutch_umls_snomed.shape[0]}')

In [None]:
dutch_umls_snomed = dutch_umls_snomed.groupby(['cui', 'str', 'tty', 'sab'])['tui'].apply('|'.join).reset_index()
print(f'Number of rows after merging TUIs in single value: {dutch_umls_snomed.shape[0]}')

In [None]:
dutch_umls_snomed

### Update column names
In MedCAT v1.0 the column name specification has changed and is defined as in the [README.md in examples](https://github.com/CogStack/MedCAT/tree/master/examples).

In [None]:
dutch_umls_snomed.rename(columns={'str': 'name', 'tty': 'name_status', 'sab': 'ontologies', 'tui': 'type_ids'}, inplace=True)
dutch_umls_snomed.head()

## Add drug names
Only run this part below if you want to further expand the concept database with drug names, adds around 270k lines. We're not lowering drug names that start with a capital letter, since this can be a brand name.

In [None]:
#In case you want to begin from here, load existing concept table:
#dutch_umls_snomed = pd.read_csv("04_ConceptDB/umls-dutch_{UMLS_DUTCH_VERSION}.csv", dtype=str)

In [None]:
# Retrieve Dutch UMLS concepts
query = """
SELECT distinct MRCONSO.cui, str as name, sab as ontologies 
FROM MRCONSO
WHERE SAB in ('ATC','DRUGBANK','RXNORM')
"""
drugs_original = pd.read_sql_query(query, con=connection)
drugs_original.head()

In [None]:
# Get the name got a preferred name from another vocab
drugs_original['name_status'] = 'A'

In [None]:
# Merge drugs dataframe with umls_snomed dataframe
dutch_umls_snomed_drugs = pd.concat([dutch_umls_snomed, drugs_original], axis=0)

print("UMLS_snomed lines: ", len(dutch_umls_snomed))
print("Drugs lines: ", len(drugs_original))
print("Adds up to: ", len(dutch_umls_snomed_drugs))

dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'name_status'])['ontologies'].apply('|'.join).reset_index()
print("Records after merging ontologies in single value: ", len(dutch_umls_snomed_drugs))
dutch_umls_snomed_drugs

In [None]:
# Add TUIs again
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.merge(tui_original, how='left', on='cui')
print(f'Number of rows containing TUIs: {dutch_umls_snomed_drugs.shape[0]}')

# Remove TUIs
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.drop(dutch_umls_snomed_drugs.index[rows_to_remove])
print(f'Number of rows filtering TUIs: {dutch_umls_snomed_drugs.shape[0]}')

# Merge TUIs in single value
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'name_status', 'ontologies'])['tui'].apply('|'.join).reset_index()
print(f'Number of rows after merging TUIs in single value: {dutch_umls_snomed_drugs.shape[0]}')

# Sort values
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.sort_values(by=['cui', 'name_status', 'name', 'ontologies'], ascending=[True, False, True, True]).reset_index(drop=True)
dutch_umls_snomed_drugs.head(20)

## Saving

In [None]:
# Save final concept table
dutch_umls_snomed.to_csv(f'04_ConceptDB/umls-dutch_{UMLS_DUTCH_VERSION}.csv', index=False)

In [None]:
# Save final concept table with drugs
dutch_umls_snomed_drugs.to_csv(f'04_ConceptDB/umls-dutch_{UMLS_DUTCH_VERSION}_with_drugs.csv', index=False)