# Dutch UMLS to concept table
This notebook describes how to convert a UMLS concept table containing Dutch terms, to a formatted concept table to be used in a tool such as MedCAT. In the second part of this notebook, we add drug names from Dutch SNOMED, because these concepts are not well represented in the Dutch UMLS source vocabularies. A large scale automatic mapping from SNOMED Dutch to UMLS is not possible because of many-to-many mapping, explained in this notebook.

Requirements:
- MySQL database containing Dutch UMLS terms

For adding Dutch SNOMED drug names:
- Dutch SNOMED concept tablel, created in `dutch-snomed_to_concept-table.ipynb`
- MySQL database containing SNOMED-US, which is used for mapping SNOMED Dutch -> UMLS

In [1]:
# Set output version of the generated UMLS dutch concept table
UMLS_DUTCH_VERSION = 'v1.9'

# Set input version of SNOMED to append to UMLS terms
SNOMED_DUTCH_VERSION = 'v1.2'

In [2]:
from sqlalchemy import create_engine
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import json
import re
import os

In [3]:
# Credentials to connect to UMLS MySQL database
load_dotenv()
user = os.getenv('MYSQL_USER')
password = os.getenv('MYSQL_PASSWORD')
host = os.getenv('MYSQL_HOST')
port = os.getenv('MYSQL_PORT')
database = os.getenv('MYSQL_DATABASE')

# Create the connection
connection_string = f'mysql://{user}:{password}@{host}:{port}/{database}'
connection = create_engine(connection_string)

In [4]:
# Retrieve Dutch UMLS concepts
query = """
SELECT cui, str, tty, sab FROM MRCONSO WHERE LAT = 'DUT'
"""
dutch_umls_original = pd.read_sql_query(query, con=connection)
dutch_umls_original.head()

Unnamed: 0,cui,str,tty,sab
0,C0030271,Pancoast-syndroom,PT,MDRDUT
1,C0238106,Clostridium difficile-colitis,PT,MDRDUT
2,C0851107,Epstein-Barr-virustest,PT,MDRDUT
3,C0035232,ademhalingsverlamming,PT,MDRDUT
4,C0400161,anale poliepectomie,PT,MDRDUT


## Term type in source
Some source-defined term types are not relevant for our use case. In the next part we will drop those. See https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html 

In [5]:
dutch_umls_original.tty.value_counts()

PT     112383
LLT     73284
LN      54641
MH      28618
SY      11859
HT       3296
HG        337
SMQ       228
CP         38
AB         27
OS         27
Name: tty, dtype: int64

| TTY  | Description | Count | Example | Reference|
| - | - | - | - | - |
| PT | Designated preferred name| 111766 | harthypertrofie, Pancoast-syndroom ||
| LLT | Lower Level Term | 71603 | heupkombreuk, buikkramp| |
| LN | LOINC official fully specified name | 52313 | fencyclidine:massa/massa:moment:haar:kwantitatief | |
| MH | Main heading | 28657 | Dehydratie, Astma | |
| SY | Designated synonym | 11863 | Spanningshoofdpijn, Ziekte van Hodgkin | |
| OL | Non-current Lower Level Term| 9291 | acquired immunodeficiency syndrome, ankylose van gewricht, meerdere plaatsen | https://meddra.org/sites/default/files/page/documents_insert/meddra_-_terminologies_coding.pdf |
| HT | Hierarchical term | 3295 | calciummetabolismestoornissen, oculaire hemorragische aandoeningen	 | |
| LO | Obsolete official fully specified name | 1696| promyelocyten/100 leukocyten:getalsfractie:mom...	| |
| HG | High Level Group Term |  337| complicaties geassocieerd met medisch hulpmiddel, zuur-basestoornissen | |
| SMQ| Standardised MedDRA Query |  225| Leveraandoeningen (SMQ) , Tumormarkers (SMQ) | |
| CP | ICPC component process (in original form) |   38| Ander bloedonderzoek, Medicatie/recept/injectie | |
| OS | System-organ class |   27| Bloed- en lymfestelselaandoeningen, Infecties en parasitaire aandoeningen | |
| AB | Abbreviation in any source vocabulary |   27| Infec, Neopl, Ear, Endo | |

In [6]:
# Select a set of TTYs that seem most relevant for entity linking
tty_selection = ['PT', 'LLT', 'MH', 'SY']
dutch_umls = dutch_umls_original[dutch_umls_original.tty.isin(tty_selection)].copy()

# Keep only relevant columns
dutch_umls = dutch_umls[['cui', 'str', 'tty', 'sab']]

## Preferred names
Most, if not all, of UMLS concepts have a preferred name in English. For other languages,
it can be difficult to select a preferred name, because each source vocabulary has one or
multiple preferred names for a concepts. For example, ICPC2ICD10DUT only contains preferred name-type values.

It's not possible to keep the English UMLS preferred names, because MedCAT would add those names to the concept table for entity linking. Perhaps future functionality can be added for MedCAT to prevent taking these preferred names into account during entity linking.

### Solution 1: Use UMLS source vocabularies preferred names
For a rough but effective solution to get decent preferred names for the Dutch terms, change the terms that have the value "Designated preferred name" (PT) for the "Term Type in Source" (TTY) to MedCAT's preferred name value (P), and all others can be saved as (A). See https://github.com/CogStack/MedCAT/blob/master/examples/README.md

In [7]:
# dutch_umls.tty.replace({'PT': 'P',
#                                   'LLT': 'A',
#                                   'MH': 'A',
#                                   'SY': 'A'}, inplace=True)

### Solution 2: Use preferred names from Dutch SNOMED
In previous experiments we have shown that the Dutch vocabularies from UMLS and Dutch SNOMED complement each other. SNOMED however, does provide most of the names, and contains excellent primary names. So we could use the preferred names from Dutch SNOMED, and for the terms not in that vocabulary, let MedCAT pick a random one.

In [8]:
# Drop tty column, put it back in just before merging with SNOMED
dutch_umls.drop(['tty'], axis=1, inplace=True)
dutch_umls.head()

Unnamed: 0,cui,str,sab
0,C0030271,Pancoast-syndroom,MDRDUT
1,C0238106,Clostridium difficile-colitis,MDRDUT
2,C0851107,Epstein-Barr-virustest,MDRDUT
3,C0035232,ademhalingsverlamming,MDRDUT
4,C0400161,anale poliepectomie,MDRDUT


## Clean values

In [9]:
dutch_umls[dutch_umls.cui == 'C0000833']

Unnamed: 0,cui,str,sab
14054,C0000833,abces,MDRDUT
57417,C0000833,abces NAO,MDRDUT
227951,C0000833,abces,MDRDUT
244546,C0000833,Abces,MSHDUT


In [10]:
# Remove "NAO" ("Niet Anders Omschreven"), which is relevant for the source terminlogy but not for entity linking.
# See https://meddra.org/sites/default/files/guidance/file/intguide_15_0_dutch.pdf
dutch_umls.str = dutch_umls.str.replace({' NAO': '', ' \(NAO\)': '', ' nao': ''}, regex=True)
dutch_umls[dutch_umls.cui == 'C0000833']

Unnamed: 0,cui,str,sab
14054,C0000833,abces,MDRDUT
57417,C0000833,abces,MDRDUT
227951,C0000833,abces,MDRDUT
244546,C0000833,Abces,MSHDUT


In [11]:
def convert_title_to_lowercase(name):
    if name.split(' ')[0].istitle():
        return name.lower()
    else:
        return name

# Many ontologies start all names with an uppercase and consider it a title. 
# SNOMEDCT does not do this, so to prevent duplication, convert all title-cased names to lowercase.
# Converting all names to lowercase could lead to issues for names that are in all uppercase, such as ALS.
dutch_umls['str'] = dutch_umls['str'].apply(convert_title_to_lowercase)
dutch_umls[dutch_umls.cui == 'C0000833']

Unnamed: 0,cui,str,sab
14054,C0000833,abces,MDRDUT
57417,C0000833,abces,MDRDUT
227951,C0000833,abces,MDRDUT
244546,C0000833,abces,MSHDUT


In [12]:
# Drop duplicates
print(f'Records before dropping duplicates: {dutch_umls.shape[0]}')
dutch_umls = dutch_umls.drop_duplicates(subset=['cui', 'str', 'sab'], keep='first').reset_index(drop=True)
dutch_umls[dutch_umls.cui == 'C0000833']

Records before dropping duplicates: 226144


Unnamed: 0,cui,str,sab
12837,C0000833,abces,MDRDUT
148944,C0000833,abces,MSHDUT


In [13]:
dutch_umls[dutch_umls.cui == 'C0002736']

Unnamed: 0,cui,str,sab
39978,C0002736,ALS,MDRDUT
46813,C0002736,amyotrofe laterale sclerose,MDRDUT
73212,C0002736,amyotrofie; laterale sclerose,ICPC2ICD10DUT
85245,C0002736,creeping; palsy,ICPC2ICD10DUT
104344,C0002736,laterale sclerose; amyotrofie,ICPC2ICD10DUT
116004,C0002736,palsy; creeping,ICPC2ICD10DUT
123802,C0002736,"sclerose; spinaal, lateraal (amyotrofisch)",ICPC2ICD10DUT
125733,C0002736,"spinaal; sclerose, lateraal (amyotrofisch)",ICPC2ICD10DUT
148785,C0002736,ALS,MSHDUT
149919,C0002736,amyotrofische laterale sclerose (als),MSHDUT


## Merge rows from different vocabularies

In [14]:
# Merge SAB into single row
print(f'Records before merging rows: {dutch_umls.shape[0]}')
dutch_umls = dutch_umls.groupby(['cui','str'])['sab'].apply('|'.join).reset_index()
print(f'Records after merging rows: {dutch_umls.shape[0]}')
dutch_umls[dutch_umls.cui == 'C0000833']

Records before merging rows: 188809
Records after merging rows: 184777


Unnamed: 0,cui,str,sab
211,C0000833,abces,MDRDUT|MSHDUT


In [15]:
# Add tty column with value 'A' to set these names as synonyms 
dutch_umls['tty'] = 'A'
dutch_umls.head(20)

Unnamed: 0,cui,str,sab,tty
0,C0000696,A-zenuwvezels,MSHDUT,A
1,C0000715,abattoir,MSHDUT,A
2,C0000715,abattoirs,MSHDUT,A
3,C0000722,abbreviated injury scale,MSHDUT,A
4,C0000726,abdomen,MSHDUT,A
5,C0000726,buik,MSHDUT,A
6,C0000727,abdomen; acute buik,ICPC2ICD10DUT,A
7,C0000727,"abdominaal syndroom, acuut",MDRDUT,A
8,C0000727,"abdominaal; syndroom, acuut",ICPC2ICD10DUT,A
9,C0000727,acute buik,ICD10DUT|MSHDUT,A


# Add Dutch names from SNOMED
UMLS does not contain the Dutch SNOMEDCT, but it does contain the English SNOMEDCT_US. So through the English SNOMED concepts, we can map the Dutch SNOMED names to UMLS.

Dutch SNOMED names with SNOMED ID **->** Get English SNOMED ID to UMLS ID mapping **->** Map Dutch SNOMED names with SNOMED ID to UMLS ID

### Load SNOMED US

In [16]:
query = "SELECT distinct cui, scui FROM MRCONSO where sab = 'SNOMEDCT_US'"
snomed_us = pd.read_sql_query(query, con=connection)
snomed_us.scui = snomed_us.scui.astype(str)
print(f'SNOMED US terms with UMLS CUI: {snomed_us.shape[0]}')
snomed_us.head()

SNOMED US terms with UMLS CUI: 363794


Unnamed: 0,cui,scui
0,C0000052,58488005
1,C0000097,285407008
2,C0000102,13579002
3,C0000163,112116001
4,C0000167,46120009


### Load SNOMED NL
We're using a cleaned and filtered list of Dutch SNOMED names, see other notebook in this repository how this is created.

In [17]:
snomed_dutch = pd.read_csv(f'04_ConceptDB/snomedct-dutch_{SNOMED_DUTCH_VERSION}.csv', dtype=str)
snomed_dutch.head()

Unnamed: 0,cui,str,tty,tui,sab
0,104001,excisie van laesie van knieschijf,A,verrichting,SNOMEDCT_NL
1,104001,excisie van afwijkend weefsel van patella,P,verrichting,SNOMEDCT_NL
2,106004,posterieur gebied van handwortel,A,lichaamsstructuur,SNOMEDCT_NL
3,106004,posterieur carpaal gebied,A,lichaamsstructuur,SNOMEDCT_NL
4,106004,structuur van posterieure carpale regio,P,lichaamsstructuur,SNOMEDCT_NL


In [18]:
snomed_dutch.shape

(506733, 5)

## Find ambiguous mapping

First find which SNOMED concepts can map to UMLS concepts. SNOMED concepts could map to multiple UMLS concepts.

In [19]:
# Create SNOMED - UMLS mapping
snomed_to_umls_mapping = snomed_us.groupby('scui')['cui'].apply(list).to_dict()
print(f'Number of SNOMED US IDs that map to at least 1 CUI: {len(snomed_to_umls_mapping)}')

Number of SNOMED US IDs that map to at least 1 CUI: 361461


In [20]:
# Check ambiguity of UMLS-SNOMED mapping
unambiguous_mapping_ids = set()
ambiguous_mapping_ids = set()
for snomed_id in snomed_to_umls_mapping:
    if len(snomed_to_umls_mapping[snomed_id]) == 1:
        unambiguous_mapping_ids.add(snomed_id)
    else:
        ambiguous_mapping_ids.add(snomed_id)
print(f'Number of SNOMED IDs that map to only 1 CUI: {len(unambiguous_mapping_ids)}')
print(f'Number of SNOMED IDs that map to multiple CUIs: {len(ambiguous_mapping_ids)}')

Number of SNOMED IDs that map to only 1 CUI: 359388
Number of SNOMED IDs that map to multiple CUIs: 2073


So 2073 SNOMED concepts map to multiple UMLS concepts. If we would add the Dutch names from SNOMED, we would have to add them to all UMLS concepts. This will introduce ambiguity, which will lead to problems in our downstream named entity linking methods. Therefor we don't add names for these ambiguously mapping SNOMED concepts.

## Example of ambiguous mapping

In [21]:
# Find example
ambiguous_mapping_ids = [int(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids.sort()
ambiguous_mapping_ids = [str(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids[0:5]

['115006', '216004', '289002', '344001', '489004']

In [22]:
query = "SELECT distinct cui, code, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '216004'"
snomed_us_example = pd.read_sql_query(query, con=connection)
snomed_us_example.head()

Unnamed: 0,cui,code,str
0,C1704268,216004,Delusion of persecution
1,C1704268,216004,Persecutory delusion
2,C0151836,216004,Paranoid reaction
3,C1704268,216004,Delusion of persecution (finding)


In [23]:
snomed_dutch[snomed_dutch.cui == '216004']

Unnamed: 0,cui,str,tty,tui,sab
122,216004,achtervolgingswaan,P,bevinding,SNOMEDCT_NL


In [24]:
dutch_umls[dutch_umls.cui.isin(['C0151836', 'C1704268'])]

Unnamed: 0,cui,str,sab,tty
43521,C0151836,paranode reactie,MDRDUT,A
43522,C0151836,paranoïd; reactie,ICPC2ICD10DUT,A
43523,C0151836,reactie paranode,MDRDUT,A
43524,C0151836,reactie; paranoïd,ICPC2ICD10DUT,A
169708,C1704268,vervolgingswaan,MDRDUT,A
169709,C1704268,waan van achtervolging,MDRDUT,A


So SNOMED US has four names for 216004. Three of these names map to C1704268 and one maps to C0151836. In SNOMED NL, there is only one name for this concept. We could map this name to both concepts, to a specific one, or ignore it.

- Mapping to both will cause ambiguity. It might have no effect on entity linking, as it could be solved during MedCAT's unsupervised training, depending on the synonyms and their presence in the training corpus. In this example there is only 1 Dutch SNOMED term, but when there are multiple Dutch SNOMED terms, adding all to both terms, will lead to many duplicates.
- Mapping to a single one is the best option for a single example, but this is time consuming, not within the scope and responsibility of this project and can be quite difficult. There are about 2000 of these terms. SNOMED US names are in UMLS, so those ambiguously mapping names are are added by either the UMLS or SNOMED team.
- Ignoring the name is the easiest option and will not lead to potential difficult downstream interpretation. The drawback is that the name, which in this example is unique to SNOMED NL, will not be in the final Dutch UMLS table.

In [25]:
# Another example of ambiguous SNOMED-CT -> UMLS mapping:
snomed_dutch[snomed_dutch.cui == '2776000']

Unnamed: 0,cui,str,tty,tui,sab
2805,2776000,delirium,A,aandoening,SNOMEDCT_NL
2806,2776000,delier,P,aandoening,SNOMEDCT_NL


In [26]:
# Show that a single SNOMED ID maps to multiple UMLS concepts
query = "SELECT distinct cui, code, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '2776000'"
snomed_us_example = pd.read_sql_query(query, con=connection)
snomed_us_example.head()

Unnamed: 0,cui,code,str
0,C0011206,2776000,Acute brain syndrome
1,C0029221,2776000,Organic brain syndrome
2,C0011206,2776000,Delirium
3,C1285577,2776000,Acute confusional state
4,C1306588,2776000,Acute organic reaction


## Merge SNOMED Dutch with UMLS Dutch

In [27]:
# Create dictionary of UMLS concepts that are in our existing Dutch name table
dutch_umls_ids=dutch_umls.groupby('cui')['str'].apply(list).to_dict()

# Create a set with all Dutch UMLS names in lowercase
dutch_umls_names_lowercase = set()
for cui in dutch_umls_ids:
    for value in dutch_umls_ids[cui]:
        dutch_umls_names_lowercase.add(value.lower())
        
# Also create a column with the lowercase names, which allows for easy comparison 
snomed_dutch['lowercase_str'] = snomed_dutch.str.str.lower()

In [28]:
def map_dutch_snomed_to_umls(row):
    snomed_id = row['cui']
    if snomed_id in unambiguous_mapping_ids:
        cui = snomed_to_umls_mapping[snomed_id][0]
        snomed_names_to_add.append([cui, row['str'], row['tty']])
snomed_names_to_add = list()
snomed_names_to_skip = list()

# Map Dutch SNOMED to UMLS
snomed_dutch.apply(map_dutch_snomed_to_umls, axis = 1)

print(f'Number of Dutch names in existing UMLS table: {dutch_umls.shape[0]}')
print(f'Number of Dutch SNOMED names to add: {len(snomed_names_to_add)}')
print(f'Number of Dutch SNOMED names to skip: {len(snomed_names_to_skip)}')

Number of Dutch names in existing UMLS table: 184777
Number of Dutch SNOMED names to add: 467352
Number of Dutch SNOMED names to skip: 0


### Example of skipped names

In [29]:
# Format SNOMED names in pandas dataframe
snomed_names_with_cui = pd.DataFrame(snomed_names_to_add, columns = ['cui', 'str', 'tty'])
snomed_names_with_cui['sab'] = 'SNOMEDCT_NL'
snomed_names_with_cui.head()

Unnamed: 0,cui,str,tty,sab
0,C0187893,excisie van laesie van knieschijf,A,SNOMEDCT_NL
1,C0187893,excisie van afwijkend weefsel van patella,P,SNOMEDCT_NL
2,C0230364,posterieur gebied van handwortel,A,SNOMEDCT_NL
3,C0230364,posterieur carpaal gebied,A,SNOMEDCT_NL
4,C0230364,structuur van posterieure carpale regio,P,SNOMEDCT_NL


### Remove duplicate SNOMED concepts
Multiple SNOMED concepts can map to a single UMLS concept.

In [30]:
snomed_dutch[snomed_dutch.str == 'abces']

Unnamed: 0,cui,str,tty,tui,sab,lowercase_str
47875,44132006,abces,P,afwijkende morfologie,SNOMEDCT_NL,abces
132978,128477000,abces,P,aandoening,SNOMEDCT_NL,abces


In [31]:
snomed_to_umls_mapping['44132006']

['C0000833']

In [32]:
snomed_to_umls_mapping['128477000']

['C0000833']

In [33]:
snomed_names_with_cui[snomed_names_with_cui.cui == 'C0000833']

Unnamed: 0,cui,str,tty,sab
45880,C0000833,abces,P,SNOMEDCT_NL
128226,C0000833,abces,P,SNOMEDCT_NL


In [34]:
print(f'Number of duplicate SNOMED names: {snomed_names_with_cui.shape[0]}')
snomed_names_with_cui = snomed_names_with_cui.drop_duplicates(subset=['cui', 'str', 'sab', 'tty'], keep='first').reset_index(drop=True)
print(f'Number of SNOMED names: {snomed_names_with_cui.shape[0]}')
snomed_names_with_cui[snomed_names_with_cui.cui == 'C0000833']

Number of duplicate SNOMED names: 467352
Number of SNOMED names: 466585


Unnamed: 0,cui,str,tty,sab
45877,C0000833,abces,P,SNOMEDCT_NL


### Concatenate SNOMED names to UMLS table

In [35]:
# Add SNOMED names to UMLS
dutch_umls_snomed = pd.concat([dutch_umls, snomed_names_with_cui])
print(f'Number of Dutch names in UMLS + SNOMED table: {dutch_umls_snomed.shape[0]}')

Number of Dutch names in UMLS + SNOMED table: 651362


In [36]:
# Grouping rows on SAB
dutch_umls_snomed = dutch_umls_snomed.groupby(['cui', 'str', 'tty'])['sab'].apply('|'.join).reset_index()
print(f'Number of names after merging rows on SAB: {dutch_umls_snomed.shape[0]}')

Number of names after merging rows on SAB: 647865


In [37]:
# Sort on CUI and TTY
dutch_umls_snomed.sort_values(by=['cui', 'tty', 'sab', 'str'], ascending=[True, False, True, True], inplace=True)
dutch_umls_snomed.reset_index(drop=True,inplace=True)
dutch_umls_snomed[0:20]

Unnamed: 0,cui,str,tty,sab
0,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL
1,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL
2,C0000097,MPTP,A,SNOMEDCT_NL
3,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL
4,C0000215,"2,4,5-T",A,SNOMEDCT_NL
5,C0000220,"2,4-dichloorfenoxyazijnzuur",P,SNOMEDCT_NL
6,C0000294,mesna,P,SNOMEDCT_NL
7,C0000294,mercapto-ethaansulfonzuur,A,SNOMEDCT_NL
8,C0000294,natrium-2-mercapto-ethaansulfonaat,A,SNOMEDCT_NL
9,C0000473,para-aminobenzoëzuur,P,SNOMEDCT_NL


In [38]:
# View some concepts that have are "P" in SNOMEDCT_NL and "A" in another SB 
dutch_umls_snomed[dutch_umls_snomed.duplicated(subset=['cui', 'str'], keep=False)].head(10)

Unnamed: 0,cui,str,tty,sab
24,C0000727,acute buik,P,SNOMEDCT_NL
25,C0000727,acute buik,A,ICD10DUT|MSHDUT
77,C0000737,abdominale pijn,P,SNOMEDCT_NL
80,C0000737,abdominale pijn,A,MDRDUT
160,C0000774,abnormale secretie van gastrine,P,SNOMEDCT_NL
161,C0000774,abnormale secretie van gastrine,A,ICD10DUT
175,C0000786,miskraam,P,SNOMEDCT_NL
182,C0000786,miskraam,A,MDRDUT|MSHDUT
237,C0000820,therapeutische abortus,P,SNOMEDCT_NL
240,C0000820,therapeutische abortus,A,MDRDUT


## Remove problematic names


In [39]:
names_to_remove = ['Bij', # C0004923
                   'Bijen', # C0004923
                   'Haar', # C0018494
                   'bleek', # C0678215
                   'Weer', # C0043085
                   'Na+'] # C0337443
dutch_umls_snomed[dutch_umls_snomed.str.isin(names_to_remove)]

Unnamed: 0,cui,str,tty,sab
35917,C0030232,bleek,P,SNOMEDCT_NL
349584,C0678215,bleek,A,MDRDUT


In [40]:
# Remove rows
rows_to_remove = dutch_umls_snomed[dutch_umls_snomed.str.isin(names_to_remove)].index
print(f'Number of rows before removing rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = dutch_umls_snomed.drop(dutch_umls_snomed.index[rows_to_remove])
print(f'Number of rows after removing rows: {dutch_umls_snomed.shape[0]}')

Number of rows before removing rows: 647865
Number of rows after removing rows: 647863


## Add custom CUIs
Sometimes names or concept are not captured in any of the Dutch terminologies. By looking up the English name for these concepts, we can add custom Dutch names using the real UMLS identifier.

In [41]:
dutch_umls_snomed.head()

Unnamed: 0,cui,str,tty,sab
0,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL
1,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL
2,C0000097,MPTP,A,SNOMEDCT_NL
3,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL
4,C0000215,"2,4,5-T",A,SNOMEDCT_NL


In [42]:
custom_concepts = pd.read_csv("custom_concepts.csv")
custom_concepts

Unnamed: 0,cui,str,tty,sab
0,C0456984,uitslag,A,UMCU
1,C0019080,bloedt,A,UMCU
2,C0019080,bloeden,A,UMCU
3,C0225844,RA,A,UMCU
4,C0225883,RV,A,UMCU
5,C0225897,LV,A,UMCU
6,C0011206,delier,P,UMCU


In [43]:
print(f'Number of rows before adding rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = pd.concat([dutch_umls_snomed, custom_concepts])
print(f'Number of rows after adding rows: {dutch_umls_snomed.shape[0]}')

Number of rows before adding rows: 647863
Number of rows after adding rows: 647870


## Add TUI (types)
UMLS concepts have one or multiple types. These types are kept in a separate table, `MRSTY`. See https://semanticnetwork.nlm.nih.gov/download/SemGroups.txt for all types.

In [44]:
# Load TUI table from MySQL
query = """
SELECT cui, tui FROM MRSTY
"""
tui_original = pd.read_sql_query(query, con=connection)
tui_original.head()

Unnamed: 0,cui,tui
0,C0684279,T104
1,C0684298,T104
2,C0684300,T104
3,C0684301,T104
4,C0175815,T104


In [45]:
# Add TUI column to UMLS + SNOMED CUI table
dutch_umls_snomed = dutch_umls_snomed.merge(tui_original, how='left', on='cui')
dutch_umls_snomed.head(20)

Unnamed: 0,cui,str,tty,sab,tui
0,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL,T131
1,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL,T109
2,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL,T131
3,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL,T109
4,C0000097,MPTP,A,SNOMEDCT_NL,T131
5,C0000097,MPTP,A,SNOMEDCT_NL,T109
6,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL,T131
7,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL,T109
8,C0000215,"2,4,5-T",A,SNOMEDCT_NL,T131
9,C0000215,"2,4,5-T",A,SNOMEDCT_NL,T109


## TUI Filtering
We could implement filtering of TUIs here. This depends on the domain and question of subsequent analysis. For SNOMED names there has been a seperate filtering step based on type, which is done in the notebook that creates the SNOMED concept table.

In [46]:
tuis_to_remove = [
    
    # Concepts & Ideas
    'T078', # Idea or Concept
    'T089', # Regulation or Law

    # Living beings
    'T011', # Amphibian
    'T008', # Animal
    'T012', # Bird
    'T013', # Fish
    'T015', # Mammal
    'T001', # Organism
    'T001', # Plant
    'T014', # Reptile
    'T010', # Vertebrate
    
    # Objects
    'T168', # Food
    
    # Organizations
    'T093', # Healthcare Related Organization
    
    # Geographic areas
    'T083', #Geographic Aera
]
                  
dutch_umls_snomed[dutch_umls_snomed.tui.isin(tuis_to_remove)].head()

Unnamed: 0,cui,str,tty,sab,tui
312,C0000863,abu dhabi,A,MSHDUT,T083
321,C0000872,academisch medisch centrum,A,MSHDUT,T093
323,C0000872,academische medische centra,A,MSHDUT,T093
325,C0000872,"medische centra, academische",A,MSHDUT,T093
552,C0001153,acomys,A,MSHDUT,T015


In [47]:
# Remove rows based on TUI
rows_to_remove = dutch_umls_snomed[dutch_umls_snomed.tui.isin(tuis_to_remove)].index
print(f'Number of rows before removing rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = dutch_umls_snomed.drop(dutch_umls_snomed.index[rows_to_remove])
print(f'Number of rows after removing rows: {dutch_umls_snomed.shape[0]}')

Number of rows before removing rows: 655877
Number of rows after removing rows: 651884


In [48]:
dutch_umls_snomed = dutch_umls_snomed.groupby(['cui', 'str', 'tty', 'sab'])['tui'].apply('|'.join).reset_index()
print(f'Number of rows after merging TUIs in single value: {dutch_umls_snomed.shape[0]}')

Number of rows after merging TUIs in single value: 644121


In [49]:
dutch_umls_snomed

Unnamed: 0,cui,str,tty,sab,tui
0,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL,T131|T109
1,C0000097,MPTP,A,SNOMEDCT_NL,T131|T109
2,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL,T131|T109
3,C0000215,"2,4,5-T",A,SNOMEDCT_NL,T131|T109
4,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL,T131|T109
...,...,...,...,...,...
644116,C5439603,tweede vaccinatie tegen SARS-CoV-2 met mRNA-va...,A,SNOMEDCT_NL,T061
644117,C5439604,overgevoeligheid voor SARS-CoV-2-mRNA-vaccin,A,SNOMEDCT_NL,T047
644118,C5439604,overgevoeligheid voor mRNA-vaccin tegen COVID-19,A,SNOMEDCT_NL,T047
644119,C5439604,overgevoeligheid voor vaccin met mRNA van SARS...,A,SNOMEDCT_NL,T047


### Update column names
In MedCAT v1.0 the column name specification has changed and is defined as in the [README.md in examples](https://github.com/CogStack/MedCAT/tree/master/examples).

In [50]:
dutch_umls_snomed.rename(columns={'str': 'name', 'tty': 'name_status', 'sab': 'ontologies', 'tui': 'type_ids'}, inplace=True)
dutch_umls_snomed.head()

Unnamed: 0,cui,name,name_status,ontologies,type_ids
0,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL,T131|T109
1,C0000097,MPTP,A,SNOMEDCT_NL,T131|T109
2,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL,T131|T109
3,C0000215,"2,4,5-T",A,SNOMEDCT_NL,T131|T109
4,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL,T131|T109


## Add drug names
Only run this part below if you want to further expand the concept database with drug names, adds around 270k lines. We're not lowering drug names that start with a capital letter, since this can be a brand name.

In [51]:
#In case you want to begin from here, load existing concept table:
#dutch_umls_snomed = pd.read_csv("04_ConceptDB/umls-dutch_{UMLS_DUTCH_VERSION}.csv", dtype=str)

In [52]:
# Retrieve Dutch UMLS concepts
query = """
SELECT distinct MRCONSO.cui, str as name, sab as ontologies 
FROM MRCONSO
WHERE SAB in ('ATC','DRUGBANK','RXNORM')
"""
drugs_original = pd.read_sql_query(query, con=connection)
drugs_original.head()

Unnamed: 0,cui,name,ontologies
0,C2348241,alcaftadine,ATC
1,C0039644,tetracycline,ATC
2,C0039943,thioridazine,ATC
3,C0040879,triazolam,ATC
4,C0041037,trimetazidine,ATC


In [53]:
# Get the name got a preferred name from another vocab
drugs_original['name_status'] = 'A'

In [54]:
# Merge drugs dataframe with umls_snomed dataframe
dutch_umls_snomed_drugs = pd.concat([dutch_umls_snomed, drugs_original], axis=0)

print("UMLS_snomed lines: ", len(dutch_umls_snomed))
print("Drugs lines: ", len(drugs_original))
print("Adds up to: ", len(dutch_umls_snomed_drugs))

dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'name_status'])['ontologies'].apply('|'.join).reset_index()
print("Records after merging ontologies in single value: ", len(dutch_umls_snomed_drugs))
dutch_umls_snomed_drugs

UMLS_snomed lines:  644121
Drugs lines:  183214
Adds up to:  827335
Records after merging ontologies in single value:  824193


Unnamed: 0,cui,name,name_status,ontologies
0,C0000039,"1,2-dipalmitoylphosphatidylcholine",A,RXNORM
1,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL
2,C0000097,MPTP,A,SNOMEDCT_NL
3,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL
4,C0000215,"2,4,5-T",A,SNOMEDCT_NL
...,...,...,...,...
824188,C5440848,tofacitinib Oral Liquid Product,A,RXNORM
824189,C5440849,umbralisib Oral Product,A,RXNORM
824190,C5440850,umbralisib Pill,A,RXNORM
824191,C5440851,evinacumab Injectable Product,A,RXNORM


In [55]:
# Add TUIs again
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.merge(tui_original, how='left', on='cui')
print(f'Number of rows containing TUIs: {dutch_umls_snomed_drugs.shape[0]}')

# Remove TUIs
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.drop(dutch_umls_snomed_drugs.index[rows_to_remove])
print(f'Number of rows filtering TUIs: {dutch_umls_snomed_drugs.shape[0]}')

# Merge TUIs in single value
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'name_status', 'ontologies'])['tui'].apply('|'.join).reset_index()
print(f'Number of rows after merging TUIs in single value: {dutch_umls_snomed_drugs.shape[0]}')

# Rename TUI column to type_ids
dutch_umls_snomed_drugs.rename(columns={'tui': 'type_ids'}, inplace=True)

# Sort values
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.sort_values(by=['cui', 'name_status', 'name', 'ontologies'], ascending=[True, False, True, True]).reset_index(drop=True)
dutch_umls_snomed_drugs.head(20)

Number of rows containing TUIs: 885598
Number of rows filtering TUIs: 881605
Number of rows after merging TUIs in single value: 820856


Unnamed: 0,cui,name,name_status,ontologies,tui
0,C0000039,"1,2-dipalmitoylphosphatidylcholine",A,RXNORM,T121|T109
1,C0000097,methyl-fenyltetrahydropyridine,P,SNOMEDCT_NL,T131|T109
2,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",A,SNOMEDCT_NL,T131|T109
3,C0000097,MPTP,A,SNOMEDCT_NL,T131|T109
4,C0000215,"2,4,5-trichloorfenoxyazijnzuur",P,SNOMEDCT_NL,T131|T109
5,C0000215,"2,4,5-T",A,SNOMEDCT_NL,T131|T109
6,C0000220,"2,4-dichloorfenoxyazijnzuur",P,SNOMEDCT_NL,T131|T109
7,C0000266,Parlodel,A,RXNORM,T121|T109
8,C0000294,mesna,P,SNOMEDCT_NL,T121|T109
9,C0000294,mercapto-ethaansulfonzuur,A,SNOMEDCT_NL,T121|T109


## Saving

In [56]:
# Save final concept table
dutch_umls_snomed.to_csv(f'04_ConceptDB/umls-dutch_{UMLS_DUTCH_VERSION}.csv', index=False)

In [57]:
# Save final concept table with drugs
dutch_umls_snomed_drugs.to_csv(f'04_ConceptDB/umls-dutch_{UMLS_DUTCH_VERSION}_with_drugs.csv', index=False)