# Dutch UMLS to concept table
This notebook describes how to create a UMLS concept table containing Dutch names, to be used in a named entity recognition and linking tool such as MedCAT. In the second part of this notebook, names from Dutch SNOMED are added. In the third part, English drug names are added, because these concepts are not well represented in the Dutch UMLS source vocabularies. 

Mapping from SNOMED Dutch to UMLS can be difficult because of many-to-many mapping, explained in this notebook.

Requirements:
- UMLS MySQL database containing Dutch ontologies.

For adding Dutch SNOMED names:
- Dutch SNOMED concept table, created in `dutch-snomed_to_concept-table.ipynb`
- UMLS MySQL database containing SNOMED-US, which is used for mapping SNOMED Dutch -> UMLS

For adding English drug names:
- UMLS MySQL database containing English drug ontologies such as RXNORM, ATC and Drugbank

In [None]:
import json
import os
import re
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from IPython.display import display
from pathlib import Path
from sqlalchemy import create_engine
from utils import clean_name_status_column, convert_title_to_lowercase

pd.options.display.max_colwidth=400
pd.options.display.max_rows=200

# Set output version of the generated UMLS dutch concept table
UMLS_DUTCH_VERSION = 'v1.12.0'

# Set version of SNOMED to append to UMLS terms
snomed_dutch_file = Path('04_ConceptDB') / 'snomedct-dutch_v1.3.csv'

# Set custom names and types files
custom_names_file = Path("05_CustomConcepts") / "dutch-umls_custom_names.csv"
custom_types_file = Path("05_CustomConcepts") / "dutch-umls_custom_types.csv"
custom_name_status_file = Path("05_CustomConcepts") / "dutch-umls_custom_name_status.csv"

# Output files
output_file = Path("04_ConceptDB") / f'umls-dutch_{UMLS_DUTCH_VERSION}.csv'
output_file_with_drug_names = Path('04_ConceptDB') / f'umls-dutch_{UMLS_DUTCH_VERSION}_with_drugs.csv'

In [None]:
# Credentials to connect to UMLS MySQL database
load_dotenv()
user = os.getenv('MYSQL_USER')
password = os.getenv('MYSQL_PASSWORD')
host = os.getenv('MYSQL_HOST')
port = os.getenv('MYSQL_PORT')
database = os.getenv('MYSQL_DATABASE')

# Create the connection
connection_string = f'mysql://{user}:{password}@{host}:{port}/{database}'
connection = create_engine(connection_string)

## Retrieve medical concepts

In [None]:
# Retrieve Dutch UMLS concepts
query = """
SELECT cui, str, tty, sab FROM MRCONSO WHERE LAT = 'DUT'
"""
dutch_umls_original = pd.read_sql_query(query, con=connection)
dutch_umls_original.head()

## Manual corrections
Some manual corrections. Easiest to do this as close to the source as possible, so they are processed downstream correctly.

In [None]:
# Correct Respiratory Failure
# Respiratory Insufficiency / C0035229 / 409623005 / https://uts.nlm.nih.gov/uts/umls/concept/C0035229
# Respiratory Failure / C1145670 / 409622000 / https://uts.nlm.nih.gov/uts/umls/concept/C1145670
display(dutch_umls_original.loc[dutch_umls_original.cui == 'C0035229'])
display(dutch_umls_original.loc[dutch_umls_original.cui == 'C1145670'])

In [None]:
dutch_umls_original.loc[dutch_umls_original.str == 'Respiratoire insufficiëntie, niet gespecificeerd', 'cui'] = 'C0035229'
dutch_umls_original.loc[dutch_umls_original.str == 'Ademhalingsinsufficiëntie', 'cui'] = 'C0035229'
display(dutch_umls_original.loc[dutch_umls_original.cui == 'C0035229'])
display(dutch_umls_original.loc[dutch_umls_original.cui == 'C1145670'])

## Filter on terminology and type

In [None]:
dutch_umls_original.sab.unique()

In [None]:
# Assess terms per terminology
dutch_umls_original.loc[dutch_umls_original['sab'].isin(['LNC-NL-NL', 'ICPC2ICD10DUT'])].sample(5)

In [None]:
# 'LNC-NL-NL' and 'ICPC2ICD10DUT' names are not usefull for named entities linking, so we exclude these
dutch_umls_sab_filtered = dutch_umls_original.loc[~dutch_umls_original['sab'].isin(['LNC-NL-NL', 'ICPC2ICD10DUT'])].copy()
dutch_umls_sab_filtered.sample(5)

In [None]:
# Assess terms per type
dutch_umls_sab_filtered.loc[dutch_umls_sab_filtered['tty']=='HT'].sample(5)

## Term type in source
Some source-defined term types are not relevant for our use case. In the next part we will drop those. See https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html 

In [None]:
dutch_umls_sab_filtered.tty.value_counts()

| TTY  | Description | Example |
| - | - | - |
| LLT | Lower Level Term | heupkombreuk, buikkramp|
| PT | Designated preferred name | harthypertrofie, Pancoast-syndroom |
| MH | Main heading | Dehydratie, Astma |
| SY | Designated synonym | Spanningshoofdpijn, Ziekte van Hodgkin |
| HT | Hierarchical term | calciummetabolismestoornissen, oculaire hemorragische aandoeningen |
| HG | High Level Group Term  | complicaties geassocieerd met medisch hulpmiddel, zuur-basestoornissen |
| SMQ| Standardised MedDRA Query | Leveraandoeningen (SMQ) , Tumormarkers (SMQ) |
| CP | ICPC component process (in original form) | Ander bloedonderzoek, Medicatie/recept/injectie |
| AB | Abbreviation in any source vocabulary | Infec, Neopl, Ear, Endo |
| OS | System-organ class | Bloed- en lymfestelselaandoeningen, Infecties en parasitaire aandoeningen |

In [None]:
# Select a set of TTYs that seem most relevant for named entity recognition
tty_selection = ['PT', 'LLT', 'MH', 'SY']
dutch_umls_tty_filtered = dutch_umls_sab_filtered.loc[dutch_umls_sab_filtered.tty.isin(tty_selection)].copy()

# Keep only relevant columns
dutch_umls_tty_filtered = dutch_umls_tty_filtered[['cui', 'str', 'tty', 'sab']]
dutch_umls_tty_filtered.sample(5)

## Preferred/pretty/primary names
For MedCAT, and other named entity linking methods, it's useful to designate a single name as the preferred name, also sometimes called primary name or pretty name. This name can be presented to the end-user in webapplications, and should therefore be the most descriptive and commonly used name. All other names (synonyms, abreviations and common mispellings) will then be considered synonyms. MedCAT used 'P' as the value for preferred name, and all other names should have the value 'A' (see https://github.com/CogStack/MedCAT/blob/master/examples/README.md).

Most, if not all, of UMLS concepts have a preferred in English. For other languages,
it can be difficult to select a preferred name, because each source vocabulary has one or
multiple preferred names for a concepts.

It's not possible to keep the English UMLS preferred names, because MedCAT would add those names to the concept table for entity linking. Perhaps future functionality can be added for MedCAT to prevent taking these preferred names into account during entity linking.

### Solution 1: Use UMLS source vocabularies preferred names
For a rough but effective solution to get decent preferred names for the Dutch terms, change the terms that have the value "Designated preferred name" (PT) for the "Term Type in Source" (TTY) to MedCAT's preferred name value (P), and all others can be saved as (A). This leads to many concepts having multiple preferred names.

In [None]:
# dutch_umls_tty_filtered.tty.replace({'PT': 'P',
#                                      'LLT': 'A',
#                                      'MH': 'A',
#                                      'SY': 'A'}, inplace=True)

### Solution 2: Use preferred names from Dutch SNOMED
In previous experiments we have shown that the Dutch vocabularies from UMLS and Dutch SNOMED complement each other. SNOMED provides most of the names, and contains excellent primary names. So we could use the preferred names from Dutch SNOMED, and for the terms not in that vocabulary, let MedCAT pick a random one.

In [None]:
# Drop tty column, put it back in just before merging with SNOMED
dutch_umls_tty_filtered.drop(['tty'], axis=1, inplace=True)

## Clean values
ICD10DUT and MDRDUT contain names that are more definitions than then how they would be found in text. For example, "colonkanker NAO" (see https://alt.meddra.org/files_acrobat/intguide_25_0_Dutch.pdf) and "Aandoening van ooglid, niet gespecificeerd". Names are more useful for named entity recongition when the descriptive part is removed.

In [None]:
dutch_umls_clean = dutch_umls_tty_filtered.copy()

# Remove ' nao' and ' NAO'
print(f"Number of terms containing ' NAO': {len(dutch_umls_clean.loc[dutch_umls_clean['str'].str.contains(' NAO')])}")
print(f"Number of terms containing ' nao': {len(dutch_umls_clean.loc[dutch_umls_clean['str'].str.contains(' nao')])}")
dutch_umls_clean.str = dutch_umls_clean.str.replace({' NAO': '', ' \(NAO\)': '', ' nao': ''}, regex=True)

In [None]:
print(f"Number of terms containing 'gespecificeerd': {len(dutch_umls_clean.loc[dutch_umls_clean['str'].str.contains('gespecificeerd')])}")

# Remove suffix
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd onderzoek')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerde graad')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerde oorzaak')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerde plaats')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd type')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd deel')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd gebruik')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd naar behandelperiode')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd naar betrokkenheid')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerde toestand')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet-gespecificeerd naar oorzaak')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix(', niet gespecificeerd')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removesuffix('; Niet gespecificeerd')

# Remove prefix
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removeprefix('niet-gespecificeerde ')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removeprefix('niet-gespecificeerd ')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removeprefix('Niet gespecificeerd ')
dutch_umls_clean['str'] = dutch_umls_clean['str'].str.removeprefix('Niet gespecificeerde ')

print(f"Number of terms containing 'gespecificeerd': {len(dutch_umls_clean.loc[dutch_umls_clean['str'].str.contains('gespecificeerd')])}")

There are many occurences of 'gespecificeerd' in names left, but these are more difficult to clean. For now, these are likely not causing any issues (but probably also not adding much) so we'll keep them in our table.

In [None]:
# Convert title-formatted names to lowercase
dutch_umls_clean['str'] = dutch_umls_clean['str'].apply(convert_title_to_lowercase, split_char=' ')
dutch_umls_clean['str'] = dutch_umls_clean['str'].apply(convert_title_to_lowercase, split_char='-')

In [None]:
# Drop duplicates
print(f'Records before dropping duplicates: {dutch_umls_clean.shape[0]}')
dutch_umls_clean = dutch_umls_clean.drop_duplicates(subset=['cui', 'str', 'sab'], keep='first').reset_index(drop=True)
print(f'Records before dropping duplicates: {dutch_umls_clean.shape[0]}')

In [None]:
dutch_umls_clean[dutch_umls_clean.cui == 'C0002736']

## Merge rows from different vocabularies

In [None]:
dutch_umls = dutch_umls_clean.copy()

# Merge SAB into single row
print(f'Records before merging rows: {dutch_umls.shape[0]}')
dutch_umls = dutch_umls.groupby(['cui','str'])['sab'].apply('|'.join).reset_index()
print(f'Records after merging rows: {dutch_umls.shape[0]}')
dutch_umls[dutch_umls.cui == 'C0002736']

In [None]:
# Add tty column with value 'A' to set these names as synonyms 
dutch_umls['tty'] = 'A'
dutch_umls.head()

# Add Dutch names from SNOMED
UMLS does not contain the Dutch SNOMEDCT, but it does contain the English SNOMEDCT. So through the English SNOMED concepts, we can map the Dutch SNOMED names to UMLS.

Dutch SNOMED names with SNOMED ID **->** Get English SNOMED ID to UMLS ID mapping **->** Map Dutch SNOMED names with SNOMED ID to UMLS ID

### Load SNOMED US

In [None]:
query = "SELECT distinct cui, scui FROM MRCONSO where sab = 'SNOMEDCT_US'"
snomed_us = pd.read_sql_query(query, con=connection)
snomed_us.scui = snomed_us.scui.astype(str)
print(f'SNOMED US terms with UMLS CUI: {snomed_us.shape[0]}')
snomed_us.head()

### Load SNOMED NL
We're using a cleaned and filtered list of Dutch SNOMED names, see other notebook in this repository how this is created.

In [None]:
snomed_dutch = pd.read_csv(snomed_dutch_file, dtype='str')
snomed_dutch.head()

In [None]:
snomed_dutch.shape

## Find ambiguous mapping

First find which SNOMED concepts can map to UMLS concepts. SNOMED concepts can map to multiple UMLS concepts.

In [None]:
# Create SNOMED - UMLS mapping
snomed_to_umls_mapping = snomed_us.groupby('scui')['cui'].apply(list).to_dict()
print(f'Number of SNOMED concepts that map to at least 1 UMLS concept: {len(snomed_to_umls_mapping)}')

In [None]:
# Check ambiguity of UMLS-SNOMED mapping
unambiguous_mapping_ids = set()
ambiguous_mapping_ids = set()
for snomed_id in snomed_to_umls_mapping:
    if len(snomed_to_umls_mapping[snomed_id]) == 1:
        unambiguous_mapping_ids.add(snomed_id)
    else:
        ambiguous_mapping_ids.add(snomed_id)
print(f'Number of SNOMED concepts that map to only 1 concept: {len(unambiguous_mapping_ids)}')
print(f'Number of SNOMED concepts that map to multiple concepts: {len(ambiguous_mapping_ids)}')

So 2073 SNOMED concepts map to multiple UMLS concepts. If the Dutch names from these concepts are added to the UMLS concept table, it will introduce ambiguity, which could lead to problems in our downstream named entity linking methods. Therefore, for now, this method does not add names for these ambiguously mapping SNOMED concepts.

## Example of ambiguous mapping
This section illustrates the ambiguous mapping problem

In [None]:
# Find example
ambiguous_mapping_ids = [int(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids.sort()
ambiguous_mapping_ids = [str(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids[0:5]

In [None]:
query = "SELECT distinct cui, code, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '216004'"
snomed_us_example = pd.read_sql_query(query, con=connection)
snomed_us_example.head()

In [None]:
snomed_dutch[snomed_dutch.cui == '216004']

In [None]:
dutch_umls[dutch_umls.cui.isin(['C0151836', 'C1704268'])]

So SNOMED US has four names for 216004. Three of these names map to C1704268 and one maps to C0151836. In SNOMED NL, there is only one name for this concept. In our current UMLS table, we have 2 names for each UMLS concept. We could map the SNOMED name to both concepts (1), to a specific one (2), or skip it (3):

1. Mapping to both will cause ambiguity. It could have an effect on entity linking, or it could be solved during MedCAT's disambiguation functionality based on unsupervised training, but that depends on the synonyms and their presence in the training corpus. In this example there is only 1 Dutch SNOMED term, but when there are multiple Dutch SNOMED terms, adding all to both terms will lead to many duplicates.
2. Mapping to a single one is the best option for a single example, but this requires manual curreation, is time consuming, not within the scope of this project and can be quite difficult. There are about 2000 of these terms. SNOMED US names are in UMLS, so those ambiguously mapping names are added by either the UMLS or SNOMED team. Perhaps in future versions, this is corrected at either UMLS or SNOMED level.
3. Skipping the name is the easiest option and will not lead to potential difficult downstream interpretation. The drawback is that the name, which in this example is not in any other Dutch ontology, will not be in the Dutch UMLS concept table.

Currently, approach #3 is used.

In [None]:
# Another example of ambiguous SNOMED-CT -> UMLS mapping:
snomed_dutch[snomed_dutch.cui == '2776000']

In [None]:
# Show that a single SNOMED ID maps to multiple UMLS concepts
query = "SELECT distinct cui, code, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '2776000'"
snomed_us_example = pd.read_sql_query(query, con=connection)
snomed_us_example.head()

## Merge SNOMED Dutch with UMLS Dutch

In [None]:
def map_dutch_snomed_to_umls(row):
    snomed_id = row['cui']
    if snomed_id in unambiguous_mapping_ids:
        cui = snomed_to_umls_mapping[snomed_id][0]
        snomed_names_to_add.append([cui, row['str'], row['tty']])
    else:
        snomed_names_to_skip.append([snomed_id, row['str'], row['tty']])
        
# Create lists to fill with SNOMED names and their UMLS CUIs
snomed_names_to_add = list()
snomed_names_to_skip = list()

# Map Dutch SNOMED to UMLS
snomed_dutch.apply(map_dutch_snomed_to_umls, axis = 1)

print(f'Number of Dutch names in existing UMLS table: {dutch_umls.shape[0]}')
print(f'Number of Dutch SNOMED names to add: {len(snomed_names_to_add)}')
print(f'Number of Dutch SNOMED names to skip: {len(snomed_names_to_skip)}')

In [None]:
# Format SNOMED names in pandas dataframe
snomed_names_with_cui = pd.DataFrame(snomed_names_to_add, columns = ['cui', 'str', 'tty'])
snomed_names_with_cui['sab'] = 'SNOMEDCT_NL'
snomed_names_with_cui.head()

### Remove duplicate SNOMED concepts
Earlier the problem was discussed of a SNOMED term that maps to multiple UMLS concepts. There's also the problem of snomed names that are ambiguous in SNOMED itself. Some of these, like "abces" are not ambiguous in UMLS. So when mapping these concepts to UMLS, the ambiguity is solved.

In [None]:
snomed_dutch[snomed_dutch.str == 'abces']

In [None]:
snomed_to_umls_mapping['44132006']

In [None]:
snomed_to_umls_mapping['128477000']

In [None]:
snomed_names_with_cui[snomed_names_with_cui.cui == 'C0000833']

In [None]:
print(f'Number of SNOMED concepts that include names that are ambiguous in SNOMED: {snomed_names_with_cui.shape[0]}')
snomed_names_with_cui = snomed_names_with_cui.drop_duplicates(subset=['cui', 'str', 'sab', 'tty'], keep='first').reset_index(drop=True)
print(f'Number of SNOMED names: {snomed_names_with_cui.shape[0]}')
snomed_names_with_cui[snomed_names_with_cui.cui == 'C0000833']

### Clean SNOMED names
Some names from SNOMED are also in Title-format, such as ziekte van Parkinson. To prevent duplication, lowercase these terms.

In [None]:
display(snomed_names_with_cui[snomed_names_with_cui.cui.isin(['C0030567', 'C0002736'])])

In [None]:
snomed_names_with_cui['str'] = snomed_names_with_cui['str'].apply(convert_title_to_lowercase, split_char=' ')
snomed_names_with_cui['str'] = snomed_names_with_cui['str'].apply(convert_title_to_lowercase, split_char='-')

In [None]:
# Examples
display(snomed_names_with_cui[snomed_names_with_cui.cui.isin(['C0030567', 'C0002736'])])

### Concatenate SNOMED names to UMLS table

In [None]:
# Add SNOMED names to UMLS
dutch_umls_snomed = pd.concat([dutch_umls, snomed_names_with_cui])
print(f'Number of Dutch names in UMLS + SNOMED table: {dutch_umls_snomed.shape[0]}')

dutch_umls_snomed.sort_values(by=['cui', 'tty', 'sab', 'str'], ascending=[True, False, True, True], inplace=True)
dutch_umls_snomed.loc[dutch_umls_snomed.cui == 'C0030567']

In [None]:
# Grouping rows on SAB
dutch_umls_snomed = dutch_umls_snomed.groupby(['cui', 'str'], as_index=False).agg({'sab' : '|'.join, 'tty' : '|'.join}).copy()

# Clean tty column
dutch_umls_snomed.tty = dutch_umls_snomed.tty.apply(clean_name_status_column)
dutch_umls_snomed.sort_values(by=['cui', 'tty'], ascending=[True, False], inplace=True)
dutch_umls_snomed.reset_index(drop=True,inplace=True)
print(f'Number of names after merging rows on SAB: {dutch_umls_snomed.shape[0]}')

# Check example
dutch_umls_snomed.loc[dutch_umls_snomed.cui == 'C0030567']

## Remove problematic names


In [None]:
names_to_remove = ['bij', # C0004923
                   'bijen', # C0004923
                   'haar', # C0018494
                   'bleek', # C0678215
                   'weer', # C0043085
                   'na+'] # C0337443
dutch_umls_snomed[dutch_umls_snomed.str.isin(names_to_remove)]

In [None]:
# Remove rows
rows_to_remove = dutch_umls_snomed[dutch_umls_snomed.str.isin(names_to_remove)].index
print(f'Number of rows before removing rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = dutch_umls_snomed.drop(dutch_umls_snomed.index[rows_to_remove])
print(f'Number of rows after removing rows: {dutch_umls_snomed.shape[0]}')

## Add custom CUIs
Sometimes names or concept are not captured in any of the Dutch terminologies. By looking up the English name for these concepts, we can add custom Dutch names using the real UMLS identifier.

In [None]:
dutch_umls_snomed.head()

In [None]:
custom_names = pd.read_csv(custom_names_file)
custom_names.head()

In [None]:
print(f'Number of rows before adding rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = pd.concat([dutch_umls_snomed, custom_names])
print(f'Number of rows after adding rows: {dutch_umls_snomed.shape[0]}')

In [None]:
dutch_umls_snomed.head()

## Add TUI (types)
UMLS concepts have one or multiple types. These types are kept in a separate table, `MRSTY`. See https://semanticnetwork.nlm.nih.gov/download/SemGroups.txt for all types.

In [None]:
# Load TUI table from MySQL
query = """
SELECT cui, tui FROM MRSTY
"""
umls_types = pd.read_sql_query(query, con=connection)
umls_types.head()

In [None]:
# Load custom types file
custom_types = pd.read_csv(custom_types_file)
custom_types.head()

In [None]:
concept_types = pd.concat([umls_types, custom_types])
concept_types.head(10)

In [None]:
# Add TUI column to UMLS + SNOMED CUI table
dutch_umls_snomed = dutch_umls_snomed.merge(concept_types, how='left', on='cui')
dutch_umls_snomed.head(20)

## TUI Filtering
What types of concepts (TUIs) should be removed depends on the domain and question of subsequent analysis. After assessing performance of named entity linking of these names on a few clinical documents, our team dediced to remove the following TUIs.

SNOMED names are already filtered in a seperate filtering step based on type, which is done in the notebook that creates the SNOMED concept table.

In [None]:
types_to_remove = [
    
    # Concepts & Ideas
    'T078', # Idea or Concept
    'T089', # Regulation or Law

    # Living beings
    'T011', # Amphibian
    'T008', # Animal
    'T012', # Bird
    'T013', # Fish
    'T015', # Mammal
    'T001', # Organism
    'T001', # Plant
    'T014', # Reptile
    'T010', # Vertebrate
    
    # Objects
    'T168', # Food
    
    # Organizations
    'T093', # Healthcare Related Organization
    
    # Geographic areas
    'T083', #Geographic Aera
]
                  
dutch_umls_snomed[dutch_umls_snomed.tui.isin(types_to_remove)].head()

In [None]:
# Remove rows based on TUI
rows_to_remove = dutch_umls_snomed[dutch_umls_snomed.tui.isin(types_to_remove)].index
print(f'Number of rows before removing rows: {dutch_umls_snomed.shape[0]}')
dutch_umls_snomed = dutch_umls_snomed.drop(dutch_umls_snomed.index[rows_to_remove])
print(f'Number of rows after removing rows: {dutch_umls_snomed.shape[0]}')

In [None]:
# Check whether there are concepts without TUI.
# This can be caused when adding custom concepts, which originate from ontologies
# that are not in the UMLS subset generated with Metamorphysis.
# For example: a concept from MeSH English is not in the generated UMLS subset of Dutch concepts,
# so it's TUI is also not present in the UMLS subset, and therefore it is not in the UMLS MySQL database.
dutch_umls_snomed[dutch_umls_snomed.tui.isnull()]

In [None]:
dutch_umls_snomed = dutch_umls_snomed.groupby(['cui', 'str', 'tty', 'sab'])['tui'].apply('|'.join).reset_index()
print(f'Number of rows after merging TUIs in single value: {len(dutch_umls_snomed)}')

## Custom name status
To change the primary/preferred/pretty name, which is relevant for display purposes in downstream applications such as MedCAT Trainer and MedCAT Service, a list of name statuses to change is used.

In [None]:
custom_name_status = pd.read_csv(custom_name_status_file, dtype='str')
for index, row in custom_name_status.iterrows():
    dutch_umls_snomed.loc[(dutch_umls_snomed.cui == row['cui']) & (dutch_umls_snomed.str == row['str']), 'tty'] = row['tty']

### Update column names
In MedCAT v1.0 the column name specification has changed and is defined in the [README.md in examples](https://github.com/CogStack/MedCAT/tree/master/examples).

In [None]:
dutch_umls_snomed.rename(columns={'str': 'name', 'tty': 'name_status', 'sab': 'ontologies', 'tui': 'type_ids'}, inplace=True)
dutch_umls_snomed.sort_values(by=['cui', 'name_status'], ascending=[True, False], inplace=True)
dutch_umls_snomed.head()

### Save output

In [None]:
# Print statistics
print(f'Unique concepts: {len(dutch_umls_snomed.cui.unique())}')
print(f'Unique names: {len(dutch_umls_snomed.name.unique())}')
print(f'Ambiguous names: {len(dutch_umls_snomed) - len(dutch_umls_snomed.name.unique())}')
print(f'Total names in concepts: {len(dutch_umls_snomed)}')

# Save final concept table
dutch_umls_snomed.to_csv(output_file, index=False)

## Add drug names
Only run this part below if you want to further expand the concept database with drug names, which adds around 270k lines. Many drugs only have an international name, or use the international name more often than the Dutch name, so adding these from ATC, Drugbank and RXNorm can be a good addition to the concept table. 

After assessing the resulting list it will be clear that many names will not be useful in named entity recognition, because they will probably never be used in natural language.

In [None]:
# Retrieve Dutch UMLS concepts
query = """
SELECT distinct MRCONSO.cui, str as name, sab as ontologies 
FROM MRCONSO
WHERE SAB in ('ATC','DRUGBANK','RXNORM')
"""
drugs_original = pd.read_sql_query(query, con=connection)
drugs_original.head()

In [None]:
# Convert title format to lowercase
drugs_original['name'] = drugs_original['name'].apply(convert_title_to_lowercase, split_char=' ')
drugs_original['name'] = drugs_original['name'].apply(convert_title_to_lowercase, split_char='-')

In [None]:
drugs_original[drugs_original.cui =='C0020740']

In [None]:
# Use previously defined preferred names
drugs_original['name_status'] = 'A'

In [None]:
# Merge drugs dataframe with umls_snomed dataframe
dutch_umls_snomed_drugs = pd.concat([dutch_umls_snomed, drugs_original], axis=0)

print("UMLS & SNOMED records: ", len(dutch_umls_snomed))
print("Drug name records: ", len(drugs_original))
print("Combined: ", len(dutch_umls_snomed_drugs))

dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'name_status'])['ontologies'].apply('|'.join).reset_index()
print("Records after merging ontologies in single value: ", len(dutch_umls_snomed_drugs))
dutch_umls_snomed_drugs.head()

In [None]:
# Add TUI column
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.merge(concept_types, how='left', on='cui')
print(f'Number of rows containing TUIs: {dutch_umls_snomed_drugs.shape[0]}')

# Remove TUIs that we decided to filter
rows_to_remove = dutch_umls_snomed_drugs[dutch_umls_snomed_drugs.tui.isin(types_to_remove)].index
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.drop(dutch_umls_snomed_drugs.index[rows_to_remove])
print(f'Number of rows filtering TUIs: {dutch_umls_snomed_drugs.shape[0]}')

# Merge TUIs in single value
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'name_status', 'ontologies'])['tui'].apply('|'.join).reset_index()
print(f'Number of rows after merging TUIs in single value: {dutch_umls_snomed_drugs.shape[0]}')

# Rename TUI column to type_ids
dutch_umls_snomed_drugs.rename(columns={'tui': 'type_ids'}, inplace=True)

In [None]:
dutch_umls_snomed_drugs.head()

In [None]:
# Grouping rows on SAB
print(f'Number of names before merging rows: {dutch_umls_snomed_drugs.shape[0]}')
dutch_umls_snomed_drugs = dutch_umls_snomed_drugs.groupby(['cui', 'name', 'type_ids'], as_index=False).agg({'ontologies' : '|'.join, 'name_status' : '|'.join}).copy()

# Clean name status column
dutch_umls_snomed_drugs.name_status = dutch_umls_snomed_drugs.name_status.apply(clean_name_status_column)
dutch_umls_snomed_drugs.sort_values(by=['cui', 'name_status'], ascending=[True, False], inplace=True)
dutch_umls_snomed_drugs.reset_index(drop=True,inplace=True)
print(f'Number of names after merging rows: {dutch_umls_snomed_drugs.shape[0]}')

In [None]:
# Check example
dutch_umls_snomed_drugs.loc[dutch_umls_snomed_drugs.cui == 'C0020740']

### Save output

In [None]:
# Print statistics
print(f'Unique concepts: {len(dutch_umls_snomed_drugs.cui.unique())}')
print(f'Unique names: {len(dutch_umls_snomed_drugs.name.unique())}')
print(f'Ambiguous names: {len(dutch_umls_snomed_drugs) - len(dutch_umls_snomed_drugs.name.unique())}')
print(f'Total names in concepts: {len(dutch_umls_snomed_drugs)}')

# Save final concept table with drugs
dutch_umls_snomed_drugs.to_csv(output_file_with_drug_names, index=False)