# Dutch UMLS to concept table
This notebook describes how to convert a UMLS concept table containing Dutch terms, to a formatted concept table to be used in a tool such as MedCAT. In the second part of this notebook, we add drug names from Dutch SNOMED, because these concepts are not well represented in the Dutch UMLS source vocabularies. A large scale automatic mapping from SNOMED Dutch to UMLS is not possible because of many-to-many mapping, explained in this notebook.

Requirements:
- MySQL database containing Dutch UMLS terms

For adding Dutch SNOMED drug names:
- Dutch SNOMED concept tablel, created in `dutch-snomed_to_concept-table.ipynb`
- MySQL database containing SNOMED-US, which is used for mapping SNOMED Dutch -> UMLS

In [1]:
from sqlalchemy import create_engine
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import json
import re
import os

In [2]:
# Credentials to connect to UMLS MySQL database
load_dotenv()
user = os.getenv('MYSQL_USER')
password = os.getenv('MYSQL_PASSWORD')
host = os.getenv('MYSQL_HOST')
port = os.getenv('MYSQL_PORT')
database = os.getenv('MYSQL_DATABASE')

# Create the connection
connection_string = f'mysql://{user}:{password}@{host}:{port}/{database}'
connection = create_engine(connection_string)

In [3]:
# Retrieve Dutch UMLS concepts
query = """
SELECT cui, str, tty, sab, code FROM MRCONSO WHERE LAT = 'DUT'
"""
df_dutch_umls = pd.read_sql_query(query, con=connection)
df_dutch_umls.head()

Unnamed: 0,cui,str,tty,sab,code
0,C0030271,Pancoast-syndroom,PT,MDRDUT,10065249
1,C0238106,Clostridium difficile-colitis,PT,MDRDUT,10009657
2,C0851107,Epstein-Barr-virustest,PT,MDRDUT,10050681
3,C0035232,ademhalingsverlamming,PT,MDRDUT,10038708
4,C0400161,anale poliepectomie,PT,MDRDUT,10002169


## Term type in source
Some source-defined term types are not relevant for our use case. In the next part we will drop those. See https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html 

In [4]:
df_dutch_umls.tty.value_counts()

PT     112383
LLT     73284
LN      54641
MH      28618
SY      11859
HT       3296
HG        337
SMQ       228
CP         38
AB         27
OS         27
Name: tty, dtype: int64

| TTY  | Description | Count | Example | Reference|
| - | - | - | - | - |
| PT | Designated preferred name| 111766 | harthypertrofie, Pancoast-syndroom ||
| LLT | Lower Level Term | 71603 | heupkombreuk, buikkramp| |
| LN | LOINC official fully specified name | 52313 | fencyclidine:massa/massa:moment:haar:kwantitatief | |
| MH | Main heading | 28657 | Dehydratie, Astma | |
| SY | Designated synonym | 11863 | Spanningshoofdpijn, Ziekte van Hodgkin | |
| OL | Non-current Lower Level Term| 9291 | acquired immunodeficiency syndrome, ankylose van gewricht, meerdere plaatsen | https://meddra.org/sites/default/files/page/documents_insert/meddra_-_terminologies_coding.pdf |
| HT | Hierarchical term | 3295 | calciummetabolismestoornissen, oculaire hemorragische aandoeningen	 | |
| LO | Obsolete official fully specified name | 1696| promyelocyten/100 leukocyten:getalsfractie:mom...	| |
| HG | High Level Group Term |  337| complicaties geassocieerd met medisch hulpmiddel, zuur-basestoornissen | |
| SMQ| Standardised MedDRA Query |  225| Leveraandoeningen (SMQ) , Tumormarkers (SMQ) | |
| CP | ICPC component process (in original form) |   38| Ander bloedonderzoek, Medicatie/recept/injectie | |
| OS | System-organ class |   27| Bloed- en lymfestelselaandoeningen, Infecties en parasitaire aandoeningen | |
| AB | Abbreviation in any source vocabulary |   27| Infec, Neopl, Ear, Endo | |

In [5]:
# Select a set of TTYs that seem most relevant for entity linking
tty_selection = ['PT', 'LLT', 'MH', 'SY']
df_dutch_umls_subset = df_dutch_umls[df_dutch_umls.tty.isin(tty_selection)].copy()

# Keep only relevant columns
df_dutch_umls_subset = df_dutch_umls_subset[['cui', 'str', 'tty', 'sab']]

# Most of the terms in UMLS have the Metathesaurus preferred name in English. 
# For a rough but effective fix to get a good preferred name for the Dutch terms, 
# change the terms that have the value "Designated preferred name" (PT) for the 
# Term Type in Source (TTY) to the Metathesaurus preferred name (PN). All others
# can be saved as synonym.
df_dutch_umls_subset.tty.replace({'PT': 'PN',
                                  'LLT': 'SY',
                                  'MH': 'SY'}, inplace=True)

# Remove "NAO" ("Niet Anders Omschreven"), which is relevant for the source terminlogy but not for entity linking.
# See https://meddra.org/sites/default/files/guidance/file/intguide_15_0_dutch.pdf
df_dutch_umls_subset.str = df_dutch_umls_subset.str.replace({' NAO': '', ' \(NAO\)': '', ' nao': ''}, regex=True)

# Sort values
df_dutch_umls_subset.sort_values(by=['cui', 'tty', 'str', 'sab'], inplace=True)

# Drop duplicates, only keep the first entry (which is a PN because we sorted)
print(f'Records before dropping duplicates: {df_dutch_umls_subset.shape[0]}')
df_dutch_umls_subset = df_dutch_umls_subset.drop_duplicates(subset=['cui', 'str'], keep='first').reset_index(drop=True)
print(f'Records after dropping duplicates: {df_dutch_umls_subset.shape[0]}')

# Because dropped duplicates, only the first value in SAB is saved. Because we lost the information of other colums, rename the values to UMLS-dutch
df_dutch_umls_subset['sab'] = 'UMLS-dutch'
df_dutch_umls_subset.head(20)

Records before dropping duplicates: 226144
Records after dropping duplicates: 187795


Unnamed: 0,cui,str,tty,sab
0,C0000696,A-zenuwvezels,SY,UMLS-dutch
1,C0000715,Abattoir,SY,UMLS-dutch
2,C0000715,Abattoirs,SY,UMLS-dutch
3,C0000722,Abbreviated Injury Scale,SY,UMLS-dutch
4,C0000726,Abdomen,SY,UMLS-dutch
5,C0000726,Buik,SY,UMLS-dutch
6,C0000727,Acute buik,PN,UMLS-dutch
7,C0000727,abdomen; acute buik,PN,UMLS-dutch
8,C0000727,"abdominaal; syndroom, acuut",PN,UMLS-dutch
9,C0000727,acuut abdomen,PN,UMLS-dutch


# Add Dutch names from SNOMED
UMLS does not contain the Dutch SNOMED names, but it does contain the English (US) SNOMED terms. So through the English SNOMED concepts, we can map the Dutch SNOMED names to UMLS.

Dutch SNOMED names with SNOMED ID **->** Get English SNOMED ID to UMLS ID mapping **->** Map Dutch SNOMED names with SNOMED ID to UMLS ID

### Load SNOMED US

In [6]:
query = "SELECT distinct cui, scui FROM MRCONSO where sab = 'SNOMEDCT_US'"
df_snomed_us = pd.read_sql_query(query, con=connection)
df_snomed_us.scui = df_snomed_us.scui.astype(str)
print(f'SNOMED US terms with UMLS CUI: {df_snomed_us.shape[0]}')
df_snomed_us.head()

SNOMED US terms with UMLS CUI: 363794


Unnamed: 0,cui,scui
0,C0000052,58488005
1,C0000097,285407008
2,C0000102,13579002
3,C0000163,112116001
4,C0000167,46120009


### Load SNOMED NL
We're using a cleaned and filtered list of Dutch SNOMED names, see other notebook in this repository how this is created.

In [7]:
df_snomed_dutch = pd.read_csv('04_ConceptDB/snomedct-dutch_v1.1.csv', dtype=str)
df_snomed_dutch.head()

Unnamed: 0,cui,str,tty,tui,sab
0,104001,excisie van afwijkend weefsel van patella,PN,verrichting,SNOMED-CT-NL
1,104001,excisie van laesie van knieschijf,SY,verrichting,SNOMED-CT-NL
2,106004,structuur van posterieure carpale regio,PN,lichaamsstructuur,SNOMED-CT-NL
3,106004,posterieur gebied van handwortel,SY,lichaamsstructuur,SNOMED-CT-NL
4,106004,posterieur carpaal gebied,SY,lichaamsstructuur,SNOMED-CT-NL


In [8]:
df_snomed_dutch.shape

(493716, 5)

In [9]:
df_dutch_umls_subset.head()

Unnamed: 0,cui,str,tty,sab
0,C0000696,A-zenuwvezels,SY,UMLS-dutch
1,C0000715,Abattoir,SY,UMLS-dutch
2,C0000715,Abattoirs,SY,UMLS-dutch
3,C0000722,Abbreviated Injury Scale,SY,UMLS-dutch
4,C0000726,Abdomen,SY,UMLS-dutch


## Find ambiguous mapping

First find which SNOMED concepts can map to UMLS concepts. SNOMED concepts could map to multiple UMLS concepts.

In [10]:
# Create SNOMED - UMLS mapping
snomed_to_umls_mapping = df_snomed_us.groupby('scui')['cui'].apply(list).to_dict()
print(f'Number of SNOMED US IDs that map to at least 1 CUI: {len(snomed_to_umls_mapping)}')

Number of SNOMED US IDs that map to at least 1 CUI: 361461


In [11]:
# Check ambiguity of UMLS-SNOMED mapping
unambiguous_mapping_ids = set()
ambiguous_mapping_ids = set()
for snomed_id in snomed_to_umls_mapping:
    if len(snomed_to_umls_mapping[snomed_id]) == 1:
        unambiguous_mapping_ids.add(snomed_id)
    else:
        ambiguous_mapping_ids.add(snomed_id)
print(f'Number of SNOMED IDs that map to only 1 CUI: {len(unambiguous_mapping_ids)}')
print(f'Number of SNOMED IDs that map to multiple CUIs: {len(ambiguous_mapping_ids)}')

Number of SNOMED IDs that map to only 1 CUI: 359388
Number of SNOMED IDs that map to multiple CUIs: 2073


So 2073 SNOMED concepts map to multiple UMLS concepts. If we would add the Dutch names from SNOMED, we would have to add them to both UMLS concepts. This will introduce ambiguity, which will lead to problems in our downstream named entity linking methods. Therefor we'll all these ambiguous mapping SNOMED concepts.

## Example of ambiguous mapping

In [12]:
# Find example
ambiguous_mapping_ids = [int(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids.sort()
ambiguous_mapping_ids = [str(code) for code in ambiguous_mapping_ids]
ambiguous_mapping_ids[0:5]

['115006', '216004', '289002', '344001', '489004']

In [13]:
query = "SELECT distinct cui, scui, str FROM MRCONSO where sab = 'SNOMEDCT_US' and CODE = '216004'"
df_snomed_us_example = pd.read_sql_query(query, con=connection)
df_snomed_us.head()

Unnamed: 0,cui,scui
0,C0000052,58488005
1,C0000097,285407008
2,C0000102,13579002
3,C0000163,112116001
4,C0000167,46120009


In [14]:
df_snomed_dutch[df_snomed_dutch.cui == '216004']

Unnamed: 0,cui,str,tty,tui,sab
122,216004,achtervolgingswaan,PN,bevinding,SNOMED-CT-NL


In [15]:
df_dutch_umls_subset[df_dutch_umls_subset.cui.isin(['C0151836', 'C1704268'])]

Unnamed: 0,cui,str,tty,sab
45691,C0151836,paranoïd; reactie,PN,UMLS-dutch
45692,C0151836,reactie; paranoïd,PN,UMLS-dutch
45693,C0151836,paranode reactie,SY,UMLS-dutch
45694,C0151836,reactie paranode,SY,UMLS-dutch
172676,C1704268,vervolgingswaan,PN,UMLS-dutch
172677,C1704268,waan van achtervolging,SY,UMLS-dutch


So SNOMED US has four names for 216004. Three of these names map to C1704268 and one maps to C0151836. In SNOMED NL, there is only one name for this concept. We could map this name to both concepts, to a specific one, or ignore it.

- Mapping to both will cause ambiguity. It might have no effect on entity linking, as it could be solved during MedCAT's unsupervised training, depending on the synonyms and their presence in the training corpus. In this example there is only 1 Dutch SNOMED term, but when there are multiple Dutch SNOMED terms, adding all to both terms, will lead to many duplicates.
- Mapping to a single one is the best option for a single example, but this is time consuming, not within the scope and responsibility of this project and can be quite difficult. There are about 2000 of these terms.
- Ignoring the name is the easiest option and will not lead to potential difficult downstream interpretation. The drawback is that the name, which in this example is unique to SNOMED NL, will not be in the final Dutch UMLS table.

## Merge SNOMED Dutch with UMLS Dutch

In [16]:
# Create dictionary of UMLS concepts that are in our existing Dutch name table
dutch_umls_ids=df_dutch_umls_subset.groupby('cui')['str'].apply(list).to_dict()

# Create a set with all Dutch UMLS names in lowercase
dutch_umls_names_lowercase = set()
for cui in dutch_umls_ids:
    for value in dutch_umls_ids[cui]:
        dutch_umls_names_lowercase.add(value.lower())
        
# Also create a column with all Dutch SNOMED names
df_snomed_dutch['lowercase_str'] = df_snomed_dutch.str.str.lower()

In [17]:
def map_dutch_snomed_to_umls(row):
    snomed_id = row['cui']
    if snomed_id in unambiguous_mapping_ids:
        cui = snomed_to_umls_mapping[snomed_id][0]
        
        # Check whether SNOMED name is a name in UMLS, under any CUI.
        # This is to prevent:
        # - Adding names for a concept that we already have.
        # - Introducing concepts that are already in our DB but map to a different CUI
        #   because of one-to-many SNOMED to UMLS mapping.
        if row['lowercase_str'] not in dutch_umls_names_lowercase:
            
            # Check if the term is new, or already exists and therefor always is a synonym.
            if cui in dutch_umls_ids:
                snomed_names_to_add.append([cui, row['str'], 'SY'])
            else:
                snomed_names_to_add.append([cui, row['str'], row['tty']])
        else:
            # For debugging purposes, track snomed names that are already in UMLS
            snomed_names_to_skip.append([cui, row['str'], row['tty']])
            
            # In the future, we might want to add SNOMED-NL to the SAB of this concept.
            # We'll need to be cautious for the case of a different concept that has the same name.
            
snomed_names_to_add = list()
snomed_names_to_skip = list()

# Apply function
df_snomed_dutch.apply(map_dutch_snomed_to_umls, axis = 1)

print(f'Number of Dutch names in existing UMLS table: {df_dutch_umls_subset.shape[0]}')
print(f'Number of Dutch SNOMED names to add: {len(snomed_names_to_add)}')
print(f'Number of Dutch SNOMED names to skip: {len(snomed_names_to_skip)}')

Number of Dutch names in existing UMLS table: 187795
Number of Dutch SNOMED names to add: 452457
Number of Dutch SNOMED names to skip: 13082


In [18]:
# Format SNOMED names in pandas dataframe
snomed_names_with_cui = pd.DataFrame(snomed_names_to_add, columns = ['cui', 'str', 'tty'])
snomed_names_with_cui['sab'] = 'SNOMEDCT-NL'
snomed_names_with_cui.head()

Unnamed: 0,cui,str,tty,sab
0,C0187893,excisie van afwijkend weefsel van patella,PN,SNOMEDCT-NL
1,C0187893,excisie van laesie van knieschijf,SY,SNOMEDCT-NL
2,C0230364,structuur van posterieure carpale regio,PN,SNOMEDCT-NL
3,C0230364,posterieur gebied van handwortel,SY,SNOMEDCT-NL
4,C0230364,posterieur carpaal gebied,SY,SNOMEDCT-NL


In [19]:
# Check which snomed names are skipped because they are already in UMLS
snomed_names_to_skip_df = pd.DataFrame(snomed_names_to_skip, columns = ['cui', 'str', 'tty'])
snomed_names_to_skip_df.head()

Unnamed: 0,cui,str,tty
0,C0030518,glandula parathyroidea,SY
1,C0030518,bijschildklier,SY
2,C0155825,chronische faryngitis,PN
3,C0153225,gonokokkenmeningitis,SY
4,C0000919,onhandigheid,SY


In [20]:
umls_snomed_merged = pd.concat([df_dutch_umls_subset, snomed_names_with_cui])
print(f'Number of Dutch names in UMLS + SNOMED table: {umls_snomed_merged.shape[0]}')

Number of Dutch names in UMLS + SNOMED table: 640252


In [21]:
umls_snomed_merged.head()

Unnamed: 0,cui,str,tty,sab
0,C0000696,A-zenuwvezels,SY,UMLS-dutch
1,C0000715,Abattoir,SY,UMLS-dutch
2,C0000715,Abattoirs,SY,UMLS-dutch
3,C0000722,Abbreviated Injury Scale,SY,UMLS-dutch
4,C0000726,Abdomen,SY,UMLS-dutch


In [22]:
# Sort on CUI and TTY
umls_snomed_merged.sort_values(by=['cui', 'tty', 'sab', 'str'], inplace=True)
umls_snomed_merged.reset_index(drop=True,inplace=True)

## Remove problematic names


In [23]:
names_to_remove = ['Bij', # C0004923
                   'Bijen', # C0004923
                   'Haar', # C0018494
                   'bleek', # C0678215
                   'Weer', # C0043085
                   'Na+'] # C0337443
umls_snomed_merged[umls_snomed_merged.str.isin(names_to_remove)]

Unnamed: 0,cui,str,tty,sab
4608,C0004923,Bij,SY,UMLS-dutch
4609,C0004923,Bijen,SY,UMLS-dutch
20427,C0018494,Haar,SY,UMLS-dutch
50782,C0043085,Weer,SY,UMLS-dutch
180706,C0337443,Na+,SY,UMLS-dutch
344412,C0678215,bleek,SY,UMLS-dutch


In [24]:
# Remove rows
rows_to_remove = umls_snomed_merged[umls_snomed_merged.str.isin(names_to_remove)].index
print(f'Number of rows before removing rows: {umls_snomed_merged.shape[0]}')
umls_snomed_filtered = umls_snomed_merged.drop(umls_snomed_merged.index[rows_to_remove])
print(f'Number of rows after removing rows: {umls_snomed_filtered.shape[0]}')

Number of rows before removing rows: 640252
Number of rows after removing rows: 640246


## Add custom CUIs
Sometimes names or concept are not captured in any of the Dutch terminologies. By looking up the English name for these concepts, we can add custom Dutch names using the real UMLS identifier.

In [25]:
umls_snomed_filtered.head()

Unnamed: 0,cui,str,tty,sab
0,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL
1,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",SY,SNOMEDCT-NL
2,C0000097,MPTP,SY,SNOMEDCT-NL
3,C0000215,"2,4,5-trichloorfenoxyazijnzuur",PN,SNOMEDCT-NL
4,C0000215,"2,4,5-T",SY,SNOMEDCT-NL


In [26]:
custom_concepts = pd.read_csv("custom_concepts.csv")
custom_concepts

Unnamed: 0,cui,str,tty,sab
0,C0456984,uitslag,SY,UMCU
1,C0019080,bloedt,SY,UMCU
2,C0019080,bloeden,SY,UMCU
3,C0225844,RA,SY,UMCU
4,C0225883,RV,SY,UMCU
5,C0225897,LV,SY,UMCU


In [27]:
print(f'Number of rows before adding rows: {umls_snomed_filtered.shape[0]}')
umls_snomed_custom = pd.concat([umls_snomed_filtered, custom_concepts])
print(f'Number of rows after adding rows: {umls_snomed_custom.shape[0]}')

Number of rows before adding rows: 640246
Number of rows after adding rows: 640252


## Add TUI (types)
UMLS concepts have one or multiple types. These types are kept in a separate table, `MRSTY`. See https://semanticnetwork.nlm.nih.gov/download/SemGroups.txt for all types.

In [28]:
# Load TUI table from MySQL
query = """
SELECT cui, tui, sty FROM MRSTY
"""
df_tui = pd.read_sql_query(query, con=connection)

In [29]:
# Add TUI column to previously created dataframe
umls_snomed_tui = umls_snomed_custom.merge(df_tui, how='left', on='cui')

# View some concepts that have multiple TUIs
umls_snomed_tui[umls_snomed_tui.duplicated(subset=['cui', 'str'], keep=False)].head(10)

Unnamed: 0,cui,str,tty,sab,tui,sty
0,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL,T131,Hazardous or Poisonous Substance
1,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL,T109,Organic Chemical
2,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",SY,SNOMEDCT-NL,T131,Hazardous or Poisonous Substance
3,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",SY,SNOMEDCT-NL,T109,Organic Chemical
4,C0000097,MPTP,SY,SNOMEDCT-NL,T131,Hazardous or Poisonous Substance
5,C0000097,MPTP,SY,SNOMEDCT-NL,T109,Organic Chemical
6,C0000215,"2,4,5-trichloorfenoxyazijnzuur",PN,SNOMEDCT-NL,T131,Hazardous or Poisonous Substance
7,C0000215,"2,4,5-trichloorfenoxyazijnzuur",PN,SNOMEDCT-NL,T109,Organic Chemical
8,C0000215,"2,4,5-T",SY,SNOMEDCT-NL,T131,Hazardous or Poisonous Substance
9,C0000215,"2,4,5-T",SY,SNOMEDCT-NL,T109,Organic Chemical


In [30]:
print(f'Number of unique TUIs in Dutch UMLS subset: {len(umls_snomed_tui.tui.unique())}')

Number of unique TUIs in Dutch UMLS subset: 125


In [31]:
# Create dataframe with counts per TUI name
type_counts = umls_snomed_tui.sty.value_counts().to_frame()
type_counts_tui = umls_snomed_tui.sty.value_counts().to_frame()
type_counts.reset_index(inplace=True)

# Add TUI code
type_counts_tui = umls_snomed_tui.tui.value_counts().to_frame()
tuis = type_counts_tui.index
type_counts['tui'] = tuis

# Format nicely
type_counts.columns = ['sty', 'count', 'tui']
type_counts = type_counts[['tui', 'sty', 'count']]
type_counts

Unnamed: 0,tui,sty,count
0,T047,Disease or Syndrome,138813
1,T061,Therapeutic or Preventive Procedure,77484
2,T033,Finding,71615
3,T023,"Body Part, Organ, or Organ Component",71479
4,T037,Injury or Poisoning,50119
...,...,...,...
120,T088,Language,4
121,T171,Molecular Sequence,4
122,T021,Fully Formed Anatomical Structure,3
123,T103,Chemical,2


## TUI Filtering
We could implement filtering of TUIs here. This depends on the domain and question of subsequent analysis. For SNOMED

In [32]:
tuis_to_remove = [
    
    # Concepts & Ideas
    'T078', # Idea or Concept
    'T089', # Regulation or Law

    # Living beings
    'T011', # Amphibian
    'T008', # Animal
    'T012', # Bird
    'T013', # Fish
    'T015', # Mammal
    'T001', # Organism
    'T001', # Plant
    'T014', # Reptile
    'T010', # Vertebrate
    
    # Objects
    'T168', # Food
    
    # Organizations
    'T093', # Healthcare Related Organization
    
    # Geographic areas
    'T083', #Geographic Aera
]
                  
                 ]
umls_snomed_tui[umls_snomed_tui.tui.isin(tuis_to_remove)].head()

Unnamed: 0,cui,str,tty,sab,tui,sty
2653,C0003057,"Dier, rechten van het",SY,UMLS-dutch,T078,Idea or Concept
4896,C0004942,Behaviorisme,SY,UMLS-dutch,T078,Idea or Concept
4926,C0004978,Weldadigheid,SY,UMLS-dutch,T078,Idea or Concept
5213,C0005488,"Kwesties, bioethische",SY,UMLS-dutch,T078,Idea or Concept
6578,C0006343,Boeddhisme,SY,UMLS-dutch,T078,Idea or Concept


In [33]:
# Remove rows based on TUI
rows_to_remove = umls_snomed_tui[umls_snomed_tui.tui.isin(tuis_to_remove)].index
print(f'Number of rows before removing rows: {umls_snomed_tui.shape[0]}')
umls_snomed_tui_filtered = umls_snomed_tui.drop(umls_snomed_tui.index[rows_to_remove])
print(f'Number of rows after removing rows: {umls_snomed_tui_filtered.shape[0]}')

Number of rows before removing rows: 647935
Number of rows after removing rows: 647619


## Column Names
In MedCAT v1.0 the column name specification has changed and is defined as in the [README.md in examples](https://github.com/CogStack/MedCAT/tree/master/examples).

In [34]:
umls_snomed_tui_filtered.rename(columns={'str': 'name', 'tty': 'name_status', 'sab': 'ontologies', 'tui': 'type_ids'}, inplace=True)
umls_snomed_tui_filtered.drop(['sty'], axis = 1, inplace=True)
umls_snomed_tui_filtered.head()

Unnamed: 0,cui,name,name_status,ontologies,type_ids
0,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL,T131
1,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL,T109
2,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",SY,SNOMEDCT-NL,T131
3,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",SY,SNOMEDCT-NL,T109
4,C0000097,MPTP,SY,SNOMEDCT-NL,T131


## Saving

In [35]:
# Save final concept table
umls_snomed_tui_filtered.to_csv('04_ConceptDB/umls-dutch_v1.7.csv', index=False)

# Save number of concepts per TUI
type_counts.to_csv('04_ConceptDB/tuis-umls-dutch_v1.7.csv', index = False, sep='\t')

## Expand Concept Database to include drug names
Only run this part below if you want to further expand the concept database with drug names, adds around 270k lines.

In [12]:
#In case you want to begin from here, load existing concept table:

#umls_snomed_tui_filtered = pd.read_csv(".../umls-dutch_v1.8.csv", dtype=str)

In [9]:
# Retrieve Dutch UMLS concepts
query = """
SELECT distinct MRCONSO.cui, str as name, sab as ontologies, tty as name_status, tui as type_ids
FROM MRCONSO
LEFT JOIN MRSTY ON MRSTY.cui = MRCONSO.cui
WHERE SAB in ('ATC','DRUGBANK','RXNORM')
"""
df_drugs = pd.read_sql_query(query, con=connection)
df_drugs.head()

Unnamed: 0,cui,name,ontologies,name_status,type_ids
0,C2348241,alcaftadine,ATC,IN,T121
1,C2348241,alcaftadine,ATC,IN,T109
2,C0039644,tetracycline,ATC,IN,T195
3,C0039644,tetracycline,ATC,IN,T109
4,C0039943,thioridazine,ATC,IN,T121


In [11]:
#Swap columns name_status and ontologies to match earlier generated dataframe
df_drugs=df_drugs.reindex(columns=['cui','name','name_status','ontologies','type_ids'])
df_drugs.head()

Unnamed: 0,cui,name,name_status,ontologies,type_ids
0,C2348241,alcaftadine,IN,ATC,T121
1,C2348241,alcaftadine,IN,ATC,T109
2,C0039644,tetracycline,IN,ATC,T195
3,C0039644,tetracycline,IN,ATC,T109
4,C0039943,thioridazine,IN,ATC,T121


In [16]:
#Merge drugs dataframe with umls_snomed dataframe
concept_drugs_expanded = pd.concat([umls_snomed_tui_filtered, df_drugs], axis=0)

print("UMLS_snomed lines: ", len(umls_snomed_tui_filtered))
print("Drugs lines: ", len(df_drugs))
print("Adds up to: ", len(concept_drugs_expanded))

UMLS_snomed lines:  644065
Drugs lines:  266483
Adds up to:  910548


In [19]:
#Sort again and reset index
concept_drugs_expanded.sort_values(by=['cui', 'name_status', 'ontologies', 'name'], inplace=True)
concept_drugs_expanded.reset_index(drop=True,inplace=True)
concept_drugs_expanded.head()

Unnamed: 0,cui,name,name_status,ontologies,type_ids
0,C0000039,"1,2-dipalmitoylphosphatidylcholine",IN,RXNORM,T121
1,C0000039,"1,2-dipalmitoylphosphatidylcholine",IN,RXNORM,T109
2,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL,T131
3,C0000097,methyl-fenyltetrahydropyridine,PN,SNOMEDCT-NL,T109
4,C0000097,"1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine",SY,SNOMEDCT-NL,T131


In [20]:
# Save final concept table
concept_drugs_expanded.to_csv('umls-dutch_v1.8_with_drugs.csv', index=False)