# I. Installing the Unified Medical Language System (UMLS)


In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(0,'../../trove')


## A. Download the UMLS Release Files

Trove requires access to the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html) which is freely available after signing up for an account with the National Library of Medicine (NLM). 

Visit the link below and download the latest "UMLS Metathesaurus Files" release [2020AB](https://download.nlm.nih.gov/umls/kss/2020AB/umls-2020AB-metathesaurus.zip). This file is quite large (5.3 GB compressed), so it may take some time to download.  **Please note, "full" release zip files are currently not supported.**

Alternatively, if you have an existing API KEY you can use the following script to download the zip file from the command line. See https://documentation.uts.nlm.nih.gov/automating-downloads.html for technical details on NLM authentication. 

    python download_umls.py \
       --apikey XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX \
       --url https://download.nlm.nih.gov/umls/kss/2020AB/umls-2020AB-metathesaurus.zip

## B. Installation

Currently there are 2 ways to initalize the UMLS:
- From the source zip file (e.g., `umls-2020AB-metathesaurus.zip`)
- From the Rich Release Format (RRF) files generated by [MetamorphoSys](https://www.ncbi.nlm.nih.gov/books/NBK9683/) 
- **TBD** An existing database instance

Depending on your machine, this should take 2-5 minutes. 

### Option 1: Install from NLM Zip File

In [2]:
%%time
from trove.labelers.umls import UMLS

# setup defaults
UMLS.config(
    cache_root = "~/.trove/umls2020AB",
    backend = 'pandas'
)
 
NLM_ZIPFILE_PATH = "umls-2020AB-metathesaurus.zip"
if not UMLS.is_initalized():
    print("Initializing the UMLS from zip file...")
    UMLS.init_from_nlm_zip(NLM_ZIPFILE_PATH, use_checksum=True)


Initializing the UMLS from zip file...
Detected UMLS version {'release': 'meta', 'year': 2020, 'version': 'AB'}
CPU times: user 1min 18s, sys: 12.8 s, total: 1min 31s
Wall time: 1min 38s


### Option 2: Install from RRF Files
If you have installed the UMLS before using [MetamorphoSys](https://www.ncbi.nlm.nih.gov/books/NBK9683/) to create custom vocabulary subsets you can directly use the generated RRF files.

In [3]:
%%time

RRF_FILE_PATH = ""
if not UMLS.is_initalized():
    print("Initializing the UMLS from RRFs...")
    UMLS.init_from_rrfs(RRF_FILE_PATH)

CPU times: user 86 µs, sys: 183 µs, total: 269 µs
Wall time: 798 µs


### Option 3: Install from an Existing Database Instance
**TBD**: If you have a live UMLS database instance running, you can initialize Trove as follows.

In [4]:
# if not UMLS.is_initalized():
#     UMLS.init_from_dbconn(engine='mysql', dbname='UMLS2020AB')
    

## 3. Test the Installation

Here we apply some common term transformations. This should run in 2-4 minutes.

In [5]:
%%time
from trove.labelers.umls import UMLS
from trove.transforms import SmartLowercase
from trove.contrib.datasets.stopwords import stopwords

# english stopwords
stopwords = stopwords.union(set([t[0].upper() + t[1:] for t in stopwords]))

# options for filtering terms
config = {
    "type_mapping"  : "TUI",  # TUI = semantic types, CUI = concept ids
    'min_char_len'  : 2,
    'max_tok_len'   : 8,
    'min_dict_size' : 500,
    'stopwords'     : stopwords,
    'transforms'    : [SmartLowercase()],
    'languages'     : {"ENG"},
    'filter_sabs'   : {"SNOMEDCT_VET"},
    'filter_rgx'    : r'''^[-+]*[0-9]+([.][0-9]+)*$'''  # filter numbers
}

umls = UMLS(**config)


CPU times: user 1min 20s, sys: 7.98 s, total: 1min 28s
Wall time: 1min 28s


Look at semantic type assignments for an example term `acetaminophen` from the Medical Subject Headings (MeSH®) terminology.

In [6]:
from trove.labelers.umls import SemanticGroups

semgroups = SemanticGroups()
stys = umls.terminologies['MSH']['acetaminophen']
print(stys)
print([semgroups.types[sty] for sty in stys])

{'T109', 'T121'}
['Organic Chemical', 'Pharmacologic Substance']
