<a href="https://colab.research.google.com/github/sophie-w/COVID-MDS/blob/master/MedCAT_Tutorial_%7C_Part_4_3_Annotating_documents_with_the_full_MedCAT_pipeline_with_MetaAnnotations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install --upgrade medcat
# Get the scispacy model
! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz

Collecting medcat
[?25l  Downloading https://files.pythonhosted.org/packages/ca/b5/62d61238e8929b5ee718e7bb8fed7559702f10c32c3309ebb23d075c64bd/medcat-1.0.33-py3-none-any.whl (125kB)
[K     |████████████████████████████████| 133kB 4.2MB/s 
Collecting elasticsearch~=7.10
[?25l  Downloading https://files.pythonhosted.org/packages/ab/b1/58cfb0bf54e29c20669d6e588496fb7fe8b54f53bc238be4cb0a185a1e76/elasticsearch-7.13.1-py2.py3-none-any.whl (354kB)
[K     |████████████████████████████████| 358kB 23.5MB/s 
Collecting numpy~=1.20
[?25l  Downloading https://files.pythonhosted.org/packages/a5/42/560d269f604d3e186a57c21a363e77e199358d054884e61b73e405dd217c/numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3MB)
[K     |████████████████████████████████| 15.3MB 250kB/s 
[?25hCollecting transformers~=4.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz
[?25l  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz (70.0MB)
[K     |████████████████████████████████| 70.0MB 56kB/s 
Building wheels for collected packages: en-core-sci-md
  Building wheel for en-core-sci-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-sci-md: filename=en_core_sci_md-0.2.4-cp37-none-any.whl size=70498246 sha256=6d2f4f6bad6505786e1eef718364d00f9a3f5f0d97ec2fecc2bcc00f770b4542
  Stored in directory: /root/.cache/pip/wheels/12/b3/89/7fbb30f56411e8b4002eac6d5568ab46da63191a2287aa17bf
Successfully built en-core-sci-md
Installing collected packages: en-core-sci-md
Successfully installed en-core-sci-md-0.2.4


**Restart the runtime if on colab, sometimes necessary after installing models**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import json 

from medcat.cat import CAT
from medcat.cdb import CDB
from medcat.config import Config
from medcat.utils.vocab import Vocab
from medcat.meta_cat import MetaCAT
from medcat.preprocessing.tokenizers import TokenizerWrapperBPE
from tokenizers import ByteLevelBPETokenizer

  from tqdm.autonotebook import tqdm


In [None]:
DATA_DIR = "./data/"
MODEL_DIR = "./models/"
vocab_path = MODEL_DIR + "vocab.dat"
cdb_path = MODEL_DIR + "cdb-medmen-v1.dat"

In [None]:
# Download the models and required data
!wget https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/pt_notes.csv -P ./data/
!wget https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/MedCAT_Export.json -P ./data/
# You can also use the models created in Part 4.1 of the Tutorial
!wget https://medcat.rosalind.kcl.ac.uk/media/mc_status.zip -P ./models/

# Get MedCAT models
!wget https://medcat.rosalind.kcl.ac.uk/media/vocab.dat -P ./models/
!wget https://medcat.rosalind.kcl.ac.uk/media/cdb-medmen-v1.dat -P ./models/

--2021-06-10 23:13:53--  https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/pt_notes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3644222 (3.5M) [text/plain]
Saving to: ‘./data/pt_notes.csv’


2021-06-10 23:13:53 (14.5 MB/s) - ‘./data/pt_notes.csv’ saved [3644222/3644222]

--2021-06-10 23:13:53--  https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/MedCAT_Export.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 272538 (266K) [text/plain]
Saving to: ‘./data/MedCAT_Export.json’


2021

In [None]:
# Unzip the status model
!unzip ./models/mc_status.zip

Archive:  ./models/mc_status.zip
   creating: mc_status/
  inflating: mc_status/vars.dat      
  inflating: mc_status/embeddings.npy  
   creating: mc_status/bert/
  inflating: mc_status/bert/tokenizer_config.json  
  inflating: mc_status/bert/vocab.txt  
  inflating: mc_status/bert/special_tokens_map.json  
  inflating: mc_status/lstm.dat      


In [None]:
# Create and load the CDB (Concept Database)
# The model we want to load here is the one fine-tuned in Part 4.2
cdb = CDB.load(cdb_path)

# Create and load the Vocabulary
vocab = Vocab.load(vocab_path)

# Config
config = Config()
config.general['spacy_model'] = 'en_core_sci_md'

tui_filter = ['T047', 'T048'] # Detect only Disease and Mental Disorders
cui_filters = set()
for tui in tui_filter:
  cui_filters.update(cdb.addl_info['type_id2cuis'][tui])
config.linking['filters']['cuis'] = cui_filters


# Get the status model for meta_annotations
mc_status = MetaCAT.load("mc_status")

# Create the full pipeline with models for meta-annotations
cat = CAT(cdb=cdb, config=config, vocab=vocab, meta_cats=[mc_status])



In [None]:
!wget https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/pt_notes.csv -P ./data/

--2021-06-10 23:21:21--  https://raw.githubusercontent.com/CogStack/MedCAT/master/tutorial/data/pt_notes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3644222 (3.5M) [text/plain]
Saving to: ‘./data/pt_notes.csv.1’


2021-06-10 23:21:22 (14.4 MB/s) - ‘./data/pt_notes.csv.1’ saved [3644222/3644222]



## Document annotation

The following is a replica of the document annotation code done in [Part 3.2.](https://colab.research.google.com/drive/1q29RbHlZoFK7TcvMKITi3ABbE-E_fw30), with the only change that we have meta-annotations now.

In [None]:
# This will be a map from CUI to a list of documents where it appears: {"cui": [<doc_id>, <doc_id>, ...], ..}
cui_location = {}
# Let's also save the TUI location (semantic type)
tui_location = {}

In [None]:
# Load the data 
data = pd.read_csv(DATA_DIR + "pt_notes.csv")
data.head()

Unnamed: 0,Unnamed: 0_x,subject_id,chartdate,category,text,create_year,Unnamed: 0_y,gender,dob,dob_year,age_year
0,6,1,2079-01-01,General Medicine,"HISTORY OF PRESENT ILLNESS:, The patient is a ...",2079,1,F,2018-01-01,2018,61
1,7,1,2079-01-01,Rheumatology,"HISTORY OF PRESENT ILLNESS: , A 71-year-old fe...",2079,1,F,2018-01-01,2018,61
2,8,1,2079-01-01,Consult - History and Phy.,"HISTORY OF PRESENT ILLNESS:, The patient is a ...",2079,1,F,2018-01-01,2018,61
3,9,2,2037-01-01,Consult - History and Phy.,"CHIEF COMPLAINT:,1. Infection.,2. Pelvic pai...",2037,2,F,2018-01-01,2018,19
4,10,2,2037-01-01,Dermatology,"SUBJECTIVE:, This is a 29-year-old Vietnamese...",2037,2,F,2018-01-01,2018,19


In [None]:
data.shape

(1088, 11)

In [None]:
batch_size = 100
batch = []
cnt = 0
for id, row in data.iterrows():
    text = row['text']
    # Skip text if under 10 characters, not really necessary as we have filtered before,
    #but I like to be sure.
    if len(text) > 10:
        batch.append((id, text))
    
    if len(batch) > batch_size or id == len(data) - 1:
        # Update the number of processors depending on your machine.
        results = cat.multiprocessing(batch, nproc=8)
        
        for pair in results:
            row_id = pair[0]
            entities = pair[1]['entities'] # Convert to set to get unique CUIs

            for entity in entities.values():
                cui = entity['cui']
                # We know there is only one meta annotation for status and
                #here we grab its value
                status = entity['meta_anns']['Status']['value']

                # Only if status if confirmed we take the entity into account
                if status == 'Confirmed':
                  if cui in cui_location and row_id not in cui_location[cui]:
                      cui_location[cui].append(row_id)
                  else:
                      cui_location[cui] = [row_id]

                  # This is not necessary as it can be done later, we have
                  #the cdb.cui2tui map.
                  tuis = cdb.cui2type_ids[cui] 
                  for tui in tuis:
                    if tui in tui_location and row_id not in tui_location[tui]:
                        tui_location[tui].append(row_id)
                    elif tui not in tui_location:
                        tui_location[tui] = [row_id]
        
        cnt += 1
        print("Done: {} - rows".format((cnt -1)*batch_size + len(batch)))
        
        # Reset the batch
        batch = []

Done: 101 - rows
Done: 201 - rows
Done: 301 - rows
Done: 401 - rows
Done: 501 - rows
Done: 601 - rows
Done: 701 - rows
Done: 801 - rows
Done: 901 - rows
Done: 1001 - rows
Done: 1078 - rows


## Done

We have now annotated all documents in our dataset and for each CUI (Concept identifier) we know in which document it appers. We also know that all the detected entities have the status "Confirmed". 

---

Please note that the number of examples I have provided is not enough to fully train the "Status" meta-annotation and one would need to provide more.

In [None]:
# For example, the concept with CUI: C0020538 (hypertension) appears in
cui_location['C0020538']

[563, 684, 757, 758, 760, 898]

In [None]:
# Save concept location in corpus
import json
json.dump(cui_location, open("./cui_location.json", 'w'))

In [None]:
?? TODO is now missing...
cui_location['C0020538']

KeyError: ignored