# Introduction

In this lecture, we will be studying about "automatic label extraction". Here, we will be using labelling a Chest X-Ray image dataset using "Radiology Report". (Refer to [What is Radiology Report](#what-is) section to understand what a inspection report looks like.)

## Motivation

In the field of medicine datasets are scarce; Even more scarce are labelled datasets.

For to obtain labelled dataset, there are several methodologies which are:
1. Annotation of data samples using experts (such as radiologist for labelling X-Rays.) where they actually inspect the raw data samples.

    _However, this process of manual annotation would be time and cost intensive, rendering the process to be sub-optimal._

    <figure>
    <center><img src="../../assets/W2/W2_P2_annotation_methods_1.png" width=900">
    <figcaption align="center"> Fig 1: Annotation method 1: Use Expert for labelling data. </figcaption>
    </figure>

2. Annotation of data samples using non-expert where they inspect the report written by experts - for example: Radiologist write reports during interpretation of scans, such as X-ray, MRI, or CT, which contains the synopsis of the interpretation and highlights findings. 

    _Though less time-consuming then 1st option, this process is still manual and sub-optimal._

    <figure>
    <center><img src="../../assets/W2/W2_P2_annotation_methods_2.png" width=900">
    <figcaption align="center" align="center"> Fig 2: Annotation method 2: Use inspection report for labelling data. </figcaption>
    </figure>

> Since the data labelling in cumbersome and time-consuming, __can we use AI to automate such labelling task?__

3. Annotation of data samples using machines (multi-modal AI system), where the system takes inspection report and raw image and outputs the labels for the sample, i.e. _supervised ML problem_.

    <figure>
    <center><img src="../../assets/W2/W2_P2_annotation_methods_3.png" width=900">
    <figcaption align="center"> Fig 3: Annotation method 3: Use BERT for labelling data, as a supervised task. </figcaption>
    </figure>


## What is a Radiology Report like?

Radiologist after investigating a scan, here, Chest X-ray, will write an succint logs about the findings in a report, which is termed Radiology Report. Although, there is higher availability of radiology datasets, only few such dataset come with radiology reports attached such as [\[1\]](https://www.nature.com/articles/s41597-019-0322-0). An example from MIMIC-CXR dataset is shown below:

<figure>
<center><img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41597-019-0322-0/MediaObjects/41597_2019_322_Fig1_HTML.png" align="middle" width=700">
<figcaption> Fig 4: Example study contained in MIMIC-CXR. Above (a), the radiology report provides the interpretation of the image. PHI (public health information) has been removed and replaced with three underscores (_ _ _). Below, the two chest radiographs for this study are shown: (b) the frontal view (left image) and (c) the lateral view (right image). </figcaption>
</figure>

## Steps for extraction of labels from radiology report

1. Find "Is the label mentioned in the report's summary/conclusion/impression?"

    In order to find if the label is mentioned in the report, we can:
    1. Directly search for the exact label. Or,
    2. Search for synonyms (list of words to match) of the label; the label can be addressed as other words (synonyms) in the report, for example: the label `pneumonia` can be written in report as `pneumonia, infection, infectious proces, infectious`.
    
    These list of words (exact word and it's synonyms) which are matched with actual label in the report for to find answer are obtained by:
    1. Asking a medical professional to write a list of synonyms. Or,
    2. Use a Standard _Terminology_ (Thesaurus or Vocabularies), such as SNOMED CT, which contains ~300,000 concepts.
        Each "concept", for example: "Common Cold", contains __concept number__, __synonyms__ and __Is-A relation__. 
        
        <figure>
        <center><img src="../../assets/W2/W2_P2_vocabulary_example.png" width=700">
        <figcaption> Fig 5: Example Concept from SNOMED CT vocabulary </figcaption>
        </figure>

        "Is-A relationship" is useful when the generic terms such as _disease_ or _infection_ are present in the report's summary. Suppose we are in a scenario where we are supposed to extract label for _lung disease_ using a report summary which states:

        > There are signs of pneumonia in the lungs.

        Here, though there isn't a direct word for lung disease in the summary, we can use an Is-A relationship to search for _the subtypes and their synonyms_ of the kavek and use those words answer the question, as:

        <center>
        
        `Viral pneumonia` IS-A `Infectious pneumonia`

        `Infectious pneumonia` IS-A `Pneumonia`
        
        `Pneumonia` IS-A `Lung disease`
        
        </center>
        
    Here are some pros and cons for the above mentioned strategy of automatic label extraction:
    
    __Pros__
    - No labeled data needed for supervised learning, as this automatic label extraction is a rule-based approach.

    __Cons__
    - Requires a ton of manual work to refine rules and test.

2. Find "Is the observation present or absent?"

    After finding if a label is mentioned the auto label extraction requires knowing if the label (observation) is present (1) or absent (0) in the summary, i.e.

    > Heart size normal and lungs are clear. _No_ <u>edema or pneumonia</u>.

    > Minor consolidations in lungs. There are signs of <u>pneumonia</u>.

    In the first statement, the _No_ conveys: `edema=0 and pneumonia=0`; in the second statement, the statement conveys: `pneumonia=1`. However, there are multiple way to convey presence of absence of the label. Some ways to sumamrize the absence of edema are:
    ```
    No edema
    No XXX or edema
    Without (XXX) edema
    No evidence of edema
    ```

    There are multiple ways to find presence or absence of the label:
    1. Rule-based (no labelled data needed, but are time consuming and require expertise)
        1. Regex rules: Matching the presence/absence using user-defined regex.
        2. Dependency Parse rules: Use dependency between grammatical units to find the answer.
    2. Supervised learning
        1. Negation classification: Report is taken as input and model outputs the presence or absence of label.

After both of these questions are answered by our algorithm, we confidently assign values to the labels:

<figure>
<center><img src="../../assets/W2/W2_P2_auto_extraction_example.png" width=700">
<figcaption> Fig 6: Example of Auto label extraction </figcaption>
</figure>

This lecture covered majorly rule based approach on obtaining automatic label extraction. Supervised learning algorithsm were not discussed in this lecture.

## References

1. Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0
