# From Mention-Level Annotations to Document Classification

## 1. Why do we need document classification?

Think about a case with multiple mentions in one document. How do we decide the document level conclusion when these mentions have "conflicted" information? For example, 

>Small **left pleural effusion**. **Right pleural effusion can be excluded**.

In this example, should we conclude that the report indicates pneumonia or does not indicate pneumonia?

There are many other situations that we need to draw a document level conclusion based on multiple mention level annotations. Certainly, we can train a machine learning classifier to accomplish this task, which you will learn in another class. But here we are going to learn how to do it in rule-based way.

## 2. Restore from where we are using pyConText

In [None]:
#import everything that we will need
import pyConTextNLP
from pyConTextNLP import pyConTextGraph
from pyConTextNLP.itemData import itemData
from pyConTextNLP.display.html import mark_document_with_html
import os
import os.path
from nlp_pneumonia_utils import Annotation
from nlp_pneumonia_utils import AnnotatedDocument
from nlp_pneumonia_utils import read_brat_annotations
from nlp_pneumonia_utils import read_annotations
from nlp_pneumonia_utils import calculate_prediction_metrics
from nlp_pneumonia_utils import markup_context_document
from nlp_pneumonia_utils import DocumentClassifier
from IPython.display import display, HTML, Image

In [None]:
# Let's just consider the example at the beginning as a document,
# and run pyConText to get markups

report = "Right pleural effusion can be excluded. Likely small left pleural effusion. "

targets = itemData(["effusion", "SPAN_POSITIVE_PNEUMONIA_EVIDENCE", r"effusion[s]?", ""])

modifiers = pyConTextNLP.itemData.instantiateFromCSVtoitemData(os.path.join(os.getcwd(),'KB/pneumonia_modifiers.tsv'))

markups=markup_context_document(report,modifiers,targets)

In [None]:
# To confirm what we get from pyConText
print(markups.getDocumentGraph())
        

context_html = pyConTextNLP.display.html.\
    mark_document_with_html(markups, colors = {"span_positive_pneumonia_evidence": "blue"})
display(HTML(context_html))

## 3. Use DocumentClassifier to define the rules for document classification

DocumentClassifier is a simple class that allows you to customize your own rules outside python code. For each type of annotations, we can draw a document level conclusion by check through these annotations' modifier values.  Each rule is specified in one line, and may have 1~3 elements separated by a 'TAB' character.

* If a rule has one element, then this element means the default document level conclusion

  
* If a rule has two elements, then the left element is the document level conclusion, and the right one is the annotation type.
 + *For example: "indicate_pneumonia	span_positive_pneumonia_evidence" means if any annotation with type "span_positive_pneumonia_evidence" exist, conclude "indicate_pneumonia" no matter what the modifier values are.*


* If a rule has three or more elements, the left two elements have the same meaning as the second format above. The rest elements represent a set of modifier values. The rule means only when **all** these modifier values exist in this annotation, we can draw the conclusion.
 + *For example: "indicate_pneumonia	span_positive_pneumonia_evidence	definite_existence" means if an annotation with type "span_positive_pneumonia_evidence" has a modifier value "definite_existence", we conclude "indicate_pneumonia."*
 

The rule will be executed in order for each type of annotations, whenever a rule is satisfied, the rest rules of this type will be skipped. So the top rules have higher priority, and usually the default conclusion rule should be put at the end.

Now let's start from something really simple:

### (1) Define a default document conclusion


Let's start with something really simple:

```Python
no_indicate_pneumonia
```
This rule means no matter what annotations we got from pyConText, let's conclusion 'no_indicate_pneumonia'.


In [None]:
# The 1st argument is used to read a rule file, here we start with using string instead, so put None for now
# The 2nd argument is specify whether run in debug mood or not (try it yourself by change it to 'True')
# The last argument is input rules as a string
docClassifier = DocumentClassifier('no_indicate_pneumonia', None)
print(docClassifier.classify_markups(markups))

### (2) Define one more rule

Now we add one more rule to make some sense:

```Python
possible_pneumonia	span_positive_pneumonia_evidence	probable_existence
```
What does this rule mean?

Let's put these two rules into code and try:

In [None]:
rule='''possible_pneumonia\tspan_positive_pneumonia_evidence\tprobable_existence\n
no_indicate_pneumonia\tspan_positive_pneumonia_evidence\tdefinite_negated_existence\n
no_indicate_pneumonia''';
docClassifier = DocumentClassifier(rule, False)
print(docClassifier.classify_markups(markups))    

### (3) Excercise
Try one more rules with two modifiers:


In [None]:
## Let's try a document that have an annotation with two modifiers
report = "Right pleural effusion can be excluded. Likely without any left pleural effusion."
rule='''possible_neg_pneumonia\tspan_positive_pneumonia_evidence\tprobable_existence\tdefinite_negated_existence\n
no_indicate_pneumonia''';

docClassifier = DocumentClassifier(rule, True)
print(docClassifier.classify_markups(markups)) 


### (4) Put the rules outside our code

Similar to how we handle pyConText rules, we can put the document classification rules outside our code. Here is the rule file [classifierRules.csv](../../edit/decart_rule_based_nlp/KB/classifierRules.csv). Now we can use the 1st argument of DocumentClassifier to initiate this class:

In [None]:
docClassifier = DocumentClassifier('KB/classifierRules.csv', True,modifiers,targets) 
print(docClassifier.classify_doc(report)) 

### (5) Let's try to switch the sentences in the example
See what happens. Does the order of mention-level annotation affects final conclusion?




### (6) Let's try to add one more question 

Is the document's conclusion certain or uncertain?

put all together in a separate file

## 4. Quiz
Let's try a few questions, see if you've understood the content of this notebook:

In [None]:
from quiz_utils import doc_classify_1
doc_classify_1()

In [None]:
from quiz_utils import doc_classify_2
doc_classify_2()

In [None]:
from quiz_utils import doc_classify_3
doc_classify_3()

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2017.<br/>
Presenters : Dr.Wendy Chapman, Jianlin Shi and Kelly Peterson