# Error Analysis (2)

From the previous notebook, you have seen that pyConTextNLP helped to improve the precision by excluding irrelevant annotations based on the modifiers. 

This notebook will continue our previous error analyses by including both false negative errors and false positive ones, through step by step demonstration of how to locate and reduce the errors.

## 1. Locate the errors

In [1]:
# Let's import some packages
import os
import pyConTextNLP
from pyConTextNLP import pyConTextGraph
import sklearn.metrics
import pandas as pd
import networkx as nx
import radnlp.view as rview

from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display, HTML, Image
import ipywidgets
# And also our utilities for this class

from nlp_pneumonia_utils import read_doc_annotations
from nlp_pneumonia_utils import mark_document_with_html
from nlp_pneumonia_utils import DocumentClassifier


Remember yesterday, in [06_NLP_ErrorAnalysis1](06_NLP_ErrorAnalysis1.ipynb#cell1), we created a function called *"list_false_negatives."* Now we will extend this function to return both **false negative** and **false positive** document names at the same time: *"list_errors."* Additionally, we will also integrate the *"calculate_prediction_metrics"* function inside *"list_errors"*, so that we can get the metrics without re-run the pyConText over the documents again.
<br/><br/>


In [None]:
def list_errors(gold_docs, prediction_function, print_prediction_metrics=False):
    fn_docs=[]
    fp_docs=[]
    gold_labels = [x.positive_label for x in gold_docs.values()]
    pred_labels = []
    for doc_name, gold_doc in gold_docs.items():
        gold_label=gold_doc.positive_label;
        pred_label = prediction_function(gold_doc.text,{'indicate_pneumonia'},doc_name)
        pred_labels.append(pred_label)
#       Differentiate false positive and false negative error
        if gold_label==0 and pred_label==1:
            fp_docs.append(doc_name)
        elif gold_label==1 and pred_label==0:
            fn_docs.append(doc_name)
    if(print_prediction_metrics):
        precision = sklearn.metrics.precision_score(gold_labels, pred_labels)
        recall = sklearn.metrics.recall_score(gold_labels, pred_labels)
        f1 = sklearn.metrics.f1_score(gold_labels, pred_labels)
        # Let's use Pandas to make a confusion matrix for us
        confusion_matrix_df = pd.crosstab(pd.Series(gold_labels, name='Actual'),
                                          pd.Series(pred_labels, name='Predicted'))

        print('Precision : {0}'.format(precision))
        print('Recall :    {0}'.format(recall))
        print('F1:         {0}'.format(f1))

        print('\nConfusion Matrix : ')
        display(confusion_matrix_df)
    return fn_docs,fp_docs   


Now we restore what we got from [10_NLP_DocumentClassification.ipynb](10_NLP_DocumentClassification.ipynb):<br/><br/>

In [None]:
#Read in the training documents and annotations
annotated_doc_map = read_doc_annotations('data/training_v2.zip')

#Here we initiate our DocumentClassifier directly through rule files:
#Change the file names if you use different files 
docClassifier = DocumentClassifier('KB/classifierRules.csv', False,'KB/pneumonia_modifiers.tsv','KB/pneumonia_targets.tsv') 
docClassifier.reset_saved_predictions()

In [None]:
# Process the corpus using docClassifier to return errors
current_false_negatives,current_false_positives=list_errors(annotated_doc_map, docClassifier.predict,True)

## 2. Display errors
Now we put everything together to display errors. Let's try false positive first:<br/><br/>

In [None]:
# Copy the snippets_markup function from 06_NLP_ErrorAnalysis1
def snippets_markup(annotated_doc_map):
    html = ["<html>","<table width=100% >",
            "<col style=\"width:25%\"><col style=\"width:75%\">"
            "<tr><th style=\"text-align:center\">document name</th><th style=\"text-align:center\">Snippets</th>"]
    for doc_name, anno_doc in annotated_doc_map.items():
        html.extend(snippet_markup(doc_name,anno_doc))
    html.append("</table>")
    html.append("</html>")
    return ''.join(html) 


def snippet_markup(doc_name,anno_doc):
    from pyConTextNLP.display.html import __sort_by_span
    from pyConTextNLP.display.html import __insert_color
    html=[]
    color= 'blue'    
    window_size=50    
    html.append("<tr>")
    html.append("<td style=\"text-align:left\">{0}</td>".format(doc_name))
    html.append("<td></td>")
    html.append("</tr>")
    for anno in anno_doc.annotations:
        if anno.type == 'SPAN_POSITIVE_PNEUMONIA_EVIDENCE':
#           make sure the our snippet will be cut inside the text boundary
            begin=anno.start_index-window_size
            end=anno.end_index+window_size
            begin=begin if begin>0 else 0
            end=end if end<len(anno_doc.text) else len(anno_doc.text)    
#           render a highlighted snippet
            cell=__insert_color(anno_doc.text[begin:end],[anno.start_index-begin,anno.end_index-end],color)
#           add the snippet into table
            html.append("<tr>")
            html.append("<td></td>")
            html.append("<td style=\"text-align:left\">{0}</td>".format(cell))
            html.append("</tr>") 
    return html

# Let's tweak the pyConText markup display function from 09_NLP_pneumonia_pyConText_targets_and_modifiers
# This function let's us view the saved markups in docClassifier without re-process it through pyConText
def view_pycontext_graph(saved_markups, colors):
    @interact(i=ipywidgets.IntSlider(min=0, max=len(saved_markups)-1))
    def _view_markup(i):
        markup = saved_markups[i]
        ag=nx.nx_pydot.to_pydot(rview.documentgraph_to_viewgraph(markup.getDocumentGraph()))
        ag.write_png("tmp.png")
        display(Image("tmp.png"))        
        report_html = mark_document_with_html(markup, colors, default_color="black")        
        display(HTML(report_html))
        
colors = {
    "evidence_of_pneumonia": "orange",
    "definite_negated_existence": "red",
    "probable_negated_existence": "indianred",
    "ambivalent_existence": "forestgreen",
    "probable_existence": "forestgreen",
    "definite_existence": "green",
    "historical": "goldenrod",
    "indication": "pink",
    "acute": "golden"
}        

### * Display false negatives
Now we can display the **false negatives** with expert annotations.<br/><br/>

In [None]:
fn_docs=dict((k, v) for k, v in annotated_doc_map.items() if k in current_false_negatives)
display(HTML(snippets_markup(fn_docs)))

### * Display false positives
Then we can display the **false positives** with pyConText markups.<br/><br/>

In [None]:
fp_docs=list(v for k,v in docClassifier.saved_markups_map.items() if k in current_false_positives)
view_pycontext_graph(fp_docs,colors)

### * Rethink the causes of the false negatives
Does the false negatives are all caused by missed keywords?<br/><br/>
You may want to review these pyConText markups in **false negative** documents:<br/><br/>


In [None]:
fp_docs=list(v for k,v in docClassifier.saved_markups_map.items() if k in current_false_negatives)
view_pycontext_graph(fp_docs,colors)
# The variable: "saved_markups_map" in "docClassifier" only saves the documents that have at least one annotation. 
# Thus, the false negatives caused by missing keywords (no annotations) will not be saved in it,
# and only pyConText caused false negatives will be displayed here

## 3. Now what?<br/><br/>

## 4. Quiz
Let's see if you are ready.

In [None]:
from quiz_utils import error_analyses_7
error_analyses_7()

In [None]:
from quiz_utils import error_analyses_8
error_analyses_8()

In [None]:
from quiz_utils import error_analyses_9
error_analyses_9()

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2017.<br/>
Presenters : Dr.Wendy Chapman, Jianlin Shi and Kelly Peterson