## Conclusion 

| Model Name      | Components                                                   | Accuracy (%)  | Time per PDF (s) |
| --------------- | ------------------------------------------------------------ | ------------- | ---------------- |
| **v1-bert**     | Block Classification<br />Obtain text blocks and tokens from the ground-truth annotations<br />Classify each block using **BERT** <sup>1</sup> | 91.93         | 15.41            |
| **v1-layoutlm** | Block Classification<br />Obtain text blocks and tokens from the ground-truth annotations<br />Classify each block using **LayoutLM** | 87.59         | 14.52            |
| **v2-bert**     | Pipeline Model: <br />Block Detection <br />Block Classification with **BERT** | 85.87 / 88.42 <sup>2</sup>| 92.68            |
| **v2-layoutlm** | Pipeline Model: <br />Block Detection <br />Block Classification with **LayoutLM** | 81.80 / 84.04 | 92.59            |
| **v3**          | Directly use the original **LayoutLM** model to predict token categories | 88.46         | 19.05            |

1. All the models are based on the `base` version of BERT
2. Please refer to the last section for more details about the 2nd number. 


And please check the visualization results [here](https://drive.google.com/drive/folders/11G_yPtntiUCcV08vRZ13iI0OJ_y3Q_Xj?usp=sharing). 



## Download Models & Dataset

Please download all the models to `models` path in this directory and store the dataset in the `dataset` directory. The evaluation results will be saved in `SAVE_PATH`.

In [1]:
SAVE_PATH = './eval/'

## Initialize Modules 

In [2]:
from evaluate import *
from utils import LayoutVisualizer

In [3]:
%load_ext autoreload
%autoreload 2

import sys
import os 

import layoutparser as lp
from tqdm import tqdm
from copy import deepcopy

sys.path.append("../../src/")
from scienceparseplus.pdftools import * 
from scienceparseplus.modeling.layoutlm import *
from scienceparseplus.modeling.pipeline import *

In [4]:
visualizer=LayoutVisualizer()
pdf_extractor = PDFExtractor('PDFPlumber')

cvt_labels_from_coco_to_docbank = {
    "Title": 'title', 
    "Author": 'author', 
    "Abstract": 'abstract', 
    "Section": 'section',
    "Paragraph": 'paragraph',
    "ListItem": 'list', 
    "Footer": 'footer', 
    "BibItem": 'reference',
    "Equation": 'equation',
    "Figure": 'figure', 
    "Table": 'table', 
    "Caption": 'caption'
}

gt_dataset = COCOGroundTruth(
    coco_base='dataset',
    allowed_cat_ids=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    use_numerical_id=False,
    cvt_category_name=cvt_labels_from_coco_to_docbank,
)

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!


### Extra Configurations 

In [5]:
layoutlm_label_map = {
    "paragraph": 0, "title": 1, "equation": 2,
    "reference": 3, "section": 4, "list": 5,
    "table": 6, "caption": 7, "author": 8,
    "abstract": 9, "footer": 10, "date": 11, 
}

docbank_orig_label_map = ["abstract", "author",  "caption",  
                          "date",  "equation",  "figure",  
                          "footer",  "list",  "paragraph",  
                          "reference",  "section",  "table",  "title"]

## Export Ground-Truth Dataset

In [6]:
gt_dataset.export_and_visualize(PDFExtractor('PDFPlumber'), visualizer, SAVE_PATH)

100%|██████████| 8/8 [00:43<00:00,  5.46s/it]


## Model V1

This model uses the text classification model for extracting the block category: 
1. Load blocks from the ground-truth, and search for tokens within each block
2. Language Model (BERT or LayoutLM) for classifying the block text

In [7]:
class V1ModelEvaluator(BaseModelEvaluator):
    
    def layout_extractor(self, page_image, pdf_info, gt_layout, **kwargs):
        
        block_classifier = self.block_classifier
        
        pred_layout = deepcopy(gt_layout)
        
        for bundle in pred_layout.bundles: 
            
            width, height = pdf_info['width'], pdf_info['height']
            
            if bundle.block.type == 'caption':
                detected_tokens = block_classifier.detect(bundle.tokens[3:], width, height) 
                detected_tokens = [ele.set(type=detected_tokens[0].type) for ele in bundle.tokens[:3]] + detected_tokens
            elif bundle.block.type == 'figure':
                detected_tokens = [ele.set(type='figure') for ele in bundle.tokens]
            else:
                detected_tokens = block_classifier.detect(bundle.tokens, width, height) 

            bundle.block.type = detected_tokens[0].type 
            bundle.tokens = detected_tokens
        
        return pred_layout

### Using LayoutLM as the Language Model

In [8]:
block_classifier_layoutlm = LayoutLMBlockPredictor(
    model_path = "models/docbank-classifier/layoutlm-v2/checkpoint-8000/",
    tokenizer_path = "microsoft/layoutlm-base-uncased",
    label_map={idx: name for name, idx in layoutlm_label_map.items()}
) # For block text classification, pre-trained on 

evaluator = V1ModelEvaluator(
    pdf_extractor = pdf_extractor,
    gt_dataset    = gt_dataset,
    block_predictor = None,
    block_classifier= block_classifier_layoutlm,
    model_name    ='v1-layoutlm',
    visualizer    = visualizer
)

In [9]:
evaluator.eval(save_path=SAVE_PATH, save_name='v1-layoutlm')

100%|██████████| 8/8 [01:56<00:00, 14.52s/it]


In [10]:
del block_classifier_layoutlm
del evaluator

### Using BERT as the Language Model

In [11]:
block_classifier_bert = LayoutLMBlockPredictor(
    model_path = "models/docbank-classifier/bert/checkpoint-44000/",
    tokenizer_path = "bert-base-uncased",
    label_map={idx: name for name, idx in layoutlm_label_map.items()},
    use_basic_lm=True
)

evaluator = V1ModelEvaluator(
    pdf_extractor = pdf_extractor,
    gt_dataset    = gt_dataset,
    block_predictor = None,
    block_classifier= block_classifier_bert,
    model_name    ='v1-bert',
    visualizer    = visualizer
)

evaluator.eval(save_path=SAVE_PATH, save_name='v1-bert')

100%|██████████| 8/8 [02:03<00:00, 15.41s/it]


In [12]:
del block_classifier_bert
del evaluator

## Model V2

This is the pipeline model, where vision models detect the block position(x,y,w,h) and language models classify the blocks. 

1. Detect blocks using Faster/Mask RCNNs, and search for tokens within each block
    1. A Mask RCNN trained on the PubLayNet dataset, detects textual regions (paragraph, title, list) and non-textual regions (table, figure)
    2. A Faster RCNN trained on the MFD dataset, detects (display) equation regions 
2. Language Model (BERT or LayoutLM) for classifying the block text

In [40]:
block_predictorA = ObjectDetectionBlockPredictor(
     config_path ='models/publaynet/mask_rcnn_R_50_FPN_3x/config.yaml',
     model_path  ='models/publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth',
     extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.55, 
                   "MODEL.ROI_HEADS.NMS_THRESH_TEST", 0.4],
     label_map   ={0: "text", 1: "title", 2: "list", 3:"table", 4:"figure"}
    
) # pre-trained on publaynet

block_predictorB = ObjectDetectionBlockPredictor(
     config_path ='models/MFD/fast_rcnn_R_50_FPN_3x/config.yaml',
     model_path  ='models/MFD/fast_rcnn_R_50_FPN_3x/model_final.pth',
     extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.6, 
                   "MODEL.ROI_HEADS.NMS_THRESH_TEST", 0.2],
     label_map   ={1: "equation"}
) # For detecting equations, pre-trained on the MFD Dataset 

In [41]:
class V2ModelEvaluator(BaseModelEvaluator):
    
    textual_blcoks=['text', 'title', 'list']
    
    def layout_extractor(self, page_image, pdf_info, gt_layout, **kwargs):
        
        block_predictorA, block_predictorB = self.block_predictor
        
        # Detecting Blocks from images 
        general_layouts = block_predictorA.detect(page_image, pdf_info['tokens'])
        equation_layouts = block_predictorB.detect(page_image, general_layouts.remaining_tokens)
        
        merged_layouts = HierarchicalLayout(
            bundles=general_layouts.bundles + equation_layouts.bundles,
            remaining_tokens=equation_layouts.remaining_tokens
        )
    
        # Predicting block text categories
        block_classifier = self.block_classifier
        
        for bundle in merged_layouts.bundles: 
            if bundle.block.type not in self.textual_blcoks: 
                continue
                    
            width, height = pdf_info['width'], pdf_info['height']
            detected_tokens = block_classifier.detect(bundle.tokens, width, height) 

            bundle.block.type = detected_tokens[0].type 
            bundle.tokens = detected_tokens
        
        merged_layouts.set_tokens_with_block_class()
        return merged_layouts

### Using LayoutLM as the Language Model

In [42]:
block_classifier_layoutlm = LayoutLMBlockPredictor(
    model_path = "models/docbank-classifier/layoutlm-v2/checkpoint-8000/",
    tokenizer_path = "microsoft/layoutlm-base-uncased",
    label_map={idx: name for name, idx in layoutlm_label_map.items()}
) # For block text classification, pre-trained on 

evaluator = V2ModelEvaluator(
    pdf_extractor = pdf_extractor,
    gt_dataset    = gt_dataset,
    block_predictor = [block_predictorA, block_predictorB],
    block_classifier= block_classifier_layoutlm,
    model_name    ='v2-layoutlm',
    visualizer    = visualizer
)

In [43]:
evaluator.eval(save_path=SAVE_PATH, save_name='v2-layoutlm')

100%|██████████| 8/8 [12:20<00:00, 92.59s/it] 


In [44]:
del block_classifier_layoutlm
del evaluator

### Using BERT as the Language Model

In [50]:
block_classifier_bert = LayoutLMBlockPredictor(
    model_path = "models/docbank-classifier/bert/checkpoint-44000/",
    tokenizer_path = "bert-base-uncased",
    label_map={idx: name for name, idx in layoutlm_label_map.items()},
    use_basic_lm=True
)

evaluator = V2ModelEvaluator(
    pdf_extractor = pdf_extractor,
    gt_dataset    = gt_dataset,
    block_predictor = [block_predictorA, block_predictorB],
    block_classifier= block_classifier_bert,
    model_name    ='v2-bert',
    visualizer    = visualizer
)

evaluator.eval(save_path=SAVE_PATH, save_name='v2-bert')

100%|██████████| 8/8 [12:21<00:00, 92.68s/it] 


In [51]:
del block_classifier_bert
del evaluator

In [52]:
del block_predictorA
del block_predictorB

## Model V3

This directly uses the LayoutLM model for predicting categories for all the tokens on the page. 

1. Load all the token for a PDF page. 
2. Use LayoutLM to classify the category for each token. 
    - The max input text length is 512. 

In [21]:
class V3ModelEvaluator(BaseModelEvaluator):
        
    def layout_extractor(self, page_image, pdf_info, gt_layout, **kwargs):
        
        width, height = pdf_info['width'], pdf_info['height']
        tokens = pdf_info['tokens']

        results = self.block_classifier.detect(tokens, width, height)
        final_layouts = HierarchicalLayout(bundles=[], remaining_tokens=results)

        return final_layouts

In [22]:
layoutlm_predictor = LayoutLMTokenPredictor(
        "models/dockbank-original/DocBank_LayoutLM_base", 
        max_seq_length=512,
        tokenizer_path="models/dockbank-original/DocBank_LayoutLM_base",
        label_map=docbank_orig_label_map
    )

evaluator = V3ModelEvaluator(
    pdf_extractor = pdf_extractor,
    gt_dataset    = gt_dataset,
    block_predictor = None,
    block_classifier= layoutlm_predictor,
    model_name    ='v3',
    visualizer    = visualizer
)

In [23]:
evaluator.eval(save_path=SAVE_PATH, save_name='v3')

100%|██████████| 8/8 [02:32<00:00, 19.05s/it]


## Results Aggregation

In [53]:
from glob import glob 
import cv2 
import os 
import numpy as np
import pandas as pd 
from sklearn.metrics import precision_recall_fscore_support


def combine_viz(all_viz):    
    return np.hstack([cv2.imread(viz) for viz in all_viz])

def preprocess(df, drop_na_cat=True):
    if drop_na_cat:
        return df[~df.is_block & ~df.category.isna()]
    else:
        return df[~df.is_block]
    
def show_detailed_statistics(prediction_table, target='gt'):
    keys = prediction_table['gt'].unique()
    scores = precision_recall_fscore_support(prediction_table['gt'], prediction_table[target], 
                                             labels=list(keys), zero_division=0)
    df = pd.DataFrame(scores, columns=keys, index=['precision', 'recall', 'f-score', 'support'])
    return df.applymap(lambda x: f"{x:.2%}")

### Combine all viz images for a page

In [54]:
VIZ_WRITE_PATH = f"{SAVE_PATH}/all_viz"
os.makedirs(VIZ_WRITE_PATH, exist_ok=True)

for paper_path in glob(f"{SAVE_PATH}/*"):
    
    if 'all_viz' in paper_path: continue 
    
    for page in glob(f"{paper_path}/*"):
        
        all_viz = sorted(glob(f"{page}/viz/*.png"))
        
        page_viz_save_name = page.replace(SAVE_PATH, '').replace('/', '-') + ".png"
        page_viz = combine_viz(all_viz)
        cv2.imwrite(f"{VIZ_WRITE_PATH}/{page_viz_save_name}", page_viz)   

### Get score reports for different models 

In [55]:
result_tables = []
for paper_path in sorted(glob(f"{SAVE_PATH}/*")):
    
    if 'all_viz' in paper_path: continue 
        
    for page in sorted(glob(f"{paper_path}/*")):
        
        layout_pred = {}
        
        all_layout = glob(f"{page}/layout/*.csv")
        
        for layout_table in all_layout:
            
            name = os.path.basename(layout_table).replace('.csv', '')
            
            df = preprocess(pd.read_csv(layout_table), name=='gt').rename(columns={'category':name})
            
            layout_pred[name] = df
        
        assert 'gt' in layout_pred
        
        result_table = layout_pred.pop('gt')
        
        for name, table in layout_pred.items():
            result_table = result_table.merge(table[['id', name]].fillna('paragraph'), on='id', how='left')
        
        result_tables.append(result_table)

In [56]:
all_res = pd.concat(result_tables)

In [57]:
for name in sorted(all_res.columns[10:]):
    print(name)
    print((all_res['gt'] == all_res[name]).mean())
    display(show_detailed_statistics(all_res, name))
    print("====")

v1-bert
0.9193446862081285


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,85.71%,100.00%,80.86%,88.73%,91.00%,91.32%,99.70%,63.41%,99.84%,100.00%,99.74%,71.81%
recall,96.23%,100.00%,92.24%,61.31%,88.64%,96.93%,94.02%,85.21%,49.29%,100.00%,100.00%,38.91%
f-score,90.67%,100.00%,86.18%,72.52%,89.80%,94.04%,96.78%,72.71%,66.00%,100.00%,99.87%,50.47%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====
v1-layoutlm
0.8759122845817526


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,82.35%,100.00%,38.28%,21.61%,94.57%,91.53%,100.00%,75.22%,96.25%,100.00%,98.45%,70.90%
recall,79.25%,100.00%,100.00%,78.83%,96.10%,89.39%,93.12%,82.56%,52.48%,100.00%,100.00%,41.64%
f-score,80.77%,100.00%,55.37%,33.93%,95.33%,90.45%,96.44%,78.72%,67.93%,100.00%,99.22%,52.46%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====
v2-bert
0.8587468739898947


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,98.08%,100.00%,81.21%,14.29%,97.46%,93.63%,95.93%,7.45%,53.07%,80.77%,99.76%,61.85%
recall,96.23%,89.36%,90.57%,1.70%,87.34%,97.31%,83.77%,3.29%,92.50%,69.94%,49.89%,38.91%
f-score,97.14%,94.38%,85.64%,3.04%,92.12%,95.43%,89.44%,4.57%,67.44%,74.96%,66.52%,47.77%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====
v2-layoutlm
0.8180024157465848


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,100.00%,100.00%,40.49%,4.06%,95.74%,94.24%,99.41%,33.97%,53.09%,80.77%,97.17%,73.45%
recall,79.25%,94.12%,98.33%,13.63%,94.81%,89.61%,85.91%,14.92%,92.50%,69.94%,49.89%,62.36%
f-score,88.42%,96.97%,57.36%,6.25%,95.27%,91.87%,92.17%,20.74%,67.46%,74.96%,65.93%,67.45%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====
v3
0.8846566067266634


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,100.00%,100.00%,100.00%,77.86%,91.54%,85.82%,100.00%,93.37%,91.84%,12.00%,99.65%,81.43%
recall,79.25%,91.04%,78.25%,24.82%,94.81%,99.36%,77.45%,84.56%,46.89%,0.24%,100.00%,99.64%
f-score,88.42%,95.31%,87.79%,37.64%,93.14%,92.10%,87.29%,88.75%,62.08%,0.47%,99.82%,89.62%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====


### Pseudo-fix the caption issues for v2 models

Due to the incorrect labeling schemas in the training dataset, the classification accuracy for captions are extremely low in the evaluation set.  

We could have a rough sense of the overall accuracy if all the captions are detected correctly in V2 models by copying ground-truth labels. 

In [58]:
for name in sorted(['v2-bert', 'v2-layoutlm']):
    print(name)
    
    tmp = all_res[name].copy()
    
    col = all_res[name].copy()
    col[all_res['gt']=='caption'] = 'caption'
    all_res[name] = col
    
    print((all_res['gt'] == all_res[name]).mean())
    display(show_detailed_statistics(all_res, name))
    print("====")
    
    all_res[name] = tmp

v2-bert
0.88421428692945


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,98.08%,100.00%,81.21%,14.29%,100.00%,97.03%,95.93%,70.94%,53.66%,81.10%,99.76%,61.85%
recall,96.23%,89.36%,90.57%,1.70%,87.34%,97.31%,83.77%,100.00%,92.50%,69.94%,49.89%,38.91%
f-score,97.14%,94.38%,85.64%,3.04%,93.24%,97.17%,89.44%,83.00%,67.92%,75.11%,66.52%,47.77%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====
v2-layoutlm
0.840407614705432


Unnamed: 0,title,author,abstract,footer,section,paragraph,equation,caption,table,figure,reference,list
precision,100.00%,100.00%,40.49%,4.06%,97.99%,97.49%,99.41%,77.52%,53.69%,81.10%,97.17%,73.45%
recall,79.25%,94.12%,98.33%,13.63%,94.81%,89.61%,85.91%,100.00%,92.50%,69.94%,49.89%,62.36%
f-score,88.42%,96.97%,57.36%,6.25%,96.37%,93.38%,92.17%,87.33%,67.94%,75.11%,65.93%,67.45%
support,10600.00%,35700.00%,137900.00%,41100.00%,30800.00%,3791000.00%,177400.00%,154800.00%,513300.00%,249800.00%,680700.00%,55000.00%


====
