<a href="https://colab.research.google.com/github/ruifcruz/sroie-on-layoutlm/blob/main/LayoutLM_fine_tunning_for_SROIE_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tune SROIE on LayoutLM
This notebook is an effort to fine tune the LayoutLM model for the SROIE dataset. The model is presented in the paper "[LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)" by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei and Ming Zhou. 

- Git-hub repo [here](https://github.com/microsoft/unilm/tree/master/layoutlm).

- Read about the SROIE competition and dataset [here](https://rrc.cvc.uab.es/?ch=13).

- Inspiration from this Kaggle notebook [here](https://www.kaggle.com/jpmiller/layoutlm-starter)

##Notes:
- The repo includes a pre processing script and fune tunning for the FUNSD dataset, but not for the SROIE dataset (though the paper includes computations on the SROIE dataset). So this notebook intends to fill that gap

- I have used my google drive to manage the files. If you want to use it, just change the folder names (both the ones where you keep the SROIE files and also were you keep the LayoutLM files)

- The best f1 results on the predicitons I got were between 93%~ 94.5%, which is a bit less than the value presented in the paper (~94%/95%). The differences may be explained by 
  - different parameters (I haven't done an exaustive grid search)
  - different sampling
  - different pre processing. This one is far from perfect, some labels and invoices are lost in the way. 
  - different OCR base. As I understood, the authors also did their own OCR, while I run from th one provided in the dataset
  - I was having difficulties with the label "company address" so I have dropped it
  - any other differences, as the paper doesn't explain this fine tunning in detail

- Make sure you have GPU enabled on the notebook (Edit->Notebook settings)

- Yes I know, the code is horrible and badly explained, sorry for that. Nevertheless, hope it helps somehow

# 1. Pre-process dataset

In [None]:
# Imports  
import os
import pandas as pd
import glob
import json 
import ast
import re
import random

In [None]:
# Connection to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Define path for the dataset files (you should previously download the dataset from the link given at the header of the notebook)
# This is the folder with the files that contain the bounding boxes and the words
spath_words = '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/0325updated.task1train(626p)/'
os.chdir(spath_words)
# Create a dataframe to store and manage the invoices bounding boxes and words
df_sentences = pd.DataFrame(columns=['filename', 'sentence'])

# Loops over every file in the folder
for file in glob.glob("*.txt"):
  try:
    # Treat each invoice as a sentence and a row of the df
    sfullpath = spath_words + file
    df_file = pd.read_csv(sfullpath, header=None, names=['x0', 'y0', 'x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'words'])
    if not df_file['words'].isnull().values.any():
      sentence_list = [str(i) for i in df_file['words']]
      bbox_list = []
      for index, row in df_file.iterrows():
        bbox_list.append([row['x0'],row['y0'],row['x2'],row['y2']])
      new_row = {'filename':file, 'sentence':sentence_list, 'bboxes':bbox_list}
      # Append row to the dataframe
      df_sentences = df_sentences.append(new_row, ignore_index=True)
  except Exception as e:
    # There are a few problems, we will just ignore them and print the error associated with it
    print(file + " | " + repr(e))

X51006619545.txt | ParserError('Error tokenizing data. C error: EOF inside string starting at row 78',)
X51006619785.txt | ParserError('Error tokenizing data. C error: EOF inside string starting at row 77',)


In [None]:
# Define path for the dataset files (you should previously download the dataset from the link given at the header of the notebook)
# This is the folder with the files that contain the values (company name, date, address and total)
spath_labels = '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/0325updated.task2train(626p)/'
os.chdir(spath_labels)
# Create a dataframe to store and manage the invoices tags
df_labels = pd.DataFrame(columns=['filename', 'value_company', 'value_date', 'value_address', 'value_total'])

for file in glob.glob("*.txt"):
  try:
    with open(file, 'r') as fileread:
      data = res = json.loads(fileread.read()) 
    new_row = {'filename':file, 'value_company':data['company'], 'value_date':data['date'], 'value_address':data['address'], 'value_total':data['total']}
    # Append row to the dataframe
    df_labels = df_labels.append(new_row, ignore_index=True)
  except Exception as e:
    print(file + " | " + repr(e))

X51005663280(1).txt | KeyError('address',)
X51005663280.txt | KeyError('address',)


In [None]:
# Now let's merge the two dataframes based on the filename
df = pd.merge(df_sentences,df_labels,on='filename')

In [None]:
# In case you want to store the df on drive (to avoid running the previous cells again and again), just uncomment this cell
# os.chdir('/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/')
# df.to_csv('df.csv')

In [None]:
# In case the df is stored on drive, just uncomment this cell
# df = pd.read_csv('/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/df.csv')
# df = df.drop(['Unnamed: 0'], axis=1)

In [None]:
# Drop unecessary column and parse data (need to avoid some quotes inside the lists)
df['sentence'] = df['sentence'].map(lambda a: ast.literal_eval(a))
df['bboxes'] = df['bboxes'].map(lambda a: ast.literal_eval(a))

In [None]:
df.head(5)

Unnamed: 0,filename,sentence,bboxes,value_company,value_date,value_address,value_total
0,X51005447852.txt,"[99 SPEED MART S/B (519537-X), LOT P.T. 2811, ...","[[178, 341, 671, 378], [200, 389, 674, 431], [...",99 SPEED MART S/B,20-02-18,"LOT P.T. 2811, JALAN ANGSA, TAMAN BERKELEY 411...",9.9
1,X51008114281.txt,"[99 SPEED MART S/B (519537-X), LOP P.T. 2811, ...","[[153, 332, 648, 373], [176, 380, 650, 431], [...",99 SPEED MART S/B,04-06-18,"LOT P.T. 2811, JALAN ANGSA, TAMAN BERKELEY 411...",23.4
2,X51006556852.txt,"[GARDENIA BAKERIES (KL) SDN BHD (139386 X), LO...","[[36, 124, 591, 147], [172, 148, 450, 170], [1...",GARDENIA BAKERIES (KL) SDN BHD,11/09/2017,"LOT 3, JALAN PELABUR 23/1, 40300 SHAH ALAM, SE...",65.5
3,X51007339642(1).txt,"[AIK HUAT HARDWARE, ENTERPRISE (SETIA, ALAM) S...","[[73, 194, 502, 225], [77, 221, 503, 250], [12...",AIK HUAT HARDWARE ENTERPRISE (SETIA ALAM) SDN BHD,28/09/2017,"NO. 17-G, JALAN SETIA INDAH (X) U13/X, SETIA A...",14.0
4,X51005806678(1).txt,"[KAISON FURNISHING SDN BHD, L4-17 (B), UP2-01,...","[[333, 214, 698, 252], [378, 279, 652, 318], [...",KAISON FURNISHING SDN BHD,29-01-18,"L4-17 (B), LEVEL 4, UP2-01, MELAWATI MALL, 355...",7838.8


In [None]:
# Define some auxiliary functions
def a_in_x(A, X):
  '''
  Returns list with indexes of elements of list X which contain A
  '''
  l = []
  for i in range(len(X) - len(A) + 1):
    if str(A[0]) in str(X[i:i+len(A)][0]): 
      l.append(i)
  return l

def flat_list_one_level(l):
  '''
  Flattens list
  Doesn't include second level list of lists, only first level
  '''
  flat_list = []
  for sublist in l:
    if type(sublist) is list:
      for item in sublist:
          flat_list.append(item)
    else:
      flat_list.append(sublist)
  return flat_list

def flat_list_one_level_list_of_lists(l):
  '''
  Flattens list
  Flattens only the first element of the sub-list
  '''
  flat_list = []
  for sublist in l:
    if type(sublist) is list and len(sublist) > 0 and type(sublist[0]) is list:
      for item in sublist:
        flat_list.append(item)
    else:
      flat_list.append(sublist)
  return flat_list
    
def intersperse(lst, item):
  '''
  Places an item between elements of a list
  '''
  result = [item] * (len(lst) * 2 - 1)
  result[0::2] = lst
  return result

def split_box(box, n_splits):
  '''
  Splits a bbox [x0,y0,x1,y1] by its coordinates into n_splits bboxes of equal size
  '''
  boxs_splitted = []
  x0 = box[0]
  y0 = box[1]
  x1 = box[2]
  y1 = box[3]
  width = x1 - x0
  for i_split in range(0, n_splits):
    boxs_splitted.append([x0 + i_split * int(width/n_splits), y0, x0 + (i_split + 1) * int(width/n_splits), y1])
  return boxs_splitted

def split_box_weighted(box, l_splits):
  '''
  Splits a bbox [x0,y0,x1,y1] by its coordinates into len(l_splits)
  The size of each bbox is proportional to the weight present in l_splits
  '''
  boxs_splitted = []
  x0 = box[0]
  y0 = box[1]
  x1 = box[2]
  y1 = box[3]
  width = x1 - x0
  sum_splits = sum(l_splits)
  for i_split in l_splits:
    split_fraction = i_split/sum_splits
    x1f = x0 + int(width * split_fraction)
    boxs_splitted.append([x0, y0, x1f, y1])
    x0 = x1f
  return boxs_splitted

In [None]:
# Define function to set the labels to the words
def define_labels(pos, sent, labels, bbox, class_value, classification, label_other = 'O'):
  # Pos is a list whith the position of the words associated with this label
  # So this loops each group of words which has some relation to the label
  for i_pos in pos:
    if sent[i_pos] == class_value:
      # If the group of words is equal to the class value, then this group of words is attributted the label
      labels[i_pos] = classification
    else:
      # The value is contained within the group of words, so we have to split the group (ex: [... , "Date: 01/01/2020", ...] -> [..., ["Date: ", "01/01/2020"], ...])
      # We start by replacing the group of words by a splitted list 
      sent[i_pos] = intersperse(sent[i_pos].split(str(class_value)), str(class_value))
      # This split leaves a white space element at the initial or final position, so we have to remove it
      if sent[i_pos][0].isspace() or len(sent[i_pos][0])==0: sent[i_pos] = sent[i_pos][1:]
      if sent[i_pos][-1].isspace() or len(sent[i_pos][-1])==0: sent[i_pos] = sent[i_pos][0:-1]
      # Now we may associate the labels with the correct group of words (ex: [... , "Date: 01/01/2020", ...] -> [..., ["Date: ", "01/01/2020"], ...], the labels would be [..., ["O", "B-DATE"], ...])
      labels[i_pos] = [classification if s == class_value else label_other for s in sent[i_pos]]
      # The bounding boxes should also be splitted
      # Here we do it proportionally to the number of chars of the words
      bbox[i_pos] = split_box_weighted(bbox[i_pos], [len(i) for i in sent[i_pos]])

  # The obtained lists have now some second level lists, so we have to flatten
  sent = flat_list_one_level(sent)
  labels = flat_list_one_level(labels)
  bbox = flat_list_one_level_list_of_lists(bbox)
  return sent, labels, bbox

In [None]:
# Finally the loop to create lists with the sentences and their corresponding labels and bboxes
sentences_list = []
labels_list = []
bbox_list = []
class_other = 'O'
for index, row in df.iterrows():
  labels = [class_other] * len(row['sentence'])
  sent = row['sentence'].copy()
  bbox = row['bboxes'].copy()
  
  # Define labels for date
  class_value = row['value_date']
  classification = 'B-DATE'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)
  
  # Define labels for total value
  class_value = row['value_total']
  classification = 'B-TOTAL'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)  

  # Define labels for company name
  class_value = row['value_company']
  classification = 'B-COMPANY'
  pos = a_in_x([class_value], sent)
  if len(pos) > 0:
    sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Define labels for address 
  # class_value = row['value_address']
  # classification = 'B-ADDRESS'
  # pos = a_in_x([class_value], sent)
  # if len(pos) > 0:
  #   sent, labels, bbox = define_labels(pos, sent, labels, bbox, class_value, classification)

  # Appends the group of words, labels and bboxes to lists
  sentences_list.append(sent.copy())
  labels_list.append(labels.copy())
  bbox_list.append(bbox.copy())

At this point we have lists in which the elements are also lists (groups of words)

In order to discretize the problem, we should split the groups of words into single words

In [None]:
def break_sentences(sl, bl, ll):
  sentences_list_temp = []
  bbox_list_temp = []
  labels_list_temp = []
  for sents, labels, boxs in zip(sl, bl, ll):
    sentences_list3 = []
    bbox_list3 = []
    labels_list3 = []
    for sent, label, box in zip(sents, labels, boxs):
      word_tokens = sent.split(" ")
      # Strip white spaces
      word_tokens = [w for w in word_tokens if (w != "" and w != " ")] 
      sentences_list3.extend(word_tokens)
      splitted_boxes = split_box_weighted(box, [len(i) for i in word_tokens])
      bbox_list3.extend(splitted_boxes)
      # BO
      labels_list3.extend([label] * len(word_tokens))
      # BIO
      #labels_list3.extend([label] + [label.replace('B-','I-')] * (len(word_tokens) - 1))
    sentences_list_temp.append(sentences_list3)
    bbox_list_temp.append(bbox_list3)
    labels_list_temp.append(labels_list3)
  return sentences_list_temp, bbox_list_temp, labels_list_temp

In [None]:
sentences_list, bbox_list, labels_list = break_sentences(sentences_list, labels_list, bbox_list)

In [None]:
# Check the first invoice data
for s, l, b in zip(sentences_list[0],labels_list[0],bbox_list[0]):
  print("{}\t\t{}\t\t{}".format(s,l,b))

99		B-COMPANY		[178, 341, 220, 378]
SPEED		B-COMPANY		[220, 341, 326, 378]
MART		B-COMPANY		[326, 341, 411, 378]
S/B		B-COMPANY		[411, 341, 475, 378]
(519537-X)		O		[477, 341, 670, 378]
LOT		O		[200, 389, 329, 431]
P.T.		O		[329, 389, 501, 431]
2811		O		[501, 389, 673, 431]
TAMAN		O		[304, 438, 399, 474]
BERKELEY		O		[399, 438, 551, 474]
41150		O		[256, 489, 441, 530]
KLANG		O		[441, 489, 626, 530]
1249-TMN		O		[233, 538, 391, 575]
PANDAN		O		[391, 538, 509, 575]
CAHAYA		O		[509, 538, 627, 575]
GST		O		[220, 590, 283, 625]
ID.		O		[283, 590, 346, 625]
NO		O		[346, 590, 388, 625]
:		O		[388, 590, 409, 625]
000181747712		O		[409, 590, 662, 625]
INVOICE		O		[198, 689, 336, 728]
NO		O		[336, 689, 375, 728]
:		O		[375, 689, 394, 728]
18314/102/T0422		O		[394, 689, 691, 728]
06:20PM		O		[99, 787, 221, 823]
568008		O		[399, 789, 505, 820]
20-02-18		B-DATE		[660, 786, 805, 823]
8991		O		[97, 885, 179, 922]
NUTRI		O		[179, 885, 282, 922]
PLUS		O		[282, 885, 364, 922]
TELUR		O		[364, 885, 467, 9

Now everything is ready to write the files in the correct format (accepted by the layoutLM process)

In [None]:
def bbox_string(box, width, length):
    return (
        str(int(1000 * (box[0] / width)))
        + " "
        + str(int(1000 * (box[1] / length)))
        + " "
        + str(int(1000 * (box[2] / width)))
        + " "
        + str(int(1000 * (box[3] / length)))
    )

def actual_bbox_string(box, width, length):
    return (
        str(box[0])
        + " "
        + str(box[1])
        + " "
        + str(box[2])
        + " "
        + str(box[3])
        + "\t"
        + str(width)
        + " "
        + str(length)
    )

def size(bboxes):
  max_width = 0
  max_height = 0
  min_x0 = 10e8
  min_y0 = 10e8
  for box in bboxes:
    if box[0] < min_x0: min_x0 = box[0]
    if box[1] < min_y0: min_y0 = box[1]
    if box[2] > max_width: max_width = box[2]
    if box[3] > max_height: max_height = box[3]
  max_width += min_x0
  max_height += min_y0
  return max_height, max_width

def get_unique(some_array, seen=None):
    if seen is None:
        seen = set()
    for i in some_array:
        if isinstance(i, list):
            seen.union(get_unique(i, seen))
        else:
            seen.add(i)
    return list(seen)


In [None]:
def write_files(output_dir, data_split, sentences_list, labels_list, bbox_list, split_indexes):
  with open(
      os.path.join(output_dir, data_split + ".txt"),
      "w",
      encoding="utf8",
  ) as fw, open(
      os.path.join(output_dir, data_split + "_box.txt"),
      "w",
      encoding="utf8",
  ) as fbw, open(
      os.path.join(output_dir, data_split + "_image.txt"),
      "w",
      encoding="utf8",
  ) as fiw:
      for index in split_indexes:
          sent = sentences_list[index]
          lab = labels_list[index]
          boxes = bbox_list[index]
          length, width = size(boxes)

          for words, label, box in zip(sent, lab, boxes):
              fw.write("{}\t{}\n".format(words, label))
              fbw.write("{}\t{}\n".format(words, bbox_string(box, width, length)))
              fiw.write("{}\t{}\t{}\n".format(words, actual_bbox_string(box, width, length), "filename.jpg"))
          fw.write("\n")
          fbw.write("\n")
          fiw.write("\n")

In [None]:
# First we split into train and test set
split_indexes = [*range(len(sentences_list))]
random.Random(4).shuffle(split_indexes)
cut = int(len(sentences_list) * 0.8)
split_indexes_train = split_indexes[:cut]
split_indexes_test = split_indexes[cut:]

In [None]:
write_files('/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/', 'train', sentences_list, labels_list, bbox_list, split_indexes_train)

In [None]:
write_files('/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/', 'test', sentences_list, labels_list, bbox_list, split_indexes_test)

In [None]:
# Finally we write the labels.txt file
tag_values = get_unique(labels_list)
with open(
      os.path.join('/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/', "labels.txt"),
      "w",
      encoding="utf8",
  ) as lb:
      for val in tag_values:
        lb.write("{}\n".format(val))

# 2. Fine tune LayoutLM

In [None]:
os.chdir('/content')

In [None]:
%%bash
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutlm
pip install .

Processing /content/unilm/layoutlm
Building wheels for collected packages: layoutlm
  Building wheel for layoutlm (setup.py): started
  Building wheel for layoutlm (setup.py): finished with status 'done'
  Created wheel for layoutlm: filename=layoutlm-0.0-cp36-none-any.whl size=11484 sha256=4509f71db801ad23d552705ca0a200c660112c560c17fd0fb73631d2d847009d
  Stored in directory: /tmp/pip-ephem-wheel-cache-v49lumeh/wheels/e8/9a/90/87de19930fb582e6176ea7912010f101efa37def32b8ced268
Successfully built layoutlm
Installing collected packages: layoutlm
  Found existing installation: layoutlm 0.0
    Uninstalling layoutlm-0.0:
      Successfully uninstalled layoutlm-0.0
Successfully installed layoutlm-0.0


Cloning into 'unilm'...


In [None]:
os.chdir('/content/unilm/layoutlm/examples/seq_labeling')

In [None]:
# Move the previously created files
%%bash
mkdir data
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/train.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/train_box.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/train_image.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/test.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/test_box.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/test_image.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
cp '/content/drive/My Drive/Msc/Tese/Datasets/SROIE2019/labels.txt' '/content/unilm/layoutlm/examples/seq_labeling/data'
# Try to remove cached files (this is optional and only important if we make changes on the input files)
rm '/content/unilm/layoutlm/examples/seq_labeling/data/cached_train_layoutlm-base-uncased_512'
rm '/content/unilm/layoutlm/examples/seq_labeling/data/cached_test_layoutlm-base-uncased_512'

rm: cannot remove '/content/unilm/layoutlm/examples/seq_labeling/data/cached_train_layoutlm-base-uncased_512': No such file or directory
rm: cannot remove '/content/unilm/layoutlm/examples/seq_labeling/data/cached_test_layoutlm-base-uncased_512': No such file or directory


In [None]:
%%bash
ls /content/unilm/layoutlm/examples/seq_labeling/data/
cat /content/unilm/layoutlm/examples/seq_labeling/data/labels.txt

labels.txt
test_box.txt
test_image.txt
test.txt
train_box.txt
train_image.txt
train.txt
O
B-DATE
B-COMPANY
B-TOTAL


In [None]:
# Check model parameters
cat "/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased/config.json"

{
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "max_2d_position_embeddings": 1024,
  "num_attention_heads": 8,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 30522
}

In [None]:
# Want to change any model parameter? For example here I just replace the number of attention heads from 12 to 8 (the results are much better)
%%bash
sed -i 's/"num_attention_heads": 12,/"num_attention_heads": 8,/' "/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased/config.json"

In [None]:
# Train the model
! CUDA_LAUNCH_BLOCKING=1 python run_seq_labeling.py  --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path '/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased' \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_train \
                            --num_train_epochs 5.0 \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir output \
                            --overwrite_output_dir \
                            --labels data/labels.txt \
                            --per_gpu_train_batch_size 8 \
                            --per_gpu_eval_batch_size 8

2020-11-25 00:12:21.493624: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Epoch:   0% 0/5 [00:00<?, ?it/s]
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)

Iteration:   1% 1/71 [00:00<01:01,  1.13it/s][A
Iteration:   3% 2/71 [00:01<00:58,  1.18it/s][A
Iteration:   4% 3/71 [00:02<00:56,  1.21it/s][A
Iteration:   6% 4/71 [00:03<00:54,  1.24it/s][A
Iteration:   7% 5/71 [00:03<00:52,  1.26it/s][A
Iteration:   8% 6/71 [00:04<00:50,  1.28it/s][A
Iteration:  10% 7/71 [00:05<00:49,  1.28it/s][A
Iteration:  11% 8/71 [00:06<00:48,  1.30it/s][A

Iteration:  14% 10/71 [00:07<00:46,  1.31it/s][A
Iteration:  15% 11/71 [00:08<00:45,  1.31it/s][A
Iteration:  17% 12/71 [00:09<00:44,  1.32it/s][A
Iteration:  18% 13/71 [0

In [None]:
# Evaluate for test set
! python run_seq_labeling.py  --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path '/content/drive/My Drive/Msc/Tese/Modelos/layoutlm-base-uncased' \
                            --do_lower_case \
                            --max_seq_length 512 \
                            --do_predict \
                            --logging_steps 10 \
                            --save_steps -1 \
                            --output_dir output \
                            --labels data/labels.txt \
                            --per_gpu_eval_batch_size 8

2020-11-25 00:17:28.612540: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Evaluating: 100% 18/18 [00:04<00:00,  3.95it/s]


In [None]:
cat output/test_results.txt

f1 = 0.945092952875054
loss = 0.0345635545026097
precision = 0.9138795986622074
recall = 0.9785138764547896


In [None]:
# We can check the results on the test set
%%bash
head -60 output/test_predictions.txt

UNIHAKKA B-COMPANY
INTERNATIONAL B-COMPANY
SDN B-COMPANY
BHD B-COMPANY
22 B-DATE
JUN B-DATE
2018 B-DATE
18:07 O
(867388-U) O
12 O
TAMPOI O
TAX O
INVOICE O
INVOICE O
# O
: O
OR18062202170372 O
ITEM O
QTY O
TOTAL O
SR O
I00100000170- O
IMPORTED O
VEGGIES O
RM1.50 O
SR O
I00100000031- O
3 O
VEGE O
RM4.15 O
SR O
I00100000171-MEAT O
DISH O
RM2.83 O
1 O
1 O
1 O
RM1.50 O
RM4.15 O
RM2.83 O
TOTAL O
AMOUNT: O
RM8.48 O
GST O
@0%: O
RM0.00 O
ROUNDING: O
RM0.02 O
NETT O
TOTAL: O
RM8.50 B-TOTAL
PAYMENT O
MODE O
CASH O
CHANGE O
AMOUNT O
RM8.50 B-TOTAL
RM0.00 O
GST O
SUMMARY O
