# Invoice reading NLP system
Remember this image? IT IS BACK!!!
![System image](https://storage.googleapis.com/aibootcamp/general_assets/ml_system_architecture.png)


This week is all about system building. Because hardly ever does a ML system stand alone. Your success in building a system for Ortec Finance depends as much on what is around your neural net as it depends on the neural net itself. This baseline is my approach to the problem. Much in this notebook was hacked together so I am sure you can improve on many points. Perhaps you even come up with a completely different approach.

## The approach, character wise classification:
The goal of the task is to extract information from the invoice. The invoice has been run through optical character recognition (OCR). OCR turns PDFs into texts but often messes up the order and confuses come characters. **To extract information from this text, we classify each character by category**. 

Take an example, if we just wanted to get the amount we would classify the characters like this:

|T|O|T|A|L|:| |€| |4|3|6|.|0|0|
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|

We classify our text into 6 classes here:

Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5

These are the classes that the training data generator tags. But the class of a character does not only depend on the character. It depends on its surroundings as well. To train our model, we create substrings of our invoice that include a certain amount of preceeding and succeeding characters. The amount of preceding and succeeding characters is defined in the `PADDING` global variable. 

If for example we wanted to classify the character '€' from the example above and had `PADDING = 3` we would feed
'L: € 43' into our network. You can see how the amount of padding has a great influence on the performance of our system.

## Post processing:
A significant part of model performance stems from what is done with the outputs of the neural net. This approach groups predictions to prediction sequences and only keeps predictions in which 5 consecutive characters were grouped into the same category. An approach to try would be to allow sequences to be interrupted by one character. Another nice add on would be to rank predicted sequences by the total confidence the neural network has in the sequence. 

## Some tips:
For this assignment you can dive pretty deep into software development. 
You might find these jupyter tricks helpful: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Especially debugging with `pdb` really makes things easier: https://docs.python.org/3.5/library/pdb.html#debugger-commands

Basically, if anything crashes, you can start a new cell and enter `%debug`. You then come to a command line in which you can look around what happened at the crash.
The debugger has some special commands. For example `p my_var` prints out a variable. This also works for other python operations, e.g. `p len(my_list)`.

Good luck with building a great system!

In [1]:
# Setup 
# !wget -nc https://storage.googleapis.com/aibootcamp/data/ortec_templates.zip

In [2]:
#  !unzip - ortec_templates.zip

In [3]:
!ls

iban_noisy_100.csv			   README.md
Jannes_Baseline_Invoices_ToFromDisk.ipynb  templates
Jannes_Baseline_Ortec.ipynb		   train
LICENSE					   Wk6_Ortec_IBAN_generation.ipynb


In [1]:
 !cat templates/TEMPLATE_18.txt

{'/Title': IndirectObject(49, 0), '/Author': IndirectObject(51, 0), '/Subject': IndirectObject(52, 0), '/Producer': IndirectObject(50, 0), '/Creator': IndirectObject(53, 0), '/CreationDate': IndirectObject(54, 0), '/ModDate': IndirectObject(54, 0), '/Keywords': IndirectObject(55, 0), '/AAPL:Keywords': IndirectObject(56, 0)}

            <SENDER_NAME>
V.O.F
                                        Specialist in tegelvlakke cementdekvloeren
                 Fam de Rochebrune
              Vaarsdrift 3
         Juinen
              Papendrecht
02-06-2017     factuur 17068
    -            Geachte heer,
                      Aan u geleverd cementdekvloer
45m27cmvezel bewapend
Bedrag excl BTW
900,00
!
 B.T.W 21%
189,00
!
 Totaal door u te voldoen
1.089,00
!
                                                                    Betalingen via IBAN.<IBAN>
binnen 14 dagen
 na datum factuur
     Dennehof 23
 3355RJ Papendrecht
 Mob.tel.0
6-53190711 IBAN.43ABNA.042.97.260

In [2]:
!pip install -q keras

## Loading templates

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
# System hyper parameters here

# How many characters before and after the main char to feed the NN
PADDING = 20 


'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
'''
N_CLASSES = 6

In [5]:
from templates.invoicegen import create_invoice

from keras.preprocessing.text import Tokenizer
import numpy as np

import pandas as pd
import glob
import uuid


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [6]:
# Create plenty invoices for each template
PLENTY = 3

invoices = []
targets = []
truths = []

for txtname in glob.glob("./templates/TEMPLATE_*.txt"):
    print(txtname)
    with open(txtname, "r") as content_file:
        content = content_file.read()
        # Create invoices from template
        for i in range(PLENTY):
            inv, tar, truth = create_invoice(content)
            invoices.append(inv)
            targets.append(tar)
            truths.append(truth)
#

./templates/TEMPLATE_1.txt
./templates/TEMPLATE_10.txt
./templates/TEMPLATE_8.txt
./templates/TEMPLATE_2.txt
./templates/TEMPLATE_9.txt
./templates/TEMPLATE_20.txt
./templates/TEMPLATE_4.txt
./templates/TEMPLATE_18.txt
./templates/TEMPLATE_17.txt
./templates/TEMPLATE_11.txt
./templates/TEMPLATE_5.txt
./templates/TEMPLATE_7.txt
./templates/TEMPLATE_14.txt
./templates/TEMPLATE_13.txt
./templates/TEMPLATE_19.txt
./templates/TEMPLATE_12.txt
./templates/TEMPLATE_3.txt
./templates/TEMPLATE_16.txt
./templates/TEMPLATE_21.txt
./templates/TEMPLATE_6.txt
./templates/TEMPLATE_15.txt


In [7]:
print(invoices[3])
print(truths[3])

{'/Producer': 'PDFCreator 2.4.0.213', '/CreationDate': "D:20170321121729+01'00'", '/ModDate': "D:20170321121729+01'00'", '/Title': 'Factuur', '/Author': 'BoekhoudPC', '/Subject': '', '/Keywords': '', '/Creator': 'PDFCreator 2.4.0.213'}

Rembrandtstraat 5-i, 3262HN, Oud-Beijerland.

Telefoon   : 06-20398799

Email       : info@Amazon NL International Holdings B.V..nl

Bank ING : NL41RABO0150437878
BTW nr              : NL1361.72.738.B01

KVK Rotterdam   : 69988978
WEBSITE: http://www.Amazon NL International Holdings B.V..nl

Factuurdatum:

Factuurnummer:
21-3-2017

20170465
Rochebrune, de

Vaarsdrift 3

1982 KB Juinen
0071
Debiteurnummer:
Artikel
AantalEenheidsprijsTotaalbedrag
EUR
EUR
Artikelcode
Schuifpui volgens email opdracht 19-12-2016.1,00 4.
149,00 4.149,00
Overige kozijnen gekoppeld volgens email

opdracht 19-12-2016.
1,00 6.441,00 6.441,00
Rolluik met verbrede geleiders etc. volgens

akkoord email 18-1-2017.
1,00 1.284,00 1.284,00
Arbeid 2,5 uur hulp bij het vernieuwen beslag



In [13]:
# save all generated inv/tar combos to disk
# one combo per file
train_dir = "train/"
! mkdir -p {train_dir}

for i in range(len(invoices)):
  inv = invoices[i]
  tar = targets[i]
  inv_tar = pd.DataFrame(columns = ['invoice', 'target'])
  inv_tar['invoice'] = inv
  inv_tar['target']  = tar
  fname = train_dir + str(uuid.uuid4()) + '.csv'
  inv_tar.to_csv(fname, index=False)
  
#


In [14]:
# ! rm {train_dir}*csv
! ls {train_dir}*csv | wc -l
! du -sh {train_dir}

63
328K	train/


In [15]:
# example code to read in invoices/targets from disk:
# one inv/tar combo in each file

for file in glob.glob("{train_dir}*.csv"):
  mysample = pd.read_csv(file)
  inv = mysample.loc[:,'invoice']
  tar = mysample.loc[:,'target']


## Generate substring

In [241]:
# Create our tokenizer
# We will tokenize on character level!
# We will NOT remove any characters
tokenizer = Tokenizer(num_words=96, char_level=True, filters=None, lower=True)  # lower = False , True by default
tokenizer.fit_on_texts(invoices)

In [242]:
def gen_sub(inv,tar,pad, m = None):
    '''
    Generates a substring from invoice inv and target list tar 
    using the character at index m as a midpoint.
    
    Params:
    inv - an invoice string
    tar - a target list specifying the type of each item
    pad - the amount of padding to attach before and after the focus character
    
    Returns:
    sub - a string with pad characters, the focus character, pad characters
    '''
    # If no focus character index is set, choose at random
    if m == None:
        m = np.random.randint(0,len(inv))
        
    l = m - pad # define the lower bound of our substring
    h = m + pad + 1 # define the upper (high) of our substring

    # Sometimes, our lower bound could be below zero
    # In this case we attach the remaining characters from the back of the string
    if l < 0:
        # Get the characters from the back of the file
        s1 = inv[l:None]
        
        # Edge case: Sample size larger than string
        # Our upper bound might be higher than the lenth of the text
        # In that case we start from the front again
        if h >= len(inv): 
            # How many characters do we need from the front
            overlap = h - len(inv)
            # The string is the entire invoice + some chars from the front
            s2 = inv
            s_over = inv[None:overlap]
            s2 = s2 + s_over
        else:
            # If we don't need chars from the front 
            # we can just select to the upper bound
            s2 = inv[None:h]
            
        # Create substring
        sub = s1 + s2
        # Ensure the substring has the right length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Our lower bound might be positive but our upper bound might 
    # still be above the length of the invoice
    elif h >= len(inv):
        # Calc how many chars we need from the front
        overlap = h - len(inv)
        
        # Get string from lower bound to end
        s1 = inv[l:None]
        # Get string from the front of the doc
        s2 = inv[None:overlap]
        sub = s1 + s2
        # Make sure our string has the correct length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Upper and lower bound lie within the length of the invoice
    else: 
        sub = inv[l:h]
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]

## Generate dataset for training

In [243]:
def gen_dataset(sample_size, n_classes, invoices, targets, tokenizer):
    '''
    Generate a dataset of inputs and outputs for our neural network
    
    Params:
    sample_size - desired sample size
    n_classes - number of classes
    invoices - list of invoices to sample from
    targets - list of corresponding targets to sample from
    tokenizer - a keras tokenizer fit on the invoices
    
    The function creates balanced samples by randomly sampling untill 
    an equal amount of samples of all types is created.
    
    Characters are one hot encoded
    
    Returns:
    x_arr: a numpy array of shape (sample_size, sequence length, number of unique characters)
    y_arr: a numpy array of shape (sample_size,)
    '''
    
    # Create a budget
    budget = [sample_size / n_classes] * n_classes
    
    # Setup holding variables
    X_train = []
    y_train = []

    # While there is still a budget left...
    while sum(budget) > 0:
        # ... get a random invoice and target list
        index = np.random.randint(0,len(invoices))
        inv = invoices[index]
        tar = targets[index]
        # ... sample up to 10 items from this invoice 
        for j in range(10):
            # Get an item
            x, y = gen_sub(inv,tar,PADDING)
            # if we still have a budget for this items target
            if budget[y] > 0:
                # Tokenize to one hot
                xm = tokenizer.texts_to_matrix(x)
                # Add data and target
                X_train.append(xm)
                y_train.append(y)
                budget[y] -= 1
      
    # Create numpy arrays from all data and targets
    x_arr = np.array(X_train)
    y_arr = np.array(y_train)
    return x_arr,y_arr

In [244]:
m=212
r=0

In [245]:
m += 1
mysubstring = gen_sub(invoices[r], targets[r], PADDING, m=m)
print(m)
print(mysubstring)


213
('Website:\n\nFactuur\n\n24421148\n000019986750\n', 2)


In [246]:
# Generate data
train_size = 12000
val_size = 1200

x_tr, y_tr = gen_dataset(train_size, N_CLASSES, invoices, targets, tokenizer)
x_val, y_val = gen_dataset(val_size, N_CLASSES, invoices, targets, tokenizer)

In [247]:
#a numpy array of shape (sample_size, sequence length, number of unique characters)

x_tr.shape 

(_, input_width, input_depth) = x_tr.shape
input_shape = (input_width, input_depth)
print(input_shape)

(41, 96)


## Model building

In [248]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense,Activation, Conv1D, MaxPool1D

In [249]:
# A simple model
model = Sequential()

# The input shape assumes there is 85 possible characters
# model.add(Conv1D(32,2,input_shape=(None, 85)))
model.add(Conv1D(32,2,input_shape=input_shape))

model.add(MaxPool1D(2))
model.add(SimpleRNN(10))
model.add(Dense(6))
model.add(Activation('softmax'))

In [250]:
# sparse_categorical_crossentropy is like categorical crossentropy but without converting targets to one hot
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['acc'])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 40, 32)            6176      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 20, 32)            0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 10)                430       
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 66        
_________________________________________________________________
activation_1 (Activation)    (None, 6)                 0         
Total params: 6,672
Trainable params: 6,672
Non-trainable params: 0
_________________________________________________________________


In [252]:
model.fit(x_tr,y_tr,batch_size=32,epochs=6,validation_data=(x_val,y_val))

Train on 12000 samples, validate on 1200 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7f28317f5358>

## Generate demo invoice

In [253]:
'''
To make predictions from our model, we need to create 
sequences around every character from the invoice.

We are making predictions for every character based on their context in the invoice
'''

# Choose a random invoice:
index = np.random.randint(0,len(invoices))
inv = invoices[index]
tar = targets[index]
# inv = invoices[0]
# tar = targets[0]

chars = [] # Holds the individual characters
data = [] # Holds the sequences around the characters
y_true = [] # Holds the true targets for each character

# Loop over characters indices
for i in range(len(inv) -1):
    # Create sequence around this character
    x,y = gen_sub(inv,tar,PADDING,m=i)
    # Tokenize the sequence to one hot
    xm = tokenizer.texts_to_matrix(x)
    # Get the character itself
    c = inv[i]
    
    chars.append(c)
    data.append(xm)
    y_true.append(y)
#


In [340]:
print((data[0][0]))

[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [254]:
import pandas as pd

In [255]:
# For demo purposes we can look to see what our invoice looks like
df = pd.DataFrame({'Char':chars,'Target':y_true})

In [256]:
# Show all characters belonging to the Total (field #5)
df[df.Target == 5]

Unnamed: 0,Char,Target
883,7,5
884,5,5
885,9,5
886,5,5
887,.,5
888,3,5
889,0,5


In [257]:
# Create test data for predictions with neural net
x_test = np.array(data)

In [258]:
x_test.shape

(921, 41, 96)

## Making predictions

In [311]:
# Make predictions
y_pred = model.predict(x_test)
y_pred.shape

(921, 6)

In [312]:
for p in y_pred[860:880]:
    for q in p:
        print("{:.3f}   ".format(q) , end='' )
    print("argmax: {:d}    max: {:.3f}".format(np.argmax(p), np.max(p)))

0.262   0.260   0.018   0.000   0.048   0.411   argmax: 5    max: 0.411
0.701   0.029   0.105   0.000   0.002   0.163   argmax: 0    max: 0.701
0.237   0.075   0.005   0.004   0.000   0.679   argmax: 5    max: 0.679
0.549   0.007   0.006   0.051   0.000   0.387   argmax: 0    max: 0.549
0.503   0.010   0.009   0.212   0.006   0.260   argmax: 0    max: 0.503
0.173   0.001   0.003   0.819   0.004   0.000   argmax: 3    max: 0.819
0.021   0.000   0.002   0.976   0.000   0.000   argmax: 3    max: 0.976
0.029   0.000   0.003   0.967   0.000   0.000   argmax: 3    max: 0.967
0.159   0.000   0.144   0.697   0.001   0.000   argmax: 3    max: 0.697
0.673   0.000   0.240   0.060   0.001   0.026   argmax: 0    max: 0.673
0.134   0.000   0.819   0.042   0.004   0.000   argmax: 2    max: 0.819
0.591   0.000   0.256   0.077   0.050   0.025   argmax: 0    max: 0.591
0.317   0.000   0.048   0.623   0.010   0.002   argmax: 3    max: 0.623
0.061   0.000   0.023   0.909   0.005   0.001   argmax: 3    max

In [313]:
# Get the maximum likely class
y_pred = y_pred.argmax(axis=1)
print(y_pred[860:880])

[5 0 5 0 0 3 3 3 3 0 2 0 3 3 4 0 0 3 0 0]


In [314]:
# Show how our model predictions look like
df['Predicted'] = y_pred
# df['Predicted'] = y_true  # temp hack for documentation purps

In [328]:
y_pred.shape

(921,)

In [323]:
# Show all chars that are predicted to belong to the TOTAL amount (field #5)
df.loc[(df.Predicted == 5) | (df.Target == 5)]

Unnamed: 0,Char,Target,Predicted
221,0,0,5
223,0,0,5
390,I,0,5
400,/,0,5
401,5,0,5
402,/,0,5
403,2,0,5
404,0,0,5
405,1,0,5
417,A,0,5


## Obtain system outputs from predictions

In [359]:
from itertools import groupby
# Create groups by the predicted output
# The this code will return a tuple with the format
# (category, length, starting index)

# TODO: This code is ugly and very hard to understand
# But it works

# Group by predicted category
g = groupby(enumerate(y_pred), lambda x:x[1])
# g = groupby(enumerate(y_true), lambda x:x[1])   # temp hack for docs

# Create list of groups
l = [(x[0], list(x[1])) for x in g]

# Create list with tuples of groups
groups = [(x[0], len(x[1]), x[1][0][0]) for x in l]

In [387]:
z=0

In [392]:
# Show grouping (hit CTRL-Enter multiple times)

# each tuple contains (category, length, index) of a substring of an invoice
# each "group" represents a piece of the invoice that is predicted to be of the same class [0-5]
z += 1
print(groups[:z])
print(groups[z])


[(0, 5, 0), (1, 3, 5), (0, 3, 8), (1, 18, 11), (0, 1, 29)]
(1, 5, 30)


In [326]:
'''
We only want to consider sequences of predictions of the same type 
that have a minimum length. This way we remove the noise
But we also might remove some good predictions

The min length is set to 5 here, certainly a value to experiment with
'''
candidates = []
# Loop over all groups
for group in groups:
    
    # Unpack group
    category, length, index = group
    
    # Ignore the ignore category and only consider category sequences longer than 5
    if category != 0 and length > 5:  # was 5
        # Create text
        candidate_text = ''.join(chars[index:index+length])
        # Remove line breaks, this is just one way to prettify outputs!
        candidate_text = candidate_text.replace('\n','')
        candidates.append((candidate_text,category))

In [327]:
# Show predictions

'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
'''

sorted(candidates, key=lambda tup: tup[1])

[('n:ORTEC Finance ', 1),
 ('L International Hol', 1),
 ('r24421148', 2),
 ('BTW nr', 3),
 ('NL35RABO0386025669', 3),
 ('/5/201', 5),
 ('7595.30', 5)]