# Invoice reading NLP system
Remember this image? IT IS BACK!!!
![System image](https://storage.googleapis.com/aibootcamp/general_assets/ml_system_architecture.png)


This week is all about system building. Because hardly ever does a ML system stand alone. Your success in building a system for Ortec Finance depends as much on what is around your neural net as it depends on the neural net itself. This baseline is my approach to the problem. Much in this notebook was hacked together so I am sure you can improve on many points. Perhaps you even come up with a completely different approach.

## The approach, character wise classification:
The goal of the task is to extract information from the invoice. The invoice has been run through optical character recognition (OCR). OCR turns PDFs into texts but often messes up the order and confuses come characters. **To extract information from this text, we classify each character by category**. 

Take an example, if we just wanted to get the amount we would classify the characters like this:

|T|O|T|A|L|:| |€| |4|3|6|.|0|0|
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|

We classify our text into 6 classes here:

Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5

These are the classes that the training data generator tags. But the class of a character does not only depend on the character. It depends on its surroundings as well. To train our model, we create substrings of our invoice that include a certain amount of preceeding and succeeding characters. The amount of preceding and succeeding characters is defined in the `PADDING` global variable. 

If for example we wanted to classify the character '€' from the example above and had `PADDING = 3` we would feed
'L: € 43' into our network. You can see how the amount of padding has a great influence on the performance of our system.

## Post processing:
A significant part of model performance stems from what is done with the outputs of the neural net. This approach groups predictions to prediction sequences and only keeps predictions in which 5 consecutive characters were grouped into the same category. An approach to try would be to allow sequences to be interrupted by one character. Another nice add on would be to rank predicted sequences by the total confidence the neural network has in the sequence. 

## Some tips:
For this assignment you can dive pretty deep into software development. 
You might find these jupyter tricks helpful: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Especially debugging with `pdb` really makes things easier: https://docs.python.org/3.5/library/pdb.html#debugger-commands

Basically, if anything crashes, you can start a new cell and enter `%debug`. You then come to a command line in which you can look around what happened at the crash.
The debugger has some special commands. For example `p my_var` prints out a variable. This also works for other python operations, e.g. `p len(my_list)`.

Good luck with building a great system!

In [18]:
%lsmagic



Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python

In [8]:
!ls

Jannes_Baseline_Ortec.ipynb     Wk6_Ortec_IBAN_generation.ipynb
LICENSE                         [34mtemplates[m[m
README.md                       [34mtemplates-2[m[m


In [9]:
# !cat templates/TEMPLATE_1.txt

## Loading templates

In [10]:
# System hyper parameters here

# How many characters before and after the main char to feed the NN
PADDING = 20 


'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
'''
N_CLASSES = 6

In [14]:
# Invoice data generator
from templates.invoicegen import create_invoice

In [15]:

# Your friendly tokenizer
from keras.preprocessing.text import Tokenizer

# Numpy
import numpy as np

ModuleNotFoundError: No module named 'keras'

In [0]:
# Create 100 invoices for each template

invoices = []
targets = []

# Load template 1
with open('templates/TEMPLATE_1.txt', 'r') as content_file:
    content = content_file.read()

# Create invoices from template
for i in range(100):
    inv, tar = create_invoice(content)
    invoices.append(inv)
    targets.append(tar)
    
# Load template 2
with open('templates/TEMPLATE_2.txt', 'r') as content_file:
    content = content_file.read()
    
# Create invoices from template
for i in range(100):
    inv, tar = create_invoice(content)
    invoices.append(inv)
    targets.append(tar)

In [0]:
len(targets)
r=0

In [0]:
r +=1
print(invoices[r])


## Generate substring

In [0]:
# Create our tokenizer
# We will tokenize on character level!
# We will NOT remove any characters
tokenizer = Tokenizer(char_level=True, filters=None)  # lower = False perhaps?
tokenizer.fit_on_texts(invoices)

In [0]:
def gen_sub(inv,tar,pad, m = None):
    '''
    Generates a substring from invoice inv and target list tar 
    using the character at index m as a midpoint.
    
    Params:
    inv - an invoice string
    tar - a target list specifying the type of each item
    pad - the amount of padding to attach before and after the focus character
    
    Returns:
    sub - a string with pad characters, the focus character, pad characters
    '''
    # If no focus character index is set, choose at random
    if m == None:
        m = np.random.randint(0,len(inv))
        
    l = m - pad # define the lower bound of our substring
    h = m + pad + 1 # define the upper (high) of our substring

    # Sometimes, our lower bound could be below zero
    # In this case we attach the remaining characters from the back of the string
    if l < 0:
        # Get the characters from the back of the file
        s1 = inv[l:None]
        
        # Edge case: Sample size larger than string
        # Our upper bound might be higher than the lenth of the text
        # In that case we start from the front again
        if h >= len(inv): 
            # How many characters do we need from the front
            overlap = h - len(inv)
            # The string is the entire invoice + some chars from the front
            s2 = inv
            s_over = inv[None:overlap]
            s2 = s2 + s_over
        else:
            # If we don't need chars from the front 
            # we can just select to the upper bound
            s2 = inv[None:h]
            
        # Create substring
        sub = s1 + s2
        # Ensure the substring has the right length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Our lower bound might be positive but our upper bound might 
    # still be above the length of the invoice
    elif h >= len(inv):
        # Calc how many chars we need from the front
        overlap = h - len(inv)
        
        # Get string from lower bound to end
        s1 = inv[l:None]
        # Get string from the front of the doc
        s2 = inv[None:overlap]
        sub = s1 + s2
        # Make sure our string has the correct length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Upper and lower bound lie within the length of the invoice
    else: 
        sub = inv[l:h]
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]

## Generate dataset for training

In [0]:
def gen_dataset(sample_size, n_classes, invoices, targets, tokenizer):
    '''
    Generate a dataset of inputs and outputs for our neural network
    
    Params:
    sample_size - desired sample size
    n_classes - number of classes
    invoices - list of invoices to sample from
    targets - list of corresponding targets to sample from
    tokenizer - a keras tokenizer fit on the invoices
    
    The function creates balanced samples by randomly sampling untill 
    an equal amount of samples of all types is created.
    
    Characters are one hot encoded
    
    Returns:
    x_arr: a numpy array of shape (sample_size, sequence length, number of unique characters)
    y_arr: a numpy array of shape (sample_size,)
    '''
    
    # Create a budget
    budget = [sample_size / n_classes] * n_classes
    
    # Setup holding variables
    X_train = []
    y_train = []

    # While there is still a budget left...
    while sum(budget) > 0:
        # ... get a random invoice and target list
        index = np.random.randint(0,len(invoices))
        inv = invoices[index]
        tar = targets[index]
        # ... sample up to 10 items from this invoice 
        for j in range(10):
            # Get an item
            x, y = gen_sub(inv,tar,PADDING)
            # if we still have a budget for this items target
            if budget[y] > 0:
                # Tokenize to one hot
                xm = tokenizer.texts_to_matrix(x)
                # Add data and target
                X_train.append(xm)
                y_train.append(y)
                budget[y] -= 1
      
    # Create numpy arrays from all data and targets
    x_arr = np.array(X_train)
    y_arr = np.array(y_train)
    return x_arr,y_arr

In [0]:
m=0

In [198]:
m += 1
mysubstring = gen_sub(invoices[r], targets[r], PADDING, m=m)
print(m)
print(mysubstring)


25
('ec%\nAan:\n\nORTEC Finance B.V.\nBoompjes 40\n', 1)


In [0]:
# Ger data
train_size = 12000
val_size = 120

x_tr, y_tr = gen_dataset(train_size, N_CLASSES, invoices, targets, tokenizer)
x_val, y_val = gen_dataset(val_size, N_CLASSES, invoices, targets, tokenizer)

In [221]:
x_tr.shape #a numpy array of shape (sample_size, sequence length, number of unique characters)


(12000, 41, 85)

## Model building

In [0]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense,Activation, Conv1D, MaxPool1D

In [207]:
# A simple model
model = Sequential()
model.add(Conv1D(32,2,input_shape=(None, 85))) # The input shape assumes there is 85 possible characters
model.add(MaxPool1D(2))
model.add(SimpleRNN(10))
model.add(Dense(6))
model.add(Activation('softmax'))

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead


In [208]:
# sparse_categorical_crossentropy is like categorical crossentropy but without converting targets to one hot
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['acc'])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, None, 32)          5472      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 32)          0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 10)                430       
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 66        
_________________________________________________________________
activation_1 (Activation)    (None, 6)                 0         
Total params: 5,968
Trainable params: 5,968
Non-trainable params: 0
_________________________________________________________________


2
0


0

In [222]:
model.fit(x_tr,y_tr,batch_size=32,epochs=6,validation_data=(x_val,y_val))

Train on 12000 samples, validate on 120 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7fcb513d6160>

## Generate demo invoice

In [0]:
'''
To make predictions from our model, we need to create 
sequences around every character from the invoice.

We the making predictions for every charater based on their invoice
'''

# Choose a random invoice:
index = np.random.randint(0,len(invoices))
inv = invoices[index]
tar = targets[index]


chars = [] # Holds the individual characters
data = [] # Holds the sequences around the characters
y_true = [] # Holds the true targets for each character

# Loop over characters indices
for i in range(len(inv) -1):
    # Create sequence around this character
    x,y = gen_sub(inv,tar,PADDING,m=i)
    # Tokenize the sequence to one hot
    xm = tokenizer.texts_to_matrix(x)
    # Get the character itself
    c = inv[i]
    
    chars.append(c)
    data.append(xm)
    y_true.append(y)

In [0]:
import pandas as pd

In [0]:
# For demo purposes we can look what our invoice looks like
df = pd.DataFrame({'Char':chars,'Target':y_true})

In [0]:
# Show all characters belonging to the amount
df[df.Target == 5]

In [0]:
# Create test data for predictions with neural net
x_test = np.array(data)

In [0]:
x_test.shape

## Making predictions

In [0]:
# Make predictions
y_pred = model.predict(x_test)

In [0]:
# Get the maximum likely class
y_pred = y_pred.argmax(axis=1)

In [0]:
# Show how our model predictions look like
df['Predicted'] = y_pred

In [0]:
# Show all chars that are predicted to belong to the amount
df[df.Predicted == 5]

## Obtain system outputs from predictions

In [0]:
from itertools import groupby
# Create groups by the predicted output
# The this code will return a tuple with the format
# (category, length, starting index)

# TODO: This code is ugly and very hard to understand
# But it works

# Group by predicted category
g = groupby(enumerate(y_pred), lambda x:x[1])

# Create list of groups
l = [(x[0], list(x[1])) for x in g]

# Create list with tuples of groups
groups = [(x[0], len(x[1]), x[1][0][0]) for x in l]

In [0]:
# Show grouping
groups[:10]

In [0]:
'''
We only want to consider sequences of predictions of the same type 
that have a minimum length. This way we remove the noise
But we also might remove some good predictions

The min length is set to 5 here, certainly a value to experiment with
'''
candidates = []
# Loop over all groups
for group in groups:
    
    # Unpack group
    category, length, index = group
    
    # Ignore the ignore category and only consider category sequences longer than 5
    if category != 0 and length > 5:
        # Create text
        candidate_text = ''.join(chars[index:index+length])
        # Remove line breaks, this is just one way to prettify outputs!
        candidate_text = candidate_text.replace('\n','')
        candidates.append((candidate_text,category))

In [0]:
# Show predictions

'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
'''

sorted(candidates, key=lambda tup: tup[1])