# Invoice reading NLP system
Remember this image? IT IS BACK!!!
![System image](https://storage.googleapis.com/aibootcamp/general_assets/ml_system_architecture.png)


This week is all about system building. Because hardly ever does a ML system stand alone. Your success in building a system for Ortec Finance depends as much on what is around your neural net as it depends on the neural net itself. This baseline is my approach to the problem. Much in this notebook was hacked together so I am sure you can improve on many points. Perhaps you even come up with a completely different approach.

## The approach, character wise classification:
The goal of the task is to extract information from the invoice. The invoice has been run through optical character recognition (OCR). OCR turns PDFs into texts but often messes up the order and confuses come characters. **To extract information from this text, we classify each character by category**. 

Take an example, if we just wanted to get the amount we would classify the characters like this:

|T|O|T|A|L|:| |€| |4|3|6|.|0|0|
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|

We classify our text into 6 classes here:

Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5

These are the classes that the training data generator tags. But the class of a character does not only depend on the character. It depends on its surroundings as well. To train our model, we create substrings of our invoice that include a certain amount of preceeding and succeeding characters. The amount of preceding and succeeding characters is defined in the `PADDING` global variable. 

If for example we wanted to classify the character '€' from the example above and had `PADDING = 3` we would feed
'L: € 43' into our network. You can see how the amount of padding has a great influence on the performance of our system.

## Post processing:
A significant part of model performance stems from what is done with the outputs of the neural net. This approach groups predictions to prediction sequences and only keeps predictions in which 5 consecutive characters were grouped into the same category. An approach to try would be to allow sequences to be interrupted by one character. Another nice add on would be to rank predicted sequences by the total confidence the neural network has in the sequence. 

## Some tips:
For this assignment you can dive pretty deep into software development. 
You might find these jupyter tricks helpful: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Especially debugging with `pdb` really makes things easier: https://docs.python.org/3.5/library/pdb.html#debugger-commands

Basically, if anything crashes, you can start a new cell and enter `%debug`. You then come to a command line in which you can look around what happened at the crash.
The debugger has some special commands. For example `p my_var` prints out a variable. This also works for other python operations, e.g. `p len(my_list)`.

Good luck with building a great system!

In [25]:
!git clone https://github.com/riklmr/MLiFC_data_invoices

fatal: destination path 'MLiFC_data_invoices' already exists and is not an empty directory.


In [28]:
!cd MLiFC_data_invoices ;git pull

remote: Counting objects: 3, done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 1), reused 3 (delta 1), pack-reused 0[K
Unpacking objects: 100% (3/3), done.
From https://github.com/riklmr/MLiFC_data_invoices
   42cd6ea..1adf5e3  master     -> origin/master
Updating 42cd6ea..1adf5e3
Fast-forward
 invoices_train.z01     | Bin [31m52428800[m -> [32m0[m bytes
 invoices_train.z02     | Bin [31m52428792[m -> [32m0[m bytes
 invoices_train.zip     | Bin [31m1666554[m -> [32m0[m bytes
 invoices_train_20k.zip | Bin [31m0[m -> [32m21303034[m bytes
 4 files changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 invoices_train.z01
 delete mode 100644 invoices_train.z02
 delete mode 100644 invoices_train.zip
 create mode 100644 invoices_train_20k.zip


In [0]:
!cd MLiFC_data_invoices

In [30]:
!ls MLiFC_data_invoices

invoices_train_20k.zip	LICENSE  README.md


In [31]:
!unzip MLiFC_data_invoices/invoices_train_20k.zip -q

Archive:  MLiFC_data_invoices/invoices_train_20k.zip
   creating: train/
  inflating: train/09562215-ec1e-42d6-9526-fc3adf37eb68.csv  
  inflating: train/6814aa63-4d54-4f08-b763-91a90ea64afd.csv  
  inflating: train/3c9e0551-b35c-4e33-8dd8-a63f246e9899.csv  
  inflating: train/1502888c-d11a-4823-b424-bd82b3d9fd6e.csv  
  inflating: train/5889b787-625a-463e-b7b3-2826d8d3d06e.csv  
  inflating: train/78afe0ca-21ed-42bb-8d94-dc2316dd02a9.csv  
  inflating: train/9ce2d52a-9b21-4811-940f-bda76e513427.csv  
  inflating: train/1fae3397-8e53-454d-8a8f-464a67bc170d.csv  
  inflating: train/a6578490-6e7e-428a-afb1-ebb0fecad000.csv  
  inflating: train/3860113d-6038-4436-8431-da69f4f75e26.csv  
  inflating: train/56c5727c-7b9d-4e74-b716-b51dc6a07e41.csv  
  inflating: train/58657bb9-696b-4c8d-be5e-9ceb8e27fd09.csv  
  inflating: train/63a690ea-69fd-4ee9-851b-0cf6295f8c14.csv  
  inflating: train/f88d2c9c-cf7d-45fa-ac13-b769f4a6d23c.csv  
  inflating: train/bb679bdc-5c22-4fb5-ba35-

  inflating: train/3ac1c461-3043-44a5-906c-1a80eda9b163.csv  
  inflating: train/366373b2-915a-4264-b4aa-aef7d1eeb2a5.csv  
  inflating: train/59f6b11f-c363-4470-afeb-d5bf7cc28f57.csv  
  inflating: train/e9946e0e-dc3a-476c-ab31-c4de36559a19.csv  
  inflating: train/a48f5584-ee97-4922-9d58-052f07163f89.csv  
  inflating: train/f7ac956f-40a3-486b-9cd0-102da1fbab68.csv  
  inflating: train/faeccd64-599e-4d30-a523-8d2584fbc1ce.csv  
  inflating: train/85361cb7-6ec0-4a49-ae5d-d681261e50bb.csv  
  inflating: train/57c961a4-8c0e-4c0d-91c3-b6d55fbdefbe.csv  
  inflating: train/b04d0b2e-d56e-4305-9575-df31eb1911b0.csv  
  inflating: train/cb4a8df5-a27b-4101-b4e6-9dc737653379.csv  
  inflating: train/b4c431c5-f705-4f41-88c9-e8018a7cf3fb.csv  
  inflating: train/27b8cfb5-a4fb-4c83-a1a8-e748e3054e64.csv  
  inflating: train/a44e45f1-5c4b-484e-b8e5-cc2f11dda561.csv  
  inflating: train/8d00d3d0-6f7a-4e37-b0d9-e7981ca31bf4.csv  
  inflating: train/6bf4d9e9-8337-4376-ab0c-c0228843c77a


  inflating: train/bbc8ae90-3343-486a-af00-7a61277be493.csv  
  inflating: train/6a1e9b67-2301-4c8f-99b5-66317540b2cf.csv  
  inflating: train/0d800a5a-dabc-4d0c-be51-96d2d36a11e5.csv  
  inflating: train/f1d86321-be1c-4c6d-8045-82e52a657cd7.csv  
  inflating: train/6a777838-27af-4768-8bdb-e42069b1ccce.csv  
  inflating: train/aafbcc81-fe48-486b-a189-57cf84d16574.csv  
  inflating: train/72f36908-d33a-4aa1-bdfe-f34f13ba7ff1.csv  
  inflating: train/ddd50f2d-e72a-44fd-b13c-4c2dc19797b1.csv  
  inflating: train/ac166851-cd9a-4f96-afef-77f1d254e289.csv  
  inflating: train/2e249cee-0f36-40ca-8a00-8cd15afd7560.csv  
  inflating: train/48f8d83e-9813-43e7-b67e-0afd139a81d8.csv  
  inflating: train/8ea9e52f-909d-4f5b-b4b4-28471f44329d.csv  
  inflating: train/d7ad9301-7b8f-4475-9583-95ac2b0bd608.csv  
  inflating: train/f2350091-19be-4089-b602-3321b0e4a950.csv  
  inflating: train/2b1cd39b-d310-4784-be64-c438c4aa8f18.csv  
  inflating: train/61500b57-b628-40ba-a65b-eab2f5d505


  inflating: train/db9c7bde-adcc-4b3b-9614-21728af7feb4.csv  
  inflating: train/8bf803d4-3e2d-4b89-8672-bfc08708e9ce.csv  
  inflating: train/3bc6f647-7c14-488e-8c85-784bdddcae52.csv  
  inflating: train/9e6eb18b-f26d-40d9-ad7b-29627921134a.csv  
  inflating: train/c7ec2721-52c9-4e29-801f-5f815ef3bf04.csv  
  inflating: train/7d7793a0-2bc0-4875-aa25-a25cb0624fd4.csv  
  inflating: train/c48186c0-fd45-4861-9970-f41d0f96fa59.csv  
  inflating: train/5b33f55f-71a5-4bef-b32c-06565509a793.csv  
  inflating: train/7577d0a6-317e-472a-8fef-745c662c0596.csv  
  inflating: train/23948949-c5c1-436a-bbec-8d7addf86dd1.csv  
  inflating: train/96fea830-5770-4524-8061-47ce30c0995c.csv  
  inflating: train/abb2ec38-877f-4841-9e30-eee7f4208254.csv  
  inflating: train/bc0bb2b3-41e6-4e14-be9d-86d756d19b33.csv  
  inflating: train/646fdcc0-f77d-40c6-8f31-71bf2cadcaa7.csv  
  inflating: train/dad7adab-f9e6-4020-aad3-8a24b397c282.csv  
  inflating: train/027d2d6d-f3ad-4cb4-8737-c439568dda

  inflating: train/e8d8baf3-30c5-4399-82ab-398876eda662.csv  
  inflating: train/ff82d219-887c-4381-9c22-b1db8dadae2d.csv  
  inflating: train/c190a9c5-18ca-4ab3-b329-a5eb85b8ba83.csv  
  inflating: train/1cf12820-610d-44bf-b9ce-d8fd1cadd3d6.csv  
  inflating: train/05d1f9b5-35b3-4d79-953d-a27a02a3a35d.csv  
  inflating: train/81aab053-27c4-47d4-b3ce-a2b59478dae4.csv  
  inflating: train/a7e9bc4c-6891-420e-94ca-b96731dc138c.csv  
  inflating: train/1bbdde26-a657-4bbb-9cc3-6db7f7515458.csv  
  inflating: train/6e83e196-1b61-47e3-97c6-3484c794c35a.csv  
  inflating: train/198baa6f-692e-4073-bb36-f9271554f418.csv  
  inflating: train/abff84ab-42b4-4db0-b082-134b41817cf9.csv  
  inflating: train/333d94d7-6caf-45ad-974c-5655a5b4e3e8.csv  
  inflating: train/b82ad212-be2e-4e8e-aaad-e8f4cd2edb70.csv  
  inflating: train/8743d6ca-5b52-40b3-9ded-f62321fa1162.csv  
  inflating: train/645d1cf0-73a4-4213-ae1a-d3611069b14c.csv  
  inflating: train/dceb589f-b69c-4a1d-9245-c346593310e1

  inflating: train/5cb57515-9d2a-4a02-b771-c19d491b36a1.csv  
  inflating: train/6b557e14-cd5d-4718-91ac-f785d3a69718.csv  
  inflating: train/96e19759-1769-43e9-8e87-075cdd7bcde3.csv  
  inflating: train/1ce9648a-93a4-47d8-a61e-222ff759bd98.csv  
  inflating: train/5b336a30-4cb7-4a90-9b85-d13f385db04c.csv  
  inflating: train/f34a8c52-a6d6-49e3-9974-b123be23da7c.csv  
  inflating: train/6f6ccd9c-e1a8-4de7-b389-a66ec466bbce.csv  
  inflating: train/c8d80399-8e00-458e-bc95-8acc0f3aa04d.csv  
  inflating: train/0f34108f-0333-4433-b9a3-3a3b6b92c805.csv  
  inflating: train/cfb66d7a-b8a4-4401-9874-5de3ddd0b9a2.csv  
  inflating: train/6a5c2a64-2497-4d22-8dfd-144d339c4b65.csv  
  inflating: train/fab06a0d-1e19-4170-a8c7-d5a992632606.csv  
  inflating: train/03e22ed6-2485-4c07-8f46-3e58fe3c2f35.csv  
  inflating: train/697a9785-ce5c-4255-a1e9-253eb9e68d27.csv  
  inflating: train/ac1de5e5-636a-4966-9931-828eecbc0d8d.csv  
  inflating: train/84efa606-a998-4f02-93ee-c6eb74f1bbcd


  inflating: train/c6fb2432-eddd-4932-b950-fba201c56cd0.csv  
  inflating: train/e04cf48b-4fca-4e6a-a57d-8384a1ac4818.csv  
  inflating: train/f37c829c-c39e-48fb-8b59-1343b6f56dc6.csv  
  inflating: train/485fa397-4942-4ba3-870f-a07be2bab7b2.csv  
  inflating: train/1af21773-dbe6-4463-9aa0-d6e726f3485b.csv  
  inflating: train/5ebac7b3-3b5e-4a36-8bd6-fff33348535d.csv  
  inflating: train/0225584a-55b1-4240-8d86-99996469e557.csv  
  inflating: train/66c520f5-1c3d-4b29-a077-3f6c8b27f529.csv  
  inflating: train/7d38bc95-9231-4f8f-9c55-1ab3f511b3a9.csv  
  inflating: train/053f0b45-ae65-4b14-bfd0-61cb37fdabbd.csv  
  inflating: train/144c2bd3-0348-4132-a9f6-aaf84477ea6d.csv  
  inflating: train/85b16fcb-572d-47fc-acb3-d577694c6495.csv  
  inflating: train/0c4b55cf-0b25-4d3e-ad73-5c4276a84905.csv  
  inflating: train/94274d42-6a71-44f8-82f4-b38242329477.csv  
  inflating: train/d0fc2691-cbbc-4d73-9059-2ed6818bdf1e.csv  
  inflating: train/268e2207-72dd-4742-a152-fc8d27dbd4


  inflating: train/85365e4f-6be5-46a1-a152-849994c4c321.csv  
  inflating: train/260d4dbb-545b-4d73-89f8-310912537de8.csv  
  inflating: train/f31d0fc2-8e07-467c-b26d-e73b2e3b7b58.csv  
  inflating: train/9a100a35-5c00-4e07-a6dc-9fbe351c3179.csv  
  inflating: train/c4ca6f7b-30dd-4cca-8c3e-ec2730c886d2.csv  
  inflating: train/37f8ae39-1756-48e4-8f39-8ec5ca893a06.csv  
  inflating: train/d913e9a0-2057-4d95-b4b5-84db61d3e527.csv  
  inflating: train/50e04e72-78e5-44c0-8313-d4ad2eab2fac.csv  
  inflating: train/67743df2-219f-48f8-bac5-ade856914bc4.csv  
  inflating: train/04f218ec-83a0-41e1-98e8-fab7458dd5a7.csv  
  inflating: train/b070a50d-5d53-4fb3-9ade-e6ba70c4b631.csv  
  inflating: train/15445ea2-8dc7-4d13-8e3e-49677bf2cd1d.csv  
  inflating: train/c7d28f6b-737f-4523-8ec8-1bed40e6ac27.csv  
  inflating: train/51f5c967-e984-46f9-b4a6-aac44b162eef.csv  
  inflating: train/9012daa0-3c71-42aa-9967-b11d3491deff.csv  
  inflating: train/b1a67d98-ffa9-49f2-9182-cc5deb2dcb

  inflating: train/92301a57-e3ea-44df-9d09-32f4d9ca924a.csv  
  inflating: train/55b3b7f2-808a-4936-a7f1-a3cb4ee0daf7.csv  
  inflating: train/b18b8a51-3c48-40f4-b29c-6de5fc9b37b5.csv  
  inflating: train/901f0881-871d-4b9c-8f5c-2dc441b4d073.csv  
  inflating: train/e2453c76-086a-47c1-9fd6-fa8fa9d6f255.csv  
  inflating: train/f7e1db68-4ce0-4b20-98f1-1a43dfab0a2d.csv  
  inflating: train/075cc6a8-ce92-4e7a-a70e-469bfb8094ac.csv  
  inflating: train/9ab628d0-bfb5-4596-b8ef-b71524f4cde6.csv  
  inflating: train/305455f8-42c1-4128-b8e7-baac69294af6.csv  
  inflating: train/6317dbea-211c-4cbd-a214-085e5c0f8fa9.csv  
  inflating: train/c55c25ff-3f14-4fd4-b88b-12f347e854bb.csv  
  inflating: train/23becf02-d16c-49ee-aac1-e9aee3e3c375.csv  
  inflating: train/5fa0d480-faf6-4650-abf7-05a678087aca.csv  
  inflating: train/ca3f3c2c-4f41-4935-990c-bf1fdac58c22.csv  
  inflating: train/a326bd8b-8709-4304-b75f-fc517afe49b6.csv  
  inflating: train/160dcf16-c4d0-4c50-b805-fe1a072cf1d8


  inflating: train/2702854d-3a9d-4038-9bde-386360c68a14.csv  
  inflating: train/fb8a6287-3f5b-4c22-844d-87fa4e2222b9.csv  
  inflating: train/f3c145d2-6795-4e9f-ab64-7244ab53bd3d.csv  
  inflating: train/f752fc8a-6dd6-404e-bb90-2d5058a4fe93.csv  
  inflating: train/3a7ea685-2225-498a-a294-172b9842aa66.csv  
  inflating: train/4209eecd-bc97-48e5-b8cb-5ead095e66aa.csv  
  inflating: train/e07b4a2e-af64-4536-abdb-f7cd385aae39.csv  
  inflating: train/b316ca42-ea27-43a6-85ef-96e147d6a3b9.csv  
  inflating: train/2c490679-b6db-4a2a-a051-22d026125893.csv  
  inflating: train/d57c8702-af2d-422c-a1cc-1c982f89cb98.csv  
  inflating: train/90b304f8-0d1d-4f51-89f1-5c1274d67daf.csv  
  inflating: train/80f9ade7-b8dd-4f51-8af1-b048ce8eb8ab.csv  
  inflating: train/4a29a986-1511-439c-8cbc-306d02a27ae4.csv  
  inflating: train/5d5ef871-3f0a-452b-9159-611ced4d03a1.csv  
  inflating: train/c7665129-2417-496f-8240-204a78482d9c.csv  
  inflating: train/03b87a10-7a4a-4600-8a68-27ca839fe5


  inflating: train/0751d8c6-f2ee-441f-ada1-1bb25ae19cc3.csv  
  inflating: train/33bdc32d-2e79-469f-9a3f-fb4d55faa999.csv  
  inflating: train/eb83979f-ea7c-4dcd-ba55-099741fa9d4b.csv  
  inflating: train/0ad8aa04-3617-422e-863f-f7707c5d05c3.csv  
  inflating: train/f375dfb8-0cb6-46b2-b666-1b98c6659ba2.csv  
  inflating: train/74c7612f-6803-4584-8c0c-70f3757d2dc0.csv  
  inflating: train/db5292dc-8b02-452d-9363-3b47018812a4.csv  
  inflating: train/4c8a371a-c3c0-4bb3-8396-a3f840e3b5e5.csv  
  inflating: train/d55903e2-9d69-4cf6-b9cd-5d1a3220b06a.csv  
  inflating: train/41f2cecb-bf07-4680-a0de-d88116e7a174.csv  
  inflating: train/0312fbfa-34a2-4bc1-b1ec-c7961e498759.csv  
  inflating: train/6cc626a5-8b6d-49c7-9376-6dafe5910933.csv  
  inflating: train/3b782de4-99bc-4276-90f3-d29c75268ee3.csv  
  inflating: train/14c91bbe-ef5f-4bd5-a54f-c1796d08ac96.csv  
  inflating: train/d088d077-e852-44f4-b144-b5d4bdfe2422.csv  
  inflating: train/5e2d8182-c927-4634-a84d-8a4fd350b3


  inflating: train/f45f76e2-4954-450e-b64f-5186de4fd159.csv  
  inflating: train/ac311897-8a3f-446d-bab0-3181015dc30d.csv  
  inflating: train/0bfeeaea-06f9-49d7-a382-b3428ec7417f.csv  
  inflating: train/484546c4-4116-45d1-953c-b4ab8ff666be.csv  
  inflating: train/f4efc888-439e-4368-ba58-6f2679bbd3a2.csv  
  inflating: train/8aa22b14-b50e-4bb3-babb-3d6d8594a8dc.csv  
  inflating: train/6f0f8e59-ef08-48a3-8b56-c13837bd7e02.csv  
  inflating: train/ceb827a4-0581-444c-866c-e59235a103f0.csv  
  inflating: train/f8732f85-7a0a-40dd-b440-31598fd7a576.csv  
  inflating: train/44d301ba-1b47-4fe1-a600-dbc64e781dda.csv  
  inflating: train/82f5b261-c61d-4359-8cda-f2c98ee157ec.csv  
  inflating: train/de1850fb-6ed5-4d7d-8671-a8fc9a611101.csv  
  inflating: train/d777db06-0851-430e-8626-b17be7faf0b1.csv  
  inflating: train/a4de83ef-e319-4c57-b7bd-8381b9d796e4.csv  
  inflating: train/45c6ead7-3fd4-4453-8d19-a596b0639f05.csv  
  inflating: train/51a8ee5b-5b93-43e5-a3b8-35bac6e145


  inflating: train/a0004385-5e2a-4e7a-9f54-343003f97f22.csv  
  inflating: train/ad5bdc6a-0179-4f48-b0e9-298ce5286fe4.csv  
  inflating: train/5c0b0981-15df-428a-b6f2-b2664dbf2346.csv  
  inflating: train/8af14965-0bd8-4d6e-92f2-03ed40205ae3.csv  
  inflating: train/6b2fe4f7-4563-4977-90b4-5c74700f937f.csv  
  inflating: train/6cb1a729-2d5c-495d-b676-4fa8aac2e322.csv  
  inflating: train/780b8b3a-aa67-4763-b83a-d241db22b791.csv  
  inflating: train/35ee64bb-7fe5-4bdb-b023-e17389cf85f5.csv  
  inflating: train/4d98bbd1-32be-42a8-8c10-276e01f77ca2.csv  
  inflating: train/e614364f-355b-4abb-a3cf-63e4925c0a76.csv  
  inflating: train/56ed64df-df32-48c9-a53e-505f90fa99c0.csv  
  inflating: train/0d883706-3c09-4ca5-ab4c-810ce3fbc028.csv  
  inflating: train/7c7ee4c8-f06b-461a-9459-792b5f9053d4.csv  
  inflating: train/fb350fbf-c1c8-4d39-94c9-db413245ef62.csv  
  inflating: train/e76cfa3b-a260-4108-bf5b-fa53c8931180.csv  
  inflating: train/6563f302-2ffb-4882-9b1a-4dfe3958aa


  inflating: train/d308009d-debf-4872-b96b-bbdd3a69ac88.csv  
  inflating: train/97ac0892-1f31-44b6-a790-f01317ee0244.csv  
  inflating: train/a12d346c-1731-4cc6-a458-4704d1758b67.csv  
  inflating: train/b5bd02b1-ce60-48b2-8a56-3635b189b6d9.csv  
  inflating: train/443ea0a7-5e7e-42c1-ba59-f5634fb1148c.csv  
  inflating: train/26a75fd2-5298-4bab-ad73-aaf823911c9a.csv  
  inflating: train/5c5a2fa5-cf6c-4cf5-b98e-80c766900eba.csv  
  inflating: train/c210cb13-1088-4752-b71c-92282483d902.csv  
  inflating: train/de0ba340-c20d-408f-8986-ad05bb56a91a.csv  
  inflating: train/f2118d6e-0ae4-4808-bdae-e690e1755cc5.csv  
  inflating: train/d933179c-14a2-47d2-b58e-c2e0913af216.csv  
  inflating: train/8a871326-b61b-472c-a14e-2bc3573b276a.csv  
  inflating: train/b7e0af82-c074-40a5-9ccd-25ec835b90da.csv  
  inflating: train/798a05db-c319-404b-87b7-47c702478c87.csv  
  inflating: train/e16172fd-c403-4b5a-a9f5-25e75dcb9845.csv  
  inflating: train/2339c3cc-ef80-44fa-8dd3-9d7b37748b

  inflating: train/b570e85e-4ff7-4a02-887a-a8d56885f124.csv  
  inflating: train/8f216289-4065-42e9-bb65-6edaa44906cc.csv  
  inflating: train/6d9f74ac-6230-4839-aed6-1f52f044cc20.csv  
  inflating: train/e577288d-9287-43d5-acf4-20ed37b9e516.csv  
  inflating: train/0a912122-bd71-461e-adda-53e1495716c9.csv  
  inflating: train/6ba1e619-66df-4eae-85a2-dfad5cd4ca99.csv  
  inflating: train/21d2ef54-f298-4cf9-a745-dab94d0dd748.csv  
  inflating: train/4a1924e8-484b-4f6d-974c-662f6c64930c.csv  
  inflating: train/74d43cda-b6f2-4a59-b8f3-506a14fecdea.csv  
  inflating: train/38534b46-c9e4-4aff-9831-475ec641d0c9.csv  
  inflating: train/e854d036-dc8a-4f70-b521-fbdf72615795.csv  
  inflating: train/ad8450e2-2439-4a2d-ba3a-82e3867982aa.csv  
  inflating: train/5b7d8a90-2520-4f3b-b96d-0d7e80906e9b.csv  
  inflating: train/74cafc85-2cea-4a7d-9278-a799af876d78.csv  
  inflating: train/4c666549-a861-4baf-9246-7ed1c3346cb8.csv  
  inflating: train/65b3f415-cd3c-40cd-8d0c-84ca118588ad


  inflating: train/4084d00f-4ed2-41e7-b98f-1c3a90eeb307.csv  
  inflating: train/7d3a392d-52bf-40a2-94f2-fa483495adc2.csv  
  inflating: train/1ef4f653-1144-4065-bdd0-19944889b7e2.csv  
  inflating: train/3337a025-4b5b-4bab-b357-6a3a4bad23c8.csv  
  inflating: train/955e4db9-245e-4f30-82a2-0707568eb1df.csv  
  inflating: train/768dec96-52c9-40f3-af23-241c077eefde.csv  
  inflating: train/c3fe42b6-e060-4c39-bca0-c8c6c8d3c337.csv  
  inflating: train/0152f890-49c2-4f24-9899-ccbaff0c19c6.csv  
  inflating: train/ac5a87a7-c3dd-4b6a-90d1-38252db413e5.csv  
  inflating: train/00aebe7e-9fd1-4f82-8df0-5c0516be22f7.csv  
  inflating: train/64857fa7-27b3-41c9-966a-a4061a4d5556.csv  
  inflating: train/c3df2bbd-effb-414a-a2a7-842107e4234d.csv  
  inflating: train/90670fc5-951c-4fa8-b267-3a0d37b5325e.csv  
  inflating: train/0426b312-8411-4016-a1f2-08dc096471ac.csv  
  inflating: train/44b190a5-f4ad-4cff-a9b6-dd3c14764b51.csv  
  inflating: train/83e9fcfe-5291-4249-b4f0-584bacea0f

  inflating: train/1afa7f87-beae-4c12-8890-a94ec772add7.csv  
  inflating: train/bd8f4202-d77b-4d5c-a019-2b49e04a78a0.csv  
  inflating: train/563ac159-c36c-4624-abd6-04b6c7417ae1.csv  
  inflating: train/338a6c85-2496-4f7a-b3b0-bcc43ed53d66.csv  
  inflating: train/21114143-b911-4d47-aa72-4997303da22b.csv  
  inflating: train/b24b922c-41d8-41ef-9eea-26e0004359b5.csv  
  inflating: train/031051f5-a861-46d0-9d63-466b0d8c4a36.csv  
  inflating: train/f39d704e-aeb7-43bf-981a-5ad5346366c6.csv  
  inflating: train/b81b1acb-6bc7-45d3-bf57-816d98bb5dff.csv  
  inflating: train/45808b5e-cbde-4b13-8fbf-f5185e7abe51.csv  
  inflating: train/b3ee0c28-79bf-4acf-b12f-d92537984721.csv  
  inflating: train/81e3360f-e6e2-40d5-8520-5b27f75b96e8.csv  
  inflating: train/377a3e2b-264b-4178-85a9-40ee6b34f993.csv  
  inflating: train/8a55a510-88d7-4c8b-9607-bc4f24371df8.csv  
  inflating: train/2aba5ec3-3998-4d29-8893-ad5187499c2a.csv  
  inflating: train/00e4c978-21d4-4f4f-a18b-08157dc23b22


  inflating: train/5a482b5c-6608-450b-8950-be5a7fb350ed.csv  
  inflating: train/dd7b78b1-3311-481a-ae67-c861c6ede7eb.csv  
  inflating: train/65ebf8c1-9559-407a-9107-287945365059.csv  
  inflating: train/e7c6ac9f-b79b-4e6d-a5e6-6d1854b31007.csv  
  inflating: train/4813a9ae-e199-42b7-94d5-8dc521d743d4.csv  
  inflating: train/6c8e01af-ccf0-4df3-bc16-3dc52bbb6335.csv  
  inflating: train/290c0e12-edf5-4cc3-b9de-42f070f513c4.csv  
  inflating: train/570a848e-13ca-46b1-bc87-bfaba9f15411.csv  
  inflating: train/78d033f1-697c-43aa-a3b6-0fed00a3e146.csv  
  inflating: train/95826f84-1cd6-4601-80b3-1e444a0a4f9d.csv  
  inflating: train/fb4fa1b2-6ce1-47a9-9d19-bac703c5e4ca.csv  
  inflating: train/479f4d44-8948-465e-b495-663ea365b10e.csv  
  inflating: train/2768c721-72d8-44cf-ae82-f255aabbec0b.csv  
  inflating: train/124fcf2f-f3b2-4c4b-8b03-5d18657ac09c.csv  
  inflating: train/96bca252-28d0-4892-995e-566f727f04c3.csv  
  inflating: train/c3758e36-7d68-4aa5-8e81-54541c25f8


  inflating: train/ab7f0e2d-426e-4f63-9349-2b476ff4d33f.csv  
  inflating: train/faf6d20e-e4ef-45fa-9d42-28000de84f75.csv  
  inflating: train/1c869c95-3274-4ce3-ab5e-a222ae1f61c5.csv  
  inflating: train/1c9f6fe9-b6b8-4273-aae8-ba05ba05ba82.csv  
  inflating: train/dbf68a28-70f4-42eb-8471-aab6ad212d22.csv  
  inflating: train/15e259e0-7b86-44c5-8029-d813dcf288a4.csv  
  inflating: train/194cb8ed-2209-4d33-9f7a-082ab6e3b2dd.csv  
  inflating: train/0397b5fa-14d1-4af0-8833-b0166a479a70.csv  
  inflating: train/444d4759-5742-410b-a34e-a7c1a6a344ef.csv  
  inflating: train/9c8f148a-b869-4863-85fc-ac1aa10c5e94.csv  
  inflating: train/b1e9bb13-00ba-458e-8bcc-17fac1ccbbb0.csv  
  inflating: train/a70618f8-cab8-4911-b31e-ed9af491b05a.csv  
  inflating: train/1313dc04-d8f0-43a8-b231-ca2e3f99ea8e.csv  
  inflating: train/9c02f1cd-8ff9-4cfe-99b3-6a40b19f12b9.csv  
  inflating: train/862dc1e0-c62b-4469-aa37-675ddbb60fcf.csv  
  inflating: train/6dc6f317-8ce8-4529-8f0a-516ff6866e


  inflating: train/2b114c32-ba37-4a39-97e0-249758958fab.csv  
  inflating: train/71900215-4bcc-4f5d-bb09-d5ebe05aecb0.csv  
  inflating: train/885b8d56-6e74-4ad5-923f-8a8f4dc20f70.csv  
  inflating: train/55edfcdb-4784-4197-ab38-0e8056c2546c.csv  
  inflating: train/3b8c27ea-ce11-43e3-bb7d-0bc5a0aee02a.csv  
  inflating: train/9cfc3b74-8e78-461c-97a9-989fed07d900.csv  
  inflating: train/55f76655-219e-495c-ae6b-40b3b88e7ebc.csv  
  inflating: train/19fcee98-4abb-4ed6-8a92-c055eefc803f.csv  
  inflating: train/4048e1b4-ebd4-49c0-8b15-d9ae4192babc.csv  
  inflating: train/b28ddefa-8c09-4746-93b2-c48af9957a1e.csv  
  inflating: train/b4d46c0b-8dbf-48b1-9eff-d30c0042a83b.csv  
  inflating: train/901f27df-bb65-45b4-b651-cd348699cf9b.csv  
  inflating: train/499c5c82-b3ff-42ab-984b-2f521e6dd51f.csv  
  inflating: train/4755b2b0-23a3-463b-9516-90a5a254d658.csv  
  inflating: train/f7b9e316-4c4a-483b-8737-6f87f9825b6a.csv  
  inflating: train/6987b18e-e159-42fe-8bde-a30bff0eee

  inflating: train/b03a9204-b952-4cfd-8967-4120ebce4f9e.csv  
  inflating: train/64e6c53f-fb28-4225-a937-023120bf701d.csv  
  inflating: train/a9f628b6-f2bd-49bc-b173-3332fd00102e.csv  
  inflating: train/45e4fb91-273b-462f-bfbf-361aa186dc22.csv  
  inflating: train/5055195e-0485-4925-b41e-8c361b5a16eb.csv  
  inflating: train/37624d1c-d1b7-4441-951c-f42288b9d10f.csv  
  inflating: train/7f1f3012-e6c4-4598-af0b-454488f6cacc.csv  
  inflating: train/4db4addf-eecc-49a9-b7b6-d0dd6aa4a8e8.csv  
  inflating: train/528b9ace-ea60-4f32-a0d2-03c6d13bebf8.csv  
  inflating: train/efe235b8-eb9d-40db-afb2-00296ece7a96.csv  
  inflating: train/59a23f1e-d48e-4000-846b-dc9c2dbff78d.csv  
  inflating: train/ed17f0b9-c4cf-4da3-941e-8420d36558c8.csv  
  inflating: train/ff46d2cb-a081-4f9d-942a-cbd7e263d126.csv  
  inflating: train/b134bbe3-92cd-4006-91e1-1087ce085fc7.csv  
  inflating: train/820aa4c1-6828-4d39-af23-77ee602bbd28.csv  
  inflating: train/075f1aa3-f582-4df2-9c4e-1eb087fbedea

  inflating: train/c94dd492-d410-445f-8cd3-cf91446c7fbc.csv  
  inflating: train/e20555ac-e81f-45f7-b899-7b653a37e584.csv  
  inflating: train/81b8851e-6b29-4d1f-b497-da376961d530.csv  
  inflating: train/10dfacbb-ae72-496d-b7b7-fbdb0bca37f2.csv  
  inflating: train/474c7749-c606-47c1-a321-6e2b1483d585.csv  
  inflating: train/32fa2d4d-83d4-4d0f-b323-59cc7cf821b4.csv  
  inflating: train/aa5fecdb-7ae5-4c7c-bfa4-40fd3e5a77e1.csv  
  inflating: train/41255594-4b9e-4fb8-ab80-cb4deb910e65.csv  
  inflating: train/cfad8a60-2ebe-4d77-9716-c63517b72b02.csv  
  inflating: train/f4833615-03a7-43d0-82e5-675369f2e1df.csv  
  inflating: train/13f91189-c540-44dd-9d58-a849d32b93d2.csv  
  inflating: train/6af73fab-272f-474e-93ea-f3a7e565e3af.csv  
  inflating: train/66e17d8c-2caa-4344-b504-00c528270a48.csv  
  inflating: train/61fcc1a2-81ae-4afd-9efc-2d6be773a491.csv  
  inflating: train/ca883211-8d43-4293-8556-58ff464606b2.csv  
  inflating: train/3256b716-8b82-45a5-806b-59ebd00d0e09


  inflating: train/69298edb-9d22-4ff2-939e-ef3e3c8b9b71.csv  
  inflating: train/30db0756-7a4b-4db1-a07f-0ccc573f28b5.csv  
  inflating: train/36bbe351-ec10-4eeb-99fe-107d0abada76.csv  
  inflating: train/cef19cbd-0dbb-4f1e-99ab-a280d5395ee4.csv  
  inflating: train/232bea43-cf25-43bc-92ca-d988dd1b84a4.csv  
  inflating: train/5179d39e-ccc0-4abc-a2e6-9caae8095f91.csv  
  inflating: train/c3e0d7ea-e8e2-4d22-8304-97b818b3b627.csv  
  inflating: train/9e638e67-65c9-45b8-b0ea-07e7ea21dd2b.csv  
  inflating: train/5fb7c4c2-4be4-4c4a-9de1-bf1034cdc030.csv  
  inflating: train/022b0bb3-c098-4c91-bf46-e4fd56d19fd1.csv  
  inflating: train/ec67af2e-ff53-4f55-bc55-d2595c3f4105.csv  
  inflating: train/d7dc79cf-3cb1-4c65-9ff3-9f05084e981e.csv  
  inflating: train/cdf28240-c0b1-45aa-949b-177d2adad326.csv  
  inflating: train/6faf43b2-88a1-49e5-b98a-be7ba34a1736.csv  
  inflating: train/b035d705-4fed-410b-8509-7e8676d1723c.csv  
  inflating: train/f9b071c2-07b0-42bd-8afd-0ff9e54832

  inflating: train/721de94f-3a57-4d95-bd17-c15fcafaa4c1.csv  
  inflating: train/026bd0d0-3873-4353-b958-b9887da3efef.csv  
  inflating: train/66cee367-3942-4f93-9734-12985fb8cdfa.csv  
  inflating: train/b5d86e61-826c-4c33-becc-e6018ce4d9b1.csv  
  inflating: train/ae76fefd-4feb-49de-a283-913d5a153e4b.csv  
  inflating: train/e6b3f12b-cc77-45bb-96ef-77c2d8f75181.csv  
  inflating: train/10525a56-b1a9-448c-beca-ed89ef1fd429.csv  
  inflating: train/d3060dac-dfa4-4784-97a9-db7bc7fca095.csv  
  inflating: train/d7750148-3068-4c55-954a-7f93c2061c88.csv  
  inflating: train/c0c8f5a6-748b-4f09-815b-d49903d8f898.csv  
  inflating: train/1535ad91-6646-4596-8e01-5c139bb7907e.csv  
  inflating: train/66170221-cce1-446d-870f-b9600b919697.csv  
  inflating: train/817eb588-3f21-435d-a1e2-916f0274b69d.csv  
  inflating: train/ff25bc94-1de0-4128-8e2d-17013b443455.csv  
  inflating: train/f87b111e-096a-41fa-94bf-2961d99f4266.csv  
  inflating: train/fce86d50-e777-479b-bce9-ca7f49005ed3


  inflating: train/e8a5ec87-57b1-4080-aef5-b315687839a8.csv  
  inflating: train/38102127-4408-4bb0-acd1-a3ad69c6f1dd.csv  
  inflating: train/492e0139-26e6-4795-bee5-48dea5538240.csv  
  inflating: train/19ac6c0a-5f6b-463e-b3c0-5725b844d86a.csv  
  inflating: train/b4182c86-db0a-4bbc-8238-3d8513fef398.csv  
  inflating: train/3b461048-bce1-4787-9f17-436bc2a97614.csv  
  inflating: train/34de8a89-7fcf-4d34-9232-000cc07c8572.csv  
  inflating: train/2be74541-5023-4387-9961-f7121e3f9871.csv  
  inflating: train/82ad454f-fa36-4b66-b6aa-ee014dbbec4b.csv  
  inflating: train/df06d367-5fe2-4578-b3e1-5ba612efea87.csv  
  inflating: train/794c1154-ebf6-4bb4-9697-375aa5a49a02.csv  
  inflating: train/abb7f298-70d9-41d2-a194-ac2003e10bae.csv  
  inflating: train/175ea9d2-ab83-41d2-9108-a6a6dc5aafd4.csv  
  inflating: train/f4980a84-a746-4f43-842d-204babdb768e.csv  
  inflating: train/106597d5-1fbb-4b13-9b32-2a7856c5313a.csv  
  inflating: train/1420ef21-e47a-42c9-80e7-a62b2150ae

  inflating: train/98f54655-b981-48ff-b654-b7efad6a9a2f.csv  
  inflating: train/634ad884-ac22-4317-8a1c-27563c7173eb.csv  
  inflating: train/14ed32d0-81a4-47e8-93c8-42514fc69a5f.csv  
  inflating: train/2114a3ea-bd6f-473d-8df6-2d0633ca1f65.csv  
  inflating: train/a0ef3a3c-de99-4923-ae97-47dd8ff3a904.csv  
  inflating: train/ccf43b81-d5ff-46d5-8f84-51071a26933f.csv  
  inflating: train/2157089e-70e7-440f-a680-549dbf0799e0.csv  
  inflating: train/19c95431-30ec-421b-998d-19de3a729bac.csv  
  inflating: train/5eae8fc8-7530-4d5c-ba15-aa99cc355ee0.csv  
  inflating: train/d0d86bb8-65d8-4885-81be-7743734f16c6.csv  
  inflating: train/342e629b-2517-4fc8-b88f-171be2d545e1.csv  
  inflating: train/393c0ac9-8143-4b3f-b352-58c2349a7641.csv  
  inflating: train/1064e80b-868a-4691-abad-a520aec31be7.csv  
  inflating: train/cd890c57-13ee-40b0-9b11-c3047e787678.csv  
  inflating: train/8e8392dc-5da3-413b-8f5b-20db79511f6d.csv  
  inflating: train/868f3d90-f484-420a-aafa-fe9018cacb14


  inflating: train/fc1f405f-b0c9-4f6d-ba4f-1c3d587ebb52.csv  
  inflating: train/85d4d1a4-33ab-4fa2-b810-2d9df48c9142.csv  
  inflating: train/54adf56a-7895-4b63-bed0-e619bd1c6f25.csv  
  inflating: train/1b6f49dd-f02a-458c-af98-080ca7acfd92.csv  
  inflating: train/a35e474a-cd30-457f-a345-686997e3a255.csv  
  inflating: train/559ee692-2185-4953-b697-fada248e6d3f.csv  
  inflating: train/d131950d-6840-454d-a3c4-970d35fac54b.csv  
  inflating: train/bd74746b-76d3-468a-9ac0-33bb55340897.csv  
  inflating: train/af72fe2f-e024-47fe-a529-902102ec3f4a.csv  
  inflating: train/4ee330a9-31cb-4a1a-af59-70ec25c7bbbd.csv  
  inflating: train/dd11cb21-4dd7-4a51-9bee-f9b632ccfce4.csv  
  inflating: train/1a7389c0-bb7a-4b73-a7ee-abead641c425.csv  
  inflating: train/e7893993-35ef-4196-af1f-9032eb711dd3.csv  
  inflating: train/a11a7cac-449a-4586-a709-d1ca2c936960.csv  
  inflating: train/d592f2ee-fa9d-4d1c-8e3b-3ddb561f71c0.csv  
  inflating: train/e7a503b6-69d9-4fae-b418-7d8cf4172f

In [34]:
# Shows the disk usage of the directory train 
!du -sh train

134M	train


In [33]:
train.shape

NameError: ignored

In [18]:
!rm -r

datalab  MLiFC_data_invoices


In [0]:
!pip install -q keras

## Loading templates

In [0]:
# System hyper parameters here

# How many characters before and after the main char to feed the NN
PADDING = 20 


'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
'''
N_CLASSES = 6

In [0]:
# Invoice data generator
from templates.invoicegen import create_invoice

In [0]:

# Your friendly tokenizer
from keras.preprocessing.text import Tokenizer

# Numpy
import numpy as np

import pandas as pd

# glob
import glob

In [0]:
# This is the ortec package made by Rik. It should be imported as a package in the final model in GitHub.

def select_batch(start=0, batch_size=32, stop=32):
    while start < stop:
        yield [start, min(stop, start + batch_size)]
        start += batch_size
#

def select_filebatch(filenames=[], start=0, batch_size=32, total=32):
    stop = min(len(filenames)+1, total)
    for (first_idx, last_idx) in select_batch(start=0, batch_size=batch_size, stop=stop):
        yield filenames[first_idx:last_idx]
#

def select_invoicebatch(filenames=[], batch_size=32, total=32):
    """yields list of invoices, list of targets, list of truths"""
    for file_batch in select_filebatch(filenames=filenames, batch_size=batch_size, total=total):
        invoices = []
        targets = []
        truths = []

        for file in file_batch:
            mysample = pd.read_csv(file)
            # each file only contains one row, that's why we get away with the 0 in .loc[0,'invoice']
            # else we needed to start another level of iteration
            invoice  = mysample.loc[0,'invoice']
            target   = eval(mysample.loc[0,'target'])
            truth    = eval(mysample.loc[0,'truth'])
            invoices.append(invoice)
            targets.append(target)
            truths.append(truth)
        #
        yield invoices, targets, truths
    #
#

In [41]:
train_dir = 'train/'
filenames_all = glob.glob(train_dir + "*.csv")
print("{} files found in directory".format(len(filenames_all)))

batch_size = 32


19992 files found in directory


In [49]:
# after importing ortec package, use = ortec.select_invoicebatch...
# for instance, in a for-loop:
workbatch = select_invoicebatch(filenames=filenames_all, batch_size=batch_size, total=70)  

piece = slice(180,270)
for invoice, target, truth in workbatch:
    print()
    print("   retrieved batch of {} invoices".format(len(invoice)))
    # print some of the retrieved content, just for the first invoice in the batch
    print((invoice[0][piece]).replace("\n", " ")) #print a piece of the invoice, replace newlines
    print("".join([str(x) for x in target[0][piece]]) ) #print corresponding piece of target
    print(truth[0][1])  # print the True Sender Name


   retrieved batch of 32 invoices
69393 000019267231 ING NETHERLANDS NL02INGB0681309748 INGBNL2A 06-88163931 info@unilever.c
222220000000000000000000000000000003333333333333333330000000000000000000000000000000000000
Unilever Nederland

   retrieved batch of 32 invoices
 Boompjes 40 3011XB Rotterdam  Invoice Summary Invoice Number: Invoice Date:  QPHNASEKUX 1
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
ING Bank N.V.

   retrieved batch of 6 invoices
bject': ''} Oostplein 97 3011 KW Rotterdam (+31) (0)10 7982088 (+31) (0)10 2125351 klanten
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
ORTEC Finance B.V.


## This is where we got stuck on how to feed the self-generated invoices to the model. We could only make the originally generated invoices to work. All code after this is not currently working. 

In [54]:
# The all_invoices is a generator object yielding three lists of invoices, targets and truths
all_invoices = select_invoicebatch(filenames=filenames_all)
all_invoices()

for invoice, target, truth in all_invoices:
    pass

AttributeError: ignored

## Generate substring

In [0]:
# Create our tokenizer
# We will tokenize on character level!
# We will NOT remove any characters
tokenizer = Tokenizer(char_level=True, filters=None)
tokenizer.fit_on_texts(invoices)

In [0]:
def gen_sub(inv,tar,pad, m = None):
    '''
    Generates a substring from invoice inv and target list tar 
    using the character at index m as a midpoint.
    
    Params:
    inv - an invoice string
    tar - a target list specifying the type of each item
    pad - the amount of padding to attach before and after the focus character
    
    Returns:
    sub - a string with pad characters, the focus character, pad characters
    '''
    # If no focus character index is set, choose at random
    if m == None:
        m = np.random.randint(0,len(inv))
        
    l = m - pad # define the lower bound of our substring
    h = m + pad + 1 # define the upper (high) of our substring

    # Sometimes, our lower bound could be below zero
    # In this case we attach the remaining characters from the back of the string
    if l < 0:
        # Get the characters from the back of the file
        s1 = inv[l:None]
        
        # Edge case: Sample size larger than string
        # Our upper bound might be higher than the lenth of the text
        # In that case we start from the front again
        if h >= len(inv): 
            # How many characters do we need from the front
            overlap = h - len(inv)
            # The string is the entire invoice + some chars from the front
            s2 = inv
            s_over = inv[None:overlap]
            s2 = s2 + s_over
        else:
            # If we don't need chars from the front 
            # we can just select to the upper bound
            s2 = inv[None:h]
            
        # Create substring
        sub = s1 + s2
        # Ensure the substring has the right length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Our lower bound might be positive but our upper bound might 
    # still be above the length of the invoice
    elif h >= len(inv):
        # Calc how many chars we need from the front
        overlap = h - len(inv)
        
        # Get string from lower bound to end
        s1 = inv[l:None]
        # Get string from the front of the doc
        s2 = inv[None:overlap]
        sub = s1 + s2
        # Make sure our string has the correct length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Upper and lower bound lie within the length of the invoice
    else: 
        sub = inv[l:h]
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]

## Generate dataset for training

In [0]:
def gen_dataset(sample_size, n_classes, invoices, targets, tokenizer):
    '''
    Generate a dataset of inputs and outputs for our neural network
    
    Params:
    sample_size - desired sample size
    n_classes - number of classes
    invoices - list of invoices to sample from
    targets - list of corresonding targets to sample from
    tokenizer - a keras tokenizer fit on the invoices
    
    The function creates balanced samples by randomly sampling untill 
    an equal amount of samples of all types is created.
    
    Characters are one hot encoded
    
    Returns:
    x_arr: a numpy array of shape (sample_size, seqence length, number of unique characters)
    y_arr: a numpy array of shape (sample_size,)
    '''
    
    # Create a budget
    budget = [sample_size / n_classes] * n_classes
    
    # Setup holding variables
    X_train = []
    y_train = []

    # While there is still a budget left...
    while sum(budget) > 0:
        # ... get a random invoice and target list
        index = np.random.randint(0,len(invoices))
        inv = invoices[index]
        tar = targets[index]
        # ... sample up to 10 items from this invoice 
        for j in range(10):
            # Get an item
            x, y = gen_sub(inv,tar,PADDING)
            # if we still have a budget for this items target
            if budget[y] > 0:
                # Tokenize to one hot
                xm = tokenizer.texts_to_matrix(x)
                # Add data and target
                X_train.append(xm)
                y_train.append(y)
                budget[y] -= 1
      
    # Create numpy arrays from all data and targets
    x_arr = np.array(X_train)
    y_arr = np.array(y_train)
    return x_arr,y_arr

In [0]:
# Ger data
train_size = 12000

# Shaffy: Increased validation size to 1200
val_size = 1200

x_tr, y_tr = gen_dataset(train_size, N_CLASSES, invoices, targets, tokenizer)
x_val, y_val = gen_dataset(val_size, N_CLASSES, invoices, targets, tokenizer)

In [0]:
x_tr.shape

(12000, 41, 85)

## Model building

In [0]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense,Activation, Conv1D, MaxPool1D

In [0]:
# A simple model
model = Sequential()

# Shaffy: Played around with convolutions and size, but this seems best
model.add(Conv1D(32,2,input_shape=(None, 85))) # The input shape assumes there is 85 possible characters
model.add(MaxPool1D(1))

# Shaffy: Changed layers from 10 to 50
model.add(SimpleRNN(50))
model.add(Dense(6))

# Shaffy: tried a few different activations but this one was best
model.add(Activation('softmax'))

In [0]:
# sparse_categorical_crossentropy is like categorical crossentropy but without converting targets to one hot
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['acc'])

In [0]:
model.fit(x_tr,y_tr,batch_size=32,epochs=6,validation_data=(x_val,y_val))

Train on 12000 samples, validate on 1200 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<keras.callbacks.History at 0x7ff107f6b7f0>

## Generate demo invoice

In [0]:
'''
To make predictions from our model, we need to create 
sequences around every character from the invoice.

We the making predictions for every charater based on their invoice
'''

# Choose a random invoice:
index = np.random.randint(0,len(invoices))
inv = invoices[index]
tar = targets[index]


chars = [] # Holds the individual characters
data = [] # Holds the sequences around the characters
y_true = [] # Holds the true targets for each character

# Loop over characters indices
for i in range(len(inv) -1):
    # Create sequence around this character
    x,y = gen_sub(inv,tar,PADDING,m=i)
    # Tokenize the sequence to one hot
    xm = tokenizer.texts_to_matrix(x)
    # Get the character itself
    c = inv[i]
    
    chars.append(c)
    data.append(xm)
    y_true.append(y)

In [0]:
import pandas as pd

In [0]:
# For demo purposes we can look what our invoice looks like
df = pd.DataFrame({'Char':chars,'Target':y_true})

In [0]:
# Show all characters belonging to the amount
df[df.Target == 5]

Unnamed: 0,Char,Target
307,1,5
308,7,5
309,8,5
310,1,5
311,7,5
312,.,5
313,4,5
314,9,5


In [0]:
# Create test data for predictions with neural net
x_test = np.array(data)

In [0]:
x_test.shape

(1837, 41, 85)

## Making predictions

In [0]:
# Make predictions
y_pred = model.predict(x_test)

In [0]:
# Get the maximum likely class
y_pred = y_pred.argmax(axis=1)

In [0]:
# Show how our model predictions look like
df['Predicted'] = y_pred

In [0]:
# Show all chars that are predicted to belong to the amount
df[df.Predicted == 5]

Unnamed: 0,Char,Target,Predicted
305,\n,0,5
306,\n,0,5
307,1,5,5
308,7,5,5
309,8,5,5
310,1,5,5
311,7,5,5
312,.,5,5
313,4,5,5
314,9,5,5


## Obtain system outputs from predictions

In [0]:
from itertools import groupby
# Create groups by the predicted output
# The this code will return a tuple with the format
# (category, length, starting index)

# TODO: This code is ugly and very hard to understand
# But it works

# Group by predicted category
g = groupby(enumerate(y_pred), lambda x:x[1])

# Create list of groups
l = [(x[0], list(x[1])) for x in g]

# Create list with tuples of groups
groups = [(x[0], len(x[1]), x[1][0][0]) for x in l]

In [0]:
# Show grouping
groups[:10]

[(0, 130, 0),
 (4, 6, 130),
 (0, 19, 136),
 (1, 20, 155),
 (0, 125, 175),
 (2, 5, 300),
 (5, 12, 305),
 (0, 601, 317),
 (4, 2, 918),
 (0, 478, 920)]

In [0]:
'''
We only want to consider sequences of predictions of the same type 
that have a minimum length. This way we remove the noise
But we also might remove some good predictions

The min length is set to 5 here, certainly a value to experiment with
'''
candidates = []
# Loop over all groups
for group in groups:
    
    # Unpack group
    category, length, index = group
    
    
     # Shaffy: Raised length from 5 
    if category != 0 and length > 5:
        # Create text
        candidate_text = ''.join(chars[index:index+length])
        
        # Remove line breaks, this is just one way to prettify outputs!
        candidate_text = candidate_text.replace('\n','')
        candidates.append((candidate_text,category))
        
    # Shaffy: we can use these types of arguments similar to Regex    
    if category == 2 and length == 8:
        candidate_text = ''.join(chars[index:index+length])
        candidate_text = candidate_text.replace('\n','')
        candidates.append((candidate_text,category))
        
    if category == 3 and length > 15:
        candidate_text = ''.join(chars[index:index+length])
        candidate_text = candidate_text.replace('\n','')
        candidates.append((candidate_text,category))
        

In [0]:
# Show predictions

'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
'''

sorted(candidates, key=lambda tup: tup[1])

[('TTN: ING Bank N.V.B', 1), ('DHE8F', 4), ('17817.49', 5)]

In [24]:
print(candidates)

NameError: ignored