# sample code: reading invoice from disk
The generated invoices we stored in individual csv files on disk have the following format:
```
Pandas Dataframe with three columns:
Invoice  Target  Truth
                 _____
                   |
                   V
   ________________________________________________________________________
   "uuid", "Sender Name", "Sender KvK", "Sender IBAN", "Reference", "Total"
```

* "Invoice" contains the full text of the invoice, as generated by whatever process. It is now a string object.
* "Target" is a stringified array of Jannes' target array. Its elements correspond to positions in the string "Invoice". Each element contains one value: a number in the range [0-5] denoting the class of each string.
* "Truth is a string object containing six strings, seperated by commas. 

Target and Truth are serialized arrays; deserialize the string back into an array with eval().

The indexes in "Truth" correspond with those in "Target", except index 0. The zero-th index in "Target" holds the class <unknown>. In "Truth" it holds the uuid of the invoice.


# Do It Yourself
Because google drive API will not do this for me:

Download [invoices_train.zip](https://drive.google.com/file/d/1oBT6NP0y6V6xBwVdLvIB_0Sk1NjWpUUZ)
via your local machine on to the notebook runtime environment.
Do not try "save link as", just click it and visit the Google drive in a new browser tab.

Make sure it is still named **invoices_train.zip** when it reaches the notebook runtime.

### Alternative: use the repo
git clone https://github.com/riklmr/MLiFC_data_invoices


### I tried these already
* colab snippet for downloading a file from google drive: seems not to work for (large?) zip files
* wget the shareable link from google drive: only downloads a silly HTML file

### Different alternatives?
* recursive wget?

    

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
whos

Interactive namespace is empty.


In [119]:
import glob
import pandas as pd
import ortec # home made library by Rik


In [2]:
!git clone https://github.com/riklmr/MLiFC_data_invoices

Cloning into 'MLiFC_data_invoices'...
remote: Counting objects: 18, done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 18 (delta 3), reused 5 (delta 0), pack-reused 0[K
Unpacking objects: 100% (18/18), done.
Checking connectivity... done.


In [4]:
train_dir = "./train/"

The library ortec.py now contains the following functions.

In [122]:
! cat ortec.py

import glob
import pandas as pd

def select_batch(start=0, batch_size=32, stop=32):
    while start < stop:
        yield [start, min(stop, start + batch_size)]
        start += batch_size
#

def select_filebatch(filenames=[], start=0, batch_size=32, total=32):
    stop = min(len(filenames)+1, total)
    for (first_idx, last_idx) in select_batch(start=0, batch_size=batch_size, stop=stop):
        yield filenames[first_idx:last_idx]
#

def select_invoicebatch(filenames=[], batch_size=32, total=32):
    """yields list of invoices, list of targets, list of truths"""
    for file_batch in select_filebatch(filenames=filenames, batch_size=batch_size, total=total):
        invoices = []
        targets = []
        truths = []

        for file in file_batch:
            mysample = pd.read_csv(file)
            # each file only contains one row, that's why we get away with the 0 in .loc[0,'invoice']
            # else we needed to start another level of iteration
   

# Example of using the data and functions:



In [123]:
filenames_all = glob.glob(train_dir + "*.csv")
print("{} files found in directory".format(len(filenames_all)))

batch_size = 32


109503 files found in directory


### workbatch is a Python "generator" 
that returns abatch of invoices, every time it is asked to behave like an iterable


In [125]:
#for instance, in a for-loop:
workbatch = ortec.select_invoicebatch(filenames=filenames_all, batch_size=3, total=7)  

piece = slice(180,270)
for invoices, targets, truths in workbatch:
    print()
    print("   retrieved batch of {} invoices".format(len(invoices)))
    # print some of the retrieved content, just for the first invoice in the batch
    print((invoices[0][piece]).replace("\n", " ")) #print a piece of the invoice, replace newlines
    print("".join([str(x) for x in targets[0][piece]]) ) #print corresponding piece of target
    print(truths[0][1])  # print the True Sender Name


   retrieved batch of 3 invoices
 ING Bank N.V. DATUM: 25 februari 2016 FACTUURNR. 2016130Roggeakker 16 VOOR: 3773 AB Barne
011111111111110000000000000000000000000000000000000000000000000000000000000000000000000000
ING Bank N.V.

   retrieved batch of 3 invoices
6-23 #IN005229  ING Bank N.V. Loberingemaat 6 7942 JD Meppel Nederland Afleveradres r de R
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
ING Bank N.V.

   retrieved batch of 1 invoices
016-06-23 #IN005229  ORTEC Finance B.V. Loberingemaat 6 7942 JD Meppel Nederland Afleverad
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
ORTEC Finance B.V.


In [126]:

workbatch = ortec.select_invoicebatch(filenames=filenames_all, batch_size=16, total=30000)  


In [127]:
# for instance using the next() function
# hit CTRL-Enter for multiple retrievals

invoices, targets, truths = next(workbatch)

print()
print("   retrieved batch of {} invoices".format(len(invoices)))
# print some of the retrieved content, just for the first invoice in the batch
print((invoices[0][piece]).replace("\n", " ")) #print a piece of the invoice, replace newlines
print("".join([str(x) for x in targets[0][piece]]) ) #print corresponding piece of target
print(truths[0][1])  # print the True Sender Name


   retrieved batch of 16 invoices
 ING Bank N.V. DATUM: 25 februari 2016 FACTUURNR. 2016130Roggeakker 16 VOOR: 3773 AB Barne
011111111111110000000000000000000000000000000000000000000000000000000000000000000000000000
ING Bank N.V.
