# sample code: reading invoice from disk
The generated invoices we stored in individual csv files on disk have the following format:
```
Pandas Dataframe with three columns:
Invoice  Target  Truth
                 _____
                   |
                   V
   ________________________________________________________________________
   "uuid", "Sender Name", "Sender KvK", "Sender IBAN", "Reference", "Total"
```

"Invoice" contains the full text of the invoice, as generated by whatever process. It is now a string object.
"Target" is a stringified array of Jannes' target array. Its elements correspond to positions in the string "Invoice". Each element contains one value: a number in the range [0-5] denoting the class of each string.
The string object in "Truth" contains six strings, seperated by commas. Basically a nested csv.

The indexes in "Truth" correspond with those in "Target", except index 0. The zero-th index in "Target" holds the class <unknown>. In "Truth" it holds the uuid of the invoice.


# Do It Yourself
Because google drive API will not do this for me:

Download [invoices_train.zip](https://drive.google.com/file/d/1oBT6NP0y6V6xBwVdLvIB_0Sk1NjWpUUZ)
via your local machine on to the notebook runtime environment.
Do not try "save link as", just click it and visit the Google drive in a new browser tab.

Make sure it is still named **invoices_train.zip** when it reaches the notebook runtime.

### I tried these already
* colab snippet for downloading a file from google drive: seems not to work for (large?) zip files
* wget the shareable link from google drive: only downloads a silly HTML file

### Alternatives?
* Use a separate github repo for the large datasets and do a !git clone.
* recursive wget?

    

In [None]:
whos

In [130]:
import glob
import pandas as pd

In [2]:
!ls

iban_noisy_100.csv			   LICENSE
invoice_from_disk_sample_code.ipynb	   README.md
invoices_train.zip.notzip		   templates
Jannes_Baseline_Invoices_ToFromDisk.ipynb  train
Jannes_Baseline_Ortec.ipynb		   Wk6_Ortec_IBAN_generation.ipynb


In [56]:
train_dir = "./train/"

In [57]:
# ! ls {train_dir}/*.csv   # raises "too many arguments!"

In [103]:
filenames_all = glob.glob(train_dir + "*.csv")
print("{} files found in directory".format(len(filenames_all)))

99981 files found in directory


In [110]:
def select_batch(start=0, batch_size=32, stop=32):
    while start < stop:
        yield [start, min(stop, start + batch_size)]
        start += batch_size


In [126]:
def select_filebatch(filenames=[], start=0, batch_size=32, total=32):
    stop = min(len(filenames)+1, total)
    for (first_idx, last_idx) in select_batch(start=0, batch_size=batch_size, stop=stop):
        yield filenames[first_idx:last_idx]

        

In [154]:
def select_invoicebatch(filenames=[], batch_size=32, total=32):
    """yields list of invoices, list of targets, list of truths"""
    for file_batch in select_filebatch(filenames=filenames, batch_size=batch_size, total=total):
        invoices = []
        targets = []
        truths = []

        for file in file_batch:
            mysample = pd.read_csv(file)
            # each file only contains one row, that's why we get away with the colon in .loc[:,'invoice']
            invoice  = mysample.loc[:,'invoice'].values[0]
            target   = mysample.loc[:,'target'].values[0]
            truth    = mysample.loc[:,'truth'].values[0]
            invoices.append(invoice)
            targets.append(target)
            truths.append(truth)
        #
        yield invoices, targets, truths


In [155]:
for invoices, targets, truths in select_invoicebatch(filenames=filenames_all, total=5):
    print(invoices[0])
    print(targets[0])
    print(truths[0])

{'/Title': IndirectObject(49, 0), '/Author': IndirectObject(51, 0), '/Subject': IndirectObject(52, 0), '/Producer': IndirectObject(50, 0), '/Creator': IndirectObject(53, 0), '/CreationDate': IndirectObject(54, 0), '/ModDate': IndirectObject(54, 0), '/Keywords': IndirectObject(55, 0), '/AAPL:Keywords': IndirectObject(56, 0)}

            Amazon NL International Holdings B.V.
V.O.F
                                        Specialist in tegelvlakke cementdekvloeren
                 Fam de Rochebrune
              Vaarsdrift 3
         Juinen
              Papendrecht
02-06-2017     factuur 17068
    -            Geachte heer,
                      Aan u geleverd cementdekvloer
45m27cmvezel bewapend
Bedrag excl BTW
900,00
!
 B.T.W 21%
189,00
!
 Totaal door u te voldoen
1.089,00
!
                                                                    Betalingen via IBAN.NL41RABO0150437878
binnen 14 dagen
 na datum factuur
     Dennehof 23
 3355RJ Papendrecht
 Mob.tel.0
6-53190711 IBAN.43ABNA.04

In [156]:
print(truths[0])

['5b5806e6-a10e-4d63-9060-f46873e7ac83', 'Amazon NL International Holdings B.V.', '69988978', 'NL41RABO0150437878', 'XQ8GPSK', '13561.433725600531']


In [157]:
print(truths[0][0])

[
