In [1]:
import pandas as pd

In [2]:
df = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx")

In [4]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [7]:
# I have mounted my s3 bucket in a subdirectory `s3`
# s3fs strm-batch-demo s3
# Since Strm batch processing currently only supports csv files
df.to_csv("s3/uci_online_retail.csv", index=False)

```
strm create schema strmprivacy/online-retail/1.0.0 --definition=online-retail.json  --public
strm activate schema strmprivacy/online-retail/1.0.0
strm create event-contract strmprivacy/online-retail/1.0.0 --public \
    -F online-retail-contract.json -S strmprivacy/online-retail/1.0.0
strm activate event-contract strmprivacy/online-retail/1.0.0
strm create batch-job -F batch-job-config.json
strm get batch-job ...
```

# Create schema and Event Contract
These steps where done by us so you don't have to do them
```
strm create schema strmprivacy/online-retail/1.0.0 --definition=online-retail.json --public
strm activate schema strmprivacy/online-retail/1.0.0
strm create event-contract strmprivacy/online-retail/1.0.0 --public \
    -F online-retail-contract.json -S strmprivacy/online-retail/1.0.0
strm activate event-contract strmprivacy/online-retail/1.0.0
```
With schema

```online-retail.json
{
  "name": "UCI Online Retail",
  "nodes": [
    {
      "type": "STRING",
      "name": "InvoiceNo"
    },
    {
      "type": "STRING",
      "name": "StockCode"
    },
    {
      "type": "STRING",
      "name": "Description"
    },
    {
      "type": "INTEGER",
      "name": "Quantity"
    },
    {
      "type": "STRING",
      "name": "InvoiceDate"
    },
    {
      "type": "STRING",
      "name": "UnitPrice"
    },
    {
      "type": "STRING",
      "name": "CustomerId"
    },
    {
      "type": "STRING",
      "name": "Country"
    }
  ]
}
```

and event contract

```online-retail-contract.json
{
  "keyField": "CustomerId",
  "piiFields": {
    "CustomerId": 1
  },
  "dataSubjectField": "CustomerId"
}
```


Configuring the batch job. This is complex:

1. figure out the bucket locations and access.
2. figuring out the timestamp (take a sample from your file). This uses
   [Java time format](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html)
   
   
## the time format
The way we converted the Excel file to csv *changed the timestamp format*. Make sure you look at the timestamp in the csv file.
A sample timestamp in the UCI *csv* file is `2010-12-01 08:26:00`. This suggests the following format pattern:

If you get it wrong, you'll only notice after the batch job has started:

```
strm list batch-jobs
 BATCH JOB ID                           TIMESTAMP                           STATE   DETAILS                                           
                                                                                                                                      
 b2cead50-f85b-42cd-9198-53d7caa998e0   2022-09-13 08:24:03.226 +0000 UTC   ERROR   Invalid timestamp [Text '2010-12-01 08:26:00' could not be parsed at index 4] in row #1
```



```batch-job-config.json

{
    "source_data": {
      "data_connector_ref": { "name": "s3"},
      "file_name": "batch-demo/uci_online_retail.csv",
      "data_type": { "csv": { "charset": "UTF-8" } }
    },
    "consent": { "default_consent_levels": [ 2 ] },
    "encryption": {
      "batch_job_group_id": "7824e975-20e1-4995-b129-2f9582728ca5",
      "timestamp_config": {
        "field": "InvoiceDate",
        "format": "M/d/yyyy H:m",
        "default_time_zone": { "id": "Europe/Amsterdam" }
      }
    },
    "event_contract_ref": {
      "handle": "bla",
      "name": "online-retail",
      "version": "1.0.0"
    },
    "encrypted_data": {
      "target": {
        "data_connector_ref": { "name": "s3"},
        "data_type": { "csv": { "charset": "UTF-8" } },
        "file_name": "batch-demo/online_retail_II/encrypted.csv"
      }
    },
    "encryption_keys_data": {
      "target": {
        "data_connector_ref": { "name": "s3"},
        "data_type": { "csv": { "charset": "UTF-8" } },
        "file_name": "batch-demo/online_retail_II/keys.csv"
      }
    },
    "derived_data": [      {
        "target": {
          "data_connector_ref": { "name": "s3"},
          "data_type": { "csv": { "charset": "UTF-8" } },
          "file_name": "batch-demo/online_retail_II/decrypted-0.csv"
        },
        "consent_levels": [ 2 ],
        "consent_level_type": "CUMULATIVE"
      }

    ]
  }
```

We have prepared a dataconnector named `s3` that points to an AWS S3 bucket
```
strm get data-connector s3 -o json
{
    "dataConnector": {
        "ref": {
            "name": "s3",
            "projectId": "568ce89d-db4e-465b-bac1-cfeed6ca899a"
        },
        "s3Bucket": {
            "bucketName": "strm-demo-ecommerce"
        },
        "uuid": "c0093b82-bf0c-4b4b-889a-205643e39cbf",
        "dependentEntities": {}
    }
}
```