# Welcome to the STRM Privacy engineering + 🍕 event! 👋

In this notebook, we'll take you through our workshop. The sections marked as Optional are, you guessed it, optional. They will allow you to explore more of STRM Privacy, or apply our platform to a dataset of your own choice. Feel free to revisit them after completing the remainder of the sections!

To follow along, make sure you have:

* Created a [STRM Privacy account](https://console.strmprivacy.io)
* Installed the [STRM CLI](https://github.com/strmprivacy/cli)
* Access to a bucket/blob storage in GCP/AWS/Azure (optional)
  * If you don't have a ready-to-use bucket, we can provide you with a Google Cloud Storage bucket (can be viewed unauthenticated, and can be written to with a service account that we provide)
* Installed [Jupyter](https://jupyter.org/install) to execute or modify some of the code examples yourself (optional)
* A belly full of pizza 😉

## Workshop outline

During the workshop, you will:

1. Create a STRM Pipeline
1. Send data to the pipeline
   1. Using the CLI to simulate random events
   1. Read the published events via the CLI web socket
   1. _(Optional) Publish the example webshop data using the (Python) driver_
   1. _(Optional) Create your own data contract and use it to publish your own data_
1. Create one or more privacy streams for the pipeline
1. Create one or more batch exporters
1. _(Optional) set up an external table with AWS Athena or Google Cloud BigQuery_
1. Explore the opportunities of privacy-safe data
1. Try out (some of) the optional sections and/or explore more of our product

💡 If at any time you would like more information about a specific concept or feature, feel free to ask us, or have a look at [our docs](https://docs.strmprivacy.io/docs/latest/overview). 

# 1. Create a STRM Pipeline 🔀

When you login for the first time, a new project is generated for you. You can either use that project, or you can create a new one, either [through the console](https://console.strmprivacy.io/projects), or [the CLI](https://docs.strmprivacy.io/docs/latest/reference/cli-reference/strm/create/project/).

In the console, navigate to the Pipelines tab of your project to create a pipeline:

![create pipeline](images/create-pipeline.png)

Alternatively, create it [through the CLI](https://docs.strmprivacy.io/docs/latest/reference/cli-reference/strm/create/stream/).

# 2. Send data to the pipeline 🚀

When sending a data record / event to a pipeline, all sensitive (PII) fields defined in the event's data contract will be encrypted. In this section you will see the effect this has in real-time. To facilitate this, a data contract is used. Below you see the Data Contract that is used in this example:

![image.png](images/example-dc.png)

⚠️ **Note**: you can skip ahead to section 3 after having tried the CLI's web socket command. Or stick around and see how we generate some fake but real-looking sensitive data.

## 2.1 Simulate random events via the CLI

Our CLI provides a [simulate command](https://docs.strmprivacy.io/docs/latest/reference/cli-reference/strm/simulate/random-events/) that publishes events containing random data using the [`strmprivacy/example/1.5.0`](https://console.strmprivacy.io/data-contracts/a43fd098-30b4-4910-b6fb-3fe4334dfd64) schema.

💡 **Note:** if this is your first time using the CLI, make sure to authenticate with:

```bash
strm auth login
```

To start sending events, simply execute the following command:

```bash
strm simulate random-events <your-pipeline-name-here> --interval 5000
```

By default, an event is published every second, but as you can see, a different interval can be used instead (5 seconds here).

## 2.2 See the published events via the CLI

To see the events as they get published, execute the [web socket command](https://docs.strmprivacy.io/docs/latest/reference/cli-reference/strm/listen/web-socket/) in a separate shell:

```bash
strm listen web-socket <your-pipeline-name-here>
```

This will print the received events as JSON, one line per event. You could use the `jq` tool for a more readable output:

```bash
strm listen web-socket <your-pipeline-name-here> | jq
```

## 2.3 (Optional) Let's use some fake webshop data

### 2.3.1 Generating data with Faker

💡 **Note:** alternatively, you can use the `example_input.csv` file that we already generated. But feel free to generate your own data. If you change the structure of the data, you will also need to create your own data contract!

First, let's import the required packages.

In [1]:
!pip3 install -r requirements.txt

import pandas as pd
import datetime, time, arrow
import random
from random import choice

You should consider upgrading via the '/Users/trietsch/.pyenv/versions/3.9.4/bin/python3.9 -m pip install --upgrade pip' command.[0m


#### Faker is pretty great at making up data! 🥸

See for yourself:

In [2]:
from faker import Faker
fake = Faker('en_US')
for _ in range(10):
    print(fake.name(), fake.phone_number(), fake.street_address())

David Combs 887.574.3306 307 Larry Tunnel Apt. 843
Andrea Brown +1-698-231-6782x74652 97729 Tina Meadows
Dawn Patel 001-448-827-7445x360 526 Eric Vista Suite 460
Jeremy Espinoza +1-608-801-6969 776 Manuel Plaza
Martin Lowery 639.161.8726x53053 57339 Jack Shoal Suite 819
Tracy Sutton +1-063-418-1869x14377 7196 James Stravenue Apt. 811
Nancy Glenn 176-474-2423 7642 Melinda Haven Apt. 178
Henry Mitchell DVM 073-058-4162x5618 153 Haley Drives
Gloria Rich 123-764-9857x0102 020 Christopher Turnpike Suite 079
Richard Foster 634-381-5107 6265 Smith Avenue Apt. 147


#### Let's assume we are in e-commerce, a webshop, if you will 🏪

Let's generate an e-commerce inspired dataset, which matches with the `strmprivacy/PrivacyEngineeringWorkshop/1.0.0` data contract:

In [4]:
def fake_data_generation(total_records, total_customers, total_products):
    transactions = []
    fake = Faker()

    customer_ids = random.sample(range(100000, 999999), total_customers)
    emails = [fake.free_email() for _ in range(total_customers)]
    ages = [random.randrange(99) for _ in range(total_customers)]
    sizes = [random.choice(['XS', 'S', 'M', 'L', 'XL']) for _ in range(total_customers)]
    names = [fake.name() for _ in range(total_customers)]
    phone_numbers = [fake.phone_number() for _ in range(total_customers)]
    addresses = [fake.street_address() for _ in range(total_customers)]

    personal_data = list(zip(customer_ids, emails, ages, sizes, names, phone_numbers, addresses))
    products = random.sample(range(10000, 19999), total_products)

    for i in range(total_records):
        item_ids = random.choices(products, k=random.randrange(1,7))
        customer = random.choice(personal_data)
        transactions.append({
                "transactionId":  random.randrange(100000000, 999999999),
                "userId": customer[0],
                "email":  customer[1],
                "age" : customer[2],
                "tshirtSize": customer[3],
                "fullName": customer[4],
                "phoneNumber": customer[5],
                "address": customer[6],
                "transactionAmount": random.randrange(323),
                "items": item_ids,
                "totalItems" : len(item_ids),
                "date" : arrow.utcnow().shift(days=-1*random.randrange(180)).format('YYYY-MM-DD HH:mm:ss'),
                "purposeConsent" : random.randint(0, 2)
                })
        
    return transactions


df = pd.DataFrame(fake_data_generation(10000, 500, 50))

#### Show me the ~money~ data 🧐

In [5]:
df

Unnamed: 0,transactionId,userId,email,age,tshirtSize,fullName,phoneNumber,address,transactionAmount,items,totalItems,date,purposeConsent
0,849938247,477049,mistykirby@hotmail.com,31,M,William Robles,001-652-752-8469x68291,076 Mandy Corners Suite 751,12,"[11217, 18090]",2,2022-09-16 16:02:32,0
1,986081859,107296,johnsmith@hotmail.com,73,L,Bruce Patton,(311)182-4594,61283 Paul Gateway Apt. 026,153,"[18347, 17344, 19829, 13463, 18720]",5,2022-11-01 16:02:32,2
2,534542356,270039,rhowell@gmail.com,83,M,Crystal Bruce,001-096-429-9884x2572,5359 Hernandez Union Apt. 222,160,"[17453, 11685, 17985, 10845, 19366, 17451]",6,2022-09-14 16:02:32,0
3,991419130,975416,allenbetty@yahoo.com,30,S,Julie Martin,887.367.4267,021 Weber Underpass,110,[14137],1,2022-09-09 16:02:32,2
4,488932444,376203,owenspatricia@gmail.com,26,M,Meagan Willis,960.762.9771x96025,0641 Robert Views,34,"[11217, 12285, 11685, 19829]",4,2023-01-09 16:02:32,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,772769402,547509,kimberly39@yahoo.com,1,XL,Carrie Powers,001-554-692-2447x4680,909 Cox Underpass Apt. 704,157,[19829],1,2022-10-01 16:02:32,0
9996,145924309,303846,kathy99@yahoo.com,0,XL,Carl Bush,(396)957-8756x53943,518 Gabriel Meadow Apt. 515,90,[10534],1,2023-01-01 16:02:32,2
9997,408577849,751655,gibsonrobert@gmail.com,20,L,Wendy Jackson,(206)463-0028,141 Patrick Avenue Apt. 906,319,"[18720, 16880, 19227]",3,2022-08-29 16:02:32,0
9998,523334445,280664,nmurray@yahoo.com,23,S,Shannon Oconnor,001-727-065-1972,54307 Amber Motorway,272,"[12877, 13476, 17344, 15521, 11283]",5,2022-12-27 16:02:32,1


#### Persist data 🪣

For convenience, let's save the data as a csv:

In [6]:
timestamp = time.time()
df.to_csv(f'example_input.csv', index=False)  

!head example_input.csv

transactionId,userId,email,age,tshirtSize,fullName,phoneNumber,address,transactionAmount,items,totalItems,date,purposeConsent
849938247,477049,mistykirby@hotmail.com,31,M,William Robles,001-652-752-8469x68291,076 Mandy Corners Suite 751,12,"[11217, 18090]",2,2022-09-16 16:02:32,0
986081859,107296,johnsmith@hotmail.com,73,L,Bruce Patton,(311)182-4594,61283 Paul Gateway Apt. 026,153,"[18347, 17344, 19829, 13463, 18720]",5,2022-11-01 16:02:32,2
534542356,270039,rhowell@gmail.com,83,M,Crystal Bruce,001-096-429-9884x2572,5359 Hernandez Union Apt. 222,160,"[17453, 11685, 17985, 10845, 19366, 17451]",6,2022-09-14 16:02:32,0
991419130,975416,allenbetty@yahoo.com,30,S,Julie Martin,887.367.4267,021 Weber Underpass,110,[14137],1,2022-09-09 16:02:32,2
488932444,376203,owenspatricia@gmail.com,26,M,Meagan Willis,960.762.9771x96025,0641 Robert Views,34,"[11217, 12285, 11685, 19829]",4,2023-01-09 16:02:32,2
832564492,531000,gutierrezlori@gmail.com,28,L,Darren Hernandez,779-449-1290x2272,85112 Re

### 2.3.2 Streaming the data using the Python driver ⚡️

The next step is to use STRM to its full potential: with event-driven, real-time data.

For the purpose of this workshop, we'll simulate a real-time anonymisation pipeline by replaying our generated data as if they were produced in real time. You will use the Python driver to send the events one-by-one into the STRM Privacy "event gateway", similar to how the CLI's `simulate` command does.

In sections 3 and later, we will build upon this pipeline by adding privacy streams, batch exporters, and more!

#### Installing the generated schema code

Execute the following commands to download a zip of generated code for our specific data contract and have it made available through python's pip package installer.

In [None]:
!strm get schema-code strmprivacy/PrivacyEngineeringWorkshop/1.0.0 --language python

In [21]:
!unzip python-avro-PrivacyEngineeringWorkshop-1.0.0.zip
!cd python-avro-PrivacyEngineeringWorkshop-1.0.0 && make install

Archive:  python-avro-PrivacyEngineeringWorkshop-1.0.0.zip
   creating: python-avro-PrivacyEngineeringWorkshop-1.0.0/
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/README.md  
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/requirements-dev.txt  
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/Makefile  
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/setup.py  
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/requirements.txt  
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/build.py  
   creating: python-avro-PrivacyEngineeringWorkshop-1.0.0/schema/
  inflating: python-avro-PrivacyEngineeringWorkshop-1.0.0/schema/schema.avsc  
rm -fr build/
rm -fr dist/
rm -fr .eggs/
find . -name '*.egg-info' -exec rm -fr {} +
find . -name '*.egg' -exec rm -f {} +
find . -name '*.pyc' -exec rm -f {} +
find . -name '*.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name '__pycache__' -exec rm -fr {} +
python3 -m pip install 

running bdist_wheel
running build
running build_py
creating build
creating build/lib
creating build/lib/strmprivacy_A_A_v
copying strmprivacy_A_A_v/__init__.py -> build/lib/strmprivacy_A_A_v
copying strmprivacy_A_A_v/schema_classes.py -> build/lib/strmprivacy_A_A_v
creating build/lib/strmprivacy_A_A_v/A
copying strmprivacy_A_A_v/A/__init__.py -> build/lib/strmprivacy_A_A_v/A
creating build/lib/strmprivacy_A_A_v/A/A
copying strmprivacy_A_A_v/A/A/__init__.py -> build/lib/strmprivacy_A_A_v/A/A
creating build/lib/strmprivacy_A_A_v/A/A/v
copying strmprivacy_A_A_v/A/A/v/__init__.py -> build/lib/strmprivacy_A_A_v/A/A/v
creating build/lib/strmprivacy_A_A_v/A/A/v/strmmeta
copying strmprivacy_A_A_v/A/A/v/strmmeta/__init__.py -> build/lib/strmprivacy_A_A_v/A/A/v/strmmeta
running egg_info
writing strmprivacy_schemas_PrivacyEngineeringWorkshop_avro.egg-info/PKG-INFO
writing dependency_links to strmprivacy_schemas_PrivacyEngineeringWorkshop_avro.egg-info/dependency_links.txt
writing requirements to 

#### Creating and sending the events

Now, you can use the generated schema code to construct each event and use the python driver (SyncSender) to send them to your pipeline.

⚠️ **Note:** you will need to insert your Pipeline's credentials in the code snippet below, otherwise the SyncSender will fail on authentication. You can retrieve your pipeline's options by requesting json output on the CLI:

```bash
strm get stream <your-pipeline-name-here> -o json
```

Or alternatively, you can click on *Pipeline options* => *View Credentials* in the console:
![pipeline credentials](images/pipeline-credentials.png)

💡 The code below only sends the first few records. Remove the `.head()` call to send all records.

In [2]:
import logging
import sys
import time
import pandas as pd

from strmprivacy.driver.client.syncsender import SyncSender
from strmprivacy.driver.serializer import SerializationType
# Note: if you want to use your own data contract, adapt the schema code installation commands accordingly, 
# and replace "PrivacyEngineeringWorkshop" by your contract's schema name.
from strmprivacy_A_A_v.A.A.v import PrivacyEngineeringWorkshop

def create_avro_event(row):
    # Here we convert each row of generated data into the event objects matching our schema.
    # If you use your own contract, you will need to make the appropriate changes below.
    event = PrivacyEngineeringWorkshop()
    
    event.strmMeta.eventContractRef = "strmprivacy/PrivacyEngineeringWorkshop/1.0.0"
    event.strmMeta.consentLevels = [int(row['purposeConsent'])]
    
    event.transactionId = str(row['transactionId'])
    event.userId = str(row['userId'])
    event.email = str(row['email'])
    event.age = int(row['age'])
    event.tshirtSize = str(row['tshirtSize'])
    event.fullName = str(row['fullName'])
    event.phoneNumber = str(row['phoneNumber'])
    event.address = str(row['address'])
    event.transactionAmount = int(row['transactionAmount'])
    event.items = row['items']
    event.date = str(row['date'])
    event.totalItems = int(row['totalItems'])
    event.purposeConsent = int(row['purposeConsent'])

    return event

def main():
    logging.basicConfig(stream=sys.stdout)
    logger = logging.getLogger(__name__)
    
    # To send all records, remove the .head() call
    records = pd.read_csv("example_input.csv").head()

    # Use your Pipeline's credentials here.
    sender = SyncSender("<your_client_id>", "<your_client_secret>")
    sender.start()

    sender.wait_ready()

    for row in records.iterrows():
        event = create_avro_event(row[1])
        response = sender.send_event(event)

        if response != None:
            logger.error(response)
#         print(event)
        time.sleep(0.002)
    
    time.sleep(2)
    print("Finished sending events")

if __name__ == '__main__':
    main()

DEBUG:strmprivacy.driver.client.auth:_initialize_auth_provider
DEBUG:strmprivacy.driver.client.auth:Initializing a new Auth Provider for SenderService
DEBUG:strmprivacy.driver.client.auth:authenticate


Exception in thread Thread-6:
Traceback (most recent call last):
  File "/Users/trietsch/.pyenv/versions/3.9.4/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/Users/trietsch/.pyenv/versions/3.9.4/lib/python3.9/site-packages/strmprivacy/driver/client/syncsender.py", line 35, in run
    asyncio.run(self.async_start(*self._props), debug=self.async_debug)
  File "/Users/trietsch/.pyenv/versions/3.9.4/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/trietsch/.pyenv/versions/3.9.4/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/Users/trietsch/.pyenv/versions/3.9.4/lib/python3.9/site-packages/strmprivacy/driver/client/syncsender.py", line 39, in async_start
    client = StrmPrivacyClient(client_id, client_secret, self._config)
  File "/Users/trietsch/.pyenv/versions/3.9.4/lib/python3.9/site-packages/strmprivacy/driver/client/client.py", line 1

Finished sending events


#### Check your data 🧐

You can again use the CLI's `listen web-socket` command to verify that the events are being sent as expected, or your batch export(s) if you've already set those up.

Congratulations on sending your first "real-time" events with STRM Privacy! If you want to try it out with your own (fake) data, see the next section!

## 2.4 (Optional) Create your own data contract 📑

To try STRM out with your own data, you will need to create a data contract that matches the shape (schema) of your data, and specify the privacy implications. STRM uses Simple Schema for creating data contracts, which is written in YAML.

For example, this is the (slightly simplified) schema for the `PrivacyEngineeringWorkshop` contract used before:

```yaml
name: PrivacyEngineeringWorkshop
nodes:
  - type: STRING
    name: transactionId
    repeated: false
    required: true
  - type: STRING
    name: userId
  - type: STRING
    name: email
  - type: INTEGER
    name: age
  - type: STRING
    name: tshirtSize
  - type: INTEGER
    name: transactionAmount
  - type: STRING
    name: items
  - type: INTEGER
    name: totalItems
  - type: STRING
    name: date
  - type: INTEGER
    name: purposeConsent
  - type: STRING
    name: fullName
  - type: STRING
    name: phoneNumber
  - type: STRING
    name: address
```

Simple Schema supports repeated and nested data structures. For a full reference of the format, see [our docs](https://docs.strmprivacy.io/docs/latest/quickstart/data-contracts/simple-schema/#simple-schema-reference).

To create your own contract, go to the *Data Contracts* tab of your project and click *Create data contract*. After defining the schema, click the *Generate fields* button to proceed with the next step:

![create contract](images/create-contract.png)

### Classify the privacy implications ⚖️

With the structure of your data defined, it is time to classify the privacy implications. This basically consists of answering two questions:

1. Which fields in my data are sensitive?
1. For which purposes can we use this data without additional privacy transformations?

First, specify the *Key field*. This must be a required field of type `STRING` and determines which events share their encryption keys. All events with the same value for the key field will contain the same encrypted string for any identical values. (This allows for some analysis on encrypted data: keys rotate after 24 hours, so within that window, some patterns and trends are preserved)

![classify fields](images/classify-fields-2.png)

Under *PII fields*, specify for each sensitive field to which purpose it belongs. For example, 1 could denote analytics, and 2 marketing. We are currently working on a more descriptive way that uses a purpose mapping defined for your organization. Note that only (repeated) `STRING` fields can be marked as PII.

You can also include field validations. Currently, we support regex patterns as validations. These are simple (but powerful) validations that will ensure only events with valid data will be accepted and further processed. In the example above, the `tshirtSize` must be one of the allowed options. Sending an event that does not match the pattern, will be rejected with HTTP status code 400.

In the example above, `userId` is linked to purpose 1 and `email` to purpose 2. Both fields will be encrypted in the pipeline and only (potentially) decrypted in *privacy streams* based on this pipeline. More on privacy streams in the next section.

### Send your data using the (Python) driver 🚀

Once you have created the contract and _**activated**_ it (through the three dots menu on the data contracts tab), you can send your own events by adapting the steps from section 2.3.2. Let us know if you need any help!

# 3 Create one or more privacy streams

Now that you know how to send data to a pipeline, and how to peek at the encrypted stream using the CLI's `web-socket`, it's time to create a couple *privacy streams*. A privacy stream essentially decrypts (a part of) the encrypted data, but *only* if the data subject (the owner of the data, e.g. a user/customer) has consented.

You can create a privacy stream for your pipeline through the *Add privacy stream* button on the pipelines tab:

![create privacy stream](images/create-privacy-stream.png)

Alternatively, you can create one [through the CLI](https://docs.strmprivacy.io/docs/latest/reference/cli-reference/strm/create/stream/).

In this example, the purpose type is *Cumulative* with a highest purpose of 1. This means that fields with a purpose up to and including 1 will be decrypted, but again, only when the data subject has consented. By choosing *Granular*, you can specify the exact purposes that should be decrypted. I.e. a cumulative configuration with purpose 2 would decrypt fields with purpose level 1 and 2, while a granular configuration with purposes 1 and 3 would leave any fields with a different purpose level encrypted.

💡 In the example, field masking is applied to the `userId` field. Field masking will hash the decrypted value. This can be useful when the original value should remain unknown, but deterministic processing of the data is allowed.

TODO: we could let the attendees try this out themselves with the CLI.

We've mentioned consent a few times now, but how does that work? Let's look at this example event from the CLI's random events simulator:
```json
{
  "strmMeta": {
    "eventContractRef": "strmprivacy/example/1.3.0",
    "nonce": 485472365,
    "timestamp": 1673610238506,
    "keyLink": "75e08aeb-131d-4f84-8070-108f8806fbe2",
    "consentLevels": [
      0,
      1,
      2
    ]
  },
  "uniqueIdentifier": "AXpgWnZ6thGw6v8pED36Z/5c536WoikG3BWJ7S5I",
  "consistentValue": "AXpgWnb4O+NsrSy8/9CmwzrXHKtrwGILHUSEfnAhadA=",
  "someSensitiveValue": "AXpgWnY8oDprlwtmdAO1hAbp417Ea2OLp1HaRxkQylj5",
  "notSensitiveValue": "not-sensitive-41"
}
```

In the `strmMeta` data of the event, the `consentLevels` field contains all levels the user consented to. This has to be specified by the sender. In this example, the user has consented to purposes 0, 1 and 2.

Now, looking at the exact same message, but processed by privacy stream with a purpose type of cumulative 1, we see:
```json
{
  "strmMeta": {
    "eventContractRef": "strmprivacy/example/1.3.0",
    "nonce": 485472365,
    "timestamp": 1673610238506,
    "keyLink": "75e08aeb-131d-4f84-8070-108f8806fbe2",
    "consentLevels": [
      0,
      1,
      2
    ]
  },
  "uniqueIdentifier": "unique-37",
  "consistentValue": "AXpgWnb4O+NsrSy8/9CmwzrXHKtrwGILHUSEfnAhadA=",
  "someSensitiveValue": "AXpgWnY8oDprlwtmdAO1hAbp417Ea2OLp1HaRxkQylj5",
  "notSensitiveValue": "not-sensitive-41"
}
```

Everything is identical, except the value of the `uniqueIdentifier` field, which has purpose level 1, and got decrypted because the user has consented.

And after processing by a privacy stream of type cumulative 2, the `consistentValue` field is also decrypted:
```json
{
  "strmMeta": {
    "eventContractRef": "strmprivacy/example/1.3.0",
    "nonce": 485472365,
    "timestamp": 1673610238506,
    "keyLink": "75e08aeb-131d-4f84-8070-108f8806fbe2",
    "consentLevels": [
      0,
      1,
      2
    ]
  },
  "uniqueIdentifier": "unique-37",
  "consistentValue": "session-623",
  "someSensitiveValue": "AXpgWnY8oDprlwtmdAO1hAbp417Ea2OLp1HaRxkQylj5",
  "notSensitiveValue": "not-sensitive-41"
}
```

💡 **Note:** events of which the data subject didn't consent to the level required by a privacy stream, will be absent from the privacy stream's output! In other words, privacy streams will likely process fewer events than originally delivered to the pipeline.

### Create your privacy streams

Now that you know how privacy streams are created, go ahead and create one or a few with configurations of your own liking!

💡 Use the CLI's random events simulator and one or more web socket commands to see this behaviour for yourself.

# 4 Create one or more batch exporters 🚰

Although the CLI's web socket command is useful to debug or experiment, it is not meant for production workloads. Instead, privacy streams can be exported to common file/blob stores through the use of *batch exporters*. If you have access to an AWS S3 bucket, GCS bucket, or Azure Blob Storage Container, you can use it in this section. If you don't have one readily available, we will provide you with a Google Cloud Storage bucket.

⚠️ **Note:** to avoid data from different participants (or privacy streams) to mix when using our provided bucket, make sure to set a unique *Path prefix* as explained further below.

To create a data connector, go to the *Data connectors* tab of your project, or follow the respective [CLI quickstart](https://docs.strmprivacy.io/docs/latest/quickstart/data-connectors/).

💡 A data connector can be used both for reading or writing data, and reused for different purposes.

With your data connector ready for use, go to the *Exporters* tab to create a batch exporter. Specify the pipeline or privacy stream you want to export and the data connector to use. You can also give it a custom name and modify the default export interval. Make sure to set a unique path prefix! If no prefix is set, data will be exported to the "root" of your data connector. If you select *include existing events*, recent events will be immediately exported. An example:
![create batch exporter](images/create-batch-exporter.png)

Shortly after creating your first batch exporter(s), you should see `.jsonl` files getting added to your data connector's target storage. Each line in these files consists of a STRM event in JSON format, like you saw with the web socket before.

With the JSON lines format being well-supported by many (cloud) data tools, you can build upon batch exporters in various interesting ways. See the next section for an example.

# 5 (Optional) set up an external table with AWS Athena or Google Cloud BigQuery ☁️

## 5.1 Creating an external table with AWS Athena

If you used a batch exporter to export data to an S3 bucket, you could use AWS Athena to create an external table on the newline-delimited json files. This will allow you to query the data using SQL. We were a bit rusty on how the `CREATE TABLE` statement should look like, so we thought "why not just ask Chat GPT?". Well, here's what we asked for the `PrivacyEngineeringWorkshop` data contract:

```
write an AWS Athena create table statement for querying this JSON data:
{
  "strmMeta": {
    "eventContractRef": "strmprivacy/PrivacyEngineeringWorkshop/1.0.0",
    "nonce": 942989520,
    "timestamp": 1673808342405,
    "keyLink": "c4b90723-08ff-4e7b-a21b-0cd6b6a9d400",
    "billingId": null,
    "consentLevels": [
      2
    ]
  },
  "transactionId": "820683509",
  "userId": "AWcF7Cl5mCYJ+QCYRNjTclRhC1KCMJd1PQ1b",
  "email": "AWcF7Cl5W2ONhCpfBGK4eAaKs+2uUmEaUynVl/iTZyzTSvar5ZdG0b5K3Sc=",
  "age": 26,
  "tshirtSize": "L",
  "transactionAmount": 62,
  "items": "[15081, 14707, 13197, 16778, 15546]",
  "totalItems": 5,
  "date": "2022-08-13 18:43:59",
  "purposeConsent": 2,
  "fullName": "AWcF7CldDr3B3R6y341aMh4h8Q/ke8o7a8JPRTbomg==",
  "phoneNumber": "AWcF7Ck85wX/V5wrJbCDHxoZOmHdgosHzbUZNpWuiU5LAQggXG+VRg==",
  "address": "AWcF7Cm95K6wXGAosawf0sV2w6ZBOfZHjxHqjuYkCc4D1rISyaHb32f3tRUQ"
}
```

To which it replied:

```
CREATE EXTERNAL TABLE IF NOT EXISTS events (
`strmMeta` struct<eventContractRef:string,nonce:bigint,timestamp:bigint,keyLink:string,consentLevels:array<int>>,
`transactionId` string,
`userId` string,
`email` string,
`age` int,
`tshirtSize` string,
`transactionAmount` int,
`items` array<int>,
`totalItems` int,
`date` timestamp,
`purposeConsent` int,
`fullName` string,
`phoneNumber` string,
`address` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<your-bucket-name>/path/to/data/';
```

Pretty convenient, don't you think! 🔥 However, trying the same for Google Cloud BigQuery resulted in an invalid query, at least for the current version of it. So don't get your hopes up just yet.. Or maybe this is one more example AI won't ever fully replace us, right?

## 5.2 Creating an external table with Google Cloud BigQuery

For BigQuery, the easiest way to create an external table is through the Cloud Console. See below example:

![BigQuery create external table](images/bq-create-table.png)

# 6 Explore the opportunities of privacy-safe data 🔍💡

Now that you have had some practice applying STRM privacy transformations to data, what opportunities do you see for your own use cases, or in general? Can you see how data contracts could help your organization bridge the gap between legal and tech, and balance risk/compliance with utility?

If you haven't already done so, you may want to return to the optional sections and give the STRM (Python) driver a try to send "real-time" events, or even create a data contract for one of your own data sets.

# 7 The End 🏁

<div class="row">
  <div style="float: left; width: 50%; padding: 5px">

        
Thank you for joining this workshop, and we hope you enjoyed it. If you'd like to explore more of our platform, try out one of these tutorials:
* **On the fly decryption**  
  Why persist decrypted/raw data when you can decrypt on the fly ([Google BigQuery](https://strmprivacy.io/posts/on-the-fly-decryption-bigquery/), [AWS Redshift](https://strmprivacy.io/posts/on-the-fly-decryption-redshift/), [Apache Spark](https://strmprivacy.io/posts/on-the-fly-decryption-spark/), [Databricks](https://strmprivacy.io/posts/on-the-fly-decryption-spark/), [Snowflake](https://strmprivacy.io/posts/on-the-fly-decryption-snowflake/))?
* **[Data Subject Management](https://strmprivacy.io/posts/batchjobs-and-datasubjects/)**  
  Know which encryption keys to delete for RTBF (Right To Be Forgotten) requests or know get a hold of which data records belong to a specific data subject for DSAR (Data Subject Access Requests), without needing to query many data.
* **[Batch Job Data Processing](https://strmprivacy.io/posts/batchjobs-and-datasubjects/)**  
  Not all data is streaming, we know, and therefore STRM Privacy also supports processing batches of data.

Currently, we're working on:
* Data contract wizard incl review/approve flow
* Purpose mapping

Thanks again, and feel free to reach out to any of us for more details or if you have questions, for example about a use case within your company. Want to [stay in touch](https://share-eu1.hsforms.com/1pW7xWg-pRia9cCK-N_2dzAfw8w8)? Scan the QR code (or click the image)!

Feedback is welcome 🪞👀! 

      
  </div>
  <div style="float: left; width: 50%; padding: 5px">
    <a href="https://share-eu1.hsforms.com/1pW7xWg-pRia9cCK-N_2dzAfw8w8" target="_blank">
      <img src="images/stay-in-touch.png" style="width:100%">
    </a>
  </div>
</div>