# Parquet Datasets

Wrangler has 3 different write modes to store Parquet Datasets on Amazon S3.

- **append** (Default)

    Only adds new files without any delete.
    
- **overwrite**

    Deletes everything in the target directory and then add new files.
    
- **overwrite_partitions** (Partition Upsert)

    Only deletes the paths of partitions that should be updated and then writes the new partitions files. It's like a "partition Upsert".

Further resources:
- [Official Documentation](https://aws-data-wrangler.readthedocs.io/en/latest/)
- [Official Repository](https://github.com/awslabs/aws-data-wrangler)
- [Official Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials)

In [None]:
from datetime import date
import awswrangler as wr
import pandas as pd
import json
import boto3

## Getting bucket name

In [None]:
ssm = boto3.client("ssm")
s3_bucket_param = json.loads(ssm.get_parameter(Name="/jam/notebook/bucket", WithDecryption=True)['Parameter']['Value'])
bucket = s3_bucket_param['s3-bucket-name']

## Define s3 path for creating Dataset

In [None]:
path = f"s3://{bucket}/awswrangler/parquet_dataset/"
path

## Creating the Dataset

In [None]:
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "date": [date(2020, 1, 1), date(2020, 1, 2)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite"
)

df = wr.s3.read_parquet(path, dataset=True)
df

In [None]:
assert df.shape == (2, 3)
assert df.id.sum() == 3

#### Validate s3 path below to see 1 file is present

In [None]:
path

## Appending

In [None]:
df = pd.DataFrame({
    "id": [3],
    "value": ["bar"],
    "date": [date(2020, 1, 3)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="append"
)

df = wr.s3.read_parquet(path, dataset=True)
df

In [None]:
assert df.shape == (3, 3)
assert df.id.sum() == 6

#### Validate s3 path below to see 2 files are present (1 new file was added)

In [None]:
path

## Overwriting

In [None]:
wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite"
)

df = wr.s3.read_parquet(path, dataset=True)
df

In [None]:
assert df.shape == (3, 3)
assert df.id.sum() == 6

#### Validate s3 path below to see 1 file is present (previous files were overwritten)

In [None]:
path

## Creating a **Partitoned** Dataset

In [None]:
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "date": [date(2020, 1, 1), date(2020, 1, 2)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    partition_cols=["date"]
)

df = wr.s3.read_parquet(path, dataset=True)
df

In [None]:
assert df.shape == (2, 3)
assert df.id.sum() == 3

#### Validate s3 path below to see 2 new folders are added and there are files inside each (previous file was overwritten)

In [None]:
path

## Upserting partitions (overwrite_partitions)

In [None]:
df = pd.DataFrame({
    "id": [2, 3],
    "value": ["xoo", "bar"],
    "date": [date(2020, 1, 2), date(2020, 1, 3)]
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite_partitions",
    partition_cols=["date"]
)

df = wr.s3.read_parquet(path, dataset=True)
df

In [None]:
assert df.shape == (3, 3)
assert df.id.sum() == 6

#### Validate s3 path below to see 1 new folder is added (i.e. there are 3 folders) and there are files inside each

In [None]:
path