# Copy TSV Data To S3

#### We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats: 

* Tab separated value (TSV), a text format - `s3://amazon-reviews-pds/tsv/`
* Parquet, an optimized columnar binary format - `s3://amazon-reviews-pds/parquet/`

The Parquet dataset is partitioned (divided into subfolders) by the column `product_category` to further improve query performance. With this, you can use a `WHERE` clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands: 


In [1]:
!aws s3 ls s3://amazon-reviews-pds/tsv/

2017-11-24 13:22:50          0 
2017-11-24 13:48:03  241896005 amazon_reviews_multilingual_DE_v1_00.tsv.gz
2017-11-24 13:48:17   70583516 amazon_reviews_multilingual_FR_v1_00.tsv.gz
2017-11-24 13:48:34   94688992 amazon_reviews_multilingual_JP_v1_00.tsv.gz
2017-11-24 13:49:14  349370868 amazon_reviews_multilingual_UK_v1_00.tsv.gz
2017-11-24 13:48:47 1466965039 amazon_reviews_multilingual_US_v1_00.tsv.gz
2017-11-24 13:49:53  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2017-11-24 13:56:36  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2017-11-24 14:04:02  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2017-11-24 14:08:11  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2017-11-24 14:17:41 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2017-11-24 14:45:50 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2017-11-24 15:10:21 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2017-11-24 15:22:13  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2017-11-24 15:27:13 2689739299 amazon_rev

In [2]:
!aws s3 ls s3://amazon-reviews-pds/parquet/

                           PRE product_category=Apparel/
                           PRE product_category=Automotive/
                           PRE product_category=Baby/
                           PRE product_category=Beauty/
                           PRE product_category=Books/
                           PRE product_category=Camera/
                           PRE product_category=Digital_Ebook_Purchase/
                           PRE product_category=Digital_Music_Purchase/
                           PRE product_category=Digital_Software/
                           PRE product_category=Digital_Video_Download/
                           PRE product_category=Digital_Video_Games/
                           PRE product_category=Electronics/
                           PRE product_category=Furniture/
                           PRE product_category=Gift_Card/
                           PRE product_category=Grocery/
                           PRE product_category=Health_&_Personal_Care/
   

# To Simulate an Application Writing Into Our Data Lake, We Copy the Public TSV Dataset to a Private S3 Bucket in our Account

In [3]:
import boto3
import sagemaker
import pandas as pd

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

# Set S3 Source Location (Public S3 Bucket)

In [4]:
s3_public_path_tsv = "s3://amazon-reviews-pds/tsv"

In [5]:
%store s3_public_path_tsv

Stored 's3_public_path_tsv' (str)


# Set S3 Destination Location (Our Private S3 Bucket)

In [6]:
s3_private_path_tsv = "s3://{}/amazon-reviews-pds/tsv".format(bucket)
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-117859797117/amazon-reviews-pds/tsv


In [7]:
%store s3_private_path_tsv

Stored 's3_private_path_tsv' (str)


# Copy Data From the Public S3 Bucket to our Private S3 Bucket in this Account
As the full dataset is pretty large, let's just copy 3 files into our bucket to speed things up later. 

In [8]:
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Gift_Card_v1_00.tsv.gz"

copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to s3://sagemaker-us-east-1-117859797117/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz to s3://sagemaker-us-east-1-117859797117/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
copy: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz to s3://sagemaker-us-east-1-117859797117/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz


# _Make sure ^^^^ this ^^^^ S3 COPY command above runs succesfully. We will need those datafiles for the rest of this workshop._

# List Files in our Private S3 Bucket in this Account

In [9]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-117859797117/amazon-reviews-pds/tsv


In [10]:
!aws s3 ls $s3_private_path_tsv/

2021-04-12 14:57:24   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2021-04-12 14:57:25   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2021-04-12 14:57:27   12134676 amazon_reviews_us_Gift_Card_v1_00.tsv.gz


In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/sagemaker-{}-{}/amazon-reviews-pds/?region={}&tab=overview">S3 Bucket</a></b>'.format(
            region, account_id, region
        )
    )
)

# Store Variables for the Next Notebooks

In [None]:
%store

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

In [None]:
# Internal - DO NOT RUN

# step_prefix = '04_prepare'
# !aws s3 cp --recursive $s3_public_path_tsv/ s3://dsoaws/$step_prefix/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
# !aws s3 cp --recursive $s3_public_path_tsv/ s3://dsoaws/$step_prefix/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"
# !aws s3 cp --recursive $s3_public_path_tsv/ s3://dsoaws/$step_prefix/ --exclude "*" --include "amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
# !aws s3 ls --recursive s3://dsoaws/$step_prefix/