# Copy TSV Data To S3

<img src="img/write_tsv_to_s3.png" width="45%" align="left">

#### We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/dsoaws/amazon-reviews-pds/readme.html) as our main dataset.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats: 

* Tab separated value (TSV), a text format - `s3://dsoaws/amazon-reviews-pds/tsv/`
* Parquet, an optimized columnar binary format - `s3://dsoaws/amazon-reviews-pds/parquet/`

The Parquet dataset is partitioned (divided into subfolders) by the column `product_category` to further improve query performance. With this, you can use a `WHERE` clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands: 


In [1]:
!aws s3 ls s3://dsoaws/amazon-reviews-pds/tsv/

2023-09-06 16:01:13  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2023-09-06 16:01:13  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2023-09-06 16:01:13  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2023-09-06 16:01:13  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2023-09-06 16:01:13 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2023-09-06 16:01:13 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2023-09-06 16:01:13 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2023-09-06 16:01:13  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2023-09-06 16:01:13 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz
2023-09-06 16:01:13 1294879074 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz
2023-09-06 16:01:13  253570168 amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv.gz
2023-09-06 16:01:13   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2023-09-06 16:01:13  506979922 amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz
2023-09-06 16:01:13   2744264

In [2]:
!aws s3 ls s3://dsoaws/amazon-reviews-pds/parquet/

                           PRE product_category=Apparel/
                           PRE product_category=Automotive/
                           PRE product_category=Baby/
                           PRE product_category=Beauty/
                           PRE product_category=Books/
                           PRE product_category=Camera/
                           PRE product_category=Digital_Ebook_Purchase/
                           PRE product_category=Digital_Music_Purchase/
                           PRE product_category=Digital_Software/
                           PRE product_category=Digital_Video_Download/
                           PRE product_category=Digital_Video_Games/
                           PRE product_category=Electronics/
                           PRE product_category=Furniture/
                           PRE product_category=Gift_Card/
                           PRE product_category=Grocery/
                           PRE product_category=Health_&_Personal_Care/
   

# To Simulate an Application Writing Into Our Data Lake, We Copy the Public TSV Dataset to a Private S3 Bucket in our Account

<img src="img/copy_data_to_s3.png" width="60%" align="left">

# Check Pre-Requisites from an earlier notebook

In [3]:
%store -r setup_dependencies_passed

In [4]:
try:
    setup_dependencies_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup Dependencies.")
    print("+++++++++++++++++++++++++++++++")

In [5]:
print(setup_dependencies_passed)

True


In [6]:
%store -r setup_s3_bucket_passed

In [7]:
try:
    setup_s3_bucket_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup S3 Bucket.")
    print("+++++++++++++++++++++++++++++++")

In [8]:
print(setup_s3_bucket_passed)

True


In [9]:
%store -r setup_iam_roles_passed

In [10]:
try:
    setup_iam_roles_passed
except NameError:
    print("+++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup IAM Roles.")
    print("+++++++++++++++++++++++++++++++")

In [11]:
print(setup_iam_roles_passed)

True


In [12]:
if not setup_dependencies_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup Dependencies.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_s3_bucket_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup S3 Bucket.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
if not setup_iam_roles_passed:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL NOTEBOOKS IN THE SETUP FOLDER FIRST. You are missing Setup IAM Roles.")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [25]:
import boto3
import sagemaker
import pandas as pd

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

# Set S3 Source Location (Public S3 Bucket)

In [26]:
s3_public_path_tsv = "s3://dsoaws/amazon-reviews-pds/tsv"

In [27]:
%store s3_public_path_tsv

Stored 's3_public_path_tsv' (str)


# Set S3 Destination Location (Our Private S3 Bucket)

In [28]:
s3_private_path_tsv = "s3://{}/amazon-reviews-pds/tsv".format(bucket)
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-992382405090/amazon-reviews-pds/tsv


In [29]:
%store s3_private_path_tsv

Stored 's3_private_path_tsv' (str)


# Copy Data From the Public S3 Bucket to our Private S3 Bucket in this Account
As the full dataset is pretty large, let's just copy 3 files into our bucket to speed things up later. 

In [30]:
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Software_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"
!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Gift_Card_v1_00.tsv.gz"

copy: s3://dsoaws/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to s3://sagemaker-us-east-1-992382405090/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
copy: s3://dsoaws/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz to s3://sagemaker-us-east-1-992382405090/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
copy: s3://dsoaws/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz to s3://sagemaker-us-east-1-992382405090/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz


# _Make sure ^^^^ this ^^^^ S3 COPY command above runs succesfully. We will need those datafiles for the rest of this workshop._

# List Files in our Private S3 Bucket in this Account

In [31]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-992382405090/amazon-reviews-pds/tsv


In [32]:
!aws s3 ls $s3_private_path_tsv/

2024-07-10 06:24:13   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2024-07-10 06:24:15   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2024-07-10 06:24:17   12134676 amazon_reviews_us_Gift_Card_v1_00.tsv.gz


In [33]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/sagemaker-{}-{}/amazon-reviews-pds/?region={}&tab=overview">S3 Bucket</a></b>'.format(
            region, account_id, region
        )
    )
)

  from IPython.core.display import display, HTML


# Store Variables for the Next Notebooks

In [34]:
%store

Stored variables and their in-db values:
USE_FULL_MOVIELENS                      -> False
bucket_name                             -> '992382405090personalizepocvod'
comprehend_endpoint_arn                 -> 'arn:aws:comprehend:us-east-1:992382405090:documen
comprehend_train_s3_uri                 -> 's3://sagemaker-us-east-1-992382405090/data/amazon
comprehend_training_job_arn             -> 'arn:aws:comprehend:us-east-1:992382405090:documen
data_dir                                -> 'poc_data'
dataset_dir                             -> 'poc_data/ml-latest-small/'
dataset_group_arn                       -> 'arn:aws:personalize:us-east-1:992382405090:datase
forecast_arn                            -> 'arn:aws:forecast:us-east-1:992382405090:forecast/
forecast_dataset_arn                    -> 'arn:aws:forecast:us-east-1:992382405090:dataset/u
forecast_dataset_group_arn              -> 'arn:aws:forecast:us-east-1:992382405090:dataset-g
forecast_predictor_arn                  -> 'arn:aws:

# Release Resources

In [35]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}