# File Transfer in AWS S3 by AWS Cli and File Storage by Hive Tables Using Athena
Author: Yuan Huang

## Introduction
This notebook consists of two sections. The first section discussed how to transfer files between s3 folders using AWS cli. it also includes the procedures to save the log files during the file transfer and compare the file names between the soruce and destination file folders to make sure all the files in source s3 folders are transferred to the destination folder.

The second section is related to file storage. AWS s3 is commonly used for file storage. Together with other AWS services, it can be seamlessly integrated with other big data technologies, including presto query and spark. AWS athena supports a subset of presto functions for querying big volume of data. In addition, Athena provides the functions to establish hive tables that projects the table schema to s3 files in various data formats, including csv, gzip and parquet. This notebook will show scripts to create hive tables using AWS athena. In addition, AWS GLUE can also be used to gether the table catelogue information.

## 1. s3 file transfer
### 1a. File transfer by AWS cli
#### 1. to transfer all files from one directory to another directory
```
aws s3 cp source_folder dest_folder --recursive --profile (aws profile of the destination s3 bucket)
```
#### 2. to transfer individual files:
```
aws s3 cp source_file dest_file --profile (aws profile of the destination file)
```

### 1b. Using shell script with log files for the transfer of large volumes of s3 files
Log files will provide important information on file transfer and help to find the missing files during the transfer. In addition, when large volumes of files are transferred, it is better to do the file transfer in the backgroud. Therefore, it is a common practice to use shell scripts for file transfer and output the script execution information in a log file. The following procedure can be used for such a data transfer:
#### 1. create the shell script for file transfer:

The following is the shell script file：
```
#!/usr/bin/bash

# To copy an entire directory to another directory
aws s3 cp source_folder dest_folder --recursive --profile (aws profile of the destination s3 bucket)

# To copy an individual file to another file
aws s3 cp source_file dest_file --profile (aws profile of the destination file)
```
#### 2. save the shell file as abc.sh
#### 3. cd to the folder containing abc.sh, and run the following shell commands:
```
chmod 755 abc.sh
nohup ./abc.sh > output_log.txt &
```
By doing this, large volumes of files are transferred in the background of the server. You can log off the server without impactig the file transfer process.

#### 4. check the output_log.txt for the log files of the file transfer. 
The common error messages include:"access_denied", "no such file or directory", "copy failed:", and "fatal error: could not connect to the endpoint URL:". You can use grep to check if these error messages are presented in the log files. 


### 1c. Comparing file names between source and destination s3 folders
After s3 files are transferred from source to destination folders, it will be useful to compare the file names in the soruce and destination folders, and make sure the file names contained in these folders are consistent.
This procedure first outputs the file names in source and destination s3 folders to source_file.txt and dest_file.txt, respectively, and then compares the contents of these two files to find out which files are missing:     
```
aws s3 ls $source --recursive | awk '{$1=$2=$3="";print $0}' | sed 's/^[ \t]*//' | sort > source_file.txt
aws s3 ls $dest --recursive | awk '{$1=$2=$3="";print $0}' | sed 's/^[ \t]*//' | sort > dest_file.txt
echo $(diff source_file.txt dest_file.txt)
```

### 1d. Computing size of all the files in a s3 folder
```
aws s3 ls s3_folder --recursive | awk 'BEGIN {total=0} {total+=$3} END {print total/1024/1024" MB"}'
```

## 2. File storage and query 
It is a common case that huge volumes of s3 files are stored in one s3 folder. These s3 files are stored in a consistent format, such as csv, gzip, json, or parquet, and all these s3 files have a consistent schema. Therefore, all s3 files contained in the same s3 folder can be treated as a huge data table. In AWS, athena and GLUE can be used to define and establish such data tables based on these s3 files. After such tables are established, big data techniques such as presto query and spark can be used to query and manipulate the big data in these tables.
### 2a. build hive tables in Athena using boto3:



In [4]:
table_sql = """
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
`id` int,
`name` string,
`quantity` bigint,
`percentage` double,
`cost` float
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.lazySimpleSerDe'
WITH SERDEPROPERTIES(
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://bucket/gz folder/'
TBLPROPERTIES ('has_encrypted_data'='false');
"""

parquet_sql = """
CREATE TABLE IF NOT EXISTS table_name
WITH (
  external_location = 's3://bucket/parquet_folder',
  format = 'PARQUET',
  parquet_compression = 'SNAPPY',
  partitioned_by = ARRAY['protein']
) AS SELECT * FROM
gzip_partitioned_table;
"""

In [None]:
import boto3
import os

aws_access_key = os.getenv("AWS_ACCESS_KEY")
aws_secret_access_key = os.getenv("AWS_SECRET_KEY")
region = os.getenv("AWS_REGION")
staging_dir = os.getenv("AWS_STAGING_DIR")

session = boto3.Session(aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_access_key)
client = session.client('athena',region)

client.start_query_execution(QueryString=table_sql,
                            ResultConfiguration={'OutputLocation':staging_dir})

client.start_query_execution(QueryString=parquet_sql,
                            ResultConfiguration={'OutputLocation':staging_dir})