# Tools for Querying AWS S3 to Pandas Dataframe by Presto Using Athena
Author: Yuan Huang

## Introduction
Athena allows users to run presto query for querying big data. The query results can be stored on AWS s3, which can be downloaded or queried later. This notebook includes several routinely used functions implemented in Python that allow to either save query results to local storage, which can be read by pandas dataframe, or directly transfer query results to pandas dataframe for further data analysis. There are several ways and packages that can be used. This notebook focused on using contextlib2, pyathenajdbc, and boto3 packages. 

First, let's import the necessary packages and define the parameters including AWS access key, secret access key, s3 bucket name, s3 path, the region of your AWS account. All these parameters can be saved as environment variables and imported to notebook without exposing these sensitive information. 

In [1]:
# Import packages
import pandas as pd
import contextlib2
from pyathenajdbc import connect
from pyathenajdbc.util import as_pandas
import boto3
from boto3 import Session
import os
import time

# Import AWS account and s3 information
aws_access_key=os.environ["AWS_KEY"]
aws_secret_access_key=os.getenv("AWS_SECRET_KEY")

bucket_name=os.getenv("AWS_s3_bucket")
s3_path=os.getenv("AWS_s3_path")
output_file="test.csv"
region=os.getenv("AWS_region")
staging_dir='s3://'+bucket_name+'/'+s3_path

## 1. Data query and transfer using pyathenajbdc connect
### 1.1 Pyathenajdbc connect with s3 bucket download_file command
pyathenajdbc package provides a convenient way to directly set up a connection to AWS s3 storage, execute the presto query, and store the query results to s3 path assigned by the user. Specifically, this method consists of the following steps:
1. initialize a pyathenajbdc connect object using AWS account and s3 information, including AWS access key, secret access key, s3 bucket name and path, region name.
2. initialze a cursor object from the connect object
3. execute presto query using the cursor.execute(presto_sql) command
4. obtain the location of the query result file using s3 path and the query_id fetched by the cursor
5. initialize a s3 resource using the AWS account information
6. download the file by the s3 resource

The following function: athena_query_s3_csv implemented steps 1 to 6 and allows the users to execute presto query by Athena on AWS and save the query results to local machine.

In [186]:
def athena_query_s3_csv(athena_sql,bucket_name,s3_path,region,output_file,access_key,secret_key):
    """
    This function execute the presto query, store the results in the s3 location 
    designated by bucket_name and s3_path, and download the results to local machine
    
    Inputs:
      athena_sql:  the presto query string
      bucket_name: s3 bucket name for result storage
      s3_path:     the path of the s3 storage
      region:      the region of the AWS account
      output_file: the name of the output file when downloading results to local machine
      access_key:  AWS access key
      secret_key:  AWS secret access key
    Output:
      None. Query results is saved as output_file on local machine.
    """
    staging_dir='s3://'+bucket_name+'/'+s3_path
    conn = connect(access_key = aws_access_key,
               secret_key = aws_secret_access_key,
               s3_staging_dir = staging_dir,
               region_name = region)
    cursor = conn.cursor()
    cursor.execute(athena_sql)
    key = s3_path + cursor.query_id + '.csv'
    s3 = boto3.resource('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key)
    s3.Bucket(bucket_name).download_file(key, output_file)
    for s in s3.Bucket(bucket_name).objects.filter(Prefix = key):
        s.delete()
        
athena_query_s3_csv(athena_sql,bucket_name,
                    s3_path,region,output_file,
                    aws_access_key,aws_secret_access_key)        

### 1.2 Pyathenajdbc connect directly converted to pandas dataframe
A more straightforward and simple way is to directly input the pyathena connect and presto query string to the read_sql command of pandas. The query results will be directly save as pandas dataframe. As shown in the following function, athena_query_df_sql():

In [None]:
def athena_query_df_sql(athena_sql,staing_dir,region,access_key,secret_key):
    """
    This function execute the presto query using AWS Athena, and fetch the 
    results directly to a pandas dataframe
    
    Inputs:
      athena_sql:  the presto query string
      staging_dir: the s3 bucket name + s3 path
      region:      AWS account region
      access key:  AWS access id
      secret key:  AWS access secret key
    Output:        A pandas dataframe containing query results  
    """
    conn = connect(access_key = access_key,
               secret_key = secret_key,
               s3_staging_dir = staing_dir,
               region_name = region)
    return pd.read_sql(athena_sql,conn)

athena_query_df_sql(athena_sql,staging_dir,region,aws_access_key,aws_secret_access_key).head()

## Data query and transfer using boto3 and athena client with paginator
This method directly uses the boto3 functionailities with athena client without using third party packages such as pyathena. It is much easier and straightforward to install boto3, especially if you are using conda to manage your packages. In addition, athena client can use paginator to partition the results. However, compared to pyathenajdbc methods, to use this method, you need to understand the athena client class a little bit, which is not very straightforward. The steps of the method is the following:
1. initialize a boto3 session using AWS account inforamtion
2. initialize an athena client
3. execute the presto query with a designated s3 staging postion, and fetch the query_id
4. repeated check the query_status. If is success, paginate the query results using the get_paginator method,
   with 'get_query_results' method name as the input
5. fetch the paginated results to results_iter variable
6. iterate the results_iter and append results to a list, which is data_list in the function
7. iterate the data_list and extract value for each column to result_data list
8. tanspose the data_list, and convert it to dictionary by combining with column names
9. convert the dictionary to pandas dataframe

In [200]:
import pandas as pd
import boto3
import numpy as np

session=boto3.Session(aws_access_key_id=aws_access_key,aws_secret_access_key=aws_secret_access_key)

def athena_to_df(session,region,athena_sql,staging_dir):
    """
    This function execute presto query and fetch the results to a pandas dataframe
    
    Inputs:
      session:     a boto3 session
      region:      AWS account region
      athena_sql:  presto query string
      staging_dir: s3 bucket name + s3 path
      
    Output:
      a pandas dataframe with the query results      
    """
    query_status=None
    page_index=0
    row_index=0
    data_list=[]
    result_data=[]
    
    # initialize an athena client
    client=session.client('athena',region)
    
    # execute the presto query, and fetch the query_id
    query_id=client.start_query_execution(QueryString=athena_sql,
                                     ResultConfiguration={
                                         'OutputLocation':staging_dir
                                     })['QueryExecutionId']
    
    # repeatedly check the query status
    while query_status=='QUEUED' or query_status=='RUNNING' or query_status is None:
        query_status=client.get_query_execution(QueryExecutionId=query_id)['QueryExecution']['Status']['State']
        if query_status=='FAILED' or query_status=='CANCELLED':
            raise Exception('Athena query with the string "{}" failed or was cancelled'.format(athena_sql))
        time.sleep(10)
        
    # paginate the results, and save the paginated results to results_iter
    results_paginator=client.get_paginator('get_query_results')
    results_iter=results_paginator.paginate(
                   QueryExecutionId=query_id,
                   PaginationConfig={
                      'PageSize': 1000
                   }
                 )

    # iterate the results_iter and extract column name and column values to 
    # column_head and data_list, respectively
    for page in results_iter:
        start_row=0
        if page_index==0:
            column_head = [col['Label'] for col in page['ResultSet']['ResultSetMetadata']['ColumnInfo']]
        for row in page['ResultSet']['Rows']:
            if start_row==0 and page_index==0:
                start_row+=1
                continue
            data_list.append(row['Data'])
        page_index+=1 
    
    # iterate the data_list, extract the column values and transpose the results
    # as a list
    for record in data_list:
        result_data.append([x.get('VarCharValue','') for x in record])
    result_data=np.array(result_data).transpose().tolist() 
    
    # combine the column name and column values, and convert them to a dictionary
    # then convert the dirctionary to the pandas dataframe to return
    return pd.DataFrame.from_dict(dict(zip(column_head,result_data)))

tdf=athena_to_df(session,region,athena_sql,staging_dir)