# Tool for Executing Presto Query on Qubole
Author: Yuan Huang

## Introduction
Qubole is a convenient platform to query big data. It manages clusters with automatic tuning of the number of the nodes in the cluster, and allows a variety of big data techniques to be operated including persto, hive queries, and spark. This notebook contains a function for running presto query on qubole and fetching the results to pandas dataframe. Usually, result datasets smaller than 20 MB can be download directly from qubole interface as csv files, however, for bigger result dataset, it is not that straightforward. The get_results() method of PrestoCommand class of qds_sdk package allows the download of bigger result files from qubole, as shown in the [following reference](https://github.com/qubole/qds-sdk-py/blob/master/qds_sdk/commands.py). This method was used in the get_pandas_df() function in this notebook for fetching large result dataset directly to pandas dataframe. For the details of the python implementation of qds_sdk, please refer to the [following link](https://github.com/qubole/qds-sdk-py/blob/master/qds_sdk/commands.py)  

In [2]:
# Import  Packages
import pandas as pd
import os

from qds_sdk.qubole import Qubole
from qds_sdk.commands import PrestoCommand

In [3]:
import os
aws_access_key=os.environ["AWS_KEY"]
aws_access_secret_key=os.getenv("AWS_SECRET_KEY")
quoble_token=os.getenv("TOKEN")

In [5]:
import os
Qubole.configure(api_token = quoble_token) 

def write_headers(qlog,fp):
    """
    This function writes the column names to a BytesIO object
    The method is avalilabe in qds_sdk package github link, as
    shown in the introduction section
    
    Inputs:
      qlog: a log object
      fp: a BytesIO object
    Output:
    None. column names in log is written to fp
    """
    col_names = []
    qlog = json.loads(qlog)
    if qlog["QBOL-QUERY-SCHEMA"] is not None:
        qlog_hash = qlog["QBOL-QUERY-SCHEMA"][list(qlog["QBOL-QUERY-SCHEMA"].keys())[0]]

        for qlog_item in qlog_hash:
            col_names.append(qlog_item["ColumnName"])

        col_names = "\t".join(col_names)
        col_names += "\n"

    fp.write(col_names.encode('utf-8'))

def get_pandas_df(query_string, cluster_label='progessive'):
   
    # initiate a PrestoCommand obj
    cmd = PrestoCommand.run(query=query_string, label=cluster_label)
   
    if PrestoCommand.is_success(cmd.status):
        # if the query executes successfully, write the column names
        buf = io.BytesIO()
        write_headers(cmd.qlog, buf)
        
        # write results to BytesIO, then to pandas
        cmd.get_results(buf, delim='\t', inline = False, qlog = cmd.qlog)
        buf.seek(0)
        df = pd.read_csv(buf, delimiter='\t', na_values='\\N')
        buf.close()
        
        return df
    else:
        raise Exception()