# Create Athena Database Schema

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. 

Athena is based on Presto, and supports various standard data formats, including CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC. 

Presto is an open source, distributed SQL query engine, developed for fast analytic queries against data of any size. It can query data where it is stored, without the need to move the data. Query execution runs in parallel over a pure memory-based architecture which makes Presto extremely fast. 


<img src="img/athena_setup.png" width="60%" align="left">

In [2]:
import boto3
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
ingest_create_athena_db_passed = False

In [4]:
%store -r s3_public_path_tsv

In [5]:
try:
    s3_public_path_tsv
except NameError:
    print("*****************************************************************************")
    print("[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************")
    print("[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************")
    print("*****************************************************************************")

In [6]:
print(s3_public_path_tsv)

s3://dsoaws/amazon-reviews-pds/tsv


In [7]:
%store -r s3_private_path_tsv

In [8]:
try:
    s3_private_path_tsv
except NameError:
    print("*****************************************************************************")
    print("[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************")
    print("[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************")
    print("*****************************************************************************")

In [9]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-992382405090/amazon-reviews-pds/tsv


# Import PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [10]:
from pyathena import connect

# Create Athena Database

In [11]:
database_name = "dsoaws"

Note: The databases and tables that we create in Athena use a data catalog service to store the metadata of your data. For example, schema information consisting of the column names and data type of each column in a table, together with the table name, is saved as metadata information in a data catalog. 

Athena natively supports the AWS Glue Data Catalog service. When we run `CREATE DATABASE` and `CREATE TABLE` queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the AWS Glue Data Catalog.

In [12]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = "s3://{0}/athena/staging".format(bucket)

In [13]:
conn = connect(region_name=region, s3_staging_dir=s3_staging_dir).cursor()

In [14]:
statement = "CREATE DATABASE IF NOT EXISTS {}".format(database_name)
print(statement)

CREATE DATABASE IF NOT EXISTS dsoaws


In [16]:
import pandas as pd
conn.execute(statement)
print(conn.description)
print(conn.fetchall())

[]
[]


# Verify The Database Has Been Created Succesfully

In [18]:
statement = "SHOW DATABASES"

conn.execute(statement)
df_show = pd.DataFrame(conn.fetchall())
df_show.head()

Unnamed: 0,0
0,analyticsdb
1,default
2,dsoaws
3,sagemaker_featurestore


In [19]:
if database_name in df_show.values:
    ingest_create_athena_db_passed = True

In [20]:
%store ingest_create_athena_db_passed

Stored 'ingest_create_athena_db_passed' (bool)


# Store Variables for the Next Notebooks

In [21]:
%store

Stored variables and their in-db values:
USE_FULL_MOVIELENS                                    -> False
bucket_name                                           -> '992382405090personalizepocvod'
comprehend_endpoint_arn                               -> 'arn:aws:comprehend:us-east-1:992382405090:documen
comprehend_train_s3_uri                               -> 's3://sagemaker-us-east-1-992382405090/data/amazon
comprehend_training_job_arn                           -> 'arn:aws:comprehend:us-east-1:992382405090:documen
data_dir                                              -> 'poc_data'
dataset_dir                                           -> 'poc_data/ml-latest-small/'
dataset_group_arn                                     -> 'arn:aws:personalize:us-east-1:992382405090:datase
forecast_arn                                          -> 'arn:aws:forecast:us-east-1:992382405090:forecast/
forecast_dataset_arn                                  -> 'arn:aws:forecast:us-east-1:992382405090:dataset/u
foreca

# Release Resources

In [23]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [22]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>