## Glue Studio Taxi Demo

In this demo we will load NYC taxi information into S3 and populate the Glue Data Catalog. Once the catalog is populated with the source tables we will use features in Glue Studio to visually create parquet formatted data denormalized into a curated data set.

We will be using the Python boto3 library for parts of the demo and resources created from scripts executed in CloudFormation leveraging the CDK. To get started with this demo review the [README.md](README.md) file.

You can find more information about the Python `boto3` library [here](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)

In [None]:
import boto3
import botocore
import json
import os
import uuid
import pandas as pd

glue = boto3.client('glue')
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
lf = boto3.client('lakeformation')
cfn = boto3.client('cloudformation')

session = boto3.session.Session()
region = session.region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

#### Get the Outputs from the CloudFormation template

The `GlueStudioDemoStack` needs to be executed before starting the demo. From this repo you can run the commands in the root directory below to get started if you just came to the notebook. To launch the CloudFormation script this requires the [AWS Cloud Development Kit](https://aws.amazon.com/cdk/)(CDK). The CDK synthizes CloudFormation templates written in `TypeScript` for this demo but other languages are supported. For more information on the CDK API go [here](https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html).

From the root directory:

``` bash
npm install

cdk deploy
```

Once this is deployed sucecssfully you will be able to get the `Outputs` from the CloudFormation stack. It creates the resources below:

* S3 Bucket
* IAM Role
* Glue Database
* Glue Crawler

These resources will be used throughout the demo to store files, provide access to the data lake, and populate the catalog to be used by Glue Studio.

In [None]:
response = cfn.describe_stacks(
    StackName='GlueStudioDemoStack'
)

outputs = response['Stacks'][0]['Outputs']

for output in outputs:
    if (output['OutputKey'] == 'DataLakeBucketName'):
        bucket = output['OutputValue']
    if (output['OutputKey'] == 'TaxiDatabase'):
        database_name = output['OutputValue']
    if (output['OutputKey'] == 'DataLakeRoleArn'):
        role_arn = output['OutputValue']
    if (output['OutputKey'] == 'TaxiDataCrawler'):
        data_crawler = output['OutputValue']        
        
pd.set_option('display.max_colwidth', None)
pd.DataFrame(outputs, columns=["OutputKey", "OutputValue"])

### Upload data files to S3

Next, we will upload the `.csv` files located in the `data` folder to S3 to be used later in the demo. We are using a sample file from New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset available on the [AWS Open Data Registry](https://registry.opendata.aws/nyc-tlc-trip-records-pds/).

[s3.upload_file](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file) boto3 documentation

In [None]:
file_name = 'yellow_tripdata_2020-06.csv'
path = 'data'
session.resource('s3').Bucket(bucket).Object(os.path.join('datalake', 'yellow', file_name)).upload_file(path + '/' + file_name)

file_name = 'paymenttype.csv'
path = 'data'
session.resource('s3').Bucket(bucket).Object(os.path.join('datalake', 'paymenttype', file_name)).upload_file(path + '/' + file_name)

file_name = 'ratecode.csv'
session.resource('s3').Bucket(bucket).Object(os.path.join('datalake', 'ratecode', file_name)).upload_file(path + '/' + file_name)

file_name = 'taxi_zone_lookup.csv'
session.resource('s3').Bucket(bucket).Object(os.path.join('datalake', 'taxi_zone_lookup', file_name)).upload_file(path + '/' + file_name)

#### Load Taxi Demo database with S3 data from Glue Crawler



In [None]:
glue.start_crawler(Name=data_crawler)

crawler_status = glue.get_crawler(Name=data_crawler)['Crawler']['State']

while crawler_status not in ('READY'):
    crawler_status = glue.get_crawler(Name=data_crawler)['Crawler']['State']
    print(crawler_status)
    time.sleep(30)
    
print('Crawler Complete')

#### View Crawler Results

Now that we have crawled the taxi data, we want to look at the results of the crawl to see the tables that were created. We will call the [glue.get_tables](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_tables) function and load key fields into a pandas DataFrame for display.

In [None]:
df = pd.json_normalize(glue.get_tables(DatabaseName=database_name)['TableList'])
pd.set_option('display.max_colwidth', None)
pd.DataFrame(df, columns=["Name", "DatabaseName", "StorageDescriptor.Columns", "StorageDescriptor.Location"])

#### Let's start with Glue Studio

Run the cell below to take you to the Glue Studio console and use the instructions below to create the Glue Job visually. We will be joining multiple tables from the Glue Data Catalog into a single data set and save it to the S3 bucket created earlier in Parquet format. Later, we will use the job created with Glue Studio in a Glue Workflow to create and end-to-end ETL solution.

In [None]:
df = pd.DataFrame(["https://console.aws.amazon.com/gluestudio/home?region={0}#/".format(region)])
df.columns = ['Link']
def make_clickable(val):
    return '<a href="{}" target="_blank">{}</a>'.format(val,val)

df.style.hide_index().format(make_clickable)