# Project: Data Warehouse
Samuel Botter Martins

This jupyter notebook aims to inspect the dataset provided by the project. Although the project description presents a lot of information about the datasets, such as their links and structures, I decided to double-check them to practice my coding skills with boto3 and S3.

In [None]:
import pandas as pd
import configparser
import boto3

## Before we start
First, I followed the same steps as in the Exercise about IaC and created a new IAM user in my AWS account with _AdministrationAccess_. <br/>
I then used my _security credentials (access keys)_ to follow this project.

## 1. Load DWH Parameters from a configuration file
My AWS configuration is placed in the `dwh.cfg` file. For security reasons, I have just provided a template of this file called `dwh_template.cfg`.

In [None]:
config = configparser.ConfigParser()
config.read_file(open('dwh.cfg'))

KEY = config.get('AWS', 'KEY')
SECRET = config.get('AWS', 'SECRET')
REGION = config.get('AWS', 'REGION')

In [None]:
# show the loaded parameters
# uncomment the code below to show your credentials

# pd.DataFrame({
#     "param": ['KEY', 'SECRET', 'REGION_NAME',],
#     "value": [KEY, SECRET, REGION_NAME],
# })

## 2. Create an S3 Client for S3
I'm going to create an python client for S3 to access the S3 bucket provided by the project. <br/>
We can do this using two different methods:
- `boto3.client`
- `boto3.resource`

In summary, the `client` method is best suited for _making direct API calls_ and offers **fine-grained control**, while the `resource` method provides a _higher-level interface_ and allows you to work with AWS resources using a _more object-oriented approach_.

More details at: https://www.learnaws.org/2021/02/24/boto3-resource-client/

We will use both methods at different times, according to available methods.

### 2.1 A bit about S3
Before checking out the datasets, let's recap some concepts about **S3**.

A <span style='color: orange'><b>S3 bucket</b></span> is a logical _container_ of <span style='color: #d13212'>objects</span>. That would be a ***“folder”***, but since **S3 deals with objects and _not_ files**, the distinction becomes important.

A <span style='color: #d13212'>Objects</span> is similar to a "file", but with a _different structure_ on AWS. <span style='color: #d13212'>Objects</span> are a **name/value data pair**, or the **“content”** and **metadata**.

For privacy reasons, Amazon **cannot see the _data_** inside any object, but **can see the _metadata_**. The **metadata** is a *series of information about the object itself* such as last modified date, file size and other HTTP specific metadata.

_Object identification_ is a <span style='color: #d13212'><b>key name</b></span> that _uniquely identifies_ each object within a bucket. We use the <span style='color: #d13212'><b>object key</b></span> to retrieve the object.

<span style='color: orange'><b>Amazon S3</b></span> has a ***flat structure*** rather than a hierarchy like you would see in a file system. However, the <span style='color: orange'><b>Amazon S3</b></span> console **supports the _folder concept_** as a means of **object grouping**.
<br/><br/><br/>

### 2.2 Understanding the datasets
The project description mentions that we will be working with 3 datasets residing in a public S3 bucket, with the following S3 links:
- Song data: `s3://udacity-dend/song_data`
- Log data: `s3://udacity-dend/log_data`
- This third file `s3://udacity-dend/log_json_path.json` contains the _meta information_ that is required by AWS to correctly load `s3://udacity-dend/log_data`

According to the [official AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html), one possible way to access an S3 bucket is using the format `s3 ://`, which is exactly our case. We should **be aware** that when using this format, the _bucket name_ **does not include the *AWS Region***.
<br/>

<code>s3://</code><code style='color: #d13212'>bucket-name</code><code>/</code><code style='color: #d13212'>key-name</code>

For example: <br/>
<code>S3://mybucket/puppy.jpg</code>

In our case, the **bucket** is <code style='color: #d13212'>udacity-dend</code> and apparently it has some 'folder', like `song_data` and `log_data`. tHE **bucket region** mentioned in the project description is `us-west-2`. I will create the Redshift cluster in the same region. So, I put this information in my _configuration file_.

Also, I created the `BUCKET='udacity-dend'` setting in the `dwh.cfg` file.

Let's confirm these premises.
<br/><br/>

#### Listing all objects ('files') inside the `udacity-dend` bucket

In [None]:
# creating a S3 client via boto3.resource
s3 = boto3.resource('s3',
                    region_name=REGION,
                    aws_access_key_id=KEY,
                    aws_secret_access_key=SECRET)

In [None]:
# access/get the S3 bucket
bucket = s3.Bucket('udacity-dend')

In [None]:
# list all objects' names ('filenames') inside this bucket
for obj in bucket.objects.all():
    print(obj.key)

Since this run was too long, I stopped after a few seconds.

From this run, we can confirm some assumptions about our _bucket_:
- `log-data` and `song-data` are 'folders'
- They have specific _subfolder structures_.
- The bucket also includes other files and folders irrelevant to our problem

#### Listing objects from a specific 'folder'
We have two ways for that:

##### Using `boto3.resource` with `filter` and `prefix`

In [None]:
# `bucket` was previously loaded with `boto3.resource`

for obj in bucket.objects.filter(Prefix="log_data"):
    print(obj.key)

##### Using `boto3.client` and recovering the specific folder

In [None]:
# creating a S3 client via boto3.client
s3_client = boto3.client('s3',
                         region_name=REGION,
                         aws_access_key_id=KEY,
                         aws_secret_access_key=SECRET)

# getting all objects with prefix `log_data` inside the bucket `udacity-dend`
objects = s3_client.list_objects_v2(Bucket='udacity-dend', Prefix="log_data")

for obj in objects['Contents']:
    print(obj['Key'])

##### BE CAREFUL
In both cases, we are retrieving the objects inside a _'folder'_ using a **prefix**. We then inform you that the name of the folder is this _prefix_.

However, please note that this does not guarantee that only files in this folder will be recovered. Any other files or directories with the same prefix will also be recovered.

For example, an object with filename `log_data_full.csv` would also be retrieved in the examples above.

More details:
- https://dev.to/aws-builders/how-to-list-contents-of-s3-bucket-using-boto3-python-47mm
- https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects_v2.html

#### Downloading and inspecting a specific file of each dataset
Finally, let's download and inspect a specific file from each dataset. <br/>
Let's consider the files:
- `log_data/2018/11/2018-11-01-events.json`
- `song-data/A/N/S/TRANSJO128F1458706.json`

In [None]:
# creating a S3 client via boto3.resource
s3 = boto3.resource('s3',
                    region_name=REGION,
                    aws_access_key_id=KEY,
                    aws_secret_access_key=SECRET)

# access/get the S3 bucket
bucket = s3.Bucket('udacity-dend')

In [None]:
# download the sample files in a `sample`` folder
bucket.download_file('song-data/A/N/S/TRANSJO128F1458706.json', './samples/song_data_sample.json')
bucket.download_file('log_data/2018/11/2018-11-01-events.json', './samples/log_data_sample.json')

# download the third dataset
bucket.download_file('log_json_path.json', './samples/log_json_path.json')

**SONG DATA**

In [None]:
# show the json files as pandas dataframes
# SONG DATA
pd.read_json('./samples/song_data_sample.json', orient='index')

**LOG DATA**

In [None]:
# LOG Data
pd.read_json('./samples/log_data_sample.json', lines=True)

<br/> <br/>

Now I can work on designing the tables.