# Tidying up our data - Part 1

# Learning objectives

- Manipulate deeply nested json and transform it into structured data ready to be loaded onto a Data Warehousing system

## Loading our data from S3

Because we expect you to know how to setup things so that you can load data from S3, we're doing it for you now.  
Make sure you go through our code and check that you were actually using our best practices.

---

On a side note, it is **tedious** to have to do this in every notebook, isn't it?  
And putting everything into a single notebook isn't a proper solution either.

The solution would be to store this configuration into an external file, that's what we would recommend.  
_Sadly, DataBricks doesn't work with regular python files, only notebooks..._

In [None]:
S3_RESOURCE = 's3'
SCHEME = 's3a'
ACCESS_KEY_ID = "ACCESS_KEY_ID" # cle du compte student
SECRET_ACCESS_KEY = "SECRET_ACCESS_KEY" # secret key du compte student
BUCKET_NAME = "BUCKET_NAME"
PREFIX = "Big_Data/YOUTUBE"

In [None]:
import boto3

# We create a S3 resource and a Bucket from this same resource
session = boto3.Session(
    region_name='eu-west-3',  # Datacenters located in Paris, FR
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=SECRET_ACCESS_KEY
)
s3 = session.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)

Last time, we worked with one file at a time, that was ok for quick analysis, but for processing, it would be nice to handle all files at once.  
We just gotta make sure their schemas is the same or we might run into surprises.  
Since all the APIs calls have been made using the same version and settings of the API, we can expect thing to be right, but **always keep an eye open**.

In [None]:
# TODO: print out the list of files inside {BUCKET}/{PREFIX}
### BEGIN STRIP ###
for bucket_object in bucket.objects.filter(Prefix=PREFIX):
    print(bucket_object)
### END STRIP ###

In [None]:
# An utility function
def get_s3_path(key, bucket_name=BUCKET_NAME, scheme=S3_RESOURCE):
    return f"{scheme}://{bucket_name}/{key}"

As you can see in the [documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.load) for the `.load(...)` method, it accepts as `path` argument, either a string or a list of string.

We will pass it the list of all the s3 path we want, using the `get_s3_path` function on keys with PREFIX as a prefix and that ends with `.json.gz`.

In [None]:
# TODO: create a list of s3 filepaths where keys have the prefix PREFIX and ends with .json.gz
### BEGIN STRIP ###
jsons = [get_s3_path(bucket_object.key) for bucket_object in bucket.objects.filter(Prefix=PREFIX) if bucket_object.key.endswith('.json')]
jsons
### END STRIP ###

In [None]:
# TODO: load all the files into a DataFrame: `df`
### BEGIN STRIP ###
df = (spark.read.format('json').load(f"s3://{BUCKET_NAME}/{PREFIX}/songs.json"))
### END STRIP ###

In [None]:
# TODO: count the number of rows in your DataFrame, it should be 12627
### BEGIN STRIP ###
df.count()
### END STRIP ###

Looking great! But there's more, what they're not telling you in the documentation is that you can use wildcards in your paths:
- `*`: replaces any string
- `?`: replaces a single letter

In our case, all our file keys are `youtube`, followed by a number, followed by `.json.gz`: like this `youtube13.json.gz`.  
Which means, `youtube*.json.gz` will catch all of these.

This pattern is called globbing, you can learn about it on [wikipedia](https://en.wikipedia.org/wiki/Glob_(programming) and this is why we will our our variable `filekey_glob`.

In [None]:
# TODO: follow previous instructions: `filekey_glob`
### BEGIN STRIP ###

### END STRIP ###

In [None]:
# TODO: Using `filekey_glob`, load all the files into a PySpark DataFrame: `df`
### BEGIN STRIP ###

### END STRIP ###

We'll just check we have the same number of rows with this method

In [None]:
# TODO: count the number of rows in the DataFrame
### BEGIN STRIP ###

### END STRIP ###

Once again, we should have 12627 rows. **If that's not the case, you have an issue, go back and fix it.**

## Tidying up

---

We have multiple issues with our data.  **It does not look like "tidy data" at all.**  
First, we have rows within rows...
And second, most of the data resides in deeply nested structure within the column items...

We will fix the former, then handle the latter in the next notebook.

### 1. Fixing the rows
Ever heard about `EXPLODE` in SQL?

🚧 Need more stuff here. In particular link to documentation for SQL explode.

Luckily for us, they're an equivalent in PySpark: `.explode(...)`, here's the link to the [documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.explode).  
What `.explode(...)` does, it "Returns a new row for each element in the given array or map."

If you remember properly, that's exactly the kind of structures we have in the schema of our DataFrame for the `items` column.

In [None]:
# TODO: print out the schema of `df`
### BEGIN STRIP ###
df.printSchema()
### END STRIP ###

In [None]:
# TODO: import the PySpark SQL functions following usual convention
### BEGIN STRIP ###
from pyspark.sql import functions as F
### END STRIP ###

In [None]:
# TODO: use `.explode(...)` on the `items` column and count the number of results
### BEGIN STRIP ###
df = df.withColumn('items', F.explode(df.items))
df.count()
### END STRIP ###

If you got 512624 rows, you've made it, congrats! :)  
We will use this as our new working DataFrame:
- just do the same thing, but this time save into a variable named `items_df`
- don't forget to give a proper alias to your newly compute column: `items`
- at the end, as a sanity check, make sure we have the right amount of columns in our new DataFrame

In [None]:
# TODO: follow previous instructions
### BEGIN STRIP ###
df.count()
### END STRIP ###

We're making progress, we now have one row per result (e.g. song)!

But each song is a deeply nested structure... We will take care of this in the following notebook.

### BEGIN STRIP ###
\# TODO: We could tell students to save their work as a parquet file inside a key/folder `interim`
We will use Parquet storage in S3 in a "folder" called `interim`

**Question:** Do we want to do this? It will incur costs and overhead?

### END STRIP ###

In [None]:
# TODO: 🚧 Do we need 
# TODO: save the DataFrame as a parquet
### BEGIN STRIP ###
df.write.mode("overwrite").parquet("/tmp/interim.parquet")
### END STRIP ###

## Wrap-up
You learned:
- how to glob files using wildcards
- use `.explode(...)` to split arrays values into their own rows
- saving your intermediary data in S3