# Tidying up our data - Part 2
## Flattening a nested schema

As usual, we'll start by loading the data.  
Because we expect you to know how to setup things so that you can load data from S3, we're doing it for you now.  
Make sure you go through our code and check that you were actually using our best practices.

In [None]:
import boto3

In [None]:
S3_RESOURCE = 's3'
SCHEME = 's3a'
ACCESS_KEY_ID = "ACCESS_KEY_ID" # cle du compte student
SECRET_ACCESS_KEY = "SECRET_ACCESS_KEY" # secret key du compte student
BUCKET_NAME = "BUCKET_NAME"
PREFIX = "Big_Data/YOUTUBE"
# TODO: set BUCKET_NAME and PREFIX
### BEGIN STRIP ###

### END STRIP ###
OUTPUT_PATH = PREFIX + 'interim/'

In [None]:
import boto3

# We create a S3 resource and a Bucket from this same resource
session = boto3.Session(
    region_name='eu-west-3',  # Datacenters located in Paris, FR
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=SECRET_ACCESS_KEY
)

In [None]:
# We create a S3 resource and a Bucket from this same resource
s3 = session.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)

In [None]:
# Just an utility function
def get_s3_path(key, bucket_name=BUCKET_NAME, scheme=SCHEME):
  return f"{scheme}://{bucket_name}/{key}"

In [None]:
# TODO: create the S3 path
### BEGIN STRIP ###

### END STRIP ###

In [None]:
# TODO: load the parquet file into a PySpark DataFrame: `df`
# NOTE: as a reminder, parquet is the default file format for loading with PySpark
#
# TODO: as a sanity check, count the rows in the DataFrame
### BEGIN STRIP ###

df = spark.read.parquet("/tmp/interim.parquet")
df.count()
### END STRIP ###

In [None]:
# TODO: print out the schema of the DataFrame
### BEGIN STRIP ###
df.printSchema()
### END STRIP ###

### Working with the schema

We're ready to get started :)

Our schema is like a tree, we want to collect all its leaves and put them neatly as columns of our DataFrame.  
That's called **flattening a schema** and that's for sure would tidy things up.

Let's give it a try with the `title` element, it's actually inside the `items` columns, then into the nested field `snippet`, and finally in the nested subfield `title`.

You can do this using the `.getField()` column method, there the [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.getField).

In [None]:
# TODO: select the `title` subfield from the `snippet` subfield in the `items` column
#       show the first 5 elements
### BEGIN STRIP ###
df.select(df.items.snippet.title).show(5)

### END STRIP ###

That's it, easy peasy 🙂.

We could just keep doing this for every single leaf of the schema and we're done.

I don't know about you, but I think this is incredebly **boring**. Also, what if tomorrow, Youtube adds a new leaf to its API results?

Come on, we're programmers, we're **supposed to automate stuff, aren't we?**

What we need is a way to build that list of leaves... 
Not gonna lie, it's not trivial, it's called a tree traversal and this is beyond the scope of this course.

Which means we will do this part for you. In the following cell, we've included a function called `walkSchema`. What this functions does, is that it walk the schema of our DataFrame with a nested schema and harvest its leave. Returning them with full path like this `items.snippet.title` as a string.

Well, "returning", not exactly. But we will see about that later.

**[TODO]**
Take a look at the function, you're not supposed to understand what it does, this is beyond the scope of this course.  
But when you're learning, it's always a good idea to be exposed to new things.

In [None]:
from pyspark.sql.types import StructType, StructField
from typing import List, Dict, Generator, Union, Callable

def walkSchema(schema: Union[StructType, StructField]) -> Generator[str, None, None]:
    """Traverse a PySpark schema:
    
    schema: StructType | StructField
    
    Yield
    -----
    A generator of strings, the name of each field in the schema
    """
    
    def _walk(schema_dct: Dict['str', Union['str', list, dict]],
              prefix: str = "") -> Generator[str, None, None]:
        assert isinstance(prefix, str), "prefix should be a string"
        
        fullName: Callable[str, str] = lambda name: (
            name if not prefix else f"{prefix}.{name}")
        
        name = schema_dct.get('name', '')
        if schema_dct['type'] == 'struct':
            assert 'fields' in schema_dct, (
                "It's a StructType, we should have some fields")
            for field in schema_dct['fields']:
                yield from _walk(field, prefix=prefix)
        elif isinstance(schema_dct['type'], dict):
            assert 'fields' not in schema_dct, (
                "We're missing some keys here")
            yield from _walk(schema_dct['type'], prefix=fullName(name))
        elif name:
            yield fullName(name)
    
    yield from _walk(schema.jsonValue())

We will give this function a try, and see how it behaves...  
You might have to look into PySpark documentation to learn how to access the schema of a DataFrame.

In [None]:
# TODO: call `walkSchema(...)` on our dataframe schema: `col_names`
#       then print it out to the screen
### BEGIN STRIP ###
col_names = walkSchema(df.schema)
col_names
### END STRIP ###

You should see an output similar to `<generator object walkSchema at 0x7f9eb0e390c0>`.  
It's a Python's generator, you can read more about it [here](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/).

For now, you just have to know, that just like a python's `list`, a `generator` is also `iterable`, which means we can iterate over it with a `for` loop.

```
for e in my_generator:
    # You can access each element of the generator here
```

We'll give it a try, by printing out the values of our col_names.

In [None]:
# TODO: iterate over the walked schema
# NOTE: give the name `col_name` to the iterating variable
### BEGIN STRIP ###
for col_name in col_names:
  print(col_name)
### END STRIP ###

Perfect, that's all the leafs of our schema.  
And we can just repeat the work we did with `items.snippet.title` for every column of this list.


There are a couple ways to do this, you've got at least 2 options (using standard "non-functionnal" python):
- build a list comprehension (or unpack the generator) and pass it to a `.select(...)` statement
- iterate over the generator, and use `.withColumn(...)`

_But our favorite uses a functional approach. It particularly makes sense because Spark is based on Scala, a functionnal language.  
If you're interested in this approach, take a look at `reduce` from the `functools` package in Python.  
In this simple isolated case, it actually makes things look a bit harder than they should, but it would make it easier to neatly integrate this step in a global pipeline.  
**Beware, if you're not familiar with functional programming that will probably feel non-trivial.**_

In [None]:
# TODO: explode th
### BEGIN STRIP ###
import functools 
from pyspark.sql import functions as F

df_eclate = functools.reduce(lambda temp_df, col_name: temp_df.withColumn(col_name, F.col(col_name)), walkSchema(df.schema), df).drop("items")
df_eclate.toPandas()
### END STRIP ###

Unnamed: 0,etag,kind,pageInfo,items.contentDetails.caption,items.contentDetails.contentRating.ytRating,items.contentDetails.definition,items.contentDetails.dimension,items.contentDetails.duration,items.contentDetails.licensedContent,items.contentDetails.projection,items.etag,items.id,items.kind,items.snippet.categoryId,items.snippet.channelId,items.snippet.channelTitle,items.snippet.defaultAudioLanguage,items.snippet.defaultLanguage,items.snippet.description,items.snippet.liveBroadcastContent,items.snippet.localized.description,items.snippet.localized.title,items.snippet.publishedAt,items.snippet.thumbnails.default.height,items.snippet.thumbnails.default.url,items.snippet.thumbnails.default.width,items.snippet.thumbnails.high.height,items.snippet.thumbnails.high.url,items.snippet.thumbnails.high.width,items.snippet.thumbnails.maxres.height,items.snippet.thumbnails.maxres.url,items.snippet.thumbnails.maxres.width,items.snippet.thumbnails.medium.height,items.snippet.thumbnails.medium.url,items.snippet.thumbnails.medium.width,items.snippet.thumbnails.standard.height,items.snippet.thumbnails.standard.url,items.snippet.thumbnails.standard.width,items.snippet.title,items.statistics.commentCount,items.statistics.dislikeCount,items.statistics.favoriteCount,items.statistics.likeCount,items.statistics.viewCount,items.status.embeddable,items.status.license,items.status.madeForKids,items.status.privacyStatus,items.status.publicStatsViewable,items.status.uploadStatus,pageInfo.resultsPerPage,pageInfo.totalResults
0,U0fncx_GV9jD5SKQr15LMvwuPcs,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,sd,2d,PT3M33S,True,rectangular,SqP7uUVSol30dxvuScN6JUny6T4,t1l8Z6gLPzo,youtube#video,10,UCUERSOitwgUq_37kGslN96w,VOLO,,,"Enregistré et mixé par Cyrille PELTIER au ""Kee...",none,"Enregistré et mixé par Cyrille PELTIER au ""Kee...","VOLO. ""L'air d'un con""",2013-07-22T12:09:11Z,90,https://i.ytimg.com/vi/t1l8Z6gLPzo/default.jpg,120,360,https://i.ytimg.com/vi/t1l8Z6gLPzo/hqdefault.jpg,480,,,,180,https://i.ytimg.com/vi/t1l8Z6gLPzo/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/t1l8Z6gLPzo/sddefault.jpg,640.0,"VOLO. ""L'air d'un con""",38,26,0,1028,223172,True,youtube,False,public,True,processed,38,38
1,U0fncx_GV9jD5SKQr15LMvwuPcs,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,hd,2d,PT7M46S,False,rectangular,m3DnhzTEw9ABiqzBvdasfk5Av_8,we5gzZq5Avg,youtube#video,10,UCson549gpvRhPnJ3Whs5onA,LongWayToDream,,,Air Conditionné EP,none,Air Conditionné EP,Julian Jeweil - Air Conditionné,2012-03-17T08:34:30Z,90,https://i.ytimg.com/vi/we5gzZq5Avg/default.jpg,120,360,https://i.ytimg.com/vi/we5gzZq5Avg/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/we5gzZq5Avg/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/we5gzZq5Avg/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/we5gzZq5Avg/sddefault.jpg,640.0,Julian Jeweil - Air Conditionné,2,3,0,124,13409,True,youtube,False,public,True,processed,38,38
2,U0fncx_GV9jD5SKQr15LMvwuPcs,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,sd,2d,PT3M7S,False,rectangular,zyzs7STAR3NG-_pZe-0nGkbKoqg,49esza4eiK4,youtube#video,10,UCcHYZ8Ez4gG_2bHEuBL8IfQ,Downtown Records,,,myspace.com/etjusticepourtous\r\n(Downtown / E...,none,myspace.com/etjusticepourtous\r\n(Downtown / E...,Justice - D.A.N.C.E,2007-09-08T02:02:07Z,90,https://i.ytimg.com/vi/49esza4eiK4/default.jpg,120,360,https://i.ytimg.com/vi/49esza4eiK4/hqdefault.jpg,480,,,,180,https://i.ytimg.com/vi/49esza4eiK4/mqdefault.jpg,320,,,,Justice - D.A.N.C.E,3168,780,0,25540,10106655,True,youtube,False,public,True,processed,38,38
3,U0fncx_GV9jD5SKQr15LMvwuPcs,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,hd,2d,PT3M43S,False,rectangular,hX2C15F6fdO5A-stUFMU5Az2PvI,BoO6LfR7ca0,youtube#video,22,UCQ0wLCF7u23gZKJkHFs1Tpg,Music Is Our Drug,,,♫ Music Is Our Drug - Spotify Playlist: https:...,none,♫ Music Is Our Drug - Spotify Playlist: https:...,Gramatik - Torture (feat. Eric Krasno),2014-01-24T12:52:38Z,90,https://i.ytimg.com/vi/BoO6LfR7ca0/default.jpg,120,360,https://i.ytimg.com/vi/BoO6LfR7ca0/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/BoO6LfR7ca0/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/BoO6LfR7ca0/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/BoO6LfR7ca0/sddefault.jpg,640.0,Gramatik - Torture (feat. Eric Krasno),6,0,0,255,29153,True,youtube,False,public,True,processed,38,38
4,U0fncx_GV9jD5SKQr15LMvwuPcs,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,hd,2d,PT5M,False,rectangular,rYHoV38PLpMbRuX_zhGTVBKNotw,DaH4W1rY9us,youtube#video,10,UCJsTMPZxYD-Q3kEmL4Qijpg,Harvey Pearson,,,Buy The Burgh Island EP now:\nhttps://itunes.a...,none,Buy The Burgh Island EP now:\nhttps://itunes.a...,Ben Howard - Oats In The Water,2012-12-02T12:41:13Z,90,https://i.ytimg.com/vi/DaH4W1rY9us/default.jpg,120,360,https://i.ytimg.com/vi/DaH4W1rY9us/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/DaH4W1rY9us/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/DaH4W1rY9us/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/DaH4W1rY9us/sddefault.jpg,640.0,Ben Howard - Oats In The Water,5303,1784,0,136033,16488714,True,youtube,False,public,True,processed,38,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3902,cG4wV6cPEJ6319IKaA81zm0Oj6I,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,sd,2d,PT5M,False,rectangular,GMmGMlMprluq1aE_CFkCrgGilAk,KoAbMfg9_Uk,youtube#video,10,UCbiAo0krO0BLeHGi900ESew,L01z,,,Burial - Unite (dubstep),none,Burial - Unite (dubstep),Burial - Unite,2007-06-16T00:18:14Z,90,https://i.ytimg.com/vi/KoAbMfg9_Uk/default.jpg,120,360,https://i.ytimg.com/vi/KoAbMfg9_Uk/hqdefault.jpg,480,,,,180,https://i.ytimg.com/vi/KoAbMfg9_Uk/mqdefault.jpg,320,,,,Burial - Unite,,66,0,2933,854649,True,youtube,False,public,True,processed,38,38
3903,cG4wV6cPEJ6319IKaA81zm0Oj6I,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,sd,2d,PT5M47S,False,rectangular,0j4O82Rruwr62QhsLmCH9cRfd28,1gj42R698Ok,youtube#video,10,UCO_Qhg7Via2U6odtb-U-dtQ,roootsman99,,,,none,,lee perry - ketch vampire,2009-02-05T15:46:13Z,90,https://i.ytimg.com/vi/1gj42R698Ok/default.jpg,120,360,https://i.ytimg.com/vi/1gj42R698Ok/hqdefault.jpg,480,,,,180,https://i.ytimg.com/vi/1gj42R698Ok/mqdefault.jpg,320,,,,lee perry - ketch vampire,29,5,0,479,77433,True,youtube,False,public,True,processed,38,38
3904,cG4wV6cPEJ6319IKaA81zm0Oj6I,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,hd,2d,PT3M29S,True,rectangular,7sWLCFY3yeu8xtpPPiU23FHYQZs,cHfp54PxZ3c,youtube#video,10,UCLf-3768gEw4wA81xLP1c1g,Memphis Industries,en-US,,"Buy on 7"" white vinyl or download: http://po.s...",none,"Buy on 7"" white vinyl or download: http://po.s...",Elephant - Shapeshifter,2013-11-20T14:26:22Z,90,https://i.ytimg.com/vi/cHfp54PxZ3c/default.jpg,120,360,https://i.ytimg.com/vi/cHfp54PxZ3c/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/cHfp54PxZ3c/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/cHfp54PxZ3c/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/cHfp54PxZ3c/sddefault.jpg,640.0,Elephant - Shapeshifter,74,29,0,1640,167130,True,youtube,False,public,True,processed,38,38
3905,cG4wV6cPEJ6319IKaA81zm0Oj6I,youtube#videoListResponse,"{'resultsPerPage': 38, 'totalResults': 38}",false,,hd,2d,PT7M45S,False,rectangular,oxN8btERWt9PKA9C1zZQ32eMWOc,7-3i7kBwcxQ,youtube#video,10,UCIVYAXPwmi0dbMxaKi8G2kw,czetaboy,,,video from the movie Faster.\n\nhttp://youtu.b...,none,video from the movie Faster.\n\nhttp://youtu.b...,Dub fx - Flow vs. Rock,2011-02-17T17:10:56Z,90,https://i.ytimg.com/vi/7-3i7kBwcxQ/default.jpg,120,360,https://i.ytimg.com/vi/7-3i7kBwcxQ/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/7-3i7kBwcxQ/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/7-3i7kBwcxQ/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/7-3i7kBwcxQ/sddefault.jpg,640.0,Dub fx - Flow vs. Rock,128,78,0,2209,599996,True,youtube,False,public,True,processed,38,38


How amazing what we can do with a couple lines of well written code, isn't it?

Now that we're here, would be a good time to start analyzing the data we got. We will do this in the next assignment.

In [None]:
# TODO: Save the output to S3 as a parquet file
### BEGIN STRIP ###
df_eclate.write.mode("overwrite").parquet("/tmp/youtube_eclate.parquet")
### END STRIP ###