# Tidying up our data - Part 2
## Flattening a nested schema

As usual, we'll start by loading the data.  
Because we expect you to know how to setup things so that you can load data from S3, we're doing it for you now.  
Make sure you go through our code and check that you were actually using our best practices.

In [None]:
items_exploded_path = "s3://full-stack-bigdata-datasets/Big_Data/YOUTUBE/items_exploded.json"

df = spark.read.json(items_exploded_path)

1. As a sanity check, count the rows in the DataFrame

2. Print out the schema of the DataFrame

### Working with the schema

We're ready to get started :)

Our schema is like a tree, we want to collect all its leaves and put them neatly as columns of our DataFrame.  
That's called **flattening a schema**.

Let's give it a try with the `title` element inside the `items` columns, then into the nested field `snippet`, and finally in the nested subfield `title`.

You can do this using the `.getField()` column method [documentation](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.Column.getField.html).

3. Select the `title` subfield from the `snippet` subfield in the `items` column show the first 5 elements

That's it, easy peasy 🙂.

We could just keep doing this for every single leaf of the schema and we're done.

I don't know about you, but I think this is incredebly **boring**. Also, what if tomorrow, Youtube adds a new leaf to its API results?

Come on, we're programmers, we're **supposed to automate stuff, aren't we?**

What we need is a way to build that list of leaves... 
Not gonna lie, it's not trivial, it's called a tree traversal and this is beyond the scope of this course.

Which means we will do this part for you. In the following cell, we've included a function called `walkSchema`. What this functions does, is that it walk the schema of our DataFrame with a nested schema and harvest its leave. Returning them with full path like this `items.snippet.title` as a string.

Well, "returning", not exactly. But we will see about that later.

**[TODO]**
Take a look at the function, you're not supposed to understand what it does, this is beyond the scope of this course.  
But when you're learning, it's always a good idea to be exposed to new things.

In [None]:
# Let's give you the intuition for the flattening function we will share with you now
# The idea is to automatically dig deeper and deeper into the schema in order to extract
# all the column names in the form "array1.array2.array3.field1", let's go!

# we'll work with the schema in json format, which will be way easier to manipulate
df.schema.jsonValue()

# It's nothing more than a dictionnary with keys

In [None]:
df.schema.jsonValue().keys()
# Only two keys at this stage, type and fields, let's explore the type key

In [None]:
df.schema.jsonValue()["type"]
# The value associated is struct

In [None]:
# let's explore the content of the other key
df.schema.jsonValue()["fields"]

In [None]:
#it's a list, what's the first element?
df.schema.jsonValue()["fields"][0]

In [None]:
# what keys does it have ?
df.schema.jsonValue()["fields"][0].keys()

In [None]:
# the key name contains the name of the field
df.schema.jsonValue()["fields"][0]["name"]

In [None]:
# if we have the key type then we have subfields inside it
df.schema.jsonValue()["fields"][0]["type"]
# and we are back to the same structure we had at the beginning and we can start digging again
# that's the spirit of the function below

In [None]:
from pyspark.sql.types import StructType, StructField
from typing import List, Dict, Generator, Union, Callable

# This is actually written like a scala function, we'll walk you through it
def walkSchema(schema: Union[StructType, StructField]) -> Generator[str, None, None]:
    """Explores a PySpark schema:
    
    schema: StructType | StructField
    
    Yield
    -----
    A generator of strings, the name of each field in the schema
    """
    
    # we define a function _walk that produces a string generator from
    # a dictionnary "schema_dct", and a string "prefix"
    def _walk(schema_dct: Dict['str', Union['str', list, dict]],
              prefix: str = "") -> Generator[str, None, None]:
        assert isinstance(prefix, str), "prefix should be a string" # check if prefix is a string
        
        # this function returns "name" if there's no prefix and "prefix.name" if prefix exists
        fullName: Callable[str, str] = lambda name: ( 
            name if not prefix else f"{prefix}.{name}")
        
        # we get the next name one level lower from the dictionnary
        name = schema_dct.get('name', '')
        
        # if the type is struct then we search for the fields key
        # if fields is there we apply the function again and dig one level deeper in
        # the schema and set a prefix
        if schema_dct['type'] == 'struct':
            assert 'fields' in schema_dct, (
                "It's a StructType, we should have some fields")
            for field in schema_dct['fields']:
                yield from _walk(field, prefix=prefix)
        # if we have a dict type and we can't find fields then we
        # dig one level deeper and apply the _walk function again
        elif isinstance(schema_dct['type'], dict):
            assert 'fields' not in schema_dct, (
                "We're missing some keys here")
            yield from _walk(schema_dct['type'], prefix=fullName(name))
        # If we finally reached the end and found a name we yield the full name
        elif name:
            yield fullName(name)
    
    yield from _walk(schema.jsonValue())

# yield as opposed to return, returns a result but does not stop the function from running, it keeps
# running even after returning one result.

We will give this function a try, and see how it behaves...  
You might have to look into PySpark documentation to learn how to access the schema of a DataFrame.

5. Call `walkSchema(...)` on our dataframe schema: `col_names` then print it out to the screen

You should see an output similar to `<generator object walkSchema at 0x7f9eb0e390c0>`.  
It's a Python's generator, you can read more about it [here](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/).

For now, you just have to know, that just like a python's `list`, a `generator` is also `iterable`, which means we can iterate over it with a `for` loop.

```python
for e in my_generator:
    # You can access each element of the generator here
```

We'll give it a try, by printing out the values of our col_names.

6. Iterate over the walked schema

*NOTE: give the name `col_name` to the iterating variable*

Perfect, that's all the leafs of our schema.  
And we can just repeat the work we did with `items.snippet.title` for every column of this list.


There are a couple ways to do this, you've got at least 2 options (using standard "non-functionnal" python):
- build a list comprehension (or unpack the generator) and pass it to a `.select(...)` statement
- iterate over the generator, and use `.withColumn(...)`

_But our favorite uses a functional approach. It particularly makes sense because Spark is based on Scala, a functionnal language.  
If you're interested in this approach, take a look at `reduce` from the `functools` package in Python.  
In this simple isolated case, it actually makes things look a bit harder than they should, but it would make it easier to neatly integrate this step in a global pipeline.  
**Beware, if you're not familiar with functional programming that will probably feel non-trivial.**_

In [None]:
from functools import reduce

from pyspark.sql import functions as F

# Non-functional way: unpacking the generator
# exploded_df = df.select(*walkSchema(df.schema))

# The functional way, using functools' reduce
exploded_df = reduce(
  lambda memo_df, col_name: memo_df.withColumn(col_name, F.col(col_name)),
  walkSchema(df.schema), df
).drop('items')

exploded_df.limit(5).toPandas()

Unnamed: 0,items.contentDetails.caption,items.contentDetails.contentRating.ytRating,items.contentDetails.definition,items.contentDetails.dimension,items.contentDetails.duration,items.contentDetails.licensedContent,items.contentDetails.projection,items.etag,items.id,items.kind,items.snippet.categoryId,items.snippet.channelId,items.snippet.channelTitle,items.snippet.defaultAudioLanguage,items.snippet.defaultLanguage,items.snippet.description,items.snippet.liveBroadcastContent,items.snippet.localized.description,items.snippet.localized.title,items.snippet.publishedAt,items.snippet.thumbnails.default.height,items.snippet.thumbnails.default.url,items.snippet.thumbnails.default.width,items.snippet.thumbnails.high.height,items.snippet.thumbnails.high.url,items.snippet.thumbnails.high.width,items.snippet.thumbnails.maxres.height,items.snippet.thumbnails.maxres.url,items.snippet.thumbnails.maxres.width,items.snippet.thumbnails.medium.height,items.snippet.thumbnails.medium.url,items.snippet.thumbnails.medium.width,items.snippet.thumbnails.standard.height,items.snippet.thumbnails.standard.url,items.snippet.thumbnails.standard.width,items.snippet.title,items.statistics.commentCount,items.statistics.dislikeCount,items.statistics.favoriteCount,items.statistics.likeCount,items.statistics.viewCount,items.status.embeddable,items.status.license,items.status.madeForKids,items.status.privacyStatus,items.status.publicStatsViewable,items.status.uploadStatus
0,False,,sd,2d,PT3M33S,True,rectangular,SqP7uUVSol30dxvuScN6JUny6T4,t1l8Z6gLPzo,youtube#video,10,UCUERSOitwgUq_37kGslN96w,VOLO,,,"Enregistré et mixé par Cyrille PELTIER au ""Kee...",none,"Enregistré et mixé par Cyrille PELTIER au ""Kee...","VOLO. ""L'air d'un con""",2013-07-22T12:09:11Z,90,https://i.ytimg.com/vi/t1l8Z6gLPzo/default.jpg,120,360,https://i.ytimg.com/vi/t1l8Z6gLPzo/hqdefault.jpg,480,,,,180,https://i.ytimg.com/vi/t1l8Z6gLPzo/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/t1l8Z6gLPzo/sddefault.jpg,640.0,"VOLO. ""L'air d'un con""",38,26,0,1028,223172,True,youtube,False,public,True,processed
1,False,,hd,2d,PT7M46S,False,rectangular,m3DnhzTEw9ABiqzBvdasfk5Av_8,we5gzZq5Avg,youtube#video,10,UCson549gpvRhPnJ3Whs5onA,LongWayToDream,,,Air Conditionné EP,none,Air Conditionné EP,Julian Jeweil - Air Conditionné,2012-03-17T08:34:30Z,90,https://i.ytimg.com/vi/we5gzZq5Avg/default.jpg,120,360,https://i.ytimg.com/vi/we5gzZq5Avg/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/we5gzZq5Avg/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/we5gzZq5Avg/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/we5gzZq5Avg/sddefault.jpg,640.0,Julian Jeweil - Air Conditionné,2,3,0,124,13409,True,youtube,False,public,True,processed
2,False,,sd,2d,PT3M7S,False,rectangular,zyzs7STAR3NG-_pZe-0nGkbKoqg,49esza4eiK4,youtube#video,10,UCcHYZ8Ez4gG_2bHEuBL8IfQ,Downtown Records,,,myspace.com/etjusticepourtous\r\n(Downtown / E...,none,myspace.com/etjusticepourtous\r\n(Downtown / E...,Justice - D.A.N.C.E,2007-09-08T02:02:07Z,90,https://i.ytimg.com/vi/49esza4eiK4/default.jpg,120,360,https://i.ytimg.com/vi/49esza4eiK4/hqdefault.jpg,480,,,,180,https://i.ytimg.com/vi/49esza4eiK4/mqdefault.jpg,320,,,,Justice - D.A.N.C.E,3168,780,0,25540,10106655,True,youtube,False,public,True,processed
3,False,,hd,2d,PT3M43S,False,rectangular,hX2C15F6fdO5A-stUFMU5Az2PvI,BoO6LfR7ca0,youtube#video,22,UCQ0wLCF7u23gZKJkHFs1Tpg,Music Is Our Drug,,,♫ Music Is Our Drug - Spotify Playlist: https:...,none,♫ Music Is Our Drug - Spotify Playlist: https:...,Gramatik - Torture (feat. Eric Krasno),2014-01-24T12:52:38Z,90,https://i.ytimg.com/vi/BoO6LfR7ca0/default.jpg,120,360,https://i.ytimg.com/vi/BoO6LfR7ca0/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/BoO6LfR7ca0/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/BoO6LfR7ca0/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/BoO6LfR7ca0/sddefault.jpg,640.0,Gramatik - Torture (feat. Eric Krasno),6,0,0,255,29153,True,youtube,False,public,True,processed
4,False,,hd,2d,PT5M,False,rectangular,rYHoV38PLpMbRuX_zhGTVBKNotw,DaH4W1rY9us,youtube#video,10,UCJsTMPZxYD-Q3kEmL4Qijpg,Harvey Pearson,,,Buy The Burgh Island EP now:\nhttps://itunes.a...,none,Buy The Burgh Island EP now:\nhttps://itunes.a...,Ben Howard - Oats In The Water,2012-12-02T12:41:13Z,90,https://i.ytimg.com/vi/DaH4W1rY9us/default.jpg,120,360,https://i.ytimg.com/vi/DaH4W1rY9us/hqdefault.jpg,480,720.0,https://i.ytimg.com/vi/DaH4W1rY9us/maxresdefau...,1280.0,180,https://i.ytimg.com/vi/DaH4W1rY9us/mqdefault.jpg,320,480.0,https://i.ytimg.com/vi/DaH4W1rY9us/sddefault.jpg,640.0,Ben Howard - Oats In The Water,5303,1784,0,136033,16488714,True,youtube,False,public,True,processed


How amazing what we can do with a couple lines of well written code, isn't it?

Now that we're here, would be a good time to start analyzing the data we got. We will do this in the next assignment.

7. Save the output to S3 as a parquet file