# Tidying up our data - Part 1

# Learning objectives

- Manipulate deeply nested json data and transform them into structured data ready to be loaded into a Data Warehouse.

## Loading our data from S3

In [None]:
filepath = "s3://full-stack-bigdata-datasets/Big_Data/YOUTUBE/songs.json"

In [None]:
df = spark.read.format('json').load(filepath, multiline=True)

## Tidying up

---

We have multiple issues with our data.  **It does not look like "tidy data" at all.**  
First, we have rows within rows...
And second, most of the data resides in deeply nested structure within the column items...

We will fix the former, then handle the latter in the next notebook.

### 1. Fixing the rows
Remember the `.explode` method? [documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.explode).  
What `.explode(...)` does, it "Returns a new row for each element in the given array or map." We will use a lot of it here!

If you remember properly, that's exactly the kind of structures we have in the schema of our DataFrame for the `items` column.

1. Print out the schema of `df`

2. Import the PySpark SQL functions following usual convention

3. Use `.explode(...)` on the `items` column and count the number of results

If you got 3907 rows, you've made it, congrats! :)  
We will use this as our new working DataFrame:
- just do the same thing, but this time save the exploded dataset into a variable named `items_df`
- don't forget to give a proper alias to your newly compute column: `items`
- at the end, as a sanity check, make sure we have the right amount of columns in our new DataFrame

4. Follow previous instructions

We're making progress, we now have one row per result (e.g. song)!

But each song is a deeply nested structure... We will take care of this in the following notebook.

5. Show the first 5 rows of the exploded dataset

Unnamed: 0,items
0,"((false, (None,), sd, 2d, PT3M33S, True, recta..."
1,"((false, (None,), hd, 2d, PT7M46S, False, rect..."
2,"((false, (None,), sd, 2d, PT3M7S, False, recta..."
3,"((false, (None,), hd, 2d, PT3M43S, False, rect..."
4,"((false, (None,), hd, 2d, PT5M, False, rectang..."


## Wrap-up

You learned how to use `.explode(...)` to split arrays values into their own rows! 🎉