# Data exploration

In this exercise we will explore a dataset that was created using the youtube API to gather information about videos on the platform, the response from the API is a json file (sort of dictionnary) containing the response to each call to the API.

In [None]:
filepath = "s3://full-stack-bigdata-datasets/Big_Data/YOUTUBE/songs.json"


In [None]:
df = spark.read.format('json').load(filepath)

1. Count the number of entries in `df`

2. Display your DataFrame

Some people are very fond of DataBricks' `display(...)`  
However, it's limited to DataBricks, and it's a little slow...

The alternative `.show(...)` has a very hard to read formatting...

Actually, there is another way: limit the results to a decent amount of rows (one that can be collected without hogging all your memory) and then convert your PySpark DataFrame into a pandas DataFrame.

Just **make sure you're not running `df.toPandas().head()`**, in theory the result would be the same but you would end up computing your whole DataFrame and storing into memory, if your data is big, that would be a **horrible idea**.

3. Get the first 5 rows of `df` as a pandas DataFrame and print them out (using notebook formatting)

Unnamed: 0,etag,items,kind,pageInfo
0,U0fncx_GV9jD5SKQr15LMvwuPcs,"[((false, Row(ytRating=None), sd, 2d, PT3M33S,...",youtube#videoListResponse,"(38, 38)"
1,LZV6LlN3-4QwaIGfe9KBxl0cJvE,"[((false, Row(ytRating=None), hd, 2d, PT3M26S,...",youtube#videoListResponse,"(38, 38)"
2,Ou4xXi-09RdImAeo1EFJC01i8iM,"[((true, Row(ytRating=None), hd, 2d, PT2M21S, ...",youtube#videoListResponse,"(43, 43)"
3,tDsVpy7PmDE2n6ZAO0rHpUpbqz0,"[((false, Row(ytRating=None), sd, 2d, PT4M8S, ...",youtube#videoListResponse,"(40, 40)"
4,otOtu8WFJDFkzdBR_PG0LptIkK4,"[((false, Row(ytRating=None), hd, 2d, PT4M40S,...",youtube#videoListResponse,"(37, 37)"


4. Print out the schema of `df`

What can you say about the schema? Is it what you expected?

How many columns would you say there are right now?
Looking at the schema, how many columns would you say there would be in this DataFrame after flattening it?

5. Print out the columns of `df`

That's 4 columns, although the schema seems much more complicated.

This is because we have **nested structures**.

Let's investigate each column one by one.

### Column `etag`
6. Take the first 5 rows of the column `etag`

### Column `kind`
How many different values do we have for the column `kind`?

7. Count the number of different values in the column `kind`

Only one, **we can `.collect(...)` this**, this won't crash our memory..

8. Collect the distinct values in the column `kind`

Surprised by the result?
No the result is not too suprising since each row in the dataset represents a response from the youtube API.

### Column `pageInfo`
9. Show the first 5 rows of the `pageInfo` column

What's this? Let's count how many different values...

10. Count the number of distinct values in the column `pageInfo`

Only a few, we can collect them all

11. Collect all distinct values from the column `pageInfo`

This way we have more informations, that's the number of results we had for this specific API call..

### Column `items`
Last but not least, let's investigate the column `items`.

What do you expect this column contains? Let's check if your intuition is right.

12. Show the first 5 rows of the column `items`

Not very useful... but `display` wouldn't help much either..  
You can give it a try in the next cell

13. Display the column `items` from df

It's very difficult to see anything...

Exploratory Data Analysis is a lot about knowing when to zoom in or zoom out.

Right now, we're looking at the data from the surface and it's difficult to see anything. In the following steps, we will zoom in on a single element of the column `items` and try and see if we can understand its content better.

We'll start by taking a single item and will call it `sample_item`

14. take 1 element of `items`: sample_item

What's the type of `sample_item`?

15. Print out the type of `sample_item`

That's a `list`. How many elements does this list contains?

16. Print out the number of elements in `sample_item`

Only one :)

What do we do then?

**We keep zooming in**: we will look inside the list.

We will take the first (and only) element of the `sample_item`, remember `sample_item` is a `list`, and lists are [indexable](https://docs.python.org/2/reference/datamodel.html#emulating-container-types).

17. Take the first element of sample_item

18. Print out the type of `row_item`

No surprise we gave it the name `row_item`.  
Although, next time you do this by yourself, you might have to first zoom inside to understand what kind of variable we're dealing with so that you can then give it a proper name.

As we keep zooming in, it's gonna get harder and harder to find good names for the variables.  
If we were doing production ready code, that would be a problem. But here, we're doing analysis, and our main objective is speed.  
That doesn't mean we should choose stupid names, that would make our code harder to read, but we don't have to obsess over it.

This idea of finding good names for the variables you are using and any object you have to name in your code is important and is part of a programming philosophy of writing "legacy code". Legacy code is code that is here to stay and will be used and reused and updated many times and by many different people, therefore making it easy to read and understandable is essential. But if you think about it, any code is legacy code, when you'll come back to one of these exercises in the future or some old code you wrote to refresh your memory about something, you'll be glad you took some extra time to write clean and understandable code ;)

19. Print out the length of `row_item`

You know the drill, only one item in this sequence. Let's dig it out.

20. print out the first item of the `Row` object

*NOTE: DO NOT USE `print(...)`, the notebook will just print it out itself it would not break, but it would be harder to use*

What do you think?
Looks like it contains a list!

21. print out the length of the first and only element of the `Row` object

22. Import pyspark sql functions

23. Compute a new column that measures the size of the items column in the whole dataset: `items_size` then, show the first 10 lines of this new DataFrame

*NOTE: we're just exploring you don't have to save the new DataFrame to a variable*

Can you see anything?

It seems that each `item_size` is equal to each of the element of `pageInfo` 🤔.  
We will make sure of it, we will compare the `totalResults` field of `pageInfo` to the size of `items`, and if they're different we will filter them.
At the end we will count how many rows we got.

To keep our eyes on the prize: we want to count how many rows have a subfield `totalResults` (inside the column `pageInfo`) that is different from the size of the field `items`.

24. Filter on the column where the inside field `totalResults` of `pageInfo` is different from the size of `items` then count how many rows are left:

### Conclusion
For each row of our dataframe, we actually have many different songs..  
When we called the API, we where calling batches of 50 songs. For each API call, we asked for 50 songs, and usually got a little bit less because some songs are not available on Youtube anymore.

In our DataFrame each row is one API call.

If we were more used to PySpark, we could actually have seen it earlier, and save ourselves some time.  
Indeed, print out the schema of `df`:

25. Print out the schema of `df`

We've got our 4 top levels (corresponding to our 4 columns): `etag`, `items`, `kind`, `pageInfo`.

`etag` and `kind` are of type `string`.

`items` is different, its type is `array`, which means its a **container data type**, a type that contains multiple values.  

We didn't get a table of songs, but instead a **table of API results**.

---

Now that we understand what's going on, we need to find a way to get the data in the shape we want, e.g. one row per result and not one row per api call..  
That will be the topic of the next assignment.

---

## Take away
- Data we collect is not always in a tidy format. Sometimes it can actually be difficult to understand what we are manipulating.
- Zooming/dezooming strategy is very effective to quickly go through a dataset and reveal how it is structured.