# Data selection for `songs` analysis

In this notebook, we will load the data we tidied up and perform columns selection.

In [None]:
INPUT_FILENAME = 's3://full-stack-bigdata-datasets/Big_Data/YOUTUBE/items_clean.parquet'
songs_raw = spark.read.parquet(INPUT_FILENAME)
songs_raw.printSchema()

In [None]:
len(songs_raw.columns)

**44 columns, that's quite many columns!** It would take time to explore all of this.  
The goal of this notebook is too quickly scan through the columns and make a selection of some interesting ones that we will explore first.

We will first get a sense of the different values in each column. Since we will do this for each of the 44 columns, we will write a function that we can then call on all columns.

Our function will be called `value_counts` and takes 2 parameters:
- `df`: a PySpark DataFrame
- `col_name`: the name of the column for which we want to count the columns

This function should return the count of each distinct values of the column as a PySpark DataFrame.

1. Create the function `value_counts` as per the instructions

We're now ready to use our function over all columns of `songs_raw`.

2. Iterate over each column of `songs_raw` and `.show()` the results of calling `value_counts`

Next part will be done for you, because this is **tedious**, in case you didn't notice, data science is mostly about doing "boring" stuff such as data sourcing and cleaning and _sometimes_ do cool stuff... like machine learning (which we'll study later). 😅

**Keep in mind that data sourcing and cleaning is CRUCIAL to the success of a data science project, garbage in → garbage out.**

Today, it's on us, but do not skim on these kind of tasks in your projects.

We do it for you, but it's nice that you understand. Here is the process 👇.

First, for convenience, we create an [Enum](https://docs.python.org/3/library/enum.html#module-enum) with 4 attributes:

- `'not selected'`: this is not our priority, we won't use this column
- `'selected'`: this looks interesting, we will use this column for our next analysis
- `'maybe'`: unsure about this one, maybe for a second pass
- `'later'`: this is interesting, but not prioritary or time-comsuming, we will come back to this later

In [None]:
from enum import Enum

class Status(Enum):
  NOT_SELECTED = 'not selected'
  SELECTED = 'selected'
  LATER = 'later'
  MAYBE = 'maybe'

In the following cells, we took a look at the `.show()` statements showing the value counts for each columns.

As we take notes, we keep track of which selection we make for each column. We told you, **it is tedious work**.

### contentDetails

#### `contentDetails_caption`
Most values are `false` => NOT SELECTED

#### `contentDetails_contentRating_ytRating`
Apparently, this column seems to indicate if the content is restricted in any way, in particular if it is age restricted.
Most value are not restricted => NOT SELECTED

#### `contentDetails_definition`
This is the definition of the video, either `sd` (low def) or `hd` (high-def) => MAYBE

#### `contentDetails_dimension`
Apparently, it is to indicate if the content is using 2Dimension (`2d`) or 3Dimension (`3d`).
Almost all videos are 2-dimensions. => NOT SELECTED

#### `contentDetails_duration`
ISO8601 encoded duration of the song, obviously many different values, plus that's interesting => SELECTED

#### `contentDetails_licensedContent`
Boolean: more false than true, not sure it's interesting => MAYBE

#### `contentDetails_projection`
2 values, either `rectangular` or `360`, I guess it's either standard video or like Virtual Reality stuff, almost all are `rectangular` => NOT SELECTED

### Misc

#### `etag`
Unique values. I think it's related to Youtube API, plus we got the `id` in another column => NOT SELECTED

#### `id`
Unique values, unique `id` in Youtube. We'll need this to join with our logs. => SELECTED

#### `kind`
Only one unique value: `youtube#video`. => NOT SELECTED

### Snippet

#### `snippet_categoryId`
Seems to indicate the category ID (within Youtube), most values are 10 (most likely means 'music').  
Could be interesting. => MAYBE

#### `snippet_channelId`
The id of the youtube channel the video belongs to, could be interesting to know if some channels are more popular and if users tend to listen to songs from the same channels.  => SELECTED

#### `snippet_channelTitle`
The title of the channel. Will be useful in combo with `channelId` to perform analysis. => SELECTED

#### `snippet_defaultAudioLanguage`
The default audio language of the video. Most values are `null`. => LATER

#### `snippet_defaultLanguage`
The default language of the video. Most values are `null`. => LATER

#### `snippet_description`
The description of the video as text content. This is text content. => LATER

#### `snippet_liveBroadcastContent`
Tells if the video was a live broadcast or not. Almost none are. => NOT SELECTED

#### `snippet_localized_description`
Localized version of the description. Similar to `snippet_description`. => NOT SELECTED

#### `snippet_localized_title`
The localized title. => MAYBE

#### `snippet_publishedAt`
Date of publication of the video. That's very interesting. => SELECTED

#### `snippet_thumbnails_default_height`
Height of the image thumbnails. One unique value. => NOT SELECTED

#### `snippet_thumbnails_default_url`
URLs of the thumbnails images. Only unique values, although some substring might not be unique, we won't dig into this for the moment. => NOT SELECTED

#### `snippet_thumbnails_default_width`
Similar to `snippet_thumbnails_default_height`. One unique value. => NOT SELECTED

#### `snippet_thumbnails_high_height`
Probably similar to `snippet_thumbnails_default_height` and `snippet_thumbnails_high_height`.  
One unique value. => NOT SELECTED

#### `snippet_thumbnails_high_url`
Similar to `snippet_thumbnails_default_url`. => NOT SELECTED

#### `snippet_thumbnails_high_width`
Similar to all other dimensions values of the thumbnails image. => NOT SELECTED

#### `snippet_thumbnails_maxres_height`
Similar to other dimensions values of the thumbnails, this one is a bit different as it has half of missing values.  
Not selecting now but the missing values could capture some time dependant signal (that we might already get from `snippet_publishedAt`). => NOT SELECTED

#### `snippet_thumbnails_maxres_url`
Same amount of null values as `snippet_thumbnails_maxres_height`, which, kinda makes sense. => NOT SELECTED

#### `snippet_thumbnails_maxres_width`
Same amount of null values as `snippet_thumbnails_maxres_height` and `snippet_thumbnails_maxres_url`. => NOT SELECTED

#### `snippet_thumbnails_medium_height`
Same as other thumbnails dimensions. => NOT SELECTED

#### `snippet_thumbnails_medium_url`
Same as other thumbnails URLs => NOT SELECTED

#### `snippet_thumbnails_medium_width`
Same as other thumbnails dimensions. => NOT SELECTED

#### `snippet_thumbnails_standard_height`
Related to thumbnails standard, some missing values. => NOT SELECTED

#### `snippet_thumbnails_standard_url`
URLs for standard thumbails, shares the same amout of missing values as `snippet_thumbnails_standard_height`. => NOT SELECTED

#### `snippet_thumbnails_standard_width`
Width for standard thumbails. Missing values amount checks out. => NOT SELECTED

#### `snippet_title`
Text content, will be useful for analysis. => SELECTED

### Statistics

#### `statistics_commentCount`
Count of comments on the video. => SELECTED

#### `statistics_dislikeCount`
Count of dislikes on the video => SELECTED

#### `statistics_favoriteCount`
Count of favorites on the video. That could be of interest, but only one unique value `0`. => NOT SELECTED

#### `statistics_viewCount`
View counts on the video. => SELECTED

### Status

#### `status_embeddable`
Is it embeddable or not, I'm not really sure what it means, we can dig out later. => LATER

#### `status_license`
License of the content, either `youtube` or `creativeCommon`. Most are `youtube`. => LATER

#### `status_privacyStatus`
Wether the video is `public`, public but `unlisted` or probably `private`. We only have values for the first twos. Most are `public`. => LATER

#### `status_publicStatsViewable`
Are the stats publicly viewable (boolean), most are `true`. => LATER

#### `status_uploadStatus`
Either `processed` or `uploaded`. Almost all are `processed`. => NOT SELECTED

Now, we're going to store our results as a pandas DataFrame.

We create a Python's dictionary which keys are the name of the columns, and values are values of one of the 4 options in our `Enum`.

In [None]:
selection_dict = {
  'contentDetails_caption': Status.NOT_SELECTED,
  'contentDetails_contentRating_ytRating': Status.NOT_SELECTED,
  'contentDetails_definition': Status.MAYBE,
  'contentDetails_dimension': Status.NOT_SELECTED,
  'contentDetails_duration': Status.SELECTED,
  'contentDetails_licensedContent': Status.MAYBE,
  'contentDetails_projection': Status.NOT_SELECTED,
  'etag': Status.NOT_SELECTED,
  'id': Status.SELECTED,
  'kind': Status.NOT_SELECTED,
  'snippet_categoryId': Status.MAYBE,
  'snippet_channelId': Status.SELECTED,
  'snippet_channelTitle': Status.SELECTED,
  'snippet_defaultAudioLanguage': Status.LATER,
  'snippet_defaultLanguage': Status.LATER,
  'snippet_description': Status.LATER,
  'snippet_liveBroadcastContent': Status.NOT_SELECTED,
  'snippet_localized_description': Status.NOT_SELECTED,
  'snippet_localized_title': Status.LATER,
  'snippet_publishedAt': Status.SELECTED,
  'snippet_thumbnails_default_height': Status.NOT_SELECTED,
  'snippet_thumbnails_default_url': Status.NOT_SELECTED,
  'snippet_thumbnails_default_width': Status.NOT_SELECTED,
  'snippet_thumbnails_high_height': Status.NOT_SELECTED,
  'snippet_thumbnails_high_url': Status.NOT_SELECTED,
  'snippet_thumbnails_high_width': Status.NOT_SELECTED,
  'snippet_thumbnails_maxres_height': Status.NOT_SELECTED,
  'snippet_thumbnails_maxres_url': Status.NOT_SELECTED,
  'snippet_thumbnails_maxres_width': Status.NOT_SELECTED,
  'snippet_thumbnails_medium_height': Status.NOT_SELECTED,
  'snippet_thumbnails_medium_url': Status.NOT_SELECTED,
  'snippet_thumbnails_medium_width': Status.NOT_SELECTED,
  'snippet_thumbnails_standard_height': Status.NOT_SELECTED,
  'snippet_thumbnails_standard_url': Status.NOT_SELECTED,
  'snippet_thumbnails_standard_width': Status.NOT_SELECTED,
  'snippet_title': Status.SELECTED,
  'statistics_commentCount': Status.SELECTED,
  'statistics_dislikeCount': Status.SELECTED,
  'statistics_favoriteCount': Status.NOT_SELECTED,
  'statistics_viewCount': Status.SELECTED,
  'status_embeddable': Status.LATER,
  'status_license': Status.LATER,
  'status_privacyStatus': Status.LATER,
  'status_publicStatsViewable': Status.LATER,
  'status_uploadStatus': Status.NOT_SELECTED
}

import pandas as pd

selection_df = pd.DataFrame.from_dict({k: v.value for k, v in selection_dict.items()},
                                      orient='index', columns=['status'])
selection_df

Unnamed: 0,status
contentDetails_caption,not selected
contentDetails_contentRating_ytRating,not selected
contentDetails_definition,maybe
contentDetails_dimension,not selected
contentDetails_duration,selected
contentDetails_licensedContent,maybe
contentDetails_projection,not selected
etag,not selected
id,selected
kind,not selected


We'll check how many of each tag we got.

In [None]:
selection_df \
  .groupby('status') \
  .agg({'status': 'count'}) \
  .rename(columns={'status': 'count'})

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
later,8
maybe,3
not selected,25
selected,9


We only selected 9 columns from the 46 we had at the beginning. This should makes thing easier to analyze.

3. Get the selected columns as a list: `selected_columns`

4. Select the `selected_columns` from `songs_raw`: `songs`. Then, print out:
- the schema of the dataframe
- the shape of the DataFrame, e.g. the number of rows and columns

In [None]:
songs.write \
  .parquet("s3://full-stack-bigdata-datasets/Big_Data/YOUTUBE/items_selected.parquet", mode='overwrite')