# Worksheet 1 

## Girls in Data Science Camp

In this worksheet, we will be using a data set obtained from the Spotify Web API the top 50 tracks from 2023 (more details can be found [here](https://www.kaggle.com/datasets/yukawithdata/spotify-top-tracks-2023)).

<img src=https://storage.googleapis.com/pr-newsroom-wp/1/2018/11/Spotify_Logo_RGB_Green.png width="500">

I have reduced the data set to include the following variables:

- artist_name: the artist name
- track_name: the title of the track
- album_release_date: The date when the track was released
- genres: A list of genres associated with the track's artist(s)
- danceability: A measure from 0.0 to 1.0 indicating how suitable a track is for dancing based on a combination of musical elements
- energy: A measure from 0.0 to 1.0 representing a perceptual measure of intensity and activity
- loudness: The overall loudness of a track in decibels (dB)
- key: The key the track is in. Integers map to pitches using standard Pitch Class notation.
- tempo: The overall estimated tempo of a track in beats per minute (BPM)
- duration_ms: The length of the track in milliseconds
- time_signature: An estimated overall time signature of a track
- popularity: A score between 0 and 100, with 100 being the most popular

I also included a new variable called `pop` which is yes if the song falls into any type of pop genre and no otherwise. 

In [1]:
# Load libraries

library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# Read in the data

spotify <- read_csv("data/spotify_top_50_2023.csv")

[1mRows: [22m[34m50[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): artist_name, track_name, genres
[32mdbl[39m  (8): danceability, energy, loudness, key, tempo, duration_ms, time_sign...
[34mdate[39m (1): album_release_date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [3]:
spotify <- spotify |> 
                mutate(pop = if_else(str_detect(genres, "pop"), "yes", "no"))

# Exercise 1: Wrangling 

## 1.1 

Is the spotify data tidy? Why or why not?

> **YOUR ANSWER HERE**

## 1.2 

Using code, what are the dimensions of this data set (i.e., the number of rows and columns)?

In [4]:
### YOUR CODE HERE

## 1.3 

Without examining the entire data set, which artist and track is in the 35th row of the data set? Note that your code should return only the required variables (`artist_name` and `track_name`). 

In [5]:
### YOUR CODE HERE

## 1.4 

Create a subset of the data that only includes tracks with a popularity score over 90. Assign this to an object called `pop_90`. How many songs have a popularity score over 90?

In [6]:
### YOUR CODE HERE

## 1.5

Now, I want to look at pop songs that have a danceability score over 0.7. Create a subset of the `spotify` data set to achieve this task, ordered from highest to lowest danceability. 

In [7]:
### YOUR CODE HERE

## 1.6

What is the average danceability score for Taylor Swift songs? 

In [8]:
### YOUR CODE HERE

## 1.7

Are Taylor Swfit's songs more danceable than The Weeknd's? *Hint: use `group_by`*.

In [9]:
### YOUR CODE HERE

## 1.8 

Are pop songs more popular than other genres? Compare the average popularity scores between pop songs vs. other genres.

In [10]:
### YOUR CODE HERE

> **YOUR ANSWER HERE**

# Exercise 2: Data Visualization

## 2.1 

Using a histogram, visualize the distribution of popularity scores for Spotify's top 50 tracks from 2023. Describe what you see. 

In [11]:
### YOUR CODE HERE

> **YOUR ANSWER HERE**

## 2.2 

Are pop songs more popular than other genres? You answered this in question 1.8 using summary statistics, but now use histograms to visualize the popularity groups for the two groups on the same graph. Do you notice anything different?

> Hint: use `facet_grid`

In [12]:
### YOUR CODE HERE

> **YOUR ANSWER HERE**

## 2.3 

Create a barplot comparing the counts of pop songs vs. non-pop songs.

In [13]:
### YOUR CODE HERE

## 2.4 

Is there a relationship between how loud a song is in decibels and its popularity? Visualize the relationship between loudness and popularity with a scatterplot, plotting loudness on the $y$-axis and popularity on the $x$-axis. What do you notice?

In [14]:
### YOUR CODE HERE

> **YOUR ANSWER HERE**

## 2.5 (Challenge)

Find the song that has the highest popularity score, but a relatively moderate loudness. Highlight the point on the graph and label it. 

> Hint: Use the `annotate` function to highlight a point

In [15]:
### YOUR CODE HERE

## 2.6 (Challenge)

List the artists who have more than one track in the top 50. For each artist, show the number of tracks and their average popularity score.

In [None]:
### YOUR CODE HERE