**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Project 4. Exploring Tracks on Spotify

Let's sharpen our Pandas skills while exploring trends in music.

In this project, you'll explore a dataset, formerly posted on [Kaggle](https://www.kaggle.com), containing the following information on over 160,000 tracks on Spotify:

| Column | Description |
| :- | :- |
| `id` | Spotify ID for the track |
| `title` | Title of track |
| `artists` | List of artists |
| `year` | Release year |
| `duration_ms` | Track length in milliseconds |
| `popularity` | Measure between 0 and 100, based mostly on the total number of plays and recency of plays |
| `tempo` | Beats per minute |
| `loudness` | Average loudness of a track in decibels |
| `mode` | 0 = Minor, 1 = Major |
| `key` | Estimated key of track: 0 = C, 1 = C-sharp/D-flat, 2 = D, etc. |
| `explicit` | 0 = No explicit content, 1 = Explicit content |
| `acousticness` | Confidence measure from 0.0 to 1.0 of whether the track is acoustic |
| `danceability` | Confidence measure from 0.0 to 1.0 of how danceable the track is, based on tempo, beat strength, etc. |
| `energy` | Confidence measure from 0.0 to 1.0 of perceived energy, intensity and activity |
| `instrumentalness` | Confidence measure from 0.0 to 1.0 of whether a track contains no vocals |
| `liveness` | Confidence measure from 0.0 to 1.0 of whether an audience is in the recording |
| `speechiness` | Confidence measure from 0.0 to 1.0 of whether a track contains spoken words |
| `valence` | Confidence measure from 0.0 to 1.0 of the musical positiveness of a track |

For more details on the various measures (especially `acousticness`, `danceability`, `energy`, `instrumentalness`, `liveness`, `speechiness` and `valence`, see [this page](https://developer.spotify.com/documentation/web-api/reference/get-audio-features)
of the Spotify API documentation.

*FYI.* [Here](https://www.kaggle.com/rodolfofigueroa/spotify-12m-songs) is a similar dataset that is still available on Kaggle.

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Setup

First, run the code cell below to import Pandas and Numpy.

In [None]:
import pandas as pd
import numpy as np

Next, in the same folder as this notebook is a CSV file, `data/spotify_oct2020.csv`, which contains the Spotify data described above. Run the code cell below to read this dataset into a DataFrame named `df`.

In [None]:
df = pd.read_csv('data/spotify_oct2020.csv', converters={'artists': eval})

Note that there is a keyword argument, `converters={'artists': eval}`. If you take a closer look at the raw CSV file, you'll see that the `artists` column contains data that look like this:

```python
"['Drake', 'Santigold', 'Lil Wayne']"
```

In other words, in each row, the `artists` column contains a list. The Python function `eval()` parses a string as a Python expression. By telling `pd.read_csv()` to use `eval` as a converter for the `artists` column, we can read in these lists.

Finally, run the code cell below to take a peek at the dataset. You can see that the values in the `artists` column are lists.

In [None]:
df.head()

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problem 1

Using a single method chain, create a table showing the top 10 tracks by popularity value. The table should show the title, artists, release year, and popularity value. The top 3 rows should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>title</th>
      <th>artists</th>
      <th>year</th>
      <th>popularity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>87942</th>
      <td>Blinding Lights</td>
      <td>[The Weeknd]</td>
      <td>2020</td>
      <td>100</td>
    </tr>
    <tr>
      <th>87940</th>
      <td>ROCKSTAR (feat. Roddy Ricch)</td>
      <td>[DaBaby, Roddy Ricch]</td>
      <td>2020</td>
      <td>99</td>
    </tr>
    <tr>
      <th>87949</th>
      <td>death bed (coffee for your head) (feat. beabadoobee)</td>
      <td>[Powfu, beabadoobee]</td>
      <td>2020</td>
      <td>97</td>
    </tr>
  </tbody>
</table>

## Problem 2

Using a single method chain, find the values below for the tracks released after the year 2000:

1. Average duration of a track, in milliseconds.
2. Average popularity of tracks.
3. Fraction of explicit tracks.

## Problem 3

We may be interested in how tracks differ across decades. First, we need to compute the decade for each track. We can do this using the `np.floor()` function from NumPy and the `.astype()` Series method, like this:

In [None]:
(np.floor(df['year'] / 10) * 10).astype(int)

Using a single method chain, create a table that shows the median danceability value for tracks in each decade, sorted from highest to lowest danceability values. The first 3 rows should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>decade</th>
      <th>median_danceability</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>10</th>
      <td>2020</td>
      <td>0.693</td>
    </tr>
    <tr>
      <th>0</th>
      <td>1920</td>
      <td>0.624</td>
    </tr>
    <tr>
      <th>9</th>
      <td>2010</td>
      <td>0.612</td>
    </tr>
  </tbody>
</table>

## Problem 4

Recall that the variable `artists` actually contains a *list* of strings. For the sake of simplicity, for each track, let's assign credit to the first artist in the list.

We can extract the first artist in each list in the variable `artists` into a Series using the `.str.get()` method, as follows:

In [None]:
df['artists'].str.get(0)

Using a single method chain, create a table that shows the artists with the top 10 highest number of tracks released after 2000. Sort the rows from highest to lowest number of tracks. The first 3 rows should look like this:


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>first_artist</th>
      <th>count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2161</th>
      <td>Drake</td>
      <td>205</td>
    </tr>
    <tr>
      <th>7219</th>
      <td>Taylor Swift</td>
      <td>201</td>
    </tr>
    <tr>
      <th>2399</th>
      <td>Eminem</td>
      <td>184</td>
    </tr>
  </tbody>
</table>

## Problem 5

Using a single method chain, create a table that shows the artists with the top 10 average track popularity values. Limit the table to artists that have 10 or more tracks in the dataset. Sort the rows from highest to lowest average popularity value. The first 3 rows should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>first_artist</th>
      <th>mean_popularity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>6888</th>
      <td>Harry Styles</td>
      <td>77.038462</td>
    </tr>
    <tr>
      <th>1864</th>
      <td>Billie Eilish</td>
      <td>76.560976</td>
    </tr>
    <tr>
      <th>9855</th>
      <td>Lewis Capaldi</td>
      <td>76.250000</td>
    </tr>
  </tbody>
</table>

## Problem 6

Ask your own question about the tracks in the Spotify dataset. Using a single method chain, produce a table that answers your question. In a few sentences, write about your findings.

*Write your question here.*

In [None]:
# Write your code here.

*Write about your findings here.*

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Grading rubric

| Problem |                                                                         | Points  |
| :-      | :-                                                                      | -:      |
| 1       | Sorts by popularity in descending order                                 | 3       |
|         | Shows only columns for title, artists, release year, and popularity     | 3       |
|         | Shows only top 10 tracks                                                | 3       |
|         | Uses a single method chain                                              | 2       |
|         | Code runs without errors                                                | 4       |
| 2       | Filters for tracks released after 2000                                  | 3       |
|         | Computes average duration of track                                      | 3       |
|         | Computes average popularity of tracks                                   | 3       |
|         | Computes fraction of explicit tracks                                    | 3       |
|         | Uses a single method chain                                              | 2       |
|         | Code runs without errors                                                | 4       |
| 3       | Computes decade for each track                                          | 3       |
|         | Computes median danceability for each decade                            | 3       |
|         | Sorts by median danceability in descending order                        | 3       |
|         | Uses a single method chain                                              | 2       |
|         | Code runs without errors                                                | 4       |
| 4       | Filters for tracks released after 2000                                  | 3       |
|         | Extracts first artist                                                   | 3       |
|         | Computes number of tracks per artist                                    | 3       |
|         | Sorts by number of tracks per artist in descending order                | 3       |
|         | Uses a single method chain                                              | 2       |
|         | Code runs without errors                                                | 4       |
| 5       | Extracts first artist                                                   | 3       |
|         | Computes average popularity for each artist                             | 3       |
|         | Filters for artists with 10 or more tracks                              | 3       |
|         | Sorts by average popularity in descending order                         | 3       |
|         | Shows only columns for artist and popularity                            | 3       |
|         | Uses a single method chain                                              | 2       |
|         | Code runs without errors                                                | 4       |
| 6       | Asks a reasonable question about the tracks in the Spotify dataset      | 3       |
|         | Uses a single method chain to produce a table that answers the question | 4       |
|         | Code runs without errors                                                | 3       |
|         | Writes a few sentences explaining the findings                          | 3       |
|         | **Total**                                                               | **100** |