# Project 1: Music and Spotify - Due Monday, November 4th, 2019 at 11:59PM

In [None]:
# Don't change this cell. These are all your imports necessary for this project.
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
%matplotlib inline
plots.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings("ignore", category = FutureWarning)

from client.api.notebook import Notebook
ok = Notebook('project01.ok')
_ = ok.auth(inline=True)

## Announcements

This assignment is due **Monday, Nov 4 at 11:59pm**. 

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

The project is more challenging than the homeworks. You are encouraged to find a partner and use pair programming.

Definitely start early so that you have time to get help if you're stuck. A calendar with lab hour times and locations appears on [https://www.dsc10.com/calendar](https://www.dsc10.com/calendar).

## Part 1: Introduction

[Spotify](https://www.spotify.com/us/) is a large digital music service that allows users to access its vast library of music. The data we are working with comes from the Spotify API, an interface for working with data from Spotify. 

Run the cell below to load this data into `tracks`.

In [None]:
tracks = Table.read_table('spotify.csv')
tracks

This data set is from [Kaggle](https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db), originally from the Spotify API. The data has been cleaned and slightly modified for the purposes of this project. The individuals in the table are songs (tracks) on Spotify, and the ten features recorded about each song are:

1. `artist_name`: Name of the artist.
2. `track_name`: Name of the track (song).
3. `key`: Estimated key of the track.
4. `popularity`: Describes the popularity of the song on a scale of 0-100.
5. `acousticness`: Describes the confidence of how acoustic the song is, on a scale of 0-1.
6. `danceability`: Describes how suitable a song is for dancing, on a scale of 0-1.
7. `energy`: Measures the level of intensity and activity of a song, on a scale of 0-1.
8. `valence`: Describes the overall positiveness or happiness of the track, on a scale of 0-1.
9. `tempo`: Measures the overall number of beats per minute (BPM) of the track.
10. `duration_ms`: Describes how long the song is in milliseconds (thousandths of a second).

More information on the columns of the track can be found on the Spotify Developers API page [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). Let's familiarize ourselves with the intricacies of the data set and practice extracting information from the table.


**Question 1.** First, let's make `duration_ms` more readable by changing the unit from milliseconds to minutes. Add a new column called `duration_min` that contains the duration of the track in minutes, without rounding, then delete the column `duration_ms`. Resave this new table as `tracks`.

In [None]:
tracks = ...
tracks

In [None]:
_ = ok.grade('q1_1')

**Question 2.** Find the length of the longest song in the data set in *minutes* and save it to `longest_duration`.

In [None]:
longest_duration = ...
longest_duration

In [None]:
_ = ok.grade('q1_2')

**Question 3.** A [musical key](https://www.studybass.com/lessons/harmony/keys-in-music/) describes a certain set of pitches, and usually a song is played in a certain key. Create an array of all the unique keys in the data set. Label it as `unique_keys`.

In [None]:
unique_keys = ...
unique_keys

In [None]:
_ = ok.grade('q1_3')

**Question 4.** Make a visualization that allows you to easily compare the amount of times a unique key is used among the songs in the Spotify data set.

In [None]:
# make plot here

**Question 5.** Lets say you're a big fan of keys with sharp in it (i.e. tracks that have a # in `key`), and have somehow managed to listen to every track with a sharp key above 60 popularity. Find how many total tracks fit this criteria, and save it as `popular_sharp_tracks`.

In [None]:
popular_sharp_tracks = ...
popular_sharp_tracks

In [None]:
_ = ok.grade('q1_5')

**Question 6.** The `tempo` of a song meaures the number of beats per minute of the song, which generally describes how slow or fast a song is. From the table `tracks`, find the name of the song with the greatest *number* of total beats. Save the answer into `most_beats`.

In [None]:
most_beats = ...
most_beats

In [None]:
_ = ok.grade('q1_6')

**Question 7.** Some artists consistently produce hit songs, and other artists produce one-hit-wonders. Create a table named `popular_table`with two columns: `artist_name` and `proportion_top_songs`, in which `proportion_top_songs` describes the proportion of songs from the artist that are above 90 popularity.

*Note*: Artists with no songs above 90 popularity should not be included in the table.

In [None]:
popular_table = ...
popular_table

In [None]:
_ = ok.grade('q1_7')

## Part 2: Tempo Markings

In music, there are Italian words that describe the tempo, or pacing of the song. In this part, we want to analyze the relationship between a song's tempo and it's other characteristics: acousticness, danceability, and valence. But before we do that, we want to convert the tempo of the song into its respective description. These descriptions are approximations of what tempo adheres to its respective Italian tempo marking.

- Lento - [0, 60) bpm
- Adagio - [60, 90) bpm
- Adante - [90, 110) bpm
- Moderato - [110, 120) bpm
- Allegro - [120, 160) bpm
- Vivace - [160, 180) bpm
- Presto - 180+ bpm



**Question 1.** Create a new column in the table `tracks` called `tempo_name` that contains the Italian tempo name for the song.

In [None]:
tracks = ...
tracks

In [None]:
_ = ok.grade('q2_1')

**Question 2.** Find the most common `tempo_name` and `key` combination among all the songs in `tracks`. Save the answer into an array labeled `most_common_combination`. For example, the answer can look like: `array(['Vivace', 'D#'])`.

In [None]:
most_common_combination = ...
most_common_combination

In [None]:
_ = ok.grade('q2_2')

**Question 3.** Now we want to categorize the songs by their Italian tempo names, and find the mean of each numerical variable according to its `tempo_name`. Non-numerical columns should be dropped. Save this new table as `track_means`. 

In [None]:
track_means = ...
track_means

In [None]:
_ = ok.grade('q2_3')

**Question 4.** Create a single plot that portrays how the `acousticness mean`, `danceability mean`, `energy mean`, and `valence mean` change according to the `tempo mean`. 

*Hint*: Check out documention for .plot [here](http://data8.org/datascience/_autosummary/datascience.tables.Table.plot.html)

In [None]:
# make plot here

**Question 5.** You may have noticed from the plot in the previous question that energy and valence seem to move together. Assuming that there is a strong association between energy and valence, can we use the energy of a song to predict its valence? In addition, does this mean that high energy causes high valence? Answer both questions and explain why below.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Write your answer here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

## Part 3: Looking into Artists

The file `artists.csv` contains data on musical artists from [last.fm](https://www.last.fm/), which is a service that allows listeners to keep a record of what songs they listen to across various platforms, so that they can get personalized music recommendations. 

Run the cell below to load this data into `artists`. 

In [None]:
artists = Table.read_table('artists.csv')
artists

This data set is from [Kaggle](https://www.kaggle.com/pieca111/music-artists-popularity), originally scraped from [last.fm](https://www.last.fm/). There are 5 columns in this data set. 

1. `artist`: artist name according to last.fm
2. `country`: country or countries associated with the artist, based on last.fm tags, separated by semicolon (;)
3. `artist_tags`: user-generated tags associated with the artist on last.fm, separated by semicolon (;), sorted in decreasing order of frequency
4. `artist_listeners`: number of listeners for the artist, among users of last.fm
5. `artist_plays`: number of times a song by this artist was played by users of last.fm

The table `artists` has a column `artist_tags`, which contains one long string with the artist's tags on [last.fm](https://www.last.fm/), each of which is separated by a semicolon ";". In order to look at artists with certain tags, first we need to find an easier way to access these tags.

**Question 1.** Create a new column in `artists` called `tag_list` that contains a *list* of tags for each artist, from the `artist_tags` column. Drop the old column `artist_tags`. 

The tags should be listed in order of decreasing frequency, as they are listed in the original `artist_tags` column. Note that some tags are the same, except for different capitalization ("hip hop' and "Hip hop"). Change all tags to all lowercase letters. For example, if an artist has the tags "rock; alternative; Britpop; Rock", the value in the `tag_list` column should be [rock, alternative, britpop, rock]. As illustrated, each list may now have repeated tags.

In [None]:
artists = ...
artists

In [None]:
_ = ok.grade('q3_1')

**Question 2.** Let's take a closer look at artists who are considered relaxing. Filter the `artists` table to find artists who have the tag `relax`, and save that table into `relax_artists`.

*Hint:* The Python keyword `in` can check whether an element belongs to a list. Try it out by running the two cells below.

In [None]:
5 in [1, 2, 3]

In [None]:
2 in [1, 2, 3]

In [None]:
relax_artists = ...
relax_artists

In [None]:
_ = ok.grade('q3_2')

**Question 3.** For artists who have the `relax` tag, what is the average number of listeners? Save that number into `relax_artists_avg_listeners`. 

In [None]:
relax_artists_avg_listeners = ...
relax_artists_avg_listeners

In [None]:
_ = ok.grade('q3_3')

**Question 4.** Compare `relax_artists_avg_listeners` to the the average number of listeners for artists with the `party` tag and save the difference (party - relax) to `diff`. 

In [None]:
diff = ...
diff

In [None]:
_ = ok.grade('q3_4')

The `country` column also has multiple values separated by a ";". Let's look at the number of countries an artists is affiliated with.

**Question 5.** Add a column to the `artists` table called `num_countries` which contains the number of countries listed in the `country` column. Save your result in a new table called `with_num_countries`.

In [None]:
with_num_countries = ...
with_num_countries

In [None]:
_ = ok.grade('q3_5')

**Question 6.** Let's see if artists who are affliated with more countries have more listeners on average. Save a table `listeners_by_num_countries` which has two columns, `num_countries` and `average_listeners`, which gives the average number of listeners among all artists associated with a given number of countries. 

In [None]:
listeners_by_num_countries = ...
listeners_by_num_countries

In [None]:
_ = ok.grade('q3_6')

**Question 7.** Plot a graph of `num_countries` versus `average_listeners`. You should find that there is a clear *negative association* between number of countries and average listeners, excpet for a sudden spike in average listeners when the number of countries equals 11. Explore the data some more and see if you can find an explanation for this sudden departure from the downward trend. If you had to use this data to predict the number of listeners for a band associated with 12 countries, what would be your rough prediction (say, to the nearest 20,000 listeners) and why? Feel free to support your answer with additional calculations, visualizations, etc.

In [None]:
# make plot here

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Write your answer here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

## Part 4: Artist Plays and Songs

**Question 1.** For artists that appear both in the `artists` and `tracks` table, find the average number of plays per song (`average_plays_per_song`) each artist has based on the number of songs the artist has in the `tracks` table. For example, if an artist has 100,000 plays and has 2 songs in the `tracks` table, the artist's `average_plays_per_song` would be 50,000. Save a table called `average_plays` with three columns: `artist_name`, `tag_list`, and `average_plays_per_song`.

In [None]:
average_plays = ...
average_plays

In [None]:
_ = ok.grade('q4_1')

**Question 2.** The first tag in the `tag_list` is the most popular tag for that artist. What is the most popular tag for the artist with the highest `average_plays_per_song`? Save that tag to `highest_average_tag`.

In [None]:
highest_average_tag = ...
highest_average_tag

In [None]:
_ = ok.grade('q4_2')

**Question 3.** For the top 5 artists with the highest `average_plays_per_song`, rank the frequency of tags associated with each of these artists by giving them a score of 1 to 5. A score of 1 means the tag is associated with only 1 of these top 5 artists, a score of 2 means that the tag is associated with 2 of the top 5 artists, etc. Create a table called `scores` with two columns: `tag` and `score`. Arrange the table so that tags appear in decreasing order of score. 

In [None]:
scores = ...
scores

In [None]:
_ = ok.grade('q4_3')

## Part 5: Profile of a Top Hit

`billboard.csv`, also from [Kaggle](https://www.kaggle.com/saberianz/billboard-charts), contains songs that were on the Billboard Hot 100 charts from 2015 to 2019. The Billboard Hot 100 is a weekly list of the top 100 current songs, ranked based on a combination of sales, radio time, and streaming activity.  Each week, each song on the Billboard Hot 100 is ranked from 1 to 100, and the popularity of hit songs can be seen by keeping track of features such as how many weeks the song appeared in the Billboard Hot 100, and the best ranking it achieved in that time. 

Run the cell below to load this data into `billboard`.

In [None]:
billboard = Table.read_table('billboard.csv')
billboard

**Question 1.** Which artists have at least 2 songs that have peaked at #1 before? Save your answer in an array of artist names called `legends`.

In [None]:
legends = ...
legends

In [None]:
_ = ok.grade('q5_1')

**Question 2.** Join the tables `tracks` and `billboard` based on the song names, into a table called `joined_by_song`.

In [None]:
joined_by_song = ...
joined_by_song

In [None]:
_ = ok.grade('q5_2')

**Question 3.** Take note of the number of rows in `joined_by_song` and the number of rows in `billboard`. Explain why this happens.

In [None]:
print(joined_by_song.num_rows)
print(billboard.num_rows)

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Write your answer here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

**Question 4.** It is better to join the tables based on two features: song *and* artist. Since the `join` command only allows to join based on one feature, we will create a new feature that is a combination of song and artist. Add a new column to each of the tables `tracks` and `billboard` called `Song; Artist` and then join the two tables by `Song; Artist` into a table called `joined_by_song_artist`. The column `Song; Artist` should contain strings formatted like "Blank Space; Taylor Swift", for example.

In [None]:
joined_by_song_artist = ...
joined_by_song_artist

In [None]:
_ = ok.grade('q5_4')

**Question 5.** Filter the table `joined_by_song_artist` to find the songs that were 
* on the charts for at least 10 weeks, and   
* made it into the top 10 songs at some point in time.  

Call the table with just these songs `top_hits`.

In [None]:
top_hits = ...
top_hits

In [None]:
 = ok.grade('q5_5')

**Question 6.** Let's see if we can determine what top hits have in common. Make a table called `comparison_table` that compares the mean of each of the following attributes among all songs in the `top_hits` table, versus among all songs in the `tracks` table. 
* acousticness  
* danceability  
* energy  
* valence  
* tempo
* duration_min 

Your table should have three columns: `Attribute`, `Mean for Top Hits`, `Mean for All Songs`.

In [None]:
comparison_table = ...
comparison_table

In [None]:
_ = ok.grade('q5_6')

**Question 7.** Fill in the blanks in the following code so that it prints a one-sentence summary of what we've learned about top hits. 

Hint: Use *if* statements based on the values in `comparison_table`.

Hint: Make sure you add spaces so the result reads like a sentence.

In [None]:
s = "The analysis shows that top hits have "
for i in np.arange(4):
    more_less = "_____" #should be either "more " or "less " depending on values in comparison_table
    s = s+more_less+attributes.item(i)+", "
faster_slower = "_____" #should be either "faster " or "slower " based on values in comparison_table
s = s+faster_slower+"tempo, and "
longer_shorter = "_____" #should be either "longer " or "shorter " based on values in comparison_table
s = s+longer_shorter+"duration than typical songs."

print(s)

In [None]:
_ = ok.grade('q5_7')

## Optional Extra Credit: Independent Data Analysis

Perhaps the data you've seen here has raised some interesting questions for you. Feel free to explore those questions here in an independent data analysis. You can do anything you want that relates to this project, such as
* explore a particular artist or genre, 
* analyze your own music listening preferences, or
* investigate trends in music over time.

Feel free to look for additional data sets that will support your analysis. 

Extra credit on this project will be given to our **five favorite data analyses**. Your analysis should be done below, and should include code and explanatory text that guides the reader in what you are doing and explains your findings. This part is completely optional and can be left blank.

In [None]:
# do whatever analysis you want here

Use Markdown cells to explain what you're doing.

# Congratulations! You finished the project!

## To submit:

1. Select `Run All` from the `Cell` menu to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine.
3. Submit using the cell below.
4. Save PDF and submit to Gradescope.
5. Add your partner (if applicable) on both OKPY and Gradescope.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
_ = ok.submit()

## Don't forget to submit your PDF to Gradescope!
If the usual way of getting a PDF does not work for the project, the alternative way would be to use the command Ctrl+P in your jupyter notebook, which will bring up a print preview page, and then change the destination to save it as a PDF.