In [None]:
import json
import importlib





# Spotify Popularity Predictor
The goal of this project is for a user to be able to lookup a song on spotify and receive a prediction of how popular it is. Originally, we had intended to predict the number of playlists that a song occurred on, but we opted to switch to popularity in order to make the possible unseen values be the entirity of Spotify's library.

## Part 1: The Dataset
Our dataset is composed of over 2 million songs from spotify that we got from a variety of sources. Currently, each entry contains 10 attributes, one of which is the predicted attribute. These attributes include:
- Number of Artists
    - With Hip Hop being the dominant force in music right now, and features      being a huge part of Hip Hop, it makes sense to take the number of
    features into account 
- The Primary Artist for the track
    - While not inherently useful for models other than Naive Bayes, we can
    use the spotify api to get more data about each track like genre and
    artist popularity. Genre in particular would be very useful, but since
    Spotify attaches multiple to each artist, it will complicate the training
    process. There are only ~300k unique artists so getting this information
    would be trivial
- Release Year
    - Obviously different years are going to have different sorts of performing songs, and things that make an older song popular don't neccessariliy translate
- Explicit
    - It is hard to say how useful this will be, though it will will be useful alongside genre in a decision tree
- Danceability
    - This is a metric calculated by Spotify for its recommendation system,
    but it will probably be quite useful for us
- Energy
    - Much like Danceability, this one is a calculation by Spotify for their 
    recommendation system and will likely be useful. Both of these would     
    likely be enhanced by adding genre to the mix
- Tempo
    - The tempo of the song in Beats per Minute. Historically this has been
    a good predictor of how much a song gets played on the radio or in a 
    club
- Loudness
    - This appears to be the peak volume of the track. Another one that would
    be enhanced by genre
- Time Signature
    - The the number of counts per measure. This will likely help filter out
    less popular forms of music such as math rock, free jazz, etc. Songs with
    more complex time signatures are usually less popular. 3/4
- Popularity
    - The attribute we are predicting for. This is a value in the range [0,100]
    that has to do with the number of plays and how recent they were. An artist's
    top 10 tracks are shown based on this value. Unfortunately, the total number
    of plays is not exposed to the API.

### Getting the Dataset
As mentioned, we originally planned on using the million playlists dataset from spotify to factor in the number of
times a track appears on a playlist, but this would have severely limited the unseen instances we could use. The dataset
did include over 2 million unique songs though, so we opted to use that as the basis for our dataset. After getting the list
of songs, we then needed to query the Spotify Web API for those attributes listed above. The ended up being around 69,000
requests and took about 4 hours. Thankfully the API allowed us to query data for multiple songs at a time, which is why there
wasn't 4 million requests. Each instance contains info from both the `tracks` endpoint and the `audio-features` endpoint. The
`tracks` endpoint takes 50 songs per request, and the `audio-features` endpoint takes 100. Interestingly, more than 70% of
the time spent was retrieving the `audio-features` data, despite it making up only ~23,000 of the requests. That means that
each request took nearly 5 times as long to complete as the `tracks` requests (~7ms per request compared to ~1.5ms). I'm not
sure what caused this, but my best theory is that endpoints are prioritized by usage. Logically the `tracks`, `artists`, and 
`playlists` endpoints would be the most used and thus have the highest priority in whatever queuing. I considered rate limiting,
but as I had several of these requests literally time out, I think that this is a much more reasonable explination, as Spotify
sends a 429 with a `Retry-After` header when you perform too many requests.


### Example Instance
Here is an example instance from the dataset. Attributes are in the order
they appear above.

```{.json}
{
    ...,
    "0UaMYEvWZi0ZqiDOoHU3YI": [
        3,
        "2wIVse2owClT7go1WT98tk",
        "2005",
        226863,
        true,
        0.904,
        0.813,
        125.461,
        -7.105,
        4,
        68
    ],
    ...
}
```

## Part 2: Initial Dataset Analysis