# Spotify Popularity Predictor
Ben Howard and Elizabeth Larson

This will be a model that predicts the popularity of a song on Spotify.








## The Dataset
Our dataset will be a combination of a Spotify-provided dataset, and some data that we generate. The base dataset used will be the Spotify million playlist dataset. It contains a set of 1,000,000 playlists and their songs. Here is an example of what a playlist entry looks like.

```
{
    "name": "2013/2014",
    "num_holdouts": 174,
    "pid": 1000323,
    "num_tracks": 199,
    "tracks": [
        {
            "pos": 0,
            "artist_name": "Of Monsters and Men",
            "track_uri": "spotify:track:2ihCaVdNZmnHZWt0fvAM7B",
            "artist_uri": "spotify:artist:4dwdTW1Lfiq0cM8nBAqIIz",
            "track_name": "Little Talks",
            "album_uri": "spotify:album:4p9dVvZDaZliSjTCbFRhJy",
            "duration_ms": 266600,
            "album_name": "My Head Is An Animal"
        },
        {
            "pos": 1,
            "artist_name": "Ellie Goulding",
            "track_uri": "spotify:track:7C7yqFTM0ncyJ04GIKrxdV",
            "artist_uri": "spotify:artist:0X2BH1fck6amBIoJhDVmmJ",
            "track_name": "Anything Could Happen",
            "album_uri": "spotify:album:4754Cgv1sdfwTpdVX83xAC",
            "duration_ms": 286322,
            "album_name": "Halcyon"
        },
        {
            "pos": 2,
            "artist_name": "The Wanted",
            "track_uri": "spotify:track:1CRY4X2b8X7X10EUdPUutw",
            "artist_uri": "spotify:artist:2NhdGz9EDv2FeUw6udu2g1",
            "track_name": "I Found You",
            "album_uri": "spotify:album:3wLINTYfZERHv3w5pXZLdK",
            "duration_ms": 238613,
            "album_name": "Word Of Mouth"
        }
    ],
    "num_samples": 2
}
```

As you can see, the amount of metadata about the actual tracks is rather limited. To get more information, we will generate a dataset from the Spotify API about each of the ~2 million songs listed. This data will include some info that the playlist dataset does not. The predicted number of plays will be binned into ranges. All of this will be parsed into 1 or more JSON files (depending on size). 

### Attributes Used For Prediction
 - Artist(s)
 - Number of Playlists the song appears on
 - Genre
 - Release Year
 - Has Features
 - Availability (the regions in which it is available)
 - Track Length

### Attribute predicted
 - Song Popularity (number of streams)

## Implemntation
When it comes to data storage, the enormouse size of dataset means it makes sense to abandon the `MyPyTable` structure, since all lookups would be O(n). Instead we will use Python's `dict`, as it is implemented as a hashmap with O(1) average lookup time and O(n) worst case. The indexer will be the Spotify ID, and it will return a list. JSON will definitly take up more space, dicts take up more memory, but the trade off is definitly worth it. The json object would look something like this:

```
{
    "spotify:track:1CRY4X2b8X7X10EUdPUutw" : [ "2NhdGz9EDv2FeUw6udu2g1", 32,0,2017,false,286322,3 ]
    cont...
}
```

Since we are storing it this way, you will lookup a track via our API using only the Spotify track ID. If we are feeling fancy, maybe we can add a lookup by other params, but this seems easiest.

## Potential Challenges
The size of the dataset is significant at around 32GB uncompressed. (Note: this is why we aren't currenlty pushing the dataset in its raw state) As we mentioned, much of the data will need to be obtained from the Spotify API. The API is rate limited, but it does allow you to get information about multiple tracks in one request. We estimate that this will take ~2.5hrs to complete. This shouldn’t be all that bad, but it does mean that any time we want to add attributes, we will have to repeat this. Thankfully we do have access to a powerful machine with a good internet connection, but that machine is also being used for other things. It only has 32GB of RAM, meaning that when
generating the dataset, we will need to save it to multiple files initially to determine if the final set is a suitable size to operate on. The way we plan on storing it, each song shouldn't be more than around 90 bytes, meaning theorhetically, we should only have about ~200MiB of data, which will hopefully be under the Heroku storage limit of 512GB. If it is not, one of us can simply host the service for a little while.

Another limitation is that any testing/development of the model that isn't done on the desktop computer will need to be done on a subset of the data, which might make it more difficult to tune. Furthermore, the EDA will be a pain to develop with the significant overhead of Jupyter. We may end up needing to operate off of a subset of the data and/or generate results for the entire dataset
ahead of time.


## Audience/Impact
Creating playlists is a popular feature on Spotify. Users can make playlists as personal and as creative as they wish. Playlists can be used to group and organize music (e.g. sorting by decade, genre, mood, favorite songs, etc.). If a song is particularly popular, it will likely be found on multiple playlists. The ability to predict the popularity of songs gives Spotify playlist makers and listeners some new ideas for their own playlist creation. 

The primary stakeholder of this project is Spotify users. Listeners can stay in the loop about the most popular tracks on the platform, leading them to adding to their playlists, creating new ones, or simply learning about a good song. Artists also benefit, because they can keep track of their most popular songs (aka the fan favorites). Spotify itself benefits from this project, because these predictions allow them to track popular songs on their platform. Other music streaming platforms are potential stakeholders, say, if they were to implement a similar playlist-centered format.



