## Phase I Project Proposal
### What Makes Music Popular?

#### Name: Eric Gerber, DS 3000

**NOTE**: This is an example project proposal; as such, no student may work on this particular topic (sorry).

### Introduction

What makes an artist's music popular? There are many features which might affect the popularity of different artists and their songs. I'm interested in examining if things like an artist's song's tempo, length, or energy are more or less likely to make an artist (or their song) popular. I think it would also be interesting to see if certain features can help me predict what genre a song comes from. Both of these questions may be used practically in different ways: investigating the first question may lead me to help recommend certain aspects of music creation to artists to increase their popularity, and investigating the second may help me to recommend songs whose genre is unlabeled to people based on what that genre is predicted to be (and the person's genre preferences). There are also numerous other questions that might be interesting aside from these two main ones that could be addressed given enough time and data.

### Data Collection

I plan to use Spotify's API (Spotipy) to collect data on the top 50 songs from the Today's Hits Playlist. These represent popular recent songs, which will help me target the most up to date information relevant to my questions of interest. Spotipy is fairly easy to use, and I demonstrate below how I can read in the relevant data (even if it is not completely clean):

**Note:** the below code requires access to Spotify credentials, including a secret that is not allowed to be shared. If you need to run the code yourself, you can either create your own free account and get your own secret, or ask me to come to office hours and demonstrate its usage. However, I have given this Jupyter Notebook a fresh Restart & Run All and the below output should serve as proof that I have access to the data. Just in case, I have saved it as a .csv file so it can be read in to Python in the future, in the unlikely event the below code stops working for some reason.

In [7]:
# Get the API and Load the Credentials to Access it
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
# Including the secret file which is not to be shared
from spotify_secret import secret

cid = '592acf2d2dc84d94bbc652f2f1d72375'

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

# The Today's Hits playlist contains only fifty songs, so the fourth line below is unnecssary right now, but I might grab a different, longer playlist later
# If a playlist has more than 100 songs, I need to use the "offset" command in the .playlist_tracks() function to access them
playlist_link = "https://open.spotify.com/playlist/37i9dQZF1DXcBWIGoYBM5M"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]
track_uris0 = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI, offset=0)["items"] if x.get("track") and x["track"].get("uri")]
track_uris1 = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI, offset=100)["items"] if x.get("track") and x["track"].get("uri")]

# Putting the track uris together into the same list (only necessary with > 100 tracks)
track_uris = track_uris0 + track_uris1

# Setting up the empty dictionary for Track Info
playlist_dict = {'track_uri': list(),
                'track_name': list(),
                'artist_name': list(),
                'artist_pop': list(),
                'artist_genres': list(),
                'track_pop': list()}

# I am going to loop through all the tracks and save their info in the dictionary
## Dr. Gerber Note:
## Because this proposal is due before HW 2 this semester (Fall 2024), I am not including the code used to create the data frame below
## (it should still show up when you first open this Jupyter Notebook)
## I may update it after HW 2 is due so you can see what I did, but just know that you DON'T have to do as much as I did;
## You do not even need to have the data in a data frame for this proposal, just read into Python and with some proof that there are at least:
## TWO numeric features and ONE categorical feature

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,song_title,artist_name,artist_pop,track_pop,artist_genres
0,0.521,0.592,6,-7.777,0,0.0304,0.308,0.0,0.122,0.535,157.969,251668,Die With A Smile,Lady Gaga,90,100,"[art pop, dance pop, pop]"
1,0.754,0.514,1,-7.721,0,0.0471,0.0257,4e-05,0.0808,0.363,114.997,156522,It's ok I'm ok,Tate McRae,84,85,[pop]
2,0.67,0.91,0,-4.07,0,0.0634,0.0939,0.0,0.304,0.786,112.966,157280,Taste,Sabrina Carpenter,95,83,[pop]
3,0.7,0.582,11,-5.96,0,0.0356,0.0502,0.0,0.0881,0.785,116.712,218424,"Good Luck, Babe!",Chappell Roan,90,97,[indie pop]
4,0.776,0.667,7,-6.622,1,0.0983,0.0146,0.3,0.0761,0.618,130.019,145219,Guess featuring Billie Eilish,Charli xcx,88,92,"[art pop, candy pop, metropopolis, pop, uk pop]"


### Data Usage and Remaining Issues

The above data set is mostly cleaned already, but there are still some issues to take care of. Mainly, the artist genres are contained as lists inside the column of the data frame. This should be fixed by either creating columns for each genre and labelling each track as 1 or 0 under each column, or simply picking the first (or most appropriate genre). I still need to figure out how to deal with this. However, I have plenty of numeric features, including the main popularity scores (both for artist and track) that may be useful in answering my first question and then genre, once it is cleaned, for answering my second. While we have not covered any ML models in class yet, I've read about supervised machine learning, and both of my questions seem like they could be reasonably answered with either regression (predicting a numeric feature, like popularity) or classification (predicting a categorical feature, like genre). There may be also some unsupervised ML techniques that help me characterize and understand the data a bit more, but I'm less familiar with those and would have to investigate further.