# Song Feature and Lyric Analysis
## Benjamin Fenison and Stephanie Myott-Beebe
#### (SIADS 591/592: Milestone I Project)

## Overview and Motivation

Our project uses audio features and lyrics to find similarities within sets of songs. The inspiration behind the project is a tool to give the user insight into creating a song with a similar *sound* to the input song(s). For example, a musician looking to write a hit R&B song may be interested in common features and language found in current chart-topping R&B songs.

For purposes of this project, we analyzed the top songs for the following genres: country, R&B/hip-hop, and rock/alternative. The hit songs for the week of May 15, 2021, as determined by the Billboard Top 100 charts, are:

> **Country:** 'Forever After All' by Luke Combs, 'The Good Ones' by Gabby Barrett, 'Made for You' by Jake Owen, 'Hell of a View' by Eric Church, 'Breaking Up Was Easy in the 90s' by Sam Hunt

> **R&B/Hip-Hop:** 'Leave the Door Open' by Silk Sonic (Bruno Mars & Anderson .Paak), 'Peaches' by Justin Bieber ft. Daniel Caesar & Giveon, 'Rapstar' by Polo G, 'Astronaut in the Ocean' by Masked Wolf, 'Up' by Cardi B 

> **Rock/Alternative:** 'Without You' by The Kid LAROI & Miley Cyrus, 'Your Power' by Billie Eilish, 'My Ex's Best Friend' by Machine Gun Kelly X blackbear, 'Mood' by 24kGoldn ft. iann dior, 'Therefore I Am' by Billie Eilish

We determined that we needed at least 1,000 songs similar to the input songs in order to find meaningful insight. Our general process for obtaining the songs and their audio features and lyrics (as shown in the spotify_analysis.ipynb and lyric_analysis.ipynb notebooks, respectively) was as follows:

1. **Seed tracks:** The seed tracks are those songs to which the user wants to draw comparisons.
 
1. **Spotify API's recommendation algorithm:** We ran the seed tracks through Spotify API's recommender algorithm (see the "Data Manipulation Methods" section, below) to identify similar songs. The algorithm returns **at most** 100 songs, thus requiring us to run the algorithm a number of times. After running the seed tracks through the algorithm, we used Euclidean distance to rank the similarity of the output songs to the seed tracks based on certain audio features. We ran the top-ranked songs one-by-one through the algorithm and filtered out repeats until we had at least 1,000 unique songs per set of seed tracks.

1. **Audio features:** We used the Spotify API to download audio features for each song, including, but not limited to, acousticness, danceability, and energy.

1. **Lyrics:** We used the Genius API to obtain lyrics for each song.

1. **Natural Language Processing:** We cleaned and preprocessed the lyrics to prepare them for certain natural language processing analyses.

1. **Dataframe:** We joined the data obtained from the Spotify API and the Genius API on 'track' and 'artist' into a comprehensive pandas dataframe for further analysis.

1. **Analysis:** We analyzed the audio features and structures of the songs and used natural language processing to analyze the lyrics.

Our ultimate goal is to translate the work from this project into a resource for budding musicians, to provide guidance in song creation.

## Data Sources

### Spotify API

We used the Spotify API to obtain songs similar to the seed tracks as well as the songs' audio features. The API's properties are:

> **Location:** The base address for the API is: https://api.spotify.com or https://developer.spotify.com. We used the Python wrapper Spotipy (https://spotipy.readthedocs.io) to connect to and interact with the API. A user must have certain credentials, including an id (cid) and password (secret), in order to use the API.  
 
> **Format:** The API returns a JSON-formatted file containing song features. Spotipy allows the user to access these features using Python. The most common data types are string, e.g., song name or artist, and float, e.g., audio features.
 
> **Important Variables:** The API's recommender algorithm allowed us to retrieve songs with similar audio features to the seed tracks, though the algorithm itself is somewhat of a black box (see the "Data Manipulation Methods" section, below).
> 
> For our purposes, the key features in the data are values that Spotify produces to describe certain aspects of a song, including:
> 
> - *Accousticness*: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
> - *Danceability*: How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. 
> - *Energy*: A perceptual measure of intensity and activity from 0.0 to 1.0. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
> - *Liveness*: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
> - *Speechiness*: Detects the presence of spoken words in a track. The more exclusively speech-like the recording, e.g. talk show, audio book, or poetry, the closer to 1.0 the attribute value.
> - *Tempo*: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
> - *Valence*: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive, e.g., happy, cheerful, or euphoric, while tracks with low valence sound more negative, e.g., sad, depressed, or angry.

> **Records Retrieved:** We retrieved the following numbers of records for each genre, which include the five seed tracks for the genre plus the songs deemed by the Spotify API to be most similar:
> - **Country:** 1,026 songs
> - **R&B/Hip-Hop:** 1,014 songs
> - **Rock/Alternative:** 1,075 songs
 

### Genius API

We used the Genius API to obtain song lyrics. The API's properties are:

> **Location:** https://docs.genius.com. We used the Python wrapper LyricsGenius (https://lyricsgenius.readthedocs.io) to connect to and interact with the API. A user must have a client access token in order to use the API.  
 
> **Format:** The API returns a JSON-formatted file containing information about a track, including its lyrics, which LyricsGenius allows the user to access using Python. The most common data type is string, e.g., song name, artist, or lyrics.
 
> **Important Variables:** The lyrics for a given song, which are in a single string, including section and line breaks.
 
> **Records Retrieved:** We retrieved the following numbers of records for each genre, including the five seed tracks for the genre plus the songs deemed by the Spotify API to be most similar, and for which the Genius API had lyrics:
> - **Country:** 1,020 songs
> - **R&B/Hip-Hop:** 1,009 songs
> - **Rock/Alternative:** 1,071 songs

## Data Manipulation Methods

### Spotify API

The reason for our project's limit on five seed tracks is that Spotify's algorithm can receive up to five songs, artists, and/or genres as seeds. 

Every time the recommendations function is run, the API returns up to 100 songs similar to the input songs, based on Spotify's bart algorithm. Based on research, the algorithm uses three functions:

 1. Natural language processing
 2. Raw audio analyzation
 3. Collaborative filtering

However, we do not know the exact manner in which analysis is performed, the threshold that must be met in order for the algorithm to return a song as being similar to the seed tracks, or the ranking of similarity of those songs that are returned. This is evident from the fact that the algorithm does not return the same songs each time the algorithm is ran. As a result, the analysis is not 

We are using Spotify's algorithm to pull similar songs but are performing some analysis of our own to increase our list of similar songs (in part, for purposes of lyric analysis).

1. **Rank similarity to seed tracks:** Given that the recommendation algorithm returns at most 100 songs, we needed to run the algorithm a number of times. 

We determined which songs to 

In order to do so, we calculated the  

 3) In order to obtain a large enough sample of songs, the top X songs returned from the previous query are each run through Spotify's recommendation algorithm.

***INSERT WHY WE CHOSE EUCLIDEAN DISTANCE***

### Genius API

     - The Genius API does not have a mechanism for returning the lyrics of a song.
     - We are using the LyricsGenius package to interface with the Genius API. LyricsGenius uses the BeautifulSoup library to scrape song lyrics from a page's HTML.

## Analysis and Visualization

### Word Count

### N-Grams

### Sentiment Analysis

### Song Structure

### Audio Features

## Statement of Work