# Spotify Data Visualization

**Acknowledgement: I'd like to thank all kaggle users who have created notes on this data set. Your insights and model building helped me to find the correct graphs to tell a great story. 

## Introduction

*Spotify is one of the newest innovations to have come to audio listening and experience. With over 125 million subscribers, Spotify dominates Apple Music and Amazon's in the audio streaming market. This does not even include the amount of free subscribers Spotify also has. The service has even recently begun to significantly expand into podcasts and audio books to further expand the audio services it offers. Thus, any musician or other content creator must be aware of the trends and directions Spotify listeners are going in in order to compete in this immensely competitive and growing market.*

![Spotify Label](http://www.electronicbeats.net/app/uploads/2018/01/jqbx-spotify-shared-listening.jpeg)

Because of this huge revolution in how music is listened, we can now easily obtain data on what people are listening to, and, potentially, shape the way that we approach marketing, mixing, and even the creative process of musical output. 

### Purpose: My attempt for this project will be to use this data to create a regression of attributes of songs that correlate with popularity. The attempt here will be to guide marketing, mixing, and songwriting directions to increase the amount of streams a given artist may receive. 

**The data used in this project was collected from Spotify's Web API. This is basically a computer algorthirm that Spotify has that can estimate various aspects of the audio file. Below, I will list the various attributes given to every song as refered from Spotify's developer page: https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/**


* **duration_ms** - The duration of the track in milliseconds.


* **key** - The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.


* **mode** - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.


* **time_signature** - An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).


* **acousticness** - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.


* **danceability** - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.


* **energy** - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. 


* **instrumentalness** - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.


* **liveness** - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.


* **loudness** - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.


* **speechiness** - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.


* **valence** - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


* **tempo** - The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.


* **id** - The Spotify ID for the track.


* **type** - The object type: “audio_features”


* **popularity** - The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

# Libraries required

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Imported Datasets

* data.csv
* data_by_artist.csv
* data_by_genres.csv
* data_by_year.csv
* data_w_genres.csv

In [None]:
data = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data.csv')

In [None]:
data.head()

It looks like the general "data" setlist records each song according to it's various attributes. 

In [None]:
data_artist = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_by_artist.csv')

In [None]:
data_artist.head(50)

It can be a little difficult to determine whether or not a name is a mistake or the actual artist name. For example, "+44" may been like a coding mistake, but is actually a very popular band. The dataset creator has also clarified 

"Mean values: acousticness, danceability, energy, valence, instrumentalness, speechiness, tempo, loudness, duration_ms, liveness, popularity Mode values: key, mode
Count values: count (total number of tracks)"

In [None]:
data_gen = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_by_genres.csv')

In [None]:
data_gen.head(40)

This dataset gives us each genre and the attributes for each. 

In [None]:
data_year = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_by_year.csv')

In [None]:
data_year.head(100)

This data set breaks down the dataset by year, listing the attributes for music within that year. As stated before, all attibutes except for key and mode are averages. Key and mode are the most common (i.e. mode). 

In [None]:
data_w_geners = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_w_genres.csv')

In [None]:
data_w_geners.head(50)

Lastly, this last data set defines the artist again, but labels them with whatever genre we would assoicate them with. We can notice that many artists do not have a genre listed. 

In [None]:
data.info()

In [None]:
data.isnull().sum()

It's good to know that we don't seem to have any missing data points in the entire set for data, which is the one will mainly be using in this project. 

#  Data Cleansing

As far as data goes, this data is fairly clean and doesn't look like it'll require much cleansing. As seen in the code directly above, we have no empty data entries. The only real issue is the artist column in the data.csv file. The column is formated to put the artist name in unnecessary brackets and percenthes. We'll go ahead and fix that. 

In [None]:
data['artists'] = data['artists'].str.replace("'","")
data['artists'] = data['artists'].str.replace("[","")
data['artists'] = data['artists'].str.replace("]","")
data.head(50)

Now, we can just enter a name or a list of names separated by a comma (if we have more than one artist) rather than having wacky brackets surrounding it.

The below example shows the difference between our original "data.csv" under the 'artists' column where it only lists the artist rather than the brackets and quotation marks. 

In [None]:
filt = data['artists'] == 'The Beatles'
data[filt]

While we're at it, let's go ahead and fix that data_w_geners "genre" column too since it's seems to have the same problem. The code we use to fix this should be fairly similar.  

In [None]:
data_w_geners['genres'] = data_w_geners['genres'].str.replace("'","")
data_w_geners['genres'] = data_w_geners['genres'].str.replace("[","")
data_w_geners['genres'] = data_w_geners['genres'].str.replace("]","")
data_w_geners.head(50)

Another issue we notice is from the following graph:

In [None]:
plt.figure(figsize=(16, 10))
sns.set(style="whitegrid")
x = data.groupby("year")["id"].count()
axis = sns.lineplot(x.index,x)
#ax.set_title('Count of Tracks added')
#ax.set_ylabel('Count')
#ax.set_xlabel('Year')

Notice that, around 1950, the count of songs only goes up to 2000. The dataset creator has pointed out that Spotify's development website only allowed an extraction, by year, of 2000 songs. From this arises the natural question:  **which** songs were chosen to be in the set? After looking at some of the data points and from the dataset creator's own comments, it's likely these are the 2000 most popular songs from each selected year.  

# Hypotheses to Explore

##    There's a few questions we want to answer with our data here:
    
###    1) Are there specific attributes which more likely lead a song to be more popular? 
###    2) Do popular attributes change over time?
###    3) Is it possible to use these insights in a realistic way to improve songwriting?

# Insight

First, let's talk about which specific attributes make a song more popular. 

In [None]:
plt.figure(figsize=(16, 8))
sns.set(style="whitegrid")
corr = data.corr()
sns.heatmap(corr,annot=True, cmap="YlGnBu")

Simply looking at a correlation table gives us some basic insights as to what attributes make a song more popular. 

1) As expected popularity is highly correlated with the year released. This makes sense as the Spotify algorithm which makes this decision generates it's "popularity" metric by not just how many streams a song receives, but also how recent those streams are. 

2) Energy also seems to influence a song's popularity with a .5. Many popular songs are energetic, though not necessarily dance songs. Because the correlation here is not too high, low energy songs do have some potential to be more popular.  

3) Acousticness seems to be uncorrelated with popularity. Most popular songs today have either electronic or electric instruments in them. It is very rare that a piece of music played by a chamber orchestra or purely acoustic band becomes immesely popular (though, again, not impossible). 

Other things worth noting:

1) Loudness and energy are highly correlated. This makes some sense as energy is definately influence by the volume the music is being played at. 

2) Acousticness is highly negatively correlated with energy, loudness, and year.

3) Valence and dancability are highly coorelated. Dance songs are usually happier and in a major key

Thus, from this data, it would be better for an artist to create a high energy song with either electric instruments or electronic songs to have the best chance at generating the most popularity. 

Just to be sure, I went ahead and ran a correlation table on the artist dataset as well. 

In [None]:
plt.figure(figsize=(16, 8))
sns.set(style="whitegrid")
corr = data_artist.corr()
sns.heatmap(corr,annot=True)

Here we see that an artist's average popularity is significantly affected by their acousticness. Again, this may simply be because most recording before 1950 are acoustic and not taken much into consideration if they are not played recently. Energy is also a factor in an artist's popularity, but not necessarily to the same degree that is was for an individual song. 

### Let's look at how attributes have changed over time

In [None]:
plt.figure(figsize=(30, 30))
sns.set(style="whitegrid")
columns = ["acousticness","danceability","energy","speechiness","liveness","valence"]
for col in columns:
    x = data.groupby("year")[col].mean()
    ax= sns.lineplot(x=x.index,y=x,label=col)
ax.set_title('Audio characteristics over the years', fontsize = 50)
ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1, prop={'size': 30}, loc = 'upper right')
ax.set_ylabel('', fontsize = 50)
ax.set_xlabel('Year', fontsize = 50)

1) Acousticness has decreased significantly. Most tracks past 1960 used electric instruments and, especially past the 1980s, electronic sounds. Most recorded music today includes both electric and electronic elements. 

2) Danceability has varied significantly, but has stayed mostly at the same level since 1980. 

3) Energy seems to be inversely related to acousticness: Was very low in the first part of the century, but then rose signficantly after 1960. It looks like it increased even more after 2000 as well.

4) Speechiness looks like it varied a lot in the first part of the 20th century, but then settled low around 1960. Note we do see a slight increase after 1980. This is likely due to the growth of rap music. Mostly music, however, is still mostly sung. 

5) Liveness looks like it has always stayed relatively low. Most recorded music on Spotify was made with not audience present. 

6) Valence seems to have risen until 2000 with energy and danceability, but has fallen since. 

In [None]:
plt.figure(figsize=(30, 20))
sns.set(style="whitegrid")
columns = ["loudness"]
for col in columns:
    x = data.groupby("year")[col].mean()
    ax= sns.lineplot(x=x.index,y=x,label=col)
ax.set_title('Mean of Loudness of Songs per Year', fontsize = 50)
ax.set_ylabel('', fontsize = 50)
ax.set_xlabel('Year', fontsize = 50)

It is very clear that songs have been becoming mixed much louder since the beginning of recorded music. This is partially due to better recording technology, but a general "loudness wars" that has become widely discussed in music circles. 

Lastly, I wanted to examine specifically which songs and artists were the most popular. 

In [None]:
plt.figure(figsize=(30, 10))
sns.set(style="whitegrid")
x = data.groupby("name")["popularity"].mean().sort_values(ascending=False).head(30)
axis = sns.barplot(x.index, x)
axis.set_title('Top Tracks with Popularity')
axis.set_ylabel('Popularity')
axis.set_xlabel('Tracks')
plt.xticks(rotation = 90)

As we have noted throughout this project is that popularity is heavily dependent on the timeframe. As we see, **death bed** has the highest popularity rating by this graph, but was released on February 8th, 2020. Using this data in our regression will give us a snapshot as to the attributes popular songs have for mid-2020, but may not work or become less relevant once we get further from this date. 

Let's also try summing the artists popularity numbers to see if we can get a better, longview idea of song attributes. 

In [None]:
plt.figure(figsize=(30, 10))
sns.set(style="whitegrid")
x = data.groupby("artists")["popularity"].sum().sort_values(ascending=False).head(20)
ax = sns.barplot(x.index, x)
ax.set_title('Top Artists with Popularity')
ax.set_ylabel('Popularity')
ax.set_xlabel('Artists')
plt.xticks(rotation = 15)

To any longtime music fan, this list seems much more familiar. As we see **The Beatles** has great popularity by this graph followed closely by the **The Rolling Stones**. Note that it isn't all completely older artists as well. **Taylor Swift**, who began her career in the mid-2000s, also makes an appearance.

It may be useful to run a regression on this list as well. This will likely give us a better, historical idea of which attributes are more liked.

In [None]:
plt.figure(figsize=(16, 4))
sns.set(style="whitegrid")
x = data_artist.groupby("artists")["popularity"].sum().sort_values(ascending=False).head(20)
ax = sns.barplot(x.index, x)
ax.set_title('Top Artists with Popularity')
ax.set_ylabel('Popularity')
ax.set_xlabel('Artists')
plt.xticks(rotation = 90)

Listing the most popular artists based on our artist data, we get **Emilee** and **StaySolidRocky**. Again, note that this list is heavily dependent on time of play. Many of these are artists may be one-hit wonders and may not otherwise be known. 

# Final thoughts

An artist can increase their chances of popularity by creating songs that 1) are more energetic 2) Use electronic or electric instruments 3) is mixed to be relatively loud in the mix. An artist that only uses acoustic instruments and creates low energy music looks likely to struggle to gain popularity though, again, it's not impossible. 

Lastly, it should be noted that this is solely a guide to help artists and label with marketing, mixing, and composing. For example, a label may want to put forth an artist's more dance orientated, high energy song as it's first single off an album for promotion. An artist themselves may want to try and write a high energy, high volume song even if they normally don't fall within that genre. 