<center><img src="https://1000logos.net/wp-content/uploads/2017/08/Spotify-Logo.png" alt="Italian Trulli" width="170" height="220"></center><center><h1> Spotify Descriptive and Exploratory Data Analysis</h1></center>



# Table of Contents

1. [Introduction](#1)
2. [Basis for Exploratory Data Analysis](#2)
    - [Feature Definitions](#2.1)
3. [Feature Analysis](#3)
    - [Correlations](#3.1)
    - [Number of Tracks and Average Track Durations Over Years](#3.2)
    - [Keys and Mode](#3.3)
4. [Genre Based Analysis](#4)
    - [Genre Populartity](#4.1)
    - [Genre Durations](#4.2)
    - [Keys and Modes](#4.3)
    - [Genre Names and Sub-genres](#4.4)
    - [Dimension Reduction and Genre Classification](#4.5)
5. [Artist Based Analysis](#5)
    - [Most Popular Artists](#5.1)
    - [Most Productive Artists](#5.2)
6. [Decade Based Analysis](#6)
    - [Song Features' Trends over Decades](#6.1)
    - [Genres' popularity changes over the decades](#6.2)
7. [Psychedelic Rock](#7)
    - [Most Popular Psychedelic Artists and Songs Each Year](#7.1)
    - [Most Productive Psychedelic Artists and Songs Each Year](#7.2)
    - [The Beach Boys](#7.3)
    - [The Beatles](#7.4)
    - [The Who](#7.5)
    - [Pink Floyd](#7.6)
    - [The Doors](#7.7)
    - [Jimi Hendrix](#7.8)
    - [Frank Zappa](#7.9)
    - [Janis Joplin](#7.10)
8. [Conclusion](#8)
 

<a id="1"></a>
## Introduction
[Spotify](https://www.spotify.com/) is an audio streaming application that needs no introduction. Over the span of 14 years, it has reached 286 million **active users** and 130 million **premium subscriptions**. An average user listens to Spotify for 25 hours in a month, 44% of the users dance with their souls through this app on a **daily basis**. These statistics clearly suggest that for a significant part of the world, Spotify is the go to address for the music.

In this notebook, I will try to , list the most popular artists, investigate the kinds of music we are listening to the most and genre behaviours over the years while questioning the genre definitions. For this analysis, I will use [Spotify Dataset 1921-2020](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) that contains over 160 thousand tracks, gathered by fellow Kaggler [Yamac Eren AY](https://www.kaggle.com/yamaerenay), using [Spotify Public API](https://developer.spotify.com/documentation/web-api/). If you're curious about popular songs and artists, general reasons behind Spotify usages, learn what a stop track is (I didn't know it before), and genre definitions, this might be an enjoyable notebook for you.




In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings('ignore')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans #Clustering
from sklearn.decomposition import PCA #Dimension Reduction

import plotly.express as px #Interactive Plots
import plotly.graph_objs as pgo #Interactive Plots
from plotly.subplots import make_subplots #Interactive Plots

import matplotlib.pyplot as plt
from wordcloud import WordCloud

from collections import Counter

import random
from math import floor

<a id="2"></a>
## Basis for Exploratory Data Analysis
 In this section, we'll briefly take a look at the features of the dataset, highlighting a few of them in order to have a better understanding for the rest of the analysis. It contains various features of a song both in technical perspective such as tempo, key, loudness, and also historical records such as release date, popularity. Dataset owner Yaman already has clear definitions for these features, and further details can be always found on [Spotify API documentation](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). In order to access them a bit more practically, I will present these definitions and their scales here.

<a id="2.1"></a>
### <u>Feature Definitions</u>

**1.	<u>artists</u>:** The list of artists of the song.

**2.	<u>danceability</u>:** Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

**3.	<u>duration_ms</u>:** The duration of the track in milliseconds.

**4.	<u>energy</u>:** Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. (Float)

**5.	<u>explicit</u>:** The content item is explicit and the user’s account is set to not play explicit content.
Additional reasons may be added in the future. Note: If you use this field, make sure that your application safely handles unknown values.

**7.	<u>instrumentalness</u>:** Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

**8.	<u>key</u>:** The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

**9.	<u>liveness</u>:** Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

**10. <u>loudness</u>:** The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

**11. <u>mode</u>:** Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

**12. <u>name</u>:** Name of the song.

**13. <u>popularity</u>:** The popularity of the track. The value will be between 0 and 100, with 100 being the most popular.
The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

**14. <u>release_date</u>:** The date the album was first released, for example “1981-12-15”. Depending on the precision, it might be shown as “1981” or “1981-12”.

**15. <u>speechiness</u>:** Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

**16. <u>tempo</u>:** The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

**17. <u>valence</u>:** A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

**18. <u>year</u>:** Year information extracted from release_date.

**19. <u>genres</u>:** A list of the genres used to classify the album. For example: “Prog Rock” , “Post-Grunge”. (If not yet classified, the array is empty.)


In [None]:
main_df = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data.csv')
main_df.info()

Let's see how these numeric metrics distributed.

In [None]:
main_df.drop(['key','mode','year','explicit'],axis=1).describe().transpose().sort_index()

We can see that most of them are in between 0 and 1, except *duration_ms*, *loudness*, *popularity* and *tempo*. The nature of these metrics are clearly defined on feature definitions section. Let us take a sneak peak to the dataset.

In [None]:
main_df.head()

In [None]:
main_df.tail()

Let's set the index of the dataset as release_date, since we'll analyze the trends and changes through time.

In [None]:
# Set release date a index and turn it into time series in order to fix dates without month and day components
main_df.set_index('release_date',inplace=True)
main_df.index = pd.to_datetime(main_df.index)
main_df.head()

The oldest entry is the song called "Keep A Song In Your Soul" by Mamie Smith. Such a lovely beginning. [Keep a song in your soul](https://open.spotify.com/track/0cS0A1fUEUd1EW3FcF8AEI) friends!

In [None]:
# year interval
print("Oldest Record")
print(main_df.iloc[0])

The most recent entry is the piece called "Improvisations" by Roger Fly. That's what I call a cool ending!

In [None]:
print("Most Recent Record")
print(main_df.iloc[-1])

To make it more readable, I will change the *duration_ms* to *duration* while changing it from milliseconds to seconds.

In [None]:
#Change duration from milliseconds to seconds 
main_df['duration'] = main_df['duration_ms'].apply(lambda x:round(x/1000))
main_df.drop('duration_ms',axis=1,inplace=True)

In [None]:
main_df.drop(['key','mode','year','explicit'],axis=1).describe().transpose().sort_index()

<a id="3"></a>
## Feature Analysis

In this section, we'll look at the correlations between the features and try to understand which of them might have an effect on the popularity. Addition to that, we'll see the number of tracks released on each year, average duration changes over the years, and most popular keys among these songs.

<a id="3.1"></a>
### Correlations
Let's see the correlations between continous metrics

In [None]:
#Pearson Correlation Table
main_corr_df = main_df.drop(['key','mode','year','explicit'],axis=1).corr(method='pearson')
fig = px.imshow(main_corr_df,title="Song Feautures Pearson Correrlation Heatmap",width=750,height=500,labels={'color':"correlation"})
fig.show()

There are obvious correlations between some of the features by definition such as, acousticness and loudness, acousticness and energy and so on, which are not unexpected. What's interesting is the correlation between energy(or in acousticness) and popularity. Let's see them in more detail. 

In [None]:
def plot_corr(feature_1,feature_2,title):
    corr_df = main_df[[feature_1,feature_2]]
    corr_df["feature_1_interval"] = pd.cut(main_df[feature_1],np.arange(0,1,0.0001),labels=[f"{feature_1}_{i}" for i in range(1,10000)])
    corr_df = corr_df.groupby("feature_1_interval").median()
    plot = px.scatter(corr_df,x=feature_1, y=feature_2,trendline="ols",trendline_color_override="red")
    plot.update_traces(marker=dict(size=5,color='rgba(30, 215, 96, .9)',
                                  line=dict(width=1)),
                      selector=dict(mode='markers'))
    plot.update_layout(title_text=title)

    plot.show()

In [None]:
#Popularity vs Acousticness
plot_corr("acousticness","popularity","Popularity vs Acousticness")

We can see that in terms of acousticness, 0.05 is the sweet spot for popularity. It might be a slight indication that users are looking for more uplifting songs on Spotify, since increase in acousticness follows decrease in popularity. How about energy?

In [None]:
#Popularity vs Energy
plot_corr("energy","popularity","Popularity vs Energy")

While there are low energy songs that are not highly popular, high energy songs are almost always welcomed by listeners. However, these indications are insufficient for a definite conclusion. 

<a id="3.2"></a>

### Number of Tracks and Average Track Durations Over Years


In [None]:
#Number of entries over the years
fig = px.bar(main_df["id"].groupby(pd.Grouper(freq="Y")).count(),labels={
                     "release_date": "Release Year",
                     "value": "Number of tracks"})
fig.update_layout(height=600, width=1200, title_text="Number of Tracks Over Years")

fig.show()

We have less entries for the years prior then 1950. This could be strictly related with the production numbers in those years.

In [None]:
#Average Duration over the years in seconds
fig = px.bar(main_df["duration"].groupby(pd.Grouper(freq="Y")).mean(),labels={
                     "release_date": "Release Year",
                     "value": "Track Duration (sec)"})
fig.update_layout(height=600, width=1200, title_text="Track Duration Over Years")

fig.show()

On average, the duration of the tracks are around 3 to 4 minutes. Let's see the most extreme outliers. Here is the longest track

In [None]:
#Longest track
main_df[main_df['duration']==main_df['duration'].max()].iloc[0]

[The End of the Year: 2015, Pt. 1 - Continuous DJ Mix](https://open.spotify.com/track/3fQCeki8H8Up3KMIkXU6GD) is 89 minutes long! If you're having party and you don't have time for a playlist, B-Max is there for you. Let's see the the track with shortest duration.

In [None]:
#Shortest Track
main_df[main_df['duration']==main_df['duration'].min()].iloc[0]

The name of the track is "Pause Track" and  it has -60 loudness and no tempo? Yes! Turns out there is such thing called a "Pause Track", which is a silent track in between songs that is recorded to vinly records in order the seperate a song from another song or seperate a group of songs from another group. Here are a few more Pause Tracks on this dataset.

In [None]:
# Pause tracks

main_df[main_df['loudness']==-60]

<a id="3.3"></a>

### Keys and Mode

Let's see the most used keys and the most popular mode of the dataset.

In [None]:
key_mapping = {0:"C",1:"C♯",2:"D",3:"D♯",4:"E",5:"F",6:"F♯",7:"G",8:"G♯",9:"A",10:"A♯",11:"B"}
key_counts_df = pd.DataFrame(main_df["key"].value_counts())
key_counts_df['key_names'] = key_counts_df.index.to_series().map(key_mapping)

key_labels = key_counts_df['key_names'].values
key_values = key_counts_df['key'].values

mode_mapping = {0:"Minor",1:"Major"}
mode_counts_df = pd.DataFrame(main_df["mode"].value_counts())
mode_counts_df['mode_names'] = mode_counts_df.index.to_series().map(mode_mapping)

mode_labels = mode_counts_df['mode_names'].values
mode_values = mode_counts_df['mode'].values

fig = make_subplots(rows=1, cols=2,specs=[[{"type": "pie"}, {"type": "pie"}]])

fig.add_trace(
    pgo.Pie(labels=key_labels, values=key_values),row=1, col=1)

fig.add_trace(
    pgo.Pie(labels=mode_labels, values=mode_values),row=1, col=2)

fig.update_traces(textposition='inside', textinfo='percent+label')


fig.update_layout(height=600, width=1200, title_text="Keys and Modes")
fig.show()

We have a major dominance on the mode and C-G-D-A are the most popular keys here. 

<a id="4"></a>

## Genre Based Analysis
 In this part, I will try to analyze the genre dataset and try to understand characterisctics of these genres. Let's take a look at the data first.
 

In [None]:
genre_df = pd.read_csv("../input/spotify-dataset-19212020-160k-tracks/data_by_genres.csv")
genre_df.info()

We have 3232 genres and sub-genres in this dataset. Again for a better readabilty, I will convert the duration_ms metric to duration in seconds. 

In [None]:
genre_df['duration'] = genre_df['duration_ms'].apply(lambda x:round(x/1000))
genre_df.drop('duration_ms',axis=1,inplace=True)
genre_df.drop(['genres','key','mode'],axis=1).describe().transpose().sort_index()

In [None]:
genre_df.head()

In [None]:
genre_df.tail()

We have one unnamed genre which I will remove from the dataset.

In [None]:
# Missing Data
genre_df[genre_df['genres']=="[]"]

In [None]:
genre_df = genre_df[genre_df['genres']!="[]"]

In [None]:
#Number of genres
print(f"Number of unique genres : {genre_df['genres'].nunique()}")

<a id="4.1"></a>
### Genre Popularity
Let's see the most popular genres and least popular genres

In [None]:
#Most popular genres
print("Most popular genres")
genre_df.sort_values("popularity",ascending=False).head()

In [None]:
print("Least popular genres")
genre_df.sort_values("popularity",ascending=False).tail()

<a id="4.2"></a>
### Genre Durations
The genres with longest durations and shortest durations on average.

In [None]:
print("Shortest Durations")
genre_df.sort_values("duration",ascending=False).head(10)

In [None]:
print("Longest Durations")
genre_df.sort_values("duration",ascending=False).tail(10)

In [None]:
genre_df[genre_df["duration"]>280].head()

Notice that genres like "abstract", "432hz" indicate that Spotify not only consists songs but also background sounds which fairly popular too. Spotify can really put you into the mood you wish to be in.

<a id="4.3"></a>
### Keys and Modes
Now let's see the keys and modes in this genre dataset

In [None]:
key_mapping = {0:"C",1:"C♯",2:"D",3:"D♯",4:"E",5:"F",6:"F♯",7:"G",8:"G♯",9:"A",10:"A♯",11:"B"}
genre_key_counts_df = pd.DataFrame(genre_df["key"].value_counts())
genre_key_counts_df['key_names'] = genre_key_counts_df.index.to_series().map(key_mapping)

key_labels = genre_key_counts_df['key_names'].values
key_values = genre_key_counts_df['key'].values

mode_mapping = {0:"Minor",1:"Major"}
genre_mode_counts_df = pd.DataFrame(genre_df["mode"].value_counts())
genre_mode_counts_df['mode_names'] = genre_mode_counts_df.index.to_series().map(mode_mapping)

mode_labels = genre_mode_counts_df['mode_names'].values
mode_values = genre_mode_counts_df['mode'].values

fig = make_subplots(rows=1, cols=2,specs=[[{"type": "pie"}, {"type": "pie"}]])

fig.add_trace(
    pgo.Pie(labels=key_labels, values=key_values),row=1, col=1)

fig.add_trace(
    pgo.Pie(labels=mode_labels, values=mode_values),row=1, col=2)

fig.update_traces(textposition='inside', textinfo='percent+label')


fig.update_layout(height=600, width=1200, title_text="Keys and Modes in Genres")
fig.show()

Here are the most popular genres in minor

In [None]:
print("Most Popular minor genres")
genre_df.query("mode==0").sort_values("popularity",ascending=False).head(10)

<a id="4.4"></a>
### Genre Names and Sub-genres
Let's try o identify the genres by their names. I will simply try to look for the word frequency in genres and see which main genre appears the most.

In [None]:
genre_names_text = " ".join(genre_df['genres'].tolist()).split(" ")
column_names = ["word","count"]
most_common_words_in_genres_df = pd.DataFrame([dict(zip(column_names,word_count)) for word_count in Counter(genre_names_text).most_common(30)])

In [None]:
fig = px.bar(most_common_words_in_genres_df.sort_values("count"),x="count",y="word",labels={
                     "word": "Word in Genres",
                     "count": "Word Frequency"},orientation='h')
fig.update_layout(height=600, width=1100, title_text="Most Common Words in Genre Names")

fig.show()

In [None]:
plt.subplots(figsize = (21,10))

wordcloud = WordCloud(background_color='black',
                      width = 2800,
                      height = 1024,
                      prefer_horizontal=1,
                      relative_scaling=1,
                      colormap = 'Greens').generate(" ".join(most_common_words_in_genres_df["word"].tolist()))

plt.imshow(wordcloud)
plt.axis('off') 
plt.show()

*pop*, *indie*, *rock* and *metal* are the most frequent ones. Let's look at the sub-genres of these genres and see their popularity.

In [None]:
pop_df = genre_df[genre_df["genres"].str.contains("pop")].sort_values("popularity").tail(10)
indie_df = genre_df[genre_df["genres"].str.contains("indie")].sort_values("popularity").tail(10)
rock_df = genre_df[genre_df["genres"].str.contains("rock")].sort_values("popularity").tail(10)
metal_df = genre_df[genre_df["genres"].str.contains("metal")].sort_values("popularity").tail(10)

fig = make_subplots(rows=2, cols=2,subplot_titles=('Pop', 'Indie', 'Rock','Metal'))

fig.add_trace(pgo.Bar(x=pop_df['popularity'],y=pop_df['genres'], orientation='h', marker_color='pink'),
              1, 1)

fig.add_trace(pgo.Bar(x=indie_df['popularity'],y=indie_df['genres'], orientation='h', marker_color='green'),
              1, 2)

fig.add_trace(pgo.Bar(x=rock_df['popularity'],y=rock_df['genres'], orientation='h', marker_color='red'),
              2, 1)
              
fig.add_trace(pgo.Bar(x=metal_df['popularity'],y=metal_df['genres'], orientation='h', marker_color='black'),
              2, 2)


fig.update_layout(height=600,width=1600,showlegend=False,bargap=0.3,title_text='Most Popular Subgenres',)
fig.show()

<a id="4.5"></a>
### Dimension Reduction and Genre Clustering

In this part, I will investigate which these audio features distinguish genres and try to cluster them. By doing so, we can question if genres with similar audio features are indeed related to each other. First, let's see how many components we need to cover the majority of this vector space. 

In [None]:
scaler = MinMaxScaler()
numeric_genre_df = genre_df.drop(["mode","genres","key","popularity"],axis=1)
scaled_genre_df = pd.DataFrame(scaler.fit_transform(numeric_genre_df.values),columns=numeric_genre_df.columns)
scaled_genre_df["genres"] = genre_df["genres"]
scaled_genre_df.drop("genres",axis=1).keys()

In [None]:
pca = PCA()
pca.fit(scaled_genre_df.drop("genres",axis=1))
fig = pgo.Figure()
fig.add_trace(pgo.Scatter(x=[i for i in range(1,11)], y=pca.explained_variance_ratio_.cumsum(),
                    mode='lines+markers',
                    name='lines+markers'))

fig.update_layout(title="Explained Variance by Components",
                   xaxis_title='Number of Components',
                   yaxis_title='Cumulative Explained Variance',
                  yaxis_zeroline=False, xaxis_zeroline=False)
fig.show()

Apperantly, with only four components, we can cover %90 of the space. Let's try to see the relation of these components with dataset features. Here are the correlations and a heat map to visualize.

In [None]:
pca = PCA(n_components=4)
pca.fit(scaled_genre_df.drop("genres",axis=1))

pca_comp_df = pd.DataFrame(data = pca.components_,
                           columns =scaled_genre_df.drop("genres",axis=1).columns.values,
                           index = ['Component 1','Component 2','Component 3','Component 4'])
pca_comp_df

In [None]:
fig = px.imshow(pca_comp_df,title="Components' Relations with Features",width=750,height=500,labels={'color':"correlation"})
fig.show()

It seems that component 1 covers mainly energy, while component 2 covers heavily instrumentalness and valence. It seems that component 3 and 4 have similar nature, while 4 covers more with valence and liveness. Let's try to classify the genres using these 4 components. First, I will try to find the appropriate cluster number by using the dear old elbow method.

In [None]:
inertias = []
for k in range(1,40):
    kmeans = KMeans(n_clusters=k,init='k-means++',random_state=42)
    kmeans.fit(pca.transform(scaled_genre_df.drop(["genres"],axis=1)))
    inertias.append(kmeans.inertia_)
    
fig = px.line(x=range(1,40), y=inertias)
fig.update_layout(title='Elbow of the Genres',
                   xaxis_title='Number of Clusters',
                   yaxis_title='Inertia')
fig.show()

Surprisingly the cluster number is fairly low. Let's choose our k as 20. 

In [None]:
kmeans = KMeans(n_clusters=20,init='k-means++',random_state=42)
kmeans.fit(pca.transform(scaled_genre_df.drop("genres",axis=1)))
scaled_genre_df["cluster"] = kmeans.labels_

In [None]:
scaled_genre_df.head(10)

In [None]:
scaled_genre_df.keys()

Let's take a look at cluster 14 as an example and see if these genres are indeed really close to each other.

In [None]:
scaled_genre_df[scaled_genre_df["cluster"] == 14]

These results clearly indicates the audio features of a song does not put it to a spesific genre. Although arab electronic and vintage swing tracks have similar technical features, these two genres do not considered as close to each other by authorites. My conclsuion from this analysis is that, genre definitions are a lot more cultural and philosophical than technical.

<a id="5"></a>
## Artist Based Analysis
Here we will see the most popular artists for each year and their hit songs. We'll also check which artists are the most hardworking ones over the years by using artists dataset.

In [None]:
artists_df= pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_w_genres.csv')
artists_df.info()

<a id="5.1"></a>
### Most Popular Artists
Here are the most popular artists and songs. Which one is your favourite?

In [None]:
years = main_df.year.unique()
top_artists_each_year = [main_df.query('year==@year').sort_values("popularity",ascending=False).iloc[0] for year in years]
top_artists_each_year_df = pd.DataFrame(top_artists_each_year)
top_artists_each_year_df['artists'] = top_artists_each_year_df['artists'].apply(lambda x:x.replace("[","").replace("]","").replace("'",""))
top_artists_each_year_df['genres'] = [artists_df[artists_df.artists.str.contains(artists.split(",")[0])]["genres"].iloc[0].replace("[","").replace("]","") for artists in top_artists_each_year_df["artists"].values]

In [None]:
fig = px.scatter(top_artists_each_year_df, x="year", y="popularity",hover_data=['artists','name'])
fig.update_traces(mode='markers', marker_line_width=2,marker=dict(size=10,color='rgba(30, 215, 96, .9)'))
fig.update_layout(title="Most Popular Artists and Songs Each Year")
fig.show()

<a id="5.2"></a>
### Most Productive Artists
Here are the most hardworking artists. Just look at Uruguayan violinist Francisco Canaro! Breathing in emotions, breathing out music.

In [None]:
fig = px.bar(artists_df.sort_values("count",ascending=False).head(30),x="count",y="artists",labels={
                     "artists": "Artists",
                     "count": "Number of Tracks"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="Most Productive Artists",yaxis={'categoryorder':'total ascending'})

fig.show()

<a id="6"></a>
## Decade Based Analysis
Now, let's see how these audio features changed over the years and we'll also take a look at genre popularity changes over the decades.

<a id="6.1"></a>
### Song Features' Trends over Decades

In [None]:
decade_df = main_df.resample(rule='10A').mean()
decade_df.index = [f"{date_index-1}'s" for date_index in decade_df.index.year]
decade_df

In [None]:
trends_df = decade_df.drop(["explicit","key","mode","year","tempo","popularity","duration","loudness"],axis=1)
fig = px.line(trends_df, x=trends_df.index, y=trends_df.columns)
fig.update_layout(title="Song Features' Trends over Decades (between 0-1) ",
                   xaxis_title='Decade',
                   yaxis_title='Feature Value ')
fig.show()

Apparently, the sounds were much more acoustic back then and now they are much more energetic.

<a id="6.2"></a>
### Genres' popularity changes over the decades

Let's see how genres effected by these changes in sound.

In [None]:
main_w_genres_df = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data.csv')
main_w_genres_df["artists"] = main_w_genres_df.artists.apply(lambda x:x.replace("[","").replace("]","").replace("'","").split(","))
main_w_genres_df = main_w_genres_df.explode('artists')
main_w_genres_df = pd.merge(main_w_genres_df,artists_df[["artists","genres"]],on='artists',how='inner')
main_w_genres_df.set_index('release_date',inplace=True)
main_w_genres_df.index = pd.to_datetime(main_w_genres_df.index)
genre_df = main_w_genres_df.drop(["artists"],axis=1).drop_duplicates()
decade_popularity = pd.DataFrame()
most_popular_genres = ["pop","indie","rock","metal","rap","jazz"]
for genre in most_popular_genres:
    genre_decade_df = genre_df[genre_df.genres.str.contains(genre)].resample(rule='10A').mean()
    genre_decade_df.index = [f"{floor(date_index/10)*10}'s" for date_index in genre_decade_df.index.year]
    decade_popularity[genre] = genre_decade_df["popularity"]
decade_popularity.fillna(0,inplace=True)

In [None]:
fig = px.line(decade_popularity, x=decade_popularity.index, y=decade_popularity.columns)
fig.update_layout(title="Genre Popularity over Decades",
                   xaxis_title='Decade',
                   yaxis_title='Popularity')
fig.show()

In with the rap, out with the jazz!

<a id="7"></a>
## Psychedelic Rock
Let's see the names in my favorite genre!

In [None]:
psy_rock_df = main_w_genres_df[main_w_genres_df["genres"].str.contains("psychedelic rock")].drop_duplicates().sort_values("popularity")
psy_rock_df['duration'] = psy_rock_df['duration_ms'].apply(lambda x:round(x/1000))
psy_rock_df.drop('duration_ms',axis=1,inplace=True)
psy_rock_df.info()

In [None]:
psy_rock_df.drop(['key','mode','year','explicit'],axis=1).describe().transpose().sort_index()

<a id="7.1"></a>
### Most Popular Psychedelic Artists and Songs Each Year

In [None]:
psy_years = psy_rock_df.year.unique()
psy_top_artists_each_year = [psy_rock_df.query('year==@year').sort_values("popularity",ascending=False).iloc[0] for year in psy_years]
psy_top_artists_each_year_df = pd.DataFrame(psy_top_artists_each_year)
fig = px.scatter(psy_top_artists_each_year_df, x="year", y="popularity",
                 hover_data=['artists','name','genres'])
fig.update_traces(mode='markers', marker_line_width=2,marker=dict(size=10,color='rgba(30, 215, 96, .9)'))
fig.update_layout(title="Most Popular Artists and Songs Each Year")
fig.show()

<a id="7.2"></a>
### Most Productive Psychedelic Artists and Songs Each Year

In [None]:
fig = px.bar(artists_df[artists_df["genres"].str.contains("psychedelic rock")].sort_values("popularity").sort_values("count",ascending=False).head(30),x="count",y="artists",labels={
                     "artists": "Artists",
                     "count": "Number of Tracks"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="Most Productive Artists",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
beach_df = psy_rock_df[psy_rock_df["artists"]=="The Beach Boys"].sort_values("popularity",ascending=False).head(10)
beatle_df = psy_rock_df[psy_rock_df["artists"]=="The Beatles"].sort_values("popularity",ascending=False).head(10)
who_df =  psy_rock_df[psy_rock_df["artists"]=="The Who"].sort_values("popularity",ascending=False).head(10)
pink_df = psy_rock_df[psy_rock_df["artists"]=="Pink Floyd"].sort_values("popularity",ascending=False).head(10)
doors_df = psy_rock_df[psy_rock_df["artists"]=="The Doors"].sort_values("popularity",ascending=False).head(10)
hendrix_df = psy_rock_df[psy_rock_df["artists"]=="Jimi Hendrix"].sort_values("popularity",ascending=False).head(10)
zappa_df = psy_rock_df[psy_rock_df["artists"]=="Frank Zappa"].sort_values("popularity",ascending=False).head(10)
janis_df = psy_rock_df[psy_rock_df["artists"]=="Janis Joplin"].sort_values("popularity",ascending=False).head(10)

<a id="7.3"></a>
<center><h2>The Beach Boys</h2></center>
<center><img src="https://www.rollingstone.com/wp-content/uploads/2020/05/BeachBoys.jpg?resize=1800,1200&w=1800" alt="The Beach Boys" width="600" height="600"></center>



In [None]:
fig = px.bar(beach_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="The Beach Boys",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
beach_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.4"></a>
<center><h2>The Beatles</h2></center>
<center><img src="https://www.rollingstone.com/wp-content/uploads/2018/06/rs-7349-20121003-beatles-1962-624x420-1349291947.jpg?resize=1800,1200&w=1800" alt="The Beatles" width="600" height="600"></center>

In [None]:
fig = px.bar(beatle_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="The Beatles",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
beatle_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.5"></a>
<center><h2>The Who</h2></center>
<center><img src="https://www.rollingstone.com/wp-content/uploads/2018/06/rs-230674-the-who.jpg" alt="The Who" width="600" height="600"></center>

In [None]:
fig = px.bar(who_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="The Who",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
who_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.6"></a>
<center><h2>Pink Floyd</h2></center>
<center><img src="https://www.rollingstone.com/wp-content/uploads/2018/06/pink-floyd-syd-interview-ad5dbc74-49f4-4f2d-8e13-0f56c285fcc4.jpg?resize=1800,1200&w=1800" alt="Pink Floyd" width="600" height="600"></center>

In [None]:
fig = px.bar(pink_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="Pink Floyd",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
pink_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.7"></a>

<center><h2>The Doors</h2></center>
<center><img src="https://upload.wikimedia.org/wikipedia/commons/6/60/Doors_electra_publicity_photo.JPG" alt="The Doors" width="600" height="600"></center>

In [None]:
fig = px.bar(doors_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="The Doors",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
doors_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.8"></a>
<center><h2>Jimi Hendrix</h2></center>
<center><img src="https://ychef.files.bbci.co.uk/976x549/p0278f0f.jpg" alt="Jimi Hendrix" width="600" height="600"></center>

In [None]:
fig = px.bar(hendrix_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="Jimi Hendrix",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
hendrix_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.10"></a>
<center><h2>Janis Joplin</h2></center>
<center><img src="https://www.musicconnection.com/wp-content/uploads/2020/09/Janis-Book-Cover.jpg" alt="Janis Joplin" width="600" height="600"></center>

In [None]:
fig = px.bar(janis_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="Janis Joplin",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
janis_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="7.9"></a>
<center><h2>Frank Zappa</h2></center>
<center><img src="https://i.imgur.com/QaLceJZ.jpg" alt="Frank Zappa" width="600" height="600"></center>

In [None]:
fig = px.bar(zappa_df,x="popularity",y="name",labels={
                     "popularity": "Popularity",
                     "name": "Song Name"},orientation='h',hover_data=['genres'])
fig.update_layout(height=600, width=1100, title_text="Frank Zappa",yaxis={'categoryorder':'total ascending'})

fig.show()

In [None]:
zappa_df.drop(["explicit","key","mode","year"],axis=1).describe().transpose().sort_index()

<a id="8"></a>
## Conclusion
Spotify is becoming a music authority every single day as a medium where "success metrics" of the music industry, if you believe in them, can be clearly observed. Moreover, it also presents hidden or forgotten gems to users like myself who like to discover beautiful pieces and more silent stories. One thing we can say for certain is that Spotify is responsible for the majority of people's mood each day. It is the most popular medium to choose when we users need a break from the outside world, or when we want to party and socialize with others. The best friend of humankind found its way in Spotify to keep shaping our lives. In this study, we've seen many popular artists and songs. Learned that back in the 50s and 60s, even tracks on the vinyl records took a break, unlike most of us in these fast-paced days. Genres are not only defined by their audio similarities and structures but mostly by their cultural effects and artists' philosophical views. 

Thank you for your time dear Kaggler and big thanks to [Yamac Eren AY](https://www.kaggle.com/yamaerenay) for this beautiful dataset. Keep kaggling with music and with curiosity.