# Spotify Tracks Analysis
![](https://digital.hbs.edu/platform-digit/wp-content/uploads/sites/2/2020/04/spotify-logo-1920x1080-2.jpg)

#### Introduction to Spotify
Spotify is a Swedish audio streaming and media services provider founded in 23 April 2006 by Daniel Ek. It is the world's largest music streaming service provider, with over 356 million monthly active users, including 158 million paying subscribers, as of March 2021.

Spotify offers digital copyright restricted recorded music and podcasts, including more than 70 million songs, from record labels and media companies. As a freemium service, basic features are free with advertisements and limited control, while additional features, such as offline listening and commercial-free listening, are offered via paid subscriptions. Users can search for music based on artist, album, or genre, and can create, edit, and share playlists.

Spotify is available in most of Europe and the Americas, Oceania and more than 40 countries in Africa as of July 2021 (including South Africa and Mauritius) and Asia. By the end of 2021, Spotify is expected to operate in a total of 178 countries. The service is available on most modern devices including Windows, macOS, and Linux computers, iOS and Android smartphones and tablets and AI enabled smart speakers such as Amazon Echo and Google Home.

**About the project:** This project is a part of course [Data Analysis with Python: Zero to Pandas](https://jovian.ai/learn/data-analysis-with-python-zero-to-pandas) provided by [Jovian](https://jovian.ai). Jovian is an online platform for data science and machine learning. It is designed to provide the best hands-on learning experience.

**About Dataset:** This dataset is retrieved from kaggle. Kaggle is a online platform especially for data science and machine learning which provide resources and tools for achieving higher in the domain. 
The link for this url is [Spotify data](https://www.kaggle.com/subhaskumarray/spotify-tracks-data?select=tracks.csv) and you can visit [Kaggle datasets](https://www.kaggle.com/datasets) for various interesting datasets around the world.

Our dataset contains informations like tracks, artists, popularity, duration and many more.

### Objective of the analysis
- To extract more and more information from the data.
- To provide better insights. So that it will help in future decisions that can be taken by the company.

## Data preparation and cleaning

1. First load our data as a dataframe using pandas
2. We will check for any null values
3. We will insert or drop columns if needed

In [None]:
import os, sys
import pandas as pd
import numpy as np

In [None]:
spotify_df = pd.read_csv(r'../input/spotify-tracks-data/tracks.csv')
spotify_df

We have 586672 rows and 20 columns. We can say it is a big data file. Now, we will ckeck some basic imformation about the dataset.

In [None]:
# Checking the column names. It will help us in deciding what variables we have to analyse.
spotify_df.columns

In [None]:
spotify_df.describe()

In [None]:
spotify_df.info()

In [None]:
spotify_df.isnull().sum()

We can here see that we have 71 null value in name column. That means we don't have names for 71 tracks but the other details for those tracks are there. Also, we can see that name of tracks are not grouped by artists and its a large dataset with more than 586 thousands of tracks removing the rows with no track names won't effect our analysis much. So we are gonna remove them.

In [None]:
# I will use dropna function and I are going to drop only those rows which don't have track names.
# I will return the output dataframe to the original dataframe.

spotify_df = spotify_df.dropna(how='any', axis=0)
spotify_df

Now, we have 586601 rows left. Let's check if there are any more null values.

In [None]:
spotify_df.isnull().sum()

So, We don't have any null values remaining. Let's Check for any duplicate values.

In [None]:
spotify_df[spotify_df.duplicated()].sum()

In [None]:
spotify_df

We don't have any duplicate value in our data.

First of all, I am going to create a copy of this dataframe incase at some point we need to just rollback. I will use `spotify_df` for analysis it's copy will just be for backup.

In [None]:
spotify_df_copy = spotify_df.copy()
spotify_df_copy

We have noticed that the duration of tracks are in milliseconds and I think it's not the good way to record duration of music tracks. So, I are gonna convert that into minutes with two decimal points. I can do that by dividing `duration_ms` by 1000 to make it to seconds and then by 60 to make it to minutes and going to make a new column `duration_min`.

In [None]:
# I was getting settingwithcopy warning and I don't know why. I tried looking up online and found these codes to fix it.
pd.options.mode.chained_assignment = None

In [None]:
spotify_df['duration_min'] = (spotify_df['duration_ms'])/1000/60
spotify_df

We don't need `duration_ms` column. So, I will drop it.

In [None]:
spotify_df.drop(['duration_ms'], axis=1, inplace = True)
spotify_df

As we can see `duration_min` has lots of value after decimal place. I will reduce it to 2 and then I will move that column where `duration_ms` was initially.

In [None]:
spotify_df['duration_min'] = round(spotify_df['duration_min'],2)

# Shifting it to index 3

column_to_shift = spotify_df.pop('duration_min')
spotify_df.insert(3,'duration_min',column_to_shift)
spotify_df

Let's do one more thing. I am going to convert `release_date` to datetime and will create columns for year and month. It will help in analysing tracks by year and month.

In [None]:
spotify_df['release_date'] = pd.to_datetime(spotify_df['release_date'])
spotify_df['release_year'] = spotify_df['release_date'].dt.year
spotify_df['release_month'] = spotify_df['release_date'].dt.month_name()

In [None]:
spotify_df

Finally, We are done with data preparation. Now, It's time for some analysis.

## Exploratory Analysis and Visualization

I am going to use Matplotlib, Seaborn, Plotly for beautiful and interactive visualizations.
We will see relationship between different variables, performance of artists, their tracks and many more.

First, Install and import required libraries. I have already installed Matplotlib and Seaborn. So, I will just install Plotly and then import all of them.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
%matplotlib inline

# Some basic configurations for graphs
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### 1. Trends for number of tracks released throughout the years.
I will draw a line chart for this analysis.

In [None]:
# First, let's find out the number of tracks released each year.

number_of_tracks_by_year = spotify_df.groupby(spotify_df['release_year'])['name'].count().reset_index()
number_of_tracks_by_year['Tracks released'] = number_of_tracks_by_year['name']
# Plotting a line chart:
fig = px.line(number_of_tracks_by_year, x='release_year', y='Tracks released')
fig.update_layout(title="Tracks released throughout the years",title_x=0.5,
                  xaxis_title="Year of release", yaxis_title="Number of tracks")

fig.show();

The reason for using `Plotly` is that it is interactive. As our time series is of more than 100 years, it would have been hard to draw clear visualization on `Seaborn` and `Matplotlib`.

From the graph above, we can have some ideas that number of tracks are increasing ever year. But, there are two major drops in year 1993 and 2000 whereas 1999 and 2000 have highest number of tracks released. We can't account for 2021 as this year has not ended.

### 2. Number of tracks released by month.
We will be using histogram for this analysis. Histograms are powerful visualization for comparision.

In [None]:
# Similar steps to previous visualization

number_of_tracks_by_month = spotify_df.groupby(spotify_df['release_month'])['name'].count().reset_index()

plt.figure(figsize=(16,5))
sns.histplot(x = 'release_month', y = 'name', data = number_of_tracks_by_month)
plt.xlabel('Month of release')
plt.ylabel('Number of tracks')
plt.title('Tracks released by month')

plt.show()

No doubt January is leading by a very high margin. We can have some ideas that may be maximum number of tracks are released at starting of the year on Spotify.

### 3. Top 10 artists by number of tracks recorded by them.
First I am going to groupby the number of tracks recorded by artists. Then we will fetch just top 10 from them.

In [None]:
# Dataframe for top 10 artists

groupby_artists = spotify_df.groupby(spotify_df['artists'])['name'].count().reset_index()
top_10_artists = groupby_artists.sort_values(['name'],axis = 0, ascending=False).head(10)
top_10_artists['Tracks recorded'] = top_10_artists['name']

fig = px.bar(top_10_artists, x='artists', y='Tracks recorded', color = 'Tracks recorded')
fig.update_layout(title="Top 10 artists",title_x=0.5,
                  xaxis_title="Artists", yaxis_title="Number of tracks")         

fig.show()

Die drei has maximun number of recording with 3856 number of tracks. We also have an indian singer, Lata Mangeshkar at number 5 with 1373 number of recordings.

### 4. Relationship between duration and popularity of the tracks.
when we need to find the relationship between two variables and we have large volume of data, Scatterplot is really great tool for that. So, I will be going with seaborn's scatterplot.

In [None]:
plt.figure(figsize=(16,8)) # for covering the entire width
sns.scatterplot(x = 'duration_min', y = 'popularity', data = spotify_df)
plt.xlabel('Duration(min)')
plt.ylabel('Popularity')
plt.title('Relation between Duration and Popularity');

Honestly, I was not expecting this. Some tracks are just too long. We can dig deeper and find out more information about these tracks.

But from this visualization we can figure out that most songs are of duration less than 20 mins and most popular songs are of around 3-4 minutes. Which means people don't prefer long songs. They like songs which is not more than 5 mins.

### 

## Asking and answering questions
1. Asking some interesting questions about the dataset.
2. Answering them using numpy, pandas or visualizations.

### Q. No.1 What is the highest danceability of the tracks and top 10 tracks which have highest danceability?
The danceability scores are given on the basis of pull of the songs towards dance.

In [None]:
highest_danceability = spotify_df['danceability'].max()
highest_danceability = round(highest_danceability,2)
print('The highest danceability of the tracks is {}.'.format(highest_danceability))

top_10_tracks_highest_danceability = spotify_df.sort_values(['danceability'],ascending=False)[['name','danceability']].head(10)
top_10_tracks_highest_danceability

### Q. No.2 What is the lowest loudness of the tracks and 10 tracks which have lowest loudness?

In [None]:
lowest_loudness = spotify_df['loudness'].min()
lowest_loudness = round(lowest_loudness,2)
print('The lowest loudness of the tracks is {}.'.format(lowest_loudness))

top_10_tracks_lowest_loudness = spotify_df.sort_values(['loudness'],ascending=True)[['name','loudness']].head(10)
top_10_tracks_lowest_loudness

### Q. No.3 What is the average tempo of the tracks and list down the tracks which have average tempo?

In [None]:
average_tempo = spotify_df['tempo'].mean()
average_tempo = round(average_tempo,2)
print('The average tempo of the tracks is {}.'.format(average_tempo))

average_tempo_tracks = spotify_df[spotify_df['tempo'] == average_tempo][['name','tempo']]
average_tempo_tracks

### Q.No.4 List down 10 most popular tracks of 2020

In [None]:
tracks_of_2020 = spotify_df[spotify_df['release_year'] == 2020][['name','popularity','artists']]
print('There are {} tracks released in 2020.'.format(tracks_of_2020['name'].count()))
print('')
print('List of 10 most popular tracks of 2020')
popular_10_tracks_of_2020 = tracks_of_2020.sort_values(['popularity'],ascending=False).head(10)
popular_10_tracks_of_2020

### Q.No.5 Create a dataframe for the tracks released on New year and list down 10 most popular tracks from them.

In [None]:
tracks_released_on_new_year = spotify_df[spotify_df['release_date'].dt.day == 1]
tracks_released_on_new_year

In [None]:
print('There are {} tracks released on new year.'.format(tracks_released_on_new_year['name'].count()))

In [None]:
most_popular_tracks_released_on_new_year = tracks_released_on_new_year.sort_values(['popularity'],ascending=False)[['name','popularity','artists','release_date']].head(10)
most_popular_tracks_released_on_new_year

Let's save our work to jovian before continuing.

## Conclusion and summary
We discovered some very interesting insights from the dataset.
- The number of tracks are increasing every year.
- Maximum tracks are released in the month of January.
- Dei Drei have recorded highest number of tracks.
- We also found that people prefer short songs over long songs.
- We also calculated some statistical values.

It is true that no any analysis is perfect and complete. We can always find some more insights from the data. For my future work, I will try to see more relations between the variables.

## References

Links to resources you found useful

 References:

- [Stack Overflow Developer Survey](https://insights.stackoverflow.com)
- [Pandas user guide](https://pandas.pydata.org)
- [Matplotlib user guide](https://matplotlib.org)
- [Seaborn user guide & tutorial](https://seaborn.pydata.org/tutorial.html)
- [Plotly documentation](https://plotly.com/python/)
- [opendatasets Python library](https://github.com/JovianML/opendatasets)
- [Kaggle dataset](https://www.kaggle.com/)
- [geeksforgeeks](https://www.geeksforgeeks.org)