<a href="https://colab.research.google.com/github/zephyrroche/Spotify-Data-Analysis/blob/main/Spotify_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. What are the top 10 most popular tracks?
2. What is the average duration of songs (in minutes)?
3. How many explicit songs are there compared to non-explicit ones?
4. Which year had the most song releases?
5. What is the correlation between energy and danceability?
6. What are the top 5 artists with the most songs in the dataset?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
df = pd.read_csv('tracks.csv')

In [None]:
df.head()

# What are the top 10 most popular tracks?

In [None]:
popular=df[['name', 'artists', 'popularity']].sort_values(by='popularity', ascending=False).head(10)
plt.figure(figsize=(10,6))
sns.barplot(x='popularity', y='name', data=popular, palette='viridis')
plt.title('Top 10 Most Popular Tracks')
plt.xlabel('Popularity')
plt.ylabel('track name')
plt.show()

**The tracks above are the 10 most popular in the dataset, showcasing what listeners have loved the most.**

# What is the average duration of songs (in minutes)?

In [None]:
# Step 1: Take the average of the duration_ms column
duration = df['duration_ms'].mean()

# Step 2: Convert milliseconds to minutes
# 60 seconds in a minute. 1000 milliseconds in a second. So 60000 milliseconds in a minute.
duration_mins = duration / 60000

# Step 3: Print the result
print('Average Duration of Songs:', duration_mins, 'minutes')

In [None]:
#to round off the value
print("Average Song Duration:", round(duration_mins, 2), "minutes")
# the 2 means 2 decimal places.

**On average, songs in the dataset are around 3 minutes long (3.83 minutes)**

# How many explicit songs are there compared to non-explicit ones?

In [None]:
explicit_songs = df['explicit'].value_counts()
print("Explicit vs Non-Explicit Songs:", explicit_songs)

#0 is the non-explicit and 1 is explicit

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(x=explicit_songs.index, y=explicit_songs.values, palette='mako')
plt.xticks([0,1], ['Non-Explicit', 'Explicit'])
plt.title('Explicit vs Non-Explicit Songs')
plt.xlabel('Song Type')
plt.ylabel('Count')
plt.show()

**The dataset has significantly more non-explicit songs than explicit ones, indicating a more family-friendly music trend overall.**

# Which year had the most song releases?

In [None]:
df['release_year'] = pd.to_datetime(df['release_date'], errors='coerce').dt.year
#errors='coerce' = "Handle mistakes calmly by making them blank (NaT) instead of crashing."
songs_per_year = df['release_year'].value_counts().sort_index()

In [None]:
plt.figure(figsize=(14,6))
songs_per_year.plot(kind='line')
plt.title('Number of Songs Released Each Year')
plt.xlabel('Year')
plt.ylabel('Number of Songs')
plt.grid()
plt.show()

**The year with the most song releases was 2020.**

# What is the correlation between energy and danceability?

In [None]:
correlation = df['energy'].corr(df['danceability'])
print('Correlation between Energy and Danceability:', round(correlation,2))

In [None]:
plt.figure(figsize=(12,6))
sns.scatterplot(x='energy', y='danceability', data=df, alpha=0.3)
plt.title('Energy vs Danceability')
plt.xlabel('Energy')
plt.ylabel('Danceability')
plt.show()

**There is a correlation of 0.24 between energy and danceability, meaning more energetic songs tend to be danceable since the correlation is more than zero.**

# What are the top 5 artists with the most songs in the dataset?

In [None]:
df

In [None]:
df['primary_artist'] = df['artists']
top_artists = df['primary_artist'].value_counts().head(5)
print("Top 5 Artists with Most Songs:", top_artists)