# Exercise for UML

# Project Title - Spotify recommender system
## About the Dataset
These dataset contains 114000 song and metadata about the songs such as their popularity and genres. The exercise is divided in three part mainly EDA, PCA and Clustering and finally the recommender system.
Try to write your own functions and know your keyboard shortcuts.
You can work on GoogleCollab or work locally.

The dataset: https://raw.githubusercontent.com/aaubs/ds-master/main/data/spotify_UML/spotify.csv

# Part 1

## Goals of Part 1
    1. Clean up the dataset and check for duplicates
    2. EDA
    3. Plots
## Relevant libraries for this part
    1. Pandas
    2. Numpy
    3. Matplotlib
    4. Seaborn
    5. Pygwalker



## Exercises Part 1

In [190]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib as mpl
import seaborn as sns
import altair as alt
import pygwalker as pyg # this is installed in the terminal as well
import matplotlib.pyplot as plt

In [None]:
# Import the dataset. 

## Define the URL of the dataset
url = "https://raw.githubusercontent.com/aaubs/ds-master/main/data/spotify_UML/spotify.csv"

## Use Pandas to read the dataset into a DataFrame
df = pd.read_csv(url)

## Display the first few rows of the dataset to understand its structure
df.head()

In [None]:
# Understand the dataset. What columns are available?
df.info()

In [None]:
# We can also get the columns by running this code
columns = df.columns
print("Columns available in the dataset:")
for column in columns:
    print(column)

In [None]:
df.describe().T

In [None]:
# To see the count number of distinct elements in specified axis:
df.nunique()

In [None]:
# Check for missing values. How would you handle them?
missing_values = df.isnull().sum()
print("Missing values in the dataset:")
missing_values

# There are null values (missing values) in three different columns 'artists', 'album_name' and 'track_name'.

In [None]:
# To show the columns which have missing values print this code
missing_values = df.isna().any()
missing_values[missing_values].index.tolist()

In [None]:
# The way to handle them fill with 0
df = df.fillna(0)
df.isnull().sum() # Now there is no missing values left

In [None]:
# We have created box plots for some of the numeric columns to detect outliers 
plt.figure(figsize=(10, 6))
df.boxplot(column=['popularity'])
plt.title("Boxplots for Numeric Columns")
plt.show()

plt.figure(figsize=(10, 6))
df.boxplot(column=['duration_ms'])
plt.title("Boxplots for Numeric Columns")
plt.show()

plt.figure(figsize=(10, 6))
df.boxplot(column=['danceability'])
plt.title("Boxplots for Numeric Columns")
plt.show()

In [None]:
# What are the distributions of song popularity, duration_ms, and danceability? Use appropriate visualizations.
sns.displot(data=df,
            x="popularity",
            kind="kde")
sns.displot(data=df,
            x="duration_ms",
            kind="kde")
sns.displot(data=df,
            x="danceability",
            kind="kde")

In [None]:
# How many unique genres are in the dataset? List the top 20. (Explain how you choose to list the top 20)
# To calculate the number of different genres print the below
unique_genres = df['track_genre'].nunique()
print("Number of unique genres:", unique_genres)

In [None]:
# For listing the genre in a list 
unique_values = df['track_genre'].unique()
unique_values
print("Unique Values in 'track_genre':")
for value in unique_values:
    print(f"- {value}")

In [None]:
# List the top 20. (Explain how you choose to list the top 20)
top_20_genres = df['track_genre'].value_counts().head(20)
print("Top 20 genres:")
print(top_20_genres)

In [None]:
# Visualize the number of songs by genre. Which are the most common genres?
# There is 1000 song in each genre so all genres are equally represented in the dataset.
genre_counts = df['track_genre'].value_counts()

plt.figure(figsize=(15, 5))
genre_counts.plot(kind='bar')

plt.title('Number of Songs by Genre')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=90)

plt.show()

In [None]:
# Rank genres by the average popularity of their songs. Which genres tend to have more popular songs?
popularity = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
popularity

In [None]:
# If we want to make a column in the df which shows the rank, we can follow the below steps.

## Group by 'track_genre' and calculate the average popularity for each genre
genre_popularity_rank = df.groupby('track_genre')['popularity'].mean().reset_index()

## Rank genres by popularity in descending order
genre_popularity_rank['popularity_rank'] = genre_popularity_rank['popularity'].rank(ascending=False, method='min')

## Sort the DataFrame by popularity rank
genre_popularity_rank = genre_popularity_rank.sort_values(by='popularity_rank')

## Merge the popularity rank DataFrame with the original DataFrame
df = df.merge(genre_popularity_rank[['track_genre', 'popularity_rank']], on='track_genre', how='left')

## Display the updated DataFrame
df.head()

In [None]:
# Explore other characteristics (like danceability, energy, etc.) by genre. Are there any noticeable differences or trends?
energy = df.groupby('track_genre')['energy'].mean().sort_values(ascending=False)
energy

In [None]:
# Dance value is low for ex opera, sleep and so on which is expected
dance = df.groupby('track_genre')['danceability'].mean().sort_values(ascending=False)
dance

In [None]:
# 1 milliseconds (ms) is equal to 1.666667×10^-5 minutes (min). 
# Conversely, 1 minutes (min) is equal to 60000 milliseconds (ms).

# Therefore, we have divided the milliseconds column with 6000 to get the duration in minutes
df['duration_minutes'] = df['duration_ms'] / 60000

# As below we can se a noticeable differences for the duration between the different genres
duration_minutes = df.groupby('track_genre')['duration_minutes'].mean().sort_values(ascending=False)
duration_minutes

In [None]:
# Investigate the relationship between danceability and energy. Do songs that are more danceable tend to have more energy? Use a scatter plot.
sns.relplot(data=df,
            x=dance,
            y=energy,
            kind="scatter")

In [None]:
# We can also make a scatterplot wiht a linear regression line to see the relationship between danceability and energy

# Set the sns.set to style="whitegrid" to get gridlines
# Set the Seaborn style to 'ggplot'
sns.set(style="whitegrid")

# Create a scatter plot with sns.relplot
scatter_plot_dance_energy = sns.relplot(data=df, x=dance, y=energy, kind="scatter")

# Add a linear regression line using sns.regplot
sns.regplot(data=df, x=dance, y=energy, scatter=False, ax=scatter_plot_dance_energy.ax)

# Optional: Customize the plot
plt.title("Scatter Plot with Linear Regression Line")
plt.xlabel("Danceability")
plt.ylabel("Energy")

# Show the plot
plt.show()

In [None]:
# How does song popularity relate to other characteristics like danceability, loudness, or tempo?
sns.relplot(data=df,
            x=dance,
            y=popularity,
            kind="scatter")

In [None]:
# Scatterplot wiht a linear regression line 
scatter_plot_dance_popularity = sns.relplot(data=df, x=dance, y=popularity, kind="scatter")
sns.regplot(data=df, x=dance, y=popularity, scatter=False, ax=scatter_plot_dance_popularity.ax)

# Optional: Customize the plot
plt.title("Scatter Plot with Linear Regression Line")
plt.xlabel("Danceability")
plt.ylabel("Popularity")

# Show the plot
plt.show()

In [None]:
# How do explicit songs compare to non-explicit ones in terms of popularity or other characteristics?
# First we group data into explicit and non-explicit songs

explicit_songs = df[df['explicit'] == True]  # Select rows where 'explicit' is True
non_explicit_songs = df[df['explicit'] == False]  # Select rows where 'explicit' is False

# Compare popularity using summary statistics
explicit_popularity_mean = explicit_songs['popularity'].mean()
non_explicit_popularity_mean = non_explicit_songs['popularity'].mean()

# Print the mean popularity for explicit and non-explicit songs
print(f"Mean Popularity for Explicit Songs: {explicit_popularity_mean}")
print(f"Mean Popularity for Non-Explicit Songs: {non_explicit_popularity_mean}")

In [None]:
# Are there any trends related to tempo or time_signature?

# As we can see from be bleow one the sweetspot looks like to se at tempo 110-130.
tempo = df.groupby('track_genre')['tempo'].mean().sort_values(ascending=False)
popularity = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
sns.scatterplot(data=df, x=tempo, y=popularity)
plt.show()

In [None]:
# Plot for time_signature
time_signature = df.groupby('track_genre')['time_signature'].mean().sort_values(ascending=False)
popularity = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False)
sns.scatterplot(data=df, x=time_signature, y=popularity)
plt.show()

# Part 2

## Goals of Part 2
    1. Pre-processing for PCA (encoding & scaling)
    2. PCA and explanations of results
    3. Clustering
## Relevant libraries for this part
    1. StandardScaler
    2. PCA
    3. KMeans



In [None]:
# Importing StandardScaler from scikit-learn (sklearn)
from sklearn.preprocessing import StandardScaler

# Importing PCA (Principal Component Analysis) from scikit-learn (sklearn)
from sklearn.decomposition import PCA

# Importing KMeans clustering algorithm from scikit-learn (sklearn)
from sklearn.cluster import KMeans

# Importing LabelEncoder from scikit-learn (sklearn)
from sklearn.preprocessing import LabelEncoder

In [None]:
# Inspect Data Types: 
# We have to confirm the data types within the columns. 

# We can do this by checking the unique data types present in the column
for column in df.columns:
    data_type = df[column].apply(type).unique()
    print(f"Column '{column}' data types: {data_type}")

In [None]:
# As we can see from the above one there are colomns with more more than one data type.
# To check for columns that have more than one data type we can also use the below code
for column in df.columns:
    unique_data_types = df[column].apply(type).unique()
    if len(unique_data_types) > 1:
        print(f"Column '{column}' has multiple data types: {unique_data_types}")


In [None]:
# Convert all entries in 'track_name' and 'album_name' to strings
df['artists'] = df['artists'].astype(str)
df['track_name'] = df['track_name'].astype(str)
df['album_name'] = df['album_name'].astype(str)

In [None]:
# Initialize the LabelEncoder
encoder = LabelEncoder()

# Encode all categorical columns (dtype as 'object') in the DataFrame
for column in df.select_dtypes(include=['object']):
    df[column + '_id'] = encoder.fit_transform(df[column])

In [None]:
df.head()

In [None]:
id_columns = [col for col in df.columns if '_id' in col]
df[id_columns]

In [None]:
# We select the below features from the dataframe
selected_columns = ['Unnamed: 0', 'danceability',  'energy', 'popularity',  'duration_ms', 'track_name_id',  'track_genre_id', 'track_id_id',  'artists_id',  'album_name_id', 'liveness',  'valence',  'tempo',  'time_signature']

# Create a new DataFrame with the selected columns
data_to_cluster = df[selected_columns]

# Display the new DataFrame
data_to_cluster.head()

In [None]:
# Handle any missing or categorical data.
# Standardize the dataset since PCA is sensitive to the magnitude of the data.
scaler = StandardScaler()
data_to_cluster_scaled = scaler.fit_transform(data_to_cluster)
data_to_cluster_scaled

In [None]:
# Conduct a PCA on the song characteristics.
# Create a PCA instance with the desired number of components

# Choose the number of components you want to keep
n_components = 5  
pca = PCA(n_components=n_components)

# Fit the PCA model to the data and transform the data to the principal components
X_pca = pca.fit_transform(data_to_cluster_scaled)

# Visualize the explained variance for each principal component.
explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratios:", explained_variance)

In [None]:
# Reduce the dataset's dimensions based on the PCA results and visualize the data in the reduced dimension space.
plt.scatter(x=X_pca[:, 0], y=X_pca[:, 1], alpha=0.6, color='blue')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Result')
plt.show()

In [None]:
# Choose a clustering algorithm (e.g., KMeans, DBSCAN, or Hierarchical).

## We're using the Kmeans because we have a lot of datapoints.

In [None]:
# Determine the optimal number of clusters (if needed, like in KMeans). explain how you get to that number of clusters

# Initialize variables
inertia_values = []
k_range = range(1, 11)  # We will check for up to 10 clusters

# Run K-means with different k values and store the inertia (sum of squared distances)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_pca)
    inertia_values.append(kmeans.inertia_)

# Plot the Elbow method graph (sum of squared distances for each 'k')
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia_values, marker='o')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()

In [None]:
k=4
centroids = X_pca[np.random.choice(X_pca.shape[0], k, replace=False)]

# Plot observations
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], alpha=0.6, color='blue')

# Plot centroids
sns.scatterplot(x=centroids[:, 0], y=centroids[:, 1], color='red', s=100)

plt.title('PCA Reduced Data and Initial Centroids')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

In [None]:
def k_means_simple(X_pca, k, max_iters=100):
    # 1. Initialize the k cluster centroids
    centroids = X_pca[np.random.choice(X_pca.shape[0], k, replace=False)]

    for _ in range(max_iters):
        # 2. Assign each data point to the closest centroid
        distances = np.linalg.norm(X_pca - centroids[:, np.newaxis], axis=2)
        labels = np.argmin(distances, axis=0)

        # 3. Recompute the centroids
        new_centroids = np.array([X_pca[labels == i].mean(axis=0) for i in range(k)])

        # Check for convergence
        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return labels, centroids

# Print the centroids
labels, final_centroids = k_means_simple(X_pca, 4)
print("Cluster centroids:\n", final_centroids)

In [None]:
# Cluster the songs based on the reduced dimensions from PCA.
# Plot observations after 100st interation
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], alpha=0.6, color='blue')

# Plot centroids
sns.scatterplot(x=final_centroids[:, 0], y=final_centroids[:, 1], color='red', s=100)

plt.title('PCA Reduced Data and last interation Centroids')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

In [None]:
# Visualize the clusters and interpret any patterns. Write your interpretations

In [None]:
# 2. Assign each data point to the closest centroid
distances = np.linalg.norm(X_pca - centroids[:, np.newaxis], axis=2)
labels = np.argmin(distances, axis=0)

In [None]:
# 3. Recompute the centroids
new_centroids = np.array([X_pca[labels == i].mean(axis=0) for i in range(k)])
new_centroids

In [None]:
# Plot observations after 1st interation
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], alpha=0.6, color='blue')

# Plot centroids
sns.scatterplot(x=new_centroids[:, 0], y=new_centroids[:, 1], color='red', s=100)

plt.title('PCA Reduced Data and 1st interation Centroids')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

In [None]:
clusterer = KMeans(n_clusters=4)

In [None]:
clusterer.fit(data_to_cluster_scaled)

In [None]:
# we can then copy the cluster-numbers into the original file and start exploring
df['cluster'] = clusterer.labels_
df.head(1)

In [None]:
plt.figure(figsize=(18,2))
sns.heatmap(pd.DataFrame(pca.components_, columns=data_to_cluster.columns), annot=True) 

In [None]:
# Plot observations after 100st interation
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], alpha=0.6, color='blue')

# Plot centroids
sns.scatterplot(x=final_centroids[:, 0], y=final_centroids[:, 1], color='red', s=100)

plt.title('PCA Reduced Data and last interation Centroids')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

# Part 3

## Goals of Part 3
    1. Vectorization   
    2. Cosine similarities
    3. Build and test recommender
    Objective: Develop a basic music recommender system that suggests songs based on textual data and put it in a small grad.io app
## Relevant libraries for this part
    1. linear_kernel
    2. TfidfVectorizer
    3. grad.io

Build the Recommender:

  Create a function that takes a song name as input and outputs a list of songs recommended based on textual similarity. For this, you'll use the cosine similarity scores.

In [None]:
# Import the liberarys
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as ss
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_distances

In [None]:
# Refine the Textual Data: Consider merging multiple textual columns (e.g., artist name + track name) to generate recommendations based on combined textual data.

# Merge 'artists' and 'track_name' columns with a separator
df['song_artist'] = df['artists'] + ' - ' + df['track_name']

# Display the DataFrame with the new 'combined_text' column
df[['artists', 'track_name', 'song_artist']]

In [None]:
ones = np.ones(len(df), np.uint64)
matrix = ss.coo_matrix((ones, (df['track_id_id'], df['popularity'])))

In [None]:
# Miscellaneous operation.
matrix

In [None]:
print(matrix.row) # check row indices
print(matrix.col) # check column indices
print(matrix.shape)
print(matrix.data.shape)

In [None]:
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
matrix_track_id_id = svd.fit_transform(matrix)
matrix_popularity = svd.fit_transform(matrix.T)

In [None]:
cosine_distance_matrix_places = cosine_distances(matrix_popularity)

In [None]:
# Filtering by Additional Features: How might you modify the recommender to suggest only songs from a particular genre or only non-explicit songs?

In [None]:
#  Improving Efficiency: If you have a very large dataset, computing cosine similarities can be time-consuming. How might you address this efficiency concern?