# 1. Dataset and Features

We are using the Spotify dataset. In this project, we wish to explore the relationship between the various numerical columns in the Spotify dataset to create predictions about a track given its characteristics. Specifically, we're interested in the correlations between numerical columns such as valence, energy, danceability, speechiness, and instrumentalness. Understanding these correlations will allow us to predict statistics such as popularity based on other track information.  

We start by importing the dataset (shown below), using the `pandas` link from HuggingFace.
Note for project members: you have to run this every time you reopen the notebook.

In [None]:
import pandas as pd
import tqdm as notebook_tqdm
spotify = pd.read_csv("hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")

In [None]:
# import some libraries
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.options.mode.copy_on_write = True

# 2. Getting started

Print out the columns of the dataset.
Print out the first 20 rows of the dataset.

In [None]:
spotify.columns

In [None]:
spotify.head(20)

In [None]:
# Get 20 random rows
spotify.sample(20)

In [None]:
# Check the shape of spotify dataset
spotify.shape

In [None]:
# Sanity check: get the counts of each artist and track_name combination
counts = spotify.groupby(['artists', 'track_name']).size().reset_index(name='count')
print(counts)
print("There are " + str(sum(counts['count'] != 1)) + " artist, track_name combinations that are non-unique.")

### Sanity checks!

- Are there any entries with null values
- Do numbers fall in the expected range


In [None]:
# popularity between 0 and 100
sum(spotify['popularity'] < 0) + sum(spotify['popularity'] > 100)

In [None]:
# danceability between 0.0 and 1.0
sum(spotify['danceability'] < 0.0) + sum(spotify['danceability'] > 1.0)

In [None]:
# energy is between 0.0 to 1.0
sum(spotify['energy'] < 0.0) + sum(spotify['energy'] > 1.0)

In [None]:
# mode is 0 or 1
sum(x not in [0,1] for x in spotify['mode'])

In [None]:
# speechiness between 0.0 and 1.0
sum(spotify['speechiness'] < 0.0) + sum(spotify['speechiness'] > 1.0)

In [None]:
# acousticness between 0.0 and 1.0
sum(spotify['acousticness'] < 0.0) + sum(spotify['acousticness'] > 1.0)

In [None]:
# instrumentalness between 0.0 and 1.0
sum(spotify['instrumentalness'] < 0.0) + sum(spotify['instrumentalness'] > 1.0)

In [None]:
# liveness between 0.0 and 1.0
sum(spotify['liveness'] < 0.0) + sum(spotify['liveness'] > 1.0)

In [None]:
# valence between 0.0 and 1.0
sum(spotify['valence'] < 0.0) + sum(spotify['valence'] > 1.0)

In [None]:
# positive tempo
sum(spotify['tempo'] < 0.0)

In [None]:
# time signature between 3 and 7 (inclusive)
sum(spotify['time_signature'] < 3) + sum(spotify['time_signature'] > 7)

### Let's visualize some missing values!

In [None]:
# identify whether each tempo value is zero
# and group by genre (index)
# count number of zero values
zero_tempo_by_genre = spotify.set_index("track_genre")["tempo"].eq(0).groupby(level=0).sum()

# Convert the result to a DataFrame 
zero_tempo_by_genre_df = zero_tempo_by_genre.reset_index()

# Create a bar chart 
px.bar(zero_tempo_by_genre_df,
       x='track_genre',
       y='tempo',  # The count of zero tempo values
       labels={'0': 'Number of zero tempo values', 'genre': 'Genre'},
       title="Zero Tempo Values by Genre")

In [None]:
# identify whether each valence value is zero
# and group by genre (index)
# count number of zero values
zero_valence_by_genre = spotify.set_index("track_genre")["valence"].eq(0).groupby(level=0).sum()

# Convert the result to a DataFrame 
zero_valence_by_genre_df = zero_valence_by_genre.reset_index()

# Create a bar chart 
px.bar(zero_valence_by_genre_df,
       x='track_genre',
       y='valence',  # The count of zero valence values
       labels={'0': 'Number of zero valence values', 'genre': 'Genre'},
       title="Zero Valence Values by Genre")

In [None]:
# Filter time_signatures for values that are either less than 3 or greater than 7
invalid_time_signatures = spotify.set_index("track_genre")["time_signature"] \
    .apply(lambda x: x < 3 or x > 7)  # Create a boolean series where True indicates invalid values

# Group by track_genre and sum the invalid counts
time_signatures = invalid_time_signatures.groupby(level=0).sum()

time_signatures_df = time_signatures.reset_index()

# Create a bar chart 
px.bar(time_signatures_df,
       x = 'track_genre',
       y='time_signature',  # The count of invalid time signature values
       labels={'0': 'Number of invalid time signatures', 'genre': 'Genre'},
       title="Invalid Time Signature by Genre")


# 2. Exploratory Data Analysis (EDA)

## Examining relationships between single variables

### Danceability and Energy

Danceability describes how easy it is to dance to a song, while energy measures how intense and active a track is. One would expect these to have a positive correlation, which the graph shows to a small extent.

In [None]:
danceability = spotify['danceability']
energy = spotify['energy']
plt.scatter(danceability, energy, s=0.1)
plt.xlabel('Danceability')
plt.ylabel('Energy')
plt.title('Energy and Danceability of Spotify Tracks')

### Valence and Danceability

Valence is a measure describing how "positive" a track is, while danceability describes how suitable a track is to dance to. One might expect the two to have a positive correlation since more upbeat songs are often faster and more rhythmic, and thus easier to dance to.

In [None]:
danceability = spotify['danceability']
energy = spotify['valence']
plt.scatter(danceability, energy, s=0.1)
plt.xlabel('Danceability')
plt.ylabel('Valence')
plt.title('Valence and Danceability of Spotify Tracks')

### Speechiness and Instrumentalness

Speechiness measures the presence of spoken words in a song, while instrumentalness predicts if a song contains no vocals. These two should be inversely related, which this graph somewhat shows.

In [None]:
speechiness = spotify['speechiness']
instrumentalness = spotify['instrumentalness']
plt.scatter(speechiness, instrumentalness, s=0.1)
plt.xlabel('Speechiness')
plt.ylabel('Instrumentalness')
plt.title('Speechiness and Instrumentalness of Spotify Tracks')

### Popularity with Respect to Valence and Danceability

Popularity measures how popular a song is. Valence and danceability are both numerical measures of how "positive" a track is, and how suitable it is to dance to, respectively. 

The heatmap below will show the correlation between valence, danceability, and tracks with low popularity (defined as popularity below 10). 

In [None]:
import seaborn as sns
# Filter the DataFrame for tracks where 'popularity' < 10
spotify_filtered = spotify[spotify['popularity'] < 10]

# Put 'valence' and 'danceability' into bins
spotify_filtered['valence_bin'] = pd.cut(spotify_filtered['valence'], bins=10)
spotify_filtered['danceability_bin'] = pd.cut(spotify_filtered['danceability'], bins=10)

# Create a table where rows are binned 'valence' and columns are binned 'danceability', and values are the counts
heatmap_data = spotify_filtered.pivot_table(index='valence_bin', columns='danceability_bin', aggfunc='size', fill_value=0)

plt.figure(figsize=(12, 8))

# Create the heatmap with seaborn
sns.heatmap(heatmap_data, cmap='coolwarm', annot=True, fmt='d')

# Add labels and title
plt.title('Heatmap of Valence vs Danceability (Popularity < 10)', fontsize=16)
plt.xlabel('Danceability Bins', fontsize=12)
plt.ylabel('Valence Bins', fontsize=12)

plt.tight_layout()
plt.show()

# 3. Feature Imputation
The Spotify dataset contains many missing values, which are largely encoded as placeholders, although `null` values are also used. Such values are incompatible with our models as they create nonsensical patterns in the data. 

A basic strategy (shown below) would be to discard entire rows or columns which contain the missing or placeholder values. However, the data lost may be valuable, and it may be a better strategy to **impute** values by inferring them from known data. 

Below are some features which we have identified as missing or placeholder values: 

- Remove duplicate rows (same artist, same song, different genre or album)
  - These will have different track IDs
- Replace missing values
- Remove "Unnamed: 0" column (which is just the row number)

- Missing value:
  - Explicit = unknown
  - Key = -1

- Time signatures < 3 and > 7
  - Time signature of 0, usually means "sleep" genre

### Note for project members
**Warning**: `inPlace = True` will modify the original DataFrame. For example, if you `drop_duplicates inPlace`, the original spotify DataFrame will now never contain duplicates.

`drop_duplicates` has a `subset` argument. It will consider two rows duplicates if they have the same values for `subset`.

In [None]:
# Remove duplicate rows (the same song by same artist under different genre or album)
spotify_new = spotify.drop_duplicates(subset=['artists', 'track_name'], keep='first')
spotify_new.shape

In [None]:
# Dropping the Unnamed column (which is just the row index)
spotify = spotify.drop(columns=['Unnamed: 0'])
spotify

### `impute_feature()` function

To handle any potential missing values or placeholders in the data. 

Because the Spotify dataset has many "placeholder" values rather than NaN or real missing data, we found it helpful to specify a `placeholder_value` which will be treated as a NaN value. 

In [None]:
def impute_feature(data, feature, group, impute_method="average", placeholder_value = 0):
  '''
  Imputes missing or placeholder values in a specified feature column based on the given impute method.
  
  Parameters:
  - data (pandas.DataFrame): The DataFrame containing the data to impute.
  - feature (str): The name of the column where missing or placeholder values should be imputed.
  - group (str or list of str): Column(s) by which to group the data before applying the imputation.
  - impute_method (str, optional): The method used to impute the missing values. Defaults to "average". Currently, 
    only "average" is supported. This method performs forward and backward fills, then takes the average of both.
  - placeholder_value (numeric, optional): The placeholder value (like 0) that should be treated as missing, used
    when impute_method is "placeholder". Defaults to 0.

  Returns:
  - pandas.Series: A Series with the imputed values for the specified feature.

  Raises:
  - ValueError: If an unsupported impute method is provided.
  '''
  if(impute_method == "placeholder"):
          # Replace placeholder values with NaN
          data[feature] = data[feature].replace(placeholder_value, np.nan)

  if impute_method in ["average", "placeholder"]:
      # Change the impute method argument to lowercase
      impute_method = impute_method.lower()
      # Create 2 temp variables equal to feature
      data = data.assign(imputed_feature_prev=data[feature], imputed_feature_next=data[feature])
      # Fill first var with forward fill
      data["imputed_feature_prev"] = data.groupby(group)["imputed_feature_prev"].ffill()
      # Fill second var with backward fill
      data["imputed_feature_next"] = data.groupby(group)["imputed_feature_next"].bfill()
      # Define feature_imputed column to be mean of the forward and backward fill
      data["feature_imputed"] = data[["imputed_feature_next", "imputed_feature_prev"]].mean(axis=1, skipna=True)
      # Impute remaining missing values with 0
      data["feature_imputed"] = data["feature_imputed"].fillna(0)
      # Remove two temp vars
      data = data.drop(columns=["imputed_feature_prev", "imputed_feature_next"])
      return data["feature_imputed"]
  else:
      raise ValueError("Invalid impute_method")

In [None]:
# Test impute_feature and placeholder_value argument
spotify["imputed_valence"] = impute_feature(spotify, 
                                            feature = "valence", 
                                            group = "track_genre",
                                            impute_method = "placeholder", 
                                            placeholder_value = 0.6190)

spotify[["track_name", "popularity", "danceability", "energy", "valence", "imputed_valence"]].sample(20, random_state=1259)