# Group 10 - Project Phase 1
This notebook is the work and submission of Group 10 of CSMODEL Section S16. The group's members consist of:
* David, Peter Jan B.
* De Guzman, Evan Mari B.
* Manaois, Kyla Nicole G.
* Wangkay, Laurize Jeante G.


## Brief Description of the Dataset
The dataset being used for this study is the Spotify Top-2000s Mega Dataset. This dataset contains the Top 2000 songs in spotify, ranging from the years 1956 to 2019. This dataset has been acquired from Kaggle.com and was produced by the user Sumat Singh.

### Data Collection Process
In collecting the data, the creator of this dataset used a third party website using this link, http://sortyourmusic.playlistmachinery.com/. This third party website uses the Spotify API to extract the data of a certain song and is then collected. The third party website was created by Paul Lamere.

### Structure of the Dataset
The structure of the dataset is quite simple. There exists 1994 observations or rows and there are 15 variables or columns. Each row is an entire set of information of a song, from its title and its artist, to its popularity. Each column represents a certain variable to be discussed further after this section. The only column that has no significance in the dataset is the index column attached with the dataset as to say that the index has no real merit in ranking the song. 

## Variables in the Dataset 
- **`Title`**: Name of the track.
- **`Artist`**: Name of the artist.
- **`Top Genres`**: Genre that the track applies to.
- **`Year`**: Release year of the track.
- **`Beats per Minute (BPM)`**: Tempo of the song.
- **`Energy`**: Energy of the song. A higher value pertains to the song being more energetic.
- **`Danceability`**: Danceability of the song. A higher value pertains to how easier it is to dance to a song.
- **`Loudness`**: Loudness of the song. A higher value pertains a louder song. 
- **`Valence`**: The positivity of a song. A higher value pertains to a more positive mood for the song.
- **`Length`**: The duration of the song.
- **`Acoustic`**: The acoustic value of the song. A higher value pertains that the song was made less electronically. 
- **`Speechiness`**: The presence of spoken words in the song. A higher value pertains that the song has more spoken words.
- **`Popularity`**: The popularity of a song. A higher value pertains to a more popular song.

## Data Cleaning


### Importing Libraries
For this section of the notebook, our main focus is simply cleaning the dataset. To fufill this purpose, it is a necessity to import the numpy and the pandas library.

In [None]:
import numpy as np
import pandas as pd

Then to load the dataset and view the first few rows with the use of `head()` function

In [None]:
spotify_df = pd.read_csv("Spotify-2000.csv")
spotify_df.head()

Now to view the general dataset information with the use of `info()` function

In [None]:
spotify_df.info()

Thanks to the `info()` function, we can see that there are 1994 observations and 15 columns. And within the same function, we get to see that there are exactly 1994 non-null items in every column. To double-check, this code is employed.

In [None]:
spotify_df.isnull().any()

Now to further check the correctness of the data, the Title and Artists should have capitalized strings. To do so, the group employed the use of a function to return a boolean checking if the first character of a string is capitalized or not.

In [None]:
def is_capitalized(s):
    return s == s.title()

is_title = spotify_df['Title'].apply(is_capitalized)
is_artist = spotify_df['Artist'].apply(is_capitalized)

capital_check = pd.DataFrame(
    {
    'Title' : is_title,
    'Artist' : is_artist
    }
)
capital_check

### Uniqueness of Values

Now, there exists another possible problem in regards to the uniqueness of certain values. For example, the title "Hallelujah" has 3 different instances. Isolated within the variable 'Title', a method of data cleaning could have been employed. However, considering other variables such as Artist or Top Genre along with BPM, Valence and the such, no method of data cleaning will be employed on such values with the same title and thus are considered unique.

In [None]:
spotify_df.loc[(spotify_df['Title'] == 'Hallelujah')]

### The Index Column

Here is a point of contention within the group for the use of the dataset within this section. The dataset is, by all means, clean. All Titles and Artists have correct data types and are in one proper format. The only problem that is noticed within this dataset is the use of an artificial number column called 'Index'. 

Meaning to say, the elements under 'Index' are simply numbers that have no real meaning or information. So as to say, a song having the Index number 1 does not denote it to be the Top 1 song in any possible relation with a variable. Due to this, it was decided upon by the group to have the Index column removed. To justify this decision, as mentioned, it is simply an incrementing artificial number and has no real value. Furthermore, data frames have a built-in identifying index per observation and thus is also redundant. To drop the column, the `drop()` function shall be used.

In [None]:
spotify_df = spotify_df.drop(['Index'], axis = 1)

Effectively, there are now only 14 columns or variables of interest.

### The Top Genres

The `Top Genres` column is quite a polarizing one. Consider that there are genres which are more categorically general but the dataset specifies the genre to the most possible specificity. In which case, the pre-processing of this data will require the group to bin and map nearly every possible unique entry to a general genre. 

First, the group will get the number of unique values within `Top Genres` using the `unique()` and `size` function.

In [None]:
spotify_df['Top Genre'].unique().size

From the code above, it returns 149 unique values in the Top Genre. By all metrics, there is too much and no possible relationships or conclusions can be made with this level of specificity. Thus, the group will create a function that checks the string within `Top Genre` and if falls under a specific condition. This condition generalizes all specific genres. 

In [None]:
def generalized_genre(genre):
    if 'metal' in genre:
        return 'metal'
    elif 'pop' in genre:
        return 'pop'
    elif 'rock' in genre:
        return 'rock'
    elif 'hip hop' in genre:
        return 'hip hop'
    elif 'adult' in genre:
        return 'adult'
    elif 'indie' in genre:
        return  'indie'
    elif any(sub_genre in genre for sub_genre in ['soul' or 'blues' or 'funk' or 'disco' or 'reggae']):
        return 'R&B'
    elif 'folk' in genre:
        return 'folk'
    elif 'british invasion' in genre:
        return 'rock'
    elif 'country' in genre:
        return 'country'
    else:
        return 'other'
    
spotify_df['General Genre'] = spotify_df['Top Genre'].apply(generalized_genre)
spotify_df.loc[36]


In this example, The song "Iris" by The Goo Goo Dolls is listed as "alternate rock". With the function that the group has applied, there now is a related `General Genre` with the generalized value "rock".

### Feature Engineering

The nature of this dataset is practically clean, besides the redundant `Index` column. In this case, to make up for the lack of need of data cleaning, the group will instead do feature engineering. In essence, the group will try to create new variables that can be of value for further analysis and study.

Firstly, we can make a categorized mood category that uses the Valence and Energy variables, called the `Affective Mood`. There would be three moods, those being Happy, Calm or Sad. 

In [None]:
def mood_category(row):
    if row['Valence'] > 50 and row['Energy'] > 50:
        return 'Happy'
    elif row['Valence'] > 50 and row['Energy'] <= 50:
        return 'Calm'
    else:
        return 'Sad'

spotify_df['Affective Mood'] = spotify_df.apply(mood_category, axis = 1)
spotify_df

Another feature that we can add is the decade of the release of the song. For this, our primary focus is the `Year` variable and making a `Decade` variable.

In [None]:
spotify_df['Decade'] = (spotify_df['Year'] // 10) * 10
spotify_df

Now there exists in our dataframe observations showing what decade a song has released. 

## Exploratory Data Analysis

In this portion, we perform an explanatory data analysis to have a comprehensive understanding of the Spotify dataset. This is to help in the formulation of the research question of the project.



### Importing more libraries

Considering the nature of exploratory data analysis, there is aneed for more python libraries. The said libraries would be `matplotlib`. 

In [None]:
import matplotlib.pyplot as plt

### Question 1 : What are the mean, median, mode, standard deviation, and variance for each numeric attribute (BPM, Energy, Danceability, Loudness, Valence, Length, Acousticness, Speechiness, Popularity)? 

To answer this question, we categorize these asked statistics (mean, median, mode, standard deviation, and variance) by attribute. Additionally, we convert any string values to numeric and remove them, as the code might also read column titles (e.g., "Energy: [values]"). For each column, we compute the mean, median, and mode, formatting the results to two decimal places.

In [None]:
numeric_columns = ['Beats Per Minute (BPM)', 'Energy', 'Danceability', 'Loudness (dB)', 'Valence', 'Length (Duration)', 'Acousticness', 'Speechiness', 'Popularity']

summary_stats = {'Attribute': [], 'Mean': [], 'Median': [], 'Mode': [], 'Standard Deviation': [], 'Variance': []}

for column in numeric_columns:
    if column in spotify_df.columns:
        spotify_df[column] = pd.to_numeric(spotify_df[column], errors='coerce')

        summary_stats['Attribute'].append(column)
        summary_stats['Mean'].append(spotify_df[column].mean())
        summary_stats['Median'].append(spotify_df[column].median())
        summary_stats['Mode'].append(spotify_df[column].mode().dropna().values[0] if not spotify_df[column].mode().dropna().empty else None)
        summary_stats['Standard Deviation'].append(spotify_df[column].std())
        summary_stats['Variance'].append(spotify_df[column].var())
    else:
        print(f"Column {column} not found in the DataFrame")

summary_df = pd.DataFrame(summary_stats)

summary_df['Mean'] = summary_df['Mean'].map("{:.2f}".format)
summary_df['Median'] = summary_df['Median'].map("{:.2f}".format)
summary_df['Mode'] = summary_df['Mode'].map("{:.2f}".format)
summary_df['Standard Deviation'] = summary_df['Standard Deviation'].map("{:.2f}".format)
summary_df['Variance'] = summary_df['Variance'].map("{:.2f}".format)

summary_df

We can examine the distribution of the data using histograms. Histograms visualize the distribution of a single variable, providing insights into the shape of the distribution (e.g., skewness). The provided code generates histograms for each attribute, giving a detailed view of the data distribution and a concise summary of key statistics.

In [None]:
fig, axs = plt.subplots(nrows=len(numeric_columns), ncols=1, figsize=(14, 2 * len(numeric_columns)))
fig.tight_layout(pad=3.0)

for i, column in enumerate(numeric_columns):
    if column in spotify_df.columns:
        spotify_df[column] = pd.to_numeric(spotify_df[column], errors='coerce')
      
        axs[i].hist(spotify_df[column], bins=10, edgecolor='black', alpha=0.7)
        axs[i].set_title(f'Histogram of {column}')
        axs[i].set_xlabel(column)
        axs[i].set_ylabel('Frequency')

    else:
        print(f"Column {column} not found in the DataFrame")

plt.show()

The visualizations above shows us the different behaviours and characteritics of the data and how their central tendency are. For exmaple, the histogram depicting the length (Duration) data reveals a left-skewed distribution, while the histogram of danceability is the most normal distribution. 

### Question 2: Which genres are most prevalent in this dataset?

For the Numerical Summary, we will calculate the frequency of each genre.

In [None]:
genre_counts = spotify_df['Top Genre'].value_counts()

top_genres = genre_counts.head(10)
print(top_genres)

For the Visualization, bar chart will be used to display the most prevalent genres.

In [None]:
plt.figure(figsize=(12, 6))
top_genres.plot(kind='bar')
plt.title('Top 10 Most Prevalent Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.show()

The frequency count shows which genres are most common in the dataset, with Album Rock being the most common genre in the dataset. The bar chart provides a visual representation, making it easier to compare the prevalence of different genres at a glance.

Considering the specificity of the `Top Genre` column, the same analysis can be done on the `General Genre` column so as to see for any future possibilities. 

In [None]:
gen_genre_counts = spotify_df['General Genre'].value_counts()

gen_genres = gen_genre_counts.head(10)
print(gen_genres)

plt.figure(figsize=(12, 6))
gen_genres.plot(kind='bar')
plt.title('Top 10 Most Prevalent General Genres')
plt.xlabel('General Genre')
plt.ylabel('Count')
plt.show()

### Question 3: What is the average song length `Length (Duration)` by genre?

For the Numerical Summary, we will calculate the average song duration for each genre.

In [None]:
average_duration_genre = spotify_df.groupby('Top Genre')['Length (Duration)'].mean().sort_values(ascending=False)

print(average_duration_genre)

For the Visualization, a box plot will be used to show the distribution of song lengths by genre.

In [None]:
plt.figure(figsize=(12, 42))
plt.barh(average_duration_genre.index, average_duration_genre.values)
plt.title('Average Song Duration by Top Genre')
plt.xlabel('Average Duration (seconds)')
plt.ylabel('Genre')
plt.show()

The average song length for each genre provides insight into how song durations vary across genres. The bar chart visualizes these averages, highlighting genres with notably longer or shorter songs on average.

### Question 4: What genres tend to have longer song durations?
For the Numerical Summary, we can refine the previous data by highlighting the top 10 genres with the longest song durations. This approach allows us to easily identify which genres have the highest average durations.

In [None]:
average_duration_genre = spotify_df.groupby('Top Genre')['Length (Duration)'].mean().sort_values(ascending=False)

top_average_duration_genre = average_duration_genre.head(4)
print(top_average_duration_genre)

For the Visualization, we will use a box plot to compare the distribution of song lengths across various genres.

In [None]:
plt.figure(figsize=(12, 6))
plt.barh(top_average_duration_genre.index, top_average_duration_genre.values)
plt.title('Average Song Duration by Top Genre')
plt.xlabel('Average Duration (seconds)')
plt.ylabel('Genre')
plt.show()

The average song length for each genre highlights which genres have longer average song durations, with Finnish Metal having the longest. The box plot displays the distribution of song lengths for the top 10 genres with the longest average durations, showing the spread and any potential outliers.

To provide broader, more consistent categories that can simplify analysis and visualization, and make the results clearer, we can use the  `General Genre` column.

In [None]:
average_duration_gen_genre = spotify_df.groupby('General Genre')['Length (Duration)'].mean().sort_values(ascending=False)

top_gen_genres_duration = average_duration_gen_genre.head(10)
print(top_gen_genres_duration)

plt.figure(figsize=(12, 6))
plt.barh(top_gen_genres_duration.index, top_gen_genres_duration.values)
plt.title('Top 10 General Genres by Average Song Duration')
plt.xlabel('Average Duration (seconds)')
plt.ylabel('Genre')
plt.show()

### Question 5: Which `Top Genre` has the highest average `Popularity`? 

For the Numerical Summary, we will compute the average popularity for each genre and identify the one with the highest average.

In [None]:
average_popularity_genre = spotify_df.groupby('Top Genre')['Popularity'].mean().sort_values(ascending=False)

top_genres_popularity = average_popularity_genre.head(10)
print(top_genres_popularity)

For the Visualization, we will use a bar chart to display the average popularity for the top genres.

In [None]:
plt.figure(figsize=(12, 6))
top_genres_popularity.plot(kind='bar')
plt.title('Average Popularity by Genre')
plt.xlabel('Genre')
plt.ylabel('Average Popularity')
plt.xticks(rotation=90)
plt.show()

The average popularity for each genre reveals which genres tend to have more popular songs. Celtic Punk and Indie Pop stand out as the top genres in terms of average popularity. The bar chart visually compares these averages, making it easy to identify the genre with the highest average popularity.

Similar to previous questions, we can analyze the `General Genre` column to further simplify analysis and visualization.

In [None]:
average_popularity_gen_genre = spotify_df.groupby('General Genre')['Popularity'].mean().sort_values(ascending=False)

top_gen_genres_popularity = average_popularity_gen_genre.head(10)

plt.figure(figsize=(12, 6))
top_gen_genres_popularity.plot(kind='bar')
plt.title('Top 10 General Genres by Average Popularity')
plt.xlabel('Genre')
plt.ylabel('Average Popularity')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### Question 6: How do the average values of numeric features change over the years (Year-wise trends)?

To analyze year-wise trends, we start by extracting unique years from the dataset. Next, we group these years accordingly. We then select the numeric variables for comparison, converting non-numeric titles to NaN and excluding them from the analysis. Finally, we get the mean.

In [None]:
unique_years = spotify_df['Year'].unique()
year_mapping = {year: f'{year} Group' for year in unique_years}
spotify_df['Year Group'] = spotify_df['Year'].map(year_mapping)

numeric_columns = ['Beats Per Minute (BPM)', 'Energy', 'Danceability', 'Loudness (dB)', 'Valence', 'Length (Duration)', 'Acousticness', 'Speechiness', 'Popularity']
spotify_df[numeric_columns] = spotify_df[numeric_columns].apply(pd.to_numeric, errors='coerce')

yearly_mean = spotify_df.groupby('Year Group')[numeric_columns].mean()

To visualize the mean of numeric values categorized by year, we will display it via a line plot to show the trends per year.

In [None]:
plt.figure(figsize=(14, 10))  
for column in yearly_mean.columns:
    plt.plot(yearly_mean.index, yearly_mean[column], marker='o', label=column)

plt.title('Year-wise Trends of Numeric Features')
plt.xlabel('Year')
plt.ylabel('Average Value')
plt.legend()
plt.grid(True)
plt.xticks(yearly_mean.index, rotation=45, ha='right')  #for readability
plt.tight_layout()
plt.show()

# Research Question

At this point, the group had finished the exploratory data analysis. Given the questions that was asked within and the subsequent data analysis, the group had decided upon this ONE research question moving forward.

``How do attributes such as 'Trend (by year)', 'Top Genre,' and 'Beats per Minute (BPM)' of songs relate to their 'Popularity', and can these relationships be used to predict the popularity of new songs using clustering methods?``

## Significance

Understanding the factors that contribute to a song's reach can have significant effects on the industry of music. This research can provide insights into what makes a song successful by examining the relationship between song qualities (such as the trend over years, genre, and BPM) and their popularity. When releasing and promoting new music, these insights may help musicians, producers, and record companies in making data-driven decisions.

Moreover, the study can provide predictive models for predicting the success of new releases by using clustering techniques to find patterns and group related music. This can be especially helpful in a market that is dynamic and fiercely competitive, where keeping up with trends is essential for business success.

# Data Modelling

In this section of the notebook, the group will commence on statistical data modelling with the use of (associative rule mining, clusters or recommender systems). This will help the group in determining the answer to the research question ``How do attributes such as 'Trend (by year)', 'Top Genre,' and 'Beats per Minute (BPM)' of songs relate to their 'Popularity', and can these relationships be used to predict the popularity of new songs using clustering methods?``  

Firstly, the necessary external python file must be imported to do the chosen data modelling technique.

In [None]:
from kmeans import KMeans

The code above would import the `KMeans` class from the `kmeans.py` file. Moving on, there has to be an instantiated object to be able to fully use the functions within `KMeans` to perform k-mean clustering algorithm

In [None]:
kmeans = KMeans(3, 0, 2, 300, what df)

# Statistical Inference

Given all the data presented, the group will now move on to statistical inference. In this part of the notebook, the group will now focus solely on hypothesis testing and inference to reach a conclusion on the given research question. Moving forward, here are the hypotheses of the group:

*H<sub>0</sub>* = There is a relationship between popularity and other attributes such as Trend, Top Genre and Beats per Minute <br>
*H<sub>A</sub>* = There is no relationship between popularity and other attributes such as Trend, Top Genre and Beats per Minute

# Insights and Conclusions