# Spotify's Weekly Top 200 Songs Streaming Data Report
By Li Liang, Tracy Bui, and Jonathan Ma

### Introduction
Spotify has grown to be a popular audio streaming and media service with one of the most largest music streaming service providers with over 456 million monthly active users as of September 2022. For this final project, we will be attempting to analyze and relate data from users to see what kind of artist, genre, and other music-related factors are the most popular in different countries around the world. From this analyzed data, we will build models to understand if any of the countries have similar taste in music. To handle the large amount of data that needs to be processed and fit in RAM, the dataset will be passed as batches of 100.

### Imports
The imports used for this project are below.

In [2]:
import pandas as pd
import numpy as np

### The Data
The dataset we are using is the Spotify's <a href="https://www.kaggle.com/datasets/yelexa/spotify200?resource=download" target="_blank">"Weekly Top Songs" Streaming Data</a>. It contains songs from the Spotify charts labelled "Weekly Top Songs" for each country from the week of 02/04/2021-07/14/2022.

The 36 columns given in this dataset are the uri, rank, artist names, artists number, artist individual, artist id, artist genre, artist image, track name, release date, album number tracks, album cover, source, peak rank, previous rank, weeks on chart, streams, week, danceability, energy, key, mode, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, country, region, language, and pivot. Full explanation of each of these columns can be found through the hyperlink above.

To download the data yourself, click on this <a href="https://drive.google.com/file/d/1DYm1ApyXTEU8EkhaMHeiQEsEE1shfiln/view?usp=sharing" target="_blank">Google link</a> to download a zipped copy of the data. Once unzipped, store the file in the same directory space as the notebook for proper file access. 

Warning: According to the link where we got the dataset from, once unzipped the `final.csv` file size will be about 835.55 MB. 

Below is a sample of what the data looks like and what it contains.

In [3]:
spotify_dataset = pd.read_csv('final.csv', dtype='unicode', nrows=5)
spotify_dataset.head(5)

FileNotFoundError: [Errno 2] No such file or directory: 'final.csv'

### Problems To Tackle
With the current data being a large list of songs, we want to take different steps to get the data that we are looking for. First we will categorize the songs into their respective countries. Then we will find the average values for each of the parameters that indicate what is the most popular in each country. Then we will take a look at what are the top values for each of the country and find similarites between countries if there is any. This can be compiled into a readable list below.

1. Categorize Spotify song list into respective countries
2. Find average values in each country and model it
3. Find top values in each country and model it
4. Relate any similarities between in each country

### Loading The Data In Chunks
The dataset is quite large so we will load it into the initial Spotify dataframe through chunks of 100 rows per iteration. 

Due to the large data, it may take about 1 minute to finish processing. 

In [4]:
# %%time
spotify_chunks = pd.read_csv('final.csv', iterator=True, chunksize=100)
spotify_df = pd.concat(spotify_chunks, ignore_index=True)

print("Total rows in Spotify dataset: ", len(spotify_df))

spotify_df.head(5)

# CPU times: user 55 s, sys: 4.44 s, total: 59.4 s
# Wall time: 59.5 s

Total rows in Spotify dataset:  1787999


Unnamed: 0.1,Unnamed: 0,uri,rank,artist_names,artists_num,artist_individual,artist_id,artist_genre,artist_img,collab,...,acousticness,instrumentalness,liveness,valence,tempo,duration,country,region,language,pivot
0,0,spotify:track:2gpQi3hbcUAcEG8m2dlgfB,1,Paulo Londra,1.0,Paulo Londra,spotify:artist:3vQ0GE3mI0dAaxIMYe5g7z,argentine hip hop,https://i.scdn.co/image/ab6761610000e5ebf796a9...,0,...,0.0495,0.0,0.0658,0.557,173.935,178203.0,Argentina,South America,Spanish,0
1,1,spotify:track:2x8oBuYaObjqHqgGuIUZ0b,2,WOS,1.0,WOS,spotify:artist:5YCc6xS5Gpj3EkaYGdjyNK,argentine indie,https://i.scdn.co/image/ab6761610000e5eb75e151...,0,...,0.724,0.0,0.134,0.262,81.956,183547.0,Argentina,South America,Spanish,0
2,2,spotify:track:2SJZdZ5DLtlRosJ2xHJJJa,3,Paulo Londra,1.0,Paulo Londra,spotify:artist:3vQ0GE3mI0dAaxIMYe5g7z,argentine hip hop,https://i.scdn.co/image/ab6761610000e5ebf796a9...,0,...,0.241,0.0,0.0929,0.216,137.915,204003.0,Argentina,South America,Spanish,0
3,3,spotify:track:1O2pcBJGej0pmH2Y9XZMs6,5,Cris Mj,1.0,Cris Mj,spotify:artist:1Yj5Xey7kTwvZla8sqdsdE,urbano chileno,https://i.scdn.co/image/ab6761610000e5eb8f4ebc...,0,...,0.0924,4.6e-05,0.0534,0.832,96.018,153750.0,Argentina,South America,Spanish,0
4,4,spotify:track:1TpZKxGnHp37ohJRszTSiq,6,Emilia,1.0,Emilia,spotify:artist:0AqlFI0tz2DsEoJlKSIiT9,pop argentino,https://i.scdn.co/image/ab6761610000e5ebaf96d1...,0,...,0.0811,6.3e-05,0.101,0.501,95.066,133895.0,Argentina,South America,Spanish,0


### Altering the Data to Fit Our Needs
As seen above this dataset has many parameters that we don't need to use so we will modify the dataset to only include columns that we want to take a look at. The columns that is of signficance to our analysis is `rank`, `artist names`, `artist individual`, `artist genre`, `track name`,  `peak rank`, `streams`, `energy`, `loudness`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `tempo`, `duration`, `country`, and `language`. 

In [5]:
spotify_df = spotify_df.drop(columns=['Unnamed: 0', 'uri', 'artist_id', 'artist_img', 
                                                'collab', 'release_date', 'album_num_tracks', 
                                                'album_cover', 'source', 'peak_rank', 
                                                'previous_rank', 'weeks_on_chart', 'week',
                                                'danceability', 'key', 'mode', 'duration',
                                                'region', 'language', 'pivot'])
spotify_df.head(5)

Unnamed: 0,rank,artist_names,artists_num,artist_individual,artist_genre,track_name,streams,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,country
0,1,Paulo Londra,1.0,Paulo Londra,argentine hip hop,Plan A,3003411,0.834,-4.875,0.0444,0.0495,0.0,0.0658,0.557,173.935,Argentina
1,2,WOS,1.0,WOS,argentine indie,ARRANCARMELO,2512175,0.354,-7.358,0.0738,0.724,0.0,0.134,0.262,81.956,Argentina
2,3,Paulo Londra,1.0,Paulo Londra,argentine hip hop,Chance,2408983,0.463,-9.483,0.0646,0.241,0.0,0.0929,0.216,137.915,Argentina
3,5,Cris Mj,1.0,Cris Mj,urbano chileno,Una Noche en Medellín,2080139,0.548,-5.253,0.077,0.0924,4.6e-05,0.0534,0.832,96.018,Argentina
4,6,Emilia,1.0,Emilia,pop argentino,cuatro veinte,1923270,0.696,-3.817,0.0505,0.0811,6.3e-05,0.101,0.501,95.066,Argentina


### Processing the Countries

Due to the large scale of 74 countries in the dataset, we wanted to focus our attention to five countries that are from different regions around the world. This is because each country's "Top Weekly 200 Songs" chart contains about 200 songs each. It can be seen that the total songs add up quickly. 

The five countries that we chose is the `United States`, `Germany`, `Hong Kong` `Mexico` and `Portugal`.  


In [40]:

countries = pd.unique(spotify_df['country'])
countries = np.delete(countries, np.where(countries == "country"))
number_countries = len(countries)

print("There are", number_countries, "countries in this dataset to process\n")
print(countries)


There are 74 countries in this dataset to process

['Argentina' 'Australia' 'Austria' 'Belarus' 'Belgium' 'Bolivia' 'Brazil'
 'Bulgaria' 'Canada' 'Chile' 'Colombia' 'Costa Rica' 'Cyprus'
 'Czech Republic' 'Denmark' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Estonia' 'Finland' 'France' 'Germany' 'Global' 'Greece'
 'Guatemala' 'Honduras' 'Hong Kong' 'Hungary' 'Iceland' 'India'
 'Indonesia' 'Ireland' 'Israel' 'Italy' 'Japan' 'Kazakhstan' 'Korea'
 'Latvia' 'Lithuania' 'Luxembourg' 'Malaysia' 'Mexico' 'Morocco'
 'Netherlands' 'New Zealand' 'Nicaragua' 'Nigeria' 'Norway' 'Pakistan'
 'Panama' 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania'
 'Saudi Arabia' 'Singapore' 'Slovakia' 'South Africa' 'Spain' 'Sweden'
 'Switzerland' 'Taiwan' 'Thailand' 'Turkey' 'United Arab Emirates'
 'United Kingdom' 'Ukraine' 'Uruguay' 'United States' 'Venezuela'
 'Vietnam']


#### Function categorizeCountry
This function is going to create databases categorized by `country` which is one of the columns from the original dataset. From these selected country datasets, we will use them to analyze the values in each of the country for our analysis. 

In [4]:
def categorizeCountry(country_name):
    newdf = spotify_df[spotify_df["country"] == country_name]  
    return newdf.head(5)

NameError: name 'spotify_df' is not defined

In [None]:
country1 = categorizeCountry("United States")
country1

In [1]:
country2 = categorizeCountry("Germany")
country2

NameError: name 'categorizeCountry' is not defined

In [49]:
country3 = categorizeCountry("Hong Kong")
country3

Unnamed: 0,rank,artist_names,artists_num,artist_individual,artist_genre,track_name,streams,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,country
754406,1,Keung To,1.0,Keung To,hk-pop,作品的說話,256348,0.389,-13.106,0.0638,0.4429999999999999,0.0,0.114,0.667,79.991,Hong Kong
754407,2,Keung To,1.0,Keung To,hk-pop,"Dear My Friend,",241638,0.452,-8.152000000000001,0.0307,0.723,0.0,0.1169999999999999,0.266,94.566,Hong Kong
754408,3,Keung To,1.0,Keung To,hk-pop,鏡中鏡,222231,0.667,-6.938,0.111,0.0399,3.4200000000000005e-05,0.104,0.223,147.037,Hong Kong
754409,4,Anson Lo 盧瀚霆,1.0,Anson Lo 盧瀚霆,hk-pop,Mr. Stranger,195537,0.718,-5.282,0.142,0.0461,0.0,0.249,0.526,81.905,Hong Kong
754410,5,Anson Lo 盧瀚霆,1.0,Anson Lo 盧瀚霆,hk-pop,Megahit,191245,0.7020000000000001,-3.554,0.0568,0.106,0.0,0.127,0.487,122.802,Hong Kong


In [2]:
country4 = categorizeCountry("Mexico")
country4

NameError: name 'categorizeCountry' is not defined

In [3]:
country5 = categorizeCountry("Portugal")
country5

NameError: name 'categorizeCountry' is not defined

#### Function averageCountryValues
We want to use this function in order to calculate the data in each of the countries, averaging it, and assessing what each country's average music interests are based on their dataframe. This averaging will be only on the numerical values from the dataset not the strings.

In [4]:
def averageCountryValues(country_df):
    country_avg_df = country_df.drop(columns=['rank', 'artist_names', 'artists_num',
                                                'artist_individual', 'artist_genre',
                                                'track_name'])
    
    #column_headers = list(country_avg_df.columns.values)
    #print(column_headers)

    mean_df = country_avg_df.mean(axis=0, numeric_only=True)
   
    return print(mean_df)

In [5]:
averageCountryValues(country1)

NameError: name 'country1' is not defined

## Using different characteristics to recommend songs
Each dataset contains energy, loudness, speechiness, acousticness, instrumentalness, etc. One way to recommend similar songs is to simply choose a single characteristic and recommend songs with similar value. Then we can work our way up to more complicated combinations of characteristics to recommend songs. 

##### We can try all characteristics one at a time at first; energy for example
(Some function that takes a song name as an input and compares its' energy level with all the songs in the dataset. Returns a few that have similar or even identical energy level)
Then we use the function to test other characteristics.

##### Start combining characteristics
At some point we will know whether or not only one characteristic is enough to dictate whether two songs are similar or not. My bet is that its simply not enough. So we will have to test and see if these characteristics can be combined to give better results. 
    
If we take a look at the data as a human, we can easily spot some patterns. For example, tempo and acousticness seem to have a high correlation.

So now that we chose acousticness and tempo as a

### Resources Used

#### Loading Initial Data
* https://towardsdatascience.com/%EF%B8%8F-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-%EF%B8%8F-e93b485086c7#:~:text=Pandas%20use%20Contiguous%20Memory%20to,than%20Disk(or%20SSDs).&text=Before%20going%20into%20multiprocessing%20%26%20GPUs,read_csv()%20effectively
* https://stackoverflow.com/questions/45532711/pandas-read-csv-method-is-using-too-much-ram
