# Spotify's Weekly Top 200 Songs Streaming Data Report
By Li Liang, Tracy Bui, and Jonathan Ma

### Introduction
Spotify has grown to be a popular audio streaming and media service with one of the most largest music streaming service providers with over 456 million monthly active users as of September 2022. For this final project, we will be attempting to analyze and relate data from users to see what kind of artist, genre, and other music-related factors are the most popular in different countries around the world. From this analyzed data, we will build models to understand if any of the countries have similar taste in music. To handle the large amount of data that needs to be processed and fit in RAM, the dataset will be passed as batches of 100.

### Imports
The imports used for this project are below.

In [1]:
import pandas as pd
import numpy as np

### The Data
The dataset we are using is the Spotify's <a href="https://www.kaggle.com/datasets/yelexa/spotify200?resource=download" target="_blank">"Weekly Top Songs" Streaming Data</a>. It contains songs from the Spotify charts labelled "Weekly Top Songs" for each country from the week of 02/04/2021-07/14/2022.

The 36 columns given in this dataset are the uri, rank, artist names, artists number, artist individual, artist id, artist genre, artist image, track name, release date, album number tracks, album cover, source, peak rank, previous rank, weeks on chart, streams, week, danceability, energy, key, mode, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, country, region, language, and pivot. Full explanation of each of these columns can be found through the hyperlink above.

Below is a sample of what the data looks like and what it contains. 

In [3]:
spotify_dataset = pd.read_csv('final.csv', dtype='unicode', nrows=5)
spotify_dataset.head(5)

Unnamed: 0.1,Unnamed: 0,uri,rank,artist_names,artists_num,artist_individual,artist_id,artist_genre,artist_img,collab,...,acousticness,instrumentalness,liveness,valence,tempo,duration,country,region,language,pivot
0,0,spotify:track:2gpQi3hbcUAcEG8m2dlgfB,1,Paulo Londra,1.0,Paulo Londra,spotify:artist:3vQ0GE3mI0dAaxIMYe5g7z,argentine hip hop,https://i.scdn.co/image/ab6761610000e5ebf796a9...,0,...,0.0495,0.0,0.0658,0.557,173.935,178203.0,Argentina,South America,Spanish,0
1,1,spotify:track:2x8oBuYaObjqHqgGuIUZ0b,2,WOS,1.0,WOS,spotify:artist:5YCc6xS5Gpj3EkaYGdjyNK,argentine indie,https://i.scdn.co/image/ab6761610000e5eb75e151...,0,...,0.7240000000000001,0.0,0.134,0.262,81.956,183547.0,Argentina,South America,Spanish,0
2,2,spotify:track:2SJZdZ5DLtlRosJ2xHJJJa,3,Paulo Londra,1.0,Paulo Londra,spotify:artist:3vQ0GE3mI0dAaxIMYe5g7z,argentine hip hop,https://i.scdn.co/image/ab6761610000e5ebf796a9...,0,...,0.241,0.0,0.0929,0.216,137.915,204003.0,Argentina,South America,Spanish,0
3,3,spotify:track:1O2pcBJGej0pmH2Y9XZMs6,5,Cris Mj,1.0,Cris Mj,spotify:artist:1Yj5Xey7kTwvZla8sqdsdE,urbano chileno,https://i.scdn.co/image/ab6761610000e5eb8f4ebc...,0,...,0.0924,4.6e-05,0.0534,0.8320000000000001,96.018,153750.0,Argentina,South America,Spanish,0
4,4,spotify:track:1TpZKxGnHp37ohJRszTSiq,6,Emilia,1.0,Emilia,spotify:artist:0AqlFI0tz2DsEoJlKSIiT9,pop argentino,https://i.scdn.co/image/ab6761610000e5ebaf96d1...,0,...,0.0811,6.25e-05,0.101,0.501,95.066,133895.0,Argentina,South America,Spanish,0


### Problems To Tackle
With the current data being a large list of songs, we want to take different steps to get the data that we are looking for. First we will categorize the songs into their respective countries. Then we will find the average values for each of the parameters that indicate what is the most popular in each country. Then we will take a look at what are the top values for each of the country and find similarites between countries if there is any. This can be compiled into a readable list below.

1. Categorize Spotify song list into respective countries
2. Find average values in each country and model it
3. Find top values in each country and model it
4. Relate any similarities between in each country

### Loading The Data In Chunks
The dataset is quite large so we will load it into the initial Spotify dataframe through chunks of 100 rows per iteration. 

Due to the large data, it may take about 1 minute to finish processing. 

In [4]:
# %%time
spotify_chunks = pd.read_csv('final.csv', iterator=True, chunksize=100)
spotify_df = pd.concat(spotify_chunks, ignore_index=True)

print("Total rows in Spotify dataset: ", len(spotify_df))

spotify_df.head(5)

# CPU times: user 55 s, sys: 4.44 s, total: 59.4 s
# Wall time: 59.5 s

Total rows in Spotify dataset:  1787999


Unnamed: 0.1,Unnamed: 0,uri,rank,artist_names,artists_num,artist_individual,artist_id,artist_genre,artist_img,collab,...,acousticness,instrumentalness,liveness,valence,tempo,duration,country,region,language,pivot
0,0,spotify:track:2gpQi3hbcUAcEG8m2dlgfB,1,Paulo Londra,1.0,Paulo Londra,spotify:artist:3vQ0GE3mI0dAaxIMYe5g7z,argentine hip hop,https://i.scdn.co/image/ab6761610000e5ebf796a9...,0,...,0.0495,0.0,0.0658,0.557,173.935,178203.0,Argentina,South America,Spanish,0
1,1,spotify:track:2x8oBuYaObjqHqgGuIUZ0b,2,WOS,1.0,WOS,spotify:artist:5YCc6xS5Gpj3EkaYGdjyNK,argentine indie,https://i.scdn.co/image/ab6761610000e5eb75e151...,0,...,0.724,0.0,0.134,0.262,81.956,183547.0,Argentina,South America,Spanish,0
2,2,spotify:track:2SJZdZ5DLtlRosJ2xHJJJa,3,Paulo Londra,1.0,Paulo Londra,spotify:artist:3vQ0GE3mI0dAaxIMYe5g7z,argentine hip hop,https://i.scdn.co/image/ab6761610000e5ebf796a9...,0,...,0.241,0.0,0.0929,0.216,137.915,204003.0,Argentina,South America,Spanish,0
3,3,spotify:track:1O2pcBJGej0pmH2Y9XZMs6,5,Cris Mj,1.0,Cris Mj,spotify:artist:1Yj5Xey7kTwvZla8sqdsdE,urbano chileno,https://i.scdn.co/image/ab6761610000e5eb8f4ebc...,0,...,0.0924,4.6e-05,0.0534,0.832,96.018,153750.0,Argentina,South America,Spanish,0
4,4,spotify:track:1TpZKxGnHp37ohJRszTSiq,6,Emilia,1.0,Emilia,spotify:artist:0AqlFI0tz2DsEoJlKSIiT9,pop argentino,https://i.scdn.co/image/ab6761610000e5ebaf96d1...,0,...,0.0811,6.3e-05,0.101,0.501,95.066,133895.0,Argentina,South America,Spanish,0


### Altering the Data to Fit Our Needs
As seen above this dataset has many parameters that we don't need to use so we will modify the dataset to only include columns that we want to take a look at. The columns that is of signficance to our analysis is `rank`, `artist names`, `artist individual`, `artist genre`, `track name`,  `peak rank`, `streams`, `energy`, `loudness`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `tempo`, `duration`, `country`, and `language`. 

In [5]:
spotify_df = spotify_df.drop(columns=['Unnamed: 0', 'uri', 'artist_id', 'artist_img', 
                                                'collab', 'release_date', 'album_num_tracks', 
                                                'album_cover', 'source', 'peak_rank', 
                                                'previous_rank', 'weeks_on_chart', 'week',
                                                'danceability', 'key', 'mode', 'duration',
                                                'region', 'language', 'pivot'])
spotify_df.head(5)

Unnamed: 0,rank,artist_names,artists_num,artist_individual,artist_genre,track_name,streams,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,country
0,1,Paulo Londra,1.0,Paulo Londra,argentine hip hop,Plan A,3003411,0.834,-4.875,0.0444,0.0495,0.0,0.0658,0.557,173.935,Argentina
1,2,WOS,1.0,WOS,argentine indie,ARRANCARMELO,2512175,0.354,-7.358,0.0738,0.724,0.0,0.134,0.262,81.956,Argentina
2,3,Paulo Londra,1.0,Paulo Londra,argentine hip hop,Chance,2408983,0.463,-9.483,0.0646,0.241,0.0,0.0929,0.216,137.915,Argentina
3,5,Cris Mj,1.0,Cris Mj,urbano chileno,Una Noche en Medellín,2080139,0.548,-5.253,0.077,0.0924,4.6e-05,0.0534,0.832,96.018,Argentina
4,6,Emilia,1.0,Emilia,pop argentino,cuatro veinte,1923270,0.696,-3.817,0.0505,0.0811,6.3e-05,0.101,0.501,95.066,Argentina


### Part I: Processing the Countries

In [40]:
countries = pd.unique(spotify_df['country'])
countries = np.delete(countries, np.where(countries == "country"))
number_countries = len(countries)
print("There are", number_countries, "countries in this dataset to process\n")
print(countries)


There are 74 countries in this dataset to process

['Argentina' 'Australia' 'Austria' 'Belarus' 'Belgium' 'Bolivia' 'Brazil'
 'Bulgaria' 'Canada' 'Chile' 'Colombia' 'Costa Rica' 'Cyprus'
 'Czech Republic' 'Denmark' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Estonia' 'Finland' 'France' 'Germany' 'Global' 'Greece'
 'Guatemala' 'Honduras' 'Hong Kong' 'Hungary' 'Iceland' 'India'
 'Indonesia' 'Ireland' 'Israel' 'Italy' 'Japan' 'Kazakhstan' 'Korea'
 'Latvia' 'Lithuania' 'Luxembourg' 'Malaysia' 'Mexico' 'Morocco'
 'Netherlands' 'New Zealand' 'Nicaragua' 'Nigeria' 'Norway' 'Pakistan'
 'Panama' 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania'
 'Saudi Arabia' 'Singapore' 'Slovakia' 'South Africa' 'Spain' 'Sweden'
 'Switzerland' 'Taiwan' 'Thailand' 'Turkey' 'United Arab Emirates'
 'United Kingdom' 'Ukraine' 'Uruguay' 'United States' 'Venezuela'
 'Vietnam']


In [44]:
def categorizeCountry(country_name):
    newdf = spotify_df[spotify_df["country"] == country_name]  
    return newdf.head(10)

categorizeCountry("Hong Kong")
categorizeCountry("Canada")


Unnamed: 0,rank,artist_names,artists_num,artist_individual,artist_genre,track_name,streams,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,country
186962,2,Olivia Rodrigo,1.0,Olivia Rodrigo,pop,good 4 u,1732495,0.664,-5.044,0.154,0.335,0.0,0.0849,0.688,166.928,Canada
186963,3,Ed Sheeran,1.0,Ed Sheeran,pop,Bad Habits,1557969,0.897,-3.712,0.0348,0.0469,3.14e-05,0.364,0.591,126.026,Canada
186964,4,Post Malone,1.0,Post Malone,dfw rap,Motley Crew,1426111,0.631,-3.818,0.0786,0.0904,3.71e-06,0.0998,0.288,129.915,Canada
186965,5,Måneskin,1.0,Måneskin,indie rock italiano,Beggin',1365828,0.8,-4.808,0.0504,0.127,0.0,0.359,0.589,134.002,Canada
186966,8,Lil Nas X,1.0,Lil Nas X,pop,MONTERO (Call Me By Your Name),1077506,0.508,-6.682,0.152,0.297,0.0,0.384,0.758,178.81799999999996,Canada
186967,10,Billie Eilish,1.0,Billie Eilish,pop,NDA,870137,0.373,-9.915,0.0713,0.3289999999999999,0.541,0.1119999999999999,0.59,85.015,Canada
186968,11,BTS,1.0,BTS,k-pop boy group,Permission to Dance,866435,0.741,-5.33,0.0427,0.0054399999999999,0.0,0.337,0.6459999999999999,124.925,Canada
186969,12,Glass Animals,1.0,Glass Animals,shiver pop,Heat Waves,828638,0.525,-6.9,0.0944,0.44,6.7e-06,0.0921,0.531,80.87,Canada
186970,13,Polo G,1.0,Polo G,chicago rap,RAPSTAR,752918,0.536,-6.862,0.242,0.41,0.0,0.129,0.4370000000000001,81.039,Canada
186971,16,Olivia Rodrigo,1.0,Olivia Rodrigo,pop,deja vu,745235,0.612,-7.222,0.1119999999999999,0.584,5.700000000000001e-06,0.37,0.178,180.917,Canada


### Resources Used

#### Loading Initial Data
* https://towardsdatascience.com/%EF%B8%8F-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-%EF%B8%8F-e93b485086c7#:~:text=Pandas%20use%20Contiguous%20Memory%20to,than%20Disk(or%20SSDs).&text=Before%20going%20into%20multiprocessing%20%26%20GPUs,read_csv()%20effectively
* https://stackoverflow.com/questions/45532711/pandas-read-csv-method-is-using-too-much-ram
