# Spotify's Weekly Top 200 Songs Streaming Data Report
By Li Liang, Tracy Bui, and Jonathan Ma

### Introduction
Spotify has grown to be a popular audio streaming and media service with one of the most largest music streaming service providers with over 456 million monthly active users as of September 2022. For this final project, we will be attempting to analyze and relate data from users to see what kind of artist, genre, and other music-related factors are the most popular in different countries around the world. From this analyzed data, we will build models to understand if any of the countries have similar taste in music. To handle the large amount of data that needs to be processed and fit in RAM, the dataset will be passed as batches of 30.

### Imports
The imports used for this project are below.

In [1]:
import pandas as pd
import csv

### The Data
The dataset we are using is the Spotify's <a href="https://www.kaggle.com/datasets/yelexa/spotify200?resource=download" target="_blank">"Weekly Top Songs" Streaming Data</a>. It contains songs from the Spotify charts labelled "Weekly Top Songs" for each country from the week of 02/04/2021-07/14/2022.

The 36 columns given in this dataset are the uri, rank, artist names, artists number, artist individual, artist id, artist genre, artist image, track name, release date, album number tracks, album cover, source, peak rank, previous rank, weeks on chart, streams, week, danceability, energy, key, mode, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, country, region, language, and pivot. Full explanation of each of these columns can be found through the hyperlink above.

In [2]:
spotify_dataset = pd.read_csv('final.csv', dtype='unicode', nrows=5)

# rename first column
spotify_dataset = spotify_dataset.rename(columns={'Unnamed: 0':'Number'}) 
spotify_dataset.head(5)

FileNotFoundError: [Errno 2] No such file or directory: 'final.csv'

### Problems To Tackle
With the current data being a large list of songs, we want to take different steps to get the data that we are looking for. First we will categorize the songs into their respective countries. Then we will find the average values for each of the parameters that indicate what is the most popular in each country. Then we will take a look at what are the top values for each of the country and find similarites between countries if there is any. This can be compiled into a readable list below.

1. Categorize Spotify song list into respective countries
2. Find average values in each country and model it
3. Find top values in each country and model it
4. Relate any similarities between in each country

### Batching The Data
As stated above the dataset is quite large so the dataset will be batched as sets of 30. Here is how we will be processing it throughout the project.

In [3]:
spotify_dataset_1 = pd.DataFrame()
batch_30 = []

with open('final.csv', encoding='utf-8') as csv_file:
    reader = csv.DictReader(csv_file)
    
    for i, row in enumerate(reader):
        batch_30.append(row)
        
        if i % 30 == 0:
            temp_df = pd.DataFrame(batch_30)
            spotify_dataset_1 = pd.concat([spotify_dataset_1, temp_df])
            batch_30 = []
            temp_df = temp_df[0:0]

    temp_df = pd.DataFrame(batch_30)
    spotify_dataset_1 = pd.concat([spotify_dataset_1, temp_df])
    batch_30 = []
    temp_df = temp_df[0:0]
    
print("done")
spotify_dataset_1.head(5)


FileNotFoundError: [Errno 2] No such file or directory: 'final.csv'