# Spotify Music Data Analysis in Quarantine

<p>Students:</p>
<ul>
    <li> Ronie Arauco </li>
    <li> Handry Guillen </li>
<ul>

## Context

<center>
<img src="https://portal.andina.pe/EDPfotografia3/Thumbnail/2020/03/18/000661676W.jpg" alt="drawing" width="600"/>
</center>

During these days, the world is facing a pandemic that has changed the lifestyle. 
Depending on the severity of the situation in each country due to the virus, the degree of tranquility of people has been negatively affected.

The tranquility is often expressed by the music we listen to, since depending on our mood we choose a special group. 


Therefore, this time we want to present an analysis of the moods during the quarantine depending on the music most listened to in each country.

So, the questions that we want to answer are the follows:


*   What kind of music people listen to in this situation? Happy or sad music?
*   Does the severity of the situation in each country really affect people's musical taste? 


## Data Description

The data that we are using have been extracted first from SpotifyCharts (Official Spotify page for Top 200 and Top viral 200), this data contains all the Top 200 id's tracks for each country available for Spotify.

Once we have the id's, we make use of Spotify API for extract all the variables available for each track. These are the follows:

|Fields|Type|Description|
|---|---|---|
|country|string|Country of the Top 200 Playlist.|
|date_extraction|timestamp|Date of the Top 200 Playlist.|
|track_id|string|The Spotify ID for the track.|
|streams|int|Number of streams of the track.|
|album|string|The album on which the track appears.|
|artist|string|The artists who performed the track.|
|duration_ms|int|The duration of the track in milliseconds.|
|explicit|bool|Whether or not the track has explicit lyrics ( true = yes it does; false = no it does not OR unknown).|
|track_name|string|The name of the track.|
|track_danceability|float|Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.|
|track_energy|float|Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.|
|track_key|int|The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.|
|track_loudness|float|The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.|
|track_mode|int|Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.|
|track_speechiness|float|Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.|
|track_acousticness|float|A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.|
|track_instrumentalness|float|Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.|
|track_liveness|float|Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.|
|track_valence|float|A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).|
|track_tempo|float|The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.|

## Analysis

<p>Spotify Music analysis through KDD data analysis using the Spotify Web API aims to explore the relation between the music taste of people and the context we are living currently.</p>
<p>The request we are doing goes through the Spotfiy Client Workflow and the image below explains how it works.</p>

<img src="https://developer.spotify.com/assets/AuthG_ClientCredentials.png" alt="drawing" width="450"/>

## 1. Spotifycharts Scraping (https://spotifycharts.com/regional)

### 1.1. Importing Libraries

In [None]:
from requests.auth import HTTPBasicAuth
import requests
import json
import numpy as np
import pandas as pd
import datetime
from bs4 import BeautifulSoup
import re
from multiprocessing import Pool
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)

  import pandas.util.testing as tm


### 1.2. Obtaining Filter Values

In [None]:
spotifycharts = 'https://spotifycharts.com/regional'
r = requests.get(spotifycharts)
charts = BeautifulSoup(r.text, 'lxml')

In [None]:
params_country = {'class': 'responsive-select', 'data-type': 'country'}
filter_country = charts.body.find(attrs=params_country).ul

country_list = []
country_name = {}
for child in filter_country.children:
    if (not isinstance(child, type(child.string))):
        country_list.append(child['data-value'])
        country_name[child['data-value']] = child.text

print(country_list)
print(country_name)

['global', 'us', 'gb', 'ad', 'ar', 'at', 'au', 'be', 'bg', 'bo', 'br', 'ca', 'ch', 'cl', 'co', 'cr', 'cy', 'cz', 'de', 'dk', 'do', 'ec', 'ee', 'es', 'fi', 'fr', 'gr', 'gt', 'hk', 'hn', 'hu', 'id', 'ie', 'il', 'in', 'is', 'it', 'jp', 'lt', 'lu', 'lv', 'mc', 'mt', 'mx', 'my', 'ni', 'nl', 'no', 'nz', 'pa', 'pe', 'ph', 'pl', 'pt', 'py', 'ro', 'se', 'sg', 'sk', 'sv', 'th', 'tr', 'tw', 'uy', 'vn', 'za']
{'global': 'Global', 'us': 'United States', 'gb': 'United Kingdom', 'ad': 'Andorra', 'ar': 'Argentina', 'at': 'Austria', 'au': 'Australia', 'be': 'Belgium', 'bg': 'Bulgaria', 'bo': 'Bolivia', 'br': 'Brazil', 'ca': 'Canada', 'ch': 'Switzerland', 'cl': 'Chile', 'co': 'Colombia', 'cr': 'Costa Rica', 'cy': 'Cyprus', 'cz': 'Czech Republic', 'de': 'Germany', 'dk': 'Denmark', 'do': 'Dominican Republic', 'ec': 'Ecuador', 'ee': 'Estonia', 'es': 'Spain', 'fi': 'Finland', 'fr': 'France', 'gr': 'Greece', 'gt': 'Guatemala', 'hk': 'Hong Kong', 'hn': 'Honduras', 'hu': 'Hungary', 'id': 'Indonesia', 'ie': 'Ir

In [None]:
params_recurrence = {'class': 'responsive-select', 'data-type': 'recurrence'}
filter_recurrence = charts.body.find(attrs=params_recurrence).ul

recurrence_list = []
for child in filter_recurrence.children:
    if (not isinstance(child, type(child.string))):
        recurrence_list.append(child['data-value'])

print(recurrence_list)

['daily', 'weekly']


In [None]:
params_date = {'class': 'responsive-select', 'data-type': 'date'}
filter_date = charts.body.find(attrs=params_date).ul

date = []
for child in filter_date.children:
    if (not isinstance(child, type(child.string))):
        date.append(child['data-value'])

print(date[:5], 'and {} more...'.format(len(date[5:])))

['2020-06-11', '2020-06-10', '2020-06-09', '2020-06-08', '2020-06-07'] and 1249 more...


### 1.3. Formatting Urls

In [None]:
# According to the date filter format
base = datetime.datetime(2020, 1, 1)
top = datetime.datetime.now()
numdays = (top - base).days
date_list = [(base + datetime.timedelta(days=x)).strftime('%Y-%m-%d') for x in range(numdays)]
print(date_list)

['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12', '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20', '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14', '2020-02-15', '2020-02-16', '2020-02-17', '2020-02-18', '2020-02-19', '2020-02-20', '2020-02-21', '2020-02-22', '2020-02-23', '2020-02-24', '2020-02-25', '2020-02-26', '2020-02-27', '2020-02-28', '2020-02-29', '2020-03-01', '2020-03-02', '2020-03-03', '2020-03-04', '2020-03-05', '2020-03-06', '2020-03-07', '2020-03-08', '2020-03-09', '2020-03-10', '2020-03-11', '2020

In [None]:
payload = {'User-Agent': \
           'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) \
           AppleWebKit/537.36 (KHTML, like Gecko) \
           Chrome/53.0.2785.143 \
           Safari/537.36'}
path = 'https://spotifycharts.com/regional/'
url = []

for date in date_list:
    for country in country_list:
        url_aux = path + country + '/daily/' + date + '/download'
        url.append((url_aux, country, date))

### 1.4. Downloading Charts

In [None]:
def get_track_ids(url):
    payload = {'User-Agent': \
           'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) \
           AppleWebKit/537.36 (KHTML, like Gecko) \
           Chrome/53.0.2785.143 \
           Safari/537.36'}
    
    r = requests.get(url[0], headers=payload)
    if (r.ok):
        track_id = re.findall(',([-+]?[0-9]+),https://open.spotify.com/track/(.*?)\n', r.text)
        
        aux_track_id = []
        # Top 50
        #for t in track_id[:50]:
        for t in track_id:
            aux_track_id.append({'country': country_name[url[1]], 'date_extraction': url[2], \
                'track_id': t[1], 'streams': t[0]})
        
        # If is not empty
        if (aux_track_id):
            return aux_track_id

We apply multiprocessing in order to download the charts rapidly. Since Quarantine, has been passed 90 days roughly and we need to analyze the top 50 from 61 countries. That means we need to do 274 500 get requests (aproximately) in the best case. To process this, we use Google Colab because we don't have the hardware to run this code.

In [None]:
p = Pool(30)
aux = p.map(get_track_ids, url)
aux2 = []
for i in aux:
    if i is not None:
        aux2.extend(i)
p.terminate()
p.join()

Once we get the data, we save it.

In [None]:
# df = pd.DataFrame(aux)
# df.to_csv('tracks.csv', index=False)

Reading the data.

In [None]:
# df = pd.read_csv('tracks.csv')

In [None]:
# Just april
# df.shape

In [None]:
# df.head()

## 2. Getting Features of Tracks with Spotify Web API

### 2.1. Generating Authorization Token

In [None]:
client_id = '6d3ef950caae47758192dbd58723a460'
client_secret = '750d6da1d107477894fe1728aac6a1b7'
path = 'https://accounts.spotify.com/api/token'
payload = {'grant_type': 'client_credentials'}

r = requests.post(path, auth=HTTPBasicAuth(client_id, client_secret), data=payload)

if (r.ok):
    print('Response: ', r.text)
else:
    print('Something happened', r)

Response:  {"access_token":"BQDRtYAszYmhHJVlpEHylPdKraSV_pEqpw6WnUinSWK1OEvkw1r7riwHjDTk8vq-GU62nmP1Xnw_5x6voBk","token_type":"Bearer","expires_in":3600,"scope":""}


### 2.2. Saving Tracks in Set Structure

In [None]:
track_ids = set()
for a in aux2:
    track_ids.add(a['track_id'])

In [None]:
len(track_ids)

16396

### 2.3. Getting Basic Data of Tracks

In [None]:
start = 0
track_max = 50
track_basics = []
track_ids = list(track_ids)

while (start < len(track_ids)):
    aux_track_ids = ','.join(track_ids[start:(start + track_max)])
    # Make request
    path = 'https://api.spotify.com/v1/tracks/' '?' + \
        'ids=' + aux_track_ids
    try:
        client_id = '6d3ef950caae47758192dbd58723a460'
        client_secret = '750d6da1d107477894fe1728aac6a1b7'
        aux_path = 'https://accounts.spotify.com/api/token'
        aux_payload = {'grant_type': 'client_credentials'}
        aux_r = requests.post(aux_path, auth=HTTPBasicAuth(client_id, client_secret), data=aux_payload)
        d = json.loads(aux_r.text)
        payload = {'Authorization': d['token_type'] + ' ' + d['access_token']}
        
        r = requests.get(path, headers=payload)
        j = json.loads(r.text)

        count = 0
        for f in j['tracks']:
            aux_artists = []
            for x in f['artists']:
                aux_artists.append(x['name'])
            artists = ','.join(aux_artists)
            
            track_basics.append(
                {'track_id': f['id'],
                'album': f['album']['name'],
                'artist': artists,
                'duration_ms': f['duration_ms'],
                'explicit': f['explicit'],
                'track_name': f['name']}
            )
            count += 1
    except Exception as e:
        print('Something went wrong: |', e, '|', path)
    finally:
      start += track_max

### 2.4. Getting Features of Tracks

In [None]:
start = 0
track_max = 100
track_features = []

while (start < len(track_ids)):
    aux_track_ids = ','.join(track_ids[start:(start + track_max)])
    # Make request
    path = 'https://api.spotify.com/v1/audio-features/' '?' + \
        'ids=' + aux_track_ids
    try:
        client_id = '6d3ef950caae47758192dbd58723a460'
        client_secret = '750d6da1d107477894fe1728aac6a1b7'
        aux_path = 'https://accounts.spotify.com/api/token'
        aux_payload = {'grant_type': 'client_credentials'}
        aux_r = requests.post(aux_path, auth=HTTPBasicAuth(client_id, client_secret), data=aux_payload)
        d = json.loads(aux_r.text)
        payload = {'Authorization': d['token_type'] + ' ' + d['access_token']}
        
        r = requests.get(path, headers=payload)
        j = json.loads(r.text)
        
        count = 0
        for f in j['audio_features']:
            try:
                f_id = f['id']
            except:
                f_id = None
            
            try:
                f_danceability = f['danceability']
            except:
                f_danceability = None
            
            try:
                f_energy = f['energy']
            except:
                f_energy = None

            try:
                f_key = f['key']
            except:
                f_key = None
            
            try:
                f_loudness = f['loudness']
            except:
                f_loudness = None
            
            try:
                f_mode = f['mode']
            except:
                f_mode = None
              
            try:
                f_speechiness = f['speechiness']
            except:
                f_speechiness = None
            
            try:
                f_acousticness = f['acousticness']
            except:
                f_acousticness = None
            
            try:
                f_instrumentalness = f['instrumentalness']
            except:
                f_instrumentalness = None

            try:
                f_liveness = f['liveness']
            except:
                f_liveness = None
            
            try:
                f_valence = f['valence']
            except:
              f_valence = None
            
            try:
                f_tempo = f['tempo']
            except:
                f_tempo = None


            track_features.append(
                {'track_id': f_id,
                'track_danceability': f_danceability,
                'track_energy': f_energy,
                'track_key': f_key,
                'track_loudness': f_loudness,
                'track_mode': f_mode,
                'track_speechiness': f_speechiness,
                'track_acousticness': f_acousticness,
                'track_instrumentalness': f_instrumentalness,
                'track_liveness': f_liveness,
                'track_valence': f_valence,
                'track_tempo': f_tempo}
            )
            count += 1
    except Exception as e:
        print('Something went wrong: |', e, '|', path)
    finally:
        start += track_max

In [None]:
len(track_features)

16396

### 2.5. Merging and Saving Data

In [None]:
df1 = pd.DataFrame(aux2)
df2 = pd.DataFrame(track_basics)
df3 = pd.DataFrame(track_features)

In [None]:
df = df1.join(df2.set_index('track_id'), on='track_id').join(df3.set_index('track_id'), on='track_id')

In [None]:
df.shape

(1330714, 20)

In [None]:
# Just for Google Colab
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
filename = datetime.datetime.now().strftime('%Y%m%dT%H%M%S')
df.to_csv('/content/drive/My Drive/data-analysis-ta-2020/' + filename + '.csv', index=False)