# Descarga de datos desde APIs públicas

## APIs involucradas

Para la confección del dataset (inspirado en [este trabajo](https://www.kaggle.com/datasets/theoverman/the-spotify-hit-predictor-dataset)), se obtuvo información de canciones mediante el uso de las API Billboard y de Spotify, a si mismo como librerías de Python que facilitan su llamada con más facilidad.

| APIs | Librerías de Python |
| :-: | :-: |
| [Spotify](https://developer.spotify.com/documentation/web-api/reference/) | [spotipy](https://spotipy.readthedocs.io/en/2.22.1/) |
| [Billboard](https://rapidapi.com/LDVIN/api/billboard-api) | [billboard.py](https://pypi.org/project/billboard.py/) |

Para poder hacer uso de la API de Spotify, se creó una app en el [Spotify for Developers Dashboard](https://developer.spotify.com/dashboard/), haciendo uso de la [guía](https://developer.spotify.com/documentation/general/guides/authorization/app-settings/). 

A su vez, para garantizar el uso de credenciales y mantener las mismas protegidas, se creó un archivo `config.py` con la API key y credenciales correspondientes. Este archivo se mantiene oculto por razones de seguridad.

## Procedimiento para la obtención de obtención de Datos

### Librerías

In [3]:
import billboard
import csv
import pandas as pd
import time
import ast
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import config
from datetime import datetime

client_credentials_manager = SpotifyClientCredentials(client_id=config.client_id, client_secret=config.api_key)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### 1. Obtener los top 100 hits por década

In [2]:
def get_chart_of(decade, week_of):
	
    chart = billboard.ChartData('hot-100', week_of)
    print(week_of)
    charts = []

    filename = f'billboard{decade}.csv'

    with open(filename, 'a+', newline='') as csv_file:
        csv_writer = csv.writer(csv_file)
        date = str(chart.date)
        charts.append(date)
        for song in chart.entries:
            charts.append([song.title, song.artist])

        csv_writer.writerows([charts])

def all_saturdays(decade):
    start_year = int(decade)
    end_year = start_year + 9
    start_date = f'{start_year}-01-02'
    end_date = f'{end_year}-12-27'
    return pd.date_range(start=start_date, end=end_date, freq='28D').strftime('%Y-%m-%d').tolist()

In [3]:
decades = ['1970','1980','1990','2000','2010']

for decade in decades:
    saturdays = all_saturdays(decade)
    for saturday in saturdays:
        try:
            get_chart_of(decade, saturday)
        except:
            print('Time Out.. Waiting.')
            time.sleep(100)
            get_chart_of(decade, saturday)
    print(f'Done with {decade}!')

1970-01-02
1970-01-30
1970-02-27
1970-03-27
1970-04-24
1970-05-22
1970-06-19
1970-07-17
1970-08-14
1970-09-11
1970-10-09
1970-11-06
1970-12-04
1971-01-01
1971-01-29
1971-02-26
1971-03-26
1971-04-23
1971-05-21
1971-06-18
1971-07-16
1971-08-13
1971-09-10
1971-10-08
1971-11-05
1971-12-03
1971-12-31
1972-01-28
1972-02-25
1972-03-24
1972-04-21
1972-05-19
1972-06-16
1972-07-14
1972-08-11
1972-09-08
1972-10-06
1972-11-03
1972-12-01
1972-12-29
1973-01-26
1973-02-23
1973-03-23
1973-04-20
1973-05-18
1973-06-15
1973-07-13
1973-08-10
1973-09-07
1973-10-05
1973-11-02
1973-11-30
1973-12-28
1974-01-25
1974-02-22
1974-03-22
1974-04-19
1974-05-17
1974-06-14
1974-07-12
1974-08-09
1974-09-06
1974-10-04
1974-11-01
1974-11-29
1974-12-27
1975-01-24
1975-02-21
1975-03-21
1975-04-18
1975-05-16
1975-06-13
1975-07-11
1975-08-08
1975-09-05
1975-10-03
1975-10-31
1975-11-28
1975-12-26
1976-01-23
1976-02-20
1976-03-19
1976-04-16
1976-05-14
1976-06-11
1976-07-09
1976-08-06
1976-09-03
1976-10-01
1976-10-29
1976-11-26

### 2. Eliminar canciones duplicadas

In [7]:
decades = ['1970','1980','1990','2000','2010']

for decade in decades:

	filename = f'billboard{decade}'

	songlist = set()
	song_count = 0
	count = 10

	with open(filename + '.csv') as fp1, open(filename + '_unique' + '.csv', 'w', newline="") as fp2:

		reader = csv.reader(fp1)
		writer = csv.writer(fp2)

		for row in reader: 

			row = row[1:]

			for song in row:

				song_count += 1
				track = list(ast.literal_eval(song))
				track = tuple(track)
				songlist.add(track)

		for song in songlist:
		
			song = list(song)
			writer.writerow(song)		

	print(f'Previous Song Count: {song_count}, New Song Count: {len(songlist)}')

Previous Song Count: 13297, New Song Count: 5139
Previous Song Count: 13100, New Song Count: 4133
Previous Song Count: 13100, New Song Count: 3435
Previous Song Count: 13100, New Song Count: 3198
Previous Song Count: 13100, New Song Count: 3314


### 3. Extracción de features de las canciones 'hit'

In [4]:
def get_song_uri(track_name, artist_name):
    
    try:
        track = sp.search(q='artist:' + artist_name+' OR track:' + track_name, type='track', limit=50)
        result = track['tracks']['items']
    except spotipy.exceptions.SpotifyException as e:
        print('Error:', e)
        return None

    result = [track for track in result if artist_name.lower() in [artist['name'].lower() for artist in track['artists']]]

    if result:
        uri = result[0]['uri']
        artist_name = result[0]['artists'][0]['name']
        return uri
    else:
        print('No results found')
        return None

def get_song_features(track_id):
    
    try:
        resp_track = sp.track(track_id)
        artist_fetched = resp_track['artists'][0]['name']
        track_fetched = resp_track['name']

        resp_audio_features = sp.audio_features(track_id)
        if resp_audio_features is None:
            raise ValueError('Failed to retrieve audio features')
        audio_features = [resp_audio_features[0][key] for key in ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']]

        resp_analysis = sp.audio_analysis(track_id)
        if len(resp_analysis['sections']) >= 3:
            chorus_hit = resp_analysis['sections'][2]['start']
        else:
            chorus_hit = None
        sections = len(resp_analysis['sections'])

        resp_artist = sp.artist(resp_track['artists'][0]['id'])
        popularity = resp_artist['popularity']

        data = [popularity] + audio_features + [chorus_hit, sections]
    
    except spotipy.exceptions.SpotifyException as e:
        print(f'Error in fetching {track_id}: {e}')
        data = None

    return data


In [11]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['1970']

for decade in decades:

    filename1 = 'billboard' + decade + '_unique.csv'
    filename2 = 'billboard' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

Fetching ['You Got That Right', 'Lynyrd Skynyrd'] ....Fetched!
['You Got That Right', 'Lynyrd Skynyrd', 'spotify:track:2lLWfAW32oxUCv0iYQQwSX', 74, 0.537, 0.702, 0, -10.973, 1, 0.033, 0.0244, 0.0505, 0.21, 0.954, 146.208, 227293, 27.29937, 13]
Fetching ['Please Mr. Please', 'Olivia Newton-John'] ....Fetched!
['Please Mr. Please', 'Olivia Newton-John', 'spotify:track:1lL1jDnZTH60djVb6vKIQj', 69, 0.463, 0.453, 5, -13.204, 1, 0.0371, 0.371, 0, 0.0993, 0.561, 145.796, 202560, 40.04562, 10]
Fetching ['Please, Mr. President', 'Paula Webb'] ....No results found
['Please, Mr. President', 'Paula Webb'] Not Found!
Fetching ['Easy Rider (Let The Wind Pay The Way)', 'Iron Butterfly'] ....Fetched!
['Easy Rider (Let The Wind Pay The Way)', 'Iron Butterfly', 'spotify:track:6917QTtfHqiKowUDLt5G4A', 45, 0.412, 0.926, 5, -6.724, 1, 0.125, 0.000484, 0.000375, 0.193, 0.255, 133.738, 186173, 36.81406, 10]
Fetching ['Dreaming', 'Blondie'] ....Fetched!
['Dreaming', 'Blondie', 'spotify:track:2Rn7bVL1FVYboc4c5

In [15]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['1980']

for decade in decades:

    filename1 = 'billboard' + decade + '_unique.csv'
    filename2 = 'billboard' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

Fetching ['Find Another Fool', 'Quarterflash'] ....Fetched!
['Find Another Fool', 'Quarterflash', 'spotify:track:1kWIbNb9gqmYBb9anvWkOA', 43, 0.522, 0.52, 7, -13.339, 1, 0.0545, 0.00964, 3.46e-06, 0.0432, 0.702, 77.757, 274933, 24.86253, 9]
Fetching ['Anyone Can See', 'Irene Cara'] ....Fetched!
['Anyone Can See', 'Irene Cara', 'spotify:track:71lQ6KbvNUEGFBNvPWzbcv', 64, 0.358, 0.542, 5, -7.504, 1, 0.0347, 0.512, 0.0117, 0.288, 0.357, 126.865, 223960, 21.12181, 15]
Fetching ['Stand', 'R.E.M.'] ....Fetched!
['Stand', 'R.E.M.', 'spotify:track:22UhQSbYimuCnvI0Y07gFX', 75, 0.653, 0.738, 4, -10.057, 1, 0.0285, 0.397, 0, 0.0931, 0.932, 109.322, 192960, 28.04756, 8]
Fetching ['Summergirls', 'Dino'] ....Fetched!
['Summergirls', 'Dino', 'spotify:track:5byYO8JjkxV27yQYo2pDRx', 22, 0.764, 0.58, 9, -15.2, 0, 0.093, 0.00219, 0.0551, 0.146, 0.81, 121.343, 376573, 36.53151, 15]
Fetching ['Do Me Baby', "Meli'sa Morgan"] ....No results found
['Do Me Baby', "Meli'sa Morgan"] Not Found!
Fetching ["Who Do 

In [16]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['1990']

for decade in decades:

    filename1 = 'billboard' + decade + '_unique.csv'
    filename2 = 'billboard' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

Fetching ['Can\'t Help Falling In Love (From "Sliver")', 'UB40'] ....No results found
['Can\'t Help Falling In Love (From "Sliver")', 'UB40'] Not Found!
Fetching ['Whenever You Come Around', 'Vince Gill'] ....Fetched!
['Whenever You Come Around', 'Vince Gill', 'spotify:track:3PHLZ5wmtyZha1pp2405OT', 64, 0.719, 0.22, 11, -11.811, 1, 0.0294, 0.736, 9.4e-06, 0.118, 0.168, 115.375, 259160, 30.59945, 12]
Fetching ['All For Love', 'Bryan Adams/Rod Stewart/Sting'] ....No results found
['All For Love', 'Bryan Adams/Rod Stewart/Sting'] Not Found!
Fetching ['Humps For The Blvd.', 'Rodney O & Joe Cooley'] ....No results found
['Humps For The Blvd.', 'Rodney O & Joe Cooley'] Not Found!
Fetching ['What It Takes', 'Aerosmith'] ....Fetched!
['What It Takes', 'Aerosmith', 'spotify:track:2fAYTT9kcUm8tnUrhD80sC', 78, 0.522, 0.738, 0, -6.423, 1, 0.031, 0.0286, 5.8e-06, 0.4, 0.461, 142.702, 310533, 29.65034, 17]
Fetching ['My Up And Down', 'Adina Howard'] ....Fetched!
['My Up And Down', 'Adina Howard', 's

In [26]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['2000']

for decade in decades:
    
    filename1 = 'billboard' + decade + '_unique.csv'
    filename2 = 'billboard' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            try:
                song_uri = get_song_uri(track, artist)

                if song_uri:
                    song_features = get_song_features(song_uri)
                    fetched_details = [track, artist, song_uri] + song_features
                    writer.writerow(fetched_details)
                    print('Fetched!')
                    print(fetched_details)
                else:
                    print(row, 'Not Found!')
            except TypeError:
                print(row, 'Skipped!')

    print(f'{decade} Done!')

Fetching ["No One's Gonna Change You", 'Reina'] ....Fetched!
["No One's Gonna Change You", 'Reina', 'spotify:track:6VZciflHXKBKMxPXnGEmag', 20, 0.66, 0.767, 11, -6.622, 1, 0.0339, 0.0185, 0.00101, 0.12, 0.594, 132.984, 255200, 19.74987, 10]
Fetching ['Lose My Breath', "Destiny's Child"] ....Fetched!
['Lose My Breath', "Destiny's Child", 'spotify:track:4dvQg9sD8k9y4qiEURuj8v', 73, 0.814, 0.899, 1, -5.958, 1, 0.0637, 0.00727, 0.219, 0.0979, 0.545, 119.011, 242013, 35.78935, 11]
Fetching ['Move Shake Drop', 'DJ Laz Featuring Flo Rida & Casely'] ....No results found
['Move Shake Drop', 'DJ Laz Featuring Flo Rida & Casely'] Not Found!
Fetching ['Redneck Woman', 'Gretchen Wilson'] ....Fetched!
['Redneck Woman', 'Gretchen Wilson', 'spotify:track:26bL4gSULWDgdIMX0pRFrG', 57, 0.499, 0.825, 6, -5.146, 1, 0.177, 0.13, 0, 0.306, 0.753, 185.069, 221333, 43.91956, 8]
Fetching ['I Miss You', 'DMX Featuring Faith Evans'] ....No results found
['I Miss You', 'DMX Featuring Faith Evans'] Not Found!
Fetch

In [22]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['2010']

for decade in decades:

    filename1 = 'billboard' + decade + '_unique.csv'
    filename2 = 'billboard' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

Fetching ['Silly Love Songs', 'Glee Cast'] ....Fetched!
['Silly Love Songs', 'Glee Cast', 'spotify:track:3aahH1sW41ubo1PZeQCviE', 72, 0.643, 0.585, 0, -6.724, 1, 0.0289, 0.584, 4.81e-06, 0.278, 0.563, 124.935, 230160, 46.78761, 7]
Fetching ['All I Want For Christmas Is You', 'Michael Buble'] ....No results found
['All I Want For Christmas Is You', 'Michael Buble'] Not Found!
Fetching ['In Your Arms', 'Nico & Vinz'] ....Fetched!
['In Your Arms', 'Nico & Vinz', 'spotify:track:0LH5xRQz5D36FpIkYUFv2e', 63, 0.431, 0.81, 7, -6.051, 1, 0.0958, 0.299, 0, 0.249, 0.674, 110.118, 205800, 35.29956, 11]
Fetching ['SOS', 'Avicii Featuring Aloe Blacc'] ....No results found
['SOS', 'Avicii Featuring Aloe Blacc'] Not Found!
Fetching ['One Man Can Change The World', 'Big Sean Featuring Kanye West & John Legend'] ....No results found
['One Man Can Change The World', 'Big Sean Featuring Kanye West & John Legend'] Not Found!
Fetching ['Locked Away', 'R. City Featuring Adam Levine'] ....No results found
['L

### 4. Establecer un criterio para las canciones "flop"

   - La canción no debe aparecer en la lista 'hit' de la década.
   - El artista de la canción no debe aparecer en la lista de 'hits' de la década.
   - La canción debe pertenecer a un género que podría considerarse no convencional.
   - El género de la pista no debe tener una canción en la lista de 'hits'.
   - La canción debe tener 'US' como uno de sus mercados.

### 5. Identificar los "flops" por década y crear playlists con esas canciones para cada década

In [3]:
playlists = [ 
'spotify:playlist:2XaeKTfGkXbl7azxOGFT8B',
'spotify:playlist:4hTSi1yGbvUzPanMIUvXAi',
'spotify:playlist:1DmzPOzqsNjQQJEPSc9nHN',
'spotify:playlist:4j53v3ELuxx4wMx5AgrP9Q',
'spotify:playlist:0kvsvGWgtyxacQZ7KA23Kj',
'spotify:playlist:5QusseFA3kBausg1ty9N2o',
]

### 6. Extraer información acerca de las canciones utilizando esas playlists

In [7]:
unique_bucket = set()

def make_unique_bucket():
	
	global unique_bucket

	decades = ['1970','1980','1990','2000','2010']

	for decade in decades:

		file = f'flop_songs{decade}.csv'

		with open(file, 'r', encoding='latin-1') as fp:

			reader = csv.reader(fp)

			for row in reader:

				song = (row[0], row[1], row[2])

				unique_bucket.add(song)

def write_to_file(track, artist, uri, release_date, release_date_prec):

	global unique_bucket

	song = (track, artist, uri)

	if song in unique_bucket:
		print('Song already fetched in', release_date)
		return

	else:
		unique_bucket.add(song)	

	if release_date_prec == 'day':
		release_date_obj = datetime.strptime(release_date, '%Y-%m-%d')

	elif release_date_prec == 'month':
		release_date_obj = datetime.strptime(release_date, '%Y-%m')

	elif release_date_prec == 'year':
		release_date_obj = datetime.strptime(release_date,'%Y')

	elif release_date_obj >= datetime(1970,1,1) and release_date_obj <= datetime(1979,12,31):
		decade = '1970'	

	elif release_date_obj >= datetime(1980,1,1) and release_date_obj <= datetime(1989,12,31):	
		decade = '1980'
	
	elif release_date_obj >= datetime(1990,1,1) and release_date_obj <= datetime(1999,12,31):
		decade = '1990'	

	elif release_date_obj >= datetime(2000,1,1) and release_date_obj <= datetime(2009,12,31):
		decade = '2000'	

	elif release_date_obj >= datetime(2010,1,1) and release_date_obj <= datetime(2019,12,31):
		decade = '2010'	

	else:
		problem_track(track, artist, uri)
		return

	file = f'flop_songs{decade}.csv'

	with open(file,'a', encoding='latin-1', newline='') as fp:

		writer = csv.writer(fp)
		writer.writerow([track, artist, uri])

		print(f'{track} by {artist} is written to {decade}')

def fetch_playlist(ID, off):

	return sp.user_playlist_tracks(user=config.user_name, playlist_id=ID, offset=off*100)

def problem_track(track, artist, uri):

	print(f'Problem {track}, {artist}, {uri}')

	with open('failed_songs_problem.csv', 'a', encoding='utf-8', newline='') as fp:

		writer = csv.writer(fp)
		writer.writerow([track, artist, uri])

In [10]:
make_unique_bucket()

for ID in playlists:

	for offset in range(80):
		playlist = fetch_playlist(ID, offset)

		items = playlist['items']

		for song in items:

			artist = song['track']['artists'][0]['name']
			track = song['track']['name']
			uri = song['track']['uri']
			release_date = song['track']['album']['release_date']
			release_date_prec = song['track']['album']['release_date_precision']

			print(release_date, release_date_prec)

			try:
				write_to_file(track, artist, uri, release_date, release_date_prec)

			except:		
				problem_track(track, artist, uri)

	print(f'{ID} is done!')

2016-09-30 day
Problem So What, The Mowgli's, spotify:track:0wAbE8PmaALSdGEpfOuk6J, 2016-09-30, day
2018-02-02 day
Problem hold on, flor, spotify:track:6RtRxaZN5RMWFbeZW24oRx, 2018-02-02, day
2018-02-20 day
Problem Colour Morning, Night Riots, spotify:track:4f6IOgyibWMOFgnuB4RKBb, 2018-02-20, day
2017-04-12 day
Problem Best Friends, grandson, spotify:track:1CmhkiXegdbj3Ewizvdg2Q, 2017-04-12, day
2017-06-02 day
Problem Name, Olen, spotify:track:6bTItgrgyjcB0h2u4974Nk, 2017-06-02, day
2017-10-06 day
Problem This Was a Home Once, Bad Suns, spotify:track:7LqYqe7uhDc1pw9urRoGq6, 2017-10-06, day
2015-12-04 day
Problem Catastrophic, Olen, spotify:track:0xtWbdkzNTcC6jjVcQFwhA, 2015-12-04, day
2017-03-16 day
Problem Coldhearted, Bryce Fox, spotify:track:3uQPPDnIaWxUH0W8xefsAw, 2017-03-16, day
2018-02-02 day
Problem heart, flor, spotify:track:6l9AVZVrvHw5u4nL3HL9N0, 2018-02-02, day
2015-04-14 day
Problem Pink Lemonade, The Wombats, spotify:track:7dFERLugNJZUgtX1V3KA4b, 2015-04-14, day
2016-09-16

### 7. Limpiar canciones con problemas

In [28]:
# Eliminar canciones flop que accidentalmente estén en hits

for decade in decades:

	billboard = 'billboard' + decade + '_features_database.csv'
	failed = 'failed' + decade + '_unique.csv'

	billboard_set = set()
	failed_set = set()

	count = 0
	count2 = 0

	with open(billboard, encoding='latin-1') as fp1, open(failed, encoding='latin-1') as fp2:

		reader1 = csv.reader(fp1)
		reader2 = csv.reader(fp2)

		for row in reader1:

			song = (row[0], row[1], row[2])
			billboard_set.add(song)

		for row in reader2:
			
			song = (row[0], row[1], row[2])

			if song in billboard_set:

				count += 1

			else:
			
				failed_set.add(song)	

				count2 += 1

	failed_unique = 'failed_final' + decade + '_unique.csv'

	with open(failed_unique,'w+', encoding='utf-8', newline='') as fp:

		writer = csv.writer(fp)

		for song in failed_set:

			song = list(song)
			writer.writerow(song)

	print(decade, len(billboard_set), len(failed_set), count)

1970 1468 4128 7
1980 817 3649 13
1990 261 3314 0
2000 93 3417 1
2010 161 6978 0


In [30]:
# Mismo número de canciones

for decade in decades:

	billboard = 'billboard' + decade + '_features_database.csv'
	failed = 'failed_final' + decade + '_unique.csv'

	billboard_set = set()
	failed_set = set()

	count = 0

	with open(billboard, encoding='latin-1') as fp1, open(failed, encoding='latin-1') as fp2:

		reader1 = csv.reader(fp1)
		reader2 = csv.reader(fp2)

		for row in reader1:

			song = (row[0], row[1], row[2])
			billboard_set.add(song)

		for row in reader2:
			
			song = (row[0], row[1], row[2])
			failed_set.add(song)	

	bill_count = len(billboard_set)	

	failed_equal = 'failed_equal' + decade + '_unique.csv'

	with open(failed_equal,'w+', encoding='utf-8', newline='') as fp:

		writer = csv.writer(fp)

		for song in failed_set:

			song = list(song)
			writer.writerow(song)

			count += 1

			if count == bill_count:
				break	
			
print("Done!")

Done!


### 8. Extracción de features de canciones 'flop'

In [None]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['1970']

for decade in decades:

    filename1 = 'failed_equal' + decade + '_unique.csv'
    filename2 = 'failed' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]
            song_uri = row[2]

            print('Fetching', row, '....', end='')

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

In [None]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['1980']

for decade in decades:

    filename1 = 'failed_equal' + decade + '_unique.csv'
    filename2 = 'failed' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

In [None]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['1990']

for decade in decades:

    filename1 = 'failed_equal' + decade + '_unique.csv'
    filename2 = 'failed' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

In [None]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['2000']

for decade in decades:
    
    filename1 = 'failed_equal' + decade + '_unique.csv'
    filename2 = 'failed' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            try:
                song_uri = get_song_uri(track, artist)

                if song_uri:
                    song_features = get_song_features(song_uri)
                    fetched_details = [track, artist, song_uri] + song_features
                    writer.writerow(fetched_details)
                    print('Fetched!')
                    print(fetched_details)
                else:
                    print(row, 'Not Found!')
            except TypeError:
                print(row, 'Skipped!')

    print(f'{decade} Done!')

In [None]:
#decades = ['1970','1980','1990','2000','2010']
decades = ['2010']

for decade in decades:

    filename1 = 'failed_equal' + decade + '_unique.csv'
    filename2 = 'failed' + decade + '_features_database.csv'

    with open(filename1) as fp_1, open(filename2, 'a', newline="") as fp_2:
        reader = csv.reader(fp_1)
        writer = csv.writer(fp_2)

        header = ['track', 'artist', 'uri', 'popularity', 'danceability', 'energy', 'key',
                  'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'chorus_hit', 'sections']
        writer.writerow(header)

        for _ in range(3064):
            next(reader)

        for row in reader:
            track = row[0]
            artist = row[1]

            print('Fetching', row, '....', end='')

            song_uri = get_song_uri(track, artist)

            if song_uri:
                song_features = get_song_features(song_uri)
                fetched_details = [track, artist, song_uri] + song_features
                writer.writerow(fetched_details)
                print('Fetched!')
                print(fetched_details)
            else:
                print(row, 'Not Found!')

    print(f'{decade} Done!')

### 9. Creación del dataset

In [46]:
decades = ['1970','1980','1990','2000','2010']

for decade in decades:

    billboard_df = pd.read_csv(f'billboard{decade}_features_database.csv')
    failed_df = pd.read_csv(f'failed{decade}_features_database.csv')

    billboard_df['target'] = 1
    failed_df['target'] = 0

    combined_df = pd.concat([billboard_df, failed_df])
    combined_df.to_csv(f'dataset-of-{decade}s.csv', index=False)


## Dataset final

In [47]:
datasets = [pd.read_csv("dataset-of-{}s.csv".format(decade)) for decade in ['1970', '1980', '1990', '2000', '2010']]

for i, decade in enumerate([1970, 1980, 1990, 2000, 2010]):
    datasets[i]['decade'] = pd.Series(decade, index=datasets[i].index)

df = pd.concat(datasets, axis=0).sample(frac=1.0, random_state=1).reset_index(drop=True)
df

Unnamed: 0,track,artist,uri,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,chorus_hit,sections,target,decade
0,Everyone's Heart Gets Broken,Ebony,spotify:track:72dyzh7rRthEYDhEfJuJUe,0.577,0.457,9,-9.716,1,0.0274,0.573000,0.000091,0.1900,0.7390,125.222,197533,3,29.15381,9,0,1970
1,Adios a Jamaica,Los Gibson Boys,spotify:track:2chBwIMZqVq7yDkco7bOO3,0.554,0.667,4,-6.502,1,0.0471,0.716000,0.000000,0.0651,0.9170,148.870,143667,4,31.63368,7,0,1990
2,So Sexy,Twista Featuring R. Kelly,spotify:track:4mZpHYUrOvvmXCoyLLF7s7,0.868,0.805,11,-3.218,0,0.1780,0.126000,0.000000,0.0960,0.5440,143.983,231200,4,27.22702,9,1,2000
3,It Ain't Enough,Corey Hart,spotify:track:3q9LRpghuXumIQLna5MZjq,0.747,0.406,2,-12.134,1,0.0302,0.499000,0.000122,0.0670,0.6020,112.347,210027,4,31.79598,10,1,1980
4,BTSTU (Edit),Jai Paul,spotify:track:2NRRrr8ylDK38KD3Ffbw4K,0.665,0.389,1,-6.709,1,0.4410,0.745000,0.000982,0.3260,0.5210,89.962,210001,4,27.11471,10,0,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32459,The Devil In Miss Jones,Mike Ness,spotify:track:59oD2JbcFRLcnhgQvQL1Kx,0.539,0.937,10,-5.740,0,0.0448,0.000236,0.057400,0.0915,0.4920,114.860,230693,4,66.63641,8,0,1990
32460,You're Moving Out Today,Carole Bayer Sager,spotify:track:6nSt8n7r0bznM8PCvHEmPj,0.638,0.557,0,-11.340,1,0.1110,0.729000,0.000010,0.0986,0.9150,112.893,215053,4,36.99424,9,1,1970
32461,The Batman Theme,Danny Elfman,spotify:track:50csT5Qb2qOF7lHdDQ1Sbx,0.201,0.475,11,-10.097,0,0.0417,0.659000,0.916000,0.0748,0.0398,73.563,158333,3,36.05006,7,0,1980
32462,No One Receiving - 2004 Digital Remaster,Brian Eno,spotify:track:4AHCh0JCCzPYsvRCUFdru3,0.641,0.789,9,-12.803,1,0.1540,0.127000,0.000218,0.1390,0.5480,107.672,232787,4,37.48169,10,0,1970


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32464 entries, 0 to 32463
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             32464 non-null  object 
 1   artist            32464 non-null  object 
 2   uri               32464 non-null  object 
 3   danceability      32464 non-null  float64
 4   energy            32464 non-null  float64
 5   key               32464 non-null  int64  
 6   loudness          32464 non-null  float64
 7   mode              32464 non-null  int64  
 8   speechiness       32464 non-null  float64
 9   acousticness      32464 non-null  float64
 10  instrumentalness  32464 non-null  float64
 11  liveness          32464 non-null  float64
 12  valence           32464 non-null  float64
 13  tempo             32464 non-null  float64
 14  duration_ms       32464 non-null  int64  
 15  time_signature    32464 non-null  int64  
 16  chorus_hit        32464 non-null  float6