# Creating a Spotify Database

We will create a database of Spotify songs, artists, and genres using SQLite and Python. The data comes from a [Kaggle data set](https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv) containing about 600k tracks and 1 million artists from  Spotify from 1921 to 2020. As of mid-2022, Spotify has over 82 million tracks [1], so this is by no means close to all the music on Spotify. However, it is enough to include a lot of great music from a wide range of time periods. The data exists in two .csv files, one for tracks and one for artists. We will import this data to a SQLite database and normalize the database such that all the data can be joined.

First, let's import the necessary packages, open a connection to a SQLite database and create a cursor, and set display options for pandas.

In [1]:
import sqlite3
import re
import csv
import pandas as pd
import numpy as np
from funcs import execute_print, split_explode

conn = sqlite3.connect('spotify.db')
c = conn.cursor()

#Set max rows and limit floats to 2 decimal places in pandas
pd.options.display.max_rows = 10
pd.set_option('display.float_format', lambda x: '%.2f' % x)

## I. Create and Populate artists and tracks Tables

First, we need to create tables for artists and tracks in the database to hold the data from our .csv files. Let's create an artists table. This table will have an artist id as a primary key as well as columns for follower count, genres, artist name, and artist popularity (on a scale from 1 to 100).

In [2]:
create_artists = """
    CREATE TABLE IF NOT EXISTS artists (
        id TEXT PRIMARY KEY NOT NULL,
        followers INTEGER,
        genres TEXT,
        name TEXT,
        popularity INTEGER);
"""
c.execute(create_artists)
conn.commit()

Now let's read data from the artists.csv file and insert this into the artists table

In [3]:
#Read data from artists.csv
with open('./data/artists.csv', 'r', encoding='utf8') as artists:
    dr = csv.DictReader(artists)
    to_db = [(artist['id'], artist['followers'], artist['genres'],
              artist['name'], artist['popularity']) for artist in dr]

#Insert data from artists.csv into artists table
insert_artists = """
    INSERT INTO artists (
        id,
        followers,
        genres,
        name,
        popularity)
    VALUES (?, ?, ?, ?, ?);
"""
c.executemany(insert_artists, to_db)
conn.commit()

Next, let's create a tracks table. This table will have a track ID as a primary key as well as columns for track name, track popularity (on a scale from 1 to 100), artist names and artist IDs, release date, and various audio features of each track.

In [4]:
create_tracks = """
    CREATE TABLE IF NOT EXISTS tracks (
        id TEXT PRIMARY KEY NOT NULL,
        name TEXT,
        popularity INTEGER,
        duration_ms INTEGER,
        explicit INTEGER,
        artists TEXT,
        id_artists TEXT,
        release_date TEXT,
        danceability REAL,
        energy REAL,
        key INTEGER,
        loudness REAL,
        mode INTEGER,
        speechiness REAL,
        acousticness REAL,
        instrumentalness REAL,
        liveness REAL,
        valence REAL,
        tempo REAL,
        time_signature INTEGER);
"""
c.execute(create_tracks)
conn.commit()

Now let's read data from the tracks.csv file and insert this into the tracks table.

In [5]:
#Read data from tracks.csv
with open('./data/tracks.csv', 'r', encoding='utf8') as tracks:
    dr = csv.DictReader(tracks)
    to_db = [(track['id'], track['name'], track['popularity'], track['duration_ms'],
              track['explicit'], track['artists'], track['id_artists'], track['release_date'],
              track['danceability'], track['energy'], track['key'], track['loudness'],
              track['mode'], track['speechiness'], track['acousticness'],
              track['instrumentalness'], track['liveness'], track['valence'],
              track['tempo'], track['time_signature']) for track in dr]

#Insert data from artists.csv into artists table
insert_tracks = """
    INSERT INTO tracks (
        id,
        name,
        popularity,
        duration_ms,
        explicit,
        artists,
        id_artists,
        release_date,
        danceability,
        energy,
        key,
        loudness,
        mode,
        speechiness,
        acousticness,
        instrumentalness,
        liveness,
        valence,
        tempo,
        time_signature)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);
"""
c.executemany(insert_tracks, to_db)
conn.commit()

Let's look at each table. It will be easier to read the tables as pandas DataFrames. First, the artists table...

In [6]:
select_artists = """
    SELECT *
    FROM artists
    LIMIT 50;
"""
artists = pd.read_sql_query(select_artists, conn)
artists

Unnamed: 0,id,followers,genres,name,popularity
0,0DheY5irMjBUeLybbCUEZ2,0,[],Armid & Amir Zare Pashai feat. Sara Rouzbehani,0
1,0DlhY15l3wsrnlfGio2bjU,5,[],ปูนา ภาวิณี,0
2,0DmRESX2JknGPQyO15yxg7,0,[],Sadaa,0
3,0DmhnbHjm1qw6NCYPeZNgJ,0,[],Tra'gruda,0
4,0Dn11fWM7vHQ3rinvWEl4E,2,[],Ioannis Panoutsopoulos,0
...,...,...,...,...,...
45,0VLMVnVbJyJ4oyZs2L3Yl2,71,['carnaval cadiz'],Las Viudas De Los Bisabuelos,6
46,0dt23bs4w8zx154C5xdVyl,63,['carnaval cadiz'],Los De Capuchinos,5
47,0pGhoB99qpEJEsBQxgaskQ,64,['carnaval cadiz'],Los “Pofesionales”,7
48,3HDrX2OtSuXLW5dLR85uN3,53,['carnaval cadiz'],Los Que No Paran De Rajar,6


...and the tracks table.

In [7]:
select_tracks = """
    SELECT *
    FROM tracks
    LIMIT 50;
"""
tracks = pd.read_sql_query(select_tracks, conn)
tracks

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.65,0.45,0,-13.34,1,0.45,0.67,0.74,0.15,0.13,104.85,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.69,0.26,0,-22.14,1,0.96,0.80,0.00,0.15,0.66,102.01,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.43,0.18,1,-21.18,1,0.05,0.99,0.02,0.21,0.46,130.42,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.32,0.09,7,-27.96,1,0.05,0.99,0.92,0.10,0.40,169.98,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.40,0.16,3,-16.90,0,0.04,0.99,0.13,0.31,0.20,103.22,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45,1kXWSsJkBVZ1jSoI8NnEDm,Marta,0,177693,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.26,0.34,0,-9.37,1,0.03,0.99,0.02,0.33,0.31,101.10,4
46,1l1Wk0nOkuMCzioN6l2yfJ,Carol of the Bells,0,286370,0,['Grandcubby Trio'],['4XVZpokXbUzg6QeomBANY9'],1922,0.38,0.75,10,-11.60,1,0.07,0.12,0.75,0.72,0.35,149.79,3
47,1lia44teZBfbv0PnPkc5dK,Machinalement,0,145400,0,['Victor Boucher'],['7vVR02JJYvsEAEPNHQMx0Q'],1922,0.46,0.19,4,-16.82,1,0.07,1.00,0.35,0.08,0.67,177.10,4
48,1pGBOfY0PvpArBZT7GaUVK,Capítulo 2.19 - Banquero Anarquista,0,106000,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.73,0.21,10,-22.14,1,0.96,0.77,0.00,0.56,0.73,110.86,5


## II. Normalize Database

We now have a database that looks like this:

![Spotify Database](./images/diagram1.png)

We have a few issues with this from a database design perspective:

* Artists and genres have a many-to-many relationship (i.e., an artist can have many genres, and a genre can have many artists). Thus, the genres column in the artists table contains multiple genres for some artists.
* Similarly, artists and tracks have a many-to-many relationship (i.e., an artist can have many tracks, and a track can have many artists). Thus, the id_artists and artists columns in the tracks table contains multiple IDs and artists for some tracks.

As these tables are right now, we cannot perform joins on genres or artist IDs. To resolve this, we need to normalize the database. We need to create an intermediate table mapping tracks to artists, with each unique pair of artist and track on each row. Similarly, we need to create an intermediate table mapping genres to artists, with each unique pair of genre and artist on each row.

### 1. Create tracks_artists Table

Let's create a tracks_artists table. First, we query the tracks table for the track IDs and artist IDs. We also want to clean some of the extra characters (e.g., brackets, quotes, spaces) in the artist IDs. We then assign this to a pandas DataFrame.

In [8]:
select_tracks_artists = """
    SELECT
        id as track_id,
        REPLACE(REPLACE(REPLACE(REPLACE(id_artists, '[', ''), ']', ''), ' ', ''), '''', '') as artist_id
    FROM tracks;
"""
tracks_artists = pd.read_sql_query(select_tracks_artists, conn)
tracks_artists

Unnamed: 0,track_id,artist_id
0,35iwgR4jXetI318WEWsa1Q,45tIt06XoI0Iio4LBEVpls
1,021ht4sdgPcrDgSk7JTbKY,14jtPCOoNZwquk5wd9DxrY
2,07A5yehtSnoedViJAZkNnc,5LiOoJbxVSAMkBS2fUm3X2
3,08FmqUhxtyLTn6pAh6bk45,5LiOoJbxVSAMkBS2fUm3X2
4,08y9GfoqCWfOGsKdwojr5e,3BiJGZsyX9sJchTqcSA7Su
...,...,...
586667,5rgu12WBIHQtvej2MdHSH0,1QLBXKM5GCpyQQSVMNZqrZ
586668,0NuWgxEp51CutD2pJoF4OM,1dy5WNgIKQU6ezkpZs4y8z
586669,27Y1N4Q4U3EfDU5Ubw8ws2,37M5pPGs6V1fchFJSgCguX
586670,45XJsGpFTyzbzeWK8VzR8S,"4jGPdu95icCKVF31CcFKbS,5ebPSE9YI5aLeZ1Z2gkqjn"


Now we have a DataFrame with a column for track ID and a column with corresponding artist IDs separated by commas. Let's split these artist IDs at each comma and break each artist ID into it's own row using the pandas explode method. For this, we will use a custom split_explode function imported from a separate module.

In [9]:
tracks_artists = split_explode(tracks_artists, 'artist_id', ',')
tracks_artists

Unnamed: 0,track_id,artist_id
0,35iwgR4jXetI318WEWsa1Q,45tIt06XoI0Iio4LBEVpls
1,021ht4sdgPcrDgSk7JTbKY,14jtPCOoNZwquk5wd9DxrY
2,07A5yehtSnoedViJAZkNnc,5LiOoJbxVSAMkBS2fUm3X2
3,08FmqUhxtyLTn6pAh6bk45,5LiOoJbxVSAMkBS2fUm3X2
4,08y9GfoqCWfOGsKdwojr5e,3BiJGZsyX9sJchTqcSA7Su
...,...,...
757165,0NuWgxEp51CutD2pJoF4OM,1dy5WNgIKQU6ezkpZs4y8z
757166,27Y1N4Q4U3EfDU5Ubw8ws2,37M5pPGs6V1fchFJSgCguX
757167,45XJsGpFTyzbzeWK8VzR8S,4jGPdu95icCKVF31CcFKbS
757168,45XJsGpFTyzbzeWK8VzR8S,5ebPSE9YI5aLeZ1Z2gkqjn


Notice that some track IDs and some artist IDs are repeated. This is okay and expected as long as each pair of track ID and artist ID is unique.

Now we can write this DataFrame to the Spotify database as a tracks_artists table.

In [10]:
tracks_artists.to_sql('tracks_artists', conn, if_exists='replace', index=False)

757170

We would like to specify track_id and artist_id as foreign keys, with the combination of the two being a primary key. We would also like these to be not null. Unfortunately, the pandas to_sql method doesn't support specifying keys, and SQLite does not support modifying the table in this way once it is created. However, we can get around this by creating a new table with the desired constraints, populating that new table with data from the old table, and then dropping the old table.

In [11]:
new_tracks_artists = """
    PRAGMA foreign_keys = off;

    BEGIN TRANSACTION;
                
    --Create new table with desired constraints
    CREATE TABLE new_tracks_artists (
        track_id TEXT NOT NULL,
        artist_id TEXT NOT NULL,
        PRIMARY KEY (track_id, artist_id),
        FOREIGN KEY (track_id)
            REFERENCES tracks (id),
        FOREIGN KEY (artist_id)
            REFERENCES artists (id));
                
    --Insert data from old table into new table
    INSERT INTO new_tracks_artists SELECT * FROM tracks_artists;

    --Drop old table and rename new table
    DROP TABLE tracks_artists;
    ALTER TABLE new_tracks_artists RENAME TO tracks_artists;
                
    COMMIT TRANSACTION;

    PRAGMA foreign_keys = on;
"""
c.executescript(new_tracks_artists)
conn.commit()

### 2. Create genres and genres_artists Tables

Now we need to create a genres table holding each unique genre with a unique genre ID and a genres_artists table mapping artist IDs to corresponding genre IDs.

First, we query the artists table for the artist ID and genres while cleaning the brackets in the genres. We then assign this to a pandas DataFrame.

In [12]:
select_genres_artists = """
    SELECT
        id as artist_id,  
        REPLACE(REPLACE(genres, '[', ''), ']', '') as name
    FROM artists;
"""
genres_artists = pd.read_sql_query(select_genres_artists, conn)
genres_artists

Unnamed: 0,artist_id,name
0,0DheY5irMjBUeLybbCUEZ2,
1,0DlhY15l3wsrnlfGio2bjU,
2,0DmRESX2JknGPQyO15yxg7,
3,0DmhnbHjm1qw6NCYPeZNgJ,
4,0Dn11fWM7vHQ3rinvWEl4E,
...,...,...
1162090,3cOzi726Iav1toV2LRVEjp,'black comedy'
1162091,6LogY6VMM3jgAE6fPzXeMl,
1162092,19boQkDEIay9GaVAWkUhTa,
1162093,5nvjpU3Y7L6Hpe54QuvDjy,'black comedy'


Now we have a DataFrame with a column for artist ID and a column with corresponding genres separated by commas and spaces. The genres are a bit more complex than the artist IDs because some genres contain apostrophes, which are identical to single quotes. So, while most of the genres are surrounded by single quotes, those that contain an apostrophe are surrounded by double quotes. We will need to split the genres at each comma and break each genre into it's own row with the pandas explode method. Then we need to clean the quotes and spaces in each genre, but only from the beginning and end.

In [13]:
genres_artists = split_explode(genres_artists, 'name', ',')
genres_artists['name'] = [name.strip("'").strip('"').strip(" '").strip(' "')
                          for name in genres_artists['name']]

#Check results
genres_artists

Unnamed: 0,artist_id,name
0,0DheY5irMjBUeLybbCUEZ2,
1,0DlhY15l3wsrnlfGio2bjU,
2,0DmRESX2JknGPQyO15yxg7,
3,0DmhnbHjm1qw6NCYPeZNgJ,
4,0Dn11fWM7vHQ3rinvWEl4E,
...,...,...
1325175,3cOzi726Iav1toV2LRVEjp,black comedy
1325176,6LogY6VMM3jgAE6fPzXeMl,
1325177,19boQkDEIay9GaVAWkUhTa,
1325178,5nvjpU3Y7L6Hpe54QuvDjy,black comedy


Now we can create a table of unique genre names and a corresponding genre ID. We do this by creating arrays for the unique genre names and genre IDs and using these to create a new DataFrame. To create the genre ID array, we will use the numpy arange function to generate a series of integers of the same length as the genre names array.

In [14]:
names = pd.unique(genres_artists['name'])
genre_ids = np.arange(len(names))
genres = pd.DataFrame({'id': genre_ids, 'name': names})

#Check results
genres

Unnamed: 0,id,name
0,0,
1,1,carnaval cadiz
2,2,classical harp
3,3,harp
4,4,classical contralto
...,...,...
5362,5362,bhutanese pop
5363,5363,musica puntana
5364,5364,metal piauiense
5365,5365,ugandan traditional


We have an empty row at the very beginning, representing any artists who don't have a listed genre. We would like this to read 'none', so let's fill that in on both the genres table and the genres_artists table.

In [15]:
genres.loc[genres['id'] == 0, 'name'] = 'none'
genres

Unnamed: 0,id,name
0,0,none
1,1,carnaval cadiz
2,2,classical harp
3,3,harp
4,4,classical contralto
...,...,...
5362,5362,bhutanese pop
5363,5363,musica puntana
5364,5364,metal piauiense
5365,5365,ugandan traditional


In [16]:
genres_artists.loc[genres_artists['name'] == '', 'name'] = 'none'
genres_artists

Unnamed: 0,artist_id,name
0,0DheY5irMjBUeLybbCUEZ2,none
1,0DlhY15l3wsrnlfGio2bjU,none
2,0DmRESX2JknGPQyO15yxg7,none
3,0DmhnbHjm1qw6NCYPeZNgJ,none
4,0Dn11fWM7vHQ3rinvWEl4E,none
...,...,...
1325175,3cOzi726Iav1toV2LRVEjp,black comedy
1325176,6LogY6VMM3jgAE6fPzXeMl,none
1325177,19boQkDEIay9GaVAWkUhTa,none
1325178,5nvjpU3Y7L6Hpe54QuvDjy,black comedy


The genres table is now complete, so we can write it to the Spotify database.

In [17]:
genres.to_sql('genres', conn, if_exists='replace', index=False)

5367

As with the tracks_artists table, we would like to specify some constraints on the genres table. The id column should be a primary key. We would also like the name column to be unique and not null. Once again, we need to create a new table with the desired constraints, populate that new table with data from the old table, and then drop the old table.

In [18]:
new_genres = """
    PRAGMA foreign_keys = off;

    BEGIN TRANSACTION;

    --Create new table with desired constraints
    CREATE TABLE new_genres (
        id INTEGER PRIMARY KEY,
        name TEXT NOT NULL UNIQUE);
                
    --Insert data from old table into new table
    INSERT INTO new_genres SELECT * FROM genres;

    --Drop old table and rename new table
    DROP TABLE genres;
    ALTER TABLE new_genres RENAME TO genres;
                
    COMMIT TRANSACTION;

    PRAGMA foreign_keys = on;
"""
c.executescript(new_genres)
conn.commit()

Turning back to the genres_artists table, let's join the genres_artists and genres tables on the name column.

In [19]:
genres_artists = pd.merge(genres_artists, genres, on='name')
genres_artists

Unnamed: 0,artist_id,name,id
0,0DheY5irMjBUeLybbCUEZ2,none,0
1,0DlhY15l3wsrnlfGio2bjU,none,0
2,0DmRESX2JknGPQyO15yxg7,none,0
3,0DmhnbHjm1qw6NCYPeZNgJ,none,0
4,0Dn11fWM7vHQ3rinvWEl4E,none,0
...,...,...,...
1325175,522Yauv605MRZ9bolexp7C,ugandan traditional,5365
1325176,2PYX7G0d5Ab99YVcc7SIJk,chapman stick,5366
1325177,3MrJlLeGYpbWTNTw3ts03Y,chapman stick,5366
1325178,3bYnscUVC63go1E763zMbF,chapman stick,5366


Now if we just drop the name column, we have a genres_artists table that maps each artist ID to a genre ID.

In [20]:
genres_artists.drop(['name'], axis=1, inplace=True)
genres_artists

Unnamed: 0,artist_id,id
0,0DheY5irMjBUeLybbCUEZ2,0
1,0DlhY15l3wsrnlfGio2bjU,0
2,0DmRESX2JknGPQyO15yxg7,0
3,0DmhnbHjm1qw6NCYPeZNgJ,0
4,0Dn11fWM7vHQ3rinvWEl4E,0
...,...,...
1325175,522Yauv605MRZ9bolexp7C,5365
1325176,2PYX7G0d5Ab99YVcc7SIJk,5366
1325177,3MrJlLeGYpbWTNTw3ts03Y,5366
1325178,3bYnscUVC63go1E763zMbF,5366


Now we can write the genres_artists table to the Spotify database.

In [21]:
genres_artists.to_sql('genres_artists', conn, if_exists='replace', index=False)

1325180

We would like to specify genre_id and artist_id as foreign keys, with the combination of the two being a primary key. We would also like these to be not null. Once again, we need to create a new table with the desired constraints, populate that new table with data from the old table, and then drop the old table.

In [22]:
new_genres_artists = """
    PRAGMA foreign_keys = off;

    BEGIN TRANSACTION;

    --Create new table with desired constraints
    CREATE TABLE new_genres_artists (
        artist_id TEXT NOT NULL,
        genre_id TEXT NOT NULL,
        PRIMARY KEY (artist_id, genre_id),
        FOREIGN KEY (artist_id)
            REFERENCES artists (id),
        FOREIGN KEY (genre_id)
            REFERENCES genres (id));

    --Insert data from old table into new table
    INSERT INTO new_genres_artists SELECT * FROM genres_artists;

    --Drop old table and rename new table
    DROP TABLE genres_artists;
    ALTER TABLE new_genres_artists RENAME TO genres_artists;
                
    COMMIT TRANSACTION;

    PRAGMA foreign_keys = on;
"""
c.executescript(new_genres_artists)
conn.commit()

## III. Final Steps

Now that we are done creating tables, we don't need the id_artists column in the tracks table anymore, so we can drop it. And while we technically don't need the genres column in the artists table and the artists column in the tracks table, these columns might still be useful if we would like to pull up all the genres for a given artist or all the artists for a given song in one row, so we will leave them in.

In [23]:
drop_id_artists = """
    ALTER TABLE tracks 
    DROP COLUMN id_artists;
"""
c.execute(drop_id_artists)
conn.commit()

Finally, the release_date column in the tracks table has some dates in form "YYYY" and some dates in form "YYYY-MM-DD". This is fine for our purposes, but we would like to have a column that is just the year in form "YYYY". We can add this as a new column to the tracks table.

First, we should confirm that all the release dates do in fact have a 4 digit year at the beginning. In SQLite, we can do this by querying all release dates generally and then subtracting release dates that match a pattern of 4 digits at the beginning. The result will be a set of any release dates that do NOT have 4 digits at the beginning.

We will need to use the REGEXP operator, which is a special syntax for the regexp() user function, to perform this query. However, SQLite does not define the regexp() user function by default. So, we will need to define a regexp() function.

In [24]:
def regexp(expr, item):
    """Takes a regex pattern and item to search,
    returns Boolean indicating whether pattern is in item.
    """
    reg = re.compile(expr)
    return reg.search(item) is not None

conn.create_function('REGEXP', 2, regexp)

select_not_4_digits = """
    SELECT release_date
    FROM tracks
    EXCEPT
    SELECT release_date
    FROM tracks
    WHERE release_date regexp '^\d{4}';
"""
execute_print(select_not_4_digits, c)

No results


This returned no results, so we can create a new release_year column and populate it with the first 4 digits of the release_date column.

In [25]:
add_release_year = """
    --Create a release_year column
    ALTER TABLE tracks
    ADD COLUMN release_year INTEGER;
    
    --Add 4-digit year from 'release_date' column to 'release_year' column
    UPDATE tracks 
    SET release_year = substr(release_date, 1, 4);
"""
c.executescript(add_release_year)
conn.commit()

#Check results
select_release_year = """
    SELECT release_date, release_year 
    FROM tracks 
    LIMIT 20;
"""
execute_print(select_release_year, c)

('1922-02-22', 1922)
('1922-06-01', 1922)
('1922-03-21', 1922)
('1922-03-21', 1922)
('1922', 1922)
('1922', 1922)
('1922', 1922)
('1922', 1922)
('1922', 1922)
('1922-03-29', 1922)
('1922-06-01', 1922)
('1922-06-01', 1922)
('1922-02-22', 1922)
('1922', 1922)
('1922', 1922)
('1922-06-01', 1922)
('1922-06-01', 1922)
('1922-06-01', 1922)
('1922', 1922)
('1922-03-21', 1922)


We are now done setting up the database! Let's check the column info for each table to confirm that we have it set up correctly.

In [26]:
artists_info = "PRAGMA table_info(artists);"
execute_print(artists_info, c)

(0, 'id', 'TEXT', 1, None, 1)
(1, 'followers', 'INTEGER', 0, None, 0)
(2, 'genres', 'TEXT', 0, None, 0)
(3, 'name', 'TEXT', 0, None, 0)
(4, 'popularity', 'INTEGER', 0, None, 0)


In [27]:
tracks_info = "PRAGMA table_info(tracks);"
execute_print(tracks_info, c)

(0, 'id', 'TEXT', 1, None, 1)
(1, 'name', 'TEXT', 0, None, 0)
(2, 'popularity', 'INTEGER', 0, None, 0)
(3, 'duration_ms', 'INTEGER', 0, None, 0)
(4, 'explicit', 'INTEGER', 0, None, 0)
(5, 'artists', 'TEXT', 0, None, 0)
(6, 'release_date', 'TEXT', 0, None, 0)
(7, 'danceability', 'REAL', 0, None, 0)
(8, 'energy', 'REAL', 0, None, 0)
(9, 'key', 'INTEGER', 0, None, 0)
(10, 'loudness', 'REAL', 0, None, 0)
(11, 'mode', 'INTEGER', 0, None, 0)
(12, 'speechiness', 'REAL', 0, None, 0)
(13, 'acousticness', 'REAL', 0, None, 0)
(14, 'instrumentalness', 'REAL', 0, None, 0)
(15, 'liveness', 'REAL', 0, None, 0)
(16, 'valence', 'REAL', 0, None, 0)
(17, 'tempo', 'REAL', 0, None, 0)
(18, 'time_signature', 'INTEGER', 0, None, 0)
(19, 'release_year', 'INTEGER', 0, None, 0)


In [28]:
genres_info = "PRAGMA table_info(genres);"
execute_print(genres_info, c)

(0, 'id', 'INTEGER', 0, None, 1)
(1, 'name', 'TEXT', 1, None, 0)


In [29]:
tracks_artists_info = "PRAGMA table_info(tracks_artists);"
execute_print(tracks_artists_info, c)

(0, 'track_id', 'TEXT', 1, None, 1)
(1, 'artist_id', 'TEXT', 1, None, 2)


In [30]:
genres_artists_info = "PRAGMA table_info(genres_artists);"
execute_print(genres_artists_info, c)

(0, 'artist_id', 'TEXT', 1, None, 1)
(1, 'genre_id', 'TEXT', 1, None, 2)


Everything looks great! Now we have a database that looks like this:

![Spotify Database](./images/diagram2.png)

This database is much easier to work with in SQL now. We can join data from the tracks, artists, and genres tables together using the intermediate tracks_artists and genres_artists tables. From here, we are ready to do any remaining cleaning. Let's close the connection to the database and continue on to the clean_spotify_db file.

In [31]:
conn.close()

## IV. References

1. 'About Spotify'. Spotify. https://newsroom.spotify.com/company-info/.