# Why Use Databases?

### How to think about databases

#### Dataframes joined by keys
The simplest way to think about a database is a series of dataframes in which a special "key" column in one dataframe references a specific row in another dataframe.

#### Example
A set of dataframes that store:
- Songs
- Albums
- Playlists
- Artists
- Genres

### What are the advantages of a database?

#### Advantages
- Allows you to store information in more than 2 dimensions
- Reduces duplication of values
- Represents complex relationships easily
- Enables targeted retrieval of information

#### When to use a database
- Aggregating data from several different sources
- Sharing datasets that require a lot of cleaning to analyze
- Conducting an analysis that requires a _ton_ of "groupby" statements

# Creating Your Data Model

### Understanding the key terms and concepts

#### ERD
An **Entity Relationship Diagram** outlines the basic structure of your database by modeling the groups or **entitities** of data you will store and how those groups will be **related** to one another

#### Entities
An abstraction of a dataframe or table in which an instance of an *instance* of an entity corresponds to a row in a dataframe

#### One-to-Many Relationship
A relationship between entities (dataframes) in which an instance of one entity (row of one dataframe) is related to many instances of another entity (rows of another dataframe). For example: One album *has many* related songs.

#### Many-to-Many Relationship
A relationship between entities (dataframes) in which many instances of one entity (rows of one dataframe) can be related to many instances of another entity (rows of another dataframe). For example: One genre *has many* related songs, and one song *can have many* related genres.

#### Diagram Keys
![Keys](P4GCKey.png)

### Step 1: Identify the core logical entities

![Keys](P4GC_Step1.png)

### Step 2: Define the relationships between entities

![Keys](P4GC_Step2.png)

### Step 3: Replace many-to-many relationships with "join tables"

![Keys](P4GC_Step3.png)

### Step 4: Translate logical entities into tables

![Keys](P4GC_Step4.png)

# Importing and Cleaning the Data

### Step 1: Read and combine the playlists into a single dataframe

In [2]:
import pandas as pd

In [3]:
#read entire excel file to avoid multiple reads
xlsx = pd.ExcelFile("P4GC_SQLite3_Test.xlsx")

In [4]:
#read each sheet to a different dataframe
df_gritty = pd.read_excel(xlsx, "Gritty Playlist")
df_mix = pd.read_excel(xlsx, "Mix Tape Playlist")
df_alt = pd.read_excel(xlsx, "Alt Playlist")
df_rainy = pd.read_excel(xlsx, "Rainy Day Playlist")

In [5]:
#add corresponding playlist name to new playlist column
df_gritty["Playlist"] = "Gritty"
df_mix["Playlist"] = "Mix Tape"
df_alt["Playlist"] = "Alt"
df_rainy["Playlist"] = "Rainy Day"

In [6]:
#combine playlists and reset the index
df_all = pd.concat([df_gritty, df_mix, df_alt, df_rainy], ignore_index=True)

In [7]:
#convert the duration timestamp to a duration delta
df_all["Duration_Delta"] = df_all["Duration"].apply(lambda x: pd.Timedelta(days=0,
                                                                           minutes=x.minute,
                                                                           seconds=x.second))

#extract the total duration in seconds from the duration delta
df_all["Duration_Seconds"] = df_all["Duration_Delta"].dt.seconds

In [8]:
df_all.head()

Unnamed: 0,Song,Artist,Album,Duration,Genre,Playlist,Duration_Delta,Duration_Seconds
0,S.O.B.,Nathaniel Rateliff & The Night Sweats,Nathaniel Rateliff & The Night Sweats,00:04:08,Rock;Soul;Blues,Gritty,00:04:08,248
1,Lengths,The Black Keys,Rubber Factory,00:04:52,Blues;Rock;Folk,Gritty,00:04:52,292
2,Don't Wanna Fight,Alabama Shakes,Sound & Color,00:03:53,Blues;Soul;Rock,Gritty,00:03:53,233
3,Howlin' For You,The Black Keys,Brothers,00:03:12,Blues;Rock,Gritty,00:03:12,192
4,Twenty Miles,Deer Tick,The Black Dirt Sessions,00:03:44,Blues;Rock;Folk,Gritty,00:03:44,224


### Step 2: Create separate dataframes to itemize the genres and artists for each song

In [9]:
#split artist and genre columns on ";" to itemize individual artists and genres later
df_all["Artist"] = df_all["Artist"].str.split(";")
df_all["Genre"] = df_all["Genre"].str.split(";")

#### Itemizing Artists

In [18]:
#export the records to a list of dicts
records_raw = df_all.to_dict("records")

#inserts a record for each artist on a song
artist_records = [{"Artist": y, 
                   "Album": x["Album"],
                   "Song": x["Song"], 
                   "Playlist": x["Playlist"],
                   "Duration_Delta": x["Duration_Delta"],
                   "Duration_Seconds": x["Duration_Seconds"]}
                  for x in records_raw for y in x["Artist"]]

In [19]:
#converts the list of records back into dataframes for further manipulation
df_artist_raw = pd.DataFrame(artist_records)

df_artist_raw.head()

Unnamed: 0,Album,Artist,Duration_Delta,Duration_Seconds,Playlist,Song
0,Nathaniel Rateliff & The Night Sweats,Nathaniel Rateliff & The Night Sweats,00:04:08,248,Gritty,S.O.B.
1,Rubber Factory,The Black Keys,00:04:52,292,Gritty,Lengths
2,Sound & Color,Alabama Shakes,00:03:53,233,Gritty,Don't Wanna Fight
3,Brothers,The Black Keys,00:03:12,192,Gritty,Howlin' For You
4,The Black Dirt Sessions,Deer Tick,00:03:44,224,Gritty,Twenty Miles


#### Itemizing Genres

In [20]:
#inserts a record for each genre on a song
genre_records = [{"Genre": y, 
                  "Album": x["Album"],
                  "Song": x["Song"], 
                  "Playlist": x["Playlist"],
                  "Duration_Delta": x["Duration_Delta"],
                  "Duration_Seconds": x["Duration_Seconds"]}
                 for x in records for y in x["Genre"]]

#converts the list of records back into dataframes for further manipulation
df_genre_raw = pd.DataFrame(genre_records)

df_genre_raw.head()

Unnamed: 0,Album,Duration_Delta,Duration_Seconds,Genre,Playlist,Song
0,Nathaniel Rateliff & The Night Sweats,00:04:08,248,Rock,Gritty,S.O.B.
1,Nathaniel Rateliff & The Night Sweats,00:04:08,248,Soul,Gritty,S.O.B.
2,Nathaniel Rateliff & The Night Sweats,00:04:08,248,Blues,Gritty,S.O.B.
3,Rubber Factory,00:04:52,292,Blues,Gritty,Lengths
4,Rubber Factory,00:04:52,292,Rock,Gritty,Lengths


### Step 3: Isolate the unique values for non-junction tables

#### Albums

In [21]:
#pulls unique album names and aggregates song count and duration
df_album = (df_all.groupby("Album", as_index=False)
                  .agg({"Song": "count",
                        "Duration_Seconds": "sum",
                        "Duration_Delta": "sum"}))
df_album.head()

Unnamed: 0,Album,Song,Duration_Seconds,Duration_Delta
0,Ambivalence Avenue,2,458,00:07:38
1,Autumn Fallin',1,205,00:03:25
2,Ben Kweller,3,760,00:12:40
3,Boom Forest,1,265,00:04:25
4,Boys & Girls,1,226,00:03:46


#### Songs

In [22]:
#drops songs that appear in multiple playlists
df_song = df_all.drop_duplicates(["Album","Song"])
df_song.head()

Unnamed: 0,Song,Artist,Album,Duration,Genre,Playlist,Duration_Delta,Duration_Seconds
0,S.O.B.,[Nathaniel Rateliff & The Night Sweats],Nathaniel Rateliff & The Night Sweats,00:04:08,"[Rock, Soul, Blues]",Gritty,00:04:08,248
1,Lengths,[The Black Keys],Rubber Factory,00:04:52,"[Blues, Rock, Folk]",Gritty,00:04:52,292
2,Don't Wanna Fight,[Alabama Shakes],Sound & Color,00:03:53,"[Blues, Soul, Rock]",Gritty,00:03:53,233
3,Howlin' For You,[The Black Keys],Brothers,00:03:12,"[Blues, Rock]",Gritty,00:03:12,192
4,Twenty Miles,[Deer Tick],The Black Dirt Sessions,00:03:44,"[Blues, Rock, Folk]",Gritty,00:03:44,224


#### Playlists

In [23]:
#pulls unique album names and aggregates song count and duration 
df_playlist = (df_all.groupby("Playlist", as_index=False)
                     .agg({"Song": "count",
                           "Duration_Seconds": "sum",
                           "Duration_Delta": "sum"}))
df_playlist.head()

Unnamed: 0,Playlist,Song,Duration_Seconds,Duration_Delta
0,Alt,8,1848,00:30:48
1,Gritty,10,2269,00:37:49
2,Mix Tape,13,2345,00:39:05
3,Rainy Day,14,3007,00:50:07


#### Genre

In [24]:
#pulls unique genres and aggregates song count and duration
df_genre = (df_genre_raw.drop_duplicates(["Album","Song"])
                        .groupby("Genre", as_index=False)
                        .agg({"Song": "count",
                              "Duration_Seconds": "sum",
                              "Duration_Delta": "sum"}))
df_genre.head()

Unnamed: 0,Genre,Song,Duration_Seconds,Duration_Delta
0,Alt,12,2349,00:39:09
1,Blues,5,1253,00:20:53
2,Country,1,171,00:02:51
3,Folk,16,3240,00:54:00
4,Pop,1,229,00:03:49


#### Artists

In [25]:
#pulls unique genres and aggregates song count and duration
df_artist = (df_artist_raw.drop_duplicates(["Artist","Song"])
                          .groupby("Artist", as_index=False)
                          .agg({"Song": "count",
                                "Duration_Seconds": "sum",
                                "Duration_Delta": "sum"}))
df_artist.head()

Unnamed: 0,Artist,Song,Duration_Seconds,Duration_Delta
0,Ben Birdwell,1,212,00:03:32
1,Benjamin Gibbard,1,182,00:03:02
2,Jared Engel,1,205,00:03:25
3,Jessica Lea Mayfield,2,374,00:06:14
4,Phox,1,265,00:04:25


# Using SQLite3

### Step 1: Install SQLite and sqlite3 and creat your database

1. Install sqlite and GUI interface for the database by downloading this app: http://sqlitebrowser.org/

1. install sqlite3 package to access and manage sqlite databases with python by running: ```pip install sqlite3``` in the command line
1. Open DB Browser for SQLite and create a new database

### Step 2: Import the library and connect to your database

In [26]:
import sqlite3
conn = sqlite3.connect("AlbumDB.db")
c = conn.cursor()

### Step 3: Create and populate tables that don't have foreign keys

#### Albums

In [28]:
#convert df to list of tuples and reformat each value for entry to database
album_records = [(x["Album"],
                  str(x["Duration_Delta"]),
                  int(x["Duration_Seconds"]), 
                  int(x["Song"]))
                 for x in df_album.to_dict("records")]

In [29]:
#creates album table
c.execute('DROP TABLE IF EXISTS Album')
c.execute('''CREATE TABLE Album (
                Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
                Name TEXT,
                Duration_Text TEXT,
                Duration_Seconds INTEGER,
                Song_Count INTEGER
             )''')
#inserts values from album_record list
c.executemany('''INSERT INTO Album (Name, Duration_Text, Duration_Seconds, Song_Count)
                 VALUES (?,?,?,?)''',
              album_records)
#commits changes to the database
conn.commit()

#### Playlists

In [30]:
#convert df to list of tuples and reformat each value for entry to database
playlist_records = [(x["Playlist"],
                     str(x["Duration_Delta"]),
                     int(x["Duration_Seconds"]), 
                     int(x["Song"]))
                    for x in df_playlist.to_dict("records")]

In [31]:
#creates playlist table
c.execute('DROP TABLE IF EXISTS Playlist')
c.execute('''CREATE TABLE Playlist (
                Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
                Name TEXT,
                Duration_Text TEXT,
                Duration_Seconds INTEGER,
                Song_Count INTEGER
             )''')
#inserts values from playlist_record list
c.executemany('''INSERT INTO Playlist (Name, Duration_Text, Duration_Seconds, Song_Count)
                 VALUES (?,?,?,?)''',
              playlist_records)
#commits changes to the database
conn.commit()

#### Artists

In [32]:
#convert df to list of tuples and reformat each value for entry to database
artist_records = [(x["Artist"],
                   str(x["Duration_Delta"]),
                   int(x["Duration_Seconds"]), 
                   int(x["Song"]))
                  for x in df_artist.to_dict("records")]

In [33]:
#creates Artist table
c.execute('DROP TABLE IF EXISTS Artist')
c.execute('''CREATE TABLE Artist (
                Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
                Name TEXT,
                Duration_Text TEXT,
                Duration_Seconds INTEGER,
                Song_Count INTEGER
             )''')
#inserts values from artist_records list
c.executemany('''INSERT INTO Artist (Name, Duration_Text, Duration_Seconds, Song_Count)
                 VALUES (?,?,?,?)''',
              artist_records)
#commits changes to the database
conn.commit()

#### Genres

In [34]:
#convert df to list of tuples and reformat each value for entry to database
genre_records = [(x["Genre"],
                  str(x["Duration_Delta"]),
                  int(x["Duration_Seconds"]), 
                  int(x["Song"]))
                 for x in df_genre.to_dict("records")]

In [35]:
#creates genre table
c.execute('DROP TABLE IF EXISTS Genre')
c.execute('''CREATE TABLE Genre (
                Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
                Name TEXT,
                Duration_Text TEXT,
                Duration_Seconds INTEGER,
                Song_Count INTEGER
             )''')
#inserts values from artist_records list
c.executemany('''INSERT INTO Genre (Name, Duration_Text, Duration_Seconds, Song_Count)
                 VALUES (?,?,?,?)''',
              genre_records)
#commits changes to the database
conn.commit()

### Step 4: Query records, merge, and repeat Step 3 for tables with foreign keys

#### Songs

In [36]:
#query and retrieve name and id field from album table
c.execute('SELECT Id, Name FROM Album')
album_ids = c.fetchall()

In [37]:
#convert to dataframe then merge with df_song
df_album_id = pd.DataFrame(album_ids, columns=["Id", "Album"])

df_song_join = df_song.merge(df_album_id, on="Album")
df_song_join.head()

Unnamed: 0,Song,Artist,Album,Duration,Genre,Playlist,Duration_Delta,Duration_Seconds,Id
0,S.O.B.,[Nathaniel Rateliff & The Night Sweats],Nathaniel Rateliff & The Night Sweats,00:04:08,"[Rock, Soul, Blues]",Gritty,00:04:08,248,11
1,Howling At Nothing,[Nathaniel Rateliff & The Night Sweats],Nathaniel Rateliff & The Night Sweats,00:03:10,"[Soul, Blues, Rock]",Gritty,00:03:10,190,11
2,Lengths,[The Black Keys],Rubber Factory,00:04:52,"[Blues, Rock, Folk]",Gritty,00:04:52,292,15
3,Don't Wanna Fight,[Alabama Shakes],Sound & Color,00:03:53,"[Blues, Soul, Rock]",Gritty,00:03:53,233,20
4,Howlin' For You,[The Black Keys],Brothers,00:03:12,"[Blues, Rock]",Gritty,00:03:12,192,6


In [38]:
#convert df to list of tuples and reformat each value for entry to database
song_records = [(x["Song"],
                 str(x["Duration_Delta"]),
                 int(x["Duration_Seconds"]), 
                 int(x["Id"]))
                for x in df_song_join.to_dict("records")]

In [39]:
#creates genre table
c.execute('DROP TABLE IF EXISTS Song')
c.execute('''CREATE TABLE Song (
                Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
                Name TEXT,
                Duration_Text TEXT,
                Duration_Seconds INTEGER,
                Album_Id INTEGER,
                FOREIGN KEY(Album_Id) REFERENCES Album (Id)
             )''')
#inserts values from artist_records list
c.executemany('''INSERT INTO Song (Name, Duration_Text, Duration_Seconds, Album_Id)
                 VALUES (?,?,?,?)''',
              song_records)
#commits changes to the database
conn.commit()

#### Playlist Song

In [40]:
#query and retrieve name and id field from song table
c.execute('SELECT Id, Name FROM Song')
song_ids = c.fetchall()

#query and retrieve name and id fields from playlist table
c.execute('SELECT Id, Name FROM Playlist')
playlist_ids = c.fetchall()

In [41]:
#convert to dataframe results to dataframes
df_song_id = pd.DataFrame(song_ids, columns=["Song_Id", "Song"])
df_playlist_id = pd.DataFrame(playlist_ids, columns=["Playlist_Id", "Playlist"])

In [42]:
df_playlist_song1 = df_all.merge(df_song_id, on="Song")
df_playlist_song2 = df_playlist_song1.merge(df_playlist_id, on="Playlist")
df_playlist_song2.head()

Unnamed: 0,Song,Artist,Album,Duration,Genre,Playlist,Duration_Delta,Duration_Seconds,Song_Id,Playlist_Id
0,S.O.B.,[Nathaniel Rateliff & The Night Sweats],Nathaniel Rateliff & The Night Sweats,00:04:08,"[Rock, Soul, Blues]",Gritty,00:04:08,248,1,2
1,Lengths,[The Black Keys],Rubber Factory,00:04:52,"[Blues, Rock, Folk]",Gritty,00:04:52,292,3,2
2,Don't Wanna Fight,[Alabama Shakes],Sound & Color,00:03:53,"[Blues, Soul, Rock]",Gritty,00:03:53,233,4,2
3,Howlin' For You,[The Black Keys],Brothers,00:03:12,"[Blues, Rock]",Gritty,00:03:12,192,5,2
4,Twenty Miles,[Deer Tick],The Black Dirt Sessions,00:03:44,"[Blues, Rock, Folk]",Gritty,00:03:44,224,7,2


In [43]:
#convert df to list of tuples and reformat each value for entry to database
playlist_song_records = [(int(x["Song_Id"]),
                          int(x["Playlist_Id"]))
                         for x in df_playlist_song2.to_dict("records")]

In [44]:
#creates playlist song table
c.execute('DROP TABLE IF EXISTS Playlist_Song')
c.execute('''CREATE TABLE Playlist_Song (
             Song_Id INTEGER,
             Playlist_Id INTEGER,
             PRIMARY KEY(Song_Id, Playlist_Id),
             FOREIGN KEY(Song_Id) REFERENCES Song (Id),
             FOREIGN KEY(Playlist_Id) REFERENCES Playlist (Id))''')

c.executemany('INSERT INTO Playlist_Song (Song_ID, Playlist_ID) VALUES (?, ?)',
              playlist_song_records)
conn.commit()