# Data exploration

- A disclaimer. I know woefully little about sql at the moment, so much of this might just be playing around with queries. It will evolve with the repo

In [7]:
import sqlite3
import pandas as pd
import pretty_midi
pd.options.display.max_columns = 999

In [3]:
conn = sqlite3.connect('/Users/tburch/Documents/Datasets/jazz/wjazzd.db')

## Let's just kinda browse some tables.
### Starting with "beats"

In [12]:
# Beats table
query = "SELECT * FROM beats"
df_beats = pd.read_sql_query(query, conn)
df_beats.head()


Unnamed: 0,beatid,melid,onset,bar,beat,signature,chord,form,bass_pitch,chorus_id
0,1,1,9.171882,-1,1,,,I1,42.0,0
1,2,1,9.488254,-1,2,,,,42.0,0
2,3,1,9.779955,-1,3,,,,40.0,0
3,4,1,10.052608,-1,4,,,,40.0,0
4,5,1,10.339796,0,1,,Bb6,,50.0,0


okay, so this will be useful for getting the chord names and roots of songs. Also hints the general structure of the data (bar = measure, beat is self explanitory). Probably more in melody

### Looking at melody table

In [13]:
# Melody
query = "SELECT * FROM melody"
df_melody = pd.read_sql_query(query, conn)
df_melody.head(20)

Unnamed: 0,eventid,melid,onset,pitch,duration,period,division,bar,beat,tatum,subtatum,num,denom,beatprops,beatdur,tatumprops,f0_mod,loud_max,loud_med,loud_sd,loud_relpos,loud_cent,loud_s2b,f0_range,f0_freq_hz,f0_med_dev
0,1,1,10.343492,65.0,0.138776,4,1,0,1,1,0,4,4,,0.291746,"(1,)",,0.126209,66.526087,5.541147,0.307692,0.389466,1.056169,37.794261,12.932532,-0.328442
1,2,1,10.637642,63.0,0.171247,4,4,0,2,1,0,4,4,,0.286621,"(1.0, 1.0, 1.0, 1.173)",,0.349751,69.133321,2.912412,0.25,0.468687,1.120317,6.36593,6.956935,11.135423
2,3,1,10.843719,58.0,0.08127,4,4,0,2,4,0,4,4,,0.286621,"(1.0, 1.0, 1.0, 1.173)",,0.094051,66.35213,3.564563,0.428571,0.531354,1.310389,68.010392,,32.366787
3,4,1,10.948209,61.0,0.235102,4,1,0,3,1,0,4,4,,0.298844,"(1,)",,0.521187,66.484173,2.414298,0.818182,0.559333,0.984047,15.443906,5.867151,-3.374696
4,5,1,11.232653,63.0,0.130612,4,1,0,4,1,0,4,4,,0.29712,"(1,)",,0.560737,71.699054,2.185794,0.166667,0.438973,1.061262,11.444363,8.329975,6.377737
5,6,1,11.551927,58.0,0.188662,4,1,1,1,1,0,4,4,,0.310023,"(1,)",,0.534657,67.636708,7.635221,0.411765,0.359536,1.049956,39.36872,6.589582,16.146429
6,7,1,11.859592,58.0,0.481814,4,1,1,2,1,0,4,4,,0.305283,"(1,)",vibrato,0.584914,63.659343,5.51807,0.068182,0.403372,0.983151,39.429103,5.40675,11.239471
7,8,1,14.535692,50.0,0.159637,4,1,3,3,1,0,4,4,,0.269977,"(1,)",,-0.129185,58.507975,5.02034,0.133333,0.368384,0.927912,174.398513,,25.203232
8,9,1,14.799819,57.0,0.145125,4,2,3,4,1,0,4,4,,0.303084,"(1.0, 0.438)",,0.599931,71.17367,2.938194,0.285714,0.551884,1.064195,27.066543,7.758283,25.73643
9,10,1,14.973968,60.0,0.110295,4,2,3,4,2,0,4,4,,0.303084,"(1.0, 0.438)",,0.484532,69.632891,2.325457,0.6,0.508617,1.038483,17.141304,11.184763,15.693739


Okay, lots to unpack here. https://jazzomat.hfm-weimar.de/dbformat/dbformat.html tells everything but still some I'm unfamiliar with.

After thinking some, tatum is basically "minimum number of subdivisions" and tatumprops is how those breakdown. E.g. bar 0, beat 2 has 4 tatums - so one beat divided into 4. The tatumprop here is (1.0, 1.0, 1.0, 1.173), so those 4 break down into 3 that are the same length and one that's longer. Note 1 happens on tatum 1, note 2 happens on tatum 4. So that means note 1 corresponds to (3 \* 1.0) = 3 while note 2 corresponds to (1 \* 1.173) = 1.173. So the first note is ~3 times longer than the second, and these fit into a beat. So that's basically a dotted eighth to a sixteenth rhythm. This can be confirmed by looking at the sheet music of [Anthropology](https://jazzomat.hfm-weimar.de/dbformat/synopsis/scores/ArtPepper_Anthropology_FINAL.pdf). The extra length must be due to swung eighths? Not entirely sure, to be researched. If that's correct, this is going to be a pain to make into something useful.

Apparently pitch is in "fractional midi," which I'm unfamiliar with.
Luckily, pretty_midi has some converters... Let's see how those work.

In [28]:
df.pitch.head().apply(pretty_midi.note_number_to_name)

0     F4
1    D#4
2    A#3
3    C#4
4    D#4
Name: pitch, dtype: object

Okay, so that makes sense. Fractional midi must be numerical distance in half steps between notes (e.g. pitch 1 = F4 = 65 midi. pitch 2 = D\#4 = 63 midi. This is 2 half steps, 65-63 = 2). That form is actually much more convienient than dealing with either pitch names or frequencies.

### Well... what songs are these actually?

In [16]:
# Songs
query = "SELECT * FROM composition_info"
df_songs = pd.read_sql_query(query, conn)
df_songs.head()


Unnamed: 0,compid,title,composer,form,template,tonalitytype,genre
0,1,Anthropology,"Parker, Gillespie",A8A8B8A8,I Got Rhythm,FUNCTIONAL,ORIGINAL
1,2,Blues for Blanche,Art Pepper,A12,Blues,BLUES,ORIGINAL
2,3,Desafinado,Antonio Carlos Jobim,A1612B8C8,,FUNCTIONAL,ORIGINAL
3,4,In a Mellow Tone,Duke Ellington,A16B16,,FUNCTIONAL,GREAT AMERICAN SONGBOOK
4,5,Stardust,"Parish, Carmichael",A16B16,,FUNCTIONAL,GREAT AMERICAN SONGBOOK


Cool. It'd be interesting to see what forms and composers creep up a lot

In [27]:
print len(df_songs.form.value_counts())," unique forms"
df_songs.form.value_counts().head(15)

70  unique forms


A8A8B8A8        70
A12             60
A16B16          28
open            17
A16             13
A8B8            10
A16A16B16A16     6
A8B8A8C8         5
A8A8B8C8         5
A8A8B8C12        5
A8A8B8           4
A16A16           4
A24              4
A8B8C8D8         3
A16B8A8          3
Name: form, dtype: int64

In [26]:
print len(df_songs.composer.value_counts())," unique composers"
df_songs.composer.value_counts().head(15)


195  unique composers


                   8
Parker             8
John Coltrane      8
Dexter Gordon      6
Sonny Rollins      6
Cole Porter        5
Bob Berg           5
Coleman            5
Thelonious Monk    5
Clifford Brown     4
Dave Holland       4
Wayne Shorter      4
Woody Shaw         4
Duke Ellington     4
Michael Brecker    4
Name: composer, dtype: int64

Okay, well not so much diversity in forms, but I guess that's true for jazz - lots of AABA or 12 bar blues, which are the two largest by far. Lots of diversity in composers though, so that's definitely cool. 