Bach 2 Part Invention in F Major BWV779

coolbutuseless edited this page Dec 10, 2010 · 5 revisions
Clone this wiki locally


Modern musical notation encodes musical symbols for the purpose of communicating with musicians. However, the format is optimized for musical performance rather than pattern analysis. R and ggplot2 can be used to analyze music and highlight patterns in a visual medium that are otherwise somewhat difficult to discern.

Music analysis using R is not a new idea. There is an R package (tuneR) dedicated to the topic and programs written by faculty of the Indiana University Bloomington (Music Informatics) are available online. I have not seen any music analysis done using ggplot2 and thought it would provide an excellent way of visualizing attributes of musical compositions. Many have been intrigued at the symmetry, patterns and repeating structures in compositions by J.S. Bach. The visualization below is based upon a Two Part Invention (No. 8 in F Major – BWV779). It is intended to compactly illustrate the notes as they occur in time, organized by voice.

The color of notes assists in the visual identification of repeating patterns. The y axis indicates the octave boundaries; the color indicates the note played. The x axis is based upon the time series for the track, with the scale set so that lines break on each measure. The position on the y axis indicates the pitch, and the color represents the note (which is the same regardless of the octave in which it appears).


  • Located a file at Finale’s site of Bach’s Invention No. 8 in F Major.
  • Exported the file in Finale as a MIDI file.
  • MIDI Files contain additional information not relevant to this graph, so a ruby script is used to extract the relevant data into a semicolon delimited file. This is read into R as a data frame.
  • The MIDI file in question contained a single track that contained both voices of the invention. R is used to separate the voices in the data frame. The steps to complete this process:
    1. For each note, identify the total number of notes that occur at the same time.
    2. If two (or more) notes occur at the same time, the note with the highest pitch is deemed voice 1
    3. If one note occurs by itself, it is grouped with voice 1 if it is greater than or equal to the first note in the song (F4 at pitch 65).
    4. All remaining notes are grouped as voice 2.

##Data Preparation The sqldf package is used to derive this information from the data frame. Code to complete this work is as follows:


df=read.csv('bach2part8inF.txt',sep=';', header=TRUE)
df = df[,-1]

# Algorithm to split voices
# 1)  Identify number of simultaneous notes at the time each note occurs
df=sqldf('select df.Time, df.Pitch, df.Note, df.Octave, t2.simultaneous_notes 
            from df 
            join (select Time, count(*) simultaneous_notes from df group by Time) t2 
            on t2.Time = df.Time')

#) 2)  Find all of the notes that are the highest when more than one occurs at the same time, and combine them with
# the set of all notes that occur by themselves but are higher that the first note in the piece
upper=sqldf('select Time, max(Pitch) Pitch from df where simultaneous_notes >1 group by time 
             select time, Pitch from df where simultaneous_notes=1 and Pitch >= 65')

# 3)  Identify voice 1 (upper voice) and 2 (lower voice).  Note that this is not exactly accurate 
# for the last chord which has 4 simultaneous notes
df=sqldf('select df.*, "1" Voice from df where (Time + Pitch) in (select Time + Pitch from upper) 
 union select df.*, "2" Voice from df where (Time + Pitch) not in (select Time + Pitch from upper)')

octave_boundary=sqldf('select Octave + 1, max(Pitch) + 1 Pitch from df group by Octave having Octave < 6')

##Visualization Code The code specific to creating the plot:


ggplot(data=df, aes(Time, Pitch, color=Note, group=Voice)) + 
geom_line() + 
geom_point() +
scale_x_continuous('Measure', breaks=seq(0,103000, 3072), labels=1:length(seq(0,103000, 3072))) +
scale_y_continuous("Octave", breaks=as.numeric(octave_boundary$Pitch), labels=octave_boundary$Octave) +
opts(title="Bach 2 Part Invention #8 in F Major (BWV779)")


#Additional Analysis Traditional summarization techniques can be done with the data readily available in a data frame. This stacked bar chart illustrates the occurrences of notes by octave.

A similar analysis is the occurrences of notes by voice.

There are numerous repeating patterns throughout the piece (for instance, notice the sequences that are staggered between the voices at the end of measure 2 into the beginning of measure 4). The graph can be modified by limiting the scope of the data frame. For instance, you might choose to view only the first four measures:

df[df$Time <= 18750,]

If you wanted to separate each voice into its own chart, you can use facets:

facet_wrap(~ Voice, ncol=1)

This visualization illustrates one of the limitations in the algorithm used to split the voices. An F is shared by both voices, and has been edited in after rendering.

##Name and Affiliation By Casimir Saternos

All Code and Data available in links above or in a github repository.