# Chord Progression Frequency Analysis
This python notebook will look into the frequency of chord progressions in a composer's repertoire.



## Table of Contents:

* [Preprocessing](#preprocessing)
    * [Sequence Dictionary](#seq-dict)
    * [Loading Datasets](#loading-datasests)
* [Method 1 - Counting Overall Percentage](#method-1)
* [Method 2 - Provide percentages at every chord](#method-2)
* [Method 3 - Kernel Frequency Analysis](#method-3)

# Preprocessing  <a class="anchor" id="preprocessing"></a>

Before we begin the analysis, we will create a couple of helper functions to conduct the rest of the frequency analysis methods.

## 1. Sequence Dictionary <a class="anchor" id="seq-dict"></a>
The first step is to determine the frequency of an n-long sequence in a text file formatted as in the List Generator notebook.

In [1]:
def build_n_sequence_dictionary(n : int, txt : str) -> dict:
    """
    Given a length n, returns a dictionary containing sequences
    in txt of length n mapped to the number of times they appear
    in txt.
    """
    seq_dict = dict()
    
    blocks = txt.split('\n')
    for block in blocks:
        elems = block.split(',')
        for i in range(len(elems) - n):
            seq = elems[i:i+n]
            seq_dict[tuple(seq)] = seq_dict.get(tuple(seq), 0) + 1
    return seq_dict            

In [2]:
# test build_n_sequence_dictionary()

# The following text is the beginning of 'monteverdi/madrigal.3.1.rntxt'
test_txt = "vi,V,I,IV,I,V,V,i,V,i,VI,V,i,V,i,i,i,I,IV,ii,vii,I,vi,I,I,IV,vii"
seq_dict = build_n_sequence_dictionary(2, test_txt)

# Show the most frequently found chord progressions of length 2
print(sorted(seq_dict.items(), key=lambda kv:(-kv[1], kv[0]))[:5])

[(('V', 'i'), 4), (('I', 'IV'), 3), (('i', 'V'), 2), (('i', 'i'), 2), (('I', 'I'), 1)]


## 2. Loading the Datasets  <a class="anchor" id="loading-datasets"></a>

The next helper function will load the datasets that are generated from the `List Generator` Jupyter notebook. Given the desired composer and whether or not we want to use the dataset that encodes inversions, this method will return the contents of the respective file. 

In [3]:
def get_dataset(composer: str, with_inversions=False) -> str:
    '''
    Given a composer and a flag for inversions, returns the correct
    dataset containing the composer's repertoire as a string.
    
    If with_inversions is True, the dataset returned will contain
    inversions as (0), (1), etc. after each roman numeral, where
    (0) is the root position.
    '''
    if composer == 'bach':
        if with_inversions:
            return open("datasets/inv-dataset-bach.txt","r").read()
        else:
            return open("datasets/simple-dataset-bach.txt","r").read()
    elif composer == 'monteverdi':
        if with_inversions:
            return open("datasets/inv-dataset-monteverdi.txt","r").read()
        else:
            return open("datasets/simple-dataset-monteverdi.txt","r").read()
    else:
        raise NotImplementedError("The composer you've entered is not in the database.")

# Method 1 - Counting and getting an overall percentage <a class="anchor" id="method-1"></a>

For this function, we count the number of times the given progression appears in the composer's repertoire and calculate a percentage based on all n-long chord progressions used in their repertoire.

In [4]:
def progression_count_and_percent(prog: str, composer: str, with_inversions=False, as_msg=False):
    '''
    Given a chord progression formatted as roman numerals separated by commas
    (if with_inversions=True, parentheses containing the inversion number, 0 
    being the root position, follow each roman numeral) and a composer 
    (currently 'bach' and 'monteverdi' are the only ones supported),
    returns a tuple containing the number of times the given progression
    appears in the composer's repertoire and the percentage used based on
    all n-long progressions. 
    
    If as_msg is True, returns a string containing information in English instead.
    '''
    # removes any whitespace
    prog = prog.replace(' ', '')

    txt = get_dataset(composer, with_inversions)
        
    # creates the dictionary given the progression
    prog_list = tuple(prog.split(','))
    seq_dict = build_n_sequence_dictionary(len(prog_list), txt)
    total_progs = sum(seq_dict.values())
    num_times = seq_dict.get(prog_list, 0)
    
    if as_msg:
        msg = ''
        if num_times == 0:
            msg = composer + ' never uses [' + prog + '] in their repertoire.'
        if num_times == 1:
            msg = composer + ' uses [' + prog + '] ' + str(num_times) + ' time in their repertoire.\n'
        else: 
            msg = composer + ' uses [' + prog + '] ' + str(num_times) + ' times in their repertoire.\n'
        msg += 'Compared to other length-' + str(len(prog_list)) + ' progressions, [' + prog + '] is used {:.2f}'.format(num_times/total_progs*100) + '% of the time.'
        return msg
    return num_times, num_times/total_progs

In [5]:
# test without inversion
print("Test Without Inversions")
prog = 'V,I'
print(progression_count_and_percent(prog, 'monteverdi', as_msg=True))

# test with inversion
print("\nTest With Inversions")
prog = 'V(0),I(0)'
print(progression_count_and_percent(prog, 'monteverdi', with_inversions=True, as_msg=True))

Test Without Inversions
monteverdi uses [V,I] 446 times in their repertoire.
Compared to other length-2 progressions, [V,I] is used 6.74% of the time.

Test With Inversions
monteverdi uses [V(0),I(0)] 312 times in their repertoire.
Compared to other length-2 progressions, [V(0),I(0)] is used 4.72% of the time.


# Method 2 - Provide percentages at every chord <a class="anchor" id="method-2"></a>
For this function, we output the percentage of times a composer uses each chord in the progression given the previous chords in the progression.

In [6]:
def progression_probability(prog: str, composer: str, with_inversions=False, as_msg=False):
    '''
    Given a chord progression formatted as roman numerals separated by commas
    (if with_inversions=True, parentheses containing the inversion number, 0 
    being the root inversion, follow each roman numeral) and a composer 
    (currently 'bach' and 'monteverdi' are the only ones supported),
    returns a list containing the probability that the i-th chord in the
    progression follows the i-1 chords that precede it using its frequency
    in the composer's repertoire.
    
    If as_msg is True, returns a string containing information in English instead.
    '''
    # removes any whitespace
    prog = prog.replace(' ', '')
    
    txt = get_dataset(composer, with_inversions)    
    
    prog_prob = [] # empty list to contain tuples of chord progression and probabilities
    prog_list = tuple(prog.split(','))
    # creates the dictionary given the progression
    seq_dict = build_n_sequence_dictionary(len(prog_list), txt) 
    
    # values to be modified as i-1 chords in progression are given
    total_progs = sum(seq_dict.values())
    chords_to_check = seq_dict.keys()
    
    # flag used when progression is not used in repertoire
    shortcut = False
    
    for i in range(len(prog_list)):
        if shortcut:
            prog_prob.append((prog_list[:i+1], 0.))
            continue
            
        same_start_chords = list(filter(lambda a: (a[:i+1] == prog_list[:i+1]), chords_to_check))
        
        # count number of same_start_chords in repertoire
        progs = 0
        for chord in same_start_chords:
            progs += seq_dict[chord]
        if progs == 0:
            # shows actual chords that follow
            shortcut = True
        
        # add probability to list
        prog_prob.append((prog_list[:i+1], progs/total_progs))
        # update values assuming i-1 chords are given
        total_progs = progs
        chords_to_check = same_start_chords
    
    if as_msg:
        msg = ''
        
        # msg for first chord in progression
        msg += composer + ' starts a progression with [' + prog_list[0] + \
                '] {:.2f}'.format(prog_prob[0][1]*100) + '% of the time.'
        for i in range(1, len(prog_list)):
            msg += '\n' + composer + ' follows [' + ','.join([chord for chord in prog_list[:i]]) + \
                    '] with [' + prog_list[i] + '] {:.2f}'.format(prog_prob[i][1]*100) + '% of the time.'
        return msg
    return prog_prob

In [7]:
# test without inversion
prog = 'V,I,IV,V,vi,iii,V'
print(progression_probability(prog, 'monteverdi', as_msg=True))

monteverdi starts a progression with [V] 22.73% of the time.
monteverdi follows [V] with [I] 30.12% of the time.
monteverdi follows [V,I] with [IV] 14.19% of the time.
monteverdi follows [V,I,IV] with [V] 29.03% of the time.
monteverdi follows [V,I,IV,V] with [vi] 11.11% of the time.
monteverdi follows [V,I,IV,V,vi] with [iii] 50.00% of the time.
monteverdi follows [V,I,IV,V,vi,iii] with [V] 0.00% of the time.


In [8]:
# test with inversion
prog = 'V(0),I(0)'
print(progression_probability(prog, 'monteverdi', with_inversions=True, as_msg=True))

monteverdi starts a progression with [V(0)] 19.96% of the time.
monteverdi follows [V(0)] with [I(0)] 23.64% of the time.


#  Method 3 - Sliding Kernel <a class="anchor" id="method-3"></a>

This approach is similar to the approach in Method 2, but with a fixed length look back. This method slides a "kernel" across the provided progression and returns the result of the frequencies aas a list. 

Using `look_back` consecutive chords, see if the next chord is followed with more than `threshold` probability in the given composers repertoire.

In [9]:
def progression_kernel(prog: str, composer: str, with_inversions=False, look_back=2, verbose=False):
    '''
    Given a progression as a comma-separated string and the composer we want to compare the
    chord progression to, we will use a sliding kernel to determine the likelihood that a 
    sequence of look_back chords in the composer's repertoire is followed by the next chord 
    in the provided progression. 
    
    If verbose is true, the method will print the likelihood of each kernel progression as we
    loop.
    '''
    percentages = []
    
    # removes any whitespace
    prog = prog.replace(' ', '')

    txt = get_dataset(composer, with_inversions)
        
    # creates the dictionary given the progression
    prog_list = tuple(prog.split(','))
    
    if verbose:
        print("Analyzing progression: [{}]".format(prog))
    
    look_back_dict = build_n_sequence_dictionary(look_back, txt)
    
    seq_dict = build_n_sequence_dictionary(look_back + 1, txt)

    # For every look_back + 1 sequence in the progression
    for i in range(len(prog_list)-look_back):
        
        # Total Occurances of look_back
        look_back_occurances = look_back_dict.get((prog_list[i:i+look_back]),0)
        # Num Occurances look back is followed by the next chord
        look_back_with_next_chord = seq_dict.get(prog_list[i:i+look_back+1], 0)
        
        # Calculate the frequency that a chord follows the n previous
        if look_back_occurances == 0:
            prog_percent = 0
        else:
            prog_percent = look_back_with_next_chord / look_back_occurances
        
        percentages.append(prog_percent)
        
        if verbose:
            print("{} follows [{}] with [{}] {:.2f}% of the time".format(
                composer, 
                ",".join(prog_list[i:i+look_back]), 
                prog_list[i+look_back],
                prog_percent * 100))
        
    return percentages

In [10]:

print("--  Example 1")
progression_kernel('V,I,IV,V,vi,iii,V', "monteverdi",look_back=2,verbose=True)

print("\n--  Example 2")

progression_kernel('V,I,IV,V,vi,iii,V'[::-1], "monteverdi",look_back=2,verbose=True)

--  Example 1
Analyzing progression: [V,I,IV,V,vi,iii,V]
monteverdi follows [V,I] with [IV] 14.13% of the time
monteverdi follows [I,IV] with [V] 20.83% of the time
monteverdi follows [IV,V] with [vi] 7.34% of the time
monteverdi follows [V,vi] with [iii] 8.70% of the time
monteverdi follows [vi,iii] with [V] 9.09% of the time

--  Example 2
Analyzing progression: [V,iii,iv,V,VI,I,V]
monteverdi follows [V,iii] with [iv] 0.00% of the time
monteverdi follows [iii,iv] with [V] 100.00% of the time
monteverdi follows [iv,V] with [VI] 0.00% of the time
monteverdi follows [V,VI] with [I] 0.00% of the time
monteverdi follows [VI,I] with [V] 28.57% of the time


[0.0, 1.0, 0.0, 0.0, 0.2857142857142857]

In [11]:
prog = input("Progression: ")
progression_kernel(prog, "monteverdi",look_back=2,verbose=True)

Progression: I,V,i
Analyzing progression: [I,V,i]
monteverdi follows [I,V] with [i] 6.71% of the time


[0.06713780918727916]

# Simple Likelihood Analysis

The basics of this analysis is to look at the list of frequencies that are given by the kernel method above. If any part of the progression is below some arbitrary threshold, `progression_likelihood` says that the progression is unlikely. If all parts of the progression is above a different threshold, it says the progression is fairly likely. Otherwise, it says that it cannot give a strong opinion one way or the other. 

In [12]:
def progression_likelihood (prog: str, composer: str, with_inversions=False, look_back=2):
    '''
    Given a progression as a comma-separated string and the composer we want to compare the
    chord progression to, determines the similarity of this chord progression to those used 
    by the composer, using a sliding kernel.
    
    If the likelyhood of any kernel is below a specified threshold, it returns that the 
    provided progression is unlikely to be used by the composer.
    
    If the likelyhood of all kernels are above a specified threshold, it returns that the 
    provided progression is pretty likely to be used by the composer.
    
    Otherwise it returns that the provided progression seems somewhat reasonable.
    '''
    
    # Arbitrary Thresholds
    unlikely_threshold = 0.1
    likely_threshold = 0.4
    # Saves the list of frequencies given by the kernel method above
    probabilities = progression_kernel(prog, composer, with_inversions=with_inversions,look_back=look_back)
    
    if min(probabilities) < unlikely_threshold:
        return "Not Likely"
    if min(probabilities) > likely_threshold:
        return "Fairly Likely"
    return "Somewhat reasonable"
    

In [13]:
prog = input("Progression?")
print(progression_likelihood(prog, "monteverdi", with_inversions=True,look_back=2))

Progression?I(0),V(0),iii(3)
Not Likely
