# Background

English-language folk songs have a long tradition and have changed over time. Songs are not easily idenifiable by name alone, and lyrics often have variations. Steve Roud began indexing his own collection in the 1970s, and his Roud Index has become the standard for grouping together different versions of the same song. He is still indexing as of 2023.

Could a machine learning algorithm hope to match his skill? Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

# Data

Although the Roud index is lyrics-based rather than tune-based, the officially-hosted index at vwml.com does not contain lyric transcriptions as a standard data field. Some lyrics are found in scanned images of historical collections, others on linked external sites, others not at all. So the first challenge is to get a dataset with enough full lyrics and Roud numbers in combination. The main contenders are Mudcat and The Traditional Ballad Index, both well-established online song databases.

- Mudcat focuses on song lyrics and tunes, but also contains Roud numbers for approximately 300 songs. Data formats:
    - Digitrad/DT database MS-DOS download (last updated in 2002)
    - Song web pages
    - Forum posts containing songs
- The Ballad Index focuses on cataloguing*, but also has supplementary lyrics for approximately 1110 songs. Data formats:
    - The Ballad Index Software Filemaker database download
    - Song web pages (without lyrics)
    - The Ballad Index and The Supplemental Tradition (lyrics - found as references beginning ST in the Ballad Index, alongside DT references) as HTML or TXT lists

&ast; This is similar to Roud, but focused on the basic unit of a song rather than its individual instances (eg songbook entries or performances), and therefore uses song titles as its main identifiers, with keywords and first line for disambiguation.


Neither databases will open. I need to either convert Ballad Index and ST txt/html data to a csv and join them or find a way to open the database and export the data.

In [6]:
import re
import pandas as pd

file_path = './Data/BalladIndex/txt/BDIDXTXT/balldidxedited.txt'
with open(file_path, 'r') as file:
    data = file.read()

Best extraction algorithms so far (could be improved)

In [None]:
# Regular expression pattern to find each record's start and end positions
record_pattern = re.compile(r'===\n(.*?)===', re.DOTALL)

# Find all records in the text
records = re.finditer(record_pattern, data)

# Initialize lists to store extracted information
names = []
descriptions = []
earliest_dates = []
found_ins = []
keywords = []
references = []
cross_references = []
roud = []
files = []

# Regular expression patterns to extract individual fields within each record
field_patterns = {
    'name': r'NAME: (.+?)(?:\n|$)',
    'description': r'DESCRIPTION: (.+?)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.+?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.+?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.+?)(?:\n|$)',
    'references': r'REFERENCES: (.+?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n(.+?)\n',
    'roud': r'ROUD: (.+?)(?:\n|$)',
    'file': r'File: (.+?)(?:\n|$)'
}

# Extract information for each record
for record in records:
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        if value:
            value = value.group(1).strip()
        else:
            value = ""
        # Append the extracted value to the corresponding list
        if field == 'name':
            names.append(value)
        elif field == 'description':
            descriptions.append(value)
        elif field == 'earliest_date':
            earliest_dates.append(value)
        elif field == 'found_in':
            found_ins.append(value)
        elif field == 'keywords':
            keywords.append(value)
        elif field == 'references':
            references.append(value)
        elif field == 'cross_references':
            cross_references.append(value)
        elif field == 'roud':
            roud.append(value)
        elif field == 'file':
            files.append(value)

# Create a dictionary with the extracted information
data_dict = {
    'name': names,
    'description': descriptions,
    'earliest_date': earliest_dates,
    'found_in': found_ins,
    'keywords': keywords,
    'references': references,
    'cross_references': cross_references,
    'roud': roud,
    'file': files
}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data_dict)

In [None]:
df

In [None]:
# Regular expression pattern to find each record's start and end positions
record_pattern = re.compile(r'===\n(.*?)===', re.DOTALL)

# Find all records in the text
records = re.finditer(record_pattern, data)

# Initialize lists to store extracted information
names = []
descriptions = []
earliest_dates = []
found_ins = []
keywords = []
references = []
cross_references = []
roud = []
files = []

# Regular expression patterns to extract individual fields within each record
field_patterns = {
    'name': r'NAME: (.+?)(?:\n|$)',
    'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:).)*)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.+?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.+?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.+?)(?:\n|$)',
    'references': r'REFERENCES: (.+?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n(.+?)\n',
    'roud': r'ROUD: (.+?)(?:\n|$)',
    'file': r'File: (.+?)(?:\n|$)'
}

# Extract information for each record
for record in records:
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        if value:
            value = value.group(1).strip()
        else:
            value = ""
        # Special handling for 'description' field
        if field == 'description':
            if ': see' in value:
                description, rest = value.split(': see', 1)
                description = description.strip()
                # Append 'see' to 'description' and retain 'file' value if present
                rest_match = re.search(r'File:\s*(.*?)(\)|$)', rest)
                if rest_match:
                    description += ': see' + rest_match.group(1).strip()
                    file_value = rest_match.group(2)
                else:
                    file_value = ""
                descriptions.append(description)
            else:
                descriptions.append(value)
        elif field == 'file':
            # Remove trailing ')' if present in 'file' value
            value = value.rstrip(')')
            files.append(value)
        else:
            # Append the extracted value to the corresponding list
            if field == 'name':
                names.append(value)
            elif field == 'earliest_date':
                earliest_dates.append(value)
            elif field == 'found_in':
                found_ins.append(value)
            elif field == 'keywords':
                keywords.append(value)
            elif field == 'references':
                references.append(value)
            elif field == 'cross_references':
                cross_references.append(value)
            elif field == 'roud':
                roud.append(value)

# Create a dictionary with the extracted information
data_dict = {
    'name': names,
    'description': descriptions,
    'earliest_date': earliest_dates,
    'found_in': found_ins,
    'keywords': keywords,
    'references': references,
    'cross_references': cross_references,
    'roud': roud,
    'file': files
}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data_dict)

In [None]:
df

latest extraction algorithm testing

In [9]:
# Regular expression pattern to find each record's start and end positions
record_pattern = re.compile(r'===\n(.*?)===', re.DOTALL)

# Find all records in the text
records = re.finditer(record_pattern, data)

# Initialize lists to store extracted information
names = []
descriptions = []
earliest_dates = []
found_ins = []
keywords = []
references = []
cross_references = []
roud = []
files = []
dt = []

# Regular expression patterns to extract individual fields within each record
field_patterns = {
    'name': r'NAME: (.+?)(?:\n|$)',
    'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.+?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.+?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.+?)(?:\n|$)',
    'references': r'REFERENCES: (.+?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n(.+?)(?:(?=\n{2}|===)|\n\Z)',
    'roud': r'ROUD: (.+?)(?:\n|$)',
    'file': r'File: (.+?)(?:\n|$)',
    'dt': r'DT: (.+?)(?:\n|$)'
}

# Extract information for each record
for record in records:
    record_data = {}
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        if value:
            value = value.group(1).strip()
        else:
            value = ""
        # Special handling for 'description' and 'name' fields
        if field == 'description':
            # If 'description' field is empty and ' : see' is in 'name', extract ' : see' part to 'cross_references'
            if not value and ' : see' in record_data.get('name', ""):
                name_value = record_data['name']
                cross_ref_idx = name_value.find(' : see')
                cross_ref_text = name_value[cross_ref_idx + 6:].strip()
                if cross_ref_text.startswith('File:'):
                    cross_ref_text = cross_ref_text[5:].strip()
                cross_references.append(cross_ref_text)
                name_value = name_value[:cross_ref_idx].strip()
                descriptions.append(name_value)
            else:
                descriptions.append(value)
        elif field == 'name':
            # Check if ': see' is in 'name' field
            if ' : see' in value:
                cross_ref_idx = value.find(' : see')
                cross_ref_text = value[cross_ref_idx + 6:].strip()
                if cross_ref_text.startswith('File:'):
                    cross_ref_text = cross_ref_text[5:].strip()
                cross_references.append(cross_ref_text)
                value = value[:cross_ref_idx].strip()
            else:
                cross_references.append("")
            names.append(value)
        elif field == 'file':
            # Remove trailing ')' if present in 'file' value
            value = value.rstrip(')')
            files.append(value)
        else:
            # Append the extracted value to the record_data dictionary
            record_data[field] = value
    
    # Append any missing fields with empty strings
    for field in field_patterns.keys():
        if field not in record_data:
            record_data[field] = ""
    
    # Append the record_data to the corresponding lists
    earliest_dates.append(record_data['earliest_date'])
    found_ins.append(record_data['found_in'])
    keywords.append(record_data['keywords'])
    references.append(record_data['references'])
    roud.append(record_data['roud'])
    dt.append(record_data['dt'])

# Fill the 'cross_references' field with empty strings for records with missing 'description' field
while len(cross_references) < len(names):
    cross_references.append("")

# Create a dictionary with the extracted information
data_dict = {
    'name': names,
    'description': descriptions,
    'earliest_date': earliest_dates,
    'found_in': found_ins,
    'keywords': keywords,
    'references': references,
    'cross_references': cross_references,
    'roud': roud,
    'file': files,
    'dt': dt
}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data_dict)

In [12]:
df.sort_values('dt')

Unnamed: 0,name,description,earliest_date,found_in,keywords,references,cross_references,roud,file,dt
0,"10,000 Years Ago: see I Was Born About Ten Tho...",,,,,,,,R410,
9991,Old Rafting Chant,"""Thus drifting to sea on a hick of white pine,...",1831 (Shoemaker-MountainMinstrelsyOfPennsylvania),US(MA),work ship,(1 citation),,15029,SHoe234,
9992,Old Rattler,"Chorus: ""Here, Rattler, Here."" Rattler is a gr...","1924 (recording, George Reneau)","US(SE,So)",dog manhunt prison escape captivity worksong c...,(9 citations),,6381,CNFM104,
9993,"Old Rebel Soldier, The: see The Good Old Rebel...",,,,,,,,Wa193,
9994,"Old Redskin, The","""Did you ever hear the story of the old Redski...",1957 (Beck-FolkloreOfMaine),US(NE),ship wreck moniker fishing,(1 citation),,,BeMa165,
...,...,...,...,...,...,...,...,...,...,...
15139,Young Ladies in Town,"""Young ladies in town, and those that live 'ro...","1769 (in the ""Boston Newsletter"")",US,clothes patriotic commerce,1767 - Passage of the Townshend Acts. Britain ...,,,SBoA057,YNGLADIE*
15031,You Are My Sunshine,"The singer dreams his ""sunshine"" is in his arm...",1940 (Davis),US,courting love promise rejection warning nonbal...,(2 citations),,18130,Hopk084A,YOUMYSUN
15196,"Zack, the Mormon Engineer","Zack, the Mormon engineer, has a wife in every...","1951 (recording, L. M. Hilton; published 1952)",US(Ro),marriage railroading humorous train,(4 citations),,4761,BRaF444,ZACKMORM*
4258,"Galway Races, The","On August 17 ""half a million"" gather at Galway...",1939 (OLochlainn-IrishStreetBallads); 19C (bro...,Ireland,racing dancing food music Ireland political horse,(2 citations),,3031,OLoc010,"[abbreviation unknown, but it's in there]"
