# Background

English-language folk songs have a long tradition and have changed over time. Songs are not easily idenifiable by name alone, and lyrics often have variations. Steve Roud began indexing his own collection in the 1970s, and his Roud Index has become the standard for grouping together different versions of the same song. He is still indexing as of 2023.

Could a machine learning algorithm hope to match his skill? Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

# Data

## Sources

Although the Roud index is lyrics-based rather than tune-based, the officially-hosted index at vwml.com does not contain lyric transcriptions as a standard data field. Some lyrics are found in scanned images of historical collections, others on linked external sites, others not at all. So the first challenge is to get a dataset with enough full lyrics and Roud numbers in combination. The main contenders are Mudcat and The Traditional Ballad Index, both well-established online song databases.

- Mudcat focuses on song lyrics and tunes, but also contains Roud numbers for approximately 300 songs. Data formats:
    - Digitrad/DT database MS-DOS download (last updated in 2002)
    - Song web pages
    - Forum posts containing songs
- The Ballad Index focuses on cataloguing*, but also has supplementary lyrics for approximately 1110 songs. Data formats:
    - The Ballad Index Software Filemaker database download
    - Song web pages (without lyrics)
    - The Ballad Index and The Supplemental Tradition (lyrics - found as references beginning ST in the Ballad Index, alongside DT references) as HTML or TXT lists

&ast; This is similar to Roud, but focused on the basic unit of a song rather than its individual instances (eg songbook entries or performances), and therefore uses song titles as its main identifiers, with keywords and first line for disambiguation.


## Acquisition

Neither the Mudcat or Ballad Index downloadable databases will open.

I therefore need to work with the `.txt` versions of the Ballad Index and Supplementary Tradition and join them in order to link Roud numbers to lyrics. Here's a preview of `balldidx.txt`:

Here it is interesting to note that this database also references Mudcat's Digitrad filenames, for example `DT, MASS1913*` above. Although I was not able to open Digitrad, I was able to access some files in the ZIP (see below).

The text version of the file is tricky to work with as entries are presented as a list with inconsistent columns and mixed data. I used a text editor to place colons inside Roud numbers and DT filenames so that they could be more easily identifed. I then used an algorithm to import the data into a Pandas DataFrame while doing the following:
- split song records at the marker '==='
- extract only the values for 'name', 'description', 'earliest_date', found_in', 'keywords', 'cross_references', 'roud', 'file', and 'dt'
- split and store reference song name and filename information from one-line stub records that only serve to reference a main song
- extract only the earliest year found in the 'EARLIEST_FOUND:' field which contained mixed data

In [37]:
import re
import numpy as np
import pandas as pd

# load file into memory
file_path = './Data/BalladIndex/txt/BDIDXTXT/balldidxedited.txt'
with open(file_path, 'r') as file:
    data = file.read()
    
# define each record's start and end marker then find each record
record_pattern = re.compile(r'===\n(.*?)===', re.DOTALL)
records = re.finditer(record_pattern, data)

# list to store dicts of extracted information
records_data = []

# regular expression patterns describing possible fields and values for a record
field_patterns = {
    'name': r'NAME: (.*?)(?:\n|$)',
    'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.*?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.*?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.*?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n(.*?)(?:(?=\n{2}|===|File:|DT:)|\n\Z)', #TODO: check this needs DT:
    'roud': r'ROUD: (.*?)(?:\n|$)',
    'file': r'File: (.*?)(?:\n|$)',
    'dt': r'DT: (.*?)(?:\n|$)'
}

# loop to extract information for each record
for record in records:
    # dict to store keys and values for each record
    record_data = {}
    # loop to extract information for each field in each record
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        if value:
            value = value.group(1).strip()
            # year handling: find all 4-digit years in field then pick lowest
            if field == 'earliest_date':
                years = re.findall(r'\b\d{4}\b', value)
                if years:
                    value = min(map(int, years))
        else:
            value = ""
        record_data[field] = value

    # stub handling: if 'name' contains ': see' and/or 'File:', store these in 'cross_references', and 'file' fields accordingly
    name = record_data['name']
    if ': see' in name:
        cross_ref_idx = name.find(': see')
        record_data['cross_references'] = 'see ' + name[cross_ref_idx + 5:].strip()
        record_data['name'] = name[:cross_ref_idx].strip()

    file_info = record_data['file']
    if '(File:' in file_info:
        cross_ref_idx = file_info.find('(File:')
        record_data['cross_references'] = file_info[cross_ref_idx + 6:].strip().rstrip(')')
        record_data['file'] = file_info[:cross_ref_idx].strip()

    # remove any brackets from 'file' field
    record_data['file'] = record_data['file'].replace('(', '').replace(')', '') 
    # remove any * from DT filenames 
    record_data['dt'] = record_data['dt'].replace('*', '')

    # append the new record_data to the records_data
    records_data.append(record_data)

# create a DataFrame from the records_data
df = pd.DataFrame(records_data)
df = df.replace('', np.nan)
df.head(40)

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,file,dt
0,"10,000 Years Ago",,,,,see I Was Born About Ten Thousand Years Ago (B...,,R410,
1,13 Highway,"""I went down 13 highway, Down in my baby's doo...",1938.0,US(SE),grief love promise nonballad lover technology,,29487.0,Rc13Hwy,
2,"1861 Anti Confederation Song, An",,,,,see Anti-Confederation Song (File: FJ028),,FJ028,
3,1918 East Broadway,"Counting-out rhyme? ""The people who live acros...",1980.0,US,home food fight floatingverses,"cf. ""Ickie Bickie Soda Cracker"" (lyrcs)\ncf. ""...",,ZiZa073B,
4,23rd Flotilla,"""Up to Kola Inlet, back to Scapa Flow... Why d...",1962.0,Canada Britain(England),navy hardtimes technology,"cf. ""Lili Marlene"" (tune) and references there...",29405.0,Hopk112,
5,'31 Depression Blues,Coal miner tells of hard times in the Depressi...,1968.0,US(Ap),strike mining work hardtimes labor-movement,"cf. ""Bright Sunny South"" (tune)\ncf. ""Sixteen ...",,Rc31DB,
6,417's Lament,"""We are a few Canadians here in Italy, Working...",1979.0,Canada,soldier pride drink clothes flying,"cf. ""Lili Marlene"" (tune, plus cross-reference...",29403.0,Hopk046,
7,692 Song,"""We fly alone, When all the heavies are ground...",1979.0,Canada,war technology travel flying,,29402.0,Hopk042,
8,A Begging We Will Go (I),,,,,see A-Begging I Will Go (File: K217),,K217,
9,A Chaipin-ar-leathuaic A'bhfeacais Na Caoire,Gaelic. A shepherdess meets a young man and as...,1947.0,Ireland,foreignlanguage sheep youth shepherd sex,,,OCC001,


In [38]:
df.to_csv('BI_df.csv')

The Mudcat Digitrad file available to download is an AskSam MS-DOS database which I was also not able to open. I have done my best to extract the data from one of the files using regular expressions but still some titles are incorrect and lyrics are incomplete due to titles being recognised in the wrong places and the inclusion of some notes on the text which were not easy to separate from the lyrics themselves. These are stored in `df_lyrics`:

In [54]:
with open('./Data/Mudcat/Z02cv4edited.txt', 'r', encoding='latin-1') as file:
    data = file.read()

def extract_records_from_text(text):
    # Define function to split records based on title
    def split_records(text):
        return re.split(r'\n(?=[A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!\.\,\(\)\\[\]:\;\–\—\-]{4,}(?:\n|$))', text)

    # Split text into records
    records = split_records(text)

    # Initialize lists to store extracted data
    filenames = []
    titles = []
    lyrics = []
    keywords = []

    # Iterate over records to extract data
    i = 0
    while i < len(records):
        record = records[i]

        # Find the title section
        title_match = re.search(r'^\s*([A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!.\,\(\)\\[\]:\;\–\—\-]{4,})\s*$', record, flags=re.MULTILINE)
        if title_match and not re.match(r'^-+$', title_match.group(1)) and '\n' not in title_match.group(1):
            title = title_match.group(1).strip()
        else:
            title = ''
            i += 1
            continue

        # Check if the title is 'SOF', and if so, skip this record
        if title == 'OCT98':
            i += 1
            continue

        # Find the keywords section and extract all occurrences of keywords on the same line
        keywords_match = re.search(r'@(.+?)\n', record)
        if keywords_match:
            keywords_line = keywords_match.group(1)
            keywords_list = [keyword.strip('@') for keyword in keywords_line.split() if keyword.strip('@').isalnum()]
        else:
            keywords_list = []

        # Find the lyrics section (everything between title and keywords or filename)
        lyrics_match = re.search(r'(?<=^' + re.escape(title) + r'\n)(.*?)(?=\n@|filename:)', record, flags=re.DOTALL)
        if lyrics_match:
            lyrics_text = lyrics_match.group(1).strip()

            # Don't store the first line of lyrics if it begins and ends with brackets
            first_line_break_idx = lyrics_text.find('\n')
            if first_line_break_idx != -1:
                first_line = lyrics_text[:first_line_break_idx].strip()
                if first_line.startswith('(') and first_line.endswith(')') or first_line == '-Traditional':
                    lyrics_text = lyrics_text[first_line_break_idx+1:].strip()
            
            # Cut off lyrics if the line contains '_________________________'
            lyrics_cutoff_idx = lyrics_text.find('_________________________')
            if lyrics_cutoff_idx != -1:
                lyrics_text = lyrics_text[:lyrics_cutoff_idx].strip()

        else:
            lyrics_text = ''
            i += 1
            continue

        # Find the filename section
        filename_match = re.search(r'filename:\s*(.*)', record)
        if filename_match:
            filename = filename_match.group(1).strip() 
        else:
            filename = ''
            i += 1
            continue

        # Append extracted data to lists
        filenames.append(filename)
        titles.append(title)
        lyrics.append(lyrics_text)
        keywords.append(keywords_list)

        # Move to the next record
        i += 1

    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'dt': filenames,
        'title': titles,
        'lyrics': lyrics,
        'keywords': keywords
    })

    return df

df_lyrics = extract_records_from_text(data)
df_lyrics

Unnamed: 0,dt,title,lyrics,keywords
0,HARDTAC,'ARD TAC,"1.I'm a shearer, yes I am, and I've shorn 'em...","[Australia, sheep, shearing, drink]"
1,FISHFRY,(I'VE GOT) BIGGER FISH TO FRY,"Sittin' on the bank of that muddy Mississippi,...","[fishing, food]"
2,JULY12,THE 12TH OF JULY,Come pledge again your heart and your hand\n O...,"[Irish, peace]"
3,AVENUE16,16TH AVENUE,"From the corners of the country, from the citi...",[country]
4,MASS1913,THE 1913 MASSACRE,Take a trip with me in nineteen thirteen\nTo C...,"[union, work, death, Xmas]"
...,...,...,...,...
8244,ZEBTURNY,ZEB TOURNEY'S GIRL,"Down in the Tennessee mountains,\nFar from the...",[feud]
8245,ZEBRADUN,ZEBRA DUN,We was camped on the plains at the head of the...,"[cowboy, animal]"
8246,ZENGOSPE,ZEN GOSPEL SINGING,I once was a Baptist and on each Sunday morn\n...,[religion]
8247,ZULIKA,ZULEIKA,"Zuleika was fair to see,\nA fair Persian maide...","[marriage, infidelity]"


Now I'll store only the Ballad Index entries with both Roud numbers and DT filenames in `df_roud_dt`:

In [55]:
df_roud_dt = df[(~df.dt.isna()) & (~df.roud.isna())]
df_roud_dt

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,file,dt
24,A-Begging I Will Go,"""Of all the trades in England, The begging is ...",1684,"Britain(England(North,Lond,south),Scotland(Aber))",begging nonballad,"cf. ""Let the Back and Sides Go Bare"" (theme)\n...",286,K217,ABEGGIN
41,Abdul the Bulbul Emir (I),The heroic Moslem Abdul and the gallant Russia...,1877,US(MW),humorous death foreigner,"cf. ""Abdul the Bulbul Emir (II)"" (tune & meter...",4321,LxA341,ABDULBUL
47,Abilene,"""Abilene, Abilene, prettiest town (you) ever s...",1973,,home train nonballad,"cf. ""Ohio River, She's So Deep and Wide"" (floa...",26032,FSWB048,ABILNE
51,"About the Bush, Willy","""Aboot the bush, Willy, aboot the bee-hive, Ab...",1812,Britain(England(North)),clothes nonballad,,3149,StoR097,BUSHWILI
65,Across the Western Ocean,"""Oh, the times are hard and the wages low, Ame...",1927,US,emigration poverty hardtimes,"cf. ""Leave Her, Johnny, Leave Her"" (floating l...",8234,San412,WSTOCEAN
...,...,...,...,...,...,...,...,...,...
15001,"Yellow Rose of Taegu, The",A reluctant soldier meets the Yellow Rose of T...,,US,bawdy sex soldier whore derivative,"cf. ""Yellow Rose of Texas"" (tune)",10405,EM410,YLLOWTX4
15031,You Are My Sunshine,"The singer dreams his ""sunshine"" is in his arm...",1940,US,courting love promise rejection warning nonbal...,,18130,Hopk084A,YOUMYSUN
15062,You Never Miss the Water till the Well Runs Dry,The singer remembers mother's lessons about ec...,1872,Britain(England(South)),youth money,"cf. ""A Motto for Every Man"" (theme of hard wor...",5457,SRW125,WASTENOT
15193,"Yowe Lamb, The (Ca' the Yowes; Lovely Molly)",Molly agrees to marry Willie if her father con...,1899,"Ireland Britain(Scotland(Aber,Bord)) Canada(Mar)",love marriage father trick,"cf. ""The Waukin' o' the Claes"" (tune, per Grei...",857,K124,CALEWE3


Next I'll add the lyrics by merging the two dataframes on 'dt' filename and storing the result as `df_roud_dt_lyrics`:

In [60]:
df_roud_dt_lyrics = df_roud_dt.merge(df_lyrics, on='dt')
df_roud_dt_lyrics

Unnamed: 0,name,description,earliest_date,found_in,keywords_x,cross_references,roud,file,dt,title,lyrics,keywords_y
0,A-Begging I Will Go,"""Of all the trades in England, The begging is ...",1684,"Britain(England(North,Lond,south),Scotland(Aber))",begging nonballad,"cf. ""Let the Back and Sides Go Bare"" (theme)\n...",286,K217,ABEGGIN,A-BEGGIN' I WILL GO,Of all the trades in England the beggin' is th...,[beggar]
1,Abdul the Bulbul Emir (I),The heroic Moslem Abdul and the gallant Russia...,1877,US(MW),humorous death foreigner,"cf. ""Abdul the Bulbul Emir (II)"" (tune & meter...",4321,LxA341,ABDULBUL,ABDUL ABULBUL AMIR,"The sons of the prophet were hardy and bold,\n...","[Russian, fight, soldier]"
2,Abilene,"""Abilene, Abilene, prettiest town (you) ever s...",1973,,home train nonballad,"cf. ""Ohio River, She's So Deep and Wide"" (floa...",26032,FSWB048,ABILNE,ABILENE,"Abilene, Abilene\nPrettiest town I ever seen.\...","[home, place]"
3,"About the Bush, Willy","""Aboot the bush, Willy, aboot the bee-hive, Ab...",1812,Britain(England(North)),clothes nonballad,,3149,StoR097,BUSHWILI,"ABOUT THE BUSH, WILLY","About the bush, Willy,\nAbout the beehive,\n...",[kids]
4,Across the Western Ocean,"""Oh, the times are hard and the wages low, Ame...",1927,US,emigration poverty hardtimes,"cf. ""Leave Her, Johnny, Leave Her"" (floating l...",8234,San412,WSTOCEAN,ACROSS THE WESTERN OCEAN,Oh the times are hard and the wages low\nAmel...,[sailor]
...,...,...,...,...,...,...,...,...,...,...,...,...
590,"Yellow Rose of Taegu, The",A reluctant soldier meets the Yellow Rose of T...,,US,bawdy sex soldier whore derivative,"cf. ""Yellow Rose of Texas"" (tune)",10405,EM410,YLLOWTX4,THE YELLOW ROSE OF TAEGU,"She's the yellow Rose of Taegu, the girl that ...","[parody, army, America, Korea, bawdy, whore]"
591,You Are My Sunshine,"The singer dreams his ""sunshine"" is in his arm...",1940,US,courting love promise rejection warning nonbal...,,18130,Hopk084A,YOUMYSUN,YOU ARE MY SUNSHINE,"The other night dear, as I lay sleeping\nI dre...",[]
592,You Never Miss the Water till the Well Runs Dry,The singer remembers mother's lessons about ec...,1872,Britain(England(South)),youth money,"cf. ""A Motto for Every Man"" (theme of hard wor...",5457,SRW125,WASTENOT,"WASTE NOT, WANT NOT",(You Never Miss the Water Till the Well Runs D...,[]
593,"Yowe Lamb, The (Ca' the Yowes; Lovely Molly)",Molly agrees to marry Willie if her father con...,1899,"Ireland Britain(Scotland(Aber,Bord)) Canada(Mar)",love marriage father trick,"cf. ""The Waukin' o' the Claes"" (tune, per Grei...",857,K124,CALEWE3,THE YOWE LAMB,"As Molly was milking her yowes one day,\nWilli...",[]


## EDA

# Data cleaning and preprocessing

Extract only records with lyrics and number

Cleaning

Transformation?

Tokenisation

# Clustering

Set up model

Tune model

Evaluate clusters

Add features

# Cluster Analysis

# Classification?

# Pipeline?