# Background

English-language folk songs have a long tradition and have changed over time. Songs are not easily idenifiable by name alone, and lyrics often have variations. Steve Roud began indexing his own collection in the 1970s, and his Roud Index has become the standard for grouping together different versions of the same song. He is still indexing as of 2023.

Could a machine learning algorithm hope to match his skill? Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

# Data

## Sources

Although the Roud index is a lyrics-based classification system (rather than tune-based), the officially-hosted index at vwml.com does not contain lyric transcriptions as a standard data field. Some lyrics are accessible online, presented in scanned images of historical collections, others on linked external sites, others not at all. 

So the first challenge is to get a dataset with enough full lyrics and Roud numbers in combination. The main contenders for the source of this data are Mudcat and The Traditional Ballad Index, both well-established online song databases.

### Mudcat 
- Project focuses on song lyrics and tunes, but also contains Roud numbers for approximately 300 songs.
- Data formats:
    - Digitrad (DT) database MS-DOS download (last updated in 2002)
    - Song web pages
    - Forum posts containing songs

### The Ballad Index 
- Project focuses on cataloguing*, but also has supplementary lyrics for approximately 1110 songs.
- Data formats:
    - The Ballad Index Software Filemaker database download
    - Song web pages (without lyrics)
    - The Ballad Index (BI) and The Supplemental Tradition (ST) (lyrics) as HTML or TXT lists

&ast; This is a similar to approach to Roud, but focused os on the basic unit of a song rather than its individual instances (eg songbook entries or performances), and therefore uses song titles as its main identifiers, with keywords and first line for disambiguation.


## Extraction

Neither the Mudcat DT or the Ballad Index (including supplementaty Tradition) downloadable databases will open. 

I therefore need to work with the `.txt` versions of the Ballad Index and Supplementary Tradition and join them in order to link Roud numbers to lyrics.

### Targets (BI, ST, DT)

Based on text editor finds I estimate I can extract approximately the following data:
- BI: 30445 song record files, of which (in combination):
    - 14213 are stubs for variants that only refer to other songs
    - 2623 refer to DT files (lyrics)
    - 1180 refer to ST files (lyrics)
    - 12126 of these contain Roud index numbers
- ST: 1229 lyrics referencing 1136 BI files
    - Note: 404 of the ST filenames seem to be modified DT filenames, eg 'DTwarovr' in ST and BI is the same as 'WAROVR' in DT
- DT: 8932 song record files (lyrics)
    - Note: 793 records also contain a 'DT #' but I don't yet know what this is. Contrary to my assumption it does not correspond to the SongID in URLs the Mudcat website, which are formatted like this example http://mudcat.org/@displaysong.cfm?SongID=329


### BI (Ballad Index)

Below is a preview of `balldidx.txt`. The text version of the Ballad Index file is tricky to work with as entries are presented as a list with inconsistent columns and mixed data. 

I first used a text editor to place colons before Roud numbers and DT filenames, so that they could be more easily matched. (This could have been perhaps better achieved with regex, although to begin I decided to save myself a step as they were formatted inconsistently.)

Here it is interesting to note that the BI database also references Mudcat's DT filenames, for example `DT, MASS1913*` above. This means we can also supplement lyrics by cross-referencing this data.

I then used a script with regular expressions to import while doing the following:
- split song records at the marker '==='
- extract only the values for 'name', 'description', 'earliest_date', found_in', 'keywords', 'cross_references', 'roud', 'bi_file', and 'dt_file'
- split and store reference song name and filename information in one-line stub records that only serve to reference a main song
- make stubs inherit Roud number and file references from their parent entries
- extract only the earliest year found in the 'EARLIEST_FOUND:' field which contained mixed data
- replace empty fields with NumPy `NaN` to allow for better data manipulation

These are stored in `df_bi`.

Target: 30445 file records |
Output: 30418 file records

In [94]:
# loop to extract information for each record
for record in records:
    # dict to store key and value for each record
    record_data = {}
    # loop to extract information for each field in each record
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        # if a value matching a search pattern has been found, replace `value`
        # to store the right bit: group(1)
        if value:
            value = value.group(1).strip()
            # extra year handling: find all 4-digit years in field then pick lowest
            if field == 'earliest_date':
                years = re.findall(r'\b\d{4}\b', value)
                if years:
                    value = min(map(int, years))
        else:
            value = ""
        record_data[field] = value

    # stub handling: if 'name' line contains ': see' and/or 'File:',
    # store these in 'cross_references' and 'bi_file' fields accordingly
    name = record_data['name']
    if ': see' in name:
        cross_ref_idx = name.find(': see')
        record_data['cross_references'] = 'see ' + name[cross_ref_idx + 5:].strip()
        record_data['name'] = name[:cross_ref_idx].strip()

    bi_file_info = record_data['bi_file']
    if '(File:' in bi_file_info:
        cross_ref_idx = bi_file_info.find('(File:')
        # Update the 'cross_references' field only if there's no existing value
        if 'cross_references' not in record_data:
            record_data['cross_references'] = bi_file_info[cross_ref_idx + 6:].strip().rstrip(')')
        record_data['bi_file'] = bi_file_info[:cross_ref_idx].strip()

    # remove any brackets from 'file' field
    record_data['bi_file'] = record_data['bi_file'].replace('(', '').replace(')', '')
    # remove any * from DT filenames
    record_data['dt_file'] = record_data['dt_file'].replace('*', '')

    # append the new record_data to the records_data
    records_data.append(record_data)

# create a DataFrame from the records_data
df_bi = pd.DataFrame(records_data)

# fill NaNs
df_bi = df_bi.replace('', np.nan)

# check for empty rows
# df_bi[df_bi.isnull().all(axis=1)]
# remove empty rows (there was only one at the end)
df_bi.dropna(how='all', inplace=True)

df_bi


Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,bi_file,st_file,dt_file
0,"10,000 Years Ago",,,,,see I Was Born About Ten Thousand Years Ago (B...,,R410,,
1,10th MTB Flotilla Song,,,,,see Fred Karno's Army (File: NeFrKaAr),,NeFrKaAr,,
2,13 Highway,"""I went down 13 highway, Down in my baby's doo...",1938,US(SE),grief love promise nonballad lover technology,,29487,Rc13Hwy,,
3,151 Days,,,,,see Hundred and Fifty-One Days (File: Colq060),,Colq060,,
4,"1861 Anti Confederation Song, An",,,,,see Anti-Confederation Song (File: FJ028),,FJ028,,
...,...,...,...,...,...,...,...,...,...,...
30413,Zula,"""Thou lov'st another, Zula, Thou lovest him al...",1952,US(So),love rejection separation travel,,11330,Brne049,,
30414,"Zulu Warrior, The","""I-kama zimba zimba zayo I-kama zimba zimba ze...",1946,,nonballad nonsense campsong,,,ACFF061A,,
30415,Zum Gali Gali,"Hebrew. ""Zum, gali-gali-gali, Zum gali-gali, Z...",1956,,foreignlanguage campsong,,,ACSF314Z,,
30416,Zutula Dead,A nice girl gave Zutula bitter casava to eat a...,1939,West Indies(Trinidad),death poison food,,,RcALZuDe,,


In [93]:
# experiments with stub handling... eventually delete and compbine above what works


In [88]:
df_bi.to_csv('df_bi.csv') #save to CSV

Target: 2623 DT file references | Output: 2602 DT file references

In [80]:
df_bi.query("dt_file.notna()").dt_file.count()

2602

Target: 1180 ST file references | Output: 1166 ST file references

In [81]:
df_bi.query("st_file.notna()").st_file.count()

1166

The following query shows I would have 3152 main songs with Roud numbers and lyrics if I were to now join up the data and all the referenced lyrics files can be extracted. 

In [82]:
df_lyrics_available = df_bi.query("(st_file.notna() | dt_file.notna()) & roud.notna()")
df_lyrics_available[['name', 'roud', 'bi_file', 'st_file', 'dt_file']].sort_values('roud')

Unnamed: 0,name,roud,bi_file,st_file,dt_file
9721,"Gypsy Laddie, The [Child 200]",1,C200,,"200, GYPDAVY GYPLADD GYPLADD2* GYPLADD3 GYPLAD..."
15901,Lord Randal [Child 12],10,C012,,"12, LORDRAN1* LORDRNLD* EELHENRY* EELHENR2"
2763,Bonny Baby Livingston [Child 222],100,C222,,BABLIVST*
2598,"Bold Privateer, The [Laws O32]",1000,LO32,LO32 (Full),"486, BOLDPRIV BLDPRIV2*"
7379,Fair Fanny Moore [Laws O38],1001,LO38,,"337, FANMOORE FANMOOR2"
...,...,...,...,...,...
15852,Lord Cornwallis's Surrender,V50597,SBoA088,,LRDCRNWL*
17128,"Memory of the Dead, The",V5143,PGa039,,MEMRYDED*
25278,"Star-Spangled Banner, The",V5200,SRW008 the source song,,STARSPAN
13901,Jolly Good Ale and Old (Back and Sides Go Bare),V7039,DTbcksid,,BACK&SID*


This number of songs may even increase if:
1. I can match variant lyrics from the other data sources to the variant titles and multiple file references listed here, in order to get more song records
2. by chance, some backwards file references to BI files are found in the two lyrics data sources which were not found inside the Ballad Index

However, this is still unlikely to constitute enough data to cluster lyrics into Roud-sized clusters and compare sytems, as our available data currently only averages 1.03 songs per unique Roud number.

In [48]:
# Number of unique Roud numbers amongst songs that now have lyrics matched:
df_lyrics_available.roud.nunique()

3030

In [49]:
# Number of entries with lyrics from DT:
df_lyrics_available.dt_file.dropna().count()

2303

In [50]:
# Number of DT entries on songs with Roud numbers:
def word_count(series):
    text = ' '.join(series.dropna().astype(str))
    files = text.split()
    return len(files)

word_count(df_lyrics_available.dt_file)

4141

Even allowing for the multiple entries per BI row for DT files and assuming we can use all of them, that would leave us with a maximum of 4990 lyrics as things stand, giving a song-to-Roud ratio of only 1.6.

### ST (Supplementary Tradition of BI)

The Supplementary Tradition is the lyrics index of the Ballad index. Again, I must use regular expressions to extract the data, this time from `supptrad.txt`. This has a different format to the BI. 

The main song title is listed at the head of the records, followed by the type of lyrics [Complete text(s) or Partial text(s)] followed by different versions of the lyrics marked [*** A ***, *** B ***, *** C ***, ...] often preceded by an alternate title and notes about the story and/or provenance of the lyrics.

Due to the aforementioned song-based classification system of the BI, multiple alternate versions are often linked to one BI record file and key title. Later I may want to split the files into different versions, so I will treat the the main song record as a parent (`key_`...) and treat the versions as children which will stand as individual records but inherit some values from their parents. Some of the alternate versions do not have their own names.

I want to extract: `key_name`, `key_full_part`, `version_in_key`, `name`, `provenance` [detected in order to exclude from lyrics], `lyrics`, `bi_file` [this belongs to key/parent but I want to name consistently for later data combinations]

In [52]:
with open('./Data/BalladIndex/txt/supptradedited.txt', 'r') as file:
    data = file.read()
    
def parse_lyric_information(data):
    outer_records = data.split("\n===\n")  # split into outer records
    records_list = []

    for record in outer_records:
        outer_lines = record.strip().split('\n')
        if len(outer_lines) < 2:
            continue  # skip 'records' with insufficient lines

        key_name = None
        key_full_part = None
        bi_file = None

        inner_records = record.strip().split('          *** ')[1:]  # split into inner records

        for i, line in enumerate(outer_lines):
            if line.startswith("==="):
                if i > 0:
                    break  # Stop looking for key_name and key_full_part after the first record
            elif not key_name:
                key_name = line.strip()
            elif not key_full_part:
                key_full_part = line.strip()
            elif not bi_file:
                bi_file_match = re.search(r"File: (.+)", line)
                if bi_file_match:
                    bi_file = bi_file_match.group(1).strip()

        for inner_record in inner_records:
            lines = inner_record.strip().split('\n')
            version_in_key = None
            name = None
            provenance = None
            lyrics = None

            is_in_provenance = False
            provenance_lines = []
            is_in_lyrics = False
            lyrics_lines = []

            for line in lines:
                if not version_in_key and line.strip() and line.strip()[0].isupper():
                    version_in_key = line.strip()[0]
                elif not name and line.strip() and not line.strip().startswith("From ") and not line.strip().startswith("Text ") and \
                        not line.strip().startswith("Derived ") and not line.strip().startswith("As printed ") and \
                        not line.strip().startswith("Supplied ") and not line.strip().startswith("Lyrics ") and \
                        not line.strip().startswith("As found in ") and not line.strip().startswith("As recorded ") and \
                        not line.strip().startswith("Also from ") and not line.strip().startswith("Also supplied") and \
                        not line.strip().startswith("Derived from "):
                    if name is None:
                        name = line.strip()
                elif not provenance and re.match(r"^(From |Text |Derived |As printed |Supplied |Lyrics |As found in |As recorded |Also from |Also supplied|Derived from )", line):
                    is_in_provenance = True
                elif not lyrics and not is_in_provenance and not is_in_lyrics and version_in_key:
                    is_in_lyrics = True

                if is_in_provenance:
                    if line.strip():
                        provenance_lines.append(line.strip())
                    elif not line.strip() and provenance_lines:
                        is_in_provenance = False
                        provenance = "\n".join(provenance_lines)
                        provenance_lines = []
                elif is_in_lyrics:
                    if line.strip() and name is not None and name not in line and not line.startswith('File: '):
                        if line.strip() == "===":
                            is_in_lyrics = False  # Stop capturing lyrics at the demarcating line
                        else:
                            if lyrics_lines and not lyrics_lines[-1].endswith(('.', '?', '!', ',', ';', ':',)):
                                lyrics_lines[-1] += ', ' + line.strip()
                            else:
                                lyrics_lines.append(line.strip())

            if provenance_lines:
                provenance = "\n".join(provenance_lines)

            if name is not None:
                if name != "" and name in lines:
                    name_index = lines.index(name)
                    if name_index == 0 or lines[name_index - 1] == "" and (name_index == len(lines) - 1 or lines[name_index + 1] == ""):
                        name = name.strip()
                    else:
                        name = ""

            # join the collected lyrics lines from the list into a string
            if lyrics_lines:
                lyrics = " ".join(lyrics_lines)

            # append the extracted data to the records list
            records_list.append([key_name, key_full_part, bi_file, version_in_key, provenance, name, lyrics])

    # create a DataFrame from the records list 
    columns = ["key_name", "key_full_part", "bi_file", "version_in_key", "provenance", "name", "lyrics"]
    df = pd.DataFrame(records_list, columns=columns)
    return df

df_st = parse_lyric_information(data)
df_st = df_st.replace('', np.nan)
df_st


Unnamed: 0,key_name,key_full_part,bi_file,version_in_key,provenance,name,lyrics
0,"A Robin, Jolly Robin",Complete text(s),Perc1185,A,"From Percy/Wheatley, I.ii.4, pp. 186-187",A Robyn Jolly Robyn,"""[F]rom what appears to be the most ancient of..."
1,"A Robin, Jolly Robin",Complete text(s),Perc1185,B,"From Shakespeare, ""Twelfth Night"" Act IV, scen...",(No title),"71 'Hey, Robin, jolly Robin, 72 Tell me how..."
2,"A, U, Hinny Bird",Partial text(s),StoR160,A,"From Stokoe/Reay, Songs and Ballads of Norther...",,"A, U, hinny burd; The bonny lass o' Benwell, A..."
3,Adieu to Erin (The Emigrant),Complete text(s),SWMS255,A,"As found in Gale Huntington, Songs the Whaleme...",Adieu to Erin,"Oh, when I breathed a last adieu, To Erin's an..."
4,"Agincourt Carol, The",Complete text(s),MEL51,A,"From the Bodleian Library (Cambridge), MS. Sel...",The Song of Agincourt,"Deo gracias anglia, Redde pro victoria, 1 Owre..."
...,...,...,...,...,...,...,...
1224,Young Strongbow,Partial text(s),FlNG210,A,"From Helen Hartness Flanders, Elizabeth Flande...",,"In olden times there came, A likely youth who ..."
1225,Young Waters [Child 94],Complete text(s),C094,A,"From Percy/Wheatley, II.ii.18, pp. 229-231",,"one sheet 8vo."", About Yule, quhen the wind bl..."
1226,Zeb Tourney's Girl [Laws E18],Complete text(s),LE18,A,"As recorded by Vernon Dalhart, 1926. Transcrib...",,"Down in the Tennessee mountains, Away from the..."
1227,Zek'l Weep,Complete text(s),San449,A,"From Carl Sandburg, The American Songbag, pp. ...",,"1 Zek'l weep, Zek'l moan, Flesh come a-creepin..."


Target: 1136 records | Output: 1229 records

### DT (Mudcat's Digitrad)

The only Mudcat Digitrad file available to download is an AskSam 32-bit MS-DOS database which I was not able to open. I was able to access a database file in the ZIP where lyrics were visible in plan text.

The lack of consistent record delimiters, field labels/delimiters, and the presence of many (often invisibe) unicode control characters made extraction challenging and unreliable. I extracted data using regular expressions, after using a text editor to add some line breaks and spaces in place of some errant unicode characters in the source (itself a marginally more human-readable side-effect of a failed attempt to open the database in a newer version of AskSam for Windows).

Due to the aforementioned challanges, there are still some issues with the data:
- some titles are incorrect 
- some lyrics are incomplete due to titles being recognised in the wrong places 
- some lyrics still include notes on the text which were not easy to separate from the lyrics themselves

This data is stored in `df_dt`:

Target: 8932 file records |
Output: 8249 file records

In [53]:
with open('./Data/Mudcat/Z02cv4edited.txt', 'r', encoding='latin-1') as file:
    data = file.read()

def extract_records_from_text(text):
    # split records based on name detection
    records = re.split(r'\n(?=[A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!\.\,\(\)\\[\]:\;\–\—\-]{4,}(?:\n|$))', text)

    # lists to store extracted data
    filenames = []
    names = []
    lyrics = []
    keywords = []

    # iterate over records to extract data
    i = 0
    while i < len(records):
        record = records[i]

        # find and store the name
        name_match = re.search(r'^\s*([A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!.\,\(\)\\[\]:\;\–\—\-]{4,})\s*$', record, flags=re.MULTILINE)
        if name_match and not re.match(r'^-+$', name_match.group(1)) and '\n' not in name_match.group(1):
            name = name_match.group(1).strip()
        else:
            name = ''
            i += 1
            continue

        # reject name if it's one of the other strings that produces false matches - TODO: cobine this above?
        if name == 'OCT98':
            i += 1
            continue

        # find and store keywords (each staring @ and all on the same line)
        keywords_match = re.search(r'@(.+?)\n', record)
        if keywords_match:
            keywords_line = keywords_match.group(1)
            keywords_list = [keyword.strip('@') for keyword in keywords_line.split() if keyword.strip('@').isalnum()]
        else:
            keywords_list = []

        # find and store the lyrics section (everything between name and keywords or filename)
        lyrics_match = re.search(r'(?<=^' + re.escape(name) + r'\n)(.*?)(?=\n@|filename:)', record, flags=re.DOTALL)
        if lyrics_match:
            lyrics_text = lyrics_match.group(1).strip()

            # don't store the first line of the section if it's likely a note
            first_line_break_idx = lyrics_text.find('\n')
            if first_line_break_idx != -1:
                first_line = lyrics_text[:first_line_break_idx].strip()
                if first_line.startswith('(') and first_line.endswith(')') or first_line == '-Traditional':
                    lyrics_text = lyrics_text[first_line_break_idx+1:].strip()
            
            # cut off lyrics if there is a line underscores
            lyrics_cutoff_idx = lyrics_text.find('_________________________')
            if lyrics_cutoff_idx != -1:
                lyrics_text = lyrics_text[:lyrics_cutoff_idx].strip()

        else:
            lyrics_text = ''
            i += 1
            continue

        # find and store the filename based on 'filename: '
        filename_match = re.search(r'filename:\s*(.*)', record)
        if filename_match:
            filename = filename_match.group(1).strip() 
        else:
            filename = ''
            i += 1
            continue

        # append extracted data to lists
        filenames.append(filename)
        names.append(name)
        lyrics.append(lyrics_text)
        keywords.append(keywords_list)

        # Move to the next record
        i += 1

    # create a DataFrame from the extracted data
    df = pd.DataFrame({
        'dt_file': filenames,
        'name': names,
        'lyrics': lyrics,
        'keywords': keywords
    })

    return df

df_dt = extract_records_from_text(data)
df_dt

Unnamed: 0,dt_file,name,lyrics,keywords
0,HARDTAC,'ARD TAC,"1.I'm a shearer, yes I am, and I've shorn 'em...","[Australia, sheep, shearing, drink]"
1,FISHFRY,(I'VE GOT) BIGGER FISH TO FRY,"Sittin' on the bank of that muddy Mississippi,...","[fishing, food]"
2,JULY12,THE 12TH OF JULY,Come pledge again your heart and your hand\n O...,"[Irish, peace]"
3,AVENUE16,16TH AVENUE,"From the corners of the country, from the citi...",[country]
4,MASS1913,THE 1913 MASSACRE,Take a trip with me in nineteen thirteen\nTo C...,"[union, work, death, Xmas]"
...,...,...,...,...
8244,ZEBTURNY,ZEB TOURNEY'S GIRL,"Down in the Tennessee mountains,\nFar from the...",[feud]
8245,ZEBRADUN,ZEBRA DUN,We was camped on the plains at the head of the...,"[cowboy, animal]"
8246,ZENGOSPE,ZEN GOSPEL SINGING,I once was a Baptist and on each Sunday morn\n...,[religion]
8247,ZULIKA,ZULEIKA,"Zuleika was fair to see,\nA fair Persian maide...","[marriage, infidelity]"


## Combine BI with ST and DT

Next I'll add the lyrics to the Ballad Index data by merging the other two dataframes on filenames and storing the result as `df_all_plus_lyrics`.

Viewing the header names gives me an overview of columns to match

In [54]:
display('BI: ', df_bi.columns,
    'ST: ', df_st.columns,
    'DT: ', df_dt.columns)

'BI: '

Index(['name', 'description', 'earliest_date', 'found_in', 'keywords',
       'cross_references', 'roud', 'bi_file', 'st_file', 'dt_file'],
      dtype='object')

'ST: '

Index(['key_name', 'key_full_part', 'bi_file', 'version_in_key', 'provenance',
       'name', 'lyrics'],
      dtype='object')

'DT: '

Index(['dt_file', 'name', 'lyrics', 'keywords'], dtype='object')

In [55]:
df_all_plus_lyrics = df_bi.merge(df_dt, how='outer', on='dt')
df_all_plus_lyrics

KeyError: 'dt'

# REDO THIS SECTION as it excludes variant stubs

Now I'll store only the Ballad Index entries with both Roud numbers and lyrics in `df_roud_lyrics`:

In [None]:
#df_roud_lyrics = df_bi[(~df_bi.dt.isna()) & (~df_bi.roud.isna())]

Next I have to check how many lyrics and how many Roud numbers I have, to see if there are enough entries per number to enable comparisons.

In [None]:
df_roud_lyrics.roud.nunique()

## EDA

# Data cleaning and preprocessing

Extract only records with lyrics and number

Cleaning

Transformation?

Tokenisation

# Clustering

Set up model

Tune model

Evaluate clusters

Add features

# Cluster Analysis

# Classification?

# Pipeline?