# Background

English-language folk songs have a long tradition and have changed over time. Songs are not easily idenifiable by name alone, and lyrics often have variations. Steve Roud began indexing his own collection in the 1970s, and his Roud Index has become the standard for grouping together different versions of the same song. He is still indexing as of 2023.

Could a machine learning algorithm hope to match his skill? Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

# Data

## Sources

Although the Roud index is a lyrics-based classification system (rather than tune-based), the officially-hosted index at vwml.com does not contain lyric transcriptions as a standard data field. Some lyrics are accessible online, presented in scanned images of historical collections, others on linked external sites, others not at all. 

So the first challenge is to get a dataset with enough full lyrics and Roud numbers in combination. The main contenders for the source of this data are Mudcat and The Traditional Ballad Index, both well-established online song databases.

### Mudcat 
- Project focuses on song lyrics and tunes, but also contains Roud numbers for approximately 300 songs.
- Data and formats:
    - Digitrad (DT) download: askSam MS-DOS database (last updated in 2002)
    - Song web pages
    - Forum posts containing songs

### The Traditional Ballad Index 
- Project focuses on cataloguing*, but also has supplementary lyrics for approximately 1110 songs.
- Data and formats:
    - The Ballad Index Software download: Claris Filemaker database
    - Song web pages (without lyrics)
    - The Ballad Index (BI) and The Supplemental Tradition (ST) (lyrics) as HTML or TXT lists

&ast; This is a similar to approach to Roud, but focused on the basic unit of a song rather than its individual instances (e.g. variations, songbook entries or performances), and therefore uses song titles as its main identifiers, with keywords and first line for disambiguation.


## Extraction

Neither the Ballad Index (which would have included ST lyrics) nor the Mudcat Digitrad downloadable databases will open. 

In order to link Roud numbers to lyrics, I therefore need to work with the `.txt` version of the Ballad Index (which does not include ST) as my base for a new database, extract the records from it, then join ST and DT's lyrics to these records using the various references provided in each data source.

### Linking data: Filenames as keys

To link the lyrics correctly to the main data of the BI, I need fields that act as idenifiers/keys:

#### BI filename
Alphanumeric filename serving as an identifier for all BI records, also referenced by ST lyrics where they exist.

#### DT filename
8.3 filename (all-caps without extension) serving as an identifier for all DT records, also sometimes referenced in BI. 
* Note: in a minority of cases, modified DT filenames also appear to be used as the main BI filenames ('DT' + first six characters in lower or title case), e.g. 'DToatsbe' is the same as 'OATSBEAN' in DT). However, this occasionally disagrees with the stated DT filename for the BI record.

#### Other numbers and references:
**DT number:** Many records in DT and BI also contain a 'DT #'. This number is not the same as the DT file, and, contrary to my first assumption, nor does it correspond to the SongID in Mudcat URLs (e.g. http://mudcat.org/@displaysong.cfm?SongID=329). It appears to be another grouping system developed by Mudcat and intended to extend Child numbers (see below): "*Francis J. Child only went up to 305--since there are ballads he didn't include, you may notice some numbers like DT #510 . Not to worry--it just helps locate variants*".

**Roud number:** Found in BI only (at least as far as downloadable data is concerned - song lyrics on Mudcat's website do often include this).

**Child number:** The Child Ballads were the first large collection of songs of English and Scottish origin collected by Francis James Child in the 1800s. Many songs contained multiple versions. Child  numbers (1-305) are often referenced in folk song sources.

**Laws number:** George Malcolm Laws and the American Folklore Society published a collection of traditional songs in 1957. Laws numbers contain an initial letter which indicates the song's theme, e.g. 'M: Ballads of Family Opposition to Lovers'. Laws numbers are also commonly referenced.

**Other collections:** References to other collections are sometimes found, and some of these also have their own numbers for songs.

### Extraction quantity targets (BI, ST, DT)

Based on text editor finds I estimate I can extract approximately the following data [with comparisons for a Google domain search of online versions]:
- BI: 30445 song record files, of which (in combination):
    - 14213 are stubs for variants that only refer to other songs
    - 2623 refer to DT files (lyrics) [compare: Google search: 357]; 356 have BI filenames referring to a DT filename
    - 1180 refer to ST files (lyrics) [compare: Google search: 395]
    - 12126 of these contain Roud index numbers [compare: Google search: 2700]
- ST: 1229 lyrics referencing 1136 BI files [no separate online version]
- DT: 8932 song record files (lyrics)
    - only 1 contains a Roud number [compare: Google search of newer web version: 435]

### BI (Ballad Index)

Below is a preview of `balldidx.txt`. The text version of the Ballad Index file is tricky to work with as entries are presented as a list with inconsistent columns and mixed data. 

I first used a text editor to place colons before Roud numbers and DT filenames, so that they could be more easily matched. (This could have been perhaps better achieved with regex, although to begin I decided to save myself a step as they were formatted inconsistently.)

Here it is interesting to note that the BI database also references Mudcat's DT filenames, for example `DT, MASS1913*` above. This means we can also supplement lyrics by cross-referencing this data.

I then used a script with regular expressions to import while doing the following:
- split song records at the marker '==='
- extract only the values for 'name', 'description', 'earliest_date', found_in', 'keywords', 'cross_references', 'roud', 'bi_file', and 'dt_file'
- split and store reference song name and filename information in one-line stub records that only serve to reference a main song
- make stubs inherit Roud number and file references from their parent entries
- extract only the earliest year found in the 'EARLIEST_FOUND:' field which contained mixed data
- replace empty fields with NumPy `NaN` to allow for better data manipulation

These are stored in `df_bi`.

Target: 30445 file records |
Output: 30418 file records

In [2]:
import re
import numpy as np
import pandas as pd

# load file into memory
file_path = './Data/BalladIndex/txt/BDIDXTXT/balldidxedited.txt'
with open(file_path, 'r') as file:
    data = file.read()
    
# define each record's start (0) and end (1) marker 
record_pattern = re.compile(r'===\n(.*?)(?=\n===)', re.DOTALL)
# initialise a callable iterator that will go over the `data` spit out matches
records = re.finditer(record_pattern, data)

# list to store dicts of extracted `record_data`
records_data = []

# dict of possible fields in a record, and regular expression patterns to find them with
field_patterns = {
    'name': r'NAME: (.*?)(?:\n|$)',
    'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.*?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.*?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.*?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n(.*?)(?:(?=\n{2}|===|File:|DT:)|\n\Z)', #TODO: check this needs DT:
    'roud': r'ROUD: (.*?)(?:\n|$)',
    'bi_file': r'File: (.*?)(?:\n|$)',
    'st_file': r'ST: (.*?)(?:\n|$)',
    'dt_file': r'DT: (.*?)(?:\n|$)'
}

# loop to extract information for each record
for record in records:
    # dict to store key and value for each record
    record_data = {}
    # loop to extract information for each field in each record
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        # if a value matching a search pattern has been found, replace `value`
        # to store the right bit: group(1)
        if value:
            value = value.group(1).strip()
            # extra year handling: find all 4-digit years in field then pick lowest
            if field == 'earliest_date':
                years = re.findall(r'\b\d{4}\b', value)
                if years:
                    value = min(map(int, years))
        else:
            value = ""
        record_data[field] = value

    name = record_data['name']

    # stub handling: if 'name' line contains ': see' and/or 'File:',
    # store these in 'cross_references' and 'bi_file' fields accordingly
    if ': see' in name:
        cross_ref_idx = name.find(': see')
        record_data['cross_references'] = 'see ' + name[cross_ref_idx + 5:].strip()
        record_data['name'] = name[:cross_ref_idx].strip()
        record_data['description'] = 'stub'

    bi_file_info = record_data['bi_file']

    # stub handling:
    if '(File:' in bi_file_info:
        cross_ref_idx = bi_file_info.find('(File:')
        # update the 'cross_references' field only if there's no existing value
        if 'cross_references' not in record_data:
            record_data['cross_references'] = bi_file_info[cross_ref_idx + 6:].strip().rstrip(')')
        record_data['bi_file'] = bi_file_info[:cross_ref_idx].strip()

    # remove any brackets from 'file' field
    record_data['bi_file'] = record_data['bi_file'].replace('(', '').replace(')', '')
    # remove any * from DT filenames
    record_data['dt_file'] = record_data['dt_file'].replace('*', '')

    # append the new record_data to the records_data
    records_data.append(record_data)

# create a DataFrame from the records_data
df_bi = pd.DataFrame(records_data)

# fill NaNs
df_bi = df_bi.replace('', np.nan)

# check for empty rows
# df_bi[df_bi.isnull().all(axis=1)]
# remove empty rows (there was only one at the end)
df_bi.dropna(how='all', inplace=True)

df_bi

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,bi_file,st_file,dt_file
0,"10,000 Years Ago",stub,,,,see I Was Born About Ten Thousand Years Ago (B...,,R410,,
1,10th MTB Flotilla Song,stub,,,,see Fred Karno's Army (File: NeFrKaAr),,NeFrKaAr,,
2,13 Highway,"""I went down 13 highway, Down in my baby's doo...",1938,US(SE),grief love promise nonballad lover technology,,29487,Rc13Hwy,,
3,151 Days,stub,,,,see Hundred and Fifty-One Days (File: Colq060),,Colq060,,
4,"1861 Anti Confederation Song, An",stub,,,,see Anti-Confederation Song (File: FJ028),,FJ028,,
...,...,...,...,...,...,...,...,...,...,...
30413,Zula,"""Thou lov'st another, Zula, Thou lovest him al...",1952,US(So),love rejection separation travel,,11330,Brne049,,
30414,"Zulu Warrior, The","""I-kama zimba zimba zayo I-kama zimba zimba ze...",1946,,nonballad nonsense campsong,,,ACFF061A,,
30415,Zum Gali Gali,"Hebrew. ""Zum, gali-gali-gali, Zum gali-gali, Z...",1956,,foreignlanguage campsong,,,ACSF314Z,,
30416,Zutula Dead,A nice girl gave Zutula bitter casava to eat a...,1939,West Indies(Trinidad),death poison food,,,RcALZuDe,,


In [3]:
df_bi_stubs = df_bi[(df_bi['description'] == 'stub')]

In [15]:
#TODO: handle stub inheritance
#make a lookup table where bi_file points to st_file, dt_file and roud
df_file_lookup = df_bi.loc[:,'roud':'dt_file'].dropna(subset=['roud', 'st_file', 'dt_file'], how='all')
#check for duplicates in bi_file (should be unique index)
df_lookup_duplicates = df_file_lookup[df_file_lookup.duplicated(subset=['bi_file'], keep=False)]
df_lookup_duplicates.sort_values('bi_file')
# but there are 301 duplicates. sanity check in source file shows regex is pulling from mid-line 'File: '. 
# added newline-detecting lookahead to field pattern -> fixed: 0 duplicates. but broke stub bi_file handling

Unnamed: 0,roud,bi_file,st_file,dt_file
15682,37845,ACSF125L,,
9987,25468,ACSF125L,,
18149,12793,ACSF166Y,,SILKHAT
8441,,ACSF166Y,,(FUNICUL)
11875,15704 and 37844,ACSF228I,,
...,...,...,...,...
28919,16402,Wa094,Wa094 (Partial),
2681,4769,Wa156,R214 (Full),BONBLUE
25064,7484,Wa156,,STHREPLY
424,V39177,WhBA0M2,,


best test so far...

In [150]:
import re
import pandas as pd
import numpy as np

# get file data
file_path = './Data/BalladIndex/txt/BDIDXTXT/balldidxedited.txt'
with open(file_path, 'r') as file:
    data = file.read()

# define record boundary pattern
record_pattern = re.compile(r'===\n(.*?)(?=\n===)', re.DOTALL)
# feed pattern and data to a callable iterator to get `records` matches
records = re.finditer(record_pattern, data)

# define the fields and regex patterns to find them, as a dict of key-value pairs: TODO: clean up
field_patterns = {
    'name': r'NAME: (.*?)(?:: see |\n|$)',
    'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.*?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.*?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.*?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|DT:).)*)(?:\n|$)', # handle cross_references
    'alternate_titles': r'ALTERNATE_TITLES: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
    'key_name': r'NAME: .+?: see (.*?)(?:\()',  # extract main song for stubs
    'bi_file': r'File: (.*?)(?:\n|$|^NAME: |^File: |: see |\).*?)',
    'st_file': r'ST: (.*?)(?:\n|$)',
    'dt_file': r'DT: (.*?)(?:[*]|\n|$)',
    'roud': r'ROUD: (.*?)(?:\n|$)'
}

# initialise a list to store dicts of extracted `record_data`
records_data = []

# loop over each of the `records` from the iterator
for record in records:
    # initialise a new dict for each record's data fields
    record_data = {}
    # iterate over the patterns and `search` them, storing match group 1 with its field in `record_data`
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        if value:
            value = value.group(1).strip()
            # for dates: get earliest year and dump the rest
            if field == 'earliest_date':
                years = re.findall(r'\b\d{4}\b', value)
                if years:
                    value = min(map(int, years))
        else:
            value = ""
        record_data[field] = value
    # add each finished record to the list
    records_data.append(record_data)

# make the data into a df, fill empty fields with `NaN`s, drop any empty rows
df_bi_test = pd.DataFrame(records_data)
df_bi_test.replace('', np.nan, inplace=True)
df_bi_test.dropna(how='all', inplace=True)

df_bi_test.head(25)

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,alternate_titles,key_name,bi_file,st_file,dt_file,roud
0,"10,000 Years Ago",,,,,,,I Was Born About Ten Thousand Years Ago,R410,,,
1,10th MTB Flotilla Song,,,,,,,Fred Karno's Army,NeFrKaAr,,,
2,13 Highway,"""I went down 13 highway, Down in my baby's doo...",1938.0,US(SE),grief love promise nonballad lover technology,,,,Rc13Hwy,,,29487.0
3,151 Days,,,,,,,Hundred and Fifty-One Days,Colq060,,,
4,"1861 Anti Confederation Song, An",,,,,,,Anti-Confederation Song,FJ028,,,
5,1913 Massacre,"In Calumet, Michigan, striking copper miners a...",1945.0,US,lie strike death labor-movement mining disaste...,"cf. ""One Morning in May (To Hear the Nightinga...",,,FSWB306A,,MASS1913,17663.0
6,1918 East Broadway,"Counting-out rhyme? ""The people who live acros...",1980.0,US,home food fight floatingverses,"cf. ""Ickie Bickie Soda Cracker"" (lyrcs)\ncf. ""...",,,ZiZa073B,,,
7,2 Y's U R (Too Wise You Are),"""2 Y's U R, 2 Y's U B, I C U R, 2 Y's 4 me."" S...",1831.0,New Zealand US,wordplay,,,,SuSm138A,,,
8,23rd Flotilla,"""Up to Kola Inlet, back to Scapa Flow... Why d...",1962.0,Canada Britain(England),navy hardtimes technology,"cf. ""Lili Marlene"" (tune) and references there...",,,Hopk112,,,29405.0
9,"3, 6, 9, The Goose Drank Wine",,,,,,,"Three, Six, Nine",OpGa135,,,


In [93]:
# tried to do something with match groups but it didnt work
# import re
# import pandas as pd
# import numpy as np

# file_path = './Data/BalladIndex/txt/BDIDXTXT/balldidxedited.txt'
# with open(file_path, 'r') as file:
#     data = file.read()

# record_pattern = re.compile(r'===\n(.*?)(?=\n===)', re.DOTALL)
# records = re.finditer(record_pattern, data)

# field_patterns = {
#     'name': r'NAME: (.*?)(?:(: see )|\n|$)',
#     'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
#     'earliest_date': r'EARLIEST_DATE: (.*?)(?:\n|$)',
#     'found_in': r'FOUND_IN: (.*?)(?:\n|$)',
#     'keywords': r'KEYWORDS: (.*?)(?:\n|$)',
#     'cross_references': r'CROSS_REFERENCES:\n((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|DT:).)*)(?:\n|$)', # handle cross_references
#     'alternate_titles': r'ALTERNATE_TITLES: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
#     'bi_file': r'File: (.*?)(?:\n|$|^NAME: |^File: |: see |\).*?)',
#     'st_file': r'ST: (.*?)(?:\n|$)',
#     'dt_file': r'DT: (.*?)(?:[*]|\n|$)',
#     'roud': r'ROUD: (.*?)(?:\n|$)'
# }

# records_data = []

# for record in records:
#     record_data = {}
#     for field, pattern in field_patterns.items():
#         get = re.search(pattern, record.group(1), re.DOTALL)
#         if get:
#             value = get.group(1).strip()
#             if field == 'earliest_date':
#                 years = re.findall(r'\b\d{4}\b', value)
#                 if years:
#                     value = min(map(int, years))
#             #store nonmatching group to trigger stub storage later?
#             if field == 'name':
#                 check_stub_name = get.group(2)
#         else:
#             value = ""
#         record_data[field] = value

#     # Handle 'cross_references' for both stubs and normal records
#     cross_ref_match = re.search(r'CROSS_REFERENCES:\s*(.*?)\n', record.group(1))
#     if cross_ref_match:
#         record_data['cross_references'] = cross_ref_match.group(1).strip()

#     # Handle stubs and store fields in 'cross_references' and 'bi_file' fields accordingly
#     if ': see ' in check_stub_name:
#         record_data['bi_file'] = re.search(r'File: (.*?)(?:\n|$)', record.group(1)).group(1)
#         record_data['description'] = 'stub' #not working

#     records_data.append(record_data)

# df_bi_test = pd.DataFrame(records_data)
# df_bi_test.replace('', np.nan, inplace=True)
# df_bi_test.dropna(how='all', inplace=True)

# df_bi_test.head(25)


TypeError: argument of type 'NoneType' is not iterable

In [127]:
#make a lookup table where bi_file points to st_file, dt_file and roud
df_file_lookup_test = df_bi_test.loc[:,'bi_file':'roud'].dropna(subset=['roud', 'st_file', 'dt_file'], how='all')
#check for duplicates in bi_file (should be unique index but is not due to 'bi_file' regex problems)
df_lookup_duplicates = df_file_lookup_test[df_file_lookup_test.duplicated(subset=['bi_file'], keep=False)]
df_lookup_duplicates.sort_values('bi_file')

Unnamed: 0,bi_file,st_file,dt_file,roud
9987,ACSF125L,,,25468
15682,ACSF125L,,,37845
18149,ACSF166Y,,SILKHAT,12793
8441,ACSF166Y,,(FUNICUL),
11875,ACSF228I,,,15704 and 37844
...,...,...,...,...
2681,Wa156,R214 (Full),BONBLUE,4769
424,WhBA0M2,,,V39177
9612,WhBA0M2,ChWI239 (Full),GRNSLVS,V19581
29234,Zimm075,,,7596


Despite the error in the data I want to try an initial merge using `_test` versions of my dfs

In [113]:
df_file_lookup_test

Unnamed: 0,bi_file,st_file,dt_file,roud
2,Rc13Hwy,,,29487
5,FSWB306A,,MASS1913,17663
8,Hopk112,,,29405
11,Hopk039,,,29404
12,Hopk046,,,29403
...,...,...,...,...
30400,San449,San449 (Full),,12174
30404,SuSm091B,,,20694
30406,Dett196,,,15233
30407,Fus214,Fus214 (Partial),,16373


In [144]:
df_file_lookup_test.drop_duplicates(subset='bi_file', inplace=True) #hack to account for broken data
df_file_lookup_test

Unnamed: 0,bi_file,st_file,dt_file,roud
0,Rc13Hwy,,,29487
1,FSWB306A,,MASS1913,17663
2,Hopk112,,,29405
3,Hopk039,,,29404
4,Hopk046,,,29403
...,...,...,...,...
12421,San449,San449 (Full),,12174
12422,SuSm091B,,,20694
12423,Dett196,,,15233
12424,Fus214,Fus214 (Partial),,16373


In [145]:
# set 'bi_file' as index for lookup table 
df_file_lookup_test.set_index(['bi_file'], inplace=True)
df_file_lookup_test


Unnamed: 0_level_0,st_file,dt_file,roud
bi_file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rc13Hwy,,,29487
FSWB306A,,MASS1913,17663
Hopk112,,,29405
Hopk039,,,29404
Hopk046,,,29403
...,...,...,...
San449,San449 (Full),,12174
SuSm091B,,,20694
Dett196,,,15233
Fus214,Fus214 (Partial),,16373


In [146]:
# fill missing values from lookup table
df_bi_test['roud'] = df_bi_test['roud'].fillna(df_bi_test['bi_file'].map(df_file_lookup_test['roud']))
df_bi_test['st_file'] = df_bi_test['st_file'].fillna(df_bi_test['bi_file'].map(df_file_lookup_test['st_file']))
df_bi_test['dt_file'] = df_bi_test['dt_file'].fillna(df_bi_test['bi_file'].map(df_file_lookup_test['dt_file']))

In [147]:
df_bi_test

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,alternate_titles,key_name,bi_file,st_file,dt_file,roud
0,"10,000 Years Ago",,,,,,,I Was Born About Ten Thousand Years Ago,R410,,,
1,10th MTB Flotilla Song,,,,,,,Fred Karno's Army,NeFrKaAr,,,10533
2,13 Highway,"""I went down 13 highway, Down in my baby's doo...",1938,US(SE),grief love promise nonballad lover technology,,,,Rc13Hwy,,,29487
3,151 Days,,,,,,,Hundred and Fifty-One Days,Colq060,,,
4,"1861 Anti Confederation Song, An",,,,,,,Anti-Confederation Song,FJ028,FJ028 (Partial),,4518
...,...,...,...,...,...,...,...,...,...,...,...,...
30413,Zula,"""Thou lov'st another, Zula, Thou lovest him al...",1952,US(So),love rejection separation travel,,,,Brne049,,,11330
30414,"Zulu Warrior, The","""I-kama zimba zimba zayo I-kama zimba zimba ze...",1946,,nonballad nonsense campsong,,,,ACFF061A,,,
30415,Zum Gali Gali,"Hebrew. ""Zum, gali-gali-gali, Zum gali-gali, Z...",1956,,foreignlanguage campsong,,,,ACSF314Z,,,
30416,Zutula Dead,A nice girl gave Zutula bitter casava to eat a...,1939,West Indies(Trinidad),death poison food,,,,RcALZuDe,,,


In [143]:
df_bi_test[df_bi_test.duplicated(subset=['bi_file'], keep=False)].head(60)

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,alternate_titles,key_name,bi_file,st_file,dt_file,roud
0,"10,000 Years Ago",,,,,,,I Was Born About Ten Thousand Years Ago,R410,,,
1,10th MTB Flotilla Song,,,,,,,Fred Karno's Army,NeFrKaAr,,,
3,151 Days,,,,,,,Hundred and Fifty-One Days,Colq060,,,
4,"1861 Anti Confederation Song, An",,,,,,,Anti-Confederation Song,FJ028,,,
8,23rd Flotilla,"""Up to Kola Inlet, back to Scapa Flow... Why d...",1962.0,Canada Britain(England),navy hardtimes technology,"cf. ""Lili Marlene"" (tune) and references there",,,Hopk112,,,29405.0
9,"3, 6, 9, The Goose Drank Wine",,,,,,,"Three, Six, Nine",OpGa135,,,
10,'31 Depression Blues,Coal miner tells of hard times in the Depressi...,1968.0,US(Ap),strike mining work hardtimes labor-movement,"cf. ""Bright Sunny South"" (tune)",,,Rc31DB,,,
13,500 Miles,,,,,,,Nine Hundred Miles,LxU073,,,
15,900 Miles,,,,,,,Nine Hundred Miles,LxU073,,,
16,A Begging We Will Go (I),,,,,,,A-Begging I Will Go,K217,,,


Later I will repeat the procecss with the fixed data:

In [7]:
# set 'bi_file' as index for lookup table 
df_file_lookup.set_index('bi_file', inplace=True)
df_file_lookup

# fill missing values from lookup table
df_bi['roud'] = df_bi['roud'].fillna(df_bi['bi_file'].map(df_file_lookup['roud']))
df_bi['st_file'] = df_bi['st_file'].fillna(df_bi['bi_file'].map(df_file_lookup['st_file']))
df_bi['dt_file'] = df_bi['dt_file'].fillna(df_bi['bi_file'].map(df_file_lookup['dt_file']))


InvalidIndexError: Reindexing only valid with uniquely valued Index objects

In [None]:
df_bi.loc[(df_bi['description'] == 'stub') & ~df_bi['roud'].isna()]

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,bi_file,st_file,dt_file


In [None]:
#df_bi.to_csv('df_bi.csv') #save to CSV

Target: 2623 DT file references | Output: 2605 DT file references

In [None]:
df_bi.query("dt_file.notna()").dt_file.count()

2605

Target: 1180 ST file references | Output: 1166 ST file references

In [None]:
df_bi.query("st_file.notna()").st_file.count()

1166

The following query shows I would have 3154 main songs with Roud numbers and lyrics if I were to now join up the data and all the referenced lyrics files can be extracted. 

In [None]:
df_lyrics_available = df_bi.query("(st_file.notna() | dt_file.notna()) & roud.notna()")
df_lyrics_available[['name', 'roud', 'bi_file', 'st_file', 'dt_file']].sort_values('roud')

Unnamed: 0,name,roud,bi_file,st_file,dt_file
9721,"Gypsy Laddie, The [Child 200]",1,C200,,"200, GYPDAVY GYPLADD GYPLADD2 GYPLADD3 GYPLADX..."
15901,Lord Randal [Child 12],10,C012,,"12, LORDRAN1 LORDRNLD EELHENRY EELHENR2"
2763,Bonny Baby Livingston [Child 222],100,C222,,BABLIVST
2598,"Bold Privateer, The [Laws O32]",1000,LO32,LO32 (Full),"486, BOLDPRIV BLDPRIV2"
7379,Fair Fanny Moore [Laws O38],1001,LO38,,"337, FANMOORE FANMOOR2"
...,...,...,...,...,...
15852,Lord Cornwallis's Surrender,V50597,SBoA088,,LRDCRNWL
17128,"Memory of the Dead, The",V5143,PGa039,,MEMRYDED
25278,"Star-Spangled Banner, The",V5200,SRW008 the source song,,STARSPAN
13901,Jolly Good Ale and Old (Back and Sides Go Bare),V7039,DTbcksid,,BACK&SID


This number of songs may even increase if:
1. I can match variant lyrics from the other data sources to the variant titles and multiple file references listed here, in order to get more song records
2. by chance, some backwards file references to BI files are found in the two lyrics data sources which were not found inside the Ballad Index

However, this is still unlikely to constitute enough data to cluster lyrics into Roud-sized clusters and compare sytems, as our available data currently only averages 1.03 songs per unique Roud number.

In [None]:
# Number of unique Roud numbers amongst songs that now have lyrics matched:
df_lyrics_available.roud.nunique()

3032

In [None]:
# Number of entries with lyrics from DT:
df_lyrics_available.dt_file.dropna().count()

2305

In [None]:
# Number of DT entries (including multiples) on songs with Roud numbers:
def word_count(series):
    text = ' '.join(series.dropna().astype(str))
    files = text.split()
    return len(files)

word_count(df_lyrics_available.dt_file)

4143

Even allowing for the multiple entries per BI row for DT files and assuming we can use all of them, that would leave us with a maximum of 4990 lyrics, giving a song-to-Roud ratio of only 1.6.

### ST (Supplementary Tradition of BI)

The Supplementary Tradition is the lyrics index of the Ballad Index. Again, I must use regular expressions to extract the data, this time from `supptrad.txt`. This has a different format to the BI. 

The main song title is listed at the head of the records, followed by the type of lyrics [Complete text(s) or Partial text(s)] followed by different versions of the lyrics marked [*** A ***, *** B ***, *** C ***, ...] often preceded by an alternate title and notes about the story and/or provenance of the lyrics.

Due to the aforementioned song-based classification system of the BI, multiple alternate versions are often linked to one BI record file and key title. Later I may want to split the files into different versions, so I will treat the the main song record as a parent (`key_`...) and treat the versions as children which will stand as individual records but inherit some values from their parents. Some of the alternate versions do not have their own names.

I want to extract: `key_name`, `key_full_part`, `version_in_key`, `name`, `provenance` [detected to exclude from lyrics], `lyrics`, `bi_file` [this belongs to key/parent but I want to name consistently for later data combinations]

In [None]:
with open('./Data/BalladIndex/txt/supptradedited.txt', 'r') as file:
    data = file.read()
    
def parse_lyric_information(data):
    outer_records = data.split("\n===\n")  # split into outer records
    records_list = []

    for record in outer_records:
        outer_lines = record.strip().split('\n')
        if len(outer_lines) < 2:
            continue  # skip 'records' with insufficient lines

        key_name = None
        key_full_part = None
        bi_file = None

        inner_records = record.strip().split('          *** ')[1:]  # split into inner records

        for i, line in enumerate(outer_lines):
            if line.startswith("==="):
                if i > 0:
                    break  # Stop looking for key_name and key_full_part after the first record
            elif not key_name:
                key_name = line.strip()
            elif not key_full_part:
                key_full_part = line.strip()
            elif not bi_file:
                bi_file_match = re.search(r"File: (.+)", line)
                if bi_file_match:
                    bi_file = bi_file_match.group(1).strip()

        for inner_record in inner_records:
            lines = inner_record.strip().split('\n')
            version_in_key = None
            name = None
            provenance = None
            lyrics = None

            is_in_provenance = False
            provenance_lines = []
            is_in_lyrics = False
            lyrics_lines = []

            for line in lines:
                if not version_in_key and line.strip() and line.strip()[0].isupper():
                    version_in_key = line.strip()[0]
                elif not name and line.strip() and not line.strip().startswith("From ") and not line.strip().startswith("Text ") and \
                        not line.strip().startswith("Derived ") and not line.strip().startswith("As printed ") and \
                        not line.strip().startswith("Supplied ") and not line.strip().startswith("Lyrics ") and \
                        not line.strip().startswith("As found in ") and not line.strip().startswith("As recorded ") and \
                        not line.strip().startswith("Also from ") and not line.strip().startswith("Also supplied") and \
                        not line.strip().startswith("Derived from "):
                    if name is None:
                        name = line.strip()
                elif not provenance and re.match(r"^(From |Text |Derived |As printed |Supplied |Lyrics |As found in |As recorded |Also from |Also supplied|Derived from )", line):
                    is_in_provenance = True
                elif not lyrics and not is_in_provenance and not is_in_lyrics and version_in_key:
                    is_in_lyrics = True

                if is_in_provenance:
                    if line.strip():
                        provenance_lines.append(line.strip())
                    elif not line.strip() and provenance_lines:
                        is_in_provenance = False
                        provenance = "\n".join(provenance_lines)
                        provenance_lines = []
                elif is_in_lyrics:
                    if line.strip() and name is not None and name not in line and not line.startswith('File: '):
                        if line.strip() == "===":
                            is_in_lyrics = False  # Stop capturing lyrics at the demarcating line
                        else:
                            if lyrics_lines and not lyrics_lines[-1].endswith(('.', '?', '!', ',', ';', ':',)):
                                lyrics_lines[-1] += ', ' + line.strip()
                            else:
                                lyrics_lines.append(line.strip())

            if provenance_lines:
                provenance = "\n".join(provenance_lines)

            if name is not None:
                if name != "" and name in lines:
                    name_index = lines.index(name)
                    if name_index == 0 or lines[name_index - 1] == "" and (name_index == len(lines) - 1 or lines[name_index + 1] == ""):
                        name = name.strip()
                    else:
                        name = ""

            # join the collected lyrics lines from the list into a string
            if lyrics_lines:
                lyrics = " ".join(lyrics_lines)

            # append the extracted data to the records list
            records_list.append([key_name, key_full_part, bi_file, version_in_key, provenance, name, lyrics])

    # create a DataFrame from the records list 
    columns = ["key_name", "key_full_part", "bi_file", "version_in_key", "provenance", "name", "lyrics"]
    df = pd.DataFrame(records_list, columns=columns)
    return df

df_st = parse_lyric_information(data)
df_st = df_st.replace('', np.nan)
df_st


Unnamed: 0,key_name,key_full_part,bi_file,version_in_key,provenance,name,lyrics
0,"A Robin, Jolly Robin",Complete text(s),Perc1185,A,"From Percy/Wheatley, I.ii.4, pp. 186-187",A Robyn Jolly Robyn,"""[F]rom what appears to be the most ancient of..."
1,"A Robin, Jolly Robin",Complete text(s),Perc1185,B,"From Shakespeare, ""Twelfth Night"" Act IV, scen...",(No title),"71 'Hey, Robin, jolly Robin, 72 Tell me how..."
2,"A, U, Hinny Bird",Partial text(s),StoR160,A,"From Stokoe/Reay, Songs and Ballads of Norther...",,"A, U, hinny burd; The bonny lass o' Benwell, A..."
3,Adieu to Erin (The Emigrant),Complete text(s),SWMS255,A,"As found in Gale Huntington, Songs the Whaleme...",Adieu to Erin,"Oh, when I breathed a last adieu, To Erin's an..."
4,"Agincourt Carol, The",Complete text(s),MEL51,A,"From the Bodleian Library (Cambridge), MS. Sel...",The Song of Agincourt,"Deo gracias anglia, Redde pro victoria, 1 Owre..."
...,...,...,...,...,...,...,...
1224,Young Strongbow,Partial text(s),FlNG210,A,"From Helen Hartness Flanders, Elizabeth Flande...",,"In olden times there came, A likely youth who ..."
1225,Young Waters [Child 94],Complete text(s),C094,A,"From Percy/Wheatley, II.ii.18, pp. 229-231",,"one sheet 8vo."", About Yule, quhen the wind bl..."
1226,Zeb Tourney's Girl [Laws E18],Complete text(s),LE18,A,"As recorded by Vernon Dalhart, 1926. Transcrib...",,"Down in the Tennessee mountains, Away from the..."
1227,Zek'l Weep,Complete text(s),San449,A,"From Carl Sandburg, The American Songbag, pp. ...",,"1 Zek'l weep, Zek'l moan, Flesh come a-creepin..."


Target: 1136 records | Output: 1229 records

### DT (Mudcat's Digitrad)

The only Mudcat Digitrad file available to download is an askSam 32-bit MS-DOS database which I was not able to open. I was able to access a database file in the ZIP where lyrics were visible in plan text. However, a lack of consistent record delimiters, field labels/delimiters, and the presence of many (often invisibe) unicode control characters made extraction challenging and unreliable. I extracted data using regular expressions, after using a text editor to add some line breaks and spaces in place of some errant unicode characters in the source (itself a marginally more human-readable side-effect of a failed attempt to open the database in a newer version of askSam for Windows).

Due to the aforementioned challanges, there are still some issues with the data:
- some titles are incorrect 
- some lyrics are incomplete due to titles being recognised in the wrong places 
- some lyrics still include notes on the text which were not easy to separate from the lyrics themselves

This data is stored in `df_dt`:

Target: 8932 file records |
Output: 8249 file records

In [None]:
with open('./Data/Mudcat/Z02cv4edited.txt', 'r', encoding='latin-1') as file:
    data = file.read()

def extract_records_from_text(text):
    # split records based on name detection
    records = re.split(r'\n(?=[A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!\.\,\(\)\\[\]:\;\–\—\-]{4,}(?:\n|$))', text)

    # lists to store extracted data
    filenames = []
    names = []
    lyrics = []
    keywords = []

    # iterate over records to extract data
    i = 0
    while i < len(records):
        record = records[i]

        # find and store the name
        name_match = re.search(r'^\s*([A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!.\,\(\)\\[\]:\;\–\—\-]{4,})\s*$', record, flags=re.MULTILINE)
        if name_match and not re.match(r'^-+$', name_match.group(1)) and '\n' not in name_match.group(1):
            name = name_match.group(1).strip()
        else:
            name = ''
            i += 1
            continue

        # reject name if it's one of the other strings that produces false matches - TODO: cobine this above?
        if name == 'OCT98':
            i += 1
            continue

        # find and store keywords (each staring @ and all on the same line)
        keywords_match = re.search(r'@(.+?)\n', record)
        if keywords_match:
            keywords_line = keywords_match.group(1)
            keywords_list = [keyword.strip('@') for keyword in keywords_line.split() if keyword.strip('@').isalnum()]
        else:
            keywords_list = []

        # find and store the lyrics section (everything between name and keywords or filename)
        lyrics_match = re.search(r'(?<=^' + re.escape(name) + r'\n)(.*?)(?=\n@|filename:)', record, flags=re.DOTALL)
        if lyrics_match:
            lyrics_text = lyrics_match.group(1).strip()

            # don't store the first line of the section if it's likely a note
            first_line_break_idx = lyrics_text.find('\n')
            if first_line_break_idx != -1:
                first_line = lyrics_text[:first_line_break_idx].strip()
                if first_line.startswith('(') and first_line.endswith(')') or first_line == '-Traditional':
                    lyrics_text = lyrics_text[first_line_break_idx+1:].strip()
            
            # cut off lyrics if there is a line underscores
            lyrics_cutoff_idx = lyrics_text.find('_________________________')
            if lyrics_cutoff_idx != -1:
                lyrics_text = lyrics_text[:lyrics_cutoff_idx].strip()

        else:
            lyrics_text = ''
            i += 1
            continue

        # find and store the filename based on 'filename: '
        filename_match = re.search(r'filename:\s*(.*)', record)
        if filename_match:
            filename = filename_match.group(1).strip() 
        else:
            filename = ''
            i += 1
            continue

        # append extracted data to lists
        filenames.append(filename)
        names.append(name)
        lyrics.append(lyrics_text)
        keywords.append(keywords_list)

        # Move to the next record
        i += 1

    # create a DataFrame from the extracted data
    df = pd.DataFrame({
        'dt_file': filenames,
        'name': names,
        'lyrics': lyrics,
        'keywords': keywords
    })

    return df

df_dt = extract_records_from_text(data)
df_dt

Unnamed: 0,dt_file,name,lyrics,keywords
0,HARDTAC,'ARD TAC,"1.I'm a shearer, yes I am, and I've shorn 'em...","[Australia, sheep, shearing, drink]"
1,FISHFRY,(I'VE GOT) BIGGER FISH TO FRY,"Sittin' on the bank of that muddy Mississippi,...","[fishing, food]"
2,JULY12,THE 12TH OF JULY,Come pledge again your heart and your hand\n O...,"[Irish, peace]"
3,AVENUE16,16TH AVENUE,"From the corners of the country, from the citi...",[country]
4,MASS1913,THE 1913 MASSACRE,Take a trip with me in nineteen thirteen\nTo C...,"[union, work, death, Xmas]"
...,...,...,...,...
8244,ZEBTURNY,ZEB TOURNEY'S GIRL,"Down in the Tennessee mountains,\nFar from the...",[feud]
8245,ZEBRADUN,ZEBRA DUN,We was camped on the plains at the head of the...,"[cowboy, animal]"
8246,ZENGOSPE,ZEN GOSPEL SINGING,I once was a Baptist and on each Sunday morn\n...,[religion]
8247,ZULIKA,ZULEIKA,"Zuleika was fair to see,\nA fair Persian maide...","[marriage, infidelity]"


## Combine BI with ST and DT

Next I'll add the lyrics to the Ballad Index data by merging the other two dataframes on filenames and storing the result as `df_all_plus_lyrics`.

Viewing the header names gives me an overview of columns to match

In [None]:
display('BI: ', df_bi.columns,
    'ST: ', df_st.columns,
    'DT: ', df_dt.columns)

'BI: '

Index(['name', 'description', 'earliest_date', 'found_in', 'keywords',
       'cross_references', 'roud', 'bi_file', 'st_file', 'dt_file'],
      dtype='object')

'ST: '

Index(['key_name', 'key_full_part', 'bi_file', 'version_in_key', 'provenance',
       'name', 'lyrics'],
      dtype='object')

'DT: '

Index(['dt_file', 'name', 'lyrics', 'keywords'], dtype='object')

In [None]:
df_all_plus_lyrics = df_bi.merge(df_dt, how='outer', on='dt')
df_all_plus_lyrics

KeyError: 'dt'

# REDO THIS SECTION as it excludes variant stubs

Now I'll store only the Ballad Index entries with both Roud numbers and lyrics in `df_roud_lyrics`:

In [None]:
#df_roud_lyrics = df_bi[(~df_bi.dt.isna()) & (~df_bi.roud.isna())]

Next I have to check how many lyrics and how many Roud numbers I have, to see if there are enough entries per number to enable comparisons.

In [None]:
df_roud_lyrics.roud.nunique()

## EDA

# Data cleaning and preprocessing

Extract only records with lyrics and number

Cleaning

Transformation?

Tokenisation

# Clustering

Set up model

Tune model

Evaluate clusters

Add features

# Cluster Analysis

# Classification?

# Pipeline?