# Background

English-language folk songs have a long tradition and have changed over time. Songs are not easily idenifiable by name alone, and lyrics often have variations. Steve Roud began indexing his own collection in the 1970s, and his Roud Index has become the standard for grouping together different versions of the same song. He is still indexing as of 2023.

Could a machine learning algorithm hope to match his skill? Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

# Data

## Sources

Although the Roud index is lyrics-based rather than tune-based, the officially-hosted index at vwml.com does not contain lyric transcriptions as a standard data field. Some lyrics are found in scanned images of historical collections, others on linked external sites, others not at all. So the first challenge is to get a dataset with enough full lyrics and Roud numbers in combination. The main contenders are Mudcat and The Traditional Ballad Index, both well-established online song databases.

- Mudcat focuses on song lyrics and tunes, but also contains Roud numbers for approximately 300 songs. Data formats:
    - Digitrad (DT) database MS-DOS download (last updated in 2002)
    - Song web pages
    - Forum posts containing songs
- The Ballad Index focuses on cataloguing*, but also has supplementary lyrics for approximately 1110 songs. Data formats:
    - The Ballad Index Software Filemaker database download
    - Song web pages (without lyrics)
    - The Ballad Index (BI) and The Supplemental Tradition (ST) (lyrics) as HTML or TXT lists

&ast; This is a similar to approach to Roud, but focused os on the basic unit of a song rather than its individual instances (eg songbook entries or performances), and therefore uses song titles as its main identifiers, with keywords and first line for disambiguation.


Neither the Mudcat DT or the Ballad Index (including supplementaty Tradition) downloadable databases will open. 

I therefore need to work with the `.txt` versions of the Ballad Index and Supplementary Tradition and join them in order to link Roud numbers to lyrics.

## Extraction

### Targets (BI, ST, DT)

Based on text editor searches I can hope to extract the following data:
- BI: 30445 song record files
    - 14213 of which are stubs for variants that only refer to other songs
    - 1906 of which refer to DT files (lyrics)
    - 1180 of which refer to ST files (lyrics)
    - 12126 of these contain Roud index numbers
- ST: 1229 lyrics referencing 1136 BI files
    - Note: 404 of the ST filenames seem to be modified DT filenames, eg 'DTwarovr' in ST and BI is the same as 'WAROVR' in DT
- DT: 8932 song record files (lyrics)
    - 793 records also contain a 'DT #' but I don't know what this is. It does not correspond to the file numbers on the Mudcat website.


### BI (Ballad Index)

Below is a preview of `balldidx.txt`. Here it is interesting to note that the BI database also references Mudcat's DT filenames, for example `DT, MASS1913*` above. This means we can also supplement lyrics by cross-referencing this data.

The text version of the Ballad Index file is tricky to work with as entries are presented as a list with inconsistent columns and mixed data. I used a text editor to place colons inside Roud numbers and DT filenames so that they could be more easily identifed. 

I then used a script with regular expressions to import while doing the following:
- split song records at the marker '==='
- extract only the values for 'name', 'description', 'earliest_date', found_in', 'keywords', 'cross_references', 'roud', 'file', and 'dt'
- split and store reference song name and filename information from one-line stub records that only serve to reference a main song
- extract only the earliest year found in the 'EARLIEST_FOUND:' field which contained mixed data
- replacing empty fields with NumPy `NaN` to allow for better data manipulation

These are stored in `df_bi`.

Target: 30445 file records |
Output: 30418 file records

In [8]:
import re
import numpy as np
import pandas as pd

# load file into memory
file_path = './Data/BalladIndex/txt/BDIDXTXT/balldidxedited.txt'
with open(file_path, 'r') as file:
    data = file.read()
    
# define each record's start and end marker then find each record
record_pattern = re.compile(r'===\n(.*?)(?=\n===)', re.DOTALL)
records = re.finditer(record_pattern, data)

# list to store dicts of extracted information
records_data = []

# regular expression patterns describing possible fields and values for a record
field_patterns = {
    'name': r'NAME: (.*?)(?:\n|$)',
    'description': r'DESCRIPTION: ((?:(?!KEYWORDS|FOUND_IN|REFERENCES|ROUD|File:|CROSS_REFERENCES:|DT:).)*)(?:\n|$)',
    'earliest_date': r'EARLIEST_DATE: (.*?)(?:\n|$)',
    'found_in': r'FOUND_IN: (.*?)(?:\n|$)',
    'keywords': r'KEYWORDS: (.*?)(?:\n|$)',
    'cross_references': r'CROSS_REFERENCES:\n(.*?)(?:(?=\n{2}|===|File:|DT:)|\n\Z)', #TODO: check this needs DT:
    'roud': r'ROUD: (.*?)(?:\n|$)',
    'file': r'File: (.*?)(?:\n|$)',
    'dt': r'DT: (.*?)(?:\n|$)'
}

# loop to extract information for each record
for record in records:
    # dict to store keys and values for each record
    record_data = {}
    # loop to extract information for each field in each record
    for field, pattern in field_patterns.items():
        value = re.search(pattern, record.group(1), re.DOTALL)
        if value:
            value = value.group(1).strip()
            # year handling: find all 4-digit years in field then pick lowest
            if field == 'earliest_date':
                years = re.findall(r'\b\d{4}\b', value)
                if years:
                    value = min(map(int, years))
        else:
            value = ""
        record_data[field] = value

    # stub handling: if 'name' contains ': see' and/or 'File:', store these in 'cross_references', and 'file' fields accordingly
    name = record_data['name']
    if ': see' in name:
        cross_ref_idx = name.find(': see')
        record_data['cross_references'] = 'see ' + name[cross_ref_idx + 5:].strip()
        record_data['name'] = name[:cross_ref_idx].strip()

    file_info = record_data['file']
    if '(File:' in file_info:
        cross_ref_idx = file_info.find('(File:')
        record_data['cross_references'] = file_info[cross_ref_idx + 6:].strip().rstrip(')')
        record_data['file'] = file_info[:cross_ref_idx].strip()

    # remove any brackets from 'file' field
    record_data['file'] = record_data['file'].replace('(', '').replace(')', '') 
    # remove any * from DT filenames 
    record_data['dt'] = record_data['dt'].replace('*', '')

    # append the new record_data to the records_data
    records_data.append(record_data)

# create a DataFrame from the records_data
df_bi = pd.DataFrame(records_data)
df_bi = df_bi.replace('', np.nan)
#check for empty rows
#df_bi[df_bi.isnull().all(axis=1)]
# remove 1 empty row
df_bi.dropna(how='all', inplace=True)
df_bi

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,file,dt
0,"10,000 Years Ago",,,,,see I Was Born About Ten Thousand Years Ago (B...,,R410,
1,10th MTB Flotilla Song,,,,,see Fred Karno's Army (File: NeFrKaAr),,NeFrKaAr,
2,13 Highway,"""I went down 13 highway, Down in my baby's doo...",1938,US(SE),grief love promise nonballad lover technology,,29487,Rc13Hwy,
3,151 Days,,,,,see Hundred and Fifty-One Days (File: Colq060),,Colq060,
4,"1861 Anti Confederation Song, An",,,,,see Anti-Confederation Song (File: FJ028),,FJ028,
...,...,...,...,...,...,...,...,...,...
30414,"Zulu Warrior, The","""I-kama zimba zimba zayo I-kama zimba zimba ze...",1946,,nonballad nonsense campsong,,,ACFF061A,
30415,Zum Gali Gali,"Hebrew. ""Zum, gali-gali-gali, Zum gali-gali, Z...",1956,,foreignlanguage campsong,,,ACSF314Z,
30416,Zutula Dead,A nice girl gave Zutula bitter casava to eat a...,1939,West Indies(Trinidad),death poison food,,,RcALZuDe,
30417,"Zwei Soldaten, Die","German. ""Es war einmal zwei Bauersohn, Die hat...",1923,US(MW),foreignlanguage soldier food homicide suicide ...,,,RDL056,


In [15]:
df_bi.to_csv('df_bi.csv') #save to CSV

### ST (Supplementary Tradition of BI)

The Supplementary Tradition is the lyrics index of the Ballad index. Again, I must use regular expressions to extract the data, this time from `supptrad.txt`. This has a different format to the BI. The main song title is listed at the head of the records, followed by the type of lyrics [Complete text(s) or Partial text(s)] followed by different versions of the lyrics marked [*** A ***, *** B ***, *** C ***, ...] often preceded by an alternate title and notes about the story and/or provenance of the lyrics.

Due to the aforementioned song-based classification system of the BI, multiple alternate titles and lyrics are linked to one BI file and key title. Later I may want to split the key files into different versions.

For now I want to extract: key_name, key_full_part, key_version, name, [ignore: provenance, notes], lyrics, key_file


Target: 1136 records | Output: 

In [212]:
with open('./Data/BalladIndex/txt/supptradedited.txt', 'r') as file:
    data = file.read()

In [231]:
import re
import pandas as pd

def parse_lyric_information(data):
    outer_records = data.split("\n===\n")  # split into outer records
    records_list = []

    for record in outer_records:
        outer_lines = record.strip().split('\n')
        if len(outer_lines) < 2:
            continue  # skip 'records' with insufficient lines

        key_name = None
        key_full_part = None
        key_file = None

        inner_records = record.strip().split('          *** ')[1:]  # split into inner records

        for i, line in enumerate(outer_lines):
            if line.startswith("==="):
                if i > 0:
                    break  # Stop looking for key_name and key_full_part after the first record
            elif not key_name:
                key_name = line.strip()
            elif not key_full_part:
                key_full_part = line.strip()
            elif not key_file:
                key_file_match = re.search(r"File: (.+)", line)
                if key_file_match:
                    key_file = key_file_match.group(1).strip()

        for inner_record in inner_records:
            lines = inner_record.strip().split('\n')
            key_version = None
            name = None
            provenance = None
            lyrics = None

            is_in_provenance = False
            provenance_lines = []
            is_in_lyrics = False
            lyrics_lines = []

            for line in lines:
                if not key_version and line.strip() and line.strip()[0].isupper():
                    key_version = line.strip()[0]
                elif not name and line.strip() and not line.strip().startswith("From ") and not line.strip().startswith("Text ") and \
                        not line.strip().startswith("Derived ") and not line.strip().startswith("As printed ") and \
                        not line.strip().startswith("Supplied ") and not line.strip().startswith("Lyrics ") and \
                        not line.strip().startswith("As found in ") and not line.strip().startswith("As recorded ") and \
                        not line.strip().startswith("Also from ") and not line.strip().startswith("Also supplied") and \
                        not line.strip().startswith("Derived from "):
                    if name is None:
                        name = line.strip()
                elif not provenance and re.match(r"^(From |Text |Derived |As printed |Supplied |Lyrics |As found in |As recorded |Also from |Also supplied|Derived from )", line):
                    is_in_provenance = True
                elif not lyrics and not is_in_provenance and not is_in_lyrics and key_version:
                    is_in_lyrics = True

                if is_in_provenance:
                    if line.strip():
                        provenance_lines.append(line.strip())
                    elif not line.strip() and provenance_lines:
                        is_in_provenance = False
                        provenance = "\n".join(provenance_lines)
                        provenance_lines = []
                elif is_in_lyrics:
                    if line.strip() and name is not None and name not in line and not line.startswith('File: '):
                        if line.strip() == "===":
                            is_in_lyrics = False  # Stop capturing lyrics at the demarcating line
                        else:
                            if lyrics_lines and not lyrics_lines[-1].endswith(('.', '?', '!', ',', ';', ':',)):
                                lyrics_lines[-1] += ', ' + line.strip()
                            else:
                                lyrics_lines.append(line.strip())

            if provenance_lines:
                provenance = "\n".join(provenance_lines)

            if name is not None:
                if name != "" and name in lines:
                    name_index = lines.index(name)
                    if name_index == 0 or lines[name_index - 1] == "" and (name_index == len(lines) - 1 or lines[name_index + 1] == ""):
                        name = name.strip()
                    else:
                        name = ""

            # join the collected lyrics lines from the list into a string
            if lyrics_lines:
                lyrics = " ".join(lyrics_lines)

            # append the extracted data to the records list
            records_list.append([key_name, key_full_part, key_file, key_version, provenance, name, lyrics])

    # create a DataFrame from the records list 
    columns = ["key_name", "key_full_part", "key_file", "key_version", "provenance", "name", "lyrics"]
    df = pd.DataFrame(records_list, columns=columns)
    return df

df = parse_lyric_information(data)
df.head(60)


Unnamed: 0,key_name,key_full_part,key_file,key_version,provenance,name,lyrics
0,"A Robin, Jolly Robin",Complete text(s),Perc1185,A,"From Percy/Wheatley, I.ii.4, pp. 186-187",A Robyn Jolly Robyn,"""[F]rom what appears to be the most ancient of..."
1,"A Robin, Jolly Robin",Complete text(s),Perc1185,B,"From Shakespeare, ""Twelfth Night"" Act IV, scen...",(No title),"71 'Hey, Robin, jolly Robin, 72 Tell me how..."
2,"A, U, Hinny Bird",Partial text(s),StoR160,A,"From Stokoe/Reay, Songs and Ballads of Norther...",,"A, U, hinny burd; The bonny lass o' Benwell, A..."
3,Adieu to Erin (The Emigrant),Complete text(s),SWMS255,A,"As found in Gale Huntington, Songs the Whaleme...",Adieu to Erin,"Oh, when I breathed a last adieu, To Erin's an..."
4,"Agincourt Carol, The",Complete text(s),MEL51,A,"From the Bodleian Library (Cambridge), MS. Sel...",The Song of Agincourt,"Deo gracias anglia, Redde pro victoria, 1 Owre..."
5,All Is Well,Partial text(s),FlBr078,A,"From Helen Hartness Flanders & George Brown, V...",,"Oh, what is this that steals upon my frame? Is..."
6,All Night Long (I),Complete text(s),San448,A,"From Carl Sandburg, The American Songbag, pp. ...",,"Paul and Silas, bound in jail, All night long...."
7,All Quiet Along the Potomac Tonight,Complete text(s),RJ19002,A,From sheet music published 1863 by Miller & Be...,All Quiet Along the Potomac To-Night,"""All quiet along the Potomac to-night,"", Excep..."
8,Alone on the Shamrock Shore (Shamrock Shore III),Partial text(s),Pea418,A,"From Kenneth Peacock, Songs of the Newfoundlan...",,"Come all you fair maids take a warning, With a..."
9,Alonzo the Brave and Fair Imogene,Partial text(s),RcAtBaFI,A,"From Kenneth Peacock, Songs of the Newfoundlan...",,"A warrior so bold and a virgin so bright, COnv..."


### DT (Mudcat's Digitrad)

The Mudcat Digitrad file available to download is an AskSam MS-DOS database.  Although I was not able to open Digitrad's database and it provides no other formats to download, I was able to access a database file in the ZIP where lyrics appeared in plan text but with inconsistent delimiters and the inclusion of many unicode control characters. I have done my best to extract the data from one of the files using regular expressions. There are still some issues with the data:
- some titles are incorrect 
- some lyrics are incomplete due to titles being recognised in the wrong places 
- some lyrics still include notes on the text which were not easy to separate from the lyrics themselves

These are stored in `df_lyrics`:

In [54]:
with open('./Data/Mudcat/Z02cv4edited.txt', 'r', encoding='latin-1') as file:
    data = file.read()

def extract_records_from_text(text):
    # Define function to split records based on title
    def split_records(text):
        return re.split(r'\n(?=[A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!\.\,\(\)\\[\]:\;\–\—\-]{4,}(?:\n|$))', text)

    # Split text into records
    records = split_records(text)

    # Initialize lists to store extracted data
    filenames = []
    titles = []
    lyrics = []
    keywords = []

    # Iterate over records to extract data
    i = 0
    while i < len(records):
        record = records[i]

        # Find the title section
        title_match = re.search(r'^\s*([A-Z0-9\s\'\"\?\!\.\,\(\)\[\]\:\;\–\—\-]+[A-Z0-9][A-Z0-9\s\'\"\?\!.\,\(\)\\[\]:\;\–\—\-]{4,})\s*$', record, flags=re.MULTILINE)
        if title_match and not re.match(r'^-+$', title_match.group(1)) and '\n' not in title_match.group(1):
            title = title_match.group(1).strip()
        else:
            title = ''
            i += 1
            continue

        # Check if the title is 'SOF', and if so, skip this record
        if title == 'OCT98':
            i += 1
            continue

        # Find the keywords section and extract all occurrences of keywords on the same line
        keywords_match = re.search(r'@(.+?)\n', record)
        if keywords_match:
            keywords_line = keywords_match.group(1)
            keywords_list = [keyword.strip('@') for keyword in keywords_line.split() if keyword.strip('@').isalnum()]
        else:
            keywords_list = []

        # Find the lyrics section (everything between title and keywords or filename)
        lyrics_match = re.search(r'(?<=^' + re.escape(title) + r'\n)(.*?)(?=\n@|filename:)', record, flags=re.DOTALL)
        if lyrics_match:
            lyrics_text = lyrics_match.group(1).strip()

            # Don't store the first line of lyrics if it begins and ends with brackets
            first_line_break_idx = lyrics_text.find('\n')
            if first_line_break_idx != -1:
                first_line = lyrics_text[:first_line_break_idx].strip()
                if first_line.startswith('(') and first_line.endswith(')') or first_line == '-Traditional':
                    lyrics_text = lyrics_text[first_line_break_idx+1:].strip()
            
            # Cut off lyrics if the line contains '_________________________'
            lyrics_cutoff_idx = lyrics_text.find('_________________________')
            if lyrics_cutoff_idx != -1:
                lyrics_text = lyrics_text[:lyrics_cutoff_idx].strip()

        else:
            lyrics_text = ''
            i += 1
            continue

        # Find the filename section
        filename_match = re.search(r'filename:\s*(.*)', record)
        if filename_match:
            filename = filename_match.group(1).strip() 
        else:
            filename = ''
            i += 1
            continue

        # Append extracted data to lists
        filenames.append(filename)
        titles.append(title)
        lyrics.append(lyrics_text)
        keywords.append(keywords_list)

        # Move to the next record
        i += 1

    # Create a DataFrame from the extracted data
    df = pd.DataFrame({
        'dt': filenames,
        'title': titles,
        'lyrics': lyrics,
        'keywords': keywords
    })

    return df

df_lyrics = extract_records_from_text(data)
df_lyrics

Unnamed: 0,dt,title,lyrics,keywords
0,HARDTAC,'ARD TAC,"1.I'm a shearer, yes I am, and I've shorn 'em...","[Australia, sheep, shearing, drink]"
1,FISHFRY,(I'VE GOT) BIGGER FISH TO FRY,"Sittin' on the bank of that muddy Mississippi,...","[fishing, food]"
2,JULY12,THE 12TH OF JULY,Come pledge again your heart and your hand\n O...,"[Irish, peace]"
3,AVENUE16,16TH AVENUE,"From the corners of the country, from the citi...",[country]
4,MASS1913,THE 1913 MASSACRE,Take a trip with me in nineteen thirteen\nTo C...,"[union, work, death, Xmas]"
...,...,...,...,...
8244,ZEBTURNY,ZEB TOURNEY'S GIRL,"Down in the Tennessee mountains,\nFar from the...",[feud]
8245,ZEBRADUN,ZEBRA DUN,We was camped on the plains at the head of the...,"[cowboy, animal]"
8246,ZENGOSPE,ZEN GOSPEL SINGING,I once was a Baptist and on each Sunday morn\n...,[religion]
8247,ZULIKA,ZULEIKA,"Zuleika was fair to see,\nA fair Persian maide...","[marriage, infidelity]"


Now I'll store only the Ballad Index entries with both Roud numbers and DT filenames in `df_roud_dt`:

In [16]:
df_roud_dt = df_bi[(~df_bi.dt.isna()) & (~df_bi.roud.isna())]
df_roud_dt

Unnamed: 0,name,description,earliest_date,found_in,keywords,cross_references,roud,file,dt
5,1913 Massacre,"In Calumet, Michigan, striking copper miners a...",1945,US,lie strike death labor-movement mining disaste...,"cf. ""One Morning in May (To Hear the Nightinga...",17663,FSWB306A,MASS1913
48,A-Begging I Will Go,"""Of all the trades in England, The begging is ...",1684,"Britain(England(North,Lond,south),Scotland(Aber))",begging nonballad,"cf. ""Let the Back and Sides Go Bare"" (theme)\n...",286,K217,ABEGGIN
59,A-Rovin',"In this cautionary tale, a sailor meets an Ams...",1876,"Britain(England,Scotland(Aber)) US(MA,NE,So,SW...",bawdy disease sailor warning whore,"cf. ""The Fire Ship"" (plot) and references ther...",649,EM064,AROVIN1 AROVIN2
82,Abdul the Bulbul Emir (I),The heroic Moslem Abdul and the gallant Russia...,1877,US(MW),humorous death foreigner,"cf. ""Abdul the Bulbul Emir (II)"" (tune & meter...",4321,LxA341,ABDULBUL
83,Abdul the Bulbul Emir (II),Abdul the Bulbul Emir and Ivan Stavinsky Stava...,1877,"Australia Canada England New Zealand US(NE,SW)",bawdy parody humorous sex contest homosexuality,"cf. ""Abdul the Bulbul Emir (I)"" (tune & meter)...",4321,EM210,ABDULBL2
...,...,...,...,...,...,...,...,...,...
30115,You Might Easy Know a Doffer,"""You might easy know a doffer"" by her yellow h...",1978,Ireland,sex bragging hair weaving humorous nonballad,,20420,Leyd013,EASYDOFF
30124,You Never Miss the Water till the Well Runs Dry,The singer remembers mother's lessons about ec...,1872,Britain(England(South)),youth money,"cf. ""A Motto for Every Man"" (theme of hard wor...",5457,SRW125,WASTENOT
30386,"Yowe Lamb, The (Ca' the Yowes; Lovely Molly)",Molly agrees to marry Willie if her father con...,1899,"Ireland Britain(Scotland(Aber,Bord)) Canada(Mar)",love marriage father trick,"cf. ""The Waukin' o' the Claes"" (tune, per Grei...",857,K124,CALEWE3
30389,Ythanside,"""As I cam in by Ythanside, Where swiftly flows...",1905,Britain(Scotland(Aber)),love courting marriage,,3783,Ord032,BONYTHAN


In [18]:
df_roud_dt.roud.nunique()

1527

# REDO THIS SECTION as it excludes variant stubs

Next I'll add the lyrics by merging the two dataframes on 'dt' filename and storing the result as `df_roud_dt_lyrics`:

In [62]:
df_roud_dt_lyrics = df_roud_dt.merge(df_lyrics, on='dt')
df_roud_dt_lyrics

Unnamed: 0,name,description,earliest_date,found_in,keywords_x,cross_references,roud,file,dt,title,lyrics,keywords_y
0,A-Begging I Will Go,"""Of all the trades in England, The begging is ...",1684,"Britain(England(North,Lond,south),Scotland(Aber))",begging nonballad,"cf. ""Let the Back and Sides Go Bare"" (theme)\n...",286,K217,ABEGGIN,A-BEGGIN' I WILL GO,Of all the trades in England the beggin' is th...,[beggar]
1,Abdul the Bulbul Emir (I),The heroic Moslem Abdul and the gallant Russia...,1877,US(MW),humorous death foreigner,"cf. ""Abdul the Bulbul Emir (II)"" (tune & meter...",4321,LxA341,ABDULBUL,ABDUL ABULBUL AMIR,"The sons of the prophet were hardy and bold,\n...","[Russian, fight, soldier]"
2,Abilene,"""Abilene, Abilene, prettiest town (you) ever s...",1973,,home train nonballad,"cf. ""Ohio River, She's So Deep and Wide"" (floa...",26032,FSWB048,ABILNE,ABILENE,"Abilene, Abilene\nPrettiest town I ever seen.\...","[home, place]"
3,"About the Bush, Willy","""Aboot the bush, Willy, aboot the bee-hive, Ab...",1812,Britain(England(North)),clothes nonballad,,3149,StoR097,BUSHWILI,"ABOUT THE BUSH, WILLY","About the bush, Willy,\nAbout the beehive,\n...",[kids]
4,Across the Western Ocean,"""Oh, the times are hard and the wages low, Ame...",1927,US,emigration poverty hardtimes,"cf. ""Leave Her, Johnny, Leave Her"" (floating l...",8234,San412,WSTOCEAN,ACROSS THE WESTERN OCEAN,Oh the times are hard and the wages low\nAmel...,[sailor]
...,...,...,...,...,...,...,...,...,...,...,...,...
590,"Yellow Rose of Taegu, The",A reluctant soldier meets the Yellow Rose of T...,,US,bawdy sex soldier whore derivative,"cf. ""Yellow Rose of Texas"" (tune)",10405,EM410,YLLOWTX4,THE YELLOW ROSE OF TAEGU,"She's the yellow Rose of Taegu, the girl that ...","[parody, army, America, Korea, bawdy, whore]"
591,You Are My Sunshine,"The singer dreams his ""sunshine"" is in his arm...",1940,US,courting love promise rejection warning nonbal...,,18130,Hopk084A,YOUMYSUN,YOU ARE MY SUNSHINE,"The other night dear, as I lay sleeping\nI dre...",[]
592,You Never Miss the Water till the Well Runs Dry,The singer remembers mother's lessons about ec...,1872,Britain(England(South)),youth money,"cf. ""A Motto for Every Man"" (theme of hard wor...",5457,SRW125,WASTENOT,"WASTE NOT, WANT NOT",(You Never Miss the Water Till the Well Runs D...,[]
593,"Yowe Lamb, The (Ca' the Yowes; Lovely Molly)",Molly agrees to marry Willie if her father con...,1899,"Ireland Britain(Scotland(Aber,Bord)) Canada(Mar)",love marriage father trick,"cf. ""The Waukin' o' the Claes"" (tune, per Grei...",857,K124,CALEWE3,THE YOWE LAMB,"As Molly was milking her yowes one day,\nWilli...",[]


In [64]:
df_roud_dt_lyrics.dt.nunique()

591

## EDA

# Data cleaning and preprocessing

Extract only records with lyrics and number

Cleaning

Transformation?

Tokenisation

# Clustering

Set up model

Tune model

Evaluate clusters

Add features

# Cluster Analysis

# Classification?

# Pipeline?