In [2]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('./bob dylan corpus.csv')

# Show the first few rows to understand the structure
df.head()

Unnamed: 0,release_year,album,title,lyrics
0,1961,"The Bootleg Series, Vol 1-3: Rare & Unreleased...",Hard Times In New York Town,"Come you ladies and you gentlemen, a-listen to..."
1,1961,"The Bootleg Series, Vol 1-3: Rare & Unreleased...",Man on the street,"’ll sing you a song, ain’t very long\n\n’Bout ..."
2,1962,"The Bootleg Series, Vol 1-3: Rare & Unreleased...",Talkin’ Bear Mountain Picnic Massacre Blues,I saw it advertised one day\n\nBear Mountain p...
3,1962,"The Bootleg Series, Vol 1-3: Rare & Unreleased...",Let Me Die in My Footsteps,I will not go down under the ground\n\n’Cause ...
4,1962,"The Bootleg Series, Vol 1-3: Rare & Unreleased...","Rambling, Gambling Willie",Come around you rovin’ gamblers and a story I ...


In [3]:
# Combine all lyrics into a single string
all_lyrics = "\n".join(df['lyrics'].dropna())
all_lyrics = all_lyrics.replace("\n\n","\n")
# Check the first 500 characters to see a snippet
all_lyrics[:500]

'Come you ladies and you gentlemen, a-listen to my song\nSing it to you right, but you might think it’s wrong\nJust a little glimpse of a story I’ll tell\n’Bout an East Coast city that you all know well\nIt’s hard times in the city\nLivin’ down in New York town\n\nOld New York City is a friendly old town\nFrom Washington Heights to Harlem on down\nThere’s a-mighty many people all millin’ all around\nThey’ll kick you when you’re up and knock you when you’re down\nIt’s hard times in the city\nLivin’ down in Ne'

In [4]:
with open("all_lyrics.txt", "w") as text_file:
    text_file.write(all_lyrics)

In [5]:
# Required Imports
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer

# Load a pre-trained GPT-2 model and tokenizer
model_name = "gpt2-medium"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Tokenize the lyrics and prepare dataset
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="./all_lyrics.txt",  # Save the all_lyrics string to a file and provide its path here
    block_size=128
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Train
trainer.train()

***** Running training *****
  Num examples = 1268
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 40
  Number of trainable parameters = 354823168


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=40, training_loss=3.0125118255615235, metrics={'train_runtime': 557.9066, 'train_samples_per_second': 2.273, 'train_steps_per_second': 0.072, 'total_flos': 294398120165376.0, 'train_loss': 3.0125118255615235, 'epoch': 1.0})

In [6]:
# Generate text
input_text = "In a cosmic sort of way"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

output = model.generate(input_ids, max_length=100, num_return_sequences=5, temperature=0.9, do_sample=True)

for i, text in enumerate(output):
    print(f"Generated Text {i+1}: {tokenizer.decode(text)}")
    print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text 1: In a cosmic sort of way
Well, I'll just make sure everybody's just fine
They'll be fine just about anything
They can make the world a better place too
Well, I'll just make sure everybody's just fine

They'll be fine just about anything
They can make the world a better place too
The only thing I know
This is me and I don't know
I feel really bad
Well, I'll just make sure everybody's just fine


Generated Text 2: In a cosmic sort of way, I don’t know, what are you thinking’t doing here?
There’s no one here to help
I’m the one who has taken the place of all
And I’m the one who was born on the bottom of the world
The sun just set, it was just kind of a sad thing
What did you do, you wonder, that you’d put up that wall and you made

Generated Text 3: In a cosmic sort of way, maybe I can keep everything the same, but I can't keep everything the same
I could always say I can go to heaven, but I can't bring myself to believe
Can't be the same for me, can't be the same for you