<a href="https://colab.research.google.com/github/rouyu0405/NLP/blob/main/Nov.21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Task
Fine-tune a GPT2 model for emotional and topical lyric generation using the `labeled_lyrics_cleaned.csv` dataset from `/content/drive/MyDrive/IAT Courses/IAT 360/360_Project/NLP_Dataset/`.

## Load Dataset

### Subtask:
Load the 'labeled_lyrics_cleaned.csv' dataset from '/content/drive/MyDrive/IAT Courses/IAT 360/360_Project/NLP_Dataset/' into a pandas DataFrame.


**Reasoning**:
To load the dataset, I need to import the pandas library and then use its `read_csv` function to load the specified file from Google Drive into a DataFrame.



In [15]:
import pandas as pd

file_path = '/content/drive/MyDrive/IAT Courses/IAT 360/360_Project/NLP_Dataset/labeled_lyrics_cleaned.csv'
df = pd.read_csv(file_path)

print("Dataset loaded successfully. Displaying the first 5 rows:")
print(df.head())

Dataset loaded successfully. Displaying the first 5 rows:
   Unnamed: 0        artist  \
0           0  Elijah Blake   
1           1  Elijah Blake   
2           2  Elijah Blake   
3           3  Elijah Blake   
4           4  Elijah Blake   

                                                 seq                song  \
0  No, no\r\nI ain't ever trapped out the bando\r...            Everyday   
1  The drinks go down and smoke goes up, I feel m...    Live Till We Die   
2  She don't live on planet Earth no more\r\nShe ...       The Otherside   
3  Trippin' off that Grigio, mobbin', lights low\...               Pinot   
4  I see a midnight panther, so gallant and so br...  Shadows & Diamonds   

   label  
0  0.626  
1  0.630  
2  0.240  
3  0.536  
4  0.371  


## Inspect and Preprocess Data

### Subtask:
Inspect the loaded dataset to understand its structure and content. Preprocess the text data to prepare it for GPT2 fine-tuning, which includes tokenization and formatting the data into a suitable input format for the model. This step may also involve handling any missing values or inconsistencies in the 'lyrics' and 'emotion' columns.


**Reasoning**:
I will start by displaying the DataFrame's information to check data types and non-null counts, followed by descriptive statistics and then missing values, as these are crucial initial inspection steps to understand the dataset's structure and completeness.



In [16]:
print("DataFrame Info:")
df.info()

print("\nDescriptive Statistics:")
print(df.describe())

print("\nMissing values per column:")
print(df.isnull().sum())

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158353 entries, 0 to 158352
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  158353 non-null  int64  
 1   artist      158353 non-null  object 
 2   seq         158353 non-null  object 
 3   song        158351 non-null  object 
 4   label       158353 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 6.0+ MB

Descriptive Statistics:
          Unnamed: 0          label
count  158353.000000  158353.000000
mean    79176.000000       0.491052
std     45712.717926       0.249619
min         0.000000       0.000000
25%     39588.000000       0.286000
50%     79176.000000       0.483000
75%    118764.000000       0.691000
max    158352.000000       0.998000

Missing values per column:
Unnamed: 0    0
artist        0
seq           0
song          2
label         0
dtype: int64


**Reasoning**:
Based on the previous output, the 'song' column has 2 missing values, while 'seq' and 'label' have none. To further inspect the data as per the subtask, I will examine the unique values and their counts for 'artist', 'song', and 'label' to understand their distribution and potential inconsistencies.



In [17]:
print("\nUnique values and their counts for 'artist' (top 10):")
print(df['artist'].value_counts().head(10))

print("\nUnique values and their counts for 'song' (top 10):")
print(df['song'].value_counts().head(10))

print("\nUnique values and their counts for 'label' (top 10):")
print(df['label'].value_counts().head(10))


Unique values and their counts for 'artist' (top 10):
artist
Elvis Presley      821
Frank Sinatra      779
Ella Fitzgerald    751
Johnny Cash        598
Willie Nelson      587
George Jones       509
Neil Diamond       434
Elvis Costello     411
Prince             391
Dean Martin        380
Name: count, dtype: int64

Unique values and their counts for 'song' (top 10):
song
Have Yourself a Merry Little Christmas    162
Over the Rainbow                          161
Intro                                     154
Body and Soul                             152
St. Louis Blues                           132
You Don't Know What Love Is               130
Stardust                                  120
My Foolish Heart                          115
Embraceable You                           114
The Nearness of You                       110
Name: count, dtype: int64

Unique values and their counts for 'label' (top 10):
label
0.961    403
0.962    341
0.963    301
0.964    276
0.356    260
0.347    254


**Reasoning**:
Since 'seq' and 'label' columns have no missing values, I will proceed to rename 'seq' to 'lyrics' and 'label' to 'emotion' to align with the task requirements and improve readability. After renaming, I will display the first few rows to confirm the changes.



In [18]:
print("No missing values in 'seq' or 'label' columns, as previously confirmed.")

df = df.rename(columns={'seq': 'lyrics', 'label': 'emotion'})

print("\nColumns renamed. Displaying the first 5 rows with new column names:")
print(df.head())

No missing values in 'seq' or 'label' columns, as previously confirmed.

Columns renamed. Displaying the first 5 rows with new column names:
   Unnamed: 0        artist  \
0           0  Elijah Blake   
1           1  Elijah Blake   
2           2  Elijah Blake   
3           3  Elijah Blake   
4           4  Elijah Blake   

                                              lyrics                song  \
0  No, no\r\nI ain't ever trapped out the bando\r...            Everyday   
1  The drinks go down and smoke goes up, I feel m...    Live Till We Die   
2  She don't live on planet Earth no more\r\nShe ...       The Otherside   
3  Trippin' off that Grigio, mobbin', lights low\...               Pinot   
4  I see a midnight panther, so gallant and so br...  Shadows & Diamonds   

   emotion  
0    0.626  
1    0.630  
2    0.240  
3    0.536  
4    0.371  


**Reasoning**:
Since only 2 values are missing in the 'song' column out of over 158,000 entries, dropping these rows is an efficient and negligible way to handle them without impacting the overall dataset significantly. This ensures data completeness before proceeding with text preprocessing.



In [19]:
df.dropna(subset=['song'], inplace=True)
print("Rows with missing values in 'song' column have been dropped.")
print("Missing values after dropping:")
print(df.isnull().sum())

Rows with missing values in 'song' column have been dropped.
Missing values after dropping:
Unnamed: 0    0
artist        0
lyrics        0
song          0
emotion       0
dtype: int64


**Reasoning**:
To preprocess the text data for GPT2 fine-tuning, the first step is to install the Hugging Face Transformers library, which provides the necessary tools for tokenization and model handling.



In [20]:
pip install transformers



**Reasoning**:
Now that the `transformers` library is installed, I will load the pre-trained GPT2 tokenizer to begin preparing the text data for fine-tuning. I will use the `gpt2` model as a base for tokenization.



In [21]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

# GPT2 does not have a padding token by default, which is needed for batch processing
# We will set the eos_token as the pad_token for consistency during padding.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("GPT2 tokenizer loaded successfully and pad token set.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT2 tokenizer loaded successfully and pad token set.


**Reasoning**:
The tokenizer was successfully loaded in the previous step. The next logical step in preparing the data for GPT2 fine-tuning is to tokenize the 'lyrics' and 'emotion' columns and format them into a suitable input structure for the model.



In [22]:
from datasets import Dataset

# Convert DataFrame to Hugging Face Dataset
hf_dataset = Dataset.from_pandas(df)

# Function to combine emotion and lyrics, then tokenize
def tokenize_function(examples):
    # Format the emotion score into a string prefix for the lyrics
    # The format 'Emotion: X.XXX Lyrics: ' is common for conditional generation
    emotion_prefixes = [f"Emotion: {e:.3f} Lyrics: " for e in examples["emotion"]]
    combined_texts = [prefix + lyric for prefix, lyric in zip(emotion_prefixes, examples["lyrics"])]

    # Tokenize the combined text. truncation=True will cut texts longer than max_length.
    # max_length=512 is a common choice for GPT-2, can be adjusted based on data analysis.
    tokenized_inputs = tokenizer(combined_texts, truncation=True, max_length=512)
    return tokenized_inputs

# Apply the tokenization function to the dataset
tokenized_dataset = hf_dataset.map(tokenize_function, batched=True, num_proc=4)

print("Dataset tokenized successfully.")
print("First tokenized input:")
print(tokenized_dataset[0])

Map (num_proc=4):   0%|          | 0/158351 [00:00<?, ? examples/s]

Dataset tokenized successfully.
First tokenized input:
{'Unnamed: 0': 0, 'artist': 'Elijah Blake', 'lyrics': "No, no\r\nI ain't ever trapped out the bando\r\nBut oh Lord, don't get me wrong\r\nI know a couple niggas that do\r\nI'm from a place where everybody knows your name\r\nThey say I gotta watch my attitude\r\nWhen they see money, man they all start actin' strange\r\nSo fuck with the ones that fuck with you\r\nThey can never say I'm brand new\r\n\r\nIt's everyday, everyday\r\nEveryday, everyday, everyday\r\nEveryday, everyday\r\nEveryday, everyday\r\nI've been talkin' my shit, nigga that's regular\r\nI've been lovin' 'em thick, life is spectacular\r\nI spend like I'ma die rich, nigga I'm flexin', yeah\r\nEveryday, that's everyday\r\nThat's everyday\r\nThat's everyday\r\nThat's everyday, everyday\r\n\r\nI see all of these wanna-be hot R&B singers\r\nI swear you all sound the same\r\nThey start from the bottom, so far from the motto\r\nYou niggas'll never be Drake\r\nShout out to OV

## Configure GPT2 Model for Fine-tuning

### Subtask:
Set up the GPT2 model and its tokenizer from the Hugging Face Transformers library. Configure training arguments such as batch size, learning rate, and the number of epochs specifically for the lyric generation task.


**Reasoning**:
To configure the GPT2 model for fine-tuning, I need to load the pre-trained model, define the training arguments, and set up the data collator for language modeling, which are all essential components for using the Hugging Face Trainer.



In [23]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer, DataCollatorForLanguageModeling

# 1. Load a pre-trained GPT2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')
print("GPT2LMHeadModel loaded successfully.")

# 2. Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',              # output directory
    num_train_epochs=3,                  # total number of training epochs
    per_device_train_batch_size=8,       # batch size per device during training
    learning_rate=5e-5,                  # learning rate
    save_steps=500,                      # number of updates steps before saving checkpoint
    logging_steps=500,                   # log every X updates steps
    save_total_limit=2,                  # limit the total amount of checkpoints
    prediction_loss_only=True,           # to compute loss on predictions
)
print("TrainingArguments defined successfully.")

# 3. Initialize Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False      # mlm=False for Causal Language Modeling (GPT-2)
)
print("DataCollatorForLanguageModeling initialized successfully.")

print("Model, TrainingArguments, and DataCollator are ready for fine-tuning.")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel loaded successfully.
TrainingArguments defined successfully.
DataCollatorForLanguageModeling initialized successfully.
Model, TrainingArguments, and DataCollator are ready for fine-tuning.


## Fine-tune GPT2 Model

### Subtask:
Train the GPT2 model on the preprocessed lyrics data using the configured training arguments.


**Reasoning**:
To fine-tune the GPT2 model, I need to initialize the `Trainer` object with the model, training arguments, data collator, and the tokenized dataset, and then call the `train()` method to begin the training process.



In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

Trainer initialized. Starting model training...


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33melizabeth4-four[0m ([33melizabeth4-four-simon-fraser-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


**Reasoning**:
The previous execution paused due to a `wandb` login prompt. To avoid this interruption and allow training to proceed without interactive input, I will disable `wandb` reporting in the `TrainingArguments` by setting `report_to="none"`.



In [1]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer, DataCollatorForLanguageModeling

# 1. Load a pre-trained GPT2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')
print("GPT2LMHeadModel loaded successfully.")

# 2. Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',              # output directory
    num_train_epochs=3,                  # total number of training epochs
    per_device_train_batch_size=8,       # batch size per device during training
    learning_rate=5e-5,                  # learning rate
    save_steps=500,                      # number of updates steps before saving checkpoint
    logging_steps=500,                   # log every X updates steps
    save_total_limit=2,                  # limit the total amount of checkpoints
    prediction_loss_only=True,           # to compute loss on predictions
    report_to="none"                     # Disable W&B logging to prevent interactive login prompt
)
print("TrainingArguments defined successfully.")

# 3. Initialize Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False      # mlm=False for Causal Language Modeling (GPT-2)
)
print("DataCollatorForLanguageModeling initialized successfully.")

print("Model, TrainingArguments, and DataCollator are ready for fine-tuning.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


GPT2LMHeadModel loaded successfully.
TrainingArguments defined successfully.


NameError: name 'tokenizer' is not defined

**Reasoning**:
The previous cell failed because the `tokenizer` object was not defined when `DataCollatorForLanguageModeling` was initialized. I need to re-initialize the tokenizer before using it in the data collator to ensure it's in scope.



In [2]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer, DataCollatorForLanguageModeling, AutoTokenizer

# Re-initialize tokenizer as it was not defined in the current scope
tokenizer = AutoTokenizer.from_pretrained('gpt2')

# GPT2 does not have a padding token by default, which is needed for batch processing
# We will set the eos_token as the pad_token for consistency during padding.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
print("GPT2 tokenizer re-loaded successfully and pad token set.")

# 1. Load a pre-trained GPT2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')
print("GPT2LMHeadModel loaded successfully.")

# 2. Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',              # output directory
    num_train_epochs=3,                  # total number of training epochs
    per_device_train_batch_size=8,       # batch size per device during training
    learning_rate=5e-5,                  # learning rate
    save_steps=500,                      # number of updates steps before saving checkpoint
    logging_steps=500,                   # log every X updates steps
    save_total_limit=2,                  # limit the total amount of checkpoints
    prediction_loss_only=True,           # to compute loss on predictions
    report_to="none"                     # Disable W&B logging to prevent interactive login prompt
)
print("TrainingArguments defined successfully.")

# 3. Initialize Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False      # mlm=False for Causal Language Modeling (GPT-2)
)
print("DataCollatorForLanguageModeling initialized successfully.")

print("Model, TrainingArguments, and DataCollator are ready for fine-tuning.")

GPT2 tokenizer re-loaded successfully and pad token set.
GPT2LMHeadModel loaded successfully.
TrainingArguments defined successfully.
DataCollatorForLanguageModeling initialized successfully.
Model, TrainingArguments, and DataCollator are ready for fine-tuning.


**Reasoning**:
The model, tokenizer, training arguments, and data collator are now correctly initialized. The next step is to re-run the training process using the `Trainer` object to fine-tune the GPT2 model on the preprocessed lyrics data, ensuring `wandb` logging is disabled.



In [3]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

NameError: name 'tokenized_dataset' is not defined

**Reasoning**:
The previous attempt to initialize the Trainer failed because `tokenized_dataset` was not in scope. I need to re-run the dataset tokenization step to ensure `tokenized_dataset` is available before initializing the `Trainer` and starting the fine-tuning process.



In [4]:
from datasets import Dataset

# Convert DataFrame to Hugging Face Dataset (assuming df is still available)
hf_dataset = Dataset.from_pandas(df)

# Function to combine emotion and lyrics, then tokenize
def tokenize_function(examples):
    emotion_prefixes = [f"Emotion: {e:.3f} Lyrics: " for e in examples["emotion"]]
    combined_texts = [prefix + lyric for prefix, lyric in zip(emotion_prefixes, examples["lyrics"])]
    tokenized_inputs = tokenizer(combined_texts, truncation=True, max_length=512)
    return tokenized_inputs

# Apply the tokenization function to the dataset
tokenized_dataset = hf_dataset.map(tokenize_function, batched=True, num_proc=4)

print("Dataset re-tokenized successfully.")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

NameError: name 'df' is not defined

**Reasoning**:
The previous attempt to tokenize the dataset failed because the `df` DataFrame was not defined in the current execution context. I need to re-load the dataset from the CSV, apply the necessary renames and drops, and then proceed with tokenization and model training. I will combine all these steps in a single cell to ensure all variables are in scope.



In [None]:
import pandas as pd
from datasets import Dataset
from transformers import Trainer # Already imported in a previous cell, but re-importing for clarity in this combined block

# Re-load DataFrame (df was not defined in the previous execution context)
file_path = '/content/drive/MyDrive/IAT Courses/IAT 360/360_Project/NLP_Dataset/labeled_lyrics_cleaned.csv'
df = pd.read_csv(file_path)

# Re-apply column renames and handle missing values (as done in previous successful steps)
df = df.rename(columns={'seq': 'lyrics', 'label': 'emotion'})
df.dropna(subset=['song'], inplace=True)
print("DataFrame re-loaded, columns renamed, and missing 'song' values dropped.")

# Convert DataFrame to Hugging Face Dataset
hf_dataset = Dataset.from_pandas(df)

# Function to combine emotion and lyrics, then tokenize
def tokenize_function(examples):
    emotion_prefixes = [f"Emotion: {e:.3f} Lyrics: " for e in examples["emotion"]]
    combined_texts = [prefix + lyric for prefix, lyric in zip(emotion_prefixes, examples["lyrics"])]
    tokenized_inputs = tokenizer(combined_texts, truncation=True, max_length=512)
    return tokenized_inputs

# Apply the tokenization function to the dataset
tokenized_dataset = hf_dataset.map(tokenize_function, batched=True, num_proc=4)

print("Dataset re-tokenized successfully.")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

DataFrame re-loaded, columns renamed, and missing 'song' values dropped.


Map (num_proc=4):   0%|          | 0/158351 [00:00<?, ? examples/s]

Dataset re-tokenized successfully.
Trainer initialized. Starting model training...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


**Reasoning**:
The previous steps successfully set up the model, tokenizer, data collator, and training arguments, and prepared the `tokenized_dataset`. Now, to fine-tune the GPT2 model, I will initialize the `Trainer` object with these components and then call the `train()` method to begin the training process, ensuring `wandb` logging is disabled as previously configured.



In [1]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

NameError: name 'Trainer' is not defined

**Reasoning**:
The previous code block failed because `Trainer` was not defined in the current execution context. I need to explicitly import the `Trainer` class from the `transformers` library to make it available before initializing the `Trainer` object and starting the fine-tuning process.



In [2]:
from transformers import Trainer # Explicitly import Trainer to ensure it's in scope

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

NameError: name 'model' is not defined

**Reasoning**:
The `NameError: name 'model' is not defined` occurred because the `model` object, along with `training_args`, `data_collator`, and `tokenized_dataset`, were not in the current execution context. To ensure all necessary components are available and correctly configured for training, I need to re-execute the full setup: re-loading the data, tokenizing it, and re-initializing the model, tokenizer, training arguments, and data collator, all in a single block before initializing and running the `Trainer`.



In [None]:
import pandas as pd
from datasets import Dataset
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer, DataCollatorForLanguageModeling, AutoTokenizer

# Re-load DataFrame (df was not defined in previous execution contexts)
file_path = '/content/drive/MyDrive/IAT Courses/IAT 360/360_Project/NLP_Dataset/labeled_lyrics_cleaned.csv'
df = pd.read_csv(file_path)

# Re-apply column renames and handle missing values
df = df.rename(columns={'seq': 'lyrics', 'label': 'emotion'})
df.dropna(subset=['song'], inplace=True)
print("DataFrame re-loaded, columns renamed, and missing 'song' values dropped.")

# Re-initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
print("GPT2 tokenizer re-loaded successfully and pad token set.")

# Re-initialize model
model = GPT2LMHeadModel.from_pretrained('gpt2')
print("GPT2LMHeadModel re-loaded successfully.")

# Define Training Arguments (with wandb disabled)
training_args = TrainingArguments(
    output_dir='./results',              # output directory
    num_train_epochs=3,                  # total number of training epochs
    per_device_train_batch_size=8,       # batch size per device during training
    learning_rate=5e-5,                  # learning rate
    save_steps=500,                      # number of updates steps before saving checkpoint
    logging_steps=500,                   # log every X updates steps
    save_total_limit=2,                  # limit the total amount of checkpoints
    prediction_loss_only=True,           # to compute loss on predictions
    report_to="none"                     # Disable W&B logging to prevent interactive login prompt
)
print("TrainingArguments defined successfully.")

# Initialize Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False      # mlm=False for Causal Language Modeling (GPT-2)
)
print("DataCollatorForLanguageModeling initialized successfully.")

# Convert DataFrame to Hugging Face Dataset and tokenize
hf_dataset = Dataset.from_pandas(df)
def tokenize_function(examples):
    emotion_prefixes = [f"Emotion: {e:.3f} Lyrics: " for e in examples["emotion"]]
    combined_texts = [prefix + lyric for prefix, lyric in zip(emotion_prefixes, examples["lyrics"])]
    tokenized_inputs = tokenizer(combined_texts, truncation=True, max_length=512)
    return tokenized_inputs
tokenized_dataset = hf_dataset.map(tokenize_function, batched=True, num_proc=4)
print("Dataset re-tokenized successfully.")

# Initialize Trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset
)

print("Trainer initialized. Starting model training...")

trainer.train()

print("Model training completed.")

DataFrame re-loaded, columns renamed, and missing 'song' values dropped.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


GPT2 tokenizer re-loaded successfully and pad token set.
GPT2LMHeadModel re-loaded successfully.
TrainingArguments defined successfully.
DataCollatorForLanguageModeling initialized successfully.


Map (num_proc=4):   0%|          | 0/158351 [00:00<?, ? examples/s]

Process ForkPoolWorker-3:
Process ForkPoolWorker-4:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/queues.py", line 385, in get
    res = self._reader.recv_bytes()
  