# Dataset Loader: Genius Song Lyrics (Hugging Face)

**Data Source:** https://huggingface.co/datasets/sebastiandizon/genius-song-lyrics

**Description:** The original Genius Song Lyrics dataset contains 2.76 million songs (~9 GB CSV file). To enable lightweight experimentation, this script allows you to download and save a smaller random subset (e.g., 1%) locally as CSV.

---

# 1. Preparations
## 1.1 Import Packages

In [94]:
from datasets import load_dataset
import pandas as pd
import os

## 1.2 Configuration

* Define the subset percentage to download (e.g., 1%, 5%, 10%). Note: Hugging Face only accepts integer percentages for split notation.
* Define output directory and file name.
* Create directory if it doesn't exist.

In [95]:
subset_fraction = 1

output_dir = "data"
output_path = f"{output_dir}/lyrics_subset_{subset_fraction}pct.csv"

os.makedirs(output_dir, exist_ok=True)

# 2. Load and Save Dataset
## 2.1 Load Subset of Dataset from Hugging Face

In [96]:
print(f"Loading Subset: {subset_fraction}% of total dataset\n")

dataset = load_dataset("sebastiandizon/genius-song-lyrics", split=f"train[:{subset_fraction}%]")

print(f"Dataset loaded successfully with {len(dataset):,} entries.")

Loading Subset: 1% of total dataset

Dataset loaded successfully with 51,349 entries.


## 2.2 Convert to pandas DataFrame

In [97]:
df = dataset.to_pandas()

print(f"DataFrame shape: {df.shape}")
print(f"Number of Songs: {len(df):,} | Artists: {df['artist'].nunique():,} | Genres: {df['tag'].nunique():,}")

DataFrame shape: (51349, 11)
Number of Songs: 51,349 | Artists: 5,333 | Genres: 6


## 2.3 Save Subset locally

In [98]:
df.to_csv(output_path, index=False)

print(f"Subset saved to: {output_path}")

Subset saved to: data/lyrics_subset_1pct.csv


# 3. Preview of dataset

In [99]:
df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
0,Killa Cam,rap,Cam'ron,2004,173166,"{""Cam\\'ron"",""Opera Steve""}","[Chorus: Opera Steve & Cam'ron]\nKilla Cam, Ki...",1,en,en,en
1,Can I Live,rap,JAY-Z,1996,468624,{},"[Produced by Irv Gotti]\n\n[Intro]\nYeah, hah,...",3,en,en,en
2,Forgive Me Father,rap,Fabolous,2003,4743,{},Maybe cause I'm eatin\nAnd these bastards fien...,4,en,en,en
3,Down and Out,rap,Cam'ron,2004,144404,"{""Cam\\'ron"",""Kanye West"",""Syleena Johnson""}",[Produced by Kanye West and Brian Miller]\n\n[...,5,en,en,en
4,Fly In,rap,Lil Wayne,2005,78271,{},"[Intro]\nSo they ask me\n""Young boy\nWhat you ...",6,en,en,en
