<a href="https://colab.research.google.com/github/sfs0126/Lyric-Generator-fine-tuned-GPT-2/blob/main/Data_Preprocessing_SS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lyric Generation for Different Music Genres
## Fine-Tuning GPT-2 and Evaluating with Perplexity
### Text Generation of Popular Music Lyrics

##### Data Preprocessing

In [248]:
!pip install transformers



In [249]:
pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-j006h3bo
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-j006h3bo
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [251]:
import os
import time
import datetime
from google.colab import drive

import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)

from transformers import GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup

import pandas as pd
from sklearn.model_selection import train_test_split
import re

In [252]:
gpu_info = !nvidia-smi -L
gpu_info = "\n".join(gpu_info)
if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU")
else:
    print(gpu_info)

GPU 0: Tesla K80 (UUID: GPU-50e33cfd-9889-7e91-eab9-7292a39a0b64)


### Data Preprocessing

In [253]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [254]:
lyrics = pd.read_csv('/content/drive/MyDrive/Topics in Computing Notebooks/Data/lyrics-data.csv')
artists = pd.read_csv('/content/drive/MyDrive/Topics in Computing Notebooks/Data/artists-data.csv')

In [255]:
lyrics.head()

lyrics = lyrics[lyrics['Idiom']=='ENGLISH']
lyrics['Idiom'].value_counts()

ENGLISH    114723
Name: Idiom, dtype: int64

In [256]:
artists.head()

artists = artists[(artists['Genre'].isin(['Rock', 'Pop', 'Hip Hop'])) & (artists['Popularity']>5)]
artists['Genre'].value_counts()

Pop        81
Rock       76
Hip Hop    21
Name: Genre, dtype: int64

In [257]:
full_df = lyrics.merge(artists[['Artist', 'Genre', 'Link']], left_on='ALink', right_on='Link', how='inner')
full_df = full_df.drop(columns=['ALink','SLink','Idiom','Link'])

#full_df = full_df[full_df['Lyric'].apply(lambda x: len(x.split(' ')) < 350)]

full_df.head()

Unnamed: 0,SName,Lyric,Artist,Genre
0,What's Up,Twenty-five years and my life is still. Trying...,4 Non Blondes,Rock
1,Spaceman,Starry night bring me down. Till I realize the...,4 Non Blondes,Rock
2,Pleasantly Blue,Every time you wake in the mornin'. And you st...,4 Non Blondes,Rock
3,Train,What ya gonna do child. When your thoughts are...,4 Non Blondes,Rock
4,Calling All The People,"How can you tell, when your wellness is not we...",4 Non Blondes,Rock


##### Split full data set by genre

In [258]:
rock_df = full_df[(full_df['Genre'].isin(['Rock']))]
rock_df.head()

Unnamed: 0,SName,Lyric,Artist,Genre
0,What's Up,Twenty-five years and my life is still. Trying...,4 Non Blondes,Rock
1,Spaceman,Starry night bring me down. Till I realize the...,4 Non Blondes,Rock
2,Pleasantly Blue,Every time you wake in the mornin'. And you st...,4 Non Blondes,Rock
3,Train,What ya gonna do child. When your thoughts are...,4 Non Blondes,Rock
4,Calling All The People,"How can you tell, when your wellness is not we...",4 Non Blondes,Rock


In [259]:
train_test_ratio = 0.9
train_valid_ratio = 7/9
rock_train_full, rock_test = train_test_split(rock_df, train_size = train_test_ratio, random_state = 1)
rock_train, rock_val = train_test_split(rock_train_full, train_size = train_valid_ratio, random_state = 1)

In [260]:
pop_df = full_df[(full_df['Genre'].isin(['Pop']))]
pop_df.head()

Unnamed: 0,SName,Lyric,Artist,Genre
4925,Careless Whisper,I feel so unsure. As I take your hand and lead...,George Michael,Pop
4927,Freedom '90,I won't let you down. I will not give you up. ...,George Michael,Pop
4929,One More Try,I've had enough of danger. And people on the s...,George Michael,Pop
4931,Father Figure,"That's all I wanted. Something special, someth...",George Michael,Pop
4933,Heal The Pain,Let me tell you a secret. Put it in your heart...,George Michael,Pop


In [261]:
pop_train_full, pop_test = train_test_split(pop_df, train_size = train_test_ratio, random_state = 1)
pop_train, pop_val = train_test_split(pop_train_full, train_size = train_valid_ratio, random_state = 1)

In [262]:
hiphop_df = full_df[(full_df['Genre'].isin(['Hip Hop']))]
hiphop_df.head()

Unnamed: 0,SName,Lyric,Artist,Genre
15257,In da Club,"Go, go, go, go. Go, go, go shawty. It's your b...",50 Cent,Hip Hop
15258,21 Questions,(50 Cent). New York City!. You are now rapping...,50 Cent,Hip Hop
15259,P.I.M.P.,[Chorus]. I don't know what you heard about me...,50 Cent,Hip Hop
15260,Candy Shop,Yeah.... Uh huh. So seductive. I'll take you t...,50 Cent,Hip Hop
15261,Just A Lil Bit,Yeah... Shady... Aftermath... G-Unit. Damn bab...,50 Cent,Hip Hop


In [263]:
hiphop_train_full, hiphop_test = train_test_split(hiphop_df, train_size = train_test_ratio, random_state = 1)
hiphop_train, hiphop_val = train_test_split(hiphop_train_full, train_size = train_valid_ratio, random_state = 1)

In [264]:
def build_dataset(df, dest_path):
    f = open(dest_path, 'w')
    data = ''
    lyrics_df = df['Lyric'].tolist()
    for lyric in lyrics_df:
        lyric = str(lyric).strip()
        lyric = re.sub(r"\s", " ", lyric)
        bos_token = '<BOS>'
        eos_token = '<EOS>'
        data += bos_token + ' ' + lyric + ' ' + eos_token + '\n'
        
    f.write(data)

In [266]:
build_dataset(rock_train, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/rock_train.txt')
build_dataset(rock_val, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/rock_valid.txt')
build_dataset(rock_test, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/rock_test.txt')

In [267]:
build_dataset(pop_train, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/pop_train.txt')
build_dataset(pop_val, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/pop_valid.txt')
build_dataset(pop_test, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/pop_test.txt')

In [268]:
build_dataset(hiphop_df_train, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/hiphop_train.txt')
build_dataset(hiphop_val, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/hiphop_valid.txt')
build_dataset(hiphop_test, '/content/drive/MyDrive/Topics in Computing Notebooks/Data/hiphop_test.txt')