<a href="https://www.kaggle.com/code/tommyadams/taylor-swift-transformer-language-model?scriptVersionId=141677538" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <center> Taylor Swift Transformer Language Model </center>

<center><img src='https://media.glamour.com/photos/59df542c017bd0228acd8003/4:3/w_2455,h_1841,c_limit/TAYLOR-SWIFT.jpg' height=600px width=600px>
<img src='https://www.hearai.pl/post/13-slt/image5.png' height=600px width=600px></center>

## Project Summary

Since the publication of the 2017 paper "Attention Is All You Need", transformer language models have been shown to be very effective for a variety of natural language processing tasks and they have led to the creation of superior chat bots, such as Chat GPT. Transformer language models are a type of neural network that are used for natural language processing tasks such as machine translation, text summarization, and question answering. They are known for their ability to learn long-range dependencies between words, which makes them well-suited for tasks that require understanding the context of a sentence or paragraph. 

In this project, I will build a transformer language model in PyTorch to generate lyrics in the style of Taylor Swift. I will do this by training the model on a dataset of lyrics from Taylor Swift's songs. The model will learn the patterns of Taylor Swift's writing style, and it will be able to use this knowledge to generate new lyrics that are similar to hers.

The project will be divided into the following steps:

- Collect a dataset of lyrics from Taylor Swift's songs.
- Prepare the dataset for training.
- Build the transformer language model.
- Train the model.
- Evaluate the model's performance.
- Generate new lyrics using the model.

## Model Architecture

My model architecture will be based on the paper "Attention Is All You Need", which can be seen in the image above. However, there will be some differences between my model and the model in the paper:

- My transformers will not have an encoder or cross-attention portions since I am not translating from one language to another like in the paper. As a result, my transformers will only perform masked self-attention.
- I chose to do the normalization phase before the multi-head attention and feed forward portions of the model, as opposed to doing it afterwards like in the paper. I wanted to reduce gradients created by the attention and feed forward layers in the hopes that it would make training more stable. 
- The paper chose to use cosine and sine positional embeddings to teach the model to generalize for longer context lengths than it was trained on. I chose to use a simple ascending numbering of the positions since my model is much smaller and I do not expect comprehension of long context lengths to be a limiting factor.
- Because I am limited by the computational power of my cpu, I chose to create a model with less layers, smaller embedding groups, and smaller training batch sizes. Also, I chose to have the model predict the next character as opposed to predicting the next word, which reduced the vocabulary of the model  and its computational intensity. 

In [1]:
import pandas as pd
import numpy as np
import re
import torch
import torch.nn as nn
import torch.nn.functional as F
import glob

In [2]:
# Importing the csv files and concatenating the data
# Data set is from https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums
path = '/kaggle/input/taylor-swift-song-lyrics-all-albums/'
csv_files = glob.glob(path + "/*.csv")
df_list = (pd.read_csv(i) for i in csv_files)
df = pd.concat(df_list, ignore_index=True)
lyrics = '\n'.join(df.loc[:,'lyric']) 

In [3]:
print(lyrics[:500])

Knew he was a killer first time that I saw him
Wondered how many girls he had loved and left haunted
But if he's a ghost, then I can be a phantom
Holdin' him for ransom, some
Some boys are tryin' too hard, he don't try at all though
Younger than my exes, but he act like such a man, so
I see nothing better, I keep him forever
Like a vendetta-ta
I, I, I see how this is gon' go
Touch me and you'll never be alone
I-Island breeze and lights down low
No one has to know
In the middle of the night, in m


In [4]:
# List of all unique characters
' '.join(sorted(set(lyrics)))

'\n   ! " & \' ( ) , - . 0 1 2 3 4 5 6 7 8 9 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y [ ] a b c d e f g h i j k l m n o p q r s t u v w x y z | \xa0 é í ï ó е \u2005 \u200b – — ‘ ’ ” … \u205f'

In [5]:
# Cleaning the file by removing/replacing unnecessary characters and removing sections that are not lyrics
replace_with_space = ['\u2005', '\u200b', '\u205f', '\xa0', '-']
replace_letters = {'í':'i', 'é':'e', 'ï':'i', 'ó':'o', ';':',', '‘':'\'', '’':'\'', ':':',', 'е':'e'} 
remove_list = ['\)', '\(', '–','"','”', '"', '\[.*\]', '.*\|.*', '—']

cleaned_lyrics = lyrics

for old, new in replace_letters.items():
    cleaned_lyrics = cleaned_lyrics.replace(old, new)
for string in remove_list:
    cleaned_lyrics = re.sub(string,'',cleaned_lyrics)
for string in replace_with_space:
    cleaned_lyrics = re.sub(string,' ',cleaned_lyrics)
print(''.join(sorted(set(cleaned_lyrics))))


 !',.0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYabcdefghijklmnopqrstuvwxyz…


In [6]:
print(len(lyrics), len(cleaned_lyrics))

296805 293121


In [7]:
# Creating an encoder and decoder to convert each character (char) to a number to feed into the model
vocab = sorted(set(cleaned_lyrics))
int_to_char = {int:char for int,char in enumerate(vocab)}
char_to_int = {char:int for int,char in enumerate(vocab)}
encoder = lambda string: [char_to_int[char] for char in string] 
decoder = lambda list: ''.join([int_to_char[i] for i in list]) 

print(decoder(encoder("She's cheer captain")))

She's cheer captain


In [8]:
# Setting aside a portion for training the model and a portion for testing the data to prevent the model from overfitting to the data it is tested on
lyric_tensor = torch.tensor(encoder(cleaned_lyrics), dtype=torch.long)
split_point = int(len(lyric_tensor)*0.9)
train = lyric_tensor[:split_point]
test = lyric_tensor[split_point:]

In [9]:
#Creating a basic language model with only an embedding matrix. This model only references the most recent char to generate the next char
class BasicModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.char_embeddings = nn.Embedding(vocab_size, vocab_size)

    def forward(self, context):
        predictions = self.char_embeddings(context)
        return predictions

    def generate(self, context, length):
        for i in range(length):
            predictions = self(context)
            predictions = predictions[-1, :] # Only referencing most recent char
            probabilities = F.softmax(predictions, dim=-1) # Normalize across the embedding dimension (aka vocab_size) so that they all add up to 1.00
            next_char = torch.multinomial(probabilities, num_samples=1) # Samples randomly from the prob distribution of the embedding dimension
            context = torch.cat((context, next_char))
        return context

# Selecting a random batch of text
torch.manual_seed(400)
vocab_size = len(set(cleaned_lyrics))
batch_size = 20
index = torch.randint(low=0, high=len(train) - batch_size, size=(1,))
context = train[index:(index+batch_size)]

# Feeding the batch (context) into the model and asking it to generate text following it 
model = BasicModel(vocab_size)
predictions = model(context)
print(' prediction dimensions:', predictions.shape) # Should have dimensions (batch_size, vocab_size)
print('\n context input:', decoder(context.tolist())) # Context input
print('\n context + response:', decoder( model.generate(context, length=30).tolist()))

 prediction dimensions: torch.Size([20, 69])

 context input: away
I should say, h

 context + response: away
I should say, h.JH7trIIIL
 pn?TL00l4T0HaMBML



In [10]:
# Adding ability for model to process multiple batches to improve training efficiency, and adding targets to measure loss
def create_batches(data, batch_size, batches):
    index = torch.randint(low=0, high=len(data) - batch_size, size=(batches,))
    context = torch.stack([data[row:(row+batch_size)] for row in index])
    target = torch.stack([data[row+1:(row+batch_size+1)] for row in index]) # Target is just the context shifted one char to the right
    return context, target

create_batches(train, 10, 5)

(tensor([[60,  1, 61, 49, 46,  1, 59, 46, 42, 60],
         [56, 61,  1, 43, 42, 45,  1, 43, 53, 56],
         [59, 46,  0, 41, 56, 62,  1, 62, 55, 45],
         [50, 60, 50, 43, 53, 46,  1, 60, 61, 59],
         [ 0, 41, 56, 62,  1, 52, 55, 46, 64,  1]]),
 tensor([[ 1, 61, 49, 46,  1, 59, 46, 42, 60, 56],
         [61,  1, 43, 42, 45,  1, 43, 53, 56, 56],
         [46,  0, 41, 56, 62,  1, 62, 55, 45, 46],
         [60, 50, 43, 53, 46,  1, 60, 61, 59, 50],
         [41, 56, 62,  1, 52, 55, 46, 64,  1, 66]]))

In [11]:
# Expanding the basic model so that it can handle multiple batches at the same time. Also, added ability for model to calculate the loss function (cross_entropy).
class BatchModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.char_embeddings = nn.Embedding(vocab_size, vocab_size)

    def forward(self, context, target=None):
        predictions = self.char_embeddings(context) # Context is the context output of create_batches and has dimensions (batches, batch_size)

        if target == None:
            loss = None
        else:
            # Resizing the shapes of predictions and target to meet requirements for cross_entropy loss function
            A, B, C = predictions.shape
            predictions = predictions.view(A * B, C)
            target = target.view(A * B)
            loss = F.cross_entropy(predictions, target)
        
        return predictions, loss

    def generate(self, context, length):
        for i in range(length):
            predictions, loss = self(context)
            predictions = predictions[:, -1, :] # Only referencing most recent char
            probabilities = F.softmax(predictions, dim=-1) # Scale data across the embedding dimension of vocab_size so that they all add up to 1.00
            next_char = torch.multinomial(probabilities, num_samples=1) # Samples randomly from the prob distribution of the embedding dimension
            context = torch.cat((context, next_char), dim=1)
        return context

vocab_size = len(set(cleaned_lyrics))
context, target = create_batches(train, batch_size=20, batches=2)
model = BatchModel(vocab_size)
predictions, loss = model(context, target)
output = model.generate(context, length=30)
print(' prediction dimensions:', predictions.shape) # Should have dimensions (batches * batch_size, vocab_size)
print('loss:', loss)
print('\n batch1:\n', decoder(output[0].tolist()))
print('\n batch2:\n', decoder(output[1].tolist()))

 prediction dimensions: torch.Size([40, 69])
loss: tensor(4.8738, grad_fn=<NllLossBackward0>)

 batch1:
 at you'll never findrJL
Un.YH7NO3f6iAy6aY?yREsE0Q4

 batch2:
 e if I want to try aCMS5z!oYkH j1z.5FuKnflpr5slfH 


In [12]:
# Using an optimizer to train the model. 
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-2)
for i in range(10):
    for i in range(100):
        context, target = create_batches(train, batch_size=20, batches=30)
        predictions, loss = model(context, target)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
    print(loss.item())

print('\n context + response:\n', decoder( model.generate(context, length=150)[0].tolist()))

3.553962469100952
2.957204580307007
2.612980365753174
2.465458393096924
2.4347734451293945
2.439598321914673
2.463512659072876
2.4112160205841064
2.344435930252075
2.381838321685791

 context + response:
  look what you made Sooqure
And witld ck, au'me clndasalit weVm8ryolyoftrcesickn's ake hath meto5Kay I anet widn apSou t'3d youromyooheadOowing ly wis y ysh
So I'th tit, 


In [13]:
# Creating a head of attention so that the model can see past chars when predicting the next char
class Attention(nn.Module):

    def __init__(self, batch_size, embed_groups, head_groups):
        super().__init__()
        self.query = nn.Linear(embed_groups, head_groups, bias=False) # Bias is set to false because a normalization layer follows
        self.key = nn.Linear(embed_groups, head_groups, bias=False)
        self.value = nn.Linear(embed_groups, head_groups, bias=False)
        self.register_buffer('mask', torch.tril(torch.ones(batch_size, batch_size))) 

    def forward(self, x):
        A,B,C = x.shape
        query = self.query(x)
        key = self.key(x)   

        # This code results in dimensions of (A, B, B), which maps each char to each other char in the context. 
        # The matrix is multiplied by 1 / (embed_groups)**0.5 to prevent the soft max layer from sharpening to much in response to high dot product values
        att = query @ key.transpose(-1,-2) * C**-0.5 

        # Apply mask so that each char cannot 'see' future chars that come after it
        att = att.masked_fill(self.mask[:B, :B] == 0, float('-inf')) 

        # Scale data to be between 0 and 1
        att = F.softmax(att, dim=-1)
        
        value = self.value(x)
        output = att @ value # Results in dimensions of (A, B, head_groups)
        return output

example = Attention(30, 60, 20)
input = torch.rand(size=(15, 30, 60))
print(example(input).shape)


torch.Size([15, 30, 20])


In [14]:
# Creating multiple heads of attention, adding feed forward linear layers, stacking multiple transformers

# Separated out the general parameters to make them easier to adjust
vocab_size = len(set(cleaned_lyrics))
batches = 15 
batch_size = 30 
embed_groups = 60
num_heads = 5
head_groups = embed_groups // num_heads
layers = 6  # Number of transformers
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'

def create_batches(data, batch_size, batches):
    index = torch.randint(low=0, high=len(data) - batch_size, size=(batches,))
    context = torch.stack([data[row:(row+batch_size)] for row in index])
    target = torch.stack([data[row+1:(row+batch_size+1)] for row in index]) 
    context, target = context.to(device), target.to(device) # Added ability to run on cuda
    return context, target

class Attention(nn.Module):

    def __init__(self):
        super().__init__()
        self.query = nn.Linear(embed_groups, head_groups, bias=False)
        self.key = nn.Linear(embed_groups, head_groups, bias=False)
        self.value = nn.Linear(embed_groups, head_groups, bias=False)
        self.register_buffer('mask', torch.tril(torch.ones(batch_size, batch_size)))

    def forward(self, x):
        A,B,C = x.shape
        query = self.query(x)
        key = self.key(x)   

        att = query @ key.transpose(-1,-2) * C**-0.5 
        att = att.masked_fill(self.mask[:B, :B] == 0, float('-inf')) 
        att = F.softmax(att, dim=-1)
        
        value = self.value(x)
        output = att @ value
        return output

class MultipleAttention(nn.Module):

    def __init__(self):
        super().__init__()
        self.att_heads = nn.ModuleList([Attention() for i in range(num_heads)])
        self.att_reader = nn.Linear(embed_groups, embed_groups)

    def forward(self, x):
        combined_att = torch.cat([i(x) for i in self.att_heads], dim=-1)
        output = self.att_reader(combined_att)
        return output

class FeedFoward(nn.Module):

    def __init__(self):
        super().__init__()
        self.ff_network = nn.Sequential(
            nn.Linear(embed_groups, 5 * embed_groups),
            nn.ReLU(),
            nn.Linear(5 * embed_groups, embed_groups))
    # ReLU is added because it is a non-linear function, which allows the model to learn more complex relationships
    
    def forward(self, x):
        return self.ff_network(x)

class Transformer(nn.Module):

    def __init__(self):
        super().__init__()
        self.matt = MultipleAttention()
        self.ff = FeedFoward()
        self.linear1 = nn.LayerNorm(embed_groups)
        self.linear2 = nn.LayerNorm(embed_groups)

    def forward(self, x):
        # Residuals are added to prevent vanishing gradient
        x = x + self.matt(self.linear1(x)) 
        x = x + self.ff(self.linear2(x)) 
        return x

class BatchModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.char_embeddings = nn.Embedding(vocab_size, embed_groups)
        self.pos_embeddings = nn.Embedding(batch_size, embed_groups) # Adds position embeddings to model can identify the order of the chars
        self.transformers = nn.Sequential(*[Transformer() for i in range(layers)])
        self.final_norm = nn.LayerNorm(embed_groups)
        self.final_linear = nn.Linear(embed_groups, vocab_size)


    def forward(self, context, target=None):
        A, B = context.shape
        full_embed = self.char_embeddings(context) + self.pos_embeddings(torch.arange(B, device=device))
        x = self.transformers(full_embed)
        x = self.final_norm(x)
        predictions = self.final_linear(x)
        
        
        if target == None:
            loss = None
        else:
            A, B, C = predictions.shape
            predictions = predictions.view(A * B, C)
            target = target.view(A * B)
            loss = F.cross_entropy(predictions, target)
        
        return predictions, loss

    def generate(self, context, length):
        for i in range(length):
            short_context = context[:, -batch_size:] # Reduce context to only focus on last batch_size of chars because positions are embedded
            predictions, loss = self(short_context)
            predictions = predictions[:, -1, :] 
            probabilities = F.softmax(predictions, dim=-1) 
            next_char = torch.multinomial(probabilities, num_samples=1)
            context = torch.cat((context, next_char), dim=1)
        return context

model = BatchModel()
model = model.to(device) # Added ability to run on cuda
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3) # Lowered the learning rate

# Training loop
for i in range(10):
    for j in range(200):
        context, target = create_batches(train, batch_size=batch_size, batches=batches)
        predictions, loss = model(context, target)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
    print('train loss:', loss.item())

    # Added test loss calculation to look for overfitting
    context, target = create_batches(test, batch_size=batch_size, batches=batches)
    predictions, loss = model(context, target)
    print('test loss:', loss.item())

print('\n',decoder(model.generate(context, length=300)[0][batch_size:].tolist()))

train loss: 2.325967788696289
test loss: 2.206486225128174
train loss: 2.240100145339966
test loss: 2.077044725418091
train loss: 2.1006884574890137
test loss: 2.1636483669281006
train loss: 1.9667949676513672
test loss: 1.9528011083602905
train loss: 1.9457050561904907
test loss: 1.7252171039581299
train loss: 1.7470109462738037
test loss: 1.8231513500213623
train loss: 1.647486686706543
test loss: 1.8626803159713745
train loss: 1.6320445537567139
test loss: 1.673216700553894
train loss: 1.427269458770752
test loss: 1.5662214756011963
train loss: 1.7723850011825562
test loss: 1.5276211500167847

 Olly bruttery, brokem at sumpine the to the pitin the's life they shar
And when your night har, be walking mill the knews you
Lave my when, I'll her sain
Be your I knemst be you
But I leave of you
An't gonna creeGr I can creaching cars
And I dall you wisckito thirs brrong. was fholt
Can't it in thre


In [15]:
# More training.

for i in range(10):
    for j in range(200):
        context, target = create_batches(train, batch_size=batch_size, batches=batches)
        predictions, loss = model(context, target)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
    print('train loss:', loss.item())

    context, target = create_batches(test, batch_size=batch_size, batches=batches)
    predictions, loss = model(context, target)
    print('test loss:', loss.item())

print('\n',decoder(model.generate(context, length=300)[0][batch_size:].tolist()))

train loss: 1.46900475025177
test loss: 1.6524217128753662
train loss: 1.3995614051818848
test loss: 1.489622950553894
train loss: 1.5939496755599976
test loss: 1.570263147354126
train loss: 1.534603238105774
test loss: 1.4992766380310059
train loss: 1.5643774271011353
test loss: 1.4966145753860474
train loss: 1.2557839155197144
test loss: 1.5761948823928833
train loss: 1.3989579677581787
test loss: 1.3395514488220215
train loss: 1.4174282550811768
test loss: 1.516241192817688
train loss: 1.2189892530441284
test loss: 1.4019126892089844
train loss: 1.4027663469314575
test loss: 1.3851622343063354

 ame the begging, now, I do
And I love ever
Did will hav ai last the tagloi
The creecher e else anywhem mice
I nothow a forgume, wording you furnt
And all at your funt gorsen, when you go3
Do at you talk is couldn it old in a sirce
And I've never alright, with nights I might me suntil
Never like down


In [16]:
# More training with reduced learning rate
learning_rate = 3e-4

for i in range(10):
    for j in range(200):
        context, target = create_batches(train, batch_size=batch_size, batches=batches)
        predictions, loss = model(context, target)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
    print('train loss:', loss.item())

    context, target = create_batches(test, batch_size=batch_size, batches=batches)
    predictions, loss = model(context, target)
    print('test loss:', loss.item())

print('\n',decoder(model.generate(context, length=300)[0][batch_size:].tolist()))

train loss: 1.3376103639602661
test loss: 1.6179249286651611
train loss: 1.3626505136489868
test loss: 1.3440215587615967
train loss: 1.4894189834594727
test loss: 1.5395950078964233
train loss: 1.1720752716064453
test loss: 1.6247557401657104
train loss: 1.2934796810150146
test loss: 1.4623098373413086
train loss: 1.402515172958374
test loss: 1.5500829219818115
train loss: 1.3286421298980713
test loss: 1.7675071954727173
train loss: 1.2494961023330688
test loss: 1.56464684009552
train loss: 1.2936402559280396
test loss: 1.4446462392807007
train loss: 1.208522081375122
test loss: 1.5539568662643433

 shame
I should've fake, never
'Cause my around of the sed out
And one to the baci tch his you think this is the lits exies the side
I only through me night to had to do
That was night at you're me the porr yoles spolet dip
And it easy apar the nighter the part, he we'll never things in you combatte 


In [17]:
# Printing a longer generation output from the model
print('\n',decoder(model.generate(context, length=1500)[1][batch_size:].tolist()))


 nge
Pats there I sun hate up lican pernectestly
Or he's fake a spoling of in the pict
That this perfel the bother our are to killo
What gue and scorrard it's for faithy
I'm the fake a shaga lin't wanna be with bate
I found at turn you
I hope it's a should, shake me thinnu
But the marver the prines the one you breeathe the again, and I hush, I shake you thoughters and pasken out and me
And when I know is make me shade
Like the one up and I found anyworing
Ssead, I'll never the freemore
It's a little here here phone hunt, but got lost should
And why I'm all where moner
If I could've been a each now a don't know that like bloods up, what I say Oh
Oh, oh, oh oh, oh
Wholed it's fr?
Knew you prozin with home
I had to do 'ema
.., his love man, baby, no get me
When the do way I didn't do
I could dending up come backst home
For a saw you make
She's sure, nothing their again
In I hidn't say what you wanted and rasteer of the ewould fakes back our penterfurade
OnlBy the hand to desk of of the n

## Results

Although the final output of the model sounds like gibberish, it greatly improved from its initial state where it vomited random characters. The model is able to produce real words, which is impressive given that it is only trained to predict the next character. The model has also learned to capitalize the first word of each line, and produces lines of lyrics that are on average the same length as the lyrics in the data set. I elected to stop training at this point because the model test error is only making small improvements with each iteration. 

This model shows the limitations of creating language models with a relatively small amount of compute power. Clearly, there is a lot of room for the model to improve. Using GPUs and distributed training would boost computational power and enable the model to become more complex with additional transformers, more embedding groups, and longer context sizes. Towards the end of the training, the model also started to suffer from overfitting, since the testing error was consistently higher than the training set error. To help reduce overfitting, I could introduce a dropping layer that would randomly drop some weights from the model; however, this would also significantly increase the training time to convergence. 
