# Re-implementing Potash et al.’s GhostWriter: Using an LSTM for Automatic Rap Lyric Generation


The purpose of this project is to compare four different models for ghost writing. Each model uses a different type of architecture to generate rap lyrics. The architectures consist of: SimpleRNN, GRU, LSTM, and CNN + LSTM <br>

## 1. Initial Setup

For the data preparation and evaluation of the generated lyrics, we used the following Python libraries: 

- Pronouncing - Used for calculating rhyme index
- Markovify - Used to create the base markov model for generating base lyrics
- Textstat - Used to calculate the readability score for a bar
- PyTorch-NLP - Used for tokenization

In [33]:
!pip3 install pronouncing
!pip3 install markovify
!pip3 install textstat
!pip3 install pytorch-nlp
!pip3 install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
     |████████████████████████████████| 13.9 MB 8.8 MB/s            


Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Data Preparation 

Load an artist's lyric file, remove the byte order mask, and split the file up into an array of bars.

In [55]:
from torch.utils.data import Dataset
import os 
import pandas as pd

#read dataset(s)
lyrics_data_Drake = open('./data/Drake_lyrics.txt', mode='r', encoding='utf8').read().split('\n')
lyrics_data_Childish = open('./data/Childish Gambino_lyrics.txt', mode='r', encoding='utf8').read().split('\n')
lyrics_data_Kanye = open('./data/Kanye West_lyrics.txt', mode='r', encoding='utf8').read().split('\n')

raw_data_Drake = {'Drake_data': [line for line in lyrics_data_Drake]}
#raw_data_Childish = {'Childish_data': [line for line in lyrics_data]}
#raw_data_Kanye = {'Kanye_data': [line for line in lyrics_data]}

#convert data to dataframe
df = pd.DataFrame(raw_data_Drake, columns = ['Drake_data'])

#shuffle dataset... is this necessary this early?
shuffle_dataframe = df.sample(frac=1)

#Split data : define size of training dataset
train_size = int(0.7*len(df))

#split into training and test set
train_set = shuffle_dataframe[:train_size]
test_set = shuffle_dataframe[train_size:]

#output for torchtext (needs a tabular dataset)
train_set.to_json('train.json', orient = 'records', lines =True)
test_set.to_json('test.json', orient = 'records', lines =True)

train_set.to_csv('train.csv', index=False)
test_set.to_csv('test.csv', index=False)

print(train_set), print(test_set)

                                             Drake_data
3044  Now it's, "Fuck you, I hate you, I'll move out...
3067                     And I finally send you to Rome
1927                                    Put a bib on me
3419  They was hatin' on me then and they hatin' now...
2223                                 Take a shot for me
...                                                 ...
1967                                Rock me real slowly
2018                                      I’m out here…
1830               I don't know who you're referring to
3969                         Steady doin' double shifts
3598               Niggas wouldn't make it on this side

[2971 rows x 1 columns]
                                             Drake_data
4028  This a "fuck-them-boys, forever-hold-a-grudge"...
4211  I bet them shits would have popped if I was wi...
2999                               They loving the crew
1811  For all the stuntin', I'll forever be immortal...
1243                   

(None, None)

### Markov Model 

Create the Markov model that will be used for generating the initial first words for our lyric generator. The Markov model is used to ensure the first set of words for each bar has some coherence before feeding them into the neural network to generate the rest of the bar.

In [56]:
import markovify

#build Markov Model from training set
markov_model_Drake = markovify.NewlineText(str("\n".join(lyrics_data_Drake)), well_formed=False, state_size=3)
markov_model_Childish = markovify.NewlineText(lyrics_data_Childish)
markov_model_Kanye = markovify.NewlineText(lyrics_data_Kanye)

#generate random sentences of given length

sen_length = 4
print("Test Run Drake\n")
for i in range(sen_length):
    print(markov_model_Drake.make_sentence(tries=100))

#print("\n")
#print("Test Run Chidish Gambino\n")
#for i in range(sen_length):
#    print(markov_model_Childish.make_sentence(tries=100))

#print("\n")
#print("Test Run Kanye\n")
#for i in range(sen_length):
#    print(markov_model_Kanye.make_sentence(tries=100))
#print("\n")


#test dataset : generate "num_sentences" many random sentences of  "x_chars" length or less
x_chars = 50
num_sentences = 4

print("\n")
print("Drake\n")
for i in range(num_sentences):
    print(markov_model_Drake.make_short_sentence(x_chars))
print("\n")


#print("Childish Gambino\n")
#for i in range(num_sentences):
#    print(markov_model_Childish.make_short_sentence(x_chars))
#print("\n")

#print("Kanye\n")
#for i in range(num_sentences):
#    print(markov_model_Kanye.make_short_sentence(x_chars))
#print("\n")


Test Run Drake

Own it, own it, own it, own it, own it, own it, own it, own it, own it, own it, own it
I don't trust a word you say (I don't trust a word you say (I don't trust a word)
And how I switched it up with a fork and knife for me
'Cause you're a good girl and you know how it goes


Drake

Make everybody have to go to funerals
Pussy so good that you gotta be a thug for her
I just pop up with the jokes, I'm dead, I'm asleep
Yeah, and I could never do no wrong




### Tokenization 

In order to create the training data, the lyrics need to be tokenized.

Tokenizing allows for the bars to be represented as matricies that correspond to what words are used in a bar.

In [21]:
import torchtext
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as f
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.data import Field, TabularDataset, BucketIterator
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor

In [63]:
#Tokenizer
tokenizer=get_tokenizer('spacy', language='en_core_web_sm')

#def tokenizer(input):
#   return [token.text for token in spacy_en.tokenizer(input)]

#convert data to dataframe
#raw_data_Drake = {'Drake_data': [line for line in lyrics_data_Drake]}
#df = pd.DataFrame(raw_data_Drake, columns = ['Drake_data', label])

TEXT = data.Field(tokenize=tokenizer, use_vocab=True, lower=True, batch_first=True, include_lengths=True)
#TEXT.build.vocab(train, max_size=25000, vectors="glove.6B.100d")
fields=[('text', TEXT)]

#sequences = lyrics_data_Drake
#encoder = StaticTokenizerEncoder(sequences, tokenize=lambda s: s.split())
#encoded_data = [encoder.encode(example) for example in sequences]

#print(encoded_data)
#print("\n")

#Create Dictionary
#fields = {'label':()}

#pad to get same size sequence (how do we know what size to pad?)
#feed seq to model


#create x and y data for training -- already have train set and test set
training_data = TabularDataset(
    path = 'train.csv', 
    format = 'csv',
    fields = fields,
    skip_header=True
)
for example in training_data.examples:
    print(example.text)

vectors=Vectors(name='glove.6B.100d.txt')
TEXT.build_vocab(training_data, vectors=vectors, max_size=10000, min_freq=1)
#train_X, train_y = seq[:, :-1], 

#train_x == full bar minus last word


#train_y == last word

['now', 'it', "'s", ',', '"', 'fuck', 'you', ',', 'i', 'hate', 'you', ',', 'i', "'ll", 'move', 'out', 'in', 'a', 'heartbeat', '!', '"']
['and', 'i', 'finally', 'send', 'you', 'to', 'rome']
['put', 'a', 'bib', 'on', 'me']
['they', 'was', 'hatin', "'", 'on', 'me', 'then', 'and', 'they', 'hatin', "'", 'now', '(', 'hatin', "'", 'now', ')']
['take', 'a', 'shot', 'for', 'me']
['keep', 'a', 'broad', 'on', 'the', 'floor', 'year', "'", 'round', 'like', 'season', 'tickets']
['i', 'just', 'do', 'it', 'cause', 'i', "'m", "'", 'sposed', 'to', ',', 'nigga']
['wake', 'up', 'with', 'me', 'this', 'weekend', ',', 'we', 'can']
['niggas', 'is', 'frontin', "'", ',', 'that', "'s", 'upside', '-', 'down', 'cake']
['came', 'up', ',', 'that', "'s", 'all', 'me', ',', 'stay', 'true', ',', 'that', "'s", 'all', 'me']
['we', 'ai', 'n’t', 'even', 'have', 'a', 'tour', 'bus']
['got', 'a', 'lot', 'of', 'people', 'tryna', 'drain', 'me', 'of', 'this', 'energy']
['i', "'m", 'acting', 'out', 'in', 'the', 'open', ',', 'it', 

['you', 'know', 'that', 'i', 'do', "n't", 'play']
['how', 'you', 'so', 'high', ',', 'but', 'still', 'so', 'down', 'to', 'earth', ',', 'nigga', '?']
['shout', 'out', 'to', 'all', 'my', 'niggas', 'living', 'tax', 'free']
['back', 'to', 'back', 'for', 'the', 'niggas', 'that', 'did', "n't", 'get', 'the', 'message']
['heard', 'once', 'that', 'in', 'dire', 'times', 'when', 'you', 'need', 'a', 'sign', ',', 'that', "'s", 'when', 'they', 'appear']
['0', 'to', '100', ',', 'nigga', ',', 'real', 'quick']
['nobody', 'else', "'s", ',', 'yeah', ',', 'this', 'shit', 'belong', 'to', 'nobody', ',', 'it', "'s", 'yours']
['cause', 'that', 'night', 'i', 'played', 'her', 'three', 'songs']
['feel', 'a', 'way', ',', 'feel', 'a', 'way', ',', 'young', 'nigga', 'feel', 'a', 'way']
['yeah']
['shit', 'hot', 'up', 'in', 'the', '6', 'right', 'now']
['a', 'lot', 'of', 'niggas', 'cut', 'the', 'check', 'so', 'they', 'can', 'take', 'this', 'flow']
['this', 'is', 'more', 'than', 'just', 'a', 'new', 'lust', 'for', 'you']


['they', 'tryna', 'take', 'the', 'wave', 'from', 'a', 'nigga']
['say', 'my', 'name']
['next', 'time', 'we', 'fuck', ',', 'i', 'do', "n't", 'wanna', 'fuck', ',', 'i', 'wanna', 'make', 'love']
['and', 'i', 'wanna', 'tell', 'you', 'my', 'intentions']
['they', 'know']
['who', "'s", 'not', 'gang', ',', 'bitch', '?', 'let', 'me', 'find', 'out']
['that', 'always', 'let', 'they', 'mouth', 'run']
['i', '’m', 'trying', 'to', 'take', 'the', 'high', 'road']
['she', 'wanna', 'get', 'married', 'tonight']
['(', 'are', 'you', 'drunk', 'right', 'now', '?', ')']
['why', 'the', 'sudden', 'change', '?']
['tell', 'the', 'truth', ',', 'i', 'do', 'n’t', 'listen', 'to', 'you']
['say', ',', '“', 'baby', ',', 'i', 'love', 'you', '”']
['make', 'me', 'call', 'my', 'bro', ',', 'do', "n't"]
['you', 'ai', "n't", 'ever', 'worried', 'cause', 'he', "'s", 'not', 'who', 'he', 'pretends', 'to', 'be']
['got', 'ta', 'get', 'a', 'handle', 'on', 'you']
['i', 'got', "'em", 'worried', ',', 'like', 'make', 'sure', 'you', 'save',

NameError: name 'Vectors' is not defined

In [29]:
pad_encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(pad_encoded_data))

AssertionError: 

In [46]:

raw_data_Drake = {'Drake_data': [line for line in lyrics_data_Drake]}
#raw_data_Childish = {'Childish_data': [line for line in lyrics_data]}
#raw_data_Kanye = {'Kanye_data': [line for line in lyrics_data]}

#convert data to dataframe
df = pd.DataFrame(raw_data_Drake, columns = ['Drake_data'])

#shuffle dataset... is this necessary this early?
shuffle_dataframe = df.sample(frac=1)

#Split data : define size of training dataset
train_size = int(0.7*len(df))

#split into training and test set
train_set = shuffle_dataframe[:train_size]
test_set = shuffle_dataframe[train_size:]

#output for torchtext (needs a tabular dataset)
train_set.to_json('train.json', orient = 'records', lines =True)
test_set.to_json('test.json', orient = 'records', lines =True)

train_set.to_csv('train.csv', index=False)
test_set.to_csv('test.csv', index=False)

print(train_set), print(test_set)

                                             Drake_data
2724               Yeah, met her once and I got through
3749  I've had mine, you've had yours, we both know....
3614                     And you know what's on my mind
2345     You be like "who's this?" I be like "me, girl"
724   I needed to hear that shit, I hate when you're...
...                                                 ...
2835                    They know, they know, they know
1597        Yeah, they don't really be the same offline
277                    He put me on to the finer things
3871   Just to off these records, nigga that's a record
32            I only love my bed and my mama, I'm sorry

[2971 rows x 1 columns]
                                             Drake_data
1171               I got enemies,  got a lot of enemies
3443                         Make me call my bro, don't
943   My high school reunion might be worth an appea...
3106                So worried that I won't be accepted
3256                   

(None, None)

## SimpleRNN Model

In [3]:
#RNN Model
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hideen_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        #we takee second dimension (count starts at 0) our shape is [1,512] - why 512?
        self.softmax = nn.LogSoftmax(dim=1)
    

    def forward (self, input_tensor, hidden_tensor):
        #curr hid state is a combo of prev hid state and curr input
        combined = torch.cat((input_tensor, hidden_tensor), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
  
    return output, hidden 

"""  
model = NN(100, 10)
#figure out what size we want to return
x = torch.randn(10, 20)
print(model(x).shape)
"""

#run on device (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


#hyperparameters (input size of vector, sequence, num layers, hidden layers, hidden size)
# input size and classes have placeholder for now
input_size = 512
output_size = 1
hidden_size = 1
learning_rate = 0.001
batch_size = 10
num_epochs = 50

#load data for training
#check if train_data is in correct format
train_loader = DataLoader(dataset=train_set, batch_size=batch_size, shuffle = True)
test_loader = DataLoader(dataset=test_set, batch_size=batch_size, shuffle = True)

#do we need to initialize network? cpu/cuda?

#optimizer and loss function
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = learning_rate)

#Training Step
for epoch in range(num_epochs, category_tensor):
    for batch_index ,(data, targets) in enumerate(train_loader):
        data = data.to(device=device)
        targets = targets.to(device=device)

        data = data.shape[1]
        #we have one dimension (one column in data), but we have a dictionary inside each colm.

        #forward part of NN
        scores = mode(data)
        #loss = criterion(scores, targets)
        loss = criterion(output, category_tensor)

        #backward, we don't go bckwd so we set gradient to 0 for every batch, so we don't store back props
        optimizer.zero_grad()
        loss.backward()

        #gradient descent, update weights based on loss.backward
        optimizer.step()

        return otput, loss.item()

#training loop
current_loss = 0
all_losses = []
plot_steps, print_steps = 1000, 5000
num_iter = 100000
for i in range(num_iter):
#...

SyntaxError: unexpected EOF while parsing (<ipython-input-3-83d10761ee6a>, line 81)

## LSTM Model

In [4]:
class LSTM(nn.Module):
    
    ''' Initialize the network variables '''
    def __init__(self, num_hidden, num_layers, embed_size, drop_prob, lr):
        # call super() on the class
        super().__init__()
        
        # store the constructor variables
        self.drop_prob = drop_prob
        self.num_layers = num_layers
        self.num_hidden = num_hidden
        self.lr = lr
        
        # define the embedded layer
        self.embedded = nn.Embedding(vocab_size, embed_size)

        # define the LSTM
        self.lstm = nn.LSTM(embed_size, num_hidden, num_layers, dropout = drop_prob, batch_first = True)
        
        # define a dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        # define the fully-connected layer
        self.fc = nn.Linear(num_hidden, vocab_size)      
    
    ''' Forward propogate through the network '''
    def forward(self, x, hidden):
        
        ## pass input through embedding layer
        embedded = self.embedded(x)     
        
        # Obtain the outputs and hidden layer from the LSTM layer
        lstm_output, hidden = self.lstm(embedded, hidden)
        
        # pass through a dropout layer and reshape
        dropout_out = self.dropout(lstm_output).reshape(-1, self.num_hidden) 

        ## put "out" through the fully-connected layer
        out = self.fc(dropout_out)

        # return the final output and the hidden state
        return out, hidden
    
    ''' Initialize the hidden state of the network '''
    def init_hidden(self, batch_size):
        
        # Create a weight torch using the parameters of the model
        weight = next(self.parameters()).data

        # initialize the hidden layer using the weight torch
        hidden = (weight.new(self.num_layers, batch_size, self.num_hidden).zero_(),
                  weight.new(self.num_layers, batch_size, self.num_hidden).zero_())
        
        # return the hidden layer
        return hidden

NameError: name 'nn' is not defined

In [5]:
# create the LSTM model
model = LSTM(num_hidden, num_layers, embed_size, drop_prob, lr)

# selecting an optimizer
optimizer = torch.optim.Adam(model.parameters(), lr = lr)

# selecting a loss function
loss_func = nn.CrossEntropyLoss()

# overview of the model
model.train()

NameError: name 'LSTM' is not defined

In [6]:
def get_next_batch(x, y, batch_size):
    
    # iterate until the end of x
    for itr in range(batch_size, x.shape[0], batch_size):
        
        # obtain the indexed x and y values
        batch_x = x[itr - batch_size:itr, :]
        batch_y = y[itr - batch_size:itr, :]
        
        # yield these values
        yield batch_x, batch_y

In [7]:
for epoch in range(num_epochs):

    # initialize hidden state
    hidden_layer = model.init_hidden(batch_size)
        
    for x, y in get_next_batch(x_idx, y_idx, batch_size):
            
        # convert numpy arrays to PyTorch arrays
        inputs = torch.from_numpy(x).type(torch.LongTensor)
        act = torch.from_numpy(y).type(torch.LongTensor)

        # reformat the hidden layer
        hidden_layer = tuple([layer.data for layer in hidden_layer])

        # obtain the zero-accumulated gradients from the model
        model.zero_grad()
            
        # get the output from the model
        output, hidden = model(inputs, hidden_layer)
            
        # calculate the loss from this prediction
        loss = loss_func(output, act.view(-1))

        # back-propagate to update the model
        loss.backward()

        # prevent exploding gradient problem
        nn.utils.clip_grad_norm_(model.parameters(), 1)

        # update weigths using the optimizer
        optimizer.step()           

NameError: name 'num_epochs' is not defined

## Generating Lyrics

In [None]:
def generate_rap(model, artists_bars, length_of_bar=10, length_of_rap=20, min_score_threshold=-0.2, max_score_threshold=0.2, tries=5):
    artists_avg_readability = calc_readability(artists_bars)
    artists_avg_rhyme_idx = calc_rhyme_density(artists_bars)
    fire_rap = []
    cur_tries = 0
    candidate_bars = []

    while len(fire_rap) < length_of_rap:
        seed_phrase = markov_model.make_sentence(tries=500).split(" ")
        seed_phrase = " ".join(seed_phrase[:3])
        cur_tries += 1
        bar = generate_bar(seed_phrase, model, rand.randrange(4, length_of_bar))
        bar_score = score_bar(bar, artist_lyrics, artists_avg_readability, artists_avg_rhyme_idx) 
        candidate_bars.append((bar_score, bar))

    if bar_score <= max_score_threshold and bar_score >= min_score_threshold:
        fire_rap.append(bar)
        cur_tries = 0
        print("Generated Bar: ", len(fire_rap))

    if cur_tries >= tries:
        lowest_score = np.Infinity
        best_bar = ""
        for bar in candidate_bars:
            if bar[0] < lowest_score:
                best_bar = bar[1]
                candidate_bars = []
      
    print("Generated Bar: ", len(fire_rap))
    fire_rap.append(best_bar)
    cur_tries = 0
      
    print("Generated rap with avg rhyme density: ", calc_rhyme_density(fire_rap), "and avg readability of: ", calc_readability(fire_rap))
    return fire_rap

In [None]:
def generate_bar(seed_phrase, model, length_of_bar):
    for i in range(length_of_bar):
        seed_tokens = pad_sequences(tokenizer.texts_to_sequences([seed_phrase]), maxlen=29)
        output_p = model.predict(seed_tokens)
        output_word = np.argmax(output_p, axis=1)[0]-1
        seed_phrase += " " + str(list(tokenizer.word_index.items())[output_word][0])
    return seed_phrase

In [None]:
def compare_bars(input_bar, artists_bars):
  
    # Convert sentences to matrix of token counts
    avg_distance = 0
    total_counted = 0
    for bar in artists_bars:
        v = CountVectorizer()
        # Vectorize the sentences
        word_vector = v.fit_transform([input_bar, bar])
    # Compute the cosine distance between the sentence vectors
        cos_distance = 1-pdist(word_vector.toarray(), 'cosine')[0]
    if not math.isnan(cos_dist):
        avg_distance += 1-pdist(word_vector.toarray(), 'cosine')[0]
        total_counted += 1
    return avg_distance/total_counted

In [None]:
def calc_rhyme_density(bars):
    total_syllables = 0
    rhymed_syllables = 0
    for bar in bars:
        for word in bar.split():
        p = pronouncing.phones_for_word(word)
        if len(p) == 0:
            break
        syllables = pronouncing.syllable_count(p[0])
        total_syllables += syllables
        has_rhyme = False
        for rhyme in pronouncing.rhymes(word):
        if has_rhyme:
            break
        for idx, r_bar in enumerate(bars):
            if idx > 4:
                break
            if rhyme in r_bar:
                rhymed_syllables += syllables
                has_rhyme = True
                break
    return rhymed_syllables/total_syllables

In [None]:
def calc_readability(input_bars):
    avg_readability = 0
    for bar in input_bars:
        avg_readability += textstat.automated_readability_index(bar)
    return avg_readability / len(input_bars)

In [None]:
def score_bar(input_bar, artists_bars, artists_avg_readability, artists_avg_rhyme_idx):
    gen_readability = textstat.automated_readability_index(input_bar)
    gen_rhyme_idx = calc_rhyme_density(input_bar)
    comp_bars = compare_bars(input_bar, artists_bars)

    # Scores based off readability, rhyme index, and originality. The lower the score the better.
    bar_score = (artists_avg_readability - gen_readability) + (artists_avg_rhyme_idx - gen_rhyme_idx) + comp_bars
    return bar_score

In [None]:
rnn = generate_rap(rnn_model, artist_lyrics, length_of_bar = 8, tries=100)

print("Rap Generated with SimpleRNN:")
for line in rnn:
    print(line)
print()

lstm = generate_rap(lstm_model, artist_lyrics, length_of_bar = 8, tries=100)

print("Rap Generated with LSTM:")
for line in lstm:
    print(line)
print()
