# Using Machine Learning to generate raps
The goal of this project is to use a dataset with a variety of artists, filter out all the rappers, and create a corpus of their music. Then using Tensorflow I will build a neural net that will create new raps. 

#### Warning: Obviously a lot of rappers use explicit language. If you are not comfortable with that then do not continue

Overall, the principles of this can be applied to a variety of text applications. In this case I am simply using the predict feature of my model to generate new raps, but it could also be used for missing word imputation. In otherwords, the principle can be expanded out to predictive text, where the seed is the user's sentence or word and based on a model, the next word or next full sentence is predicted. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from IPython.display import display, HTML, Markdown, Latex
display(HTML("""
<style>
.output {
    display: flex;
    align-items: center;
    text-align: center;
    
}
div.output_subarea{
    max-width:1200px;
}
div.text_cell_render{
padding: 5em 5em 0.5em 0.5em
}

</style>
"""))

In [2]:
df = pd.read_csv("songdata.csv")

In [3]:
df.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


### There are a lot of artists, but I will broadly filter out all rappers that I know. 

In [4]:
df['artist'].unique()

array(['ABBA', 'Ace Of Base', 'Adam Sandler', 'Adele', 'Aerosmith',
       'Air Supply', 'Aiza Seguerra', 'Alabama', 'Alan Parsons Project',
       'Aled Jones', 'Alice Cooper', 'Alice In Chains', 'Alison Krauss',
       'Allman Brothers Band', 'Alphaville', 'America', 'Amy Grant',
       'Andrea Bocelli', 'Andy Williams', 'Annie', 'Ariana Grande',
       'Ariel Rivera', 'Arlo Guthrie', 'Arrogant Worms', 'Avril Lavigne',
       'Backstreet Boys', 'Barbie', 'Barbra Streisand', 'Beach Boys',
       'The Beatles', 'Beautiful South', 'Beauty And The Beast',
       'Bee Gees', 'Bette Midler', 'Bill Withers', 'Billie Holiday',
       'Billy Joel', 'Bing Crosby', 'Black Sabbath', 'Blur', 'Bob Dylan',
       'Bob Marley', 'Bob Rivers', 'Bob Seger', 'Bon Jovi', 'Boney M.',
       'Bonnie Raitt', 'Bosson', 'Bread', 'Britney Spears',
       'Bruce Springsteen', 'Bruno Mars', 'Bryan White', 'Cake',
       'Carly Simon', 'Carol Banawa', 'Carpenters', 'Cat Stevens',
       'Celine Dion', 'Chaka Khan

In [5]:
artists = [
     'Drake', 'Eminem', 'Fabolous', 
       'Fatboy Slim', 
       'Flo-Rida',
       'Gucci Mane', 
       'Ice Cube',
       'J Cole',  'Kanye West', 'Lil Wayne', 'LL Cool J','Mc Hammer', 
        'Migos', 'Nicki Minaj',
       'Notorious B.I.G.', 'Puff Daddy', 'Q-Tip', 'Snoop Dogg', 
       'The Weeknd','Will Smith',
        'Wiz Khalifa', 'Wu-Tang Clan', 'Yelawolf',
       'Ying Yang Twins', 'Yo Gotti', 'Young Buck', 'Young Dro', 'Young Jeezy',
       'Youngbloodz', 'Yung Joc'
]

In [6]:
df_rap = df[df['artist'].isin(artists)]

In [7]:
df_rap['text'].to_csv(r'df_rap.txt', header=None, index=None, sep=' ', mode='a')

  """Entry point for launching an IPython kernel.


In [19]:
df['text'].to_csv(r'df_all.txt', header=None, index=None, sep=' ', mode='a')

  """Entry point for launching an IPython kernel.


In [37]:
import tensorflow as tf

In [8]:
path_to_file = "df_rap.txt"

In [9]:
text = open(path_to_file, 'r').read()

### This is part of the finished corpus that will go into the model.

In [10]:
print(text[:500])

"[Verse 1]  
Boomin' out in South Gwinnett like Lou Will  
6 man like Lou Will, 2 girls and they get along like I'm...  
Like I'm Lou Will, I just got the new deal  
I am in the Matrix and I just took the blue pill  
No ho shit, no fuckin' ho shit, just save that for your shit  
I don't need no fuckin' body, I run my own shit  
Like Soulja, I thought I told yah, you didn't listen  
Fieri, I'm in the kitchen, I'm a magician  
I'm on it, I'm like Macgyver, I'm Michael Meyers  
I kill careers and c


In [28]:
vocab = sorted(set(text))

### 76 unique vocab words in the entire rap corpus

In [29]:
len(vocab)

76

### The model needs every vocab item to be assigned numerically.

In [30]:
char_to_ind = {char:ind for ind,char in enumerate(vocab)}

In [31]:
char_to_ind

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 "'": 4,
 '(': 5,
 ')': 6,
 ',': 7,
 '-': 8,
 '.': 9,
 '0': 10,
 '1': 11,
 '2': 12,
 '3': 13,
 '4': 14,
 '5': 15,
 '6': 16,
 '7': 17,
 '8': 18,
 '9': 19,
 ':': 20,
 '?': 21,
 'A': 22,
 'B': 23,
 'C': 24,
 'D': 25,
 'E': 26,
 'F': 27,
 'G': 28,
 'H': 29,
 'I': 30,
 'J': 31,
 'K': 32,
 'L': 33,
 'M': 34,
 'N': 35,
 'O': 36,
 'P': 37,
 'Q': 38,
 'R': 39,
 'S': 40,
 'T': 41,
 'U': 42,
 'V': 43,
 'W': 44,
 'X': 45,
 'Y': 46,
 'Z': 47,
 '[': 48,
 ']': 49,
 'a': 50,
 'b': 51,
 'c': 52,
 'd': 53,
 'e': 54,
 'f': 55,
 'g': 56,
 'h': 57,
 'i': 58,
 'j': 59,
 'k': 60,
 'l': 61,
 'm': 62,
 'n': 63,
 'o': 64,
 'p': 65,
 'q': 66,
 'r': 67,
 's': 68,
 't': 69,
 'u': 70,
 'v': 71,
 'w': 72,
 'x': 73,
 'y': 74,
 'z': 75}

In [32]:
ind_to_char = np.array(vocab) ## index loc to grab char

In [33]:
encoded_text = np.array([char_to_ind[c] for c in text]) ## corpus as np.array

### Batching: Sequence length must be long enough to capture the structure. I think a length of about 180 is a good starting place. Since most raps are couplets, I will also shorten that to 80 in another model.

In [17]:
print(text[18000:19000])

n the low  
I still been plottin' on the low  
Schemin' on the low  
The furthest thing from perfect  
Like everyone I know  
  
I just been drinkin' on the low  
Mobbin' on the low  
Fuckin' on the low  
Smokin' on the low  
I just been plottin' on the low  
Schemin' on the low  
The furthest thing from perfect  
Like everyone I know  
  
Drinkin', smokin', fuckin', plottin'  
Schemin', plottin', schemin', gettin' money  
Drinkin', fuckin', smokin', plottin', schemin'  
Plottin', schemin', getting money  
  
This the life for me  
My mama told me this was right for me  
I got em worried, like make sure you save a slice for me  
I should have Spoons, serve you up with a fork and knife for me  
Your actions make us doubt you  
Your lack of effort got me rapping different  
This the shit I wanna go out to  
Play this shit at my funeral if they catch me slippin'  
Naked women swimming that's just how I'm living  
Donate a million to some children, that's just how I'm feeling  
A nigga fil

In [None]:
# len('''
# Trees like me weren't meant to live  
# (Oh Lord I lay me down)  
# If all this earth can give  
# (My branches to the ground)  
# Is pollution and slow death  
# (There's nothing left for me) 
# ''')

In [34]:
seq_len = 180

### There are 23145 sequences based on the batch length in the corpus

In [35]:
total_num_seq = len(text) // seq_len+1
total_num_seq

23145

In [38]:
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

In [39]:
sequences = char_dataset.batch(seq_len+1, drop_remainder=True) ## drop incomplete seqs

In [40]:
### Need a way to loop and append every character
def create_seq_target(seq):
    input_txt = seq[:-1] # Hello my nam
    target_txt = seq[1:] # ello my name
    return input_txt, target_txt

In [41]:
dataset = sequences.map(create_seq_target)

### Buffer: I don't want to load everything into memory at once, so I will only grab about 10k sequences at a time and shuffle them around to make the model robust.

In [42]:
batch_size = 128

In [43]:
buffer_size = 10000
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

### Model

In [44]:
vocab_size = len(vocab)
embed_dim = 64 ## keep around unique number of vocab elems
rnn_neurons = 1024

In [45]:
from tensorflow.keras.losses import sparse_categorical_crossentropy

In [46]:
### data is not actually on-hot-encoded so I have to use this func when fitting
def sparse_cat_loss(y_true, y_pred):
    return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

In [47]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GRU

In [48]:
def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
    model = Sequential()
    model.add(Embedding(vocab_size, embed_dim, batch_input_shape=[batch_size, None]))
    model.add(GRU(rnn_neurons, return_sequences=True,
                 stateful=True, recurrent_initializer='glorot_uniform'))
    model.add(Dense(vocab_size))
    
    model.compile('adam', loss=sparse_cat_loss)
    
    model.summary()
    
    return model

In [49]:
model = create_model(vocab_size=vocab_size,
                    embed_dim=embed_dim,
                    rnn_neurons=rnn_neurons,
                    batch_size=batch_size)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (128, None, 64)           4864      
_________________________________________________________________
gru (GRU)                    (128, None, 1024)         3348480   
_________________________________________________________________
dense (Dense)                (128, None, 76)           77900     
Total params: 3,431,244
Trainable params: 3,431,244
Non-trainable params: 0
_________________________________________________________________


In [50]:
# epochs = 30
# model.fit(dataset, epochs=epochs)

### Evaluation: Given the number of parameters, I had to use a GPU on Google Colab to train the model.

In [51]:
from tensorflow.keras.models import load_model

In [52]:
model = create_model(vocab_size, embed_dim, rnn_neurons, batch_size=1)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 64)             4864      
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3348480   
_________________________________________________________________
dense_1 (Dense)              (1, None, 76)             77900     
Total params: 3,431,244
Trainable params: 3,431,244
Non-trainable params: 0
_________________________________________________________________


In [57]:
model.load_weights('NLP_songs_fixed.h5')

In [60]:
model.build(tf.TensorShape([1, None]))

In [55]:
def generate_text(model, start_seed, gen_size=500, temp=1.0):
    num_generate= gen_size
    
    input_eval = [char_to_ind[s] for s in start_seed]
    
    input_eval = tf.expand_dims(input_eval,0)
    
    text_generated = []
    
    temperature = temp
    
    model.reset_states()
    
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0) ## undoes the expanded dims above
        predictions = predictions / temperature ## affect actual prob of dist. based on 
        ## what we set temp as
        
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        
        input_eval = tf.expand_dims([predicted_id],0)
        
        text_generated.append(ind_to_char[predicted_id])
        
    return (start_seed + "".join(text_generated))

### This is the first rap from model 1, which has a sequence length of 180. The model predicts everything after the start seed. 
The made up words are to be expected as it is predicting a character at a time. 

I also made two other models after this one. I think the results from the second model are the best. Clearly, some of the sentences are nonsense, however, the results are still quite impressive. Even on a Google Colab Nvidia GPU, these models took a bit of time to run. So, in the future I would like to replicate this project with better hardware that could allow me to run for far more epochs with several layers. 

In [102]:
model = create_model(vocab_size, embed_dim, rnn_neurons, batch_size=1)
model.load_weights('NLP_songs_fixed.h5')
model.build(tf.TensorShape([1, None]))

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (1, None, 64)             4864      
_________________________________________________________________
gru_10 (GRU)                 (1, None, 1024)           3348480   
_________________________________________________________________
dense_10 (Dense)             (1, None, 76)             77900     
Total params: 3,431,244
Trainable params: 3,431,244
Non-trainable params: 0
_________________________________________________________________


In [104]:
terms = ['yo ', 'shawty', 'hommie', 'rachet']

for item in terms:
    
    print(f"The start seed is '{item}'\n")
    print(generate_text(model, 
                    start_seed=item,
                   gen_size=500,
                   temp=1.0))
    print('\n\n')
    print('''
    ################################################
    ################################################
    ################################################
    ''')
    print('\n\n')

The start seed is 'yo '

yo Good but why shoulda say hah hah!  
(Don't get it cause there's right)  
We gettin' chedical endy or the 40's  
The Loudest place that we talkin' to myself  
I feel I'm bounce you know my money  
Muthafuckas went to H4M  
King color groupies on the pool conversation  
We back to the bomb, buddy body and  
Back like a splinter, want you to see if he continued the kid  
Hard to take the shit is where you're doing too much, but it best white,  
Soon tell the pretty get tolike stealing my face again




    ################################################
    ################################################
    ################################################
    



The start seed is 'shawty'

shawty  
Caugar to hit me out like rain  
Bitches he act like it's slow and ballers  
That's why I be the pain just to get a good for me je  
Everybody tryin' to hear the same ring the alley  
Shit go for forgiveness, so clean up hard as hell  
Watchinaters in the club th

### Model 2: sequence was reduced to 80 and the number of embed_dims was reduced to 50

In [105]:
model = create_model(vocab_size, 50, rnn_neurons, batch_size=1)

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (1, None, 50)             3800      
_________________________________________________________________
gru_11 (GRU)                 (1, None, 1024)           3305472   
_________________________________________________________________
dense_11 (Dense)             (1, None, 76)             77900     
Total params: 3,387,172
Trainable params: 3,387,172
Non-trainable params: 0
_________________________________________________________________


In [106]:
model.load_weights('NLP_songs_fixed2.h5')

In [107]:
model.build(tf.TensorShape([1, None]))

In [110]:
terms = ['yo ', 'shawty', 'hommie', 'rachet']

for item in terms:
    
    print(f"The start seed is '{item}'\n")
    print(generate_text(model, 
                    start_seed=item,
                   gen_size=500,
                   temp=1.0))
    print('\n\n')
    print('''
    ################################################
    ################################################
    ################################################
    ''')
    print('\n\n')

The start seed is 'yo '

yo people as to four I'll be my friend a bad, my money getcha  
I said, I'm gone off that pound of dank (that sticky, that icky) 
Richand trying to plenty to be the dirty Charast (se if you wanna get right  
Let's get right  
Let's get right to A bitch she step  
Now how the fuck would go for me  
Even though ya rut thout five to my legs  
What's your sugher fat lil shit that you see it?s all the fingers?  
Could be back for the future [?] Plue Dogg I just outld A-N, D, Robay Rover  
Bustin' of the




    ################################################
    ################################################
    ################################################
    



The start seed is 'shawty'

shawty got a 9Uhow  
I see my only one was gone, thou shoo shook up  
And all you need is shady thin swent, show her out  
  
Got plenty for a while far  for none shit  
She say she sweet budging they had a double really cared  
Mcs Deville,  
Cause that you've always bee

### Model 3: Increased neurons to 2048

In [100]:
model = create_model(vocab_size, 50, 2048 , batch_size=1)
model.load_weights('NLP_songs_fixed3.h5')
model.build(tf.TensorShape([1, None]))

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (1, None, 50)             3800      
_________________________________________________________________
gru_9 (GRU)                  (1, None, 2048)           12902400  
_________________________________________________________________
dense_9 (Dense)              (1, None, 76)             155724    
Total params: 13,061,924
Trainable params: 13,061,924
Non-trainable params: 0
_________________________________________________________________


In [101]:
terms = ['yo ', 'shawty', 'hommie', 'rachet']

for item in terms:
    
    print(f"The start seed is '{item}'\n")
    print(generate_text(model, 
                    start_seed=item,
                   gen_size=800,
                   temp=1.0))
    print('\n\n')
    print('''
    ################################################
    ################################################
    ################################################
    ''')
    print('\n\n')

The start seed is 'yo '

yo  
Ain't no way on her ass roader than that part from hom and oh party again ax we just hold accord!  
Ya'll wanna be argelard your dudest eat  
'cause she mean not, murda-be ness (All my  is? You d 
Fuck them  
Get lifting  
I don't speak to me  
What you can't skept feel  
Cause your lyin' leave  
Come with my mental ou know it's a nine or the old my night  
Rock-and who that little libbatch the Couse and this sumier contact  
And I don't wanna fin'  
I want that I be sick  
But the same one of years, we felt russy literal shit, pull it up, rollin' it up  
  
Your nasty got a nigga  
But but I was roight to hold my stale  
You know what your striend and com  
Turn in Snips with thave,  
I said, I'm sincy  
No yatrilocause to drop tough with fat five hard  
Felons it was youngin'  
You too




    ################################################
    ################################################
    ################################################
    



