In [3]:
import requests
import time

from bs4 import BeautifulSoup

**Scraping Song Lyrics**

We have to gather the training data ourselves in this case...

Going to webscrape off of songlyrics.com

In [4]:
response = requests.get("https://www.songlyrics.com/red-hot-chili-peppers-lyrics/")

In [5]:
soup = BeautifulSoup(response.content, "html.parser")

In [6]:
tables = soup.find_all("table")
table = tables[0]

In [7]:
lyrics = []

for song in table.find_all("tr")[0:500]:
    lyric = song.find("a", href=True)["href"]
    new_response = requests.get(lyric)
    new_soup = BeautifulSoup(new_response.content, "html.parser")
    words = new_soup.find('p', class_='songLyricsV14 iComment-text')
    lyric_text = words.get_text()
    lyrics.append(lyric_text)
    time.sleep(0.2)

In [8]:
print(lyrics[0]) # The first song in the RHCP directory - Otherside.

How long, how long will I slide?
Separate my side
I don't, I don't believe it's bad
Slit my throat, it's all I ever

I heard your voice through a photograph
I thought it up, it brought up the past
Once you know you can never go back
I've got to take it on the otherside

Centuries are what it meant to me
A cemetery where I marry the sea
Stranger things could never change my mind
I've got to take it on the otherside
Take it on the otherside
Take it on, take it on

How long, how long will I slide?
Separate my side
I don't, I don't believe it's bad
Slit my throat, it's all I ever

Pour my life into a paper cup
The ashtray's full and I'm spillin' my guts
She wants to know am I still a slut
I've got to take it on the otherside

Scarlet starlet and she's in my bed
A candidate for my soul mate bled
Push the trigger and pull the thread
I've got to take it on the otherside
Take it on the otherside
Take it on, take it on

How long, how long will I slide?
Separate my side
I don't, I don't believe 

Some links are broken or don't retrieve an actual song. The scraped text for those reads: "We do not have the lyrics".

In [10]:
lyrics = [lyrics for lyrics in lyrics if "We do not have the lyrics" not in lyrics] 
len(lyrics) # Left with 44 songs.

441

In [11]:
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

**Unigram Markov Chain Model**

Practice on "Otherside"

In [12]:
sample = lyrics[0]
sample = sample.replace('\n\n', ' <N> ')
sample = sample.replace('\n', ' <N> ')
sample = sample.replace(',', '')
sample = sample.replace('?', '')
sample = sample.split()
chain = {"<START>": []}
chain["<START>"].append(sample[0])

In [15]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.

    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyric in lyrics:
        sample = lyric.replace('\n\n', ' <N> ')
        sample = sample.replace('\n', ' <N> ')
        sample = sample.replace(',', '')
        sample = sample.replace('?', '')
        words = sample.split()

        chain["<START>"].append(words[0])

        for i in range(len(words) - 1):
            current_word = words[i]
            next_word = words[i + 1]

            if current_word not in chain:
                chain[current_word] = []

            chain[current_word].append(next_word)

        if words:
            last_word = words[-1]
            if last_word not in chain:
                chain[last_word] = []
            chain[last_word].append("<END>")

    return chain

In [16]:
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb")) # Reload the pickled lyrics from earlier

chain = train_markov_chain(lyrics) # Function call

print(chain["<START>"]) # These will be the most common words that start a RHCP song

print(chain["<N>"][:20]) # These words tend to begin a new line

['How', 'Psychic', "Standin'", 'Sometimes', 'Psychic', "Gettin'", 'What', 'Psychic', "I've", 'Scar', "Here's", 'People', 'Standing', 'They', 'Blood', 'All', 'Can', 'I', 'What', 'This', 'I', 'Deep', 'Hustle', 'Yeah', 'There', 'I', 'Looks', 'My', 'Life', 'Say', 'Something', 'Waking', 'Forty', 'Bells', "Won't", 'A', 'Hu', 'All', 'I', 'I', "She's", 'Easily', 'Does', 'Can', 'What', 'Oh', 'I', 'Red', 'Readymade', 'Get', 'This', 'My', "She's", 'Because', 'All', 'Oh', 'Bear', 'My', 'One', 'This', 'I', 'Things', 'Cabron', 'Blood', 'Give', 'Time', 'I', 'They', 'To', 'Warlocks', 'Everything', 'Swing', 'Next', 'Porcelain', 'My', 'Shiver', 'Close', 'The', "Drivin'", "I've", 'You', 'Throw', 'I', 'Dusting', 'All', 'Some', "Standin'", 'You', 'Never', 'Never', 'Some', "Sittin'", 'Catholic', 'I', 'My', 'Sat', 'Yo', 'Lipstick', 'I', 'Red', 'The', 'Hair', "History's", 'I', 'It', 'I’m', 'I', 'There', 'Me', "I'm", 'Red', 'Rock', 'You', 'Flat', 'My', 'Red', 'If', 'L.A.', 'Look', 'Red', 'Death', 'I', 'I', 'St

In [18]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()

    Returns:
      A string representing the randomly generated song.
    """

    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain["<START>"]))

    while words[-1] != "<END>":
        current_word = words[-1]

        if current_word not in chain:
            break

        next_word = random.choice(chain[current_word])

        # append the next word to the list
        words.append(next_word)

    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [20]:
print(generate_new_lyrics(chain))

Standing in time of Angels 
 Strike the Laysarium 
 Walkabout in a hole right 
 Upside down get me shine talk 
 One that cat to write it 
 Hop along I ever we did kid because you ass 
 If you really cares 
 Heavy glow 
 
 I can't hide behind your skin on tryin' till it away now 
 Sing and my life 
 Give it to 
 Bless your dream that the sun gets paid 
 Funny how to give it for long 
 Hollywood 
 PT-boat on the ground 
 I used to the one has certain someone... That's my hour 
 I came to strong for the pretty west the fire 
 But I love chug-a-lug me and the chin 
 Keystone cops they call my love now 
 A simple pain 
 And I know each other I missed ya 
 Others will give it wide 
 I'll make a sidewinder I'm falling into your kisses the sky 
 Yeah yeah 
 Can be too easily 
 'Cause my life is the nova is no 
 Aren't you 
 And don't hide behind your story 
 The always wonder if it's somethin' that is the microphone 
 And hold 
 Bob Marley taught me what it girl what you're above land 
 I won'

**Bigram Markov Chain Model**

Now, we are using the last two words to predict the next word.

In [21]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.

    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None, "<START>"): []}
    for lyric in lyrics:
      sample = lyric.replace('\n\n', ' <N> ')
      sample = sample.replace('\n', ' <N> ')
      sample = sample.replace(',', '')
      sample = sample.replace('?', '')
      words = sample.split()
      words.insert(0, "<START>")

      if len(words) < 2:
            continue

      chain[(None, "<START>")].append((words[1]))


      for i in range(len(words) - 2):
        current_bigram = (words[i], words[i + 1])
        next_word = words[i + 2]

        if current_bigram not in chain:
            chain[current_bigram] = []

        chain[current_bigram].append(next_word)

      if len(words) >= 2:
          last_bigram = (words[-2], words[-1])
          if last_bigram not in chain:
              chain[last_bigram] = []
          chain[last_bigram].append("<END>")


    return chain


In [23]:
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

chain = train_markov_chain(lyrics)

In [24]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()

    Returns:
      A string representing the randomly generated song.
    """

    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain[(None, "<START>")]))

    words.append(random.choice(chain["<START>", words[-1]]))

    while words[-1] != '<END>':
        current_bigram = (words[-2], words[-1])
        next_word = random.choice(chain.get(current_bigram, []))
        if next_word == '<END>':
            break
        words.append(next_word)

    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [25]:
print(generate_new_lyrics(chain))

We're all a bunch of bad chicks 
 Well I really want to be found in the U.S.A 
 Let's go get lost 
 Let's make peace 
 The story of a man named A.C. Green 
 Slam so hard break your TV screen 
 Worthy's hot with his tomahawk 
 Take it down I tear it down the road 
 Can give so much to be 
 And showed me what do to you 
 Swim for your smile in a flash ray a mash of DNA 
 Another reason why I love it kickin' back and 
 I say oh 
 The body of water 
 What ever happened to humanity 
 What I've got tapes I've got you've got to give it away now 
 Give it away now 
 Give it away now 
 My love is my aeroplane 
 It's a hollywood jam 
 The California flower is poppy child 
 Drifting and floating and fading away 
 Finding what you're looking for 
 Look a golden day 
 Take a piece and pass it on come get to play it out loud for everyone to hear it 
 If you see me getting high 
 Knock on wood we all live in 
 We don't ask we demand 
 That we know each other better 
 Than Larry Holmes 
 Come again so

**Quick Takeaways**

Both models generated pretty funny lyrics that were reasonable. The bigram model specifically generated lyrics that made more sense. The sentence structure made sense, more so than the unigram. In the bigram model, my favorite line is "Oh ah kissed ya then I missed ya" - it totally sounds like it is straight out of a Red Hot Chili Peppers song. And that is the weakness of the bigram model. Once it goes down a path of a song that uses a word that isn't heavily used by any other song (such as magik) it becomes too reliant on that one song. The unigram doesn't get locked into phrases like the bigram model does with "She's magik sex magik sex magik". However, the unigram model fails to make sense a lot of the time. Some quotes include "He's a California" and "It's not my me".

In conclusion, the bigram model makes much more sense in terms of sentence structure because it takes the tuple of words before finding the third. It's weakness is that it gets locked into patterns and phrases present in specific Red Hot Chili Peppers songs. The unigram model writes very unique songs but they don't always make a whole lot of sense.