# Song Lyrics Generator

In this lab, you will scrape a website to get lyrics of songs by your favorite artist. Then, you will train a model called a Markov chain on these lyrics so that you can generate a song in the style of your favorite artist.

# Question 1. Scraping Song Lyrics

Find a web site that has lyrics for several songs by your favorite artist. Scrape the lyrics into a Python list called `lyrics`, where each element of the list represents the lyrics of one song.

**Tips:**
- Find a web page that has links to all of the songs, like [this one](https://www.azlyrics.com/s/stevemillerband.html). [_Note:_ It appears that `azlyrics.com` blocks web scraping, so you may have to find a different lyrics web site.] Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song. 
- Use `time.sleep()` to stagger your HTTP requests so that you do not get banned by the website for making too many requests.

In [None]:
import requests
import time

response = requests.get("http://www.songlyrics.com/lauv-lyrics/")

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
lyrics = []
tables = soup.find_all("table")
len(tables)

3

In [None]:
table = tables[0]

In [None]:
for song in table.find_all("a", href=True):
  link = song["href"]

  response2 = requests.get(link)
  time.sleep(0.5)
  
  soup2 = BeautifulSoup(response2.text, "html.parser")
  lyrics.append(soup2.find_all(id = "songLyricsDiv")[0].text)

In [None]:
# Print out the lyrics to the first song.
print(lyrics[0])

Remembering what you said, I'm laying alone in bed
Tryina wrap up this feeling, falling apart instead

'Cause baby we hit the top
Sweeter than sugar rocks
Holding onto a moment
Let go, I'm falling off

I just need a little of it,
Need a little not a lot to get into it. Oh
Keep calling me home

Can we go back to adrenaline?
Can we go back to adrenaline?
Girl you know that we've been settling, settling
But I need to feel, need to feel it
Adrenaline

Remembering all those nights
With your body straight to the sky
Work me until the morning
Fall asleep half past nine (am)

So crazy in love
Enough was never enough
Tell me our touch ain't dull now
Tell me our touch ain't dull now

And I don't know where the ceiling is anymore
No I don't know where the feeling is anymore


`pickle` is a Python library that serializes Python objects to disk so that you can load them in later.

In [None]:
import pickle
pickle.dump(lyrics, open("lyrics.pkl", "wb"))

# Question 2. Unigram Markov Chain Model

You will build a Markov chain for the artist whose lyrics you scraped in Question 1. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=dV3AziKTBUo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of how songs are likely to begin and end. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [None]:
import re
from collections import Counter

def get_bigrams(list_words):
  return zip(list_words, list_words[1:])

In [None]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyric in lyrics:
        list_words = []
        words = re.sub(r'[^\w\s]', '', lyric).replace('\n',' <N> ').split(" ")

        # make everything lowercase
        for word in words:
          if word == "<N>":
            list_words.append(word)
          elif word != "":
            list_words.append(word.lower())
        chain["<START>"].append(list_words[0])
        
        chain[list_words[len(list_words)-1]] = []
        chain[list_words[len(list_words)-1]].append("<END>")
        bigrams = get_bigrams(list_words)
        for item in bigrams:
          if chain.get(item[0]) is None:
            chain[item[0]] = []
          chain[item[0]].append(item[1])
    return chain

In [None]:
# Load the pickled lyrics object that you created in Question 1.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain["<START>"])

# What words tend to begin a line (i.e., what words follow the line break tag?)
print(chain["<N>"][:20])

['remembering', 'like', 'i', 'smile', 'didnt', 'i', 'we', 'stick', 'i', 'i', 'i', 'i', 'like', 'its', 'we', 'running', 'running', 'doubt', 'to', 'stick', 'didnt', 'wait', 'didnt', 'like', 'ive', 'like', 'like', 'to', 'you', 'all', 'you', 'didnt']
['tryina', '\r', 'cause', 'sweeter', 'holding', 'let', '\r', 'i', 'need', 'keep', '\r', 'can', 'can', 'girl', 'but', 'adrenaline', '\r', 'remembering', 'with', 'work']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [None]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    word = random.choice(chain["<START>"])
    
    while word != "<END>":
      words.append(word)
      word = random.choice(chain[word])
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [None]:
print(generate_new_lyrics(chain))

wait in the night 
 you 
 in the hours talking 
 what night 
 id rather lay let go that you 
 i can reforget 
 i can reforget 
 and set them to not like dancing when i stay awhile stay 
 i know who wrote the book on fire 
 two more im feelin guilty 
 
 wonder who i like me happy 
 no one knows knows knows no other way 
 work me better when im so far 
 do anything to say that no good way yeah 
 and set the book on mine 
 cause anywhere we have this bed next to say 
 put my all that 
 when theres nothing quite wrong but it no other on fire 
 and maybe im with wine on my coffee more hilarious misheard lyrics about to not to be watching i am running from the book on a wrap up the ceiling 
 
 and


# Question 3. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict `chain` should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [None]:
def train_markov_chain(lyrics):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None, "<START>"): []}
    for lyric in lyrics:
        lines = lyric.replace('\n', ' <N> ').split()
        bigram = (None, "<START>")
        for word in lines:
            chain[bigram].append(word)
            bigram = (bigram[1], word)
            if bigram not in chain:
                chain[bigram] = []
        chain[bigram].append("<END>")
    return chain

In [None]:
# Load the pickled lyrics object that you created in Question 1.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain[(None, "<START>")])

['Remembering', 'Like', 'I', 'Smile', "Didn't", 'I', 'We', 'Stick,', 'I', 'I', 'I', 'I', 'Like', "It's", 'We', 'Running', 'Running', 'Doubt,', 'To', 'Stick,', "Didn't", 'Wait', "Didn't", 'Like', "I've", 'Like', 'Like', 'To', 'You', 'All', 'You', "Didn't"]


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [None]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain[(None, "<START>")]))
    bigram = ("<START>", words[0])
    word = random.choice(chain[bigram])

    while word != "<END>":
      words.append(word)
      bigram = (words[-2], words[-1])
      word = random.choice(chain[bigram])
      
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [None]:
print(generate_new_lyrics(chain))

I wanna feel you 
 
 I'd have it no other way 
 No one knows (knows, oh, oh) 
 No one knows, knows, knows. 
 Promoted Content 
 This Game Can Train Your Brain to Think Strategically (No Joke!) 
 featured video 
 What's That Line? 
 We all know 
 That I thought we said we're good 
 Yeah there's no good 
 Yeah yeah yeah 
 
 Lost in the rain 
 I could be with someone, making me feel so far? 
 
 Ooh why do we 
 We fell from the first time, I'd stay for a long time cause 
 I let you go, but baby, I'm gonna wear it 
 I was driving you home in the rain, Paris in the rain 
 We have to be, enemies, enemies? 
 
 And I don't know how they're still in their frames 
 There's never been a way to make this easy 
 When there's nothing quite wrong but it don't feel right 
 Either your head or your heart, you set the other on fire) 
 Either your head or your heart, you set the other on fire 
 No one knows, knows, knows. No one knows, knows, knows. No one knows (knows, oh, oh) 
 
 Two more footsteps on t

# Question 4. Analysis

Compare the quality of the lyrics generated by the unigram model (in Question 2) and the bigram model (in Question 3). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

**The unigram model keeps words independent of previous words, so each word does not carry much meaning in the context of the lyrics. The bigram model seems to generate more reasonable lyrics because it takes the data in pairs, so the chance of having lyrics that make sense together are higher compared to the unigram model. Also, since the bigram model has information about the previous word, this format helps to put the word in context. An advantage of the unigram model is that it prints independent values. A disadvantage of the unigram model is that it does not provide context. An advantage of the bigram model is that it provides pairs of data and context. A disadvantage of the bigram model is that if there is not a lot of data, the model is not very useful.**

## Submission Instructions

- Restart this notebook and run the cells from beginning to end:
  - Go to Runtime > Restart and Run All.
- Download the notebook:
  - Go to File > Download > Download .ipynb.
- Submit your notebook file to the assignment on Canvas.
