Grant White, Johnson Merrell, Walker Hughes, Alex Shaw

# Introduction

Can we train a computer to write new church hymns? Our project is seeking to answer this question. We will be attempting to recreate both lyrics and music. In this draft we focus on text generation. 

Text generation has become a hot topic in the past decade, and has begun seeing great success in recent years. Applying these methods to song generation will be a new aspect to explore. Music generation has not seen the same amount of success as text generation, but is still an active field of research.

# Data Cleaning

We will use the lyrics from all the Hymns of the Church of Jesus Christ of Latter-day Saints as well as the instrumental tracks. Both of these are available at www.churchofjesuschrist.org. There are 341 hymns available in English.

The first step will be processing our data. This will be done in the following five steps:

1. Scrape data from online.
1. Split each word into it's own line.
1. Remove verse numbers.
1. Replace sentence-ending punctuation and end of songs with stop words.
1. Remove all remaining punctuation.

The code for web scraping and data cleaning is given below.

In [None]:
#initialize url
url = 'https://www.churchofjesuschrist.org'
hymn = '/study/manual/hymns/the-morning-breaks?lang=eng'

#open file
with open('lyrics.txt', 'a') as file:
    for i in range(341):
        # get html code
        page_source = requests.get(url + hymn).text
        soup = BeautifulSoup(page_source, "html.parser")
 
        # print hymn number
        print(soup.find(class_='title-number').string)

        # get text
        current = soup.find_all(class_="stanza")
        for verse in current:
            lines = verse.strings
            for line in lines:
                file.write(line + '\n')
            
        # update url
        if i < 340:
            hymn = soup.find(class_='traversalLink-1QVq2 nextLink-1V6GZ').a['href']

In [None]:
# Read in our file of lyrics (all songs from the hymn book)
filename = 'lyrics.txt'
with open(filename) as file:
    lyrics = file.read().split()

In [None]:
# Get rid of all the empty strings
not_empty = lambda x: len(x) != 0
lyrics = list(filter(not_empty, lyrics))

# Replace '1.' with end of song word
end_of_song = 'END'
lyrics = [lyric if lyric != '1.' else end_of_song for lyric in lyrics]
lyrics = lyrics[1:]
lyrics.append(end_of_song)

# Strip out all the verse numbers
not_verse_num = lambda x: not (x[0].isnumeric() and x.endswith('.'))
lyrics = list(filter(not_verse_num, lyrics))

In [None]:
# Replace all periods with a stop word
period = 'PERIOD'
exclamation = 'EXCLAMATION'
question = 'QUESTION'
semi_colon = 'SEMI'

stop_lyrics = []

for lyric in lyrics:
    if lyric.endswith('.'):
        lyric = lyric.replace('.', '')
        stop_lyrics.append(lyric)
        stop_lyrics.append(period)
    elif lyric.endswith(';'):
        lyric = lyric.replace(';', '')
        stop_lyrics.append(lyric)
        stop_lyrics.append(semi_colon)
    elif lyric.endswith('!'):
        lyric = lyric.replace('!', '')
        stop_lyrics.append(lyric)
        stop_lyrics.append(exclamation)
    elif lyric.endswith('?'):
        lyric = lyric.replace('?', '')
        stop_lyrics.append(lyric)
        stop_lyrics.append(question)
    else:
        stop_lyrics.append(lyric)

lyrics = stop_lyrics

In [None]:
def process_lyric(lyric):
    """
    Gets rid of punctuation and non-latin characters
    """
    apostrophe = 'â\x80\x99'
    a_hat = 'â'
    lyric = lyric.replace(apostrophe, '')
    lyric = lyric.replace(a_hat, '') 
    lyric = ''.join([char for char in lyric if char.isalpha()])

    return lyric

In [None]:
# Get rid of punctuation and non-latin characters
lyrics = list(map(process_lyric, lyrics))

In [None]:
# Write the cleaned lyrics
filename = 'clean_lyrics.txt'
with open(filename, 'w') as file:
    for lyric in lyrics:
        lyric += '\n'
        file.write(lyric)

# Methods

To produce a hymn we generate lyrics and music through distinct processes. For lyrics we use a Markov chain, transformer, and recurrent neural network (RNN). For music, we experiment with a traditional RNN and generative adversarial network (GAN).

## Lyrics
A Markov chain is a straightforward and obvious model to use for generating random text to resemble a given corpus. Markov chains are simple to implement, quick to train on small corpuses, and easy to explain to others. Unfortunately, these benefits come at a cost – because our model is _Markov_, it only takes into consideration one word or lyric at a time when predicting the next. We implemented our own model using a dictionary, with the key being a word or lyric, and the value being a list of non-unique words that follow the key in the corpus. We include this code below:

```
filename = 'clean_lyrics.txt'
with open(filename) as file:
    lyrics = file.read().split()

for word1, word2 in pairs:
    if word1 in word_dict.keys():
        word_dict[word1].append(word2)
    else:
        word_dict[word1] = [word2]

# Find all potential start words and pick a random one
start_words = [word for word in lyrics if (word[0].upper() == word[0]) and word not in stop_words]

stop_words = {
    'PERIOD': '.',
    'EXCLAMATION': '!',
    'QUESTION': '?',
    'SEMI': ';',
    'END': '',
}

def generate_text(start_word):
    """Generates text."""
    sentence = [start_word]
    word = start_word
    while True:
        word = np.random.choice(word_dict[word])
        if word in stop_words:
            sentence.append(stop_words[word])
            if word == 'END':
                break
        else:
            sentence.append(' ' + word)

    return ''.join(sentence)

start_word = np.random.choice(start_words)
generate_text(start_word)
```

We also use a highly optimized package for constructing Markov chains for this exact use called `markovify`. This package makes it extremely easy to build and use Markov chains trained on small and medium-sized corpuses.

```
import markovify

with open("./cleaned_lyrics.txt") as f:
    text = f.read()

text_model = markovify.Text(text)
    
print(text_model.make_short_sentence(100))
```

Note that the `.make_short_sentence(N)` method generates a sentence with `N` characters or less.


For our Recurrent Neural Network (RNN) we used the char-rnn originally developed by Andrej Karpathy. This RNN creates sequences of words after being trained probabilistically on sequences of characters. The RNN then can be sampled to generate new sequences of text that are similar to the text it was trained on. 

RNN models are generally ideal for time-series and sequential data, so this RNN will fit our purposes since our song lyrics will be temporally dependent on previous lyrics. We trained this model on the lyrics from the LDS Hymnbook and then samples text in order to generate new lyrics. 

```
def train(inp, target):

    hidden = decoder.init_hidden()
    decoder.zero_grad()
    loss = 0

    # iterate through input characters and target characters 
    for input_char, target_char in zip(inp, target):
        target_char = target_char.unsqueeze(0) # correct the size 
        out_char, hidden = decoder(input_char, hidden) # decode 
        out_char = out_char.squeeze(0) # reshape 
        loss += criterion(out_char, target_char) # update loss 

    loss.backward()
    decoder_optimizer.step()
    # return the loss 
    return loss.item() / len(inp) 
```


## Music

# Results
We achieve mixed results from our different approaches.

## Lyrics
Our own implementation of a Markov model worked surprisingly well. Below is an example of 10 sentences we generated.

```
Which with lively praise Our souls with my toilsome way.
For joys and pure in our shield Is far abroad To thee.
Half its firmrooted bulwarks outstand the angels sayÂ Alleluia!
Oh wilt thou art so pleasant hours defend me? 
Ever guard us to bestow. 
Now Ill sing to his holy name be Nearer nearer That which he suffered grief is before him home.
Guide me an endless day. 
Go search more will be our minds with love at my breast. 
Rise ye heavnly fire and wear; Come cast the gospels sound. 
His precious grain and sing Alleluia!
```

The `markovify` implementation of the Markov model seemed to work qualitatively better. This implementation seemed to have more cohesive sentences than our own. Below are 10 sentences it generated.

```
Jesus, our only joy be thou, As thou our glory now, And thru eternity.
In mem'ry of thy providence more?
Angels of glory unto thee.
Come unto Jesus; He'll surely hear you, Jew and Gentile greet the sound.
We love thy glad tidings by sea and main.
Help us all to thy keep.
How vast is our mission, If we now at parting One more strain of praise.
I kneel upon the windy hill; Seeds that live and glory to his name!
Savior, may I repose And wake with praises to his name.
Glorious things of thee are led In rev'rence sweet.
```