# TesseracT Lyric Generation

In [0]:
!pip install markovify

Collecting markovify
  Downloading https://files.pythonhosted.org/packages/de/c3/2e017f687e47e88eb9d8adf970527e2299fb566eba62112c2851ebb7ab93/markovify-0.8.0.tar.gz
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 5.1MB/s 
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.8.0-cp36-none-any.whl size=10694 sha256=897614651c65865920b2e8a41f316d14af129455a9575280d52aca4a6fd4ba92
  Stored in directory: /root/.cache/pip/wheels/5d/a8/92/35e2df870ff15a65657679dca105d190ec3c854a9f75435e40
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.8.0 unidecode-1.1.1


In [0]:
import numpy as np
import pandas as pd
from time import time
import re
import spacy
import markovify
import warnings
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
warnings.filterwarnings("ignore")
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


# Introduction

Inspired by the Text Generation checkpoint, I wanted to see what would happen if I input all the lyrics from my favorite progressive metal band, TesseracT. I copied and pasted all the lyrics into .txt files by hand, since there weren't too many to make it worth writing out a scraping algorithm.

# Cleaning and exploration

Create a folder in this colab and name it "tesseract" and then put all the song lyric .txt files inside of it.

In [0]:
DOC_PATTERN = r'.*\.txt'
corpus = PlaintextCorpusReader('/content/tesseract', DOC_PATTERN)

In [0]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

When generating a song, I want to decide what is a good number of lines, or in the case of the corpus, sentences. So just dividing the number of sentences by the number of documents in the corpus will get the average number of lines in each song.

In [0]:
len(corpus.sents()) / len(corpus.fileids())

7.0

Join all the songs together as one long string for spaCy to use.

In [0]:
songs = [corpus.raw(fileid) for fileid in corpus.fileids()]

In [0]:
songs = " ".join(songs)

Now to figure out the maximum characters per sentence by dividing the characters of all songs by the number of lines in all the songs.

In [0]:
len(songs) / len(corpus.sents())

113.66917293233082

Cleaning the text data just in case.

In [0]:
def text_cleaner(text):
    # visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [0]:
songs = text_cleaner(songs)

In [0]:
nlp = spacy.load('en')
songs = nlp(songs)

In [0]:
nlp = spacy.load('en')

Fusing it all back together into one string so markovify can make use of it.

In [0]:
song_lines = " ".join([sent.text for sent in songs.sents if len(sent.text) > 1])

# Converting to Numerical Vectors

Since my task involves markov chains and neural network-based text generation, I will not benefit from converting my data into numerical vectors. So I will skip straight to the text generation.

# Generating Lyrics!

In [0]:
def tesseract_generator(generator, num_lines, state_size, character_length):
    markovifier = generator(song_lines, state_size = state_size)

    return "\n".join([markovifier.make_short_sentence(character_length) for i in range(num_lines)])

## State Size = 3

In [0]:
tesseract_generator(markovify.Text, 7, 3, 114)

None
None
Machinery dredge the sea All that's left is memory All the time they're suffering So when will it end?
None
None
Run away from me Don't you come near with those eyes I hate them , why do they lie?
None


## State Size = 2

In [0]:
tesseract_generator(markovify.Text, 7, 2, 114)

Disturbed, when I get the feeling I've been chasing shadows Change.
This structured raw submission, such a complex rage inside us all.
All the time they lie to me I'm not to reprimand I'm here to help you through Is nothing like it seems?
This structured raw submission, such a complex rage inside us all.
This is another one of his ways To control me I feel dead inside Disturbed; will I fall?
Will I disappear with a vision of her oh the feelings of pain And the vision of tomorrow Or will I fall?
You walk through the furrows deep I sense the strain No one seems to know I can't feel the light?


## State Size =1

In [0]:
tesseract_generator(markovify.Text, 7, 1, 114)

I won't be loved in Watching over Talking in mystery It seems to us, teary eyed History hexes us all.
I'm full of pain And your waist lest you show?
Don’t you sleep at me this sequence, a word of it costs All the prisoner You walk through the back of tomorrow?
I so much we face a crevice in torn.
We cannot forgive me?
Take this world This structured raw submission, such a part of the world.
History hexes us I feel the peace?


Seems like the best state_size so far is 2. 3 doesn't seem to create much of anything original, just mashups of recognizable lines from existing songs, and 1 creates nonsense. But I'll see what the POSifiedText class from the checkpoint can do to improve things.

# POSified Lyric Generation

In [0]:
class POSifiedText(markovify.Text):
    
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

## State Size = 3

In [0]:
tesseract_generator(POSifiedText, 7, 3, 114)

This structured raw submission , such a complex rage inside us all .
I choose to never let go So take your time , prove your worth Do n't look .
Machinery dredge the sea All that 's left is memory All the time they 're suffering So when will it end ?
Disturbed , when I get the feeling I 've been chasing shadows Change .
None
This structured raw submission , such a complex rage inside us all .
This structured raw submission , such a complex rage inside us all .


## State Size = 2

In [0]:
tesseract_generator(POSifiedText, 7, 2, 114)

Crawling through the crowd Lost in the sun You radiate for me And it all comes to life right before your eyes .
You 're alive ; it 's too late ...
Do n’t you know I ca n't feel whole ?
Nascent , nascent , nascent , nascent , nascent , nascent .
Nascent , nascent , nascent , nascent , nascent , nascent , nascent .
Crawling through the wildest night Given to the sky .
How will I fall ?


## State Size = 1

In [0]:
tesseract_generator(POSifiedText, 7, 1, 114)

The feelings we all But the storm I 'm a lie And now while I 'm still feel the sudden urges for you ’re feeling ?
Do n't a child sleeping near his ways I know how you 'll be here before your hands and itinerant I long enough ?
None
Can we see Hopelessly I get the rules ...
Do n't you 'll soothe you believe that you fall ?
You trust me Do n't think Desperately opiate , such defiant menaces are born .
Can you ; when will be ?


Seems that state size of 2 is still the best, and this POSified class is a better version than without, just like in the checkpoint. 

Now this is all novel and all, but it has its limits, and doesn't play nicely with contractions. So, after reading about GPT-2 from OpenAI, I thought I would see what would happen after making use of it's model trained on 345 million text parameters, since it was the most easy to find an example to feed the lyrical data straight into, so I will clone N Shepperd's repo and run through his well written instructions.

# GPT-2 345M Neural Network

In [0]:
!git clone https://github.com/nshepperd/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 366, done.[K
remote: Total 366 (delta 0), reused 0 (delta 0), pack-reused 366[K
Receiving objects: 100% (366/366), 4.42 MiB | 15.71 MiB/s, done.
Resolving deltas: 100% (199/199), done.


In [0]:
cd gpt-2

/content/gpt-2


Install requirements

In [0]:
!pip3 install -r requirements.txt

Collecting fire>=0.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/d9/69/faeaae8687f4de0f5973694d02e9d6c3eb827636a009157352d98de1129e/fire-0.2.1.tar.gz (76kB)
[K     |████████████████████████████████| 81kB 3.0MB/s 
[?25hCollecting regex==2017.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |████████████████████████████████| 604kB 10.5MB/s 
Collecting tqdm==4.31.1
[?25l  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 7.8MB/s 
[?25hCollecting toposort==1.5
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: fire, regex
  Building wheel for fire (setu

In [0]:
cd gpt-2

/content/gpt-2


Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Download the model data

In [0]:
!python3 download_model.py 345M

Fetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]Fetching checkpoint: 1.00kit [00:00, 1.19Mit/s]                                                     
Fetching encoder.json:   0%|                                           | 0.00/1.04M [00:00<?, ?it/s]Fetching encoder.json: 1.04Mit [00:00, 52.7Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 1.28Mit/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:17, 82.0Mit/s]                                 
Fetching model.ckpt.index: 11.0kit [00:00, 10.3Mit/s]                                               
Fetching model.ckpt.meta: 927kit [00:00, 66.5Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 59.3Mit/s]                                                       


encoding

In [0]:
!export PYTHONIOENCODING=UTF-8

Fetch checkpoints if you have them saved in google drive

In [0]:
!cp -r /content/drive/My\ Drive/checkpoint/ /content/gpt-2/ 


Start training, add --model_name '345M' to use 345 model

*Riley's additonal commentary:* For whatever text, lyrics or poetry you want your samples to be inspired by, input it as one long text file. I chose to format the separation of songs by two new lines.

In [0]:
!PYTHONPATH=src ./train.py --dataset /content/gpt-2/all_lines.txt --model_name '345M'


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.




2020-01-19 01:42:58.064016: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-01-19 01:42:58.064226: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x19fcf40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-19 01:42:58.064262: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-19 01:42:58.066326: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-19 01:42:58.224577: I tensorflow/stream_executor/cu

*Riley:* I decided to stop the training at 900 epochs, since the loss average didn't change after over 100 epochs. After reading through the samples, the model really gets a good understanding of the lyrical structure of the songs I provided, and shows some very interesting sets prose. I recognize a few lines from the training data, but I will not pass judgment until I provide my own priming line and see if similar behavior continues.

Save the checkpoints to start training again later

In [0]:
!cp -r /content/gpt-2/checkpoint/ /content/drive/My\ Drive/

Load the trained model for use in sampling below


In [0]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/345M/

To check flag descriptions, use:

In [0]:
!python3 src/interactive_conditional_samples.py -- --help

Generate conditional samples from the model given a prompt you provide -  change top-k hyperparameter if desired (default is 40)

In [0]:
!python3 src/interactive_conditional_samples.py --top_k 40 --model_name "345M"



2020-01-19 02:34:55.576823: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-19 02:34:55.646476: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-19 02:34:55.647107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2020-01-19 02:34:55.650985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-19 02:34:55.660636: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-19 02:34:55.667520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
202

Very interesting. Still recognizing some notable lines, and even entire sections of songs appearing. Perhaps the model has overfit, but it is still "creating" interesting segments of prose based on a word or concept. Would be interesting to mess around with the temperature hyperparameters and top_k to see how that affects the output.

Since the 345M model, OpenAI has released a 774M and a 1.5B model. With the increased exposure to more text data, it would be interesting to see how this would affect the generation of text. I would also consider incorporating the lyrics of other bands within the same sub-genre, to expand the structural style, and limit the use of exact lines and sections that appear in the training data.