# Vocabulary Reduction

After some initial attempts to build our model using our song lyrics, we discovered that we cannot build and train a network on my pathetic excuse for a PC with such a large vocabulary space. There are just too many words possible words that our networks have to account for. This leads to the size our our model being too large because the output layer has 1 node for each possible word which is ~100,000 at this point. Technically my PC can handle contructing a model this large, barely, but it does not have the memory required to evaluate training examples against the model. Thus, we can make the model, but it is too computationally intensive to train. I was able to reduce the size of some layers in the model to the point where I could train on short sequences, 1 sample at a time. However, it would take too long to train such a model and we would have no guarantee the model would work on these sub-optimal parameters.

As such, we need to come up with a solution. The two possible solutions I could think of are the following:

    1) Reduce vocabulary to more reasonable size
    2) Convert model from sequence labeling to sequence regression
        - Train word embedding (word2vec) layer separately
        - Train song model on embedded words and have output be vector of same shape
        - Pick "closest" word to model output using word embedding
        
Each of these potential solutions hve pros and cons. For solution 1), the pros are that we don't have to change anything about our models. Everything stays the same, we are just reducing their size and computational effor by reducing the space of words we work on. The biggest con of this solution however is that it could significantly alter the integrity of our data. I think we could get a working model if the vocabulary size is reduced from ~100,000 -> 10,000. That would involve removing 90% of the possible words used in the songs which could severly impact our songs, and therfore our model.

For solution 2), the pros and cons are exactly the opposite. The pro is that we don't have to alter our dataset of songs (and that it makes training the lyrics model magnitudes faster). However, the con is that we will be completely altering our models and actually need to build/train an entirely separate word embedding model for vectorizing our vocabulary. Furthermore, we have no guarantee that converting our model from a categorical output to a continuous output vector will work. It makes sense that we should be able to train the network(s) to output a vector as close to each word label as possible, but I have no empirical evidence that it actually does. Furthermore, there is still the matter of training the word vectorizer. To do so, we need to train it over the whole vocabulary size will be almost as computationally intensive as training our original model.

Given the pros and cons of each solution, we decided to approach solution 1) first. This is the simpler solution and if it were to fail, we could always apply the reduced vocabulary when approaching solution 2). In the sections below we attempt to reduce the size of the distinct words in our song lyrics as much as possible, without comprimising the integrity of the songs themselves.

## Reload Data

First let's load our pre-transformed song lyrics and reconvert them from their integrer representation back to text. After doing so, let's also separate the artist lines from each song since we do not want to alter these at all.

In [1]:
import json
import numpy as np

# define function to load in datasets
def load_data(filename, load_mappings = False):
    print('Reloading Pre-Transformed Training and Validation Data from Filename "{}"'\
          .format(filename))
    
    # load data from file
    with open(filename, 'r') as f:
        data = json.load(f)
    
    # Build datasets from file contents
    training_dataset = [np.array(song) for song in data["training_dataset"]]
    validation_dataset = [np.array(song) for song in data["validation_dataset"]]
    
    # If load_mappings = True then build vectorizing and inverse mappings
    if load_mappings:
        int2text = np.array(data["unique_values"])
        text2int = {word:i for i, word in enumerate(int2text)}
        output = (training_dataset, validation_dataset, int2text, text2int)
    else:
        output = (training_dataset, validation_dataset)        
    
    # Print that we are done
    print('Loading Data Complete')
    
    return output

In [2]:
# load in training and validation datasets as well as mappings for vectorization
filename = 'all_song_lyrics_1.json'
training_dataset, validation_dataset, int2text, text2int = load_data(filename, load_mappings = True)

# combine training and validation datasets
combined_dataset = np.concatenate((training_dataset, validation_dataset))

# convert dataset back to text
dataset = []
for song in combined_dataset:
    dataset.append(int2text[song])

Reloading Pre-Transformed Training and Validation Data from Filename "all_song_lyrics_1.json"
Loading Data Complete


In [20]:
# Separate artist from lyrics
artist_end = 'xxx011'
artist_dataset = []
lyrics_dataset = []
for song in dataset:
    artist_end_index = np.where(song == artist_end)[0][0]
    artist_dataset.append(song[:artist_end_index+1])
    lyrics_dataset.append(song[artist_end_index+1:])

artist_dataset

[array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<U106'),
 array(['xxx000', 'xxx010', 'abba', 'xxx011'], dtype='<

## Word Frequencies

Before we start removing words willy nilly, let's get an idea of how frequently each distinct word in our vocabulary is used. This will help us determine which words can and should be removed, since removing infrequent words will have the smallest impact on our data. Furthermore, we can try to find certain patterns in the infrequent words. Potentially, these issues can be addressed or we can develop a rule to resolve these patterns.

In [21]:
# Initialize dictionary for storing word frequencies
word_dict = {}

# Iterate over songs and count word frequencies
for song in dataset:
    for word in song:
        try:
            word_dict[word] += 1
        except KeyError:
            word_dict[word] = 1
            
# Sort dictionary just cuz
word_dict = {k: v for k, v in sorted(word_dict.items(), key=lambda item: item[1], reverse = True)}

# get total # of words and unique words
total_words = sum(word_dict.values())
total_unique_words = len(word_dict)

In [22]:
word_dict['____']

KeyError: '____'

In [4]:
# Print first few most common words in dataset
num_words = 20
counter = 0
for word in word_dict:
    print('Word: {: <10}  Frequency: {: <10}  % of Total: {:4.2f}%'.format(word, word_dict[word], word_dict[word]/total_words*100))
    counter += 1
    if counter >= num_words:
        break

Word: xxx110      Frequency: 1661692     % of Total: 11.11%
Word: xxx111      Frequency: 1661692     % of Total: 11.11%
Word: ,           Frequency: 453042      % of Total: 3.03%
Word: the         Frequency: 426719      % of Total: 2.85%
Word: i           Frequency: 355792      % of Total: 2.38%
Word: you         Frequency: 355053      % of Total: 2.37%
Word: to          Frequency: 250833      % of Total: 1.68%
Word: and         Frequency: 246573      % of Total: 1.65%
Word: a           Frequency: 215877      % of Total: 1.44%
Word: me          Frequency: 167686      % of Total: 1.12%
Word: .           Frequency: 148190      % of Total: 0.99%
Word: in          Frequency: 142672      % of Total: 0.95%
Word: my          Frequency: 141729      % of Total: 0.95%
Word: it          Frequency: 123429      % of Total: 0.83%
Word: of          Frequency: 121716      % of Total: 0.81%
Word: your        Frequency: 100927      % of Total: 0.67%
Word: that        Frequency: 95107       % of Total: 0

It just occurred to me that we probably want to keep track of the words used in our artist and song names, and probably not alter them, so lets create a separate dictionary for these.

In [5]:
# Initialize dictionaries
artist_word_dict = {}
title_word_dict = {}

# indicators for finding artist and song names
artist_start = 'xxx010'
artist_end = 'xxx011'
title_start = 'xxx100'
title_end = 'xxx101'

# Iterate over songs so grab the unique words used in each artist and song name
for song in dataset:
    artist_start_index = np.where(song == artist_start)[0][0]
    artist_end_index = np.where(song == artist_end)[0][0]
    title_start_index = np.where(song == title_start)[0][0]
    title_end_index = np.where(song == title_end)[0][0]
    for word in song[artist_start_index+1:artist_end_index]:
        try:
            artist_word_dict[word] += 1
        except KeyError:
            artist_word_dict[word] = 1
    for word in song[title_start_index+1:title_end_index]:
        try:
            title_word_dict[word] += 1
        except KeyError:
            title_word_dict[word] = 1

# Sort dictionaries just cuz
artist_word_dict = {k: v for k, v in sorted(artist_word_dict.items(), key=lambda item: item[1], reverse = True)}
title_word_dict = {k: v for k, v in sorted(title_word_dict.items(), key=lambda item: item[1], reverse = True)}

# get total # of words and unique words
total_artist_words = sum(artist_word_dict.values())
total_artist_unique_words = len(artist_word_dict)
total_title_words = sum(title_word_dict.values())
total_title_unique_words = len(title_word_dict)

In [6]:
# Print first few most common words in artist and song names
num_words = 10

print('Artist Word Examples')
counter = 0
for word in artist_word_dict:
    print('Word: {: <10}  Frequency: {: <10}  % of Total: {:4.2f}%'.format(word, artist_word_dict[word], artist_word_dict[word]/total_artist_words*100))
    counter += 1
    if counter >= num_words:
        break

print('\nTitle Word Examples')
counter = 0
for word in title_word_dict:
    print('Word: {: <10}  Frequency: {: <10}  % of Total: {:4.2f}%'.format(word, title_word_dict[word], title_word_dict[word]/total_title_words*100))
    counter += 1
    if counter >= num_words:
        break

Artist Word Examples
Word: .           Frequency: 1355        % of Total: 1.52%
Word: john        Frequency: 820         % of Total: 0.92%
Word: the         Frequency: 723         % of Total: 0.81%
Word: williams    Frequency: 569         % of Total: 0.64%
Word: michael     Frequency: 532         % of Total: 0.60%
Word: george      Frequency: 514         % of Total: 0.58%
Word: band        Frequency: 510         % of Total: 0.57%
Word: bob         Frequency: 434         % of Total: 0.49%
Word: kenny       Frequency: 412         % of Total: 0.46%
Word: boys        Frequency: 406         % of Total: 0.45%

Title Word Examples
Word: the         Frequency: 6756        % of Total: 4.49%
Word: you         Frequency: 3500        % of Total: 2.33%
Word: i           Frequency: 3095        % of Total: 2.06%
Word: of          Frequency: 2875        % of Total: 1.91%
Word: a           Frequency: 2726        % of Total: 1.81%
Word: love        Frequency: 2435        % of Total: 1.62%
Word: to      

Now that we know how frequent all of our words are, let's get an idea of how much our vocabulary shrinks if we remove the more infrequent words. To measure this, below we created a function that counts up all the unique words that occur in our dataset less than or equal to a set amount.

In [7]:
# define function for finding # of unique words that appear less than L times in dataset
def infrequent_measure(dictionary, L):
    number_of_words = 0
    number_of_unique_words = 0
    for word in dictionary:
        value = dictionary[word]
        if value <= L:
            number_of_unique_words += 1
            number_of_words += value
    
    return number_of_unique_words, number_of_words

In [8]:
# Determine % of words and dataset made up of these infrequent words
for frequency in range(1,31):
    number_of_unique_words, number_of_words = infrequent_measure(word_dict, frequency)
    print('Word Frequency: <={: <3}  , % of Words: {:4.2f}%  , % of Dataset: {:4.2f}%'.\
          format(frequency, number_of_unique_words/total_unique_words * 100, number_of_words/total_words * 100))

Word Frequency: <=1    , % of Words: 45.13%  , % of Dataset: 0.28%
Word Frequency: <=2    , % of Words: 58.14%  , % of Dataset: 0.44%
Word Frequency: <=3    , % of Words: 64.93%  , % of Dataset: 0.57%
Word Frequency: <=4    , % of Words: 69.43%  , % of Dataset: 0.68%
Word Frequency: <=5    , % of Words: 72.41%  , % of Dataset: 0.77%
Word Frequency: <=6    , % of Words: 74.89%  , % of Dataset: 0.86%
Word Frequency: <=7    , % of Words: 76.77%  , % of Dataset: 0.94%
Word Frequency: <=8    , % of Words: 78.34%  , % of Dataset: 1.02%
Word Frequency: <=9    , % of Words: 79.71%  , % of Dataset: 1.10%
Word Frequency: <=10   , % of Words: 80.86%  , % of Dataset: 1.17%
Word Frequency: <=11   , % of Words: 81.79%  , % of Dataset: 1.23%
Word Frequency: <=12   , % of Words: 82.75%  , % of Dataset: 1.30%
Word Frequency: <=13   , % of Words: 83.50%  , % of Dataset: 1.36%
Word Frequency: <=14   , % of Words: 84.17%  , % of Dataset: 1.42%
Word Frequency: <=15   , % of Words: 84.80%  , % of Dataset: 1

Above we see that most of the distict words in our dataset are very infrequent. In fact, almost half of our vocabulary is only used once. Furthermore, we see that we could potentially remove most of these infrequent words without affecting too much of our overal dataset. For instance, we could reduce our vocabulary down to ~10,000 words by removing all words that appear 30 or less times in our dataset. It turns out that this only accounts for about 2% of the words in all the songs, so roughly 6 words would be changed per song.

## Problematic Words

Above we showed that we could just remove infrequent words indiscriminantly without too much of an impact on our dataset. However, let's try to approach the process a little more intelligently before we recklessly slash our vocabulary. The first thing we are going try is finding certain types of nonsense words that show up in the vocabulary. As shown before, roughly half of the vocabulary is made up of single occurence words, and chances are a lot of these are just mispelled or entirely invalid. As such, let's see if we can try to find patterns or even the causes of these cases so we can remove them.

Below we simply and print a random selection of these infrequent words to see if we can identify any sorts of patterns. Note that we repeatedly performed this routine to be as thourough as possible.

In [9]:
import random

# print random selection of infrequent words
keys = list(word_dict.keys())
np.random.shuffle(keys)

# Choose word frequencies to select and how many random words to look at
max_frequency = 3
num_of_words = 1000

# Grab words
lookin_words = {}
counter = 0
for word in keys:
    frequency = word_dict[word]
    if frequency <= max_frequency:
        lookin_words[word] = frequency
        counter += 1
        if counter >= num_of_words:
            break

# print words
#lookin_words = {k: v for k, v in sorted(lookin_words.items(), key=lambda item: item[1], reverse = True)}
for word in sorted(lookin_words):
    print('Word: {: <20}  , Frequency: {}'.format(word, lookin_words[word]))

Word: '90                   , Frequency: 2
Word: 'couse                , Frequency: 1
Word: 'ez                   , Frequency: 1
Word: 'freedom'             , Frequency: 1
Word: 'hoes                 , Frequency: 3
Word: 'kong                 , Frequency: 1
Word: 'leven                , Frequency: 3
Word: 'original'            , Frequency: 1
Word: 'rocinante'           , Frequency: 1
Word: 'thunderbird'         , Frequency: 1
Word: 'years'               , Frequency: 1
Word: -----------           , Frequency: 1
Word: 008                   , Frequency: 1
Word: 2-step                , Frequency: 1
Word: 2123                  , Frequency: 1
Word: 2day                  , Frequency: 1
Word: 4-h                   , Frequency: 1
Word: 409                   , Frequency: 3
Word: 44s                   , Frequency: 1
Word: 5'2                   , Frequency: 2
Word: 7927                  , Frequency: 1
Word: 82                    , Frequency: 2
Word: a'ight                , Frequency: 2
Word: a-bea

After looking at a lot of words, below are the three most common causes of infrequent words I could see:

    1) Multiple words hyphenated together: Example - "dumb-di-dumb-di-dumb" and "co-starred"
    2) Words that start or end with apostrophe: Example - "'lectronic" and "glitterin'"
    3) Possesive words: Example - "behavior's" and "carol's"
    
We will try to approach each of these in a unique way and hopefully it will reduce our vocabulary significanlty and make more sense.

### Hyphenated Words

Firstly, let's address the large number of hyphenated words. Right off the bat, I noticed a large number of words that started with "a-". I'm guessing there is a song or two in the dataset where most or all of the words in the song are like this. Let's try and find them and remove them.

In [10]:
# intiate array for storing # of 'a-' words in each song
a_word_values = np.zeros((len(dataset),2))

#iterate over songs to find # of words that start with 'a-' and ratio of total words that start with 'a-'
for i in range(len(dataset)):
    song = dataset[i]
    n = len(song)
    a_words = 0
    for word in song:
        if word[:2] == 'a-':
            a_words += 1
    a_word_values[i,0] = a_words
    a_word_values[i,1] = a_words / n

In [11]:
a_words = {}
for word in word_dict.keys():
    if word[:2] == 'a-':
        a_words[word] = word_dict[word]
        
a_words

{'a-gonna': 92,
 "a-rollin'": 39,
 'a-rolling': 34,
 'a-ha': 31,
 'a-coming': 30,
 'a-running': 26,
 'a-baby': 23,
 "a-runnin'": 22,
 'a-me': 20,
 "a-comin'": 19,
 "a-goin'": 19,
 'a-town': 18,
 "a-rockin'": 16,
 "a-walkin'": 15,
 'a-honey': 15,
 'a-riddle-i-day': 15,
 'a-way': 14,
 'a-amen': 14,
 'a-waiting': 13,
 "a-changin'": 13,
 'a-rocking': 12,
 "a-leavin'": 12,
 "a-livin'": 12,
 'a-calling': 11,
 'a-': 11,
 "a-lookin'": 11,
 "a-thinkin'": 10,
 'a-just': 9,
 "a-courtin'": 9,
 'a-yo': 9,
 'a-g-l-e-t': 9,
 "a-slidin'": 9,
 'a-u-t-o-matic': 9,
 "a-knockin'": 8,
 "a-waitin'": 8,
 "a-movin'": 8,
 'a-glow': 8,
 'a-ready': 8,
 "a-sittin'": 8,
 "a-gettin'": 8,
 'a-come': 8,
 'a-breaking': 8,
 'a-singing': 8,
 'a-courting': 8,
 'a-happening': 8,
 'a-a-and': 8,
 'a-knocking': 7,
 "a-blowin'": 7,
 "a-shakin'": 7,
 'a-t-l': 7,
 'a-reeling': 7,
 'a-laying': 7,
 'a-long': 7,
 'a-a-a-a-a-': 7,
 'a-ah': 7,
 "a-hidin'": 7,
 "a-turnin'": 6,
 'a-fighting': 6,
 'a-standing': 6,
 'a-scared': 6,
 'a-t

In [12]:
len(a_words)

612

In [13]:
dataset[3729]

array(['xxx000', 'xxx010', 'eddie', 'cochran', 'xxx011', 'xxx100',
       'goodbye', 'bye', 'bye', 'bye', 'xxx101', 'xxx110', 'goodbye',
       'bye', 'bye', 'bye', 'forever', 'xxx111', 'xxx110', 'goodbye',
       'bye', 'bye', 'bye', 'my', 'love', 'xxx111', 'xxx110', 'how',
       'can', 'you', 'say', 'you', 'still', 'love', 'me', 'xxx111',
       'xxx110', 'when', 'seen', 'in', 'the', 'arms', 'of', 'a', 'friend',
       'xxx111', 'xxx110', 'the', 'flame', 'that', 'burns', 'for', 'me',
       'is', 'dying', 'slowly', 'xxx111', 'xxx110', "it's", 'time', 'we',
       'both', 'wake', 'up', ',', "let's", 'not', 'pretend', 'xxx111',
       'xxx110', 'goodbye', 'bye', 'bye', 'bye', 'forever', 'xxx111',
       'xxx110', 'goodbye', 'bye', 'bye', 'bye', 'my', 'love', 'xxx111',
       'xxx110', "we're", 'fooling', 'ourselves', 'just', 'by',
       'thinking', 'xxx111', 'xxx110', "let's", 'not', 'pretend',
       'anymore', 'xxx111', 'xxx110', 'you', 'can', 'stop', 'writing',
       'me', 'lette

In [14]:
title_word_dict

{'the': 6756,
 'you': 3500,
 'i': 3095,
 'of': 2875,
 'a': 2726,
 'love': 2435,
 'to': 2329,
 'in': 2218,
 'me': 2188,
 'my': 1803,
 'on': 1329,
 'it': 1242,
 'and': 1171,
 'for': 1022,
 'is': 986,
 ',': 959,
 'your': 948,
 'all': 902,
 "don't": 899,
 '.': 866,
 'be': 843,
 'time': 670,
 'one': 657,
 'heart': 629,
 "i'm": 604,
 'man': 602,
 'down': 573,
 'with': 535,
 'up': 522,
 'no': 517,
 'song': 513,
 'go': 509,
 'this': 499,
 'like': 488,
 'that': 486,
 'little': 485,
 'if': 471,
 'do': 461,
 'night': 438,
 'world': 434,
 'out': 432,
 'day': 426,
 "it's": 418,
 'life': 415,
 'back': 409,
 'way': 407,
 'girl': 405,
 'baby': 404,
 'good': 394,
 'get': 392,
 'just': 390,
 "can't": 390,
 'let': 379,
 'we': 375,
 'never': 362,
 'what': 355,
 '?': 355,
 'away': 354,
 'come': 351,
 'got': 347,
 'know': 339,
 'so': 330,
 'are': 328,
 'home': 327,
 'not': 325,
 'from': 322,
 'want': 318,
 'again': 315,
 'blues': 310,
 'blue': 303,
 'have': 299,
 'when': 299,
 'now': 299,
 'can': 294,
 'chr

In [15]:
dataset[0]

array(['xxx000', 'xxx010', 'abba', 'xxx011', 'xxx100', "she's", 'my',
       'kind', 'of', 'girl', 'xxx101', 'xxx110', 'look', 'at', 'her',
       'face', ',', "it's", 'a', 'wonderful', 'face', 'xxx111', 'xxx110',
       'and', 'it', 'means', 'something', 'special', 'to', 'me', 'xxx111',
       'xxx110', 'look', 'at', 'the', 'way', 'that', 'she', 'smiles',
       'when', 'she', 'sees', 'me', 'xxx111', 'xxx110', 'how', 'lucky',
       'can', 'one', 'fellow', 'be', '?', 'xxx111', 'xxx110', "she's",
       'just', 'my', 'kind', 'of', 'girl', ',', 'she', 'makes', 'me',
       'feel', 'fine', 'xxx111', 'xxx110', 'who', 'could', 'ever',
       'believe', 'that', 'she', 'could', 'be', 'mine', '?', 'xxx111',
       'xxx110', "she's", 'just', 'my', 'kind', 'of', 'girl', ',',
       'without', 'her', "i'm", 'blue', 'xxx111', 'xxx110', 'and', 'if',
       'she', 'ever', 'leaves', 'me', 'what', 'could', 'i', 'do', ',',
       'what', 'could', 'i', 'do', '?', 'xxx111', 'xxx110', 'and', 'when',
    

In [16]:
total_words / len(dataset)

311.8359702332562

In [17]:
keys = list(word_dict.keys())
np.random.shuffle(keys)
keys

['dieee',
 'denicd',
 'hi-de-hey',
 'pittle',
 "shadowin'",
 'beethoven',
 'tag-hirap',
 'me-oh',
 'amongst',
 "nose's",
 'congealing',
 "motors'",
 'wafa',
 'dickens',
 'half-alive',
 'smokin',
 'earning',
 'fuma',
 'oj',
 'ant',
 'yip-yip-yip-yip-yip-yip-yip-yip',
 "banging'",
 'tank',
 'easton',
 'deviner',
 'sim',
 "ass'd",
 'cupcakes',
 'ways',
 'kinski',
 'continnetial',
 'hayah',
 'sinisisi',
 'franernity',
 'behave',
 'teenagers',
 'kaisi',
 'tuder',
 'dominique',
 'hande',
 'delightshake',
 'portofino',
 're-hee-hee',
 'matanto',
 "m'ama",
 'skooldaze',
 'setup',
 '3rd',
 'serge',
 "sundown's",
 'drabs',
 "maybe--it's",
 'humor',
 'corral',
 'umpisa',
 "fiddlin'",
 'unsung',
 'roles',
 'atomizer',
 "drooppin'",
 'sickness',
 'washukru',
 "fitzgerald's",
 'bonavie',
 'blissfully',
 'entomb',
 'foreign',
 "'howdy",
 'show-ow-ow',
 'breezeway',
 'michi',
 'seizes',
 'coal-trolley',
 'junped',
 'anarchists',
 "a-movin'",
 'adagio',
 "ve'omeret",
 'titie',
 'disbeliever',
 '56',
 '

In [18]:
print(len(artist_word_dict))
print(len(title_word_dict))

987
14532
