# Leonard Cohen vs Bob Dylan Machine Learning Project

## Background
Bob Dylan and Leonard Cohen are two of the most verbose and prolific singer-poets of the 20th century, with a corpus of songs spanning more than 50 years. Moreover, despite both emerging during the folk revival of the 1960s, their poetic styles are highly distinct and unique, standing head and shoulders above their contemporaries in quality, consistency, and longevity. I though it would therefore be an interesting project to see if I could train a machine learning (ML) algorithm to tell their styles apart, and even assign new words and/or phrases to being more Dylan-esque, or Cohen-like.

## Methodology
Every song from every STUDIO ALBUM of either Bob Dyl or Leonard Cohen was saved as a .txt file in separate directories.

Only ORIGINAL COMPOSITIONS were included in the analysis.


Note: Songs from the 1976 album 'Desire' were included in the Dylan corpus, even though they were largely co-written with Jacques Levy.
Likewise for the tracks on Side 2 of 1986 album 'Knocked Out Loaded'.
Cohen's final album 'You Want It Darker' was collaborative, however, as stated on wikipedia "All lyrics written by Leonard Cohen", therefore all tracks were included in the Cohen corpus.

#####!!!!!!!!!!! got up to the 1990s for Bob.... will continue after testing the model

Note: The album 'Death Of A Ladies' Man' was included in the Cohen corpus even though it was co-written with Phil Spector. The lyrics were most likely predominently written by Cohen. 
Likewise for the 2001 album 'Ten New Songs' (co-written with Sharon Robinson).
On the 2004 album 'Dear Heather', the songs adapted from poems were not included in the Cohen corpus.

## Notes and Considerations
Dylan's corpus is significantly larger than Cohens... not sure how this may affect the algorithm and its output.

## Core Development Notes
**_(see file 'project-methodology.txt' for development notes)_**

# First steps

Import the word stemmer object from the nltk package

In [1]:
from nltk.stem.snowball import SnowballStemmer
from textblob import TextBlob
import string

In [2]:
### helper function for stripping any non-ascii characters from the txt files
def strip_non_ascii(n):
    ''' Returns the string without non-ASCII characters '''
    stripped = (c for c in n if 0 < ord(c) < 127)

    return ''.join(stripped)

### Main lyrics parsing function

In [3]:
### main parsing function
def parseOutLyrics(f):
    """
    Takes a file 'f' and converts to a string, then splits the string into individual words
    and performs 'stemming' on them and return a string that contains all the words
    in the song (space-separated).

    example use case:
    f = open("song_name.txt", "r")
    text = parseOutText(f)

    """

    all_lyrics = f.read() # reads file into a single string


    words = ""

    if len(all_lyrics) > 1:
        ### remove punctuation
        ## .translate() takes two arguments, the thing to convert characters to (often with the maketrans()
        ## helper function), and the characters to replace/'translate'.
        lyrics_string = all_lyrics.translate(string.maketrans("", ""), string.punctuation)

        # strip non-ascii characters
        lyrics_string = strip_non_ascii(lyrics_string)

        ### Split the text string into individual words, stem the word, and append each stemmed word
        ### to a new list called stemmed_lyrics (make sure there is a single space between each
        ### stemmed word)

        split_lyrics = lyrics_string.split() # split on 'default' whitespace. Returns a list of words.
            
        stemmer = SnowballStemmer("english") # create 'stemmer' obj

        stemmed_lyrics = []
        for word in split_lyrics:
            st_word = stemmer.stem(word)
            stemmed_lyrics.append(st_word)

        words = " ".join(stemmed_lyrics) # re-join the newly-created list of words...

    return words

## More pre-processing options

In [4]:
### Use 'Part-of-speech (PoS)' recognition via TextBlob
from textblob import TextBlob

test_song = "I have only one thing to do with you"

def split_into_tokens(song):
    """
        PoS assigns each word to a 'word-type' (e.g. "Noun, plural")
        
        This function was adapted from the blog post: https://radimrehurek.com/data_science_python/
        
    """
    # song = unicode(song, 'utf8')  # convert bytes into proper unicode
    return TextBlob(song).tags

list_of_tups =  split_into_tokens(test_song)

# unzip the tuple list to create two lists: one with the words, and one with the PoS's
new_data = [list(t) for t in zip(*list_of_tups)]
print new_data

[[u'I', u'have', u'only', u'one', u'thing', u'to', u'do', u'with', u'you'], [u'PRP', u'VBP', u'RB', u'CD', u'NN', u'TO', u'VBP', u'IN', u'PRP']]


## Code to 'vectorize' the parsed lyrics data

In [5]:
import os
import pickle
import re
import sys

sys.path.append( "../code/" )

"""
    Starter code to process the lyric files from Bob and Leonard to extract
    the features and get the documents ready for classification.

    The data is stored in lists and packed away in pickle files at the end.
"""

which_artist_data = [] # will be a bunch of 0s and 1s associated with the lyric data
lyrics_data = [] # ALL of the lyrics data, with who wrote the lyric stored in the which_artist_data list

"""
### Loop through all the song files in the album sub-directories
### This first one is a test... testing, testing, 1, 2, 3
test_dir = '/Users/TBD/Documents/LeonardBobProj/Test/'
for root, dirs, filenames in os.walk(test_dir):
    for f in filenames:
        if f.startswith('.DS'):
            pass
        else:
            print os.path.join(root, f)
            open_test_song = open(os.path.join(root, f), "r")
            parsed_test_song = parseOutLyrics(open_test_song)

            ## Remove some words from test file, try their names (?)
            words_to_remove = ["bob", "leonard"]
            for word in words_to_remove:
                parsed_test_song = parsed_test_song.replace(word, "")

            ### append the text to the lyrics_data list         
            lyrics_data.append(parsed_test_song)

            ### append a 0 to which_artist_data list to indicate lyrics are from Bob!
            which_artist_data.append(0)

            #print parsed_test_song

            # close song file
            open_test_song.close()
"""

#********************************PARSE SONGS INTO LISTS OF STEMMED WORDS******************************#

### Loop through all the song files in the album sub-directories
### This first one is for Bob!
bob_dir = '/Users/TBD/Documents/LeonardBobProj/Bob/'
for root, dirs, filenames in os.walk(bob_dir):
    for f in filenames:
        if f.startswith('.DS'):
            pass
        else:
            # print os.path.join(root, f)
            open_bob_song = open(os.path.join(root, f), "r")    # had to add full dir path as was getting "IOError: [Errno 2]"
            parsed_bob_song = parseOutLyrics(open_bob_song)

            ## Remove some words from bob's songs. Mainly just his name I guess!
            words_to_remove = ["bob", "dylan"]
            for word in words_to_remove:
                parsed_bob_song = parsed_bob_song.replace(word, "")

            ### append the text to the lyrics_data list         
            lyrics_data.append(parsed_bob_song)

            ### append a 0 to which_artist_data list to indicate lyrics are from Bob!
            which_artist_data.append(0)

            # close song file
            open_bob_song.close()

### Loop through all the song files in the album sub-directories
### This one is for Leonard!
leonard_dir = '/Users/TBD/Documents/LeonardBobProj/Leonard/'
for root, dirs, filenames in os.walk(leonard_dir):
    for f in filenames:
        if f.startswith('.DS'):
            pass
        else:
            # print os.path.join(root, f)
            open_leonard_song = open(os.path.join(root, f), "r")
            parsed_leonard_song = parseOutLyrics(open_leonard_song)

            ## Remove some words from leonard's songs. Mainly just his name as well!
            words_to_remove = ["leonard", "cohen"]
            for word in words_to_remove:
                parsed_leonard_song = parsed_leonard_song.replace(word, "")

            ### append the text to the lyrics_data list         
            lyrics_data.append(parsed_leonard_song)

            ### append a 1 to which_artist_data list to indicate lyrics are from Leonard!
            which_artist_data.append(1)

            # close song file
            open_leonard_song.close()


Check newly-created lists 'lyrics_data' and 'which_artist_data' are populated correctly

In [10]:
from random import randrange

# print lyrics_data
print "No. of songs parsed:", len(lyrics_data)
print "Length of which_artist_data:", len(which_artist_data)

num_bob_songs = 0
num_leonard_songs = 0

for n in which_artist_data:
    if n == 0:
        num_bob_songs += 1
    elif n == 1:
        num_leonard_songs += 1

print "No. of Bob songs:", num_bob_songs
print "No. of Leonard songs:", num_leonard_songs

print "Random song:", lyrics_data[randrange(0, len(lyrics_data))]

### 'Dump' the generated lists into 'pickle' files

pickle.dump( lyrics_data, open("full_lyrics_data.pkl", "w") )
pickle.dump( which_artist_data, open("artist_identifier_data.pkl", "w") )


print "Lyrics all processed."

No. of songs parsed: 330
Length of which_artist_data: 330
No. of Bob songs: 204
No. of Leonard songs: 126
Random song: the wick messeng   there was a wick messeng from eli he did come with a mind that multipli the smallest matter when question who had sent for him he answer with his thumb for his tongu it could not speak but onli flatter he stay behind the assembl hall it was there he made his bed oftentim he could be seen return until one day he just appear with a note in his hand which read the sole of my feet i swear theyr burn oh the leav began to fallin and the sea began to part and the peopl that confront him were mani and he was told but these few word which open up his heart if you cannot bring good news then dont bring ani
Lyrics all processed.


In [11]:
# print lyrics_data

# Machine Learning Code

In [12]:
import pickle
import numpy as np 
import scipy as sp
from time import time
from sklearn import cross_validation
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.feature_extraction.text import TfidfVectorizer

np.random.seed(42)



Re-load the pickled list data created above

In [13]:
#************************************LOAD DATA FROM PICKLE FILES****************************************#
### The lyrics (features) and artists (labels) are already largely processed by 'vectorize_lyrics.py'.
### These pickle files should have been created during that pre-processing
lyrics_file = "/Users/TBD/Documents/LeonardBobProj/code/full_lyrics_data.pkl" 
artists_file = "/Users/TBD/Documents/LeonardBobProj/code/artist_identifier_data.pkl"
lyric_data = pickle.load( open(lyrics_file, "r"))
artists = pickle.load( open(artists_file, "r") )

Split the data into 'training' and 'testing' subsets (with 80% used for training)

In [14]:
#*******************************TEST/TRAIN SPLITS VIA CROSS-VALIDATION***********************************#
### test_size is the percentage of events assigned to the test set (the remainder go into training)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(lyric_data, artists, test_size=0.2, random_state=42)

In [17]:
features_test

[u'i shall be free   well i took me a woman late last night is threefourth drunk she look all right till she start peelin off her onion gook she took off her wig said how do i look is high flyin bare nake out the window well sometim i might get drunk walk like a duck and smell like a skunk dont hurt me none dont hurt my pride caus i got my littl ladi right by my side shes atryin to hide pretendin she dont know me is out there paintin on the old wood shed when a can oblack paint it fell on my head i went down to scrub and rub but i had to sit in back of the tub cost a quarter half price well my telephon rang it would not stop it presid kennedi callin me up he said my friend  what do we need to make the countri grow i said my friend john brigitt bardot anita ekberg sophia loren countryl grow well i got a woman five feet short she yell and holler and squeal and snort she tickl my nose pat me on the head blow me over and kick me out of bed shes a man eater meat grinder bad loser oh there a

Using a 'bag of words' approach, vectorize the lyrics data into word frequencies.

Also, use the TfIDf to give more 'weight' to words that don't appear often in the total corpus.

In [18]:
#************************************TEXT VEXTORIZER USING TFIDF****************************************#
### First need to use TfIDf to convert all the text data to frequency counts
### i.e. "vectorize" or 'extract' the features (words)
# Note: 'ngram_range=(1,3)' means that it includes in the list individual words, word pairs, and trios of consecutive
# words - i.e. the min is 1, and the max is 3 words.
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,3), sublinear_tf=True, max_df=0.2, stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test)

### Look-up words in the vectorizer vocab dictionary ###
vocab = vectorizer.get_feature_names()
print vocab[2411]

away tomorrow


Check out some features of the 'vectorizer', and the number of words/word-grams in our 'vocabulary' (note: taken from the blogpost: http://www.ultravioletanalytics.com/2016/11/18/tf-idf-basics-with-pandas-scikit-learn/)

In [20]:
from itertools import islice
import pandas as pd

print list(islice(vectorizer.vocabulary_.items(), 20))
print "Number of unique word-grams in all documents:", len(vectorizer.vocabulary_)

### After 'transformation' into the bag-of-words representation, let's look at the top 20 terms
### as 'weighted' by the TFiDF transformer
weight = np.asarray(features_train.mean(axis=0)).ravel().tolist()
counts_features_train = pd.DataFrame({'Term': vectorizer.get_feature_names(), 'TFiDF weight': weight})
print counts_features_train.sort_values(by='TFiDF weight', ascending=False).head(20)

[(u'fawn', 15411), (u'day sweet', 10523), (u'defi solitud', 10968), (u'woodi', 54936), (u'shame dark doe', 42222), (u'like metal belt', 27925), (u'everybodi went straight', 14176), (u'say say', 41266), (u'forget veri forget', 16759), (u'mood anymor rememb', 32442), (u'thorn oh', 48556), (u'bridg travel goe', 5562), (u'say wasnt noth', 41330), (u'fetus dont', 15745), (u'troubl nothin', 50526), (u'road men gone', 39556), (u'goe life pleasur', 18066), (u'commenc doin befor', 8647), (u'wed meet', 53488), (u'theyr just lie', 48072)]
Number of unique word-grams in all documents: 56168
       TFiDF weight   Term
26610      0.006981   leav
2305       0.006472   away
1083       0.006371  alway
33167      0.006287   need
20602      0.006254   hear
55108      0.006175  world
18187      0.006053   gone
10174      0.006027   dark
13978      0.005971  everi
48181      0.005925  thing
53851      0.005922    whi
27268      0.005887   life
54636      0.005877  woman
22603      0.005841     id
12589    

Perform feature selection and fitting.

In [21]:
#************************************FEATURE SELECTION****************************************#
### feature selection, because the text corpus is super high dimensional (lots of 'features', i.e. words) 
### and may need to be condensed into a 'dense' array for some algorithms
### NOTE: only need to perform on 'features', as the 'labels' array is inherently dense
selector = SelectPercentile(f_classif, percentile=18)
selector.fit(features_train, labels_train)
features_train_transformed = selector.transform(features_train).toarray()
features_test_transformed  = selector.transform(features_test).toarray()

print len(features_train_transformed)
print len(features_test_transformed)

264
66


Perform ML algorithm classifier 'fitting' and output metrics to gauge success.

In [22]:
#************************************CLASSIFIER TESTING****************************************#

from sklearn.metrics import confusion_matrix

### Try a Decision Tree algorithm
'''
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train_transformed, labels_train)
ypred = clf.predict(features_test_transformed)
accuracy = accuracy_score(labels_test, ypred)

print accuracy
'''

### Try a Gaussian Naive-Bayes

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()

fitted_clf = clf.fit(features_train_transformed, labels_train)
ypred = fitted_clf.predict(features_test_transformed)
accuracy = accuracy_score(labels_test, ypred)
precision = precision_score(labels_test, ypred)
recall = recall_score(labels_test, ypred)
cm = confusion_matrix(labels_test, ypred)


### Try a Multinomial Naive-Bayes
'''
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

fitted_clf = clf.fit(features_train, labels_train)
ypred = fitted_clf.predict(features_test)
accuracy = accuracy_score(labels_test, ypred)
precision = precision_score(labels_test, ypred)
recall = recall_score(labels_test, ypred)
cm = confusion_matrix(labels_test, ypred)
'''

### Try a logistic regression model
'''
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

fitted_clf = clf.fit(features_train_transformed, labels_train)
ypred = fitted_clf.predict(features_test_transformed)
accuracy = accuracy_score(labels_test, ypred)
precision = precision_score(labels_test, ypred)
recall = recall_score(labels_test, ypred)
cm = confusion_matrix(labels_test, ypred)
'''

print "Accuracy:", accuracy
print "Precision:", precision
print "Recall:", recall
print cm

Accuracy: 0.80303030303
Precision: 0.730769230769
Recall: 0.76
[[34  7]
 [ 6 19]]


In [44]:
print features_train_transformed

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


Initial scores:
Accuracy = 0.7879
Precision = 0.7037
Recall = 0.76

Updated 'SelectPercentile' to 18%:
Accuracy = 0.8030
Precision = 0.7308
Recall = 0.76

# Front end connections

Need to make an function that takes an input query (a string) and performs the classification (i.e. ypred) on it, giving an output of 'Bob' or 'Leonard'.

In [23]:
from nltk.stem.snowball import SnowballStemmer
import string

def parseOutQuery(q):
    """
    Takes a query 'q' and converts to a string, then splits the string into individual words
    and performs 'stemming' on them and return a string that contains all the words
    in the song (space-separated).
    
    Adapted from the 'parseOutLyrics' code.
    
    """

    words = ""

    if len(q) > 1:
        ### remove punctuation
        ## .translate() takes two arguments, the thing to convert characters to (often with the maketrans()
        ## helper function), and the characters to replace/'translate'.
        query_string = q.translate(string.maketrans("", ""), string.punctuation)

        # strip non-ascii characters
        query_string = strip_non_ascii(query_string)

        ### Split the text string into individual words, stem the word, and append each stemmed word
        ### to a new list called stemmed_query (make sure there is a single space between each
        ### stemmed word)

        split_query = query_string.split() # split on 'default' whitespace. Returns a list of words.

        stemmer = SnowballStemmer("english") # create 'stemmer' obj

        stemmed_query = []
        for word in split_query:
            st_word = stemmer.stem(word)
            stemmed_query.append(st_word)

        
        words = " ".join(stemmed_query) # re-join the newly-created list of words...
        #print stemmed_query
        

    return words

In [24]:
def bob_or_leonard():
    query = raw_input("What lyric you got? ")
    
    # need to stem/pre-process this 'raw_input' string
    parsed_query = [parseOutQuery(query)]   # had to be an array (list-of-lists) to take full query into account
    #print parsed_query
    
    # use previously defined 'vectorizer' object to transform the query (list) into the TfiDF representation/weighting
    vectorized_query  = vectorizer.transform(parsed_query)
    #print vectorized_query
    
    query_transformed  = selector.transform(vectorized_query).toarray()
    #print query_transformed
    
    ### DON'T NEED FEATURE SELECTION FOR INPUT QUERY, AS IT SHOULDN'T BE TOO HIGH DIMENSIONAL
    ### (DIMENSIONALITY = LENGTH OF INPUT QUERY STRING)
    
    # now pass this 'processed' query object to the ML classifier predictor
    ypred = fitted_clf.predict(query_transformed)[0]
    
    # get probility of prediction
    prob = fitted_clf.predict_proba(query_transformed)[0]
    
    print "ypred: ", ypred
    print "prob: ", prob
    
    if ypred == 1:
        print "Leonard"
        print "With {}% confidence".format(prob[0])
    elif ypred == 0:
        print "Bob"
        print "With {}% confidence".format(prob[0])

In [33]:
# call the above function to get our 'Leonard or Bob' predictor!
bob_or_leonard()

What lyric you got? riddle damn thumb blues
ypred:  0
prob:  [ 1.  0.]
Bob
With 1.0% confidence


Okay, the above seems to be working ok UNLESS the input words in the query are not in the vectorizer vocab!

For example, "Marianne" shows up as "Bob"... presumably because the minimum document frequency cut-off was 20% (i.e. a word/phrase had to show up in at least 20% of the song corpus to be added to the TFiDF vectorizer vocab). At least that's what I TIHNK is happening. May just have to add an exception case to say 'If word(s) not recognized, return "Unsure, could be either" as the output'.

When provided with sufficient input data (i.e. a few lines from a Bob or Leonard song) the classifier seems to be working quite well!