## Objectives:
- compute the cosine similarity between a number of song lyrics (as vectors)
- with the largest cosine similarity of OVER 50%, PROVE that the author of "Penny Lane," Paul McCartney (Lennon/McCartney) also wrote "In My Life" from the hit 1980 musical *LES MISERABLES*
- further, with analysis of Ringo Starr's "Octopus' Garden," and George Harrison's "Within You Without You," identify which Beatle played Marius and which Eponine in the now-lost Beatles concept album of the hit Les Mis quartet

#All You Need is Cosine Similarity

##Background Math

With cosine similarity, we can quantify the relationship between two variables. A vector is a quantity with both magnitude and direction. Similarity between 2D vectors, graphed as arrows, can be measured by the angle between them. If they're exactly the same, there is no angle between them (angle = 0 degrees.) The degree can be determined by calculating the cosine. 

**CLARIFY THIS: (Recall that a cosine of 0 graphs to 1, so hereafter, when measuring a cosine as a percentage, the higher the cosine simliarity (on a scale of 0 to 1,) the higher the similarity. ** *italicized text* When the two vectors are exactly the same $\theta=0$ and $\cos\theta=1$ (
When the two vectors are perpendicular to each other $\theta=90$ and $\cos\theta=0$ )

MARKED FOR DELETION--

Think about one of our vectors as the adjacent side of a triangle and one as the hypotenuse.  The cosine of the angle measures the relationship of the hypotenuse and the adjacent side.  

$$\cos(\theta) = \frac{adj}{hyp}$$

Because we're often dealing with multi-dimensional vectors, we need a more general formula to determine the cosine of the angle between the vectors.

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

$$\mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta$$

Solving for $\cos\theta$ we get

$$\cos \theta= \frac{\mathbf {A} \cdot \mathbf {B} }{\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|}$$

$\qquad$

The numerator is the dot product of the vectors $\mathbf {A}$ and $\mathbf {B}$

The denominator is the norm of $\mathbf {A}$ times the norm of $\mathbf {B}$


###Let's look at an example

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were collaborations.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collaboration, primarily written by Lennon, or primarily written by McCartney.  

However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.

We will use cosine similarity to determine if *In My Life* (disputed) is most similar to *From Me to You* (collaboration, not disputed), *Strawberry Fields* (Lennon, not disputed) or *Penny Lane* (McCartney, not disputed).

Let's start by looking at the text of Strawberry Fields, which we know was written by John Lennon.  We can actually copy the lyrics to the entire song (removing punctuation and capitals from the first words of sentences) as a string and then convert that string into a data frame.

---



In [1]:
Les_Mis = "How strange This feeling that my lifes begun at last This change Can people really fall in love so fast Whats the matter with you Cosette Have you been to much on your own So many things unclear So many things unknown In my life There are so many questions and answers That somehow seem wrong In my life There are times when I catch in the silence The sigh of a faraway song And it sings Of a world that I long to see Out of reach Just a whisper away waiting for me Does he know Im alive Do I know if hes real Does he see what I see Does he feel what I feel In my life Im no longer alone Now the love in my life is so near Find me now Find me here Dear Cosette Youre such a a lonely child How pensive how sad you seem to me Believe me Were it within my power Id fill each passing hour How quiet it must be I can see With only me for company Theres so little I know that Im longing to know Of the man that you were in a time long ago Please Cosette Theres so little you say of the life you have known Why you keep to yourself Why were always alone So dark so dark and deep The secrets that you keep In my life Please forgive what I say You are loving and gentle and good But Papa Dear Papa In your eyes I am still like the child Whos lost in a wood No more words No more words its a time that is dead There are words That are better unheard Better unsaid In my life Im no longer a child And I yearn for the truth that you know Of the years Years ago You will learn Truth is given by God to us all in our time In our turn"
import pandas as pd

Les_Mis = Les_Mis.lower()

Les_Mis_df = pd.DataFrame({'words' : Les_Mis.split()})
Les_Mis_df.head()
Les_Mis_freq = pd.crosstab(index = Les_Mis_df['words'], columns = 'count')

In [3]:
import pandas as pd

#Strawberry Fields - John Lennon (not disputed)
  
Strawberry_ = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
Strawberry_ = Strawberry_.lower()
Strawberry_df = pd.DataFrame({'words' : Strawberry_.split()})


The way we are going to determine if two songs are similar is by comparing how frequently words appear in each song. We can make a frequency table to determine how many times each word appears in Strawberry Fields.

In [4]:
straw_freq = pd.crosstab(index= Strawberry_df['words'], columns = 'count')

Now let's do the same with Penny Lane - a song we know was written by McCartney

In [5]:
import pandas as pd

#Penny Lane - Paul McCartney (not disputed)

Lane_ = "in Penny Lane there is a barber showing photographs of every head hes had the pleasure to know and all the people that come and go stop and say hello on the corner is a banker with a motorcar and little children laugh at him behind his back and the banker never wears a mac in the pouring rain very strange Penny Lane is in my ears and in my eyes there beneath the blue suburban skies I sit and meanwhile back in Penny Lane there is a fireman with an hourglass and in his pocket is a portrait of the Queen he likes to keep his fire engine clean its a clean machine Penny Lane is in my ears and in my eyes a four of fish and finger pies in summer meanwhile back behind the shelter in the middle of the roundabout the pretty nurse is selling poppies from a tray and though she feels as if shes in a play ahe is anyway in Penny Lane the barber shaves another customer we see the banker sitting waiting for a trim and then the fireman rushes in from the pouring rain very strange Penny Lane is in my ears and in my eyes there beneath the blue suburban skies I sit and meanwhile back Penny Lane is in my ears and in my eyes there beneath the blue suburban skies Penny Lane"
Lane_ = Lane_.lower()
Lane_df = pd.DataFrame({'words' : Lane_.split()})
Lane_df.head()
Lane_freq = pd.crosstab(index = Lane_df['words'], columns = 'count')

In [7]:
#cleaned up by me: 
import pandas as pd

Penny_Lane_clean = "In Penny Lane there is a barber showing photographs Of every head hes had the pleasure to know And all the people that come and go Stop and say hello On the corner is a banker with a motorcar And little children laugh at him behind his back And the banker never wears a mac in the pouring rain Very strange Penny Lane is in my ears and in my eyes Wet beneath the blue suburban skies I sit and meanwhile back in Penny Lane there is a fireman with an hourglass And in his pocket is a portrait of the Queen He likes to keep his fire engine clean Its a clean machine Penny Lane is in my ears and in my eyes A four of fish and finger pies In summer meanwhile back Behind the shelter in the middle of a roundabout A pretty nurse is selling poppies from a tray And though she feels as if shes in a play She is anyway Penny Lane the barber shaves another customer We see the banker sitting waiting for a trim And then the fireman rushes in from the pouring rain Very strange Penny Lane is in my ears and in my eyes There beneath the blue suburban skies I sit and meanwhile back Penny Lane is in my ears and in my eyes There beneath the blue suburban skies Penny Lane"
Penny_Lane_clean = Lane_.lower()
Lane_df = pd.DataFrame({'words' : Penny_Lane_clean.split()})
Lane_df.head()
Lane_freq = pd.crosstab(index = Lane_df['words'], columns = 'count')

Now we are going to concatenate the two data sets so that there is one row for each word that appears in either song and one column for each song that counts how many times that word appears in the song's lyrics.

In [8]:
# Compare Strawberry Fields to Penny Lane

from numpy import dot
from numpy.linalg import norm

dfs = [straw_freq, Lane_freq]
all_words = pd.concat(dfs, axis= 1)

In [9]:
LM_straw = [straw_freq, Les_Mis_freq]
LM_straw_words = pd.concat(LM_straw, axis= 1)
LM_penny = [Lane_freq, Les_Mis_freq]
LM_penny_words = pd.concat(LM_penny, axis= 1)

We want to rename the first column so we know it is the word count from Strawberry Fields and the second column so we know it is the word count from Penny Lane.  

Also, we want to change the NaNs present to 0s because they indicate that a word that was in one song was not included in the other song.

In [10]:
all_words = all_words.fillna(0)
all_words.columns = ['Strawberry', 'Penny_Lane']
#all_words

In [11]:
#LM = Les Mis not Lennon/McCartney. Build the columns to compare lyrics:
LM_straw_words = LM_straw_words.fillna(0)
LM_penny_words = LM_penny_words.fillna(0)
LM_straw_words.columns = ['Strawberry', 'Les_Mis']
LM_penny_words.columns = ['Penny_Lane', 'Les_Mis']
LM_straw_words
LM_penny_words

Unnamed: 0_level_0,Penny_Lane,Les_Mis
words,Unnamed: 1_level_1,Unnamed: 2_level_1
a,11.0,9.0
ahe,1.0,0.0
all,1.0,1.0
an,1.0,0.0
and,15.0,6.0
...,...,...
years,0.0,2.0
you,0.0,11.0
your,0.0,2.0
youre,0.0,1.0


Now we can have two numeric vectors that represent the lyric frequency of each song, and we an compare them using the cosine similarity.  

In [12]:
#cos_sim = dot product Strawberry Fields and Penny Lane / norm(Strawberry Fields) * norm (Penny Lane)
dot(all_words['Strawberry'], all_words["Penny_Lane"]) / (norm(all_words['Strawberry']) * norm(all_words["Penny_Lane"]))

0.21590157172853788

In [13]:
dot(LM_straw_words['Strawberry'], LM_straw_words["Les_Mis"]) / (norm(LM_straw_words['Strawberry']) * norm(LM_straw_words["Les_Mis"]))

0.3924427854637147

In [14]:
dot(LM_penny_words['Penny_Lane'], LM_penny_words["Les_Mis"]) / (norm(LM_penny_words['Penny_Lane']) * norm(LM_penny_words["Les_Mis"]))

0.5425218908918277

In [15]:
marius = 'In my life She has burst like the music of angels The light of the sun And my life Seems to stop as if something is over And something has scarcely begun eponine Youre the friend who has brought me here Thanks to you I am one with the Gods And heaven is near And I saw through a world that is new That is free In my life There is someone who touches my life Waiting near'
eponine = 'Every word that he says Is a dagger in me In my life Theres been no one like him anywhere Anywhere Where he is If he asked Id be his In my life There is someone who touches my life Waiting here'
marius = marius.lower()
eponine = eponine.lower()
marius_df = pd.DataFrame({'words' : marius.split()})
eponine_df = pd.DataFrame({'words' : eponine.split()}) 
marius_freq = pd.crosstab(index = marius_df['words'], columns = 'count')
eponine_freq = pd.crosstab(index = eponine_df['words'], columns = 'count')


In [16]:
octopus = 'Id like to be Under the sea In an octopus garden In the shade Hed let us in Knows where weve been In his octopus garden In the shade Id ask my friends To come and see An octopus garden With me Id like to be Under the sea In an octopus garden In the shade We would be warm Below the storm In our little hideaway Beneath the waves Resting our head On the seabed In an octopus garden Near a cave We would sing And dance around Because we know We cant be found Id like to be Under the sea In an octopus garden In the shade We would shout And swim about The coral that lies Beneath the waves Lies beneath the ocean waves Oh what joy For every girl and boy Knowing theyre happy And theyre safe Happy and theyre safe We would be so happy You and me No one there to tell us What to do Id like to be Under the sea In an octopus garden With you In an octopus garden With you In an octopus garden With you'
octopus = octopus.lower()
octopus_df = pd.DataFrame({'words' : octopus.split()}) 
octopus_freq = pd.crosstab(index = octopus_df['words'], columns = 'count')

In [17]:
within = 'We were talking about the space between us all And the people who hide themselves behind a wall of illusion Never glimpse the truth Then its far too late When they pass away We were talking about the love we all could share When we find it to try our best to hold it there with our love With our love we could save the world if they only knew Try to realise its all within yourself No one else can make you change And to see youre really only very small And life flows on within you and without you We were talking about the love thats gone so cold And the people who gain the world and lose their soul They dont know They cant see Are you one of them When youve seen beyond yourself then you may find Peace of mind is waiting there And the time will come when you see were all one And life flows on within you and without you'
within = within.lower()
within_df = pd.DataFrame({'words' : within.split()}) 
within_freq = pd.crosstab(index = within_df['words'], columns = 'count')

In [18]:
marius_octopus = [marius_freq, octopus_freq]
marius_octopus_words = pd.concat(marius_octopus, axis= 1)
marius_within = [marius_freq, within_freq]
marius_within_words = pd.concat(marius_within, axis = 1)
marius_octopus_words = marius_octopus_words.fillna(0)
marius_within_words = marius_within_words.fillna(0)
marius_octopus_words.columns = ['marius', 'octopus']
marius_within_words.columns = ['marius', 'within']
print('Ringo Starr as Marius: ')
print(dot(marius_octopus_words['marius'], marius_octopus_words["octopus"]) / (norm(marius_octopus_words['marius']) * norm(marius_octopus_words["octopus"])))
print('George Harrison as Marius: ')
print(dot(marius_within_words['marius'], marius_within_words["within"]) / (norm(marius_within_words['marius']) * norm(marius_within_words["within"])))

Ringo Starr as Marius: 
0.38419080719855575
George Harrison as Marius: 
0.42904118721439827


In [19]:
eponine_octopus = [eponine_freq, octopus_freq]
eponine_octopus_words = pd.concat(eponine_octopus, axis= 1)
eponine_within = [eponine_freq, within_freq]
eponine_within_words = pd.concat(eponine_within, axis= 1)
eponine_octopus_words = eponine_octopus_words.fillna(0)
eponine_within_words = eponine_within_words.fillna(0)
eponine_octopus_words.columns = ['eponine', 'octopus']
eponine_within_words.columns = ['eponine', 'within']
print('Ringo Starr as Eponine: ')
print(dot(eponine_octopus_words['eponine'], eponine_octopus_words["octopus"]) / (norm(eponine_octopus_words['eponine']) * norm(eponine_octopus_words["octopus"])))
print('George Harrison as Eponine: ')
print(dot(eponine_within_words['eponine'], eponine_within_words["within"]) / (norm(eponine_within_words['eponine']) * norm(eponine_within_words["within"])))

Ringo Starr as Eponine: 
0.2585449118153828
George Harrison as Eponine: 
0.09949879346007116


In [20]:
cosette = 'How strange This feeling that my lifes begun at last This change Can people really fall in love so fast Whats the matter with you Cosette Have you been to much on your own So many things unclear So many things unknown In my life There are so many questions and answers That somehow seem wrong In my life There are times when I catch in the silence The sigh of a faraway song And it sings Of a world that I long to see Out of reach Just a whisper away waiting for me Does he know Im alive Do I know if hes real Does he see what I see Does he feel what I feel In my life Im no longer alone Now the love in my life is so near Find me now Find me here Theres so little I know that Im longing to know Of the man that you were in a time long ago Theres so little you say of the life you have known Why you keep to yourself Why were always alone So dark so dark and deep The secrets that you keep In my life Please forgive what I say You are loving and gentle and good But Papa Dear Papa In your eyes I am still like the child Whos lost in a wood In my life Im no longer a child And I yearn for the truth that you know Of the years Years ago'
cosette = cosette.lower()
valjean = 'Dear Cosette Youre such a lonely child How pensive how sad you seem to me Believe me Were it within my power Id fill each passing hour How quiet it must be I can see With only me for company Please Cosette No more words No more words its a time that is dead There are words That are better unheard Better unsaid You will learn Truth is given by God to us all in our time In our turn'
valjean = valjean.lower()
cosette_df = pd.DataFrame({'words' : cosette.split()})
valjean_df = pd.DataFrame({'words' : valjean.split()}) 
cosette_freq = pd.crosstab(index = cosette_df['words'], columns = 'count')
valjean_freq = pd.crosstab(index = valjean_df['words'], columns = 'count')

In [21]:
cosette_straw = [cosette_freq, straw_freq]
cosette_straw_words = pd.concat(cosette_straw, axis= 1)
cosette_penny = [cosette_freq, Lane_freq]
cosette_penny_words = pd.concat(cosette_penny, axis= 1)
valjean_straw = [valjean_freq, straw_freq]
valjean_straw_words = pd.concat(valjean_straw, axis= 1)
valjean_penny = [valjean_freq, Lane_freq]
valjean_penny_words = pd.concat(valjean_penny, axis= 1)

In [22]:
cosette_straw_words = cosette_straw_words.fillna(0)
cosette_penny_words = cosette_penny_words.fillna(0)
cosette_straw_words.columns = ['cosette', 'Strawberry']
cosette_penny_words.columns = ['cosette', 'Penny_Lane']
valjean_straw_words = valjean_straw_words.fillna(0)
valjean_penny_words = valjean_penny_words.fillna(0)
valjean_straw_words.columns = ['valjean','Strawberry']
valjean_penny_words.columns = ['valjean','Penny_Lane']

In [23]:
print('Lennon as Cosette: ') 
print(dot(cosette_straw_words['cosette'], cosette_straw_words["Strawberry"]) / (norm(cosette_straw_words['cosette']) * norm(cosette_straw_words["Strawberry"])))
print('Lennon as Valjean: ')
print(dot(valjean_straw_words['valjean'], valjean_straw_words["Strawberry"]) / (norm(valjean_straw_words['valjean']) * norm(valjean_straw_words["Strawberry"])))
print('McCartney as Cosette: ')
print(dot(cosette_penny_words['cosette'], cosette_penny_words["Penny_Lane"]) / (norm(cosette_penny_words['cosette']) * norm(cosette_penny_words["Penny_Lane"])))
print('McCartney as Valjean: ')
print(dot(valjean_penny_words['valjean'], valjean_penny_words["Penny_Lane"]) / (norm(valjean_penny_words['valjean']) * norm(valjean_penny_words["Penny_Lane"])))

Lennon as Cosette: 
0.34935207024108916
Lennon as Valjean: 
0.32860898603038086
McCartney as Cosette: 
0.5517627043468121
McCartney as Valjean: 
0.23755684052207143


"stop words"

{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’} 

Note: You can even modify the list by adding words of your choice in the english .txt. file in the stopwords directory. 

In [None]:
#To remove "stop" words according to the list above
'''
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
print(stopwords.words('english'))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
 
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
 
print(word_tokens)
print(filtered_sentence)

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
# word_tokenize accepts
# a string as an input, not a file.
stop_words = set(stopwords.words('english'))
file1 = open("text.txt")
 
# Use this to read file content as a stream:
line = file1.read()
words = line.split()
for r in words:
    if not r in stop_words:
        appendFile = open('filteredtext.txt','a')
        appendFile.write(" "+r)
        appendFile.close()
'''

We can use this value (cosine similarity = 0.22) as a baseline.  This is the similarity between two songs that were written by close collaborators but we know were not written by the same individual.

Let's load in two more songs: From Me to You (collaboration, not disputed) and In My Life (the disputed song)

Let's compare In My Life to Penny Lane (McCartney), Strawberry Fields (Lennon) and From Me to You (the undisputed collaboration)

The cosine similarity between In My Life to all three other songs is higher than the cosine similarity between Strawberry Fields and Penny Lane.

It is highest between In My Life and Penny Lane, followed by From Me to You.  In My Life is least Similar to Strawberry Fields.

From the Wikipedia article about the Lennon-McCartney collaboration: In 1977, when shown a list of songs Lennon claimed writing on for the magazine Hit Parader, McCartney disputed only "In My Life". Lennon said that McCartney helped only with "the middle eight" (a short section) of the song. McCartney said that he wrote the entire melody, taking inspiration from Smokey Robinson songs.
