# Text Similarity Measures Exercises #

## Introduction ##

We will be using [a song lyric dataset from Kaggle](https://www.kaggle.com/mousehead/songlyrics) to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.

In [1]:
import nltk
import pandas as pd

In [2]:
data = pd.read_csv('../data/songdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [3]:
data.shape

(57650, 4)

In [4]:
data_copy=data.copy()

## Question 1 ##

* Filter the lyrics data set to only select songs by The Beatles.
* How many songs are there in total by The Beatles?
* Take a look at the first song's lyrics.

In [5]:
# 1.1

data['artist'].unique()

array(['ABBA', 'Ace Of Base', 'Adam Sandler', 'Adele', 'Aerosmith',
       'Air Supply', 'Aiza Seguerra', 'Alabama', 'Alan Parsons Project',
       'Aled Jones', 'Alice Cooper', 'Alice In Chains', 'Alison Krauss',
       'Allman Brothers Band', 'Alphaville', 'America', 'Amy Grant',
       'Andrea Bocelli', 'Andy Williams', 'Annie', 'Ariana Grande',
       'Ariel Rivera', 'Arlo Guthrie', 'Arrogant Worms', 'Avril Lavigne',
       'Backstreet Boys', 'Barbie', 'Barbra Streisand', 'Beach Boys',
       'The Beatles', 'Beautiful South', 'Beauty And The Beast',
       'Bee Gees', 'Bette Midler', 'Bill Withers', 'Billie Holiday',
       'Billy Joel', 'Bing Crosby', 'Black Sabbath', 'Blur', 'Bob Dylan',
       'Bob Marley', 'Bob Rivers', 'Bob Seger', 'Bon Jovi', 'Boney M.',
       'Bonnie Raitt', 'Bosson', 'Bread', 'Britney Spears',
       'Bruce Springsteen', 'Bruno Mars', 'Bryan White', 'Cake',
       'Carly Simon', 'Carol Banawa', 'Carpenters', 'Cat Stevens',
       'Celine Dion', 'Chaka Khan

In [6]:
data=data[data['artist']=='The Beatles']
data.head()

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \r\nAnd ..."
1199,The Beatles,Across The Universe,/b/beatles/across+the+universe_10026507.html,Words are flowing out like \r\nEndless rain i...
1200,The Beatles,All I've Got To Do,/b/beatles/all+ive+got+to+do_10026646.html,"Whenever I want you around, yeah \r\nAll I go..."
1201,The Beatles,And I Love Her,/b/beatles/and+i+love+her_10026463.html,I give her all my love \r\nThat's all I do \...
1202,The Beatles,And Your Bird Can Sing,/b/beatles/and+your+bird+can+sing_10026364.html,You tell me that you've got everything you wan...


In [7]:
# 1.2

len(data['song'].unique())

178

In [8]:
# 1.3

data.head(1)

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \r\nAnd ..."


## Question 2 ##

Apply the following preprocessing steps:
* Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.
* Remove all words with numbers using regular expressions.
* Create a document-term matrix using Count Vectorizer, with each row as a song and each column as a word in the lyrics. Have the Count Vectorizer remove all stop words as well.

Note: Count Vectorizer automatically removes punctuation and makes all characters lowercase.

In [9]:
# 2.1

import re

for w in range(178):
    data['text'].iloc[w]=re.sub('\n',' ',data['text'].iloc[w])
    
data.head()

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \r And y..."
1199,The Beatles,Across The Universe,/b/beatles/across+the+universe_10026507.html,Words are flowing out like \r Endless rain in...
1200,The Beatles,All I've Got To Do,/b/beatles/all+ive+got+to+do_10026646.html,"Whenever I want you around, yeah \r All I got..."
1201,The Beatles,And I Love Her,/b/beatles/and+i+love+her_10026463.html,I give her all my love \r That's all I do \r...
1202,The Beatles,And Your Bird Can Sing,/b/beatles/and+your+bird+can+sing_10026364.html,You tell me that you've got everything you wan...


In [10]:
# 2.2

for w in range(178):
    data['text'].iloc[w]=re.sub('\w*\d\w*',' ',data['text'].iloc[w])
    
data.head()

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \r And y..."
1199,The Beatles,Across The Universe,/b/beatles/across+the+universe_10026507.html,Words are flowing out like \r Endless rain in...
1200,The Beatles,All I've Got To Do,/b/beatles/all+ive+got+to+do_10026646.html,"Whenever I want you around, yeah \r All I got..."
1201,The Beatles,And I Love Her,/b/beatles/and+i+love+her_10026463.html,I give her all my love \r That's all I do \r...
1202,The Beatles,And Your Bird Can Sing,/b/beatles/and+your+bird+can+sing_10026364.html,You tell me that you've got everything you wan...


In [16]:
# 2.3

from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(stop_words='english')
x=cv.fit_transform(data['song'])
one_hot=pd.DataFrame(x.toarray(),columns=cv.get_feature_names())
one_hot.head(10)

# must write 'columns' word in above method

Unnamed: 0,act,ain,anna,ask,baby,bad,banana,bells,benefit,besame,...,walrus,wanna,want,warm,way,week,weight,woman,won,wood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
#one_hot.loc[0]
x.toarray()[177]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

## Question 3 ##

* Take a look at the lyrics for the song "Imagine".
* Which song is the most similar to the song "Imagine"?
     * Use cosine similarity to calculate the similarity
     * Use Count Vectorizer to numerically encode the lyrics
* Find the most similar song using the TF-IDF Vectorizer.

Compare the most similar song of the outputs of both the Count Vectorizer and the TF-IDF Vectorizer.

In [13]:
# 3.1

data1=data[data['song']=='Imagine']
data1

Unnamed: 0,artist,song,link,text
24783,The Beatles,Imagine,/b/beatles/imagine_20254326.html,Imagine there's no heaven \r It's easy if you...


In [15]:
# 3.2

from numpy import dot
from numpy.linalg import norm

In [None]:
cv1=CountVectorizer(stop_words='english')
x1=cv1.fit_transform(data['song'])
one_hot=pd.DataFrame(x1.toarray(),columns=cv1.get_feature_names())
one_hot

In [None]:
data[data['song']=='Imagine'].index

## Question 4 ##

Which two Beatles songs are the most similar?
   * Using Count Vectorizer
   * Using TF-IDF Vectorizer
     
Compare the results. Which Vectorizer seems to do a better job?

In [18]:
# 4.1

# countvectorizer 1st

cv2=CountVectorizer(stop_words='english')
sim2=cv2.fit_transform(data['song'])
a2=pd.DataFrame(sim2.toarray(),columns=cv2.get_feature_names())
a2.head()

Unnamed: 0,act,ain,anna,ask,baby,bad,banana,bells,benefit,besame,...,walrus,wanna,want,warm,way,week,weight,woman,won,wood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

comb_pair=list(combinations(range(a2.shape[0]),2))

comb=[(data['song'].iloc[a_index],data['song'].iloc[b_index]) for (a_index,b_index) in comb_pair]

cos_sim=[cosine_similarity([a2.iloc[a_index]],[a2.iloc[b_index]]) for (a_index,b_index) in comb_pair]

result=sorted(zip(cos_sim,comb),reverse=True)

In [36]:
result

[(array([[1.]]), ('Love Me Do', 'Love You To')),
 (array([[1.]]), ("It's Only Love", 'Love You To')),
 (array([[1.]]), ("It's Only Love", 'Love Me Do')),
 (array([[1.]]), ("I'll Be Back", "I'll Get You")),
 (array([[1.]]), ('Come And Get It', 'Come Together')),
 (array([[1.]]), ("Baby, It's You", 'Cry Baby Cry')),
 (array([[1.]]), ('Another Girl', 'Girl')),
 (array([[1.]]), ('And I Love Her', 'Love You To')),
 (array([[1.]]), ('And I Love Her', 'Love Me Do')),
 (array([[1.]]), ('And I Love Her', "It's Only Love")),
 (array([[0.81649658]]), ("All I've Got To Do", "If You've Got Trouble")),
 (array([[0.81649658]]), ("All I've Got To Do", "I've Got A Feeling")),
 (array([[0.70710678]]), ("Nobody's Child", 'Child Of Nature')),
 (array([[0.70710678]]), ('Love Of The Loved', 'Love You To')),
 (array([[0.70710678]]), ('Love Me Do', 'Love Of The Loved')),
 (array([[0.70710678]]), ('Little Child', "Nobody's Child")),
 (array([[0.70710678]]), ("It's Only Love", 'Love Of The Loved')),
 (array([[0

In [None]:
# 4.2

# Now TF-IDF

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf=TfidfVectorizer(stop_words='english')
p1=tf.fit_transform(data['song'])
s1=pd.DataFrame(p1.toarray(),columns=tf.get_feature_names())
s1.head()

Unnamed: 0,act,ain,anna,ask,baby,bad,banana,bells,benefit,besame,...,walrus,wanna,want,warm,way,week,weight,woman,won,wood
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
sim_cos=[cosine_similarity([s1.iloc[a_index]],[s1.iloc[b_index]]) for (a_index,b_index) in comb_pair]

result1=sorted(zip(sim_cos,comb),reverse=True)
result1

[(array([[1.]]), ('Love Me Do', 'Love You To')),
 (array([[1.]]), ("It's Only Love", 'Love You To')),
 (array([[1.]]), ("It's Only Love", 'Love Me Do')),
 (array([[1.]]), ("I'll Be Back", "I'll Get You")),
 (array([[1.]]), ('Come And Get It', 'Come Together')),
 (array([[1.]]), ("Baby, It's You", 'Cry Baby Cry')),
 (array([[1.]]), ('Another Girl', 'Girl')),
 (array([[1.]]), ('And I Love Her', 'Love You To')),
 (array([[1.]]), ('And I Love Her', 'Love Me Do')),
 (array([[1.]]), ('And I Love Her', "It's Only Love")),
 (array([[0.80861869]]),
  ('If I Needed Someone', 'If I Needed Someone To Love')),
 (array([[0.80861869]]), ('All You Need Is Love', 'I Need You')),
 (array([[0.76815958]]), ("Don't Let Me Down", 'Let It Be')),
 (array([[0.75982537]]), ('Got To Get It Into My Life', 'In My Life')),
 (array([[0.74579195]]), ("All I've Got To Do", "If You've Got Trouble")),
 (array([[0.74579195]]), ("All I've Got To Do", "I've Got A Feeling")),
 (array([[0.73303808]]), ('Cry Baby Cry', 'If Yo

### Note:
TF-IDF result is better compare to countvectorizer because TF-IDF cosine_similarity value decreases constantly not drastically.