# spaCy POS tagger: is it good for us?
In this notebook we will explore a little bit the spaCy's built in POS tagger to see if it is actually good for our context (lyrics). We will read some song lyrics and parse them with spaCy. Let's see what we get.

It is important to say that spaCy's POS tagger provides two tags per words:
 - `pos`: coarse-grained part-of-speech e.g. VERB
 - `tag`: fine-grained part-of-speech e.g. PAST_TENSE
 
We will analyze both tags.

In [42]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_lg')

We will begin by working on Nirvana's song Polly becase....I like it ([link](https://www.youtube.com/watch?v=atx9X4s6_zg)) .

In [129]:
path = 'ml_lyrics/sad_Nirvana_Polly'

In [130]:
with open(path, 'r') as f:
    lyrics = f.read()
    print(lyrics)

Polly wants a cracker
Think I should get off her first
Think she wants some water
To put out the blow torch

Isn't me, have a seed
Let me clip, dirty wings
Let me take a ride, cut yourself
Want some help, please myself
Got some rope, have been told
Promise you, have been true
Let me take a ride, cut yourself
Want some help, please myself

Polly wants a cracker
Maybe she would like some food
She asks me to untie her
Chase would be nice for a few

Isn't me, have a seed
Let me clip, dirty wings
Let me take a ride, cut yourself
Want some help, please myself
Got some rope, have been told
Promise you, have been true
Let me take a ride, cut yourself
Want some help, please myself

Polly said

Polly says her back hurts
She's just as bored as me
She caught me off my guard
'Mazes me the will of instinct

Isn't me, have a seed
Let me clip, dirty wings
Let me take a ride, cut yourself
Want some help, please myself
Got some rope, have been told
Promise you, have been true
Let me take a ride, cut you

There are some duplicate parts. I will drop them.

In [131]:
parts = lyrics.split('\n\n')
print('We had {} parts'.format(len(parts)))
parts = list(set(parts))
print('Now we have {} parts'.format(len(parts)))

We had 7 parts
Now we have 5 parts


In [132]:
lyrics = '\n\n'.join(parts)
lyrics = nlp(lyrics)
print(lyrics)

Polly wants a cracker
Think I should get off her first
Think she wants some water
To put out the blow torch

Polly said

Polly says her back hurts
She's just as bored as me
She caught me off my guard
'Mazes me the will of instinct

Isn't me, have a seed
Let me clip, dirty wings
Let me take a ride, cut yourself
Want some help, please myself
Got some rope, have been told
Promise you, have been true
Let me take a ride, cut yourself
Want some help, please myself

Polly wants a cracker
Maybe she would like some food
She asks me to untie her
Chase would be nice for a few


In [133]:
def get_lines():
    return lyrics.text.split('\n')

In [134]:
lines = get_lines()

In [135]:
def describe_line(line):
    line = nlp(line)
    print(line)
    for token in line:
        print(token.pos_, end=' ')
    print('\n')
    for token in line:
        print(token.text, '=', token.pos_,  token.tag_, '->', spacy.explain(token.tag_))

Firstly let's have a look at the first line and see what we get.

In [120]:
# Let's have a look at the first line
describe_line(lines[0])

Polly wants a cracker
PROPN VERB DET NOUN 

Polly = PROPN NNP -> noun, proper singular
wants = VERB VBZ -> verb, 3rd person singular present
a = DET DT -> determiner
cracker = NOUN NN -> noun, singular or mass


Makes sense! Problems may arise on the whole song lyric. Let's see.

In [121]:
for line in lines:
    describe_line(line)
    print('-'*30)

Polly wants a cracker
PROPN VERB DET NOUN 

Polly = PROPN NNP -> noun, proper singular
wants = VERB VBZ -> verb, 3rd person singular present
a = DET DT -> determiner
cracker = NOUN NN -> noun, singular or mass
------------------------------
Think I should get off her first
VERB PRON VERB VERB ADP ADJ ADJ 

Think = VERB VBP -> verb, non-3rd person singular present
I = PRON PRP -> pronoun, personal
should = VERB MD -> verb, modal auxiliary
get = VERB VB -> verb, base form
off = ADP IN -> conjunction, subordinating or preposition
her = ADJ PRP$ -> pronoun, possessive
first = ADJ JJ -> adjective
------------------------------
Think she wants some water
VERB PRON VERB DET NOUN 

Think = VERB VB -> verb, base form
she = PRON PRP -> pronoun, personal
wants = VERB VBZ -> verb, 3rd person singular present
some = DET DT -> determiner
water = NOUN NN -> noun, singular or mass
------------------------------
To put out the blow torch
PART VERB PART DET NOUN NOUN 

To = PART TO -> infinitival to
put

It seems to be quite good. Probably that's because we are looking at the song line by line (sentence by sentence). We believe that it is worth to have a look at what happens if we process a song as whole. However, we are doing this just for the sake of completeness because, while doing our feature engineering, we do not consider entire songs but just lines, one by one.

In [122]:
for token in lyrics:
    print(token.text, '=', token.pos_, '->', spacy.explain(token.pos_))

Polly = PROPN -> proper noun
wants = VERB -> verb
a = DET -> determiner
cracker = NOUN -> noun

 = SPACE -> space
Think = VERB -> verb
I = PRON -> pronoun
should = VERB -> verb
get = VERB -> verb
off = ADP -> adposition
her = ADJ -> adjective
first = ADJ -> adjective

 = SPACE -> space
Think = VERB -> verb
she = PRON -> pronoun
wants = VERB -> verb
some = DET -> determiner
water = NOUN -> noun

 = SPACE -> space
To = PART -> particle
put = VERB -> verb
out = PART -> particle
the = DET -> determiner
blow = NOUN -> noun
torch = NOUN -> noun


 = SPACE -> space
Polly = PROPN -> proper noun
said = VERB -> verb


 = SPACE -> space
Polly = PROPN -> proper noun
says = VERB -> verb
her = ADJ -> adjective
back = NOUN -> noun
hurts = VERB -> verb

 = SPACE -> space
She = PRON -> pronoun
's = VERB -> verb
just = ADV -> adverb
as = ADV -> adverb
bored = ADJ -> adjective
as = ADP -> adposition
me = PRON -> pronoun

 = SPACE -> space
She = PRON -> pronoun
caught = VERB -> verb
me = PRON -> pronoun
o

The results seems to be totally the same as those we obtained above. Therefore we can conclude that there are no particular issues in analyzing a song as whole (which we do not do by the way) rather than analyzing it line by line.

## Maybe not with a 'weirder' song?

We believe that analyzing "Polly" from Nirvana is not exhaustive as it does not contain particularly strange words, nor slangs, nor abbreviations. Among the songs available in MoodyLyrics we found one of them which is particularly long and contains several slangs and abbreviations. It is "Kiss You Back" from Underground Kiss ([link](https://www.youtube.com/watch?v=szjoplYtqW4)).

We'll repeat the same things done before (read and duplicate parts elimination).

In [142]:
path = 'ml_lyrics/happy_Digital Underground_Kiss You Back'

In [143]:
with open(path, 'r') as f:
    lyrics = f.read()
    print(lyrics)

Shimmy shimmy cocoa-pop
Shimmy shimmy cocoa-pop
Shimmy shimmy cocoa-pop
Shimmy shimmy cocoa-pop
We-we chocolate cross-over
Yeah, we chocolate cross-over
See me cocoa might go pop
I'm cocoa and I might go pop

Now it's about time that I cleared this
So pardon me miss
But I'd like for you to hear this
If you kiss me then I'll kiss you back
You see, I feel real good inside, and it's just from your nearness
There's no need for you to fear this
Kiss me, I'll kiss you back
Mmmm-<kiss>

Well ya look kinda cute to me
I think we can achieve this
Plus you act like you need this
Kiss me and I'll kiss you back
You act real fly
Money-B's not buyin' it
Quit denyin' it
You're better off tryin' it
Freak me girl and I'll freak you back
(Duh nuh nuh nuh nuh nuh nuuuuh)

Through any kinda weather
Will me and you forever
Stay together
Well I just don't know
But I'll tell ya what though
If you kiss me then I'll kiss you back
(Kiss you back)
And I guess you wanna know if I'm gonna be around
I ain't sure but

In [144]:
parts = lyrics.split('\n\n')
print('We had {} parts'.format(len(parts)))
parts = list(set(parts))
print('Now we have {} parts'.format(len(parts)))

We had 22 parts
Now we have 22 parts


Nothing changed this time so we do not need to read the song again.

In [145]:
lyrics = nlp(lyrics)
lines = get_lines()

In [146]:
# Let's have a look at the first line
describe_line(lines[0])

Shimmy shimmy cocoa-pop
VERB VERB NOUN PUNCT NOUN 

Shimmy = VERB VB -> verb, base form
shimmy = VERB VB -> verb, base form
cocoa = NOUN NN -> noun, singular or mass
- = PUNCT HYPH -> punctuation mark, hyphen
pop = NOUN NN -> noun, singular or mass


The only thing to note is that cocoa-pop is not treated as a single token as we would expect. Instead, the '-' is treated as self-standing token, which causes 'coca-pop' to split. 

Apart from this, the first line does not present any problem. Let's analyze all lines, one by one and see if we have problems. Before doing that, since we will analyze this song just line by line (we saw that analyzing it as whole does not change anything) we will firstly remove all duplicate lines.

In [151]:
lines = list(set(lines))
# Also drop empty lines
lines = [line for line in lines if len(line) > 0]

In [152]:
for line in lines:
    describe_line(line)
    print('-'*30)

If you hold my nuts I'll
ADP PRON VERB ADJ NOUN PRON VERB 

If = ADP IN -> conjunction, subordinating or preposition
you = PRON PRP -> pronoun, personal
hold = VERB VBP -> verb, non-3rd person singular present
my = ADJ PRP$ -> pronoun, possessive
nuts = NOUN NNS -> noun, plural
I = PRON PRP -> pronoun, personal
'll = VERB MD -> verb, modal auxiliary
------------------------------
You act real fly
PRON VERB ADV NOUN 

You = PRON PRP -> pronoun, personal
act = VERB VBP -> verb, non-3rd person singular present
real = ADV RB -> adverb
fly = NOUN NN -> noun, singular or mass
------------------------------
Yeah, we chocolate cross-over
INTJ PUNCT PRON VERB NOUN PUNCT PART 

Yeah = INTJ UH -> interjection
, = PUNCT , -> punctuation mark, comma
we = PRON PRP -> pronoun, personal
chocolate = VERB VBP -> verb, non-3rd person singular present
cross = NOUN NN -> noun, singular or mass
- = PUNCT HYPH -> punctuation mark, hyphen
over = PART RP -> adverb, particle
------------------------------
<Clea

See me? I'm cocoa, and I might go pop
VERB PRON PUNCT PRON VERB NOUN PUNCT CCONJ PRON VERB VERB VERB 

See = VERB VB -> verb, base form
me = PRON PRP -> pronoun, personal
? = PUNCT . -> punctuation mark, sentence closer
I = PRON PRP -> pronoun, personal
'm = VERB VBP -> verb, non-3rd person singular present
cocoa = NOUN NN -> noun, singular or mass
, = PUNCT , -> punctuation mark, comma
and = CCONJ CC -> conjunction, coordinating
I = PRON PRP -> pronoun, personal
might = VERB MD -> verb, modal auxiliary
go = VERB VB -> verb, base form
pop = VERB VB -> verb, base form
------------------------------
(Kiss you back)
PUNCT VERB PRON PART PUNCT 

( = PUNCT -LRB- -> left round bracket
Kiss = VERB VB -> verb, base form
you = PRON PRP -> pronoun, personal
back = PART RP -> adverb, particle
) = PUNCT -RRB- -> right round bracket
------------------------------
Kiss me then I'll kiss you back
VERB PRON ADV PRON VERB VERB PRON PART 

Kiss = VERB VB -> verb, base form
me = PRON PRP -> pronoun, pers

We chocolate might cross over
PRON NOUN VERB VERB PART 

We = PRON PRP -> pronoun, personal
chocolate = NOUN NN -> noun, singular or mass
might = VERB MD -> verb, modal auxiliary
cross = VERB VB -> verb, base form
over = PART RP -> adverb, particle
------------------------------
Money-B's not buyin' it
NOUN PUNCT NOUN PART ADV NOUN PUNCT PRON 

Money = NOUN NN -> noun, singular or mass
- = PUNCT HYPH -> punctuation mark, hyphen
B = NOUN NN -> noun, singular or mass
's = PART POS -> possessive ending
not = ADV RB -> adverb
buyin = NOUN NN -> noun, singular or mass
' = PUNCT '' -> closing quotation mark
it = PRON PRP -> pronoun, personal
------------------------------
Come one ladies, one more time, kick it
VERB NUM NOUN PUNCT NUM ADJ NOUN PUNCT VERB PRON 

Come = VERB VB -> verb, base form
one = NUM CD -> cardinal number
ladies = NOUN NNS -> noun, plural
, = PUNCT , -> punctuation mark, comma
one = NUM CD -> cardinal number
more = ADJ JJR -> adjective, comparative
time = NOUN NN -> noun

I'm hopin' that you hear me 'cause I love it when you're near me
PRON VERB NOUN PUNCT ADP PRON VERB PRON ADP PRON VERB PRON ADV PRON VERB ADP PRON 

I = PRON PRP -> pronoun, personal
'm = VERB VBP -> verb, non-3rd person singular present
hopin = NOUN NN -> noun, singular or mass
' = PUNCT '' -> closing quotation mark
that = ADP IN -> conjunction, subordinating or preposition
you = PRON PRP -> pronoun, personal
hear = VERB VBP -> verb, non-3rd person singular present
me = PRON PRP -> pronoun, personal
'cause = ADP IN -> conjunction, subordinating or preposition
I = PRON PRP -> pronoun, personal
love = VERB VBP -> verb, non-3rd person singular present
it = PRON PRP -> pronoun, personal
when = ADV WRB -> wh-adverb
you = PRON PRP -> pronoun, personal
're = VERB VBP -> verb, non-3rd person singular present
near = ADP IN -> conjunction, subordinating or preposition
me = PRON PRP -> pronoun, personal
------------------------------
If you play with my tummy I'll tickle your feet
ADP PRON VERB 

You know I know you knew this so I guess that we could do this
PRON VERB PRON VERB PRON VERB DET ADP PRON VERB ADP PRON VERB VERB DET 

You = PRON PRP -> pronoun, personal
know = VERB VBP -> verb, non-3rd person singular present
I = PRON PRP -> pronoun, personal
know = VERB VBP -> verb, non-3rd person singular present
you = PRON PRP -> pronoun, personal
knew = VERB VBD -> verb, past tense
this = DET DT -> determiner
so = ADP IN -> conjunction, subordinating or preposition
I = PRON PRP -> pronoun, personal
guess = VERB VBP -> verb, non-3rd person singular present
that = ADP IN -> conjunction, subordinating or preposition
we = PRON PRP -> pronoun, personal
could = VERB MD -> verb, modal auxiliary
do = VERB VB -> verb, base form
this = DET DT -> determiner
------------------------------
Quit denyin' it
VERB VERB PUNCT PRON 

Quit = VERB VB -> verb, base form
denyin = VERB VB -> verb, base form
' = PUNCT '' -> closing quotation mark
it = PRON PRP -> pronoun, personal
----------------------

me = PRON PRP -> pronoun, personal
cocoa = NOUN NN -> noun, singular or mass
might = VERB MD -> verb, modal auxiliary
go = VERB VB -> verb, base form
pop = VERB VB -> verb, base form
go = NOUN NN -> noun, singular or mass
pop = NOUN NN -> noun, singular or mass
------------------------------
If you kiss me girl I'll kiss you back
ADP PRON VERB PRON NOUN PRON VERB VERB PRON PART 

If = ADP IN -> conjunction, subordinating or preposition
you = PRON PRP -> pronoun, personal
kiss = VERB VBP -> verb, non-3rd person singular present
me = PRON PRP -> pronoun, personal
girl = NOUN NN -> noun, singular or mass
I = PRON PRP -> pronoun, personal
'll = VERB MD -> verb, modal auxiliary
kiss = VERB VB -> verb, base form
you = PRON PRP -> pronoun, personal
back = PART RP -> adverb, particle
------------------------------
Let's just keep it cool, you know what I'm sayin'?
VERB PRON ADV VERB PRON ADJ PUNCT PRON VERB NOUN PRON VERB VERB PUNCT PUNCT 

Let = VERB VB -> verb, base form
's = PRON PRP -> pro

An interesting thing is that the POS tagger properly recognizes abbreviations such as "'ll'".

Another important feature we can notice is when we analyze the following line: "Yeah, we chocolate cross-over". Here the word "chocolate" is used as a verb (even though chocolate is clearly not defined as a verb in the dictionary) and the POS tagger is able to recognize that. This is quite important because, in songs, those situations happen very often. 

Let's now consider a very long line and its analysis from above:
```
Jus't havin' fun with it, man, know what I'm sayin'?
NOUN VERB NOUN ADP PRON PUNCT INTJ PUNCT VERB NOUN PRON VERB VERB PUNCT PUNCT 

Jus't = NOUN NNS -> noun, plural
havin' = VERB VBG -> verb, gerund or present participle
fun = NOUN NN -> noun, singular or mass
with = ADP IN -> conjunction, subordinating or preposition
it = PRON PRP -> pronoun, personal
, = PUNCT , -> punctuation mark, comma
man = INTJ UH -> interjection
, = PUNCT , -> punctuation mark, comma
know = VERB VB -> verb, base form
what = NOUN WP -> wh-pronoun, personal
I = PRON PRP -> pronoun, personal
'm = VERB VBP -> verb, non-3rd person singular present
sayin = VERB VBG -> verb, gerund or present participle
' = PUNCT '' -> closing quotation mark
? = PUNCT . -> punctuation mark, sentence closer
```

For sure a very interesting thing we can notice are those two lines: `havin' = VERB VBG -> verb, gerund or present participle` and `sayin = VERB VBG -> verb, gerund or present participle` (the reason is obvious).

One thing which is very impressive from the above analysis in when "man" is recognized to be an interjection. An interjection is a part of speech that shows the emotion or feeling of the author. These words or phrases can stand alone or be placed before or after a sentence. Many times an interjection is followed by a punctuation mark, often an exclamation point ([reference](http://examples.yourdictionary.com/examples-of-interjections.html)). This description perfectly fits with the usage of the word "man" in this context. However this one was not trivial to detect. This POS tagger feature is quite impressive and it is certainly very useful because songs tends to use those things quite often.

By looking at the rest, everything seems to be fine.

## Still good with vulgar song?

**DISCLAIMER**: we did not want to seem impolite. We are discussing a vulgar song just because we want to see if our POS tagger performs good also in those situation. Vulgar songs are in fact quite common, especially in rap.

For our purpose we analyzed "The Ballad Of Chasey Lain", from Bloodhound Gang. 

This song is not contained in MoodyLyrics so we need to download the lyrics.

In [154]:
import lyricwikia
lyrics = lyricwikia.get_lyrics('Bloodhound Gang', 'The Ballad Of Chasey Lain')

In [155]:
print(lyrics)

Dear Chasey Lain
I wrote to explain
I'm your biggest fan
I just wanted to ask
Could I eat your ass?
Write back as soon as you can

You've had a lotta dick
Had a lotta dick
I've had a lotta time
Had a lotta time
You've had a lotta dick Chasey
But you ain't had mine

Dear Chasey Lain
I wrote to complain
Ya never wrote me back
How could I ever eat
Your ass when ya treat
Your biggest fan like that?

You've had a lotta dick
Had a lotta dick
I've had a lotta time
Had a lotta time
You've had a lotta dick Chasey
But you ain't had mine

Dear Chasey Lain
I wrote to constrain
This letter is my last
As your biggest fan
I must demand
You let me eat your ass

You've had a lotta dick
Had a lotta dick
I've had a lotta time
Had a lotta time
You've had a lotta dick Chasey
But you ain't had mine

P. S
Mom and Dad this is Chasey
Chasey this is my mom and dad
Now show 'em them titties
Now show 'em them titties
P. S
Mom and Dad this is Chasey
Chasey this is my mom and dad
Now show 'em them titties
Now show 

Let's go straight to the point and analyze all the lyric after having dropped duplicates.

In [156]:
parts = lyrics.split('\n\n')
print('We had {} parts'.format(len(parts)))
parts = list(set(parts))
print('Now we have {} parts'.format(len(parts)))

We had 8 parts
Now we have 6 parts


In [157]:
lyrics = nlp(lyrics)
lines = get_lines()

In [158]:
lines = list(set(lines))
# Also drop empty lines
lines = [line for line in lines if len(line) > 0]

In [159]:
for line in lines:
    describe_line(line)
    print('-'*30)

Had a lotta dick
VERB DET NOUN NOUN 

Had = VERB VBD -> verb, past tense
a = DET DT -> determiner
lotta = NOUN NN -> noun, singular or mass
dick = NOUN NN -> noun, singular or mass
------------------------------
You've had a lotta dick Chasey
PRON VERB VERB DET NOUN NOUN PROPN 

You = PRON PRP -> pronoun, personal
've = VERB VB -> verb, base form
had = VERB VBN -> verb, past participle
a = DET DT -> determiner
lotta = NOUN NN -> noun, singular or mass
dick = NOUN NN -> noun, singular or mass
Chasey = PROPN NNP -> noun, proper singular
------------------------------
I just wanted to ask
PRON ADV VERB PART VERB 

I = PRON PRP -> pronoun, personal
just = ADV RB -> adverb
wanted = VERB VBD -> verb, past tense
to = PART TO -> infinitival to
ask = VERB VB -> verb, base form
------------------------------
Could I eat your ass?
VERB PRON VERB ADJ NOUN PUNCT 

Could = VERB MD -> verb, modal auxiliary
I = PRON PRP -> pronoun, personal
eat = VERB VB -> verb, base form
your = ADJ PRP$ -> pronoun, 

We will not go too much into the details of this analysis. What we will just say is that, again, we can not notice any problem in the words tagging.

# Conclusion

The POS tagger embedded in spaCy's model language we are using is quite impressive. We did not find a specific case in which it performed really bad. For sure it is good for our purpose.