# Song lyrics parsing
Song lyrics are some of the easiest linguistic data to access. The tools are all there, there are parsers, scrubbers etc. But how reliable are these tools? Let us examine that, by taking a song and checking how well the parsers work for the AAVE speech. The basic idea of this brief study will be analysing the valency of AAVE speech. I posit, the valency will be significantly different from the standard English.

## What is AAVE
AAVE or African American Vernacular English is a full-fledged variety spoken by most Black Americans. The variety takes its origins in slaves brought to America from Africa. The specifics of its creation are murky, there seems to be however some intersection with original African languages (Green 2002: 9).

## Valency in English
Valency is not a terribly popular methodology in English. This can explain the lack of any real consideration for valency in AAVE research. It is a shame as valency changes seem to be quite common and are an unexplored facet of the variety. One of the premiere resources for English valency is the [Erlangen Valency Patternbank](http://www.patternbank.uni-erlangen.de/cgi-bin/patternbank.cgi) created and still supported by the University of Nürnberg. I will take the respective verbs found in the lyrics and test them against the patternbank to check, how different AAVE valency is. This will also permit us to see how a standard dependency parser deals with an AAVE sentence.

## Researching valency in AAVE
With the use of the below python script I will clean up the data. Next I will take the text and annotate the valency myself in line with the system used by the Erlangen Valency Patternbank. I will skip subjects, as they are always indicated in Erlangen, they are however quite often missing in the text. The text was chosen, because it was mentioned on the [AAVE wikipedia page](https://en.wikipedia.org/wiki/African-American_Vernacular_English), it having AAVE characteristics. It is "Gotta Have It" by Kanye West and Jay-Z.

In [13]:
import requests
import regex as re
from bs4 import BeautifulSoup
url = "https://www.azlyrics.com/lyrics/kanyewest/gottahaveit.html"  
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
plain_text = soup.get_text()
plain_text=re.sub(r'\n', '.', plain_text)
plain_text=re.sub(r'^.*?Lyrics....."', '"', plain_text,2)
plain_text=re.sub(r'Submit Corrections.*$', '', plain_text)
plain_text=re.sub(r'\[.*?\]','',plain_text)
plain_text=re.sub(r'\.+','. ',plain_text)
plain_text=re.sub(r'\s\.','.',plain_text)
plain_text=re.sub(r'\(.*?\)','',plain_text)
plain_lista=plain_text.split(".")
for x in plain_lista:
    print(x)

"Gotta Have It"
 Turn my headphones up, louder, uh-huh, uh-huh
 What you need, what–what you need
 I got what you need, what–what you need 
 What you need, what–what you need
 I got what you need
 Hello, hello, hello, hello, white America, assassinate my character
 Money matrimony, yeah, they tryna break the marriage up
 Who gon' act phony, or who gon' try to embarrass ya?
 I'ma need a day off, I think I'll call Ferris up
 Bueller had a Muller but I switched it for a Mille
 'Cause I'm richer, and prior to this shit, was movin' free base
 Had a conference with the DJs , Puerto Rico three days
 Poli with the PDs, now they got our shit on replay
 Sorry I'm in pajamas but I just got off the PJ
 And last party we had, they shut down Prive
 Ain't that where the Heat play? 
 Niggas hate ballers these days 
 Ain't that like LeBron James?
 Ain't that just like D-Wade? Wait
 What you need, what–what you need
 I got what you need, what–what you need 
 What you need, what–what you need
 I got what

Here is the annotated text:

"Gotta Have It", **Have (got) + to_INF, have + NP - in Erlangen**
 Turn my headphones up, louder, uh-huh, uh-huh, **Turn + NP + up - in Erlangen**
 What you need, what–what you need, **Need + NP - in Erlangen**
 I got what you need, what–what you need, **Have (got) + wh_CL (inconclusive, there is a pattern: wh_CL + VHCact + NP),   Need + NP - in Erlangen**
 What you need, what–what you need, **Need + NP - in Erlangen**
 I got what you need, **Have (got) + NP, Need + NP - in Erlangen**
 Hello, hello, hello, hello, white America, assassinate my character, **assassinate + NP - not in Erlangen, but seemingly in the standard** 
 Money matrimony, yeah, they tryna break the marriage up, **try + to_INF, break + NP + up - in Erlangen**
 Who gon' act phony, or who gon' try to embarrass ya? **act + AdjP, try + to_INF, embarass + NP - act and try in Erlangen, embarass not in erlangen, but seems to be typical**
 I'ma need a day off, I think I'll call Ferris up, **need + NP, think + that_CL, call + NP + up - in Erlangen**
 Bueller had a Muller but I switched it for a Mille, **have + NP, switch + NP + for_NP - first one in Erlangen, second one not, but pretty standard**
 'Cause I'm richer, and prior to this shit, was movin' free base, **be + AdjP, move + NP, - in Erlangen**
 Had a conference with the DJs , Puerto Rico three days. **have + NP - in Erlangen**
 Poli with the PDs, now they got our shit on replay, **have + NP - in Erlangen**
 Sorry I'm in pajamas but I just got off the PJ, **be + in_NP, get + off + NP - interestingly both are not in Erlangen**
 And last party we had, they shut down Prive, **have + NP, shut + down + NP - in Erlangen**
 Ain't that where the Heat play? **be + wh_CL, play - in Erlangen**
 Niggas hate ballers these days, **hate + NP - in Erlangen**
 Ain't that like LeBron James? **be + NP - in Erlangen**
 Ain't that just like D-Wade? Wait, **be + NP, wait - in Erlangen**
 What you need, what–what you need, **Need + NP - in Erlangen**
 I got what you need, what–what you need, **Need + NP - in Erlangen**
 What you need, what–what you need, **Need + NP - in Erlangen**
 I got what you need **Have (got) + wh_CL (inconclusive, there is a pattern: wh_CL + VHCact + NP), Need + NP - in Erlangen**
 Wussup, wussup, wussup, wussup
 Wussup, muh'fucka? Where my money at? **a skipped be, also a very new construction with at, not in Erlangen**
 You gon' make me come down to your house where yo' mommy at, **make + NP + INF, be + at - first in Erlangen, second not**
 Mummy wrap the kids, have 'em cryin' for they mommy back, **wrap + NP, have + NP_V-ing (it is in Erlangen, just with a different sense), cry for_NP_back (the version with back not in Erlangen)**
 Dummy that your daddy is, tell 'em I just want my racks, **is + NP, tell + NP + that_CL, want + NP - in Erlangen**
 Racks on racks on racks 
 Maybachs on bachs on bachs on bachs on bachs
 Who in that? Oh shit, it's just blacks on blacks on blacks **be + in_NP, be + NP - first one not in Erlangen, second one yes**
 Hunnid stack–How you get it? Nigga layin' raps on tracks, **get + NP, lay + NP + on_NP - both in Erlangen**
 I wish I could give you this feeling, I'm planking on a million, **wish + that_CL, give + NP + NP - both in Erlangen**
 I'm riding through yo' hood, you can bank I ain't got no ceiling, **ride + through_NP (it is not in Erlangen, but that depends on whether we see through_NP as compulsory), bank + that_CL (not in the bank, not a typical usage), have + NP - in Erlangen**
 Made a left on Nostrand Ave, **make + NP - in Erlangen, although not as these sense**
 we in Bed Stuy, **again skipped be, be + in - not in Erlangen**
 Made a right on 79th, I'm coming down South Shore Drive, **make + NP - same sense, in Erlangen, but different sense, come + down_NP - not in Erlangen**
 I remain Chi-town, Brooklyn 'til I die, **remain + NP, die - in Erlangen**
 Take 'em on home, take 'em on home, **take + NP + on + NP (not in Erlangen, depending on whether one takes "home" here as compulsory)**
 I got what you need, what–what you need, **Have (got) + wh_CL (inconclusive, there is a pattern: wh_CL + VHCact + NP), Need + NP - in Erlangen**
 Tryna hurt my name, huh? **try + to_inf, hurt + NP - first in Erlangen, second not, but pretty typical**

There are quite a few lexical differences between standard and the text. Similarly, there are many other specifics of AAVE present. We have negative concord, skipped be, many contractions, "ain't" etc. 
When it comes to valency however, there isn't really that many non-standard features. As we can see above, almost all of them can be found in Erlangen. Some of the most interesting features are: 
1. Sorry I'm in pajamas but I just got off the PJ, **be + in_NP, get + off + NP - interestingly both are not in Erlangen**
2. Where my money at? **a skipped be, also a very new construction with at, not in Erlangen**
3. Mummy wrap the kids, have 'em cryin' for they mommy back, **wrap + NP, have + NP_V-ing (it is in Erlangen, just with a different sense), cry for_NP_back (the version with back not in Erlangen)**
4. I'm riding through yo' hood, you can bank I ain't got no ceiling, **ride + through_NP (it is not in Erlangen, but that depends on whether we see through_NP as compulsory), bank + that_CL (not in the bank, not a typical usage), have + NP - in Erlangen**
As we can see, there are some minor differences between the standard, but the exact extent of these would need a bigger study. Thus, there are no real significant differences in valency patterns, besides skipping the subject.

## Parsers
In spacy one can find four different pipelines for English, each of which has a parser: 
1. en_core_web_md
2. en_core_web_sm
3. en_core_web_lg
4. en_core_web_trf
These four pipelines will be put against the four sentences: 
1. Sorry I'm in pajamas but I just got off the PJ.
2. Where my money at?
3. Mummy wrap the kids, have 'em cryin' for they mommy back.
4. I'm riding through yo' hood, you can bank I ain't got no ceiling.

In [3]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_md")
doc = nlp('Sorry I am in pajamas but I just got off the PJ.')
displacy.render(doc, style="dep")

In [4]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('Sorry I am in pajamas but I just got off the PJ.')
displacy.render(doc, style="dep")

In [6]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Sorry I am in pajamas but I just got off the PJ.')
displacy.render(doc, style="dep")

In [5]:
import spacy
import spacy_transformers
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
doc = nlp('Sorry I am in pajamas but I just got off the PJ.')
displacy.render(doc, style="dep")

When it comes to sentence number one, there were not many problems, the main was the conjunct "but". The two latter parsers were able to recognize the relation between the two clauses. This is, however, not AAVE specific, but something which is generally problematic for parsers.

In [7]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_md")
doc = nlp('Where my money at?')
displacy.render(doc, style="dep")

In [8]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('Where my money at?')
displacy.render(doc, style="dep")

In [9]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Where my money at?')
displacy.render(doc, style="dep")

In [10]:
import spacy
import spacy_transformers
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
doc = nlp('Where my money at?')
displacy.render(doc, style="dep")

When it comes to sentence number 2, it had a dropped auxiliary. This is something, that the parsers have always had problems with and here is no exception.

In [11]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_md")
doc = nlp('Mummy wrap the kids, have \'em cryin\' for they mommy back.')
displacy.render(doc, style="dep")

In [12]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('Mummy wrap the kids, have \'em cryin\' for they mommy back.')
displacy.render(doc, style="dep")

In [13]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Mummy wrap the kids, have \'em cryin\' for they mommy back.')
displacy.render(doc, style="dep")

In [18]:
import spacy
import spacy_transformers
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
doc = nlp('Mummy wrap the kids, have \'em cryin\' for they mommy back.')
displacy.render(doc, style="dep")

This sentence seems especially problematic for parsers, because of "they" instead of "their". None of the parsers were able to parse the sentence correctly. Parsers number 1 and 2 though that "mommy" was a verb, parser number 3 saw “for” as a dependency of “have”. It also attached "back" to "have" as an adverb, which is just plain wrong. The final parser also did not succeed, identifying "for" as a marker and "'em" as the subject of "cryin'". It did however identified "they" as a nominal modifier, which can express a relation similar to genitive. It is not the right relation (that would be "poss"), but it is the most right.

In [19]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_md")
doc = nlp('I\'m riding through yo\' hood, you can bank I ain\'t got no ceiling')
displacy.render(doc, style="dep")

In [20]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('I\'m riding through yo\' hood, you can bank I ain\'t got no ceiling')
displacy.render(doc, style="dep")

In [21]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('I\'m riding through yo\' hood, you can bank I ain\'t got no ceiling')
displacy.render(doc, style="dep")

In [22]:
import spacy
import spacy_transformers
from spacy import displacy
nlp = spacy.load("en_core_web_trf")
doc = nlp('I\'m riding through yo\' hood, you can bank I ain\'t got no ceiling')
displacy.render(doc, style="dep")

All of the parsers parsed the beginning of the sentence correctly. Problems begin with the second verb. Parsers 1, 2 and 4 identified the relation between "riding" and "bank" as a ccomp or a clausal complement, which is wrong. The relation is probably closer to a conj with a skipped conjunction. The 3 parser identified the relation as an unspecified dependency and that is probably the closest to the truth. The next part is also parsed correctly by all the parsers, they did not have aby trouble with "ain't" or the negative concord. Unfortunately there was some trouble with identifying the relation between "bank" and "got". The relation should have been a ccomp, with "that" dropped. Parsers 1,2 and 4 did all correctly label this relation, while parser 3 saw got as a dependency of "riding".

## Conclusion
This very brief study seems to show, that the valency relations have not shifted majorly in AAVE. There are some small differences and a bigger study should be conducted in order to check it properly. Atypical valency relations seemed to not pose that much trouble for parser, the problems were mainly connected to complicated relations between multiple verbs, substitution of words, that would be expected in the standard variety and dropping of an aux.

## Bibliography
Lisa J. Green. African American English: A Linguistic Introduction. Cambridge University Press; 2002. Accessed June 12, 2023. https://search-1ebscohost-1com-18znubvqk0c3d.hansolo.bg.ug.edu.pl/login.aspx?direct=true&db=e000xww&AN=125093&lang=pl&site=eds-live
Erlangen Valency Patternbank, accessed June 14, 2023 http://www.patternbank.uni-erlangen.de/cgi-bin/patternbank.cgi?do=imp