# **You can make your own grammar!!**

Chapter 07 of the NLTK book discusses applications of NLP related to "information extraction." The point of this section is to have you reflect upon the ways one can computationally represent relationships among different entities, such as locations and people (i.e., can we query where a given person is at a given time based on text data?). When it comes to completely unstructured text, such as text read in raw from the internet, this task is quite difficult. However, when some degree of structure is imposed on the information, some form of sense can be obtained. 

That's all well and good, but we will also use this chapter to think about patterns of language above the word level. This is where we dip our toes into *syntax* using some of the techniques we've played with in the prior chapters. 

This is where **chunking** comes into the picture. Chunking is related to larger patterns in language above the word level, now at the *phrase* level. 

Noun phrases, verb phrases, prepositional phrases, relative clauses — all of these can be chunks. However, as the book makes clear, these things are not to be equated. 

Noun phrase chunks (NP chunks) are "smaller" noun phrases within a larger noun phrase. As the book explains, a noun phrase could be quite long, such as 

`the politician who paid the woman who paid the clerk who sold the car`

Within that (odd) sentence, there are a lot of smaller NP chunks: 
  - the politician who paid the woman
  - the politician
  - the woman who paid the clerk
  - the woman
  - the clerk who sold the car
  - the clerk
  - the car

The entire sentence itself is also a noun phrase - `the politician` is the head noun and the rest of the sentence is a relative clause working to provide more information about the noun. 


Finding all of the smaller Noun Phrases in a document/text is one strategy used for the purpose of information extraction (if you've worked with other NLP libraries you may have noticed they almost always have a NP chunk method). When we string together phrases into larger sentences, we begin to unravel the syntax or grammar of a text. 



## **Creating a chunk/phrase parser**

We've worked so far with regular expressions and part of speech tags to parse text for specific patterns. The next step is to use NLTK's method for parsing larger "chunks", which are specific sequences of words. To do so, you get to define your own grammar — cool! 

Note that the book provides a sentence, which has been tagged, and then manually defines a grammar in the form of a dictionary. 

Instead of me trying to walk through each thing the NLTK authors do, I'd like to just focus on the gist, so you can walk away from this notebook/workshop and the reading/skimming of NLTK to get an idea on how to write your own grammar parser. 

Let's begin!

> *please note that you will not be able to use the .draw() function to see the syntax trees in Colab. Yeah they are really cool but they don't work unless you can run Python on your local machine. We'll be able to draw dependency trees when we work with spaCy*

In [None]:
# download the resources we will need
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger', 'tagsets'])

## Defining a noun phrase

Let's start first and foremost by using NLTK to define a grammar which can identify simple noun phrases in a text. In linguistics, the definition of different phrases follow what are known as [phrase structure rules](https://en.wikipedia.org/wiki/Phrase_structure_rules). We can represent a noun phrase using "NP", and define it using a system of rules.

For example, the most basic definition of a NP is any noun preceded by a determiner/article. We will start with that as a definition and express it in terms of Part of Speech tags:

```
NP —> (DET) + NN
```

The definition above says that a noun phrase is any noun preceded by an optional determiner (the brackets around (DET) mean it is optional). 

We are going to use NLTK's regex parser to define and parse for noun phrases using text which is tagged for part of speech. The "grammar" will use the NLTK regex parser function, which means we can use the `<tag>` method of creating patterns in a more readable way that the base regex. To convert the above rule into the NLTK regex, we can use the following:


In [None]:
# define a simple NP
grammar1 = "NP: {<DT>?<NN>}"

In the cell above, I saved the regex pattern to a variable named `grammar`. The regex pattern itself is encased in the curly braces `{}`, and I have provided a name for that pattern using "`NP:`" before the pattern. The NLTK parser will be able to use this information. To do so, we need to create a parser from the `nltk.RegexpParser()` function, and pass our grammar as an argument to the function. 

We will save the parser to a variable.

In [None]:
# create a parser with our grammar from the cell above
# note that our grammar1 pattern is being fed to the parser
np_parser = nltk.RegexpParser(grammar1)

Crucially, we must use our parser on text which has part of speech tags applied. We know how to do that! Let's create a tagged text and use our parser on it. 

I'll create a short text below to get started. 


In [None]:
# create a short tokenized and tagged text
sent = "My friends, the sea was angry that day. Like an old man sending back soup in a deli."
sent_tagged = nltk.pos_tag(nltk.word_tokenize(sent))

# do you see any POS tag combinations which should be matched by our parser? 
sent_tagged

Now that we have a tagged text, we can parse the text for noun chunks using the parser we created. To do so, we simply pass the tokenized/tagged text to the `.parse()` method from the parser we made.


In [None]:
# parse the example for noun chunks
sent_parsed = np_parser.parse(sent_tagged)

The parser will create a `nltk.Tree` object. You can see the full tree by printing it either using `print()` or the `.pprint()` method from the object. 

In [None]:
# inspect the output (you can also use print())
sent_parsed.pprint()

### Inspecting the output... 

- the "S" means the "top" of the sentence
- any nested brackets within the "S" denote different chunks we have asked for. 
  -  Each open paren `(` has the chunk label as the first thing. This is a special NLTK object (including the 'S')
  - Note that NPs match based on the tags, so that determiners such as `the` and `that` are both being matched by `<DET>`.
- in this case, we see that our grammar has found *five* noun chunks. Nice!
- All of the other words are listed separately with their POS tags, they were not part of the chunks we defined. 

We can interact with the `Tree` object in more ways than printing, but before doing so, let's update our noun grammar. 







## Adding to our noun phrase

Our noun phrase in the grammar above is quite simple and doesn't capture the range of other possible ways a noun phrase can exist. For example, and as pointed out in the NLTK book, in addition to optional determiners before a noun phrase, nouns can also have optional adjectives before the noun phrase. 

NLTK uses this pattern to define a possible NP chunk:

`"NP: {<DT>?<JJ>*<NN>}"`

This pattern says a noun phrase us:
  - zero or one determiners (`<DT>?`), 
  - followed by zero or more adjectives (`<JJ>*`)
  - followed by a required noun (`<NN>`).

Note the difference between the rules for `DT` and `JJ`. Why would we restrict this rule to only one determiner but effectively infinite adjectives? Remember that determiners are words like `the` and `a` - based on your knolwedge of English, is it common for more than one determiner to come before a noun? (no, it's not).

Then consider that `JJ` means adjectives, so words like `big`, `red`, etc. Can you place more than one adjective before nouns in English? Of course! The sentence `the small blue bird` has the tags `DT JJ JJ NN`, whereas the sentence `the bird` has the tags `DT NN`. The pattern defined above allows for these possibilities. 

So, let's update our NP gramar with this new pattern

In [None]:
# save the new np grammar rules
grammar2 = "NP: {<DT>?<JJ>*<NN>}"

In [None]:
# we need to make a new parser with out new rule
np_parser2 = nltk.RegexpParser(grammar2)

In [None]:
# parse our sentence again with the new parser
sent_parsed2 = np_parser2.parse(sent_tagged)

Let's compare the outputs from the two different parsers. 

What is different in our second example compared to the first? 

In [None]:
print(sent_parsed)
print('\n') 
print(sent_parsed2)

Here is an explanation for all the matches returned from the second iteration of our parser. 

`"NP: {<DT>?<JJ>*<NN>}"`

Let's look through the matches:

- `the sea` is a `<DT><NN>`, so it is matched, because the adjective (`<JJ>`) is optional
- `that day` is also a `<DT><NN>` and matches for the same reasons
- `an old man` is matched, and represents a full match from our pattern – `<DT><JJ><NN>`!
- `soup` is matched because both the `<DT>` and `<JJ>` tags are optional, whereas `<NN>` is the minimum required to form a noun phrase.
- `a deli` is matched because it is a `<DT><NN>`

## Navingating the Tree structure

The NLTK `Tree` object is made up of smaller `subtrees`, which we can navigate using the `.subtrees` method. The subtrees include the entire `S` tree and then any of the different parsed chunks of the tree. Because we have specified our grammar to create only NP chunks, our subtree includes only those chunks. 

In the cell below, I first save the subtrees to a list comprehension. I then loop throught that list comprehension, but I skip the first subtree, because that is the total tree (i.e, the whole sentence, and I just don't want to see it again!). Using this method, I can get all of the NP chunks from my parsed text (we will come to find out there are other, perhaps easier methods to do this). 

In [None]:
# list comprehensionto find the chunks
subtrees = [subtree for subtree in sent_parsed2.subtrees()]
for subtree in subtrees[1:]: # skip the first subtree becuse its the "S" tree
  print(f'Chunk: {subtree}') # we can see each chunk from the Tree

The trees will also have a `.label()` method for the chunks to see the chunk type — the label is whatever name we gave the chunk in the grammar *before* the curly brackets `{}`. In this case, we specified "NP:".

In the cell below, I again loop through, but ask for the label as well as the tree:



In [None]:
# now do the same using the .label() method
subtrees = [subtree for subtree in sent_parsed2.subtrees()]
for subtree in subtrees[1:]:
  print(f'Chunk type: {subtree.label()}, Chunk: {subtree}, leaves: {subtree.leaves()}')

Finally, the `.leaves()` of the tree provide us with the original version of the input, which is the tagged text. 

In [None]:
# Using .leaves() will provide you with the original input. 
subtrees = [subtree for subtree in sent_parsed2.subtrees()]
for subtree in subtrees[1:]:
  print(f'Chunk type: {subtree.label()}, Chunk: {subtree}, leaves: {subtree.leaves()}')

### **Your Turn**

Create some sample sentences which contain chunks that do and do not follow our grammar above. 

- Practice parsing these sentences and inspect the outputs 
- Play around - how many adjectives can you chain together?
- Are you able to give the NPs a different chunk label?
- Are you able to search for chunks based on the label?
- What other NPs are you noticing that are *not* being captured in our parser grammar? 
  - What could be done to capture these NPs?

## Adding a verb phrase to our rules

All right, now let's think about how we could further parse our sentence so that it might contain all of the chunks in our sentence. To do so, we need to update our grammar!


Let's introduce a rule to make a verb phrase or `VP`. Just like a noun phrase must contain a noun, a verb phrase must contain a verb. What part of speech tags are associated with verbs? We can find out by asking the nltk help and passing in `V.` as a regex pattern to our help:





In [None]:
# find all POS tags which start with "V" and are followed by anything
nltk.help.upenn_tagset('V.')

We want to write a rule which will allow for any of these types of verbs to be parsed as a VP. We can do that with regex quite easily, using the wildcard `.` and zero or more quantifier `*`. 

Using the NLTK syntax, our VP rule could thus be:

```
VP: {<V.*}
```

We can now add our two rules together.

To do so, I will use triple quotes and put each rule on a new line.

In [None]:
# define a grammar to find NP and VPs
np_vp_grammar = """
NP: {<DT>?<JJ>*<NN>} 
VP: {<V.*>}
"""

We need to make a new parser with our new grammar:

In [None]:
# update the parser and create a new version of our sentence
np_vp_parser = nltk.RegexpParser(np_vp_grammar)

Let's test our new parser out on our same example:

In [None]:
# parse the same sentence from above
sent_parsed3 = np_vp_parser.parse(sent_tagged)

Inspect the output – note that printing the labels helps us to see the type of chunk as well as the associated phrases. 

Again, I'm skipping the first index of `.subtrees()` to avoid printing out the entire tree again. 

In [None]:
# look at each Chunk - we are getting almost the entire text back
subtrees = [subtree for subtree in sent_parsed3.subtrees()]
for subtree in subtrees[1:]:
  print(f'Chunk type: {subtree.label()}, Chunk: {subtree}')

What do you notice about the output? Adding in a rule for finding a VP now returns two more chunks, but those two chunks start to clarify the nature of our example. In other words, we can see how important NP and VP chunks are for constructing sentences and utterances. 

### Make a chunk printer function

We're at the point where we are repeating ourselves over and over when we print out the chunks and subtrees. We should thus create a function which helps print out our parsed input. This way we don't have to keep copying and pasting our cells over and over.

In the cell below, I make a function which loops through the subtrees and prints the chunks and their labels, just as I have been doing above. I'll also add a flag, `print_full`, which means that I can ask for the full sentence to be printed out or not. The default will be `False`, meaning that the user will have to explicitly ask for the full sentence to be printed. 

In [None]:
def chunk_printer(chunk, print_full = False):
  """
  prints all chunks with tags, 
  and optionally prints entire text
  """
  # gather subtrees
  subtrees = [subtree for subtree in chunk.subtrees()]

  # check if print_full is set to True
  # we could also just type `if print_full` right?
  if print_full == True:
    print(f'Full sentence is: {chunk}')

  print('=============================')
  print('The chunks are:')
  for subtree in subtrees[1:]:
    print(f'Chunk type: {subtree.label()}, \nChunk: {subtree}\n')


Let's test it out on our parsed sentence. 

In [None]:
# print default, without the full chunk
chunk_printer(sent_parsed3)

In [None]:
# ask for the full chunk (it's quite ugly tho)
chunk_printer(sent_parsed3, print_full = True)

### **Your Turn**

Continue parsing some examples, but now with the VP and NP grammar. You can use the `chunk_printer()` function to inspect the output.

- Just as you considered the nouns above, are there are verbs being missed?
- Are there any additional words you think should be added to the verb phrases?
- We used `<V.*>` to find all the verbs, but our pattern for nouns is `<NN>`. Should we do something about that? 

In [None]:
# what different types of nouns exist in our tagset? ? 
nltk.help.upenn_tagset('N.')

## Adding a full clause to our grammar

Let's now look at the definition of a (simple) *clause*:

```
Clause: NP + VP
```

That's it!! Well, we've already defined our NP and VP chunks, so now we just need to add a rule to our grammar which combines these chunks into a larger chunk. The rule would follow the same format, but we can use our chunk labels as the search pattern:

```
Clause: {<NP><VP>}
```

This will work because the NLTK parser will parse our data *in the order of the rules*, so that chunks made as part of the first rule can then be used in subsequent rules — neat-o. 

Let's add the clause rule to our grammar

In [None]:
# update our grammar to include a clause. 
clause_grammar =  """
  NP: {<DT>?<JJ>*<NN>} # NOUN PHRASE
  VP: {<V.*>} # VERB PHRASE
  CLAUSE: {<NP><VP>} # CLAUSE!
"""

In [None]:
# update the parser
clause_parser = nltk.RegexpParser(clause_grammar)

In [None]:
# parse the example again 
sent_parsed4 = clause_parser.parse(sent_tagged)

In [None]:
# inspect the ouput
chunk_printer(sent_parsed4)

### Update chunk printer function

Take a moment to look at the output and note the nested nature of the chunks. We are retaining the smaller constituents of the clauses (i.e., the separate NP and VP chunks), as well as the larger clauses. It does become a bit tricky to read though eh? 

Let's update `chunk_printer()` so that we can control the types of chunks the parser returns. We do that with a new argument, `chunks`, which can take a list of chunk tags. If the list is empty, it will just return all the chunks. 

Otherwise, we can supply a list of chunks to control the output. 

In [None]:
# a third argument is added, chunks, set with any empty list as default. 
def chunk_printer(chunk, print_full = False, chunks = []):
  """
  prints all chunks with tags, 
  and optionally prints entire text. 
  lets use pass a list of chunks to be printed
  """

  subtrees = [subtree for subtree in chunk.subtrees()]

  if print_full == True:
    print(f'Full sentence is: {chunk}')

  # if chunks exists (has a value)
  if chunks:
    print('=============================')
    print(f'Your requested chunks {chunks}:\n')
    for subtree in subtrees[1:]:
      if subtree.label() in chunks: # using the labels to match the input to any existing chunks
        # if the user puts in a chunk that doesn't exist it won't print it - something to worry about later
        print(f'Chunk type: {subtree.label()}, \nChunk: {subtree}\n')

  else:
    print('=============================')
    print('The chunks are:\n')
    for subtree in subtrees[1:]:
      print(f'Chunk type: {subtree.label()}, Chunk: {subtree}')

We can now control our program to give us the chunks of our choice, below I am asking for NP and VP only. This program would probably break if I passed chunks that don't exist, but since we're the only one using this so far, we'll keep this secret between us. 

In [None]:
# So we could find just the NPs and VPs...
chunk_printer(sent_parsed4, chunks = ['NP', 'VP'])

In [None]:
# Or just the clauses
chunk_printer(sent_parsed4, chunks = ['CLAUSE'])

## Turn it up to 11

Right now our grammar is still relatively simple, and it does not capture all the possible ways nouns, verbs, noun phrases, and verb phrases could be defined (and honestly, it probably never will). 

Rather than go through a bunch of iterative updates, I'm going to dump a few updates to the grammar on you in the next cell. 

This grammar is heavily influenced by the example I've been using as well as phrase structure rules gleaned from websites [like this one](http://faculty.washington.edu/cicero/370syntax.htm). 

One of the things I'm doing in the grammar below is defining smaller nouns before then defining larger NPs. Same thing with verbs. This, for instance, allows us to find noun-noun combinations and put them together as one noun phrase. 

Another thing I'm doing is accounting for prepositional phrases which can nest with sentences. We have two prepositions in our sentence: `like` and `in` (although `like` is probably a bit iffy in this category — but that's what `pos_tag()` gives us, so we'll go with it). They are both tagged with `IN` 

We also define our NP to now be anything we tag with an N, which is now expanded in its scope:

In [None]:
# update our grammar to find smaller nouns/verbs before NP/VP
# definition of noun has also been expanded. 
grammar1 = """
N: {<DT>?<JJ>*<N.*><IN>*} # optional determiner, followed by any number of adjectives, followed by noun, then preps
NP: {<N><N>*} # we can use the smaller pattern of N to stand for nouns now
V: {<V.*>}
VP: {<V>}
CLAUSE: {<NP><VP>} # any adjacent NP + VP is a clause
"""

# initalise the parser
cp1 = nltk.RegexpParser(grammar1)

In [None]:
# Look at the NP and VPs, which will contain the smaller Ns and Vs
chunk_printer(cp1.parse(sent_tagged), chunks = ['NP', 'VP', 'CLAUSE'])

In [None]:
# Look at the clauses, which gives a nice picture of the nested nature of our design. 
chunk_printer(cp1.parse(sent_tagged), chunks = ['CLAUSE'])

### Add adverbs

Our grammar is improved in terms of finding the main constituents of the test sentence, but could still be improved. For example, we are still missing "angry that day my friends" and the "sending back soup" parts of the sentence (here it is again in case you forgot the whole thing):

```
My friends, the sea was angry that day. Like an old man sending back soup in a deli.
```

We need to expand what a VP can be, since we're still missing out on the word `angry` and `back`. Let's check the "basic" phrase structure rules for English,  [taken from here](http://faculty.washington.edu/cicero/370syntax.htm)

```
S → NP  VP 

NP → (det)(adv)* (adj)*N

VP → V (NP) (ADV)*
```

This means we can define a verb phrase as a verb followed by an optional noun phrase and zero or more optional adverbial phrases. The use of `back` in our sentence is tagged as an adverb (`RP`), so we can add that to our VP. 




Note that depending on which site I've linked to, the phrase structure rules are defined slightly differently, fun ;). For instance, this site defines NP + VP as sentence, whereas I've been using the word "clause". 

Anyhow, let's update our grammar so that a VP can be a verb followed by a noun or an adverbial (`<RP>`), followed by yet another noun. Our example actually has the adverb before the noun phrase, so we already know this might not be 100% accurate at parsing our data. I'll tweak the pattern accordingly. 

In [None]:
# update our grammar to increase scope of what a VP is
grammar2 = """
N: {<DT>?<JJ>*<N.*><IN>*} # create smaller nouns first
V: {<V.*>}                # then create smaller verbs
NP: {<N><N>*}         # create NPs after VPs are created - any N that did not get put into VP will become NP
VP: {<V><N>*<RP>*<NP>*}  # create our VPs
CLAUSE: {<NP><VP>}        # any adjacent NP + VP is a clause
"""
cp2 = nltk.RegexpParser(grammar2)

In [None]:
# Look at the NP and VPs
chunk_printer(cp2.parse(sent_tagged), chunks = ['NP', 'VP', 'CLAUSE'])

### Add adjectives

Holy crap , did you see that? We got a VP which is the pattern `V — Adverb — N — N`, nice!

The word `angry` is still not being captured by our rule, because we have not allowed the smaller V to capture adjacent adjectives/adverbs - let's update it again. (The word `like` is also not being captured, but let's leave that alone for now. )

In [None]:
# update our grammar to let Vs take adjectives/adverbs
grammar3 = """
N: {<DT>?<JJ>*<N.*><IN>*} # create smaller nouns first
V: {<V.*><JJ|RP>*}        # then create smaller verbs (can have adjectives/adverbs afterwards)
VP: {<V><N>*<RP>*<NP>*}  # create our VPs
NP: {<N><N>*}         # create NPs after VPs are created - any N that did not get put into VP will become NP

CLAUSE: {<NP><VP>}        # any adjacent NP + VP is a clause
"""
cp3 = nltk.RegexpParser(grammar3)

In [None]:
# Look at the NPs, VPs, and clauses - what changed? 
chunk_printer(cp3.parse(sent_tagged), chunks = ['NP', 'VP', "CLAUSE"])

In [None]:
# clause level looking good
chunk_printer(cp3.parse(sent_tagged), chunks = ['CLAUSE'])

### Add personal possessive pronouns

We're getting pretty close. We are still missing a NP at the start of the sentence: `my friends`. 
 
The word "my" has been tagged with `PRP$`. What does that tag mean? 





In [None]:
nltk.help.upenn_tagset('PRP$')

The `PRP$` tag is associated with personal possessive pronouns, which commonly pattern before nouns in English. We can easily integrated this into our existing grammar by including an optional `PRP$` in the same pattern we ask for determiners `DT`.

Because `PRP$` includes a regex meta character (`$`), we need to `escape` the character with a `\` so that the character is read literally, rather than as a special character. 

In [None]:
# update our grammar to let Vs take adjectives/adverbs
grammar4 = """
N: {<DT|PRP\$>?<JJ>*<N.*><IN>*} # adding PRP$ to DT, escaping the $
V: {<V.*><JJ|RP>*}        # then create smaller verbs (can have adjectives/adverbs afterwards)
VP: {<V><N>*<RP>*<NP>*}  # create our VPs
NP: {<N><N>*<IN>*}         # create NPs after VPs are created - any N that did not get put into VP will become NP
CLAUSE: {<NP><VP>}        # any adjacent NP + VP is a clause
"""
cp4 = nltk.RegexpParser(grammar4)

In [None]:
# what is going on now? How well is this parsed? 
chunk_printer(cp4.parse(sent_tagged), chunks = ["NP", "VP", 'CLAUSE'])

## **Your Turn**

We have fine-tuned a grammar based on our knowledge of phrase structure rules as well as an example sentence. 

- Try our grammar on some other examples. How accurate is it?
- Do you see any problems with the rules? Any tweaks that should be made?
- We didn't actually finish parsing the example above, as we missed out on the word "like" and the word "Oh" at the start. How important do you think these are to capture? 

## Does our grammar generalise?

What next? We could continue this exploration by writing more functions which would let us "ask" our parser which nouns are associated within verb phrases, or check if certain verbs are in our text and then ask for the program to give us the nouns associated with those verbs, and so on. The book shows you some of this, and this is effectively what information extraction is all about. 

However, let's see how well our parser performs on some other text. How about a State of the Union address? 

In [None]:
# download then import the state_union resource
nltk.download('state_union')
from nltk.corpus import state_union

In [None]:
# check out Clinton's final state of the union speech
clinton = state_union.sents('2000-Clinton.txt')

In [None]:
# let's pick a random sentence from the clinton speech and parse it. 
clinton[10]

Look at the tagged text. What predictions can we make about the accuracy of our grammar?  


In [None]:
clinton_tagged = nltk.pos_tag(clinton[10])

In [None]:
clinton_tagged

What happens when we try to parse this with our grammar?

 

In [None]:
chunk_printer(cp4.parse(clinton_tagged), chunks = ['VP', 'NP', "CLAUSE"])

We have found no clauses, only VPs and NPs. This means our clause rule is not meeting its condition. 

Why is that? Inspect the fully parsed version - basically, we have to work harder to connect our prepositional phrases between the VP and NPs.

In [None]:
print(cp4.parse(clinton_tagged))

### **Open Challenge**

Can you update our grammar to parse the above sentence more accurately? 

You'll probably find the best strategy is to to separately define a prepositional phrase, then update other rules to incorporate those phrases into NPs, then VPs, then clauses. There are likely some other tricks you can think of. 



## Yet another generalisation attempt

*(this is more bonus challenge - can you extend the grammar to work for this text?)*

I'll do the same thing as above, but this time with the `grasshopper.txt.` I'll read in the grasshopper short story from the LING 226 GitHub. I use the `requests` module which let's me read in files from the internet. I point the `.get()` function at the direct URL for my text and then use the `.text` method to return the raw text. 

In [None]:
# import requests library
import requests

# use .get() to target direct URL and download the raw text
grasshopper = requests.get('https://raw.githubusercontent.com/scskalicky/LING-226-vuw/main/texts/grasshopper.txt').text

# inspect the results
grasshopper

In [None]:
# tokenize then tag
grasshopper_tagged = nltk.pos_tag(nltk.word_tokenize(grasshopper))

# inspect a slice of the text - do you see any NPs?
grasshopper_tagged[:20]

In [None]:
# parse our text for NPs
grasshopper_parsed = cp4.parse(grasshopper_tagged)

# inspect the results: 
chunk_printer(grasshopper_parsed, chunks = ['CLAUSE'])

# Limitless Possibilities

We've used the chunk parser to create rules adhering to the syntactic constraints of English. These analyses are facilitated by the fact we have an automatic tagger we can use in NLTK. 

But, you *could* decide to make your own tags and then build a parser to read those tags. 

Remember, you could use the `nltk.str2tuple()` function to convert a string in the form of `string/tag` into a string:tag combination of any meaning that you like. 

Then, using the `nltk.RegexpParser()`, you could develop a parser that reads those tags to make chunks. You could for example define language which code switches into chunks of different languages.

Because strings are parsed left-to-right, and chunks are evaluated in the order they are made, I could explore this to find the sequences which are certain stretches of language in the text. 

> *This is just an idea off the top of my head which could use more refinement. the point is to show you that you can leverage this parser to do things insead of POS tags.*

In [None]:
code_switching = '她/zh 是/zh 一/zh 个/zh 很/zh cool/en 的/zh 人/zh'

code_switching_tagged = [nltk.str2tuple(w) for w in code_switching.split()]

code_switching_tagged

In [None]:
code_switch_grammar= """
All_English: {<EN><EN>*} # required English chunk followed by any number of English chunks
All_Chinese: {<ZH><ZH>*}
"""

code_switch_parser = nltk.RegexpParser(code_switch_grammar)

In [None]:
# my goal was to find contiguous stretches of English and Mandarin
print(code_switch_parser.parse(code_switching_tagged))

In [None]:
# try it on another text
another_example = [nltk.str2tuple(w) for w in 'I/EN 懂/ZH 您/ZH 的/ZH meaning/EN'.split()]
print(code_switch_parser.parse(another_example))