# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt', 'r') as f:
    text = f.read()

In [3]:
# Run this cell to verify it worked:
doc = nlp(text)
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [None]:
len(doc)

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [9]:
doc_list = list(doc.sents)
print(len(doc_list))

204


**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [10]:
print(doc_list[1])

The man's hands were behind
his back, the wrists bound with a cord.  


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [11]:
# NORMAL SOLUTION: 
for token in doc_list[1]:
    print(token.text, token.pos_, token.tag_, token.lemma_)

The DET DT the
man NOUN NN man
's PART POS 's
hands NOUN NNS hand
were AUX VBD be
behind ADP IN behind

 SPACE _SP 

his PRON PRP$ his
back NOUN NN back
, PUNCT , ,
the DET DT the
wrists NOUN NNS wrist
bound VERB VBN bind
with ADP IN with
a DET DT a
cord NOUN NN cord
. PUNCT . .
  SPACE _SP  


In [64]:
# CHALLENGE SOLUTION:
for token in doc_list[1]:
    print(f'{token.text:{15}} {token.pos_:{15}} {token.dep_:{15}} {token.lemma_:{15}}')

The             DET             det             the            
man             NOUN            poss            man            
's              PART            case            's             
hands           NOUN            nsubj           hand           
were            AUX             ROOT            be             
behind          ADP             prep            behind         

               SPACE           dep             
              
his             PRON            poss            his            
back            NOUN            pobj            back           
,               PUNCT           punct           ,              
the             DET             det             the            
wrists          NOUN            appos           wrist          
bound           VERB            acl             bind           
with            ADP             prep            with           
a               DET             det             a              
cord            NOUN            pobj    

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [13]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [14]:
# Create a pattern and add it to matcher:
pattern1 = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]
matcher.add('SwimmingVigorously', [pattern1])


In [15]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
found_matches


[(13245044497498710760, 1274, 1277), (13245044497498710760, 3609, 3612)]

**7. Print the text surrounding each found match**

In [54]:
token_size=10
start_index = found_matches[0][1]-token_size
end_index = found_matches[0][2]+token_size
doc[start_index:end_index]

 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and

In [56]:
start_index = found_matches[1][1]-token_size
end_index = found_matches[1][2]+token_size
doc[start_index:end_index]


saw all this over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic

**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [57]:
for sent in doc_list:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [58]:
for sent in doc_list:
    if found_matches[1][1] < sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
