# NLP Basics

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>

In [3]:
!wget https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/owlcreek.txt

--2021-04-01 15:28:18--  https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/owlcreek.txt
Resolving frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)... 52.95.150.86
Connecting to frenzy86.s3.eu-west-2.amazonaws.com (frenzy86.s3.eu-west-2.amazonaws.com)|52.95.150.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21719 (21K) [text/plain]
Saving to: ‘owlcreek.txt’


2021-04-01 15:28:19 (168 KB/s) - ‘owlcreek.txt’ saved [21719/21719]



In [4]:
with open('owlcreek.txt') as f:
    doc = nlp(f.read())

In [5]:
#Span
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [None]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>

In [7]:
sents = [sent for sent in doc.sents]
len(sents)

249

**4. Print the second sentence in the document**<br>

In [8]:
print(sents[2].text)

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  


**5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`**<br>

In [None]:
# NORMAL SOLUTION:
for token in sents[4]:
    print(token.text, token.pos_, token.dep_, token.lemma_)

his DET poss -PRON-
back NOUN ROOT back
, PUNCT punct ,
the DET det the
wrists NOUN appos wrist
bound VERB acl bind
with ADP prep with
a DET det a
cord NOUN pobj cord
. PUNCT punct .
  SPACE   


In [None]:
# CHALLENGE SOLUTION:
for token in sents[4]:
    print(f'{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}} {token.lemma_:{15}}')

his             DET   poss       -PRON-         
back            NOUN  ROOT       back           
,               PUNCT punct      ,              
the             DET   det        the            
wrists          NOUN  appos      wrist          
bound           VERB  acl        bind           
with            ADP   prep       with           
a               DET   det        a              
cord            NOUN  pobj       cord           
.               PUNCT punct      .              
                SPACE                           


**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [None]:
# Import the Matcher library:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [None]:
# Create a pattern and add it to matcher:

pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]

matcher.add('Swimming', None, pattern)

In [None]:
# Create a list of matches called "found_matches" and print the list:

found_matches = matcher(doc)
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


Cthe text surrounding each found match**

In [None]:
print(doc[1265:1290])

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home


In [None]:
print(doc[3600:3615])

all this over his shoulder; he was now swimming
vigorously with the current


**8 Print the *sentence* that contains each found match**

In [None]:
for sent in sents:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [None]:
for sent in sents:
    if found_matches[1][1] < sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


In [None]:
### Esercizio trovare altri match nel testo a scelta
pattern = [{'LOWER': 'zzzzzzzzzz'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'kkkkkkkkkk'}]