## Chapter 1: Finding words, phrases, names and concepts

**Part 1: English**

Import the English class from spacy.lang.en and create the nlp object.

Create a doc and print its text.

In [1]:
# Import the English language class
from spacy.lang.en import English
# Create the nlp object
nlp = English()
# Process a text
doc = nlp("This is a sentence.")
# Print the document text
print(doc.text)

This is a sentence.


#### Part 2: German

    Import the German class from spacy.lang.de and create the nlp object.
    Create a doc and print its text.

In [3]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


#### Part 3: Spanish

    Import the Spanish class from spacy.lang.es and create the nlp object.
    Create a doc and print its text.

In [5]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


### Documents, Spans and Tokens

In [6]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


**Step 2:**

    Import the English language class and create the nlp object.
    Process the text and instantiate a Doc object in the variable doc.
    Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.

In [15]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


### Lexical Attributes

In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

    1. Use the like_num token attribute to check whether a token in the doc resembles a number.
    2. Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
    3. Check whether the next token’s text attribute is a percent sign ”%“.

In [14]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


### Statistical Models

Enable spaCy to predict linguistic attributes in context
    
    Part-of-speech tags
    Syntactic dependencies
    Named entities

Trained on labeled example texts
    
    Can be updated with more examples to fine-tune predictions
    
### Model Packages:

    $ python -m spacy download en_core_web_sm
    import spacy

    nlp = spacy.load('en_core_web_sm')
        1. Binary weights
        2. Vocabulary
        3. Meta information (language, pipeline)
    
### Predicting Part of Speech:
    
    import spacy

    # Load the small English model
    nlp = spacy.load('en_core_web_sm')

    # Process a text
    doc = nlp("She ate the pizza")

    # Iterate over the tokens
    for token in doc:
        # Print the text and the predicted part-of-speech tag
        print(token.text, token.pos_)
        
### Predicting Syntactic Dependencies

    for token in doc:
        print(token.text, token.pos_, token.dep_, token.head.text

<img src="dep_example.png">
<table align='center'>
<tr>
<b><td align='center'><b>Label</b></td>	<td align='center'><b>Description</b></td>	<td align='center'><b>Example</b></td>
</tr>
<tr>
<td align='center'>nsubj</td>	<td align='center'>nominal subject</td>	<td align='center'>She</td>
</tr>
<tr>
<td align='center'>dobj</td>	<td align='center'>direct object</td>	<td align='center'>pizza</td>
</tr>
<tr>
<td align='center'>det</td>	<td align='center'>determiner (article)</td align='center'>	<td>the</td>
</tr>
</table>


### Predicting Named Entities

    # Process a text
    doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

    # Iterate over the predicted entities
    for ent in doc.ents:
        # Print the entity text and its label
        print(ent.text, ent.label_)

### Tip: the explain method

Get quick definitions of the most common tags and labels.

    spacy.explain('GPE')
    >>>'Countries, cities, states'
    
    spacy.explain('NNP')
    >>>'noun, proper singular'
    
    spacy.explain('dobj')
    >>>'direct object'

## Let's practice!

### Loading Packages:

The models we’re using in this course are already pre-installed. For more details on spaCy’s statistical models and how to install them on your machine, see the documentation.

    Use spacy.load to load the small English model 'en_core_web_sm'.
    Process the text and print the document text.

In [15]:
import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


### Predicting linguistic annotations

You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain('PROPN') or spacy.explain('GPE').

#### Part 1

   1. Process the text with the nlp object and create a doc.
   2. For each token, print the <code>token text</code>, the <code>token’s .pos_</code> (part-of-speech tag) and the <code>token’s .dep_</code> (dependency label).

In [16]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          PROPN     ROOT      
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


#### Part 2

   1. Process the text and create a <code>doc</code> object.
   2. Iterate over the <code>doc.ents</code> and print the entity text and <code>label_ </code>attribute.

In [18]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### Predicting named entities in context

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

   1. Process the text with the <code>nlp</code> object.
   2. Iterate over the entities and print the entity text and label.
   3. Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


## Rule-based matching

Why not just regular expressions?
    
    Match on Doc objects, not just strings
    Match on tokens and token attributes
    Use the model's predictions
    Example: "duck" (verb) vs. "duck" (noun)

Match patterns
Lists of dictionaries, one per token

Match exact token texts

<code>
    [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

    Match lexical attributes
</code>
<code>
    [{'LOWER': 'iphone'}, {'LOWER': 'x'}]
</code>

    Match any token attributes
<code>
    [{'LEMMA': 'buy'}, {'POS': 'NOUN'}]
</code>

Using the Matcher (1)

<code>
    import spacy
</code>
<code>
    # Import the Matcher
    from spacy.matcher import Matcher
</code>
<code>
    # Load a model and create the nlp object
    nlp = spacy.load('en_core_web_sm')
</code>
<code>
    # Initialize the matcher with the shared vocab
    matcher = Matcher(nlp.vocab)
</code>
<code>
    # Add the pattern to the matcher
    pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
    matcher.add('IPHONE_PATTERN', None, pattern)
</code>
<code>
    # Process some text
    doc = nlp("New iPhone X release date leaked")
</code>
<code>
    # Call the matcher on the doc
    matches = matcher(doc)
</code>

### Using the Matcher (2)
<code>
    # Call the matcher on the doc
    doc = nlp("New iPhone X release date leaked")
    matches = matcher(doc)
    # Iterate over the matches
    for match_id, start, end in matches:
        # Get the matched span
        matched_span = doc[start:end]
        print(matched_span.text)
    >>> iPhone X
    match_id: hash value of the pattern name
    start: start index of matched span
    end: end index of matched span
</code>

### Matching lexical attributes
<code>
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
doc = nlp("2018 FIFA World Cup: France won!")
>>>2018 FIFA World Cup:
</code>

### Matching other token attributes
<code>
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
doc = nlp("I loved dogs but now I love cats more.")
loved dogs
love cats
</code>

### Using operators and quantifiers (1)
<code>
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")
bought a smartphone
buying apps
</code>

Using operators and quantifiers (2)

|Example	|Description                     |
|-----------|--------------------------------|
|{'OP': '!'}|	Negation: match 0 times      |
|{'OP': '?'}|	Optional: match 0 or 1 times |
|{'OP': '+'}|	Match 1 or more times        |
|{'OP': '*'}|	Match 0 or more times        |

## Let's practice!


### Using the Matcher

Let’s try spaCy’s rule-based Matcher. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.

    1.Import the Matcher from spacy.matcher.
    2.Initialize it with the nlp object’s shared vocab.
    3.Create a pattern that matches the 'TEXT' values of two tokens: "iPhone" and "X".
    4.Use the matcher.add method to add the pattern to the matcher.
    5.Call the matcher on the doc and store the result in the variable matches.
    6.Iterate over the matches and get the matched span from the start to the end index.

In [25]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT':'iPhone'},{'TEXT':'X'}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


### Writing Match Patterns:
    
In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.

#### Part 1

    Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [3]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": 'iOS'}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


#### Part 2

Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag 'PROPN' (proper noun).

In [7]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": 'download'}, {"POS": 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


#### Part 3

Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).

In [8]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": 'ADJ'}, {"POS": 'NOUN'}, {"POS": 'NOUN', "OP": '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


### CHAPTER 1 COMPLETE