In [None]:
import spacy

In [None]:
nlp_en = spacy.blank("en")
nlp_en

<spacy.lang.en.English at 0x7f34f51a4850>

**Documents, Span and Token**

In [None]:
doc = nlp_en("This is a sentence")

In [None]:
doc.text

'This is a sentence'

In [None]:
first_token = doc[0]
first_token

This

In [None]:
type(first_token)

spacy.tokens.token.Token

In [None]:
first_token.text

'This'

In [None]:
doc = nlp_en("I like tree kangaroos and narwhals.")

In [None]:
tree_kangaroos = doc[2:4]

In [None]:
tree_kangaroos.text

'tree kangaroos'

In [None]:
tree_kangaroos_and_narwhal = doc[2:]

In [None]:
tree_kangaroos_and_narwhal.text

'tree kangaroos and narwhals.'

**Lexical Attributes**

In [None]:
doc = nlp_en(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

In [None]:
print([(token, token.idx) for token in doc])
print([(token, token.i) for token in doc])

[(In, 0), (1990, 3), (,, 7), (more, 9), (than, 14), (60, 19), (%, 21), (of, 23), (people, 26), (in, 33), (East, 36), (Asia, 41), (were, 46), (in, 51), (extreme, 54), (poverty, 62), (., 69), (Now, 71), (less, 75), (than, 80), (4, 85), (%, 86), (are, 88), (., 91)]
[(In, 0), (1990, 1), (,, 2), (more, 3), (than, 4), (60, 5), (%, 6), (of, 7), (people, 8), (in, 9), (East, 10), (Asia, 11), (were, 12), (in, 13), (extreme, 14), (poverty, 15), (., 16), (Now, 17), (less, 18), (than, 19), (4, 20), (%, 21), (are, 22), (., 23)]


In [None]:
for token in doc:
    if token.like_num:
        next_token = doc[token.i+1]
        if next_token.text == "%":
            print("%age Found", token.text)

%age Found 60
%age Found 4


**Trained Pipelines**

    * Trained pipeline components have statistical models that enable spaCy to make predictions in context.
    * This usually includes part-of speech tags, syntactic dependencies and named entities.
    * Pipelines are trained on large datasets of labeled example texts.
    * They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

**spaCy provides a number of trained pipeline packages you can download using the spacy download command.**

    For example, the "en_core_web_sm" package is a small English pipeline that supports all core capabilities and is trained on web text.   
    The package provides the binary weights that enable spaCy to make predictions.
    It also includes the vocabulary, meta information about the pipeline and the configuration file used to train it. 
    It tells spaCy which language class to use and how to configure the processing pipeline.

**Predict Part-of-Speech tags**

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x7f348f0756f0>

In [None]:
doc = nlp("She ate the pizza")

In [None]:
for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


**Predict Syntactic Dependencies**

    we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

**To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:**

    The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

    The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

    The determiner "the", also known as an article, is attached to the noun "pizza".

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


**Predicting Name Entities**

**Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.**



In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

In [None]:
for entity in doc.ents:
    print(entity, entity.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [None]:
spacy.explain("GPE")

'Countries, cities, states'

**In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.**

In [None]:
text = "It's official: Apple is the first U.S. public company to reach a $1 trillion market value"

In [None]:
doc = nlp(text)

In [None]:
doc.text

"It's official: Apple is the first U.S. public company to reach a $1 trillion market value"

In [None]:
for token in doc:
    print(f'{token.text:<14}{token.pos_:<12}{token.dep_}')

It            PRON        nsubj
's            AUX         ccomp
official      ADJ         acomp
:             PUNCT       punct
Apple         PROPN       nsubj
is            AUX         ROOT
the           DET         det
first         ADJ         amod
U.S.          PROPN       nmod
public        ADJ         amod
company       NOUN        attr
to            PART        aux
reach         VERB        relcl
a             DET         det
$             SYM         quantmod
1             NUM         compound
trillion      NUM         nummod
market        NOUN        compound
value         NOUN        dobj


In [None]:
spacy.explain("nmod")

'modifier of nominal'

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


**Predicting Name Entities in Context**

In [None]:
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

In [None]:
doc = nlp(text)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)       # Looks like the model didn’t predict “iPhone X”

Apple ORG


In [None]:
[token.i for token in doc if token.text == "iPhone" or token.text == "X"]

[1, 2]

In [None]:
iphone_x = doc[1:3]
iphone_x

iPhone X

**Rule Based Matching**

    Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.
    It's also more flexible: you can search for texts but also other lexical attributes.

    You can even write rules that use a model's predictions.
    For example, find the word "duck" only if it's a verb, not a noun.

* Matcher

In [None]:
pattern = [{"TEXT": "iPhone"}, {"TEXT" : "X"}]  
# pattern = [{"LOWER":"iphone"}, {"LOWER": "x"}]
# pattern = [{"LEMMA":"buy"}, {"POS" : "NOUN"}]

In [None]:
from spacy.matcher import Matcher

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
matcher = Matcher(nlp.vocab)

In [None]:
matcher

<spacy.matcher.matcher.Matcher at 0x7f348f1f3250>

In [None]:
matcher.add("IPHONE_PATTERN", [pattern])

In [None]:
doc = nlp("Upcomig iPhone X release date leaked")

In [None]:
matches = matcher(doc)
matches

[(9528407286733565721, 1, 3)]

    match_id: hash value of the pattern name
    start: start index of matched span
    end: end index of matched span

In [None]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [None]:
doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

In [None]:
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT":True}]

In [None]:
matcher.add("IOS_VERSION_PATTERN", [pattern])

In [None]:
matches = matcher(doc)

In [None]:
for match_id, start, end in matches:
    matches_span = doc[start:end]
    print(matches_span)


iOS 7
iOS 11
iOS 10


In [None]:
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)


In [None]:
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

In [None]:
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])

In [None]:
matches = matcher(doc)
matches

[(1475109908168048428, 1, 3),
 (1475109908168048428, 21, 23),
 (1475109908168048428, 52, 54)]

In [None]:
for match_id, start, end in matches:
    matches_span = doc[start:end]
    print(matches_span)

downloaded Fortnite
downloading Minecraft
download Winzip


Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).


In [None]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

In [None]:
pattern = [{"POS":"ADJ"}, {"POS":"NOUN"}, {"POS": "NOUN", "OP": "?"}]

In [None]:
matcher.add("ADJ_NOUN_PATTERN", [pattern])

In [None]:
matches = matcher(doc)
matches

[(5488211386492616699, 6, 8),
 (5488211386492616699, 9, 11),
 (5488211386492616699, 12, 14),
 (5488211386492616699, 15, 17),
 (5488211386492616699, 15, 18)]

In [None]:
for match_id, start,end in matches:
    matches_span = doc[start:end]
    print(matches_span)

beautiful design
smart search
automatic labels
optional voice
optional voice responses
