<a href="https://colab.research.google.com/github/timgluz/colab_notebooks/blob/master/SpacyCourse_chapter1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  # Chapter.1 - Introduction to SpaCy

  SpaCy is a popular library for advanced NLP in Python.

source: https://course.spacy.io/en/chapter1

In [None]:
# install spacy

%pip install spacy

In [None]:
# import the english language class
from spacy.lang.en import English

# create the nlp object
# contains the procesing pipeline,
# includes language-specific rules for tokenization
nlp = English()

When you process a text with the `nlp` object, spaCy creates a `Doc` object – short for "document". The `Doc` lets you access information about the text in a structured way, and no information is lost.

The `Doc` behaves like a normal Python __sequence__ by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!


In [None]:
# create Doc object from a string
doc = nlp("Hello world!")

# iterate over tokens in a Doc
for token in doc:
  print(token.text)

print("\n-----")


In [None]:
# index into the Doc to get a single Token
token = doc[1]

# get the token text via .text attribute
print(token.text)

# a slice from the Doc is a Span object
span = doc[1:3]
print(span.text)

### lexical attributes

they refer to the entry in the vocabulary and don't depend on the token's context.



In [None]:
doc = nlp("It costs $5.")

# the i is the index of the token within the parent document.
print("Index:   ", [token.i for token in doc])

# text returns the token text
print("Text:    ", [token.text for token in doc])

# is_alpha indicates whether the token consists of alphabetic characters
print("is_alpha:", [token.is_alpha for token in doc])

# is_alpha indicates whether the token includes punctuation
print("is_punct:", [token.is_punct for token in doc])

# like_num whether it resembles a number
print("like_num:", [token.like_num for token in doc])

# Statistical models

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

Meaning of POS tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


In [None]:
# download model packages

#%%python -m spacy download en_core_web_sm

import spacy

# load a model package by name
nlp = spacy.load("en_core_web_sm")

# process a text
doc = nlp("She ate the pizza")

# print out each token and the predicted part-of-speec tag
for token in doc:
  print(token.text, token.pos_)

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The `.dep_` attribute returns the predicted dependency label.

The `.head` attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.


In [None]:
for token in doc:
  print(token.text, token.pos_, token.dep_, token.head.text)


#### Predicting name entities

**Named entities** are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The `doc.ents` property lets you access the named entities predicted by the model.

It returns an iterator of **Span** objects, so we can print the entity text and the entity label using the `.label_` attribute.

In [None]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

**A quick tip:** To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for *geopolitical entity* isn't exactly intuitive – but `spacy.explain` can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.


In [None]:
print(spacy.explain("GPE"))
# 'Countries, cities, states'

print(spacy.explain("NNP"))
# 'noun, proper singular'

print(spacy.explain("dobj"))
# 'direct object'

# Rule-based matching

spaCy's matcher, which lets you write rules to find words and phrases in text.

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

**Match patterns** are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In [None]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [None]:
# Matching lexical attributes
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

doc = nlp("2018 FIFA World Cup: France won!")

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [None]:
# Matching other token attributes

pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

doc = nlp("I loved dogs but now I love cats more.")

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [None]:
# using operators and quantifiers
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)