# Introduction
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text.

## Introduction to Information Extraction
`Information Extraction (IE)` is a crucial cog in the field of `Natural Language Processing (NLP)` and linguistics. It’s widely used for tasks such as `Question Answering Systems`, `Machine Translation`, `Entity Extraction`, `Event Extraction`, `Named Entity Linking`, `Coreference Resolution`, `Relation Extraction`, etc.

We can broadly divide Information Extraction into two branches:

- `Traditional Information Extraction`, the relations to be extracted are pre-defined. 
- `Open Information Extraction`, the relations are not pre-defined. The system is free to extract any relations it comes across while going through the text data.

## Different Approaches to Information Extraction
There are multiple approaches to perform information extraction automatically. Let’s understand them one-by-one:

- **Rule-based Approach:** We define a set of rules for the syntax and other grammatical properties of a natural language and then use these rules to extract information from text

- **Supervised:** Let’s say we have a sentence S. It has two entities E1 and E2. Now, the supervised machine learning model has to detect whether there is any relation (R) between E1 and E2. So, in a supervised approach, the task of relation extraction turns into the task of relation detection. The only drawback of this approach is that it needs a lot of labeled data to train a model

- **Semi-supervised:** When we don’t have enough labeled data, we can use a set of seed examples (triples) to formulate high-precision patterns that can be used to extract more relations from the text.


# Getting started

In [1]:
# !pip install spacy

In [2]:
import spacy

# Statistical models
## Download statistical models

Predict part-of-speech tags, dependency labels, named entities and more. [See here](https://spacy.io/models) for available models.

In [3]:
# !python -m spacy download en_core_web_sm

## Check that your installed models are up to date

In [4]:
# !python -m spacy validate

# Loading statistical models

In [5]:
# Load the installed model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

# Documents, tokens and spans
## Processing text

Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.

In [6]:
doc = nlp("This is a example text")

### The English language class

In [7]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("Progress to Contributor to make your voice count!")

# Print the document text
print(doc.text)

Progress to Contributor to make your voice count!


### The German language class

In [8]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


### The Spanish language class

In [9]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


## Accessing token attributes

In [10]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


In [11]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a example text")
# Token texts
[token.text for token in doc]

['This', 'is', 'a', 'example', 'text']

## Spans
### Accessing spans

Span indices are exclusive. So `doc[2:5]` is a span starting at token 2, up to – but not including! – token 5.

In [12]:
doc = nlp("This is a example text")
span = doc[2:4]
span.text

'a example'

In [13]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## Creating a span manually

In [14]:
# Import the Span object
from spacy.tokens import Span
# Create a Doc object
doc = nlp("I live in Guwahati Assam")
# Span for "Guwahati" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE")
span.text

'Guwahati Assam'

In [15]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
          print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


# Linguistic features

Attributes return label IDs. For string labels, use the attributes with an underscore. For example, `token.pos_`.
# Part-of-speech tags (predicted by statistical model)

In [16]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is an another example text.")
# Coarse-grained part-of-speech tags
[token.pos_ for token in doc]

['DET', 'VERB', 'DET', 'DET', 'NOUN', 'NOUN', 'PUNCT']

In [17]:
# Fine-grained part-of-speech tags
[token.tag_ for token in doc]

['DT', 'VBZ', 'DT', 'DT', 'NN', 'NN', '.']

# Syntactic dependencies (predicted by statistical model)

In [18]:
doc = nlp("This is a simple text example.")
# Dependency labels
[token.dep_ for token in doc]

['nsubj', 'ROOT', 'det', 'amod', 'compound', 'attr', 'punct']

In [19]:
# Syntactic head token (governor)
[token.head.text for token in doc]

['is', 'is', 'example', 'example', 'example', 'is', 'is']

# Named Entities (predicted by statistical model)

In [20]:
doc = nlp("Steve Jobs founded Apple")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]

[('Steve Jobs', 'PERSON'), ('Apple', 'ORG')]

In [21]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


In [22]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          PROPN     ROOT      
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [23]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [24]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

Apple ORG


In [25]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

Apple ORG
Missing entity: iPhone X


# Sentences (usually needs the dependency parser)

In [26]:
doc = nlp("This a sentence. This is another one.")
# doc.sents is a generator that yields sentence spans
[sent.text for sent in doc.sents]

['This a sentence.', 'This is another one.']

# Base noun phrases (needs the tagger and parser)

In [27]:
doc = nlp("I have a brown car")
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

['I', 'a brown car']

# Label explanations

In [28]:
spacy.explain("NN")

'noun, singular or mass'

In [29]:
spacy.explain("GPE")

'Countries, cities, states'

# Visualizing

* For Jupyter notebook, use `displacy.render`. 
* Otherwise, use `displacy.serve` to start a web server and show the visualization in your browser.

In [30]:
from spacy import displacy

## Visualize dependencies

In [31]:
doc = nlp("I live in Guwahati, Assam")
displacy.render(doc, style="dep")

## Visualize named entities

In [32]:
doc = nlp("Bill Gates founded Microsoft")
displacy.render(doc, style="ent")

# Word vectors and similarity

- To use word vectors, you need to install the larger models ending in md or lg , for example `en_core_web_md`.
## Comparing similarity

In [33]:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")
# Compare 2 documents
doc1.similarity(doc2)

  "__main__", mod_spec)


0.9494250928790803

In [34]:
# Compare 2 tokens
doc1[2].similarity(doc2[2])

  "__main__", mod_spec)


0.85979897

In [35]:
# Compare tokens and spans
doc1[0].similarity(doc2[1:3])

  "__main__", mod_spec)


0.055946942

# Accessing word vectors

In [36]:
# Vector as a numpy array
doc = nlp("I like cats")
# The L2 norm of the token's vector
doc[2].vector

array([ 3.612486  , -0.00959456,  0.9075469 , -3.7867062 ,  2.4655213 ,
        0.13715684,  2.4155567 , -2.5730793 , -2.867982  ,  3.2525606 ,
       -1.2344294 ,  1.8056613 , -1.9664278 , -0.7356216 , -2.7180357 ,
       -0.6440704 , -2.8764105 ,  3.0591764 , -1.2157367 ,  1.2292783 ,
       -3.8869352 ,  1.5334756 , -0.42812008, -1.6634891 , -0.71992075,
       -0.9405132 , -0.92217237,  1.4494174 ,  0.8063313 , -1.9848629 ,
        1.463424  , -4.440749  , -2.0363843 ,  1.969531  ,  2.132836  ,
       -0.06996736,  1.0586739 ,  0.72142375, -2.455808  ,  0.6979033 ,
       -3.979558  ,  5.9907084 ,  0.46196795, -1.6138006 ,  6.1429152 ,
       -0.6601538 ,  0.05084178, -1.2918823 , -0.60123396,  0.5269669 ,
       -3.8541064 ,  1.5263548 ,  0.60274506, -0.57671547,  0.90186244,
        1.1323965 , -0.22235966,  0.38049865, -0.29168057,  2.2304847 ,
       -2.0556376 ,  4.193283  ,  4.0678716 , -1.9232148 ,  1.2723014 ,
        8.359107  , -1.0821431 ,  3.8314052 , -1.5646377 ,  2.12

In [37]:
doc[2].vector_norm

23.784721

# Pipeline components

Functions that take a Doc object, modify it and return it.

## Pipeline information

In [38]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tagger', 'parser', 'ner']

In [39]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fd3ff2a7358>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fd3ff403108>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fd3ff403168>)]

## Custom components

In [40]:
# Function that modifies the doc and returns it
def custom_component(doc):
    print("Do something to the doc here!")
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

Components can be added `first`, `last (default)`, or `before` or `after` an existing component.

# Extension attributes

Custom attributes that are registered on the global Doc, Token and Span classes and become available as `._.`

In [41]:
from spacy.tokens import Doc, Token, Span
doc = nlp("The sky over Guwahati is blue")

Do something to the doc here!


# Attribute extensions (with default value)

In [42]:
# Register custom attribute on Token class
Token.set_extension("is_color", default=False)
# Overwrite extension attribute with default value
doc[5]._.is_color = True

# Property extensions (with getter & setter)

In [43]:
# Register custom attribute on Doc class
get_reversed = lambda doc: doc.text[::-1]
Doc.set_extension("reversed", getter=get_reversed)
# Compute value of extension attribute with getter
doc._.reversed

'eulb si itahawuG revo yks ehT'

# Method extensions (callable method)

In [44]:
# Register custom attribute on Span class
has_label = lambda span, label: span.label_ == label
Span.set_extension("has_label", method=has_label)
# Compute value of extension attribute with method
doc[3:5]._.has_label("GPE")

False

# Rule-based matching
## Using the Matcher

In [45]:
# Matcher is initialized with the shared vocab
from spacy.matcher import Matcher
# Each dict represents one token and its attributes
matcher = Matcher(nlp.vocab)
# Add with ID, optional callback and pattern(s)
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
matcher.add('CITIES', None, pattern)
# Match by calling the matcher on a Doc object
doc = nlp("I live in New York")
matches = matcher(doc)
# Matches are (match_id, start, end) tuples
for match_id, start, end in matches:
     # Get the matched span by slicing the Doc
     span = doc[start:end]
     print(span.text)

Do something to the doc here!
New York


In [46]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Do something to the doc here!
Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [47]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Do something to the doc here!
Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [48]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Do something to the doc here!
Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


## Token patterns

In [49]:
# "love cats", "loving cats", "loved cats"
pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
# "10 people", "twenty people"
pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
# "book", "a cat", "the sea" (noun + optional article)
pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]

## Operators and quantifiers

Can be added to a token dict as the "OP" key.

|OP |	Description|
|---|---|
|! 	|Negate pattern and match exactly 0 times.|
|? 	|Make pattern optional and match 0 or 1 times.|
|+ 	|Require pattern to match 1 or more times.|
|* 	|Allow pattern to match 0 or more times.|

## Glossary
|Name| 	Description|
|---|---|
|Tokenization |	Segmenting text into words, punctuation etc.|
|Lemmatization |	Assigning the base forms of words, for example: "was" → "be" or "rats" → "rat".|
|Sentence Boundary Detection |	Finding and segmenting individual sentences.|
|Part-of-speech (POS) Tagging | 	Assigning word types to tokens like verb or noun.|
|Dependency Parsing |	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
|Named Entity Recognition (NER) |	Labeling named "real-world" objects, like persons, companies or locations.|
|Text Classification |	Assigning categories or labels to a whole document, or parts of a document.|
|Statistical model |	Process for making predictions based on examples.|
|Training |	Updating a statistical model with new examples.|

Note: This Kernel is based on datacamp [spaCy Cheat Sheet: Advanced NLP in Python](https://www.datacamp.com/community/blog/spacy-cheatsheet) article

## If You like this kernel Don't forget to upvote