### What's spaCy?

SpaCy is free, open-source library for advanced Natural language processing(NLP) in Python.

Suppose you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What does the words mean in the context? Who is doing what to whom? What products and compnaies are mentioned in the text? Which texts are simmilar to each other.

spaCy is designed specifically for production use and helps you build applications that process and "understand" large volume of text. It can be used to build information extraction or natural language processing systems, or to pre-process text for deep learning.

### Installation¶

spaCy is compatible with 64bit of Cython 2.7/3.5+ and runs on Unix/Linux, macOS/OS X and Windows. The latest version of spaCy is available over pip and conda.

--> Installation with pip in Linux,Windows and macOs/OS X for both version of Python 2.7/3.5+

 pip install -U spacy or pip install spacy
--> Installation with conda in Linux,Windows and macOs/OS X for both version of Python 2.7/3.5+

 conda install -c conda-forge spacy

## Features

Here, you'll come across mentions of spaCy's features and capabilities. 

### Statistical models

Some of spaCy's features works independently, other requires statistical models to be loaded, which enable spaCy to **predict**
linguistic annotations-For example, whether a word is a verb or noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy, and the data they include. The model you choose always depends upon your use cases and the texts you're working with. For a general use case, the small and the default models are always a good start. They typically include the following components:

  - **Binary weights** for the part-of-speech tagger, dependency parser and named entity recognizer to predict those    annotations in context.
  
  - **Lexical entries** in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
  
  - **Data files** like lemmatization rules and lookup tables.
  - **Word vectors**,  i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
  - **Configuration** options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.
  
  
### Linguistic annotations

spaCy provides a variety of linguistic annotations to give you **insights into a text’s grammatical structure.** This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.

Once you’ve [downloaded and installed](https://spacy.io/usage/models) a model, you can load it via spacy.load() This will return a *Language* object containing all components and data needed to process text. We usually call it *nlp* object on a string of text will return a processed *Doc* :

In [3]:
# https://spacy.io/usage/linguistic-features

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

# Text: The original word text.
# Lemma: The base form of the word.
# POS: The simple part-of-speech tag.
# Tag: The detailed part-of-speech tag.
# Dep: Syntactic dependency, i.e. the relation between tokens.
# Shape: The word shape – capitalization, punctuation, digits.
# is alpha: Is the token an alpha character?
# is stop: Is the token part of a stop list, i.e. the most common words of the language?

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


Even though a Doc is processed - e.g. split into individual words and annotated - it still hols all information of the original text, like a whitespace characters. You can always get the offset of a token into the original string, or reconstruct the original by joining the tokens and their trailing whitespace. This way, you’ll never lose any information when processing text with spaCy.

### Tokenization
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion



## Part-of-speech(pos) tags and dependencies

After tokenization, spaCy can parse and tag a given *Doc*. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as **Token** .Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Coronavirus Coronavirus PROPN NNP ROOT Xxxxx True False
: : PUNCT : punct : False False
Delhi Delhi PROPN NNP compound Xxxxx True False
resident resident NOUN NN nsubj xxxx True False
tests test NOUN NNS appos xxxx True False
positive positive ADJ JJ amod xxxx True False
for for ADP IN prep xxx True True
coronavirus coronavirus NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
total total ADJ JJ acl xxxx True False
31 31 NUM CD nummod dd False False
people people NOUN NNS dobj xxxx True False
infected infect VERB VBN acl xxxx True False
in in ADP IN prep xx True True
India India PROPN NNP pobj Xxxxx True False


Using spaCy’s built-in **displaCy** visualizer, here’s what our example sentence and its dependencies look like:

### Syntactical Parseing using Spacy

In [7]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Google, Apple crack down on fake coronavirus apps")
displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [15]:
import spacy
from nltk import Tree


en_nlp = spacy.load('en_core_web_sm')

doc = en_nlp("The quick brown fox jumps over the lazy dog.")

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_


[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

        jumps                    
  ________|______________         
 |        |             over     
 |        |              |        
 |       fox            dog      
 |    ____|_____      ___|____    
 .  The quick brown the      lazy



[None]

In [16]:
doc = nlp("Google, Apple crack down on fake coronavirus apps")
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

           crack                           
   __________|______________                
  |     |    |    |         on             
  |     |    |    |         |               
  |     |    |    |        apps            
  |     |    |    |     ____|________       
Google  ,  Apple down fake      coronavirus



[None]

In [18]:
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [19]:
nlp = spacy.load("en_core_web_sm")
text = """In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus. A slave belonging to Hero, Pseudolus wishes to buy, win, or steal his freedom. One of the neighboring houses is owned by Marcus Lycus, who is a buyer and seller of beautiful women; the other belongs to the ancient Erronius, who is abroad searching for his long-lost children (stolen in infancy by pirates). One day, Senex and Domina go on a trip and leave Pseudolus in charge of Hero. Hero confides in Pseudolus that he is in love with the lovely Philia, one of the courtesans in the House of Lycus (albeit still a virgin)."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


##### Spacy Rule based Matching using syntactic parseing

In [22]:
# sample text
text = "GDP in developing countries such as Vietnam will continue growing at high rate."

#create a spaCy object
doc = nlp(text)

# print token, dependency POS tag
for tok in doc:
    print(tok.text,"-->",tok.dep_,"-->",tok.pos_)

GDP --> nsubj --> PROPN
in --> prep --> ADP
developing --> amod --> VERB
countries --> pobj --> NOUN
such --> amod --> ADJ
as --> prep --> SCONJ
Vietnam --> pobj --> PROPN
will --> aux --> VERB
continue --> ROOT --> VERB
growing --> xcomp --> VERB
at --> prep --> ADP
high --> amod --> ADJ
rate --> pobj --> NOUN
. --> punct --> PUNCT


##### Pattern: X such as Y

Have a look around the terms “such” and “as” . They are followed by a noun (“countries”). And after them, we have a proper noun (“Vietnam”) that acts as a hyponym.

So, let’s create the required pattern using the dependency tags and the POS tags:

In [24]:
#define the pattern 
pattern = [{'POS':'NOUN'}, 
           {'LOWER': 'such'}, 
           {'LOWER': 'as'}, 
           {'POS': 'PROPN'}] #proper noun

In [26]:
from spacy.matcher import Matcher 
from spacy.tokens import Span

# Matcher class object 
matcher = Matcher(nlp.vocab) 
matcher.add("matching_1", None, pattern) 

matches = matcher(doc) 
span = doc[matches[0][1]:matches[0][2]] 

print(span.text)

countries such as Vietnam


In [27]:

# Matcher class object
matcher = Matcher(nlp.vocab)

#define the pattern
pattern = [{'DEP':'amod', 'OP':"?"}, # adjectival modifier
           {'POS':'NOUN'},
           {'LOWER': 'such'},
           {'LOWER': 'as'},
           {'POS': 'PROPN'}]

matcher.add("matching_1", None, pattern)
matches = matcher(doc)

span = doc[matches[0][1]:matches[0][2]]
print(span.text)

developing countries such as Vietnam


###### Subtree Matching for Relation Extraction

We have to be extremely creative to come up with new rules to capture different patterns. It is difficult to build patterns that generalize well across different sentences.

To enhance the rule-based methods for relation/information extraction, we should try to understand the dependency structure of the sentences at hand.

In [29]:
text = "Tableau was recently acquired by Salesforce." 

# Plot the dependency graph 
doc = nlp(text) 
displacy.render(doc, style='dep',jupyter=True)

In [30]:
def subtree_matcher(doc): 
    x = '' 
    y = '' 
  
    # iterate through all the tokens in the input sentence 
    for i,tok in enumerate(doc): 
    # extract subject 
        if tok.dep_.find("subjpass") == True: 
            y = tok.text 
      
    # extract object 
        if tok.dep_.endswith("obj") == True: 
            x = tok.text 
      
    return x,y

In [31]:
subtree_matcher(doc)

('Salesforce', 'Tableau')

Here, the subject is the acquirer and the object is the entity that is getting acquired. Let’s use the same function, subtree_matcher( ), to extract entities related by the same relation (“acquired”):

In [32]:
text_2 = "Careem, a ride hailing major in middle east, was acquired by Uber." 

doc_2 = nlp(text_2) 
subtree_matcher(doc_2)

('Uber', 'Careem')

###### Named Entities 

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Delhi 13 18 GPE
31 66 68 CARDINAL
India 88 93 GPE


In [21]:
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Vocab, hashes and lexemes

Whenever possible, spaCy tries to store data in a vocabulary, the Vocab, that will be shared by multiple documents. To save memory, spaCy also encodes all strings to hash values – in this case for example, “coffee” has the hash 3197928453018144401. Entity labels like “ORG” and part-of-speech tags like “VERB” are also encoded. Internally, spaCy only “speaks” in hash values.

<img src=".\Images\20.png">

If you process lots of documents containing the word “coffee” in all kinds of different contexts, storing the exact string “coffee” every time would take up way too much space. So instead, spaCy hashes the string and stores it in the StringStore. You can think of the StringStore as a **lookup table that works in both directions** – you can look up a string to get its hash, or a hash to get its string:



In [11]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee


In [12]:
# Let's check with other words like 'tea'
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love tea, over coffee")
print(doc.vocab.strings["tea"]) # 6041671307218480733
print(doc.vocab.strings[6041671307218480733])

6041671307218480733
tea


Now that all strings are encoded, the entries in the vocabulary **don’t need to include the word text** themselves. Instead, they can look it up in the StringStore via its hash value. Each entry in the vocabulary, also called *Lexeme*, contains the context-independent information about a word. For example, no matter if “love” is used as a verb or a noun in some context, its spelling and whether it consists of alphabetic characters won’t ever change. Its hash value will also always be the same.

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love tea, over coffee!")
for word in doc:
    lexeme = doc.vocab[word.text]
    # print(lexeme)
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
tea 6041671307218480733 xxx t tea True False False en
, 2593208677638477497 , , , False False False en
over 5456543204961066030 xxxx o ver True False False en
coffee 3197928453018144401 xxxx c fee True False False en
! 17494803046312582752 ! ! ! False False False en
