- Installation
- Download pre-trained Models
- Tokenization
- Lemmatization
- Token Extracting / Removing / Transforming
- Sentence Segmentation
- Part of Speech Tagging
- Named Entity Recognition
- Dependency Parsing
- Word Vectors
- Sentence Similarity

# Installation 

- spaCy is compatible with 64-bit CPython 2.7 / 3.5+ 
- Runs on Unix/Linux, macOS/OS X and Windows
- The latest spaCy releases are available over pip and conda

In [1]:
# !pip install -U spacy

In [2]:
# !conda install -c conda-forge spacy

# Download pre-trained Models

The easiest way to download a model is via spaCy’s download command.  

It takes care of finding the best-matching model compatible with your spaCy installation.

In [3]:
# !python -m spacy download en_core_web_sm

spaCy’s models can also be installed as Python packages.

In [None]:
# With external URL
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

# With local file
# !pip install /Users/you/en_core_web_sm-2.2.5.tar.gz

Directory structure
└── en_core_web_md-2.2.0.tar.gz       # downloaded archive
    ├── meta.json                     # model meta data
    ├── setup.py                      # setup file for pip installation
    └── en_core_web_md                # 📦 model package
        ├── __init__.py               # init for pip installation
        ├── meta.json                 # model meta data
        └── en_core_web_md-2.2.0      # model data

# Using spaCy

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [5]:
doc = nlp("This is a sentence.")

In [6]:
print([(w.text, w.pos_) for w in doc])

[('This', 'DET'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]


# Tokenization

### using white space

In [7]:
text = "Rare bird’s detection highlights promise of ‘environmental DNA’"

In [8]:
token_lsit = text.split()
for token in token_lsit:
    print(token)

Rare
bird’s
detection
highlights
promise
of
‘environmental
DNA’


### using spaCy

In [9]:
doc = nlp(text)

In [10]:
for token in doc:
    print(token.text)

Rare
bird
’s
detection
highlights
promise
of
‘
environmental
DNA
’


# Lemmatization

Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language (converts words in the second or third forms to their first form variants). This reduced form or root word is called a lemma.

In [11]:
text = "look, looks, looked and looking"

In [12]:
doc = nlp(text)

In [13]:
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

token:look -> lemma:look
token:, -> lemma:,
token:looks -> lemma:look
token:, -> lemma:,
token:looked -> lemma:look
token:and -> lemma:and
token:looking -> lemma:looking


In [113]:
text = "Syracuse University has never been bound by convention or the status quo."

In [114]:
doc = nlp(text)

In [115]:
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

token:Syracuse -> lemma:Syracuse
token:University -> lemma:University
token:has -> lemma:have
token:never -> lemma:never
token:been -> lemma:be
token:bound -> lemma:bind
token:by -> lemma:by
token:convention -> lemma:convention
token:or -> lemma:or
token:the -> lemma:the
token:status -> lemma:status
token:quo -> lemma:quo
token:. -> lemma:.


# Word Frequency

In [116]:
from collections import Counter

In [128]:
text = '''Since 1870, Syracuse University has upheld the idea that “brains and heart shall have a fair chance” at achieving a University degree. We were pioneers in recognizing women’s achievements, the academic validity of the fine arts, the emergence of information studies and the importance of educational opportunities that serve veterans. Our founding values are embedded in 150 years of educational innovation, expansion, discovery and change.

See a timeline of events in Syracuse University's 150 years of history.'''

In [129]:
doc = nlp(text)

In [130]:
words =  [token.text for token in doc if token.is_punct != True]

In [131]:
word_counter = Counter(words)

In [133]:
word_counter.most_common(5)

[('of', 6), ('the', 5), ('University', 3), ('and', 3), ('a', 3)]

In [134]:
word_counter["Syracuse"]

2

# Token Extracting / Removing / Transforming


|Attribute Name	|Type|Description                                                                    |
|:--------------------:|:---------:|:------------------------------------------------------------------------------------------------------------------------------------:|
| lemma              | int     | Base form of the token, with no inflectional suffixes.                                                                             |
| lemma_             | unicode | Base form of the token, with no inflectional suffixes.                                                                             |
| norm               | int     | The token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions. |
| norm_              | unicode | The token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions. |
| lower              | int     | Lowercase form of the token.                                                                                                       |
| lower_             | unicode | Lowercase form of the token text. Equivalent to Token.text.lower().                                                                |
| shape              | int     | Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.                                      |
| shape_             | unicode | Transform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.                                      |
| prefix             | int     | Hash value of a length-N substring from the start of the token. Defaults to N=1.                                                   |
| prefix_            | unicode | A length-N substring from the start of the token. Defaults to N=1.                                                                 |
| suffix             | int     | Hash value of a length-N substring from the end of the token. Defaults to N=3.                                                     |
| suffix_            | unicode | Length-N substring from the end of the token. Defaults to N=3.                                                                     |
| is_alpha           | bool    | Does the token consist of alphabetic characters? Equivalent to token.text.isalpha().                                               |
| is_ascii           | bool    | Does the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text).                                   |
| is_digit           | bool    | Does the token consist of digits? Equivalent to token.text.isdigit().                                                              |
| is_lower           | bool    | Is the token in lowercase? Equivalent to token.text.islower().                                                                     |
| is_upper           | bool    | Is the token in uppercase? Equivalent to token.text.isupper().                                                                     |
| is_title           | bool    | Is the token in titlecase? Equivalent to token.text.istitle().                                                                     |
| is_punct           | bool    | Is the token punctuation?                                                                                                          |
| is_left_punct      | bool    | Is the token a left punctuation mark, e.g. (?                                                                                      |
| is_right_punct     | bool    | Is the token a right punctuation mark, e.g. )?                                                                                     |
| is_space           | bool    | Does the token consist of whitespace characters? Equivalent to token.text.isspace().                                               |
| is_bracket         | bool    | Is the token a bracket?                                                                                                            |
| is_quote           | bool    | Is the token a quotation mark?                                                                                                     |
| is_currency V2.0.8 | bool    | Is the token a currency symbol?                                                                                                    |
| like_url           | bool    | Does the token resemble a URL?                                                                                                     |
| like_num           | bool    | Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc.                                                                  |
| like_email         | bool    | Does the token resemble an email address?                                                                                          |

### Extracting

Get all the punctuations

In [14]:
text='''The future opens wide for those who can master our ever-evolving,
data-driven world – come learn that mastery from the experts
at the Syracuse University iSchool!'''

In [15]:
num_list=[]
doc = nlp(text)
for token in doc:
    if token.is_punct:
        num_list.append(token)
for punc in num_list:
    print(punc)

-
,
-
–
!


### Removing

Remove the stop words

In [16]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [17]:
len(spacy_stopwords)

326

In [18]:
list(spacy_stopwords)[:20]

['nine',
 'really',
 'anywhere',
 'move',
 'sometime',
 'give',
 'against',
 'per',
 'whereafter',
 'into',
 'nobody',
 'among',
 'ours',
 'neither',
 'everything',
 'third',
 'since',
 'very',
 'other',
 'always']

In [19]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [20]:
no_stop_words = [token for token in doc if not token.is_stop]

In [21]:
no_stop_words

[quick, brown, fox, jumps, lazy, dog]

### Transforming

Convert text to lowercase

In [140]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [141]:
lowercased = [ token.lower_ for token in doc]

In [142]:
lowercased

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

# Token Attributes

In [143]:
import pandas as pd

In [155]:
cols = ("text", "lemma_","is_punct", "is_stop", "is_alpha", "is_space", "lower_")

In [156]:
rows = [] 
for t in doc:
    row = [t.text, t.lemma_,  t.is_punct,  t.is_stop,  t.is_alpha,  t.is_space,  t.lower_]
    rows.append(row)
attri_pdf = pd.DataFrame(rows, columns=cols)

In [157]:
attri_pdf

Unnamed: 0,text,lemma_,is_punct,is_stop,is_alpha,is_space,lower_
0,The,the,False,True,True,False,the
1,quick,quick,False,False,True,False,quick
2,brown,brown,False,False,True,False,brown
3,fox,fox,False,False,True,False,fox
4,jumps,jump,False,False,True,False,jumps
5,over,over,False,True,True,False,over
6,the,the,False,True,True,False,the
7,lazy,lazy,False,False,True,False,lazy
8,dog,dog,False,False,True,False,dog


# Sentence Segmentation

In [25]:
text = '''In some cases, eDNA analyses are being used to enforce policy. For example, in 2014, the UK government approved the use of eDNA analysis for detecting the endangered great crested newt in land-use surveys that are required by law.'''

In [26]:
text

'In some cases, eDNA analyses are being used to enforce policy. For example, in 2014, the UK government approved the use of eDNA analysis for detecting the endangered great crested newt in land-use surveys that are required by law.'

In [27]:
doc = nlp(text)

In [28]:
for sent in doc.sents:
    print("start_pos={}, end_pos={}, text:{}".format(sent.start, sent.end, sent.text))

start_pos=0, end_pos=13, text:In some cases, eDNA analyses are being used to enforce policy.
start_pos=13, end_pos=46, text:For example, in 2014, the UK government approved the use of eDNA analysis for detecting the endangered great crested newt in land-use surveys that are required by law.


# Part of Speech Tagging

In [136]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [137]:
for token in doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

The DT DET determiner
quick JJ ADJ adjective
brown JJ ADJ adjective
fox NN NOUN noun, singular or mass
jumps NNS NOUN noun, plural
over IN ADP conjunction, subordinating or preposition
the DT DET determiner
lazy JJ ADJ adjective
dog NN NOUN noun, singular or mass


# Named Entity Recognition

| TAG | ID | DESCRIPTION                           |
|:-----:|:----:|:---------------------------------------:|
| I   | 1  | Token is inside an entity.            |
| O   | 2  | Token is outside an entity.           |
| B   | 3  | Token begins an entity.               |
|     | 0  | No entity tag is set (missing value). |

In [31]:
text='''We’re bringing the celebration of Syracuse University’s 150 years of impact to Chicago'''

In [32]:
print(text)

We’re bringing the celebration of Syracuse University’s 150 years of impact to Chicago


In [33]:
doc = nlp(text)

In [34]:
for ent in doc.ents:
    print("{}, [{},{}), {}".format(ent.text, ent.start_char, ent.end_char, ent.label_))
    for token in ent.as_doc():        
        print("    {} {} {}".format(token, token.ent_iob_, token.ent_type_))

Syracuse University, [34,53), ORG
    Syracuse B ORG
    University I ORG
150 years, [56,65), DATE
    150 B DATE
    years I DATE
Chicago, [79,86), GPE
    Chicago B GPE


- __text__ gives the Unicode text representation of the entity.
- __start_char__ denotes the character offset for the start of the entity.
- __nd_char__ denotes the character offset for the end of the entity.
- __label___ gives the label of the entity.

In [35]:
from spacy import displacy

In [36]:
displacy.render(doc, style="ent")

# Dependency Parsing

In [37]:
doc = nlp("The quick brown fox jumps over the lazy dog")

In [38]:
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

The/DT <--det-- jumps/NNS
quick/JJ <--amod-- jumps/NNS
brown/JJ <--amod-- jumps/NNS
fox/NN <--compound-- jumps/NNS
jumps/NNS <--ROOT-- jumps/NNS
over/IN <--prep-- jumps/NNS
the/DT <--det-- dog/NN
lazy/JJ <--amod-- dog/NN
dog/NN <--pobj-- over/IN


In [39]:
from spacy import displacy

In [40]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

# Word Vectors

In [None]:
#python -m spacy download en_core_web_lg

In [64]:
nlp_lg = spacy.load('en_core_web_lg')

In [65]:
apple = nlp_lg.vocab["apple"]
banana = nlp_lg.vocab["banana"]
car = nlp_lg.vocab["car"]
apple_banana_sim = apple.similarity(banana)
banana_apple_sim = banana.similarity(apple)
apple_car_sim = apple.similarity(car)

In [66]:
apple_banana_sim

0.5831844

In [67]:
banana_apple_sim

0.5831844

In [68]:
apple_car_sim

0.21747091

In [69]:
#from scipy.spatial.distance import cosine

In [84]:
import numpy as np
def cosine(x,y):
    return np.dot(x,y) / (np.sqrt(np.dot(x,x)) * np.sqrt(np.dot(y,y)))

In [80]:
man = nlp_lg.vocab["man"].vector
woman = nlp_lg.vocab["woman"].vector
king = nlp_lg.vocab["king"].vector
queen = nlp_lg.vocab["queen"].vector

In [85]:
cosine(king-man+woman,queen)

0.7880845

# Sentence Similarity

By default, Token.vector returns the vector for its underlying Lexeme, while Doc.vector and Span.vector return an average of the vectors of their tokens

In [60]:
from scipy.spatial.distance import cosine

In [96]:
doc1 = nlp_lg("The quick brown fox jumps over the lazy dog")

In [97]:
doc2 = nlp_lg("The lazy dog jumps over the quick brown fox")

In [98]:
doc1.similarity(doc1)

1.0

In [95]:
doc1.similarity(doc2)

0.9999999448211677

In [99]:
doc1_vec=doc1.vector

In [100]:
doc2_vec=doc2.vector

In [101]:
cosine(doc2_vec,doc1_vec)

1.0

In [102]:
cosine(doc2_vec,doc1_vec)

1.0

In [103]:
doc3 = nlp("I like snow")
doc4 = nlp("I hate snow")

In [105]:
cosine(doc3.vector,doc4.vector)

0.85214496