In [1]:
# import the english language class
from spacy.lang.en import English

In [2]:
# create the nlp object of ENglish class
nlp = English()

In [3]:
# Created by the processing a string of text with the nlp object
doc = nlp("Hello World!")

# Iterate over tokem in doc
for token in doc:
    print(token.text)

Hello
World
!


#### The Token Object

In [4]:
# !Insert image here

In [5]:
# Indexing in Doc to get single token
token = doc[1]
print(token)

# text is attribute of token
print(token.text)

World
World


In [6]:
# Insert image here

In [7]:
# A Slice from Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

World!


#### Lexical Attribute

In [8]:
doc = nlp("This is me!")

In [9]:
print("Index:\t", [token.i for token in doc])
print("Text:\t", [token.text for token in doc])

print("is_alpha:\t", [token.is_alpha for token in doc])
print("is_punct:\t", [token.is_punct for token in doc])
print("is_digit:\t", [token.is_digit for token in doc])
print("is_lower:\t", [token.is_lower for token in doc])

Index:	 [0, 1, 2, 3]
Text:	 ['This', 'is', 'me', '!']
is_alpha:	 [True, True, True, False]
is_punct:	 [False, False, False, True]
is_digit:	 [False, False, False, False]
is_lower:	 [False, True, True, False]


## Chapter 1
### Statistical models

##### What are statistical models?
* Enable SpaCy to predict linguistic attributes in *context*
    * Part-of_Speech tag
    * Syntactic dependencies
    * Named entities
    
* Trained on labeled example texts
* Can be updated with more examples to fine-tune predictions

#### Model packages
* SpaCy provides a number of pre-trained model packages for example: - "en_core_web_sm" is a small english model that supports all core capabilities and is trained on web text.
<br/><br/>_To download it run - ``` $ python -m spacy download en_core_web_sm_```

In [10]:
# load the model 
import spacy
nlp = spacy.load("en_core_web_sm")

#### Predicting the Parts-of_Speech tags

In [11]:
# process the text
doc = nlp("Spacy provides the language supports to build the model.")

# Iterate over token 
for token in doc:
    print(token.text, token.pos_) # attributes ends with _(underscore) return text, w/o _ return integers

Spacy PROPN
provides VERB
the DET
language NOUN
supports VERB
to PART
build VERB
the DET
model NOUN
. PUNCT


#### Predicting Syntactic dependencies

In [12]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Spacy PROPN nsubj provides
provides VERB ROOT provides
the DET det language
language NOUN compound supports
supports VERB dobj provides
to PART aux build
build VERB advcl provides
the DET det model
model NOUN dobj build
. PUNCT punct provides


#### Label scheme pic

In [20]:
spacy.displacy.render(docs=nlp("This is me! Who are you?"), style='dep')

#### Predicting NamedEntities

In [21]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Ierate over the predicted entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [23]:
spacy.displacy.render(docs=doc, style='ent')

#### spacy.explain method

In [24]:
spacy.explain('GPE')

'Countries, cities, states'

In [25]:
spacy.explain('NNP')

'noun, proper singular'

### Rule-based matching

#### Why not just regular expressions?
* Match on *Doc* objects, not just strings
* Match on tokens and token attributes
* Use model's predictions 
* For example - find "duck"(verb) not "duck"(noun)

#### Match patterns
* List of dictionaries, one per token
* Match exact token texts
```
[{'TEXT':"iphone"}, {'TEXT':"X"}]
```

* Match lexical attributes

```
[{'LOWER':"iphone"}, {'LOWER':"x"}]
```
* Match any token attributes

```
[{'LEMMA':"buy"}, {'POS':"NOUN"}]
```

#### Using the Matcher

In [4]:
import spacy
from spacy.matcher import Matcher

# load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab - 'en_core_web_sm'
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT':"iphone"}, {'TEXT':"X"}]
matcher.add("IPHONE_PATTERN", None, pattern) # IPHONE_PATTERN is the unique id

# process some text
doc = nlp("Upcmoing iphone x date leaked")

# Call the matcher on doc
matches = matcher(doc) # return tuple with three value

# Iterate over the matches
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

#### Matching lexical attributes

In [6]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': "fifa"},
    {'LOWER': "world"},
    {'LOWER': "cup"},
    {'IS_PUNCT': True}
]

doc = nlp("2018 FIFA world cup: France won!")
matcher.add("FIFA", None, pattern)
matches = matcher(doc)

In [7]:
matches

[(851579294197118795, 0, 5)]

In [9]:
doc[matches[0][1]:matches[0][2]]

2018 FIFA world cup:

### Chapter 2

#### Data Structures, Vocab, Lexemes and StringStores

##### SpaCy stores strings in hash manner

In [12]:
coffee_hash = nlp.vocab.strings["coffee"]
coffee_hash
#coffee_string = nlp.vocab.strings[coffee_hash]

3197928453018144401

#### Lexemes: entries in vocabulary
* A Lexemes object is an entry in the vocabulary

In [13]:
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# print the lexeme attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


In [14]:
# insert pic

### Data Structures: Doc, Span, and Token

#### The Doc object

In [15]:
# create the nlp object
from spacy.lang.en import English
nlp = English()

# import the Doc class from token
from spacy.tokens import Doc

# words and spaces to create the doc
words = ["Hello","World","!"]
spaces = [True, False, False]

# Create the doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

#### The Span object

In [18]:
from spacy.tokens import Doc, Span

# Create a span manually
span = Span(doc, 0, 2)

#### Word Vectors and Semantic Similarity

#### Comparing the semantic similarity
* SpaCy compares two objects and return similarity
* ```Doc.similarity()```, ```Span.similarity()```, and ```Token.similarity()```
* Take another object and return the similarity scores between ```0``` and ```1```
* **Important**: Needs a model that has word vectors included, for example:
    * - [x] ```en_core_web_md``` (medium model)
    * - [x] ```en_core_web_lg``` (large model)
    * - [ ] **NOT** ```en_core_web_sm``` (small model)

In [1]:
import spacy
# load the large model
nlp = spacy.load('en_core_web_md')

# create two doc
doc1 = nlp("I like apple")
doc2 = nlp("I like pizza")

# Doc similarity
print(doc1.similarity(doc2))

token1 = doc1[2]
token2 = doc2[2]

# token similarity
token1.similarity(token2)

0.8459288321499241


0.39022762