### Import and Load Spacy

- first, import spacy and load the `en` language model

In [5]:
import spacy
nlp = spacy.load('en')

### Tokenizing

- then, load the base string to process
- run the language model on the base string and store output in a variable
    - this variable is a document made from the base string
    - it is a document made of tokens from the base string 

In [9]:
# create input string
string_to_process = "Tea is healthy and calming, don't you think?"

# initilize document from the string 
doc = nlp(string_to_process)

# check type of doc 
print(type(doc))

<class 'spacy.tokens.doc.Doc'>


***



- the individual tokens in the document can be accessed with a for loop

In [8]:
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


***

### Text Pre-processing

- there are a few pre-processing techniques to improve how we model with words 

#### 1/ Lemmatizing

- **lemma** of a word is its base form
    - i.e. **talk** is the lemma of the word **talking**
- when you lemmatize the word walking, you would convert it to walk
- with a spaCy token, `token.lemma_` returns the lemma

#### 2/ Stop-Word removal 

- it is common practise to remove stop words 
- stop words are those that add no meaning/contain no information
- with a spaCy token, `token.is_stop` returns a boolean `True` if the token is a stopword (and `False` otherwise)

In [24]:
# lets check out the lemma and is_stopword for each token in the document we crated 

print('Token \t\t\t Lemma \t\t\t Stopword')
print('-'*100)
for token in doc:
    print(f" {str(token)} \t\t\t {str(token.lemma_)} \t\t\t {str(token.is_stop)}")

Token 			 Lemma 			 Stopword
----------------------------------------------------------------------------------------------------
 Tea 			 tea 			 False
 is 			 be 			 True
 healthy 			 healthy 			 False
 and 			 and 			 True
 calming 			 calm 			 False
 , 			 , 			 False
 do 			 do 			 True
 n't 			 not 			 True
 you 			 -PRON- 			 True
 think 			 think 			 False
 ? 			 ? 			 False


- language data has a lot of noise mixed in with informative content

- In the sentence above, the important words are tea, healthy and calming 

- Removing stop words might help the predictive model focus on relevant words

- However, lemmatizing and dropping stopwords might result in your models performing worse
    - So you should treat this preprocessing as part of your hyperparameter optimization process

### Pattern Matching

- a common NLP task is mathcing tokens or phrases with text chuncks or entire documents 

- this can be done with regex, but using spacy's matching capabilities is easier to use

- to match individual tokens, create a `Matcher` 
- to match a list of items, use `PhraseMatcher` 

In [25]:
## phrase matcher example
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

- The matcher is created using the vocabulary of your model
    - Here we're using the small English model you loaded earlier 
    - Setting attr='LOWER' will match the phrases on lowercased text
    - This provides case insensitive matching

- Next you create a list of terms to match in the text
    - The phrase matcher needs the patterns as document objects
    - The easiest way to get these is with a list comprehension using the nlp mod

In [26]:
## set terms and create token document objects 
terms = ['Galaxy Note','iPhone11','iPhone XS','Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList",patterns)

- Then you create a document from the base-text to search 
    - then use the phrase matcher to find where the terms occur in the text

In [27]:
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.")

In [29]:
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


- The matches here are a tuple of the match id and the positions of the start and end of the phrase.