<a href="https://colab.research.google.com/github/saurabh-maurya/NLP-Simple-Implementation/blob/master/Intro_to_NLP_with_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#import spacy
import spacy

spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with **spacy.load**

In [0]:
# load english language
nlp = spacy.load('en')

In [0]:
# text process
doc = nlp("Tea is healthy and calming, don't you think?")

In [5]:
# we can do lot with doc object that we have jcreated above
for token in doc:
  print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


In [6]:
for token in doc:
  print(token.lemma_) #lemmatizing

tea
be
healthy
and
calm
,
do
not
-PRON-
think
?


In [8]:
for token in doc:
  print(token.is_stop) # stop word

False
True
False
True
False
False
True
True
True
False
False


In [9]:
# visualizing token, lemma and stop word
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False


**lemmatizing** and **dropping stopwords** might result in your models performing **worse**. So you should treat this preprocessing as part of your **hyperparameter optimization process**.

In [0]:
# Patter Matching
#------------------------------------------------------------------------------------------------------------------------
# You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use....

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')


In [0]:
# create a list of terms to match in the text
# The phrase matcher needs the patterns as document objects
# The easiest way to get these is with a list comprehension using the nlp model.

terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

In [12]:
# create a document from the text to search and use the phrase matcher to find where the terms occur in the text

# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro

text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


In [13]:
# The matches here are a tuple of the match id and the positions of the start and end of the phrase.

match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11
