<h1 style=" background-color:lightblue;color:white;font-size:100px;text-align:center">Spacy</h1>

In [1]:
# Spacy 3 Intallation

# !pip install pip setuptools wheel
# !pip install spacy[cuda-autodetect]
# !python -m spacy download en_core_web_trf

In [2]:
import spacy

In [3]:
# Language object
nlp = spacy.load("en_core_web_sm")

# Doc object - collection of tokens
doc = nlp("Google is looking to rent offices in Bangalore for $10Million a month") # Tokenization process

print("token.text     token.pos      token.dep_      spacy.explain(token.tag_)                     token.is_stop\n")
# POS Tagging
for token in doc:
    # these are attributes
    print(f"{token.text:<16s} {token.pos_:<14s} {token.dep_:<13s} {spacy.explain(token.tag_):<50s} {token.is_stop}") 

token.text     token.pos      token.dep_      spacy.explain(token.tag_)                     token.is_stop

Google           PROPN          nsubj         noun, proper singular                              False
is               AUX            aux           verb, 3rd person singular present                  True
looking          VERB           ROOT          verb, gerund or present participle                 False
to               PART           aux           infinitival "to"                                   True
rent             VERB           xcomp         verb, base form                                    False
offices          NOUN           dobj          noun, plural                                       False
in               ADP            prep          conjunction, subordinating or preposition          True
Bangalore        PROPN          pobj          noun, proper singular                              False
for              ADP            prep          conjunction, subordinating

Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents.

# Parts of Speech

Parts of speech are categories that help us understand the role and function of words in a sentence. There are several main parts of speech:

1. Noun: A noun is a word that represents a person, place, thing, or idea. Examples include "dog," "city," and "love."

2. Pronoun: A pronoun is a word used to replace a noun. It helps avoid repetition. Examples include "he," "she," "it," and "they."

3. Verb: A verb is a word that describes an action, state, or occurrence. It shows what the subject of the sentence is doing. Examples include "run," "eat," and "sleep."

4. Adjective: An adjective is a word that describes or modifies a noun. It provides more information about the noun. Examples include "happy," "big," and "red."

5. Adverb: An adverb is a word that describes or modifies a verb, adjective, or other adverb. It provides more information about how, when, or where something happens. Examples include "quickly," "very," and "often."

6. Preposition: A preposition is a word that shows a relationship between a noun or pronoun and another word in the sentence. It indicates location, direction, time, or manner. Examples include "in," "on," "at," and "with."

7. Conjunction: A conjunction is a word that connects words, phrases, or clauses in a sentence. It shows relationships between them. Examples include "and," "but," "or," and "because."

8. Interjection: An interjection is a word or phrase used to express strong emotions or surprise. It often stands alone and is followed by an exclamation mark. Examples include "wow," "oh," and "ouch."

# Visualizing the Dependency Parsing

In [4]:
from spacy import displacy
displacy.render(doc,style='ent')

In [5]:
options = {"distance":120, "bg":"#0095b6", "color":"#DDDDDD", "font":"Source Sans Pro"}
displacy.render(doc, style='dep', jupyter=True, options=options)

### Entity Vs Token

A <b>token</b> is the smallest unit of text that NLP systems analyze. It can be a word, a number, or even a punctuation mark. Tokens are like individual puzzle pieces that make up a sentence. For example, in the sentence "I love cats," the tokens are "I," "love," and "cats."

An <b>entity</b> refers to something specific mentioned in the text that has a particular meaning or significance. Entities are usually named objects, such as people, organizations, locations, dates, or other types of nouns. NLP systems often identify and classify entities to understand the context and extract meaningful information from the text.

To illustrate, in the sentence "John works at Google," the tokens are "John," "works," "at," and "Google." The entity in this sentence is "Google," which refers to a specific organization.

In summary, tokens are the individual units of text used for analysis, while entities are specific named objects or elements within the text that carry meaning and are often classified by NLP systems.

# Sentence Boundary Detection

Detecting start & end of the sentences. Extracting sentences from a paragraph.

In [6]:
para = '''Tap to Pay on ‌iPhone‌ is set to compete against existing ‌iPhone‌ payment solutions for merchants such as Square. It will let small businesses accept NFC contactless payments through supported iOS apps with an ‌iPhone‌ XS or newer. When checking someone out, the merchant will ask the customer to hold their own ‌iPhone‌, Apple Watch, digital wallet, or contactless card up to the merchant's ‌iPhone‌ to complete a payment quickly and easily.'''
doc = nlp(para)

#doc.sents extract the sentences from the paragraph
sentences = doc.sents
for s in sentences:
    print(s)

Tap to Pay on ‌iPhone‌ is set to compete against existing ‌iPhone‌ payment solutions for merchants such as Square.
It will let small businesses accept NFC contactless payments through supported iOS apps with an ‌iPhone‌ XS or newer.
When checking someone out, the merchant will ask the customer to hold their own ‌iPhone‌, Apple Watch, digital wallet, or contactless card up to the merchant's ‌iPhone‌ to complete a payment quickly and easily.


# Stopwords

Common words in english language that occur most frequently

In [7]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS # returns SET of stopwords 
list(stopwords)[:10]

['another',
 'thus',
 'thereupon',
 'will',
 'was',
 'next',
 'done',
 'whence',
 'few',
 'other']

In [8]:
doc

Tap to Pay on ‌iPhone‌ is set to compete against existing ‌iPhone‌ payment solutions for merchants such as Square. It will let small businesses accept NFC contactless payments through supported iOS apps with an ‌iPhone‌ XS or newer. When checking someone out, the merchant will ask the customer to hold their own ‌iPhone‌, Apple Watch, digital wallet, or contactless card up to the merchant's ‌iPhone‌ to complete a payment quickly and easily.

In [9]:
for sentence in doc.sents:
    for token in sentence:
        print(token)
    break # added to print a single line

Tap
to
Pay
on
‌iPhone‌
is
set
to
compete
against
existing
‌iPhone‌
payment
solutions
for
merchants
such
as
Square
.


In [10]:
# Removing stopwords from paragraph
for sentence in doc.sents:
    for token in sentence:
        if token.text not in stopwords: # token.text contains the value of the token
            print(token)
    break

Tap
Pay
‌iPhone‌
set
compete
existing
‌iPhone‌
payment
solutions
merchants
Square
.


In [11]:
sentence # sentence is the part of the doc object which can be further divided into tokens 

Tap to Pay on ‌iPhone‌ is set to compete against existing ‌iPhone‌ payment solutions for merchants such as Square.

In [12]:
sentence.text # textual Value of the sentence  

'Tap to Pay on \u200ciPhone\u200c is set to compete against existing \u200ciPhone\u200c payment solutions for merchants such as Square.'

# Lemmatization

Lemmatization is a language processing technique that finds the base or dictionary form of a word, which is called the "lemma." It considers the context and grammar of the word to determine its simplest form. Lemmatization takes into account things like tense, pluralization, and other variations to transform words into their fundamental form. For example, it can convert words like "cars," "car's," and "cars'" to their common lemma "car." Lemmatization helps us understand that different forms of a word have the same meaning, making it easier for computers and people to work with words more effectively.

<pre>
1. Base or root word with similar meaning
2. Creates a lemma which is the root form of word
3. More accurate than stemming
</pre>

In [13]:
# Token -> Lemmatization -> Lemma(base or root word with similar meaning)
line = nlp("looking look run ran running ")

for token in line:
    print(f"{token.text:<10s} {token.lemma_:<10s}")

looking    look      
look       look      
run        run       
ran        run       
running    run       


# Stemming

Stemming is a language processing technique that helps us find the root or base form of a word by removing extra parts, like endings or suffixes. It simplifies words so we can group together similar words that share the same core meaning. For example, stemming can turn words like "running," "ran," and "runs" into their common form "run." It makes it easier for us to understand and work with words in a simpler way.

In [14]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

line = nlp('study studying studied')

for token in line:
    print(f"{token.text:<10s} {ps.stem(token.text)}")

study      studi
studying   studi
studied    studi


# Word Frequency Count

In [15]:
from collections import Counter

In [16]:
Counter(para.split())

Counter({'Tap': 1,
         'to': 5,
         'Pay': 1,
         'on': 1,
         '\u200ciPhone\u200c': 4,
         'is': 1,
         'set': 1,
         'compete': 1,
         'against': 1,
         'existing': 1,
         'payment': 2,
         'solutions': 1,
         'for': 1,
         'merchants': 1,
         'such': 1,
         'as': 1,
         'Square.': 1,
         'It': 1,
         'will': 2,
         'let': 1,
         'small': 1,
         'businesses': 1,
         'accept': 1,
         'NFC': 1,
         'contactless': 2,
         'payments': 1,
         'through': 1,
         'supported': 1,
         'iOS': 1,
         'apps': 1,
         'with': 1,
         'an': 1,
         'XS': 1,
         'or': 2,
         'newer.': 1,
         'When': 1,
         'checking': 1,
         'someone': 1,
         'out,': 1,
         'the': 3,
         'merchant': 1,
         'ask': 1,
         'customer': 1,
         'hold': 1,
         'their': 1,
         'own': 1,
         '\u200ciPho

In [17]:
word_freq = Counter(para.split())
word_freq.most_common(10)

[('to', 5),
 ('\u200ciPhone\u200c', 4),
 ('the', 3),
 ('payment', 2),
 ('will', 2),
 ('contactless', 2),
 ('or', 2),
 ('Tap', 1),
 ('Pay', 1),
 ('on', 1)]

# Rule Based Matching

In [18]:
# Getting Alice - Adventures in Wonderland ebook 
import requests
response = requests.get('https://www.gutenberg.org/files/11/11-0.txt')
response.text



In [19]:
doc = nlp(response.text)

In [20]:
displacy.render(doc, style = 'ent')

In [21]:
for token in doc:
    print(f"{token.text:<25s}{token.pos_}" )

ï»¿The                   NUM
Project                  PROPN
Gutenberg                PROPN
eBook                    PROPN
of                       ADP
Aliceâs                PROPN
Adventures               PROPN
in                       ADP
Wonderland               PROPN
,                        PUNCT
by                       ADP
Lewis                    PROPN
Carroll                  PROPN


                     SPACE
This                     DET
eBook                    PROPN
is                       AUX
for                      ADP
the                      DET
use                      NOUN
of                       ADP
anyone                   PRON
anywhere                 ADV
in                       ADP
the                      DET
United                   PROPN
States                   PROPN
and                      CCONJ

                       SPACE
most                     ADJ
other                    ADJ
parts                    NOUN
of                       ADP
the       

In [22]:
from spacy.matcher import Matcher
from spacy.tokens import span

In [23]:
matcher = Matcher(nlp.vocab)

In [24]:
pattern_1 = [{'LOWER':'alice','POS':'PROPN'}]
matcher.add('Alice_PROPN',[pattern_1])

In [25]:
matches = matcher(doc)

In [26]:
# matches contains match_id,start & end position of the match 
matches[:10]

[(4167198918391854712, 323, 324),
 (4167198918391854712, 384, 385),
 (4167198918391854712, 472, 473),
 (4167198918391854712, 567, 568),
 (4167198918391854712, 646, 647),
 (4167198918391854712, 690, 691),
 (4167198918391854712, 884, 885),
 (4167198918391854712, 1010, 1011),
 (4167198918391854712, 1085, 1086),
 (4167198918391854712, 1285, 1286)]

In [27]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id,string_id,start,end,span.text)

4167198918391854712 Alice_PROPN 323 324 Alice
4167198918391854712 Alice_PROPN 384 385 Alice
4167198918391854712 Alice_PROPN 472 473 Alice
4167198918391854712 Alice_PROPN 567 568 Alice
4167198918391854712 Alice_PROPN 646 647 Alice
4167198918391854712 Alice_PROPN 690 691 Alice
4167198918391854712 Alice_PROPN 884 885 Alice
4167198918391854712 Alice_PROPN 1010 1011 Alice
4167198918391854712 Alice_PROPN 1085 1086 Alice
4167198918391854712 Alice_PROPN 1285 1286 Alice
4167198918391854712 Alice_PROPN 1380 1381 Alice
4167198918391854712 Alice_PROPN 1519 1520 Alice
4167198918391854712 Alice_PROPN 1583 1584 Alice
4167198918391854712 Alice_PROPN 1682 1683 Alice
4167198918391854712 Alice_PROPN 1854 1855 Alice
4167198918391854712 Alice_PROPN 1948 1949 Alice
4167198918391854712 Alice_PROPN 2009 2010 Alice
4167198918391854712 Alice_PROPN 2095 2096 Alice
4167198918391854712 Alice_PROPN 2143 2144 Alice
4167198918391854712 Alice_PROPN 2311 2312 Alice
4167198918391854712 Alice_PROPN 2413 2414 Alice
416719

In [28]:
matcher_2 = Matcher(nlp.vocab)
pattern_2 = [{"LOWER": 'alice', "POS":"PROPN"}, {'POS': {"NOT_IN": ['AUX']}}]
matcher_2.add('Alice_PROPN',[pattern_2])
matches_2 = matcher_2(doc)

for match_id, start, end in matches_2:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id,string_id,start,end,span.text)

4167198918391854712 Alice_PROPN 384 386 Alice

4167198918391854712 Alice_PROPN 472 474 Alice think
4167198918391854712 Alice_PROPN 567 569 Alice started
4167198918391854712 Alice_PROPN 646 648 Alice after
4167198918391854712 Alice_PROPN 690 692 Alice had
4167198918391854712 Alice_PROPN 884 886 Alice to
4167198918391854712 Alice_PROPN 1085 1087 Alice had
4167198918391854712 Alice_PROPN 1285 1287 Alice soon
4167198918391854712 Alice_PROPN 1380 1382 Alice began
4167198918391854712 Alice_PROPN 1583 1585 Alice like
4167198918391854712 Alice_PROPN 1854 1856 Alice opened
4167198918391854712 Alice_PROPN 1948 1950 Alice,
4167198918391854712 Alice_PROPN 2095 2097 Alice,
4167198918391854712 Alice_PROPN 2311 2313 Alice ventured
4167198918391854712 Alice_PROPN 2413 2415 Alice;
4167198918391854712 Alice_PROPN 2509 2511 Alice to
4167198918391854712 Alice_PROPN 2593 2595 Alice!
4167198918391854712 Alice_PROPN 2700 2702 Alice to
4167198918391854712 Alice_PROPN 2802 2804 Alice,
4167198918391854712 Alic

In [29]:
matcher = Matcher(nlp.vocab)

pattern = [{"LEMMA": "begin"}, {"POS":"ADP"}]

matcher.add('lemma_adp', [pattern])
matches = matcher(doc)

for match_id, start, end in (matches):
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

6308555498547242653 lemma_adp 11376 11378 begin with
6308555498547242653 lemma_adp 12231 12233 beginning to
6308555498547242653 lemma_adp 14157 14159 began by
6308555498547242653 lemma_adp 16731 16733 begin with
6308555498547242653 lemma_adp 17146 17148 beginning with
6308555498547242653 lemma_adp 19883 19885 begins with
6308555498547242653 lemma_adp 19952 19954 begins with
6308555498547242653 lemma_adp 20234 20236 began by
6308555498547242653 lemma_adp 20616 20618 began in
6308555498547242653 lemma_adp 23265 23267 begin at
6308555498547242653 lemma_adp 24553 24555 began in
6308555498547242653 lemma_adp 25918 25920 begin with
6308555498547242653 lemma_adp 29526 29528 began in
6308555498547242653 lemma_adp 30601 30603 begins with


# Phrase Matching

Matching exact phrases in the context

In [30]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab,attr='LOWER')
list_of_phrases = ['little','alice']

# Doc & make_doc are interchangable
pattern = [nlp.make_doc(text) for text in list_of_phrases]
matcher.add('little_magic',pattern)
matches = matcher(doc)

for match_id, start, end in (matches):
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(string_id, start, end, span.text)

little_magic 242 243 Little
little_magic 323 324 Alice
little_magic 384 385 Alice
little_magic 472 473 Alice
little_magic 567 568 Alice
little_magic 646 647 Alice
little_magic 690 691 Alice
little_magic 884 885 Alice
little_magic 1010 1011 Alice
little_magic 1085 1086 Alice
little_magic 1245 1246 little
little_magic 1285 1286 Alice
little_magic 1380 1381 Alice
little_magic 1519 1520 Alice
little_magic 1583 1584 Alice
little_magic 1682 1683 Alice
little_magic 1725 1726 little
little_magic 1827 1828 little
little_magic 1838 1839 little
little_magic 1854 1855 Alice
little_magic 1948 1949 Alice
little_magic 1955 1956 little
little_magic 2009 2010 Alice
little_magic 2035 2036 little
little_magic 2079 2080 little
little_magic 2095 2096 Alice
little_magic 2142 2143 little
little_magic 2143 2144 Alice
little_magic 2184 2185 little
little_magic 2311 2312 Alice
little_magic 2413 2414 Alice
little_magic 2460 2461 little
little_magic 2493 2494 little
little_magic 2509 2510 Alice
little_magic 2593 

# Entity Matching

In [31]:
txt = """Here is the first volume in George R. R. Martin’s magnificent cycle of novels that includes A Clash of Kings and A Storm of Swords. As a whole, this series comprises a genuine masterpiece of modern fantasy, bringing together the best the genre has to offer. Magic, mystery, intrigue, romance, and adventure fill these pages and transport us to a world unlike any we have ever experienced. Already hailed as a classic, George R. R. Martin’s stunning series is destined to stand as one of the great achievements of imaginative fiction.

A GAME OF THRONES

Long ago, in a time forgotten, a preternatural event threw the seasons out of balance. In a land where summers can last decades and winters a lifetime, trouble is brewing. The cold is returning, and in the frozen wastes to the north of Winterfell, sinister and supernatural forces are massing beyond the kingdom’s protective Wall. At the center of the conflict lie the Starks of Winterfell, a family as harsh and unyielding as the land they were born to. Sweeping from a land of brutal cold to a distant summertime kingdom of epicurean plenty, here is a tale of lords and ladies, soldiers and sorcerers, assassins and bastards, who come together in a time of grim omens.

Here an enigmatic band of warriors bear swords of no human metal; a tribe of fierce wildlings carry men off into madness; a cruel young dragon prince barters his sister to win back his throne; and a determined woman undertakes the most treacherous of journeys. Amid plots and counterplots, tragedy and betrayal, victory and terror, the fate of the Starks, their allies, and their enemies hangs perilously in the balance, as each endeavors to win that deadliest of conflicts: the game of thrones."""

In [32]:
nlp = spacy.load('en_core_web_trf')

doc = nlp(txt)
displacy.render(doc,style='ent')

In [33]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [34]:
from bs4 import BeautifulSoup

In [35]:
url = "https://www.bbc.com/news/business-61589229"
html_content = requests.get(url).text

soup = BeautifulSoup(html_content,'lxml')
txt = soup.body.text

In [36]:
nlp = spacy.load('en_core_web_trf')

doc = nlp(txt)
displacy.render(doc,style='ent')

# Word2Vec

In [37]:
#!python -m spacy download en_core_web_lg #685k unique vectors
nlp = spacy.load('en_core_web_lg')

In [41]:
t1 = nlp('i want tea')
t2 = nlp('i want coffee')

t1.similarity(t2)

0.9576329871632303