INSTALLING AND IMPORTING SPACY

In [107]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [108]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
  print(token.text,token.pos_,token.dep_)


Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


TOKENIZATION

In [109]:
text = '''"Let's go to N.Y.!"'''

In [110]:
nlp = spacy.load('en_core_web_sm')

In [111]:
doc = nlp(text)

In [112]:
doc

"Let's go to N.Y.!"

In [113]:
for token in doc:
  print(token.text)

"
Let
's
go
to
N.Y.
!
"


In [114]:
text = "Apple is looking at buying U.K. startup for $1 billion"
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for token in doc:
  print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


Part of Speech (POS) Tagging and Dependency parsing

In [115]:
!pip install beautifultable

[0m

In [116]:
from beautifultable import BeautifulTable

In [117]:
table = BeautifulTable()

In [118]:
table.columns.header = ['text' , 'POS' ,'TAG' ,'Explain Tag' , 'Dep' , 'Shape' , 'is_alpha' , 'is_stop']
for token in doc:
  table.rows.append([token.text, token.pos_, token.tag_,spacy.explain(token.tag_), token.dep_, token.shape_, token.is_alpha, token.is_stop])

In [119]:
print(table)

+------+---------+-----+--------------------------+------+-------+------+------+
| text |   POS   | TAG |       Explain Tag        | Dep  | Shape | is_a | is_s |
|      |         |     |                          |      |       | lpha | top  |
+------+---------+-----+--------------------------+------+-------+------+------+
| Appl |  PROPN  | NNP |  noun, proper singular   | nsub | Xxxxx |  1   |  0   |
|  e   |         |     |                          |  j   |       |      |      |
+------+---------+-----+--------------------------+------+-------+------+------+
|  is  |   AUX   | VBZ | verb, 3rd person singula | aux  |  xx   |  1   |  1   |
|      |         |     |        r present         |      |       |      |      |
+------+---------+-----+--------------------------+------+-------+------+------+
| look |  VERB   | VBG | verb, gerund or present  | ROOT | xxxx  |  1   |  0   |
| ing  |         |     |        participle        |      |       |      |      |
+------+---------+-----+----

Visualising Dependency parsing with DisPlacy

In [120]:
from spacy import displacy

In [121]:
options = {"compact":True,"distance":100, "bg":"#BBDFFF", "color": "#DDDDF","font":"Source Sans Pro"}
displacy.render(doc, style = 'dep', jupyter = True,options = options)

Sentence Boundary Detection

In [122]:
para = '''The Rankine cycle is a thermodynamic cycle used in steam power plants to convert heat energy into mechanical work. It is named after Scottish engineer William John Macquorn Rankine, who developed it in the 19th century.

The cycle is comprised of four basic components: a heat source (usually a boiler) that supplies high-pressure steam, a turbine that extracts energy from the steam and converts it into rotational motion, a condenser that removes heat from the low-pressure steam and converts it back into water, and a pump that returns the condensed water to the boiler to be reheated and used again.'''

In [123]:
doc = nlp(para)

In [124]:
sents = list(doc.sents)

In [125]:
len(sents)

3

In [126]:
for sent in sents:
  print(sent)

The Rankine cycle is a thermodynamic cycle used in steam power plants to convert heat energy into mechanical work.
It is named after Scottish engineer William John Macquorn Rankine, who developed it in the 19th century.


The cycle is comprised of four basic components: a heat source (usually a boiler) that supplies high-pressure steam, a turbine that extracts energy from the steam and converts it into rotational motion, a condenser that removes heat from the low-pressure steam and converts it back into water, and a pump that returns the condensed water to the boiler to be reheated and used again.


Stop Words

In [127]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [128]:
stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [129]:
for sent in sents:
  print(sent.text)
  for token in sent:
    print(token)

  break

The Rankine cycle is a thermodynamic cycle used in steam power plants to convert heat energy into mechanical work.
The
Rankine
cycle
is
a
thermodynamic
cycle
used
in
steam
power
plants
to
convert
heat
energy
into
mechanical
work
.


Lemmatization

In [130]:
text = 'ram is playing '

In [131]:
doc=nlp(text)

In [132]:
for token in doc:
  print(token.text, token.lemma_)

ram ram
is be
playing play


Word frequency count

In [133]:
text = "The Rankine cycle is commonly used in steam power plants, including coal-fired power plants and nuclear power plants. It is also used in geothermal power plants, where the heat source is naturally occurring steam from underground. The efficiency of the Rankine cycle can be improved by using reheating and regeneration, which involves reheating the steam between stages of the turbine and using feedwater heaters to preheat the water before it enters the boiler."

In [134]:
print(text)

The Rankine cycle is commonly used in steam power plants, including coal-fired power plants and nuclear power plants. It is also used in geothermal power plants, where the heat source is naturally occurring steam from underground. The efficiency of the Rankine cycle can be improved by using reheating and regeneration, which involves reheating the steam between stages of the turbine and using feedwater heaters to preheat the water before it enters the boiler.


In [135]:
from collections import Counter

In [136]:
word_freq = Counter(text.split())

In [137]:
doc = nlp(text)

In [138]:
words = [token.text for token in doc if not token.is_stop and not token.is_punct]

In [139]:
word_freq = Counter(words)

In [140]:
word_freq.most_common(5)

[('power', 4), ('plants', 4), ('steam', 3), ('Rankine', 2), ('cycle', 2)]

Rule based matching

Token Matcher

In [141]:
from spacy.matcher import Matcher
from spacy.tokens import Span

In [142]:
matcher = Matcher(nlp.vocab)

In [172]:
pattern_verb = [{"LOWER" : 'alice', "POS": "PROPN"},{"POS":'AUX'}]
pattern = [{"LOWER" : 'alice', "POS": "PROPN"},{"POS":{"NOT_IN":['AUX']}}]

In [173]:
matcher.add("Matching",[pattern])

In [174]:
import requests
response = requests.get('https://www.gutenberg.org/files/11/11-0.txt')
text = response.text

In [175]:
text



In [176]:
doc = nlp(text)

In [177]:
matches = matcher(doc)

In [178]:
for match_id, start, end in matches :
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id, string_id, start, end, span.text)

6895354335150655416 Matching 323 325 Alice was
6895354335150655416 Matching 384 386 Alice

6895354335150655416 Matching 472 474 Alice think
6895354335150655416 Matching 567 569 Alice started
6895354335150655416 Matching 646 648 Alice after
6895354335150655416 Matching 690 692 Alice had
6895354335150655416 Matching 884 886 Alice to
6895354335150655416 Matching 1010 1012 Alice had
6895354335150655416 Matching 1085 1087 Alice had
6895354335150655416 Matching 1285 1287 Alice soon
6895354335150655416 Matching 1380 1382 Alice began
6895354335150655416 Matching 1519 1521 Alice was
6895354335150655416 Matching 1583 1585 Alice like
6895354335150655416 Matching 1682 1684 Alice had
6895354335150655416 Matching 1854 1856 Alice opened
6895354335150655416 Matching 1948 1950 Alice,
6895354335150655416 Matching 2009 2011 Alice had
6895354335150655416 Matching 2095 2097 Alice,
6895354335150655416 Matching 2143 2145 Alice was
6895354335150655416 Matching 2311 2313 Alice ventured
6895354335150655416 Mat

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 SPACE
about ADP
the DET
twentieth ADJ
time NOUN
that DET
day NOUN
. PUNCT


 SPACE
âNo PROPN
, PUNCT
no!â PROPN
said VERB
the DET
Queen PROPN
. PUNCT
âSentence PROPN
firstâverdict PROPN
afterwards.â PROPN


 SPACE
âStuff PROPN
and CCONJ
nonsense!â NOUN
said VERB
Alice PROPN
loudly ADV
. PUNCT
âThe NOUN
idea NOUN
of ADP
having VERB
the DET

 SPACE
sentence NOUN
first!â NOUN


 SPACE
âHold NOUN
your PRON
tongue!â NOUN
said VERB
the DET
Queen PROPN
, PUNCT
turning VERB
purple NOUN
. PUNCT


 SPACE
âI NOUN
wonât!â PROPN
said VERB
Alice PROPN
. PUNCT


 SPACE
âOff PROPN
with ADP
her PRON
head!â NOUN
the DET
Queen PROPN
shouted VERB
at ADP
the DET
top NOUN
of ADP
her PRON
voice NOUN
. PUNCT
Nobody PRON

 SPACE
moved VERB
. PUNCT


 SPACE
âWho VERB
cares NOUN
for ADP
you?â NOUN
said VERB
Alice PROPN
, PUNCT
( PUNCT
she PRON
had AUX
grown VERB
to ADP
her PRON
full ADJ
size NOUN
by ADP

 SPACE


In [179]:
matcher = Matcher(nlp.vocab)

In [180]:
pattern = [{"POS":"ADJ"}, {"POS":"NOUN"}]

In [182]:
matcher.add('adj_noun', [pattern])

In [183]:
matches = matcher(doc)

In [185]:
for match_id, start, end in matches :
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id, string_id, start, end, span.text)

  break

2526562708749592420 adj_noun 30 32 other parts


In [186]:
matcher = Matcher(nlp.vocab)

pattern = [{"LEMMA":'begin'},{"POS":"ADP"}]

In [194]:
matcher.add('lemma_adp', [pattern])
matches = matcher(doc)

In [195]:
index =0
for  match_id, start, end in (matches) :
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id, string_id, start, end, span.text)
  index= index+1


  if index>10 :
    break




914636624706811372 leema_adp 11376 11378 begin with
6308555498547242653 lemma_adp 11376 11378 begin with
914636624706811372 leema_adp 12231 12233 beginning to
6308555498547242653 lemma_adp 12231 12233 beginning to
914636624706811372 leema_adp 14157 14159 began by
6308555498547242653 lemma_adp 14157 14159 began by
914636624706811372 leema_adp 16731 16733 begin with
6308555498547242653 lemma_adp 16731 16733 begin with
914636624706811372 leema_adp 17146 17148 beginning with
6308555498547242653 lemma_adp 17146 17148 beginning with
914636624706811372 leema_adp 19883 19885 begins with


In [196]:
matcher = Matcher(nlp.vocab)

pattern = [{"TEXT":"Alice"}, {"IS_PUNCT":True, "OP":"*"}]

matcher.add('lemma_adp',[pattern])
matches = matcher(doc)

In [197]:
index =0
for  match_id, start, end in (matches) :
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id, string_id, start, end, span.text)
  index= index+1


  if index>10 :
    break




6308555498547242653 lemma_adp 323 324 Alice
6308555498547242653 lemma_adp 384 385 Alice
6308555498547242653 lemma_adp 472 473 Alice
6308555498547242653 lemma_adp 567 568 Alice
6308555498547242653 lemma_adp 646 647 Alice
6308555498547242653 lemma_adp 690 691 Alice
6308555498547242653 lemma_adp 884 885 Alice
6308555498547242653 lemma_adp 1010 1011 Alice
6308555498547242653 lemma_adp 1085 1086 Alice
6308555498547242653 lemma_adp 1285 1286 Alice
6308555498547242653 lemma_adp 1380 1381 Alice


Phrase Matcher

In [198]:
from spacy.matcher import PhraseMatcher

In [200]:
matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER')

In [226]:
phrase_list = ['it!' ,'she had expected',  'No, I’ll look first',"What a curious feeling!", "Curiouser and curiouser!"]

In [227]:
pattern = [nlp.make_doc(text) for text in phrase_list]

In [228]:
matcher.add("phrase_match", pattern)

In [229]:
doc = nlp(text)

In [230]:
matches = matcher(doc)

In [231]:
for match_id, start,end in matches :
  span = doc[start:end]
  print(start, end, span.text)


8578 8581 she had expected
10974 10976 it!


Entity matcher

In [233]:
from spacy.pipeline import EntityRuler
from spacy.lang.en import English

In [237]:
text = '''Just then she heard something splashing about in the pool a little way
off, and she swam nearer to make out what it was: at first she thought
it must be a walrus or hippopotamus, but then she remembered how small
she was now, and she soon made out that it was only a mouse that had
slipped in like herself.'''

In [238]:
nlp = English()
ruler = nlp.add_pipe('entity_ruler')

In [251]:
pattern = [{"label":"NOUN", "pattern":"hippopotamus"}, {"label":"ORG", "pattern":[{"lower":"then"},{"lower":{"IN":["first","she","third"]}},{"ORTH":"heard"}]}]

In [252]:
ruler.add_patterns(pattern)

In [253]:
doc = nlp(text)

In [254]:
for ent in doc.ents:
  print(ent.text, ent.label_, ent.ent_id_)

then she heard ORG 
hippopotamus NOUN 


Named Entity Recognition

In [255]:
 !python -m spacy download en_core_web_lg
 !pip install lxml
 !pip install beautifulsoup4

Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.6.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[0m

In [256]:
text = '''The ratio, which once saw a higher participation from engineering students, is undergoing a transformation. Two non-engineering students account for every five students at the IIMs in Ahmedabad, Kozhikode, Indore, and Kashipur. And non-engineering students make up for 30% and 17% of the overall count at IIM-Calcutta and IIM-Bangalore, respectively, the report adds.

To attract students from diverse backgrounds, IIM-Calcutta introduced an academic diversity factor for bachelor’s degree. And in the last two batches, the diversity ratio improved from 8% to around 23-25%. The institute has extended this factor for post-graduate and professional qualifications for the 2023-25 batch.'''

In [257]:
nlp = spacy.load('en_core_web_lg')

In [258]:
doc = nlp(text)

In [259]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Two CARDINAL
five CARDINAL
Ahmedabad GPE
Kozhikode GPE
Indore GPE
Kashipur GPE
30% and 17% PERCENT
IIM-Calcutta ORG
IIM-Bangalore ORG
IIM-Calcutta ORG
two CARDINAL
8% PERCENT
23-25% PERCENT
2023-25 DATE


In [261]:
spacy.explain("GPE")

'Countries, cities, states'

In [262]:
from bs4 import BeautifulSoup

In [263]:
import requests
from spacy import displacy

In [264]:
url = "https://www.bbc.com/news/world-asia-66264726"

In [265]:
html_content = requests.get(url).text

In [266]:
html_content



In [267]:
soup = BeautifulSoup(html_content, 'lxml')

In [268]:
body = soup.body.text

In [269]:
body

'BBC HomepageSkip to contentAccessibility HelpYour accountHomeNewsSportReelWorklifeTravelFutureMore menuMore menuSearch BBCHomeNewsSportReelWorklifeTravelFutureCultureMusicTVWeatherSoundsClose menuBBC NewsMenuHomeWar in UkraineClimateVideoWorldUS & CanadaUKBusinessTechScienceMoreEntertainment & ArtsHealthIn PicturesBBC VerifyWorld News TVNewsbeatAsiaChinaIndiaCambodia election: Polls open in vote with no credible oppositionPublished12 hours agoShareclose panelShare pageCopy linkAbout sharingImage source, Getty ImagesImage caption, Hun Sen has ensured that his party faces no strong challenge in the pollsBy Jonathan Head, Lulu Luo & Frances MaoBBC News in Phnom Penh and SingaporeVoting is under way in Cambodia, where the country\'s long-term leader is virtually certain to extend his party\'s rule in an election where there are no serious challengers.People turning up to the polls in Phnom Penh told the BBC they expected the Cambodian People\'s Party (CPP) to sweep all 125 seats in parlia

In [270]:
doc = nlp(body)

In [271]:
for ent in doc.ents:
  print(ent.text, ent.label_)


BBC HomepageSkip ORG
accountHomeNewsSportReelWorklifeTravelFutureMore ORG
menuSearch PERSON
menuBBC NewsMenuHomeWar ORG
UkraineClimateVideoWorldUS & CanadaUKBusinessTechScienceMoreEntertainment & ORG
TVNewsbeatAsiaChinaIndiaCambodia ORG
agoShareclose panelShare PERSON
Getty ImagesImage ORG
Hun Sen PERSON
Jonathan Head PERSON
Lulu Luo & PERSON
MaoBBC News ORG
Phnom Penh GPE
Cambodia GPE
Phnom Penh GPE
BBC ORG
the Cambodian People's Party ORG
CPP ORG
125 CARDINAL
Hun Sen PERSON
38 years DATE
May DATE
"It PERSON
one CARDINAL
Phnom Penh GPE
BBC ORG
earlier this week DATE
US GPE
Asia LOC
this year DATE
Human Rights Watch ORG
May DATE
the Candlelight Party ORG
The National Election Commission ORG
last year DATE
Candlelight PERSON
22% PERCENT
last year DATE
Hun Sen PERSON
Hun Sen PERSON
Sunday DATE
morning TIME
Hun Manet PERSON
weeks DATE
CPP ORG
Hun Sen PERSON
Earlier this year DATE
Kem Sokha PERSON
27 years DATE
Voice of Democracy ORG
second ORDINAL
Hun Sen PERSON
voting day DATE
2018 DATE


In [272]:
displacy.render(doc, style='ent', jupyter=True)

Word to Vector(word2vec) and Sentence Similarity

In [273]:
doc = nlp('dog cat bird artreffgd')

In [274]:
for token in doc:
  print(token.text, token.has_vector, token.is_oov)

dog True False
cat True False
bird True False
artreffgd False True


In [275]:
text1 = "i like peri-peri fries and cheese burst pizza"
text2= "junk food tastes amazing"

In [276]:
doc1=nlp(text1)
doc2=nlp(text2)

In [278]:
print(doc1, "<>" , doc2, doc1.similarity(doc2))

i like peri-peri fries and cheese burst pizza <> junk food tastes amazing 0.5240134683422039


In [279]:
text1 = "river bank"
text2= "bank account"

doc1=nlp(text1)
doc2=nlp(text2)

print(doc1, "<>" , doc2, doc1.similarity(doc2))


river bank <> bank account 0.7595661548019256


In [280]:
text1 = "Saloni is my friend"
text2= "my friend lives in agra"

doc1=nlp(text1)
doc2=nlp(text2)

print(doc1, "<>" , doc2, doc1.similarity(doc2))


Saloni is my friend <> my friend lives in agra 0.6314669430119215
