<h2 align="center">Spacy Language Processing Pipelines Tutorial</h2>

<h3>Blank nlp pipeline</h3>

In [1]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.1/12.8 MB 469.7 kB/s eta 0:00:28
     --------------------------------------- 0.1/12.8 MB 573.4 kB/s eta 0:00:23
     --------------------------------------- 0.1/12.8 MB 573.4 kB/s eta 0:00:23
      -------------------------------------- 0.2/12.8 MB 737.3 kB/s eta 0:00:18
      -------------------------------------- 0.2/12.8 MB 737.3 kB/s eta 0:00:18
      -------------------------------------- 0.3/12.8 MB 827.2 kB/s eta 0:00:16
      -------------------------------------- 0.3/12.8 MB 884.2 kB/s eta 0:00:15
     - ------------------------------------- 0.5

In [3]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [None]:
#

In [4]:
import spacy

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


We get above error because we have a blank pipeline as shown below. Pipeline is something that starts with a Tokenizer component in a dotted rectange below. You can see there is nothing there hence the blank pipeline

<img height=300 width=400 src="spacy_blank_pipeline.jpg" />

In [5]:
nlp.pipe_names

[]

nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer 

More general diagram for nlp pipeline may look something like below

<img height=300 width=400 src="spacy_loaded_pipeline.jpg" />

<h3>Download trained pipeline</h3>

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart

https://www.debug.school/rakeshdevcotocus_468/automating-tasks-by-processing-text-using-nlp-pipeline-1kj4

In [6]:
#different kind of pipname
#Token Information like pos ,their explanation ,lemma(base form)
#This part identifies named entities (like company names or amounts) in the sentence.
#Use a Sample Text and List Detected Entity Types
#Get All Entity Types in spaCy’s Language Model
# Visualization with displacy like visually highlights named entities in the text.
# A short explanation of the entity type.
#Custom Pipeline and Entity Recognition
#Adding a component to a blank pipeline

In [7]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [8]:
#different kind of pipname
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x25fcda47fb0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x25fcda47bf0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x25fcda4cf90>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x25fcdd72250>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x25fcdd41fd0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x25fcda4cdd0>)]

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [None]:
#Token Information like pos ,their explanation ,lemma(base form)

In [9]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

Captain  |  proper noun  |  Captain
america  |  proper noun  |  america
ate  |  verb  |  eat
100  |  numeral  |  100
$  |  numeral  |  $
of  |  adposition  |  of
samosa  |  proper noun  |  samosa
.  |  punctuation  |  .
Then  |  adverb  |  then
he  |  pronoun  |  he
said  |  verb  |  say
I  |  pronoun  |  I
can  |  auxiliary  |  can
do  |  verb  |  do
this  |  pronoun  |  this
all  |  determiner  |  all
day  |  noun  |  day
.  |  punctuation  |  .


**Run same code above with a blank pipeline and check what output you see?**

<h3>Named Entity Recognition</h3>

In [None]:
#This part identifies named entities (like company names or amounts) in the sentence.

In [10]:
#Use a Sample Text and List Detected Entity Types
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample text
doc = nlp("Google, based in California, was founded by Larry Page and Sergey Brin. It is worth over $1 trillion.")

# Extract and print unique entity types
unique_entity_labels = set(ent.label_ for ent in doc.ents)
print("Detected Entity Types:", unique_entity_labels)

Detected Entity Types: {'PERSON', 'ORG', 'GPE', 'MONEY'}


In [11]:
#Get All Entity Types in spaCy’s Language Model
print("Possible Named Entities:")
for label in nlp.get_pipe("ner").labels:
    print(label, ":", spacy.explain(label))

Possible Named Entities:
CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


In [12]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In [13]:
from spacy import displacy

displacy.render(doc, style="ent")

<h3>Trained processing pipeline in French</h3>

In [14]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
     ---------------------------------------- 0.0/16.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/16.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/16.3 MB ? eta -:--:--
     --------------------------------------- 0.1/16.3 MB 440.4 kB/s eta 0:00:37
     --------------------------------------- 0.1/16.3 MB 419.2 kB/s eta 0:00:39
     --------------------------------------- 0.1/16.3 MB 554.9 kB/s eta 0:00:30
     --------------------------------------- 0.2/16.3 MB 541.0 kB/s eta 0:00:30
      -------------------------------------- 0.2/16.3 MB 754.9 kB/s eta 0:00:22
      -------------------------------------- 0.3/16.3 MB 770.1 kB/s eta 0:00:21
      -------------------------------------- 0.3/16.3 MB 813.9 kB/s eta 0:00:20
      -------------------------------------- 

In [15]:
nlp = spacy.load("fr_core_news_sm")

You need to install the processing pipeline for french language using this command,

python -m spacy download fr_core_news_sm

In [16]:
nlp = spacy.load("fr_core_news_sm")

In [17]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [18]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


<h3>Adding a component to a blank pipeline</h3>

In [19]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [20]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In [21]:
for token in doc:
    print(token.text, " | ", token.dep_, " | ", token.head.text)

Tesla  |    |  Tesla
Inc  |    |  Inc
is  |    |  is
going  |    |  going
to  |    |  to
acquire  |    |  acquire
twitter  |    |  twitter
for  |    |  for
$  |    |  $
45  |    |  45
billion  |    |  billion


In [22]:
for token in doc:
    print(token.text, " | ", token.is_stop)

Tesla  |  False
Inc  |  False
is  |  True
going  |  False
to  |  True
acquire  |  False
twitter  |  False
for  |  True
$  |  False
45  |  False
billion  |  False


In [24]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("sentencizer")  # Add the sentencizer to the pipeline

doc = nlp("This is the first sentence. This is the second sentence.")
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is the second sentence.


In [26]:
import spacy
from spacy.language import Language

# Define and register the custom component
@Language.component("custom_component")
def custom_component(doc):
    print("Custom component")
    return doc

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add the component to the pipeline using its name
nlp.add_pipe("custom_component", last=True)

# Test the pipeline
doc = nlp("This is a test sentence.")

Custom component


In [27]:
doc1 = nlp("cat")
doc2 = nlp("dog")
print(doc1.similarity(doc2))

Custom component
Custom component
0.7422727346420288


  print(doc1.similarity(doc2))


In [28]:
for token in doc:
    print(token.text, " | ", token.vector[:5])  # First 5 values

This  |  [-0.32963437 -0.49875662  0.39574748  0.75809145  0.24849215]
is  |  [ 0.8627015   0.79238445 -0.02256067 -0.21371616  0.97954464]
a  |  [ 1.4822263   0.11548758  1.4996794   0.77538466 -0.64772826]
test  |  [ 0.28101185 -0.47633532  1.8650059   0.46909714 -0.1305256 ]
sentence  |  [-0.7220636  -0.30611536  0.2598266  -0.42837867  0.8273482 ]
.  |  [-0.94648755 -0.3767298  -1.0847697  -0.5486675  -0.6191116 ]


In [29]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'custom_component']


In [30]:
texts = ["Tesla Inc is acquiring Twitter.", "Microsoft bought GitHub."]
for doc in nlp.pipe(texts):
    print([(ent.text, ent.label_) for ent in doc.ents])

Custom component
[('Tesla Inc', 'ORG'), ('Twitter', 'PERSON')]
Custom component
[('Microsoft', 'ORG'), ('GitHub', 'PRODUCT')]


In [35]:
pip install spacy-langdetect


Note: you may need to restart the kernel to use updated packages.


In [37]:
from spacy_langdetect import LanguageDetector
import spacy
from spacy.language import Language

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Register the LanguageDetector component with a factory name
@Language.factory("language_detector")
def create_language_detector(nlp, name):
    return LanguageDetector()

# Add the language detector to the pipeline by using its factory name
nlp.add_pipe("language_detector", last=True)

# Test the language detection
doc = nlp("Bonjour tout le monde")
print(doc._.language)  # Output the detected language

{'language': 'fr', 'score': 0.999996474594318}


In below image you can see sentecizer component in the pipeline

<img height=300 width=400 src="sentecizer.jpg" />

<h3>Further reading</h3>

https://spacy.io/usage/processing-pipelines#pipelines