# Natural Language Processing using spacy

- Pipelining:

    - Get all the factory pipelining options available
    - How to disable preloaded pipeline, that will enahnce the processing time?
    - Adding custom pipelines

- Reading a file and displaying entity

In [311]:
import spacy as sp
from spacy import displacy # used for data visualization
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.attrs import ORTH # to be used for word count
from spacy.language import Language

In [312]:
sp.__version__

'3.2.1'

In [345]:
nlp = sp.load("en_core_web_sm") 
# ref: https://spacy.io/models/en
#!python -m spacy download en_core_web_sm

###### To load english model

# !python -m spacy download en_core_web_sm

In [315]:
txt = """Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five companies in the American information technology industry, along with Amazon, Apple, Meta (Facebook) and Microsoft.[10]

Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14% of its publicly-listed shares and control 56% of the stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly-owned subsidiary of Alphabet Inc.. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's Internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet"""

In [316]:
print(txt)

Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five companies in the American information technology industry, along with Amazon, Apple, Meta (Facebook) and Microsoft.[10]

Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14% of its publicly-listed shares and control 56% of the stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly-owned subsidiary of Alphabet Inc.. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's Internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry P

# Pipelining:
Get all the factory pipelining options available
How to disable preloaded pipeline, that will enahnce the processing time?
Adding custom pipelines

In [317]:
nlp.pipe_factories

{'tok2vec': 'tok2vec',
 'tagger': 'tagger',
 'parser': 'parser',
 'senter': 'senter',
 'attribute_ruler': 'attribute_ruler',
 'lemmatizer': 'lemmatizer',
 'ner': 'ner'}

In [320]:
nlp.get_pipe("ner")

<spacy.pipeline.ner.EntityRecognizer at 0x7ff3c101ab30>

In [321]:
nlp.pipe_labels

{'tok2vec': [],
 'tagger': ['$',
  "''",
  ',',
  '-LRB-',
  '-RRB-',
  '.',
  ':',
  'ADD',
  'AFX',
  'CC',
  'CD',
  'DT',
  'EX',
  'FW',
  'HYPH',
  'IN',
  'JJ',
  'JJR',
  'JJS',
  'LS',
  'MD',
  'NFP',
  'NN',
  'NNP',
  'NNPS',
  'NNS',
  'PDT',
  'POS',
  'PRP',
  'PRP$',
  'RB',
  'RBR',
  'RBS',
  'RP',
  'SYM',
  'TO',
  'UH',
  'VB',
  'VBD',
  'VBG',
  'VBN',
  'VBP',
  'VBZ',
  'WDT',
  'WP',
  'WP$',
  'WRB',
  'XX',
  '``'],
 'parser': ['ROOT',
  'acl',
  'acomp',
  'advcl',
  'advmod',
  'agent',
  'amod',
  'appos',
  'attr',
  'aux',
  'auxpass',
  'case',
  'cc',
  'ccomp',
  'compound',
  'conj',
  'csubj',
  'csubjpass',
  'dative',
  'dep',
  'det',
  'dobj',
  'expl',
  'intj',
  'mark',
  'meta',
  'neg',
  'nmod',
  'npadvmod',
  'nsubj',
  'nsubjpass',
  'nummod',
  'oprd',
  'parataxis',
  'pcomp',
  'pobj',
  'poss',
  'preconj',
  'predet',
  'prep',
  'prt',
  'punct',
  'quantmod',
  'relcl',
  'xcomp'],
 'senter': ['I', 'S'],
 'attribute_ruler': [],


In [322]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

# Add custom pipeline

In [326]:
@Language.component("upperCase")
def upperCase(doc):
    op = [w.text.upper() for w in doc]
    op = " ".join(op)
    final_op = nlp.make_doc(op)
    return final_op

In [327]:
nlp.add_pipe("upperCase")

<function __main__.upperCase(doc)>

In [328]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'upperCase']

# Check the added pipeline functionality

In [329]:
txt

"Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five companies in the American information technology industry, along with Amazon, Apple, Meta (Facebook) and Microsoft.[10]\n\nGoogle was founded on September 4, 1998, by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14% of its publicly-listed shares and control 56% of the stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly-owned subsidiary of Alphabet Inc.. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's Internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larr

In [330]:
obj = nlp(txt)
obj

GOOGLE LLC IS AN AMERICAN MULTINATIONAL TECHNOLOGY COMPANY THAT SPECIALIZES IN INTERNET - RELATED SERVICES AND PRODUCTS , WHICH INCLUDE ONLINE ADVERTISING TECHNOLOGIES , A SEARCH ENGINE , CLOUD COMPUTING , SOFTWARE , AND HARDWARE . IT IS CONSIDERED ONE OF THE BIG FIVE COMPANIES IN THE AMERICAN INFORMATION TECHNOLOGY INDUSTRY , ALONG WITH AMAZON , APPLE , META ( FACEBOOK ) AND MICROSOFT.[10 ] 

 GOOGLE WAS FOUNDED ON SEPTEMBER 4 , 1998 , BY LARRY PAGE AND SERGEY BRIN WHILE THEY WERE PH.D. STUDENTS AT STANFORD UNIVERSITY IN CALIFORNIA . TOGETHER THEY OWN ABOUT 14 % OF ITS PUBLICLY - LISTED SHARES AND CONTROL 56 % OF THE STOCKHOLDER VOTING POWER THROUGH SUPER - VOTING STOCK . THE COMPANY WENT PUBLIC VIA AN INITIAL PUBLIC OFFERING ( IPO ) IN 2004 . IN 2015 , GOOGLE WAS REORGANIZED AS A WHOLLY - OWNED SUBSIDIARY OF ALPHABET INC .. GOOGLE IS ALPHABET 'S LARGEST SUBSIDIARY AND IS A HOLDING COMPANY FOR ALPHABET 'S INTERNET PROPERTIES AND INTERESTS . SUNDAR PICHAI WAS APPOINTED CEO OF GOOGLE ON

# Remove pipeline

In [331]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'upperCase']

In [336]:
nlp.remove_pipe("ner")

('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7ff3adddbeb0>)

In [337]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'upperCase']

# Add it back

In [338]:
nlp.add_pipe("ner",before="upperCase")

<spacy.pipeline.ner.EntityRecognizer at 0x7ff3ade32430>

In [339]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'upperCase']

In [341]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'upperCase']

# Disable pipeline temporarily using context manager

In [342]:
with nlp.disable_pipes("ner","upperCase"):
    obj = nlp(txt)
    print(obj)
    print("Witin Context::",nlp.pipe_names)
print("Outside Context::",nlp.pipe_names)

Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five companies in the American information technology industry, along with Amazon, Apple, Meta (Facebook) and Microsoft.[10]

Google was founded on September 4, 1998, by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14% of its publicly-listed shares and control 56% of the stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly-owned subsidiary of Alphabet Inc.. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's Internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry P

# Reading a file and displaying entity

In [347]:
fo = open("NER_text.txt")
obj = nlp(fo.read())
displacy.render(obj,style="ent")