# Spacy

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')    #load the language model instance

Spacy to create a processed Doc object, which is a container for accessing linguistic annotations, for a given input string

In [7]:
introduction_text = ('This space is about Natural'     
                     ' Language Processing in Spacy.')
introduction_doc = nlp(introduction_text)
#Extract tokens for the given doc
print([token.text for token in introduction_doc])

['This', 'space', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


In [2]:
about_text = ('Apple Inc. is an American multinational technology company headquartered in Cupertino, California, '
            'that designs, develops, and sells consumer electronics, computer software, and online services. '
            'It is considered one of the Big Four technology companies, alongside Amazon, Google, and Microsoft. '
            'It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976')

Sentence Detection is the process of locating the start and end of sentences in a given text

In [3]:
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
for sentence in sentences:
    print(sentence)
len(sentences)

Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services.
It is considered one of the Big Four technology companies, alongside Amazon, Google, and Microsoft.
It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976


3

In [29]:
token_dict={}
for token in about_doc:
    token_dict[token]=token.idx       #spacy preseverves the starting-index of each token which is helpful for in-place replacement
token_dict

{Apple: 0,
 Inc.: 6,
 is: 11,
 an: 14,
 American: 17,
 multinational: 26,
 technology: 40,
 company: 51,
 headquartered: 59,
 in: 73,
 Cupertino: 76,
 ,: 85,
 California: 87,
 ,: 97,
 that: 99,
 designs: 104,
 ,: 111,
 develops: 113,
 ,: 121,
 and: 123,
 sells: 127,
 consumer: 133,
 electronics: 142,
 ,: 153,
 computer: 155,
 software: 164,
 ,: 172,
 and: 174,
 online: 178,
 services: 185,
 .: 193,
 It: 195,
 is: 198,
 considered: 201,
 one: 212,
 of: 216,
 the: 219,
 Big: 223,
 Four: 227,
 technology: 232,
 companies: 243,
 ,: 252,
 alongside: 254,
 Amazon: 264,
 ,: 270,
 Google: 272,
 ,: 278,
 and: 280,
 Microsoft: 284,
 .: 293,
 It: 295,
 was: 298,
 founded: 302,
 by: 310,
 Steve: 313,
 Jobs: 319,
 ,: 323,
 Steve: 325,
 Wozniak: 331,
 ,: 338,
 and: 340,
 Ronald: 344,
 Wayne: 351,
 in: 357,
 April: 360,
 1976: 366}

In [39]:
for token in about_doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop,spacy.explain(token.tag_))

Apple Apple PROPN NNP compound Xxxxx True False noun, proper singular
Inc. Inc. PROPN NNP nsubj Xxx. False False noun, proper singular
is be AUX VBZ ROOT xx True True verb, 3rd person singular present
an an DET DT det xx True True determiner
American american ADJ JJ amod Xxxxx True False adjective
multinational multinational ADJ JJ amod xxxx True False adjective
technology technology NOUN NN compound xxxx True False noun, singular or mass
company company NOUN NN attr xxxx True False noun, singular or mass
headquartered headquarter VERB VBN acl xxxx True False verb, past participle
in in ADP IN prep xx True True conjunction, subordinating or preposition
Cupertino Cupertino PROPN NNP pobj Xxxxx True False noun, proper singular
, , PUNCT , punct , False False punctuation mark, comma
California California PROPN NNP appos Xxxxx True False noun, proper singular
, , PUNCT , punct , False False punctuation mark, comma
that that SCONJ IN det xxxx True True conjunction, subordinating or preposit

In [38]:
from collections import Counter
words = [token.text for token in about_doc
          if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
common_words = word_freq.most_common(5)   #5 common words and their freq
print(common_words)
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print(unique_words)

[('technology', 2), ('Steve', 2), ('Apple', 1), ('Inc.', 1), ('American', 1)]
['Apple', 'Inc.', 'American', 'multinational', 'company', 'headquartered', 'Cupertino', 'California', 'designs', 'develops', 'sells', 'consumer', 'electronics', 'computer', 'software', 'online', 'services', 'considered', 'Big', 'companies', 'alongside', 'Amazon', 'Google', 'Microsoft', 'founded', 'Jobs', 'Wozniak', 'Ronald', 'Wayne', 'April', '1976']


In [40]:
from spacy import displacy
displacy.serve(about_doc, style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [31/May/2020 00:40:10] "GET / HTTP/1.1" 200 43443
127.0.0.1 - - [31/May/2020 00:40:12] "GET /favicon.ico HTTP/1.1" 200 43443


Shutting down server on port 5000.


Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents.
Graphically - Words are the NODES and grammatical relationship are the EDGES

In [51]:
plain_text = 'Tuba is learning Python'
plain_doc = nlp(plain_text)
for token in plain_doc:
    print (token.text, token.tag_, token.head.text, token.dep_)

Tuba NNP learning nsubj
is VBZ learning aux
learning VBG learning ROOT
Python NNP learning dobj


In [None]:
displacy.serve(plain_doc, style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



Shallow parsing, or chunking, is the process of extracting phrases from unstructured text. Chunking groups adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.

In [5]:
for chunk in about_doc.noun_chunks:
    print(chunk)

Apple Inc.
an American multinational technology company
Cupertino
California
consumer electronics
computer software
online services
It
the Big Four technology companies
Amazon
Google
Microsoft
It
Steve Jobs
Steve Wozniak
Ronald Wayne
April


In [6]:
conference_text = ('There is a developer conference happening on 21 July 2019 in London.')
conference_doc = nlp(conference_text)
for chunk in conference_doc.noun_chunks:
    print(chunk)

a developer conference
21 July
London


Entity Recognition

In [10]:
ent_text = ('Apple is looking at buying U.K. startup for $1 billion')
ent_doc = nlp(ent_text)
for ent in ent_doc.ents:
     print(ent.text, ent.start_char, ent.end_char,
           ent.label_, spacy.explain(ent.label_))

Apple 0 5 ORG Companies, agencies, institutions, etc.
U.K. 27 31 GPE Countries, cities, states
$1 billion 44 54 MONEY Monetary values, including unit
