In [1]:
pip install -U spacy

Note: you may need to restart the kernel to use updated packages.


In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 4.2 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy

In [4]:
nlp=spacy.load('en_core_web_sm')

In [5]:
with open('wiki_text_usa.txt',"r") as f:    #Opening the text file in reading mode
    text=f.read()

In [6]:
print (text)

The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America. It is a federal union of 50 states, which also includes its federal capital district of Washington, D.C., and 326 Indian reservations.[j] The 48 contiguous states border Canada to the north and Mexico to the south. The State of Alaska is non-contiguous and lies to the northwest, while the State of Hawaii is an archipelago in the Pacific Ocean. The U.S. also asserts sovereignty over five major unincorporated island territories and various uninhabited islands.[k] The country has the world's third-largest land area,[d] second-largest exclusive economic zone, and third-largest population, exceeding 334 million.[l]

Paleo-Indians migrated across the Bering land bridge more than 12,000 years ago, and went on to form various civilizations and societies. British colonization led to the first settlement of the Thirteen Colonies in Virginia i

In [7]:
doc=nlp(text)                #converting the text into doc

In [8]:
print (len(text))
print (len(doc))

3308
598


In [9]:
for token in text [0:10]:    #Text counts every instances of the characters,widespace,puncutations.
    print(token)

T
h
e
 
U
n
i
t
e
d


In [10]:
for token in doc [0:10]:    # Docs counts individual tokens
    print(token)

The
United
States
of
America
(
USA
or
U.S.A.
)


# Sentence Boundary Detection

In [11]:
for sent in doc.sents:
    print(sent);

The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America.
It is a federal union of 50 states, which also includes its federal capital district of Washington, D.C., and 326 Indian reservations.[j]
The 48 contiguous states border Canada to the north and Mexico to the south.
The State of Alaska is non-contiguous and lies to the northwest, while the State of Hawaii is an archipelago in the Pacific Ocean.
The U.S. also asserts sovereignty over five major unincorporated island territories and various uninhabited islands.[k]
The country has the world's third-largest land area,[d] second-largest exclusive economic zone, and third-largest population, exceeding 334 million.[l]

Paleo-Indians migrated across the Bering land bridge more than 12,000 years ago, and went on to form various civilizations and societies.
British colonization led to the first settlement of the Thirteen Colonies in Virginia i

In [12]:
sentence2= list(doc.sents)[0]
print(sentence2)

The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America.


# Token Attributes

In [13]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
USA
or
U.S.A.
)


In [14]:
token1=sentence2[2]
print(token1)

States


In [15]:
token1.text      # gives the text content.

'States'

In [16]:
token1.left_edge  #the left most token of the token

The

In [17]:
token1.right_edge  # the right most token of the token

America

In [18]:
token1.ent_type_   # gives Named entity type

'GPE'

In [19]:
 token1.ent_iob_    # i = inside of the entity ,o=outside of the entity and b=begning of the entity

'I'

In [20]:
token1.lemma_   #Root form of the token

'States'

In [21]:
token1.morph   #Morphological analysi

Number=Sing

In [22]:
token1.pos_    # part of speech

'PROPN'

In [23]:
token1.dep_    # dependenct Relation

'nsubj'

In [24]:
token1.lang_   # language of doc

'en'

# Part of speech Tagging

In [25]:
text='Ronaldo enjoys playing football'

In [26]:
doc2=nlp(text)

In [27]:
print(doc2)

Ronaldo enjoys playing football


In [28]:
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Ronaldo PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


In [29]:
from spacy import displacy
displacy.render(doc2,style='dep')

# Word Vector and Spacy

Word vectors are the numerical representation of the word in the multidimensional space through vectors

# Why word vectors?
Once a word vector model is trained we can do similarity match very quickely and very reliably

In [30]:
! python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 42.8/42.8 MB 4.2 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [31]:
nlp=spacy.load('en_core_web_md')

In [32]:
with open("wiki_text_usa.txt",'r') as f:
    text=f.read()

In [33]:
doc=nlp(text)

In [34]:
sentence1=list(doc.sents)[0]
print(sentence1)

The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America.


In [41]:
import numpy as np
your_word="country"
ms=nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),n=10)
words=[nlp.vocab.strings[w] for w in ms[0][0]]
distances=ms[2]
print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [47]:
doc1=nlp("I like fries and hamburgers.")
doc2=nlp("Fast food tastes very good.")
print(doc1,"<->",doc2,doc1.similarity(doc2))

I like fries and hamburgers. <-> Fast food tastes very good. 0.6667528668828602


In [48]:
doc3=nlp("Mt.Everest is in Nepal")
print(doc1,"<->",doc3,doc1.similarity(doc3))

I like fries and hamburgers. <-> Mt.Everest is in Nepal 0.14839816927292906


# Spacy piplines

Input sentences -> entity ruler -> Entity Linker -> output sentence with entities annoted

In [51]:
# creates blank spcay
nlp=spacy.blank("en")
nlp.add_pipe("sentencizer")
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [52]:
# small model
nlp2=spacy.load("en_core_web_sm")
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att