# SHRENI SHAH-19MAI0038

# SPACY INTRODUCTION

- spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. 
- spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. 
- It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
- spaCy is used to perform Tokenization, Part-of-speech (POS) Tagging, Lemmatization, Sentence Boundary Detection (SBD), Dependency Parsing, Named Entity Recognition (NER), Entity Linking (EL) wtc things.

In [1]:
import spacy

# load the english library

In [2]:
nlp=spacy.load("en")

In [3]:
nlp.vocab.length

478


# create doc object

In [4]:
doc = nlp("High Explosive Research was the independent British project to develop atomic bombs after the Second World War")
for token in doc:
    print("{}:{}".format(token, token.vector[:3]))

High:[ 3.379025    0.60030574 -1.8544984 ]
Explosive:[ 3.4047315  1.4757392 -0.5115589]
Research:[-1.2007543   1.8710369   0.80947626]
was:[-3.1372113 -1.5912277 -2.8670561]
the:[-2.4285593 -3.782106   3.7539806]
independent:[2.359838  1.7131125 0.5008454]
British:[3.426497   0.46152067 0.09366858]
project:[-4.5285845  4.971984  -1.6442893]
to:[-3.784699    0.23621476 -2.8847935 ]
develop:[-2.4502208  -0.15967411 -0.39966428]
atomic:[2.0508976  1.114049   0.51341105]
bombs:[ 1.1723139  0.7761539 -1.0135481]
after:[ 1.6361983   0.18040827 -0.8765584 ]
the:[ 0.6448693 -1.2734318  1.8607215]
Second:[3.9483616  0.18734695 0.87496614]
World:[-1.5217814  3.42406   -1.1666031]
War:[ 1.7223792 -1.2408535 -0.860064 ]


In [5]:
doc = nlp("any help?please.")

 - here i have made one documnet on which i will perform some operations.

In [6]:
for token in doc:
    print("{}:{}".format(token, token.vector[:3]))

any:[ 2.573865  -1.0350318  1.1211486]
help?please:[ 0.93766063 -1.5132332   0.07575023]
.:[-0.676624  -2.7627392 -1.8642435]


 - Here we ar getting first three entries of vectors for each of the token

In [7]:
[(token.text,token.pos_) for token in doc]

[('any', 'DET'), ('help?please', 'X'), ('.', 'PUNCT')]

 - this shows part of speech tagging.


# Linguistic annotations

In [8]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hey,I am shreni. can not believe? yeah me too.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Hey INTJ intj
, PUNCT punct
I PRON nsubj
am AUX ROOT
shreni ADJ attr
. PUNCT punct
can VERB aux
not PART neg
believe VERB ROOT
? PUNCT punct
yeah INTJ intj
me PRON ROOT
too ADV advmod
. PUNCT punct


 - Here I am getting every token and its description. ie for comma(,) i am getting PUNCT which entions punctuation.

# Tokenization

In [9]:
for token in doc:
    print(token.text)

Hey
,
I
am
shreni
.
can
not
believe
?
yeah
me
too
.


 - Tokenization is splitting up a sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
 - by doin token.text we can get each tokens.

# Part-of-speech tagging

In [10]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Hey hey INTJ UH intj Xxx True False
, , PUNCT , punct , False False
I -PRON- PRON PRP nsubj X True True
am be AUX VBP ROOT xx True True
shreni shreni ADJ JJ attr xxxx True False
. . PUNCT . punct . False False
can can VERB MD aux xxx True True
not not PART RB neg xxx True True
believe believe VERB VB ROOT xxxx True False
? ? PUNCT . punct ? False False
yeah yeah INTJ UH intj xxxx True False
me -PRON- PRON PRP ROOT xx True True
too too ADV RB advmod xxx True True
. . PUNCT . punct . False False


 - here we are extraction every partof speech and tag them.
 - spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name.
 

# Named Entities

In [11]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

 - spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples

# Lemmatization

In [12]:
import lemminflect

In [13]:
doc = nlp('I am testing this example.')
doc[2]._.lemma()         
doc[4]._.inflect('NNS')  

'examples'

In [14]:
doc = nlp('what will you watch with me dude?')
#doc[5]._.lemma()         
doc[3]._.inflect('VBD')



'watched'

 - lemmatization gives dictionary form of a single token

# Noun 

In [15]:
nlp = spacy.load("en")
doc = nlp("Peach emoji is where it has always been. Peach is the superior "
          "emoji. It's outranking eggplant 🍑 ")
print(doc[0].text)          # 'Peach'
print(doc[1].text)          # 'emoji'
print(doc[-1].text)         # '🍑'
print(doc[13:19].text)      # 'outranking eggplant'

noun_chunks = list(doc.noun_chunks)
print(noun_chunks[0].text)  # 'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
print(sentences[1].text)    # 'Peach is the superior emoji.'

Peach
emoji
🍑
emoji. It's outranking eggplant
Peach
Peach is the superior emoji.


- This shows subset of given document. 
- we can get the partwe want. for eg, here i have mentioned 13:19 which will return words from 13 to 19.

In [16]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Peach Peach nsubj emoji
it it nsubj been
Peach Peach nsubj is
It It nsubj 's


# Word vectors and similarity

In [17]:
nlp = spacy.load("en")
tokens = nlp("dog cat bananassaa afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 17.056751 False
cat True 18.97264 False
bananassaa True 20.019207 False
afskfsd True 19.65514 False


In [18]:
nlp = spacy.load("en")  
tokens = nlp("dog cat banana")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.4192831
dog banana 0.4178361
cat dog 0.4192831
cat cat 1.0
cat banana 0.34277543
banana dog 0.4178361
banana cat 0.34277543
banana banana 1.0


  


- this word similarity shows what is the probability of having same word. 
- It compares each word with another.

In [19]:
doc = nlp("Apple and banana are similar. Pasta and hippo aren't.")

apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

print("apple <-> banana", apple.similarity(banana))
print("pasta <-> hippo", pasta.similarity(hippo))
print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)

apple <-> banana 0.37282813
pasta <-> hippo 0.3430258
True True True True


  
  if __name__ == '__main__':


- It shows the simillarity between two given words. 
- which words' comparison gets higher numbers, are most probabily similar.

# Vocab, hashes and lexemes

In [20]:
nlp = spacy.load("en")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee


- Whenever possible, spaCy tries to store data in a vocabulary, the Vocab, that will be shared by multiple documents.
- To save memory, spaCy also encodes all strings to hash values 
- Here “coffee” has the hash value 3197928453018144401. 

In [21]:
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


- in above given example we found hash value of coffee, same for other word we got the values.

# REFERENCES 

- https://www.guru99.com/nltk-tutorial.html
- https://spacy.io/usage/models
- https://monkeylearn.com/blog/natural-language-processing-tools/

# GITHUB LINK
- https://github.com/shrenis/Natural-Language-Processing