<a href="https://colab.research.google.com/github/shreyasd2301/Collection-of-NLP-Code/blob/main/Spacy/Understanding%20Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Spacy is avaliable in 55+ languages

# Chapter 1 Finding word phases and names

# 1. Getting Started 

The general syntax to import language is ```from spacy.lang.__ import language```

In [1]:
from spacy.lang.en import English
nlp=English()

In [None]:
doc=nlp("hi how are you i am shreyas")

In [4]:
doc.text

'hi how are you i am shreyas'

# 2. Documents, Spans and Tokens

when you call nlp it tokenizes the string

In [5]:
doc[0]

hi

In [7]:
doc[1:4]

how are you

# 3. Lexical Attributes

In [10]:
doc=nlp("our body has 70% water""we should drink 5litres of water daliy")

In [11]:
doc

our body has 70% waterwe should drink 5litres of water daliy

In [13]:
for token in doc:
  if token.like_num:
    next_token=doc[token.i+1]
    if next_token.text=="%":
      print(f"percentage found: {token.text}%")

percentage found: 70%


# Loading Models

In [14]:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("i am learning spacy which is widely used in nlp")

In [16]:
doc.text

'i am learning spacy which is widely used in nlp'

# 5. Predicting linguistic annotations


In [17]:
for token in doc:
  print(f"{token.text:<12}{token.pos_:<10}{token.dep_:<10}")

i           PRON      nsubj     
am          AUX       aux       
learning    VERB      ROOT      
spacy       NOUN      dobj      
which       DET       nsubjpass 
is          AUX       auxpass   
widely      ADV       advmod    
used        VERB      relcl     
in          ADP       prep      
nlp         PROPN     pobj      


## Additional Token Attributes
We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

## NER

In [21]:
for ent in doc.ents:
  print(ent.text,ent.label_)

nlp ORG


# Using the Matcher

In [29]:
import spacy
from spacy.matcher import Matcher
nlp=spacy.load("en_core_web_sm")
doc=nlp("upcoming iPhone 10 is better than previous iPhone 6")

In [30]:
matcher=Matcher(nlp.vocab)

In [31]:
pattern=[{'TEXT':'iPhone'},{'IS_DIGIT':True}]
matcher.add("IPHONE PATTERN", None, pattern)
matches=matcher(doc)
print("Matchers:", [doc[start:end].text for match_idl, start, end in matches])

Matchers: ['iPhone 10', 'iPhone 6']


# Chapter 2: Large-Scale Data Analysis

# 1. String to Hashes

In [36]:
nlp.vocab.strings['shreyas']

2207083292131419735

In [44]:
nlp.vocab.strings[380]

'PERSON'

In [45]:
from spacy.lang.en import English
from spacy.tokens import Doc
nlp=English()
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


# 3. Docs, Spans and Entities from Scratch

In [46]:
from spacy.lang.en import English
from spacy.tokens import Doc, Span
nlp=English()

words = ['I', 'like', 'David', 'Bowie']
spaces = [True, True, True, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

span = Span(doc, start=2, end=4, label='PERSON')
print(span.text, span.label_)

doc.ents = [span]

print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


In [47]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin looks like a nice city")

# Collect all proper nouns that are followed by a verb
for token in doc:
    if token.pos_ == "PROPN":
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb: ", token.text)

Found proper noun before a verb:  Berlin


# 4 Inspecting Word Vectors

In [48]:
!python3 -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051301 sha256=11fdf00ce6838bfa3e6ec93626a43075e3e2344c17aed80af621b1806c905637
  Stored in directory: /tmp/pip-ephem-wheel-cache-iy0qb0_0/wheels/69/c5/b8/4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [49]:
import en_core_web_md
nlp=en_core_web_md.load()

In [50]:
doc=nlp("eat banana before going to gym")

In [52]:
banana_vect=doc[1].vector
banana_vect

array([ 2.0228e-01, -7.6618e-02,  3.7032e-01,  3.2845e-02, -4.1957e-01,
        7.2069e-02, -3.7476e-01,  5.7460e-02, -1.2401e-02,  5.2949e-01,
       -5.2380e-01, -1.9771e-01, -3.4147e-01,  5.3317e-01, -2.5331e-02,
        1.7380e-01,  1.6772e-01,  8.3984e-01,  5.5107e-02,  1.0547e-01,
        3.7872e-01,  2.4275e-01,  1.4745e-02,  5.5951e-01,  1.2521e-01,
       -6.7596e-01,  3.5842e-01, -4.0028e-02,  9.5949e-02, -5.0690e-01,
       -8.5318e-02,  1.7980e-01,  3.3867e-01,  1.3230e-01,  3.1021e-01,
        2.1878e-01,  1.6853e-01,  1.9874e-01, -5.7385e-01, -1.0649e-01,
        2.6669e-01,  1.2838e-01, -1.2803e-01, -1.3284e-01,  1.2657e-01,
        8.6723e-01,  9.6721e-02,  4.8306e-01,  2.1271e-01, -5.4990e-02,
       -8.2425e-02,  2.2408e-01,  2.3975e-01, -6.2260e-02,  6.2194e-01,
       -5.9900e-01,  4.3201e-01,  2.8143e-01,  3.3842e-02, -4.8815e-01,
       -2.1359e-01,  2.7401e-01,  2.4095e-01,  4.5950e-01, -1.8605e-01,
       -1.0497e+00, -9.7305e-02, -1.8908e-01, -7.0929e-01,  4.01

# 5. Comparing Similarities

In [53]:
doc_1=nlp("it is raining heavily outside")
doc_2=nlp("clouds are thundering load and bright")

In [56]:
similarity=doc_1.similarity(doc_2)
similarity

0.711132990403484

# Chapter 3: Processing Pipelines

## 1. what happens when you call nlp?
- Tokenize the text and apply each pipeline component in order. the tokenizer turns a string of text into a `Doc` obkect. Spacy the applies every component in the pipeline on document in order

# 2. Inspecting Pipelines

In [57]:
import spacy
nlp=spacy.load('en_core_web_sm')
print(nlp.pipe_names)
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fd23c4e0c50>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fd23bfe6ad0>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fd23bfe6b40>)]


## 3. Simple Components

In [60]:
import spacy
def length_component(doc):
  doc_length=len(doc)
  print(f"This Document contain {doc_length} no/. of tokens")
  return doc

nlp=spacy.load('en_core_web_sm')
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)
doc=nlp("I am Shreyas currently in my 2nd year")

['length_component', 'tagger', 'parser', 'ner']
This Document contain 8 no/. of tokens
