# **Spacy Libraries**

**Introduction:**
* spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. 
* The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.
* It is used to build information extraction or to preprocess text for deep learning. Some of the features provided by SpaCy are Tokenization, Parts-of-Speech(PoS) Tagging, Text Classification and Named Entity Recognition.
* https://spacy.io/

**Features of Spacy**
 * Support for 64+ languages.
 * Pretrained word vectors.
 * State-of-the-art speed.
 * Production-ready training system.
 * Linguistically-motivated tokenization.
 * Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more.
 * Built in visualizers for synta.
 * Easy model packaging, deployment and workflow management.
 * Robust, rigorously evaluated accuracy.

In [None]:
# Installation of spacy
!pip install spacy --upgrade

Collecting spacy
  Downloading spacy-3.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.3 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 22.7 MB/s 
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.3 MB/s 
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 18.8 MB/s 
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 46.1 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-p

In [None]:
# Import Spacy
import spacy as sp

In [None]:
# Check the version of spacy
sp.__version__

'3.2.1'

In [None]:
# To work with french language you can install below package 
!python -m spacy download fr_core_news_sm 

In [None]:
# To work with english language you can install below package 
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.4 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
# Load the english library
import en_core_web_sm as en

**Legend**
(In NLP below parameter are called )
1. pos: part-of-speech
2. shape: lowercase, uppercasa
3. alpha: if it is alphanumeric
4. stop: if it is a stop word
5. lemma: "root" of the word
6. tag: morfological information (present, future, past)
7. dep: syntatic dependency

In [None]:
# create nlp object and Load the english module of spacy library into it.
nlp = sp.load('en_core_web_sm')

In [None]:
nlp

<spacy.lang.en.English at 0x7f7b72ea0e90>

In [None]:
text = "I am learning natural language processing. The course is offered by univercity.It is in Mumbai. Ph.d Sunil is Instructor"

In [None]:
# Note: To work with spacy library we need to associate above string with nlp object
doc = nlp(text)

**1. POS (part-of-speech)**
- In this module here we will identify each one of the word of sentences.
- POS (part-of-speech): Noun, Adjective, Verb,Aux verb,Determiner,Punctuation Adpostion etc.
- For More details visit: https://ashutoshtripathi.com/2020/04/13/parts-of-speech-tagging-and-dependency-parsing-using-spacy-nlp/

In [None]:
# Part of speech tagging
for x in doc:
  print(x.text,"--", x.pos_)       # Part of speech

I -- PRON
am -- AUX
learning -- VERB
natural -- ADJ
language -- NOUN
processing -- NOUN
. -- PUNCT
The -- DET
course -- NOUN
is -- AUX
offered -- VERB
by -- ADP
univercity -- NOUN
. -- PUNCT
It -- PRON
is -- AUX
in -- ADP
Mumbai -- PROPN
. -- PUNCT
Ph.d -- PROPN
Sunil -- PROPN
is -- AUX
Instructor -- NOUN


**2. Shape**

In [None]:
# shape: lowercase, uppercase
for x in doc:
  print(x,'-', x.shape_) 

I - X
am - xx
learning - xxxx
natural - xxxx
language - xxxx
processing - xxxx
. - .
The - Xxx
course - xxxx
is - xx
offered - xxxx
by - xx
univercity - xxxx
. - .
It - Xx
is - xx
in - xx
Mumbai - Xxxxx
. - .
Ph.d - Xx.x
Sunil - Xxxxx
is - xx
Instructor - Xxxxx


**3. Check for alphanumeric Character**

In [None]:
# alpha: if it is alphanumeri
for x in doc:
  print(x.text,'-', x.is_alpha)   # alpha: if it is alphanumeri

I - True
am - True
learning - True
natural - True
language - True
processing - True
. - False
The - True
course - True
is - True
offered - True
by - True
univercity - True
. - False
It - True
is - True
in - True
Mumbai - True
. - False
Ph.d - False
Sunil - True
is - True
Instructor - True


**3. check for Stop words**

In [None]:
# Ceck for the stop word
for x in doc:
  print(x ,'-', x.is_stop)  

I - True
am - True
learning - False
natural - False
language - False
processing - False
. - False
The - True
course - False
is - True
offered - False
by - True
univercity - False
. - False
It - True
is - True
in - True
Mumbai - False
. - False
Ph.d - False
Sunil - False
is - True
Instructor - False


**4. find Lemma of words**

In [None]:
# lemma is the basically root or base of the word
for x in doc:
  print(x.text,"-", x.lemma_)  

I - I
am - be
learning - learn
natural - natural
language - language
processing - processing
. - .
The - the
course - course
is - be
offered - offer
by - by
univercity - univercity
. - .
It - it
is - be
in - in
Mumbai - Mumbai
. - .
Ph.d - Ph.d
Sunil - Sunil
is - be
Instructor - instructor


**5. marpholocial Information**

In [None]:
# tag: morfological information (present, future, past)
for x in doc:
  print(x.text,"-", x.tag_)     

I - PRP
am - VBP
learning - VBG
natural - JJ
language - NN
processing - NN
. - .
The - DT
course - NN
is - VBZ
offered - VBN
by - IN
univercity - NN
. - .
It - PRP
is - VBZ
in - IN
Mumbai - NNP
. - .
Ph.d - NNP
Sunil - NNP
is - VBZ
Instructor - NN


* Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.
* However, there are clearly many more categories and sub-categories.
* For nouns, the plural, possessive, and singular forms can be distinguished.
* In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. 
* In some tagging systems, different inflections of the same root word will get different parts of speech, resulting in a large number of tags.
* For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus).
* Other tagging systems use a smaller number of tags and ignore fine differences or model them as features somewhat independent from part-of-speech.

**6. Syntatics dependency**

In [None]:
# dep: syntatic dependency 
for x in doc:
  print(x.text,"--", x.dep_)    

I -- nsubj
am -- aux
learning -- ROOT
natural -- amod
language -- compound
processing -- dobj
. -- punct
The -- det
course -- nsubjpass
is -- auxpass
offered -- ROOT
by -- agent
univercity -- pobj
. -- punct
It -- nsubj
is -- ROOT
in -- prep
Mumbai -- pobj
. -- punct
Ph.d -- compound
Sunil -- nsubj
is -- ROOT
Instructor -- attr
