### Natural language Processing With SpaCy and Python
+ NLP a form of AI or Artificial Intelligence (Building systems that can do intelligent things)
+ NLP or Natural Language Processing – Building systems that can understand everyday language. It is a subset of Artificial Intelligence.
+ SpaCy by Explosion.ai (Matthew Honnibal)
![alt text](SpaCy_logo.png "Title")



#### Basic Terms
+ Tokenization:	Segmenting text into words, punctuations marks etc.
+ Part-of-speech: (POS) Tagging	Assigning word types to tokens, like verb or noun.
+ Dependency Parsing:	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
+ Lemmatization:	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
+ Named Entity Recognition (NER):	Labelling named "real-world" objects, like persons, companies or locations.
+ Similarity:	Comparing words, text spans and documents and how similar they are to each other.
+ Sentence Boundary Detection (SBD):	Finding and segmenting individual sentences.
+ Text Classification:	Assigning categories or labels to a whole document, or parts of a document.
+ Rule-based Matching:	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
+ Training:	Updating and improving a statistical model's predictions.
+ Serialization:	Saving objects to files or byte strings.

### Installing the Library on Linux/Unix
+ sudo pip install spacy
+ sudo python -m spacy download en
+ sudo python -m spacy download fr

### Installing On Windows using Conda

+ conda install tqdm
+ conda install -c conda-forge spacy /conda install spacy
+ python -m spacy download en
- - with cmd administrator


### Installing using Conda
+ conda install -c conda-forge spacy
+ sudo python -m spacy download en
+ sudo python -m spacy download fr




#### For Download the Models of other languages
+ sudo python -m spacy download de
+ sudo python -m spacy download es
+ sudo python -m spacy download xx


### Installing On Windows using Conda
+ conda config-add channel conda-forge
+ conda update anaconda
+ conda install tqdm
+ conda install -c conda-forge spacy
+ sudo python -m spacy download en

In [54]:
# Loading the package
import spacy
nlp = spacy.load("en_core_web_sm")

#nlp = en_core_web_sm.load()

![alt text](BehindSpacy.jpg "Title")

### Reading  A Document or Text

In [55]:
# Reading the text /tokens
docx = nlp("SpaCy is a cool tool")

In [56]:
docx

SpaCy is a cool tool

In [57]:
docx2 = nlp(u"SpaCy is an amazing tool like nltk")

In [58]:
# Reading a file
myfile = open("/content/text.txt").read()

In [59]:
doc_file = nlp(myfile)

In [60]:
# Calling the file
doc_file

Cambodia has recorded a trade surplus with Vietnam in the first five months of 2024, making Vietnam the second biggest market for Cambodia’s products after the US.

In the past, Cambodia’s exports to Vietnam have been less than Cambodia’s imports from Vietnam. However, the momentum of Cambodia’s exports to Vietnam has increased significantly in the last several months.

Figures from the General Department of Customs and Excise (GDCE) showed on Tuesday that in the first five months this year, Cambodia exported goods worth $1.88 billion to Vietnam, an increase of 42.6 percent, while imports from Vietnam were worth only $1.67 billion, an increase of eight percent compared to the same period last year.

This gave Cambodia a trade surplus with Vietnam to the tune of $216 million.

“Vietnam has significantly increased its agricultural purchases from Cambodia after it opened up a free trade market with the European Union, which boosted Vietnam’s demand for raw materials,” said Penn Sovicheat,

In [61]:
# Decoding the file as UTF-8
#myfile2 = open("examplefile.txt").read().decode('utf8')

#### Sentence Tokens
+ Tokenization == Splitting or segmenting the text into sentences or tokens
+ .sent

In [62]:
# Sentence Tokens
for num,sentence in enumerate(doc_file.sents):
    #print(f'{num}: {sentence}') # For Python 3.6 upwards
    print('{0}: {1}'.format(num,sentence))


0: Cambodia has recorded a trade surplus with Vietnam in the first five months of 2024, making Vietnam the second biggest market for Cambodia’s products after the US.


1: In the past, Cambodia’s exports to Vietnam have been less than Cambodia’s imports from Vietnam.
2: However, the momentum of Cambodia’s exports to Vietnam has increased significantly in the last several months.


3: Figures from the General Department of Customs and Excise (GDCE) showed on Tuesday that in the first five months this year, Cambodia exported goods worth $1.88 billion to Vietnam, an increase of 42.6 percent, while imports from Vietnam were worth only $1.67 billion, an increase of eight percent compared to the same period last year.


4: This gave Cambodia a trade surplus with Vietnam to the tune of $216 million.


5: “Vietnam has significantly increased its agricultural purchases from Cambodia after it opened up a free trade market with the European Union, which boosted Vietnam’s demand for raw materials,

#### Word Tokens
+ Splitting or segmenting the text into words
+ .text

In [63]:
doc = nlp(u"Spacy is an amazing tool.")

In [64]:
# Word Tokens
for token in doc:
    print(token.text)

Spacy
is
an
amazing
tool
.


In [65]:
# List of Word Tokens
[token.text for token in doc ]

['Spacy', 'is', 'an', 'amazing', 'tool', '.']

In [66]:
# Similar to splitting on spaces
doc.text.split(" ")

['Spacy', 'is', 'an', 'amazing', 'tool.']

### More about words
+ .shape_ ==> for shape of word eg. capital,lowercase,etc
+ .is_alpha ==> returns boolean(true or false) if word is alphabet
+ .is_stop ==> returns boolean(true or false) if word is a stop word

In [67]:
docx

SpaCy is a cool tool

In [68]:
# Word Shape As Hash Value
for word in docx:
    print(word.text,word.shape)

SpaCy 14101195205177134206
is 4370460163704169311
a 11123243248953317070
cool 13110060611322374290
tool 13110060611322374290


In [69]:
# Word Shape As readable representation
for word in docx:
    print(word.text,word.shape_)

SpaCy XxxXx
is xx
a x
cool xxxx
tool xxxx


In [70]:
ex_doc = nlp("This is 1 pen.")

In [71]:
for word in ex_doc:
    print("Token =>", word.text, "Shape ",word.shape_,word.is_alpha,word.is_stop)

Token => This Shape  Xxxx True True
Token => is Shape  xx True True
Token => 1 Shape  d False False
Token => pen Shape  xxx True False
Token => . Shape  . False False


### Part of Speech Tagging
+  NB attribute_ ==> Returns readable string representation of attribute
+ .pos
+ .pos_ ==> exposes Google Universal pos_tag,simple
+ .tag
+ .tag_ ==> exposes Treebank, detailed,for training your own model
+ + Uses
- - Sentiment Analysis,Homonym Disambuguity ,Prediction

In [72]:
# Parts of Speech
ex1 = nlp("He drinks a drink.")

In [73]:
# pos_ = Parts of Speech Simplified
for word in ex1:
    print(word.text,word.pos_)


He PRON
drinks VERB
a DET
drink NOUN
. PUNCT


In [74]:
# Parts of Speech Simple Term (.pos_)
ex2 = nlp("I fish a fish")

In [75]:
for word in ex2:
    print(word.text,word.pos_,word.tag_)

I PRON PRP
fish VERB VBP
a DET DT
fish NOUN NN


In [76]:
# Parts of Speech Detailed (.tag_) (Good for training your own model,features)
# Parts of Speech of Tag
for word in ex2:
    print(word.text,word.pos_,word.tag_)

I PRON PRP
fish VERB VBP
a DET DT
fish NOUN NN


##### If you want to know the meaning of the pos abbreviation
+ spacy.explain('DT')

In [77]:
spacy.explain('DT')

'determiner'

In [78]:
exercise1 = nlp(u"All the faith he had had had had no effect on the outcome of his life")
#the first is a modifier while the second is the main verb of the sentence


In [79]:
for word in exercise1:
    print((word.text,word.tag_,word.pos_))

('All', 'PDT', 'DET')
('the', 'DT', 'DET')
('faith', 'NN', 'NOUN')
('he', 'PRP', 'PRON')
('had', 'VBD', 'AUX')
('had', 'VBN', 'AUX')
('had', 'VBN', 'AUX')
('had', 'VBN', 'VERB')
('no', 'DT', 'DET')
('effect', 'NN', 'NOUN')
('on', 'IN', 'ADP')
('the', 'DT', 'DET')
('outcome', 'NN', 'NOUN')
('of', 'IN', 'ADP')
('his', 'PRP$', 'PRON')
('life', 'NN', 'NOUN')


In [80]:
exercise2 = nlp("The man the professor the student has studies Rome.")
#The student has the professor who knows the man who studies ancient Rome

In [81]:
for word in exercise2:
    print((word.text,word.tag_,word.pos_))

('The', 'DT', 'DET')
('man', 'NN', 'NOUN')
('the', 'DT', 'DET')
('professor', 'NN', 'NOUN')
('the', 'DT', 'DET')
('student', 'NN', 'NOUN')
('has', 'VBZ', 'VERB')
('studies', 'NNS', 'NOUN')
('Rome', 'NNP', 'PROPN')
('.', '.', 'PUNCT')


#### Syntactic Dependency
+ It helps us to know the relation between tokens
+ How each word is connected and dependent on each other

In [82]:
ex3 = nlp("I like to stay at home.")

In [83]:
for word in ex3:
    print((word.text,word.tag_,word.pos_,word.dep_))

('I', 'PRP', 'PRON', 'nsubj')
('like', 'VBP', 'VERB', 'ROOT')
('to', 'TO', 'PART', 'aux')
('stay', 'VB', 'VERB', 'xcomp')
('at', 'IN', 'ADP', 'prep')
('home', 'NN', 'NOUN', 'pobj')
('.', '.', 'PUNCT', 'punct')


In [84]:
# What does Advmod mean?
spacy.explain('pobj')

'object of preposition'

### Visualizing Dependency using displaCy
+ from spacy import displacy
+ displacy.serve()
+ displacy.render(jupyter=True) # for jupyter notebook

In [85]:
# To dispay the dependences and any other visualization
from spacy import displacy

In [86]:
# For Jupyter Notebooks you can set jupter=True to render it properly
displacy.render(ex3,style='dep',jupyter=True)

In [87]:
# Visualizing Named Entity Recognistion
#displacy.render(ex1,style='ent',jupyter=True,options={'distance':140})


### Visualizing using displaCy
+ For IDEs
+ For Jupyter notebook


In [88]:
# For IDEs
#from spacy import displacy

In [89]:
docx3 = nlp('I go to school.')

In [90]:
for word in docx3:
    print((word.text,word.tag_,word.pos_,word.dep_))

('I', 'PRP', 'PRON', 'nsubj')
('go', 'VBP', 'VERB', 'ROOT')
('to', 'IN', 'ADP', 'prep')
('school', 'NN', 'NOUN', 'pobj')
('.', '.', 'PUNCT', 'punct')


In [91]:
# Start a server running on your localhost
#displacy.serve(docx3,style='dep')

### Using displaCy in Jupyter notebooks
+ displacy.render(jupyter=True)

In [92]:
displacy.render(docx3,style='dep',jupyter=True)

In [93]:
spacy.explain('ADP')

'adposition'

In [94]:
## Customizing the Diplays
# Compact set it to square arrows or curved arrows
# Color:#09a3d5
options = {'compact': True, 'bg': 'cornflowerblue',
           'color': '#fff', 'font': 'Sans Serif'}


In [95]:
displacy.render(docx3,style='dep',options=options,jupyter=True)

In [96]:
# Adding Title
docx3.user_data['title']= 'Buffalo Complex Sentence'

In [97]:
displacy.render(docx3,style='dep',options=options,jupyter=True)

### Rendering HTML
+ Default is svg
+ set page to True
+ minify=True For Minified format

In [98]:
html = displacy.render(docx3,style='dep',page=True)

### Exporting The Rendered Graphic

### Named Entity Recognition or Detection
+  Classifying a text into predefined categories or real world object  entities.
+ takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, and organizations) that are mentioned in that string.

##### Uses
+ Classifying or Categorizing contents by getting the relevant tags
+ Improve search algorithms
+ For content recommendations
+ For info extraction

+ .ents
+ .label_

In [99]:
wikitext = nlp(u"By 2020 the telecom company Orange, will relocate from Turkey to Orange County in the U.S. close to Apple.It will cost them 2 billion dollars.")

In [100]:
for entity in wikitext.ents:
    print(entity.text,entity.label_)

2020 DATE
Orange NORP
Turkey GPE
Orange County GPE
U.S. GPE
Apple ORG
2 billion dollars MONEY


In [101]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [102]:
# What does GPE means
spacy.explain('GPE')

'Countries, cities, states'

In [103]:
# Visualize With DiSplaCy
displacy.render(wikitext,style='ent',jupyter=True)

In [104]:
wikitext2 = nlp(u"I live in Paris.")

In [105]:
# Visualize With DiSplaCy
displacy.render(wikitext2,style='ent',jupyter=True)

In [106]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [107]:
doc1 = nlp("Facebook, Explosion.ai, JCharisTech are all internet companies")

In [108]:
# Visualize With DiSplaCy
displacy.render(doc1,style='ent',jupyter=True)

### Text Normalization  and Word Inflection
+ Word inflection == syntactic differences between word forms
+ Reducing a word to its base/root form
+ Lemmatization **
- - a word based on its intended meaning
+ Stemming
- - Cutting of the prefixes/suffices to reduce a word to base form
+ Word Shape Analysis

In [109]:
## Lemmatization
docx_lemma = nlp("He goes to school.")

In [110]:
for word in docx_lemma:
    print("Token=>",word.text,"Lemma=>",word.lemma_,word.pos_)

Token=> He Lemma=> he PRON
Token=> goes Lemma=> go VERB
Token=> to Lemma=> to ADP
Token=> school Lemma=> school NOUN
Token=> . Lemma=> . PUNCT


In [111]:
docx_lemma1 = nlp("I like the goods.")

In [112]:
for word in docx_lemma1:
    print("Token=>",word.text,"Lemma=>",word.lemma_,word.pos_)

Token=> I Lemma=> I PRON
Token=> like Lemma=> like VERB
Token=> the Lemma=> the DET
Token=> goods Lemma=> good NOUN
Token=> . Lemma=> . PUNCT


In [113]:
docx_lemma2 = nlp("I am walking.")

In [114]:
for word in docx_lemma2:
    print("Token=>",word.text,"Lemma=>",word.lemma_,word.pos_)

Token=> I Lemma=> I PRON
Token=> am Lemma=> be AUX
Token=> walking Lemma=> walk VERB
Token=> . Lemma=> . PUNCT


### Semantic Similarity
+ object1.similarity(object2)
+ Uses:
+ - Recommendation systems
+ - Data Preprocessing eg removing duplicates
- - python -m spacy download en_core_web_lg

In [115]:
# Similarity of object
doc1 = nlp("wolf")
doc2 = nlp("dog")

In [116]:
doc1.similarity(doc2)

  doc1.similarity(doc2)


0.666785871991148

In [117]:
doc3  = nlp("cat")

In [118]:
doc3.similarity(doc2)

  doc3.similarity(doc2)


0.6847176149951816

In [119]:
# Synonmys
doc4 = nlp("smart")
doc5 = nlp("clever")

In [120]:
# Similarity of words
doc4.similarity(doc5)

  doc4.similarity(doc5)


0.8056307834804642

#### Noun Chunks
+ noun + word describing the noun
+ noun phrases


In [125]:
# Noun Phrase or Chunks
doc_phrase1 = nlp("The man reading the news is very tall.")

In [126]:
for word in doc_phrase1.noun_chunks:
    print(word.text)

The man
the news


In [127]:
# Root Text
# the Main Noun
for word in doc_phrase1.noun_chunks:
    print(word.root.text)

man
news


### Text Similarity With ML

In [162]:
# Using ML
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [130]:
documents = ['wolf','dog','cat','bird','fish']


In [140]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(documents).toarray()

In [141]:
print(vectorizer.vocabulary_)


{'wolf': 4, 'dog': 2, 'cat': 1, 'bird': 0, 'fish': 3}


In [142]:
features

array([[0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])

In [146]:
cosine_similarity([features[0]],features)

array([[1., 0., 0., 0., 0.]])

### Sentence Similarity


In [155]:
# Using Three Sentences
corpus1 = ["I like that bachelor and bachelor","I like that unmarried man","I don't like the married man"]
corpus2 = ["Jane is very nice.", "Is Jane very nice?"]
corpus3 = ["He is a bachelor","He is an unmarried man"]
corpus4 = ["She is a wife","She is a wife"]
corpus5 = ["He is a king","He is a doctor"]

documents = corpus1 + corpus2 + corpus3 + corpus4 + corpus5

In [156]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(documents).toarray()

In [157]:
print(vectorizer.vocabulary_)


{'like': 9, 'that': 14, 'bachelor': 2, 'and': 1, 'unmarried': 16, 'man': 10, 'don': 4, 'the': 15, 'married': 11, 'jane': 7, 'is': 6, 'very': 17, 'nice': 12, 'he': 5, 'an': 0, 'she': 13, 'wife': 18, 'king': 8, 'doctor': 3}


In [158]:
features

array([[0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [159]:
cosine_similarity([features[0]],features)

array([[1.        , 0.37796447, 0.16903085, 0.        , 0.        ,
        0.43643578, 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

In [163]:
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(documents).toarray()

In [164]:
print(vectorizer.vocabulary_)

{'like': 9, 'that': 14, 'bachelor': 2, 'and': 1, 'unmarried': 16, 'man': 10, 'don': 4, 'the': 15, 'married': 11, 'jane': 7, 'is': 6, 'very': 17, 'nice': 12, 'he': 5, 'an': 0, 'she': 13, 'wife': 18, 'king': 8, 'doctor': 3}


In [165]:
features

array([[0.        , 0.43776435, 0.74837005, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.32907478,
        0.        , 0.        , 0.        , 0.        , 0.37418503,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.46696806,
        0.46696806, 0.        , 0.        , 0.        , 0.530981  ,
        0.        , 0.530981  , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.49205853,
        0.        , 0.        , 0.        , 0.        , 0.36988863,
        0.36988863, 0.49205853, 0.        , 0.        , 0.        ,
        0.49205853, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29744623, 0.55121857, 0.        , 0.        ,
        0.        , 0.        , 0.55121857, 0.        , 0.        ,
        0.   

In [166]:
cosine_similarity([features[0]],features)

array([[1.        , 0.35235255, 0.12172102, 0.        , 0.        ,
        0.54166087, 0.        , 0.        , 0.        , 0.        ,
        0.        ]])