## In this notebook, we will do some basic stuff with spaCy for NLP. We have used NLTK for some other notebooks in this repository. <br>

##### NLP is a subfield of artificial intelligence, and it’s all about allowing computers to comprehend human language. NLP involves analyzing, quantifying, understanding, and deriving meaning from natural languages. Currently, the most powerful NLP models are transformer based. BERT from Google and the GPT family from OpenAI are examples of such models. See this for more details:
# https://realpython.com/natural-language-processing-spacy-python/



In [121]:
#!pip install spacy
import spacy

In [122]:
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x7f2134eaa310>

In [123]:
introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
type(introduction_doc)

spacy.tokens.doc.Doc

In [124]:
[token.text for token in introduction_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

In [125]:
import pathlib
file_name = "maxwell.txt"
text = pathlib.Path(file_name).read_text(encoding="utf-8")
doc = nlp(text)
print ([token.text for token in doc])

['James', 'Clerk', 'Maxwell', 'FRSE', 'FRS', '(', '13', 'June', '1831', '–', '5', 'November', '1879', ')', 'was', 'a', 'Scottish', 'mathematician', 'and', 'scientist', 'responsible', 'for', 'the', 'classical', 'theory', 'of', 'electromagnetic', 'radiation', ',', 'which', 'was', 'the', 'first', 'theory', 'to', 'describe', 'electricity', ',', 'magnetism', 'and', 'light', 'as', 'different', 'manifestations', 'of', 'the', 'same', 'phenomenon', '.', 'Maxwell', "'s", 'equations', 'for', 'electromagnetism', 'have', 'been', 'called', 'the', '"', 'second', 'great', 'unification', 'in', 'physics"[3', ']', 'where', 'the', 'first', 'one', 'had', 'been', 'realised', 'by', 'Isaac', 'Newton', '.', 'With', 'the', 'publication', 'of', '"', 'A', 'Dynamical', 'Theory', 'of', 'the', 'Electromagnetic', 'Field', '"', 'in', '1865', ',', 'Maxwell', 'demonstrated', 'that', 'electric', 'and', 'magnetic', 'fields', 'travel', 'through', 'space', 'as', 'waves', 'moving', 'at', 'the', 'speed', 'of', 'light', '.', '

In [126]:
sentences = list(doc.sents)
len(sentences)

8

In [127]:
for sentence in sentences:
  print(f"{sentence[:3]} ----")

James Clerk Maxwell ----
Maxwell's equations ----
With the publication ----
He proposed that ----
The unification of ----
Maxwell is also ----
Maxwell helped develop ----
He is also ----


In [128]:
print (len(doc))
print (len(text))

225
1288


In [129]:
for token in text[0:11]:
  print (token) 

J
a
m
e
s
 
C
l
e
r
k


In [130]:
for token in doc[0:11]:
  print (token) 

James
Clerk
Maxwell
FRSE
FRS
(
13
June
1831
–
5


In [131]:
for sent in doc.sents:
    print (sent)

James Clerk Maxwell FRSE FRS (13 June 1831 – 5 November 1879) was a Scottish mathematician and scientist responsible for the classical theory of electromagnetic radiation, which was the first theory to describe electricity, magnetism and light as different manifestations of the same phenomenon.
Maxwell's equations for electromagnetism have been called the "second great unification in physics"[3] where the first one had been realised by Isaac Newton.
With the publication of "A Dynamical Theory of the Electromagnetic Field" in 1865, Maxwell demonstrated that electric and magnetic fields travel through space as waves moving at the speed of light.
He proposed that light is an undulation in the same medium that is the cause of electric and magnetic phenomena.[4]
The unification of light and electrical phenomena led to his prediction of the existence of radio waves.
Maxwell is also regarded as a founder of the modern field of electrical engineering.
Maxwell helped develop the Maxwell–Boltzma

In [132]:
sentence1 = list(doc.sents)[0]
print (sentence1)

James Clerk Maxwell FRSE FRS (13 June 1831 – 5 November 1879) was a Scottish mathematician and scientist responsible for the classical theory of electromagnetic radiation, which was the first theory to describe electricity, magnetism and light as different manifestations of the same phenomenon.


In [133]:
token2 = sentence1[2]
print (token2)

Maxwell


In [134]:
token2.ent_type_

'PERSON'

In [135]:
for ent in doc.ents:
    print (ent.text, ent.label_)

James Clerk Maxwell FRSE PERSON
13 June 1831 DATE
November 1879 DATE
Scottish NORP
first ORDINAL
Maxwell ORG
second ORDINAL
first ORDINAL
Isaac Newton PERSON
A Dynamical Theory of the Electromagnetic Field WORK_OF_ART
1865 DATE
Maxwell ORG
Maxwell ORG
Maxwell ORG
Maxwell–Boltzmann ORG
first ORDINAL
1861 DATE


In [146]:
from spacy import displacy
displacy.render(sentence1, style='dep', jupyter=True, options={'distance': 90})

In [137]:
#!python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")  # make sure to use larger package!
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [138]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489675521851


In [139]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])
doc = nlp("This is an email address: wmattingly@aol.com")
matches = matcher(doc)

In [140]:
print (matches)

[(16571425990740197027, 6, 7)]


In [141]:
import spacy
nlp2 = spacy.load("en_core_web_sm")
conference_help_text = ("Gus is helping organize a developer"
" conference on Applications of Natural Language Processing. He keeps organizing local Python meetups"
" and several internal talks at his workplace.")
conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
  if str(token) != str(token.lemma_):
    print(f"{str(token):>20} : {str(token.lemma_)}")

                  is : be
                  He : he
               keeps : keep
          organizing : organize
             meetups : meetup
               talks : talk


In [142]:
from collections import Counter
nlp3 = spacy.load("en_core_web_sm")
complete_text = ("Raghav Jha is a physicist currently working for a Virginia-based \
national lab. He is interested in learning Natural Language Processing. There is a \
conference happening on 21 July 2023 in London. It is titled Applications of \
Natural")
complete_doc = nlp(complete_text)
words = [token.text for token in complete_doc if not token.is_stop and not token.is_punct]
print(Counter(words).most_common(5))


[('Natural', 2), ('Raghav', 1), ('Jha', 1), ('physicist', 1), ('currently', 1)]


In [143]:
import spacy
nlp4 = spacy.load("en_core_web_sm")
about_text = ("Gus Proto is a Python developer currently working for a London-based Fintech \
company. He is interested in learning Natural Language Processing.")
about_doc = nlp(about_text)
for token in about_doc:
  print(f"""TOKEN: {str(token)},TAG: {str(token.tag_):10}, POS: {token.pos_}, \
  EXPLANATION: {spacy.explain(token.tag_)}""")

TOKEN: Gus,TAG: NNP       , POS: PROPN,   EXPLANATION: noun, proper singular
TOKEN: Proto,TAG: NNP       , POS: PROPN,   EXPLANATION: noun, proper singular
TOKEN: is,TAG: VBZ       , POS: AUX,   EXPLANATION: verb, 3rd person singular present
TOKEN: a,TAG: DT        , POS: DET,   EXPLANATION: determiner
TOKEN: Python,TAG: NNP       , POS: PROPN,   EXPLANATION: noun, proper singular
TOKEN: developer,TAG: NN        , POS: NOUN,   EXPLANATION: noun, singular or mass
TOKEN: currently,TAG: RB        , POS: ADV,   EXPLANATION: adverb
TOKEN: working,TAG: VBG       , POS: VERB,   EXPLANATION: verb, gerund or present participle
TOKEN: for,TAG: IN        , POS: ADP,   EXPLANATION: conjunction, subordinating or preposition
TOKEN: a,TAG: DT        , POS: DET,   EXPLANATION: determiner
TOKEN: London,TAG: NNP       , POS: PROPN,   EXPLANATION: noun, proper singular
TOKEN: -,TAG: HYPH      , POS: PUNCT,   EXPLANATION: punctuation mark, hyphen
TOKEN: based,TAG: VBN       , POS: VERB,   EXPLANATION: ver

## We now write a preprocessor that applies the following operations: 1) Lowercases the text, 2) Lemmatizes each token, 3)Removes punctuation symbols, 4)Removes stop words. </br>

Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a lemma. For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. The inflection of a word allows you to express different grammatical categories, like tense (organized vs organize), number (trains vs train), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

In [144]:
import spacy
nlp = spacy.load("en_core_web_sm")
complete_text = ("Gus Proto is a Python developer currently working for a London-based \
Fintech company. He is interested in learning Natural Language Processing. There is \
a developer conference happening on 21 July 2019 in London. It is titled Applications \
of Natural Language Processing")
complete_doc = nlp(complete_text)

def is_token_allowed(token):
  return bool(token and str(token).strip() and not token.is_stop and not token.is_punct)

def preprocess_token(token):
  return token.lemma_.strip().lower()

complete_filtered_tokens = [preprocess_token(token) for token in complete_doc \
                            if is_token_allowed(token)]

complete_filtered_tokens

['gus',
 'proto',
 'python',
 'developer',
 'currently',
 'work',
 'london',
 'base',
 'fintech',
 'company',
 'interested',
 'learn',
 'natural',
 'language',
 'processing',
 'developer',
 'conference',
 'happen',
 '21',
 'july',
 '2019',
 'london',
 'title',
 'application',
 'natural',
 'language',
 'processing']