# Natural Language Processing using spacy

- How to Get Noun chunks
- How to Get root words from Noun chunks
- Lemmatization
- Display tree view of words using displacy using displacy.render()
- How to get the meaning of any denoted words by nlp using explain()
- How to Find out NER(Named entity Recognition) in given doc
- How to filter specific NER from raw data
- Display Named Entity in doc using displacy.render
- How to display only specific ENTITIES in displacy
- How to Save Entities in html page and customize it as per need
- find out total number of occurences of ORG entities
- Remove stop_words/punctuation using is_stop & is_punct attribute
- Try at Home
###### create a list of words/sentence after removing stop_words then make sentence

###### Sentence and Word Tokenization

In [1]:
import spacy as sp
from spacy import displacy # used for data visualization
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.attrs import ORTH # to be used for word count

In [2]:
sp.__version__

'3.2.1'

# To load english model

In [3]:
nlp = sp.load("en_core_web_sm") 
# ref: https://spacy.io/models/en
#!python -m spacy download en_core_web_sm

# To load english model

## !python -m spacy download en_core_web_sm

## Sample Text about Mr. Obama

In [4]:
txt = """Obama’s father, Barack Obama, Sr., was a teenage goatherd in rural Kenya, won a scholarship to study in the United States, and eventually became a senior economist in the Kenyan government. Obama’s mother, S. Ann Dunham, grew up in Kansas, Texas, and Washington state before her family settled in Honolulu. In 1960 she and Barack Sr. met in a Russian language class at the University of Hawaii and married less than a year later.

When Obama was age two, Barack Sr. left to study at Harvard University; shortly thereafter, in 1964, Ann and Barack Sr. divorced. (Obama saw his father only one more time, during a brief visit when Obama was 10.) Later Ann remarried, this time to another foreign student, Lolo Soetoro from Indonesia, with whom she had a second child, Maya. Obama lived for several years in Jakarta with his half sister, mother, and stepfather. While there, Obama attended both a government-run school where he received some instruction in Islam and a Catholic private school where he took part in Christian schooling.

He returned to Hawaii in 1971 and lived in a modest apartment, sometimes with his grandparents and sometimes with his mother (she remained for a time in Indonesia, returned to Hawaii, and then went abroad again—partly to pursue work on a Ph.D.—before divorcing Soetoro in 1980). For a brief period his mother was aided by government food stamps, but the family mostly lived a middle-class existence. In 1979 Obama graduated from Punahou School, an elite college preparatory academy in Honolulu.

Obama attended Occidental College in suburban Los Angeles for two years and then transferred to Columbia University in New York City, where in 1983 he received a bachelor’s degree in political science. Influenced by professors who pushed him to take his studies more seriously, Obama experienced great intellectual growth during college and for a couple of years thereafter. He led a rather ascetic life and read works of literature and philosophy by William Shakespeare, Friedrich Nietzsche, Toni Morrison, and others. After serving for a couple of years as a writer and editor for Business International Corp., a research, publishing, and consulting firm in Manhattan, he took a position in 1985 as a community organizer on Chicago’s largely impoverished Far South Side. He returned to school three years later and graduated magna cum laude in 1991 from Harvard University’s law school, where he was the first African American to serve as president of the Harvard Law Review. While a summer associate in 1989 at the Chicago law firm of Sidley Austin, Obama had met Chicago native Michelle Robinson, a young lawyer at the firm. The two married in 1992."""

## How to Get Noun chunks

In [5]:
obj = nlp(txt)
obj

Obama’s father, Barack Obama, Sr., was a teenage goatherd in rural Kenya, won a scholarship to study in the United States, and eventually became a senior economist in the Kenyan government. Obama’s mother, S. Ann Dunham, grew up in Kansas, Texas, and Washington state before her family settled in Honolulu. In 1960 she and Barack Sr. met in a Russian language class at the University of Hawaii and married less than a year later.

When Obama was age two, Barack Sr. left to study at Harvard University; shortly thereafter, in 1964, Ann and Barack Sr. divorced. (Obama saw his father only one more time, during a brief visit when Obama was 10.) Later Ann remarried, this time to another foreign student, Lolo Soetoro from Indonesia, with whom she had a second child, Maya. Obama lived for several years in Jakarta with his half sister, mother, and stepfather. While there, Obama attended both a government-run school where he received some instruction in Islam and a Catholic private school where he t

In [6]:
for n in obj.noun_chunks:
    print(n.text)

Obama’s father
Barack Obama
Sr
.
a teenage goatherd
rural Kenya
a scholarship
the United States
a senior economist
the Kenyan government
Obama’s mother
S. Ann Dunham
Kansas
Texas
Washington state
her family
Honolulu
she
Barack Sr
.
a Russian language class
the University
Hawaii
Obama
age
Harvard University
Ann
Barack Sr
Obama
his father
a brief visit
Obama
Ann
another foreign student
Lolo Soetoro
Indonesia
whom
she
a second child
Maya
Obama
several years
Jakarta
his half sister
mother
stepfather
Obama
both a government-run school
he
some instruction
Islam
a Catholic private school
he
part
Christian schooling
He
Hawaii
a modest apartment
his grandparents
his mother
she
a time
Indonesia
Hawaii
work
Soetoro
a brief period
his mother
government food stamps
the family
a middle-class existence
Obama
Punahou School
an elite college preparatory academy
Honolulu
Obama
Occidental College
suburban Los Angeles
two years
Columbia University
New York City
he
a bachelor’s degree
political science
pro

## How to Get root words from Noun chunks

In [7]:
for n in obj.noun_chunks:
    print(n.root)

father
Obama
Sr
.
goatherd
Kenya
scholarship
States
economist
government
mother
Dunham
Kansas
Texas
state
family
Honolulu
she
Sr
.
class
University
Hawaii
Obama
age
University
Ann
Sr
Obama
father
visit
Obama
Ann
student
Soetoro
Indonesia
whom
she
child
Maya
Obama
years
Jakarta
sister
mother
stepfather
Obama
school
he
instruction
Islam
school
he
part
schooling
He
Hawaii
apartment
grandparents
mother
she
time
Indonesia
Hawaii
work
Soetoro
period
mother
stamps
family
existence
Obama
School
academy
Honolulu
Obama
College
Angeles
years
University
City
he
degree
science
professors
who
him
studies
Obama
growth
college
couple
years
He
life
works
literature
philosophy
Shakespeare
Nietzsche
Morrison
others
couple
years
writer
editor
Corp.
research
publishing
firm
Manhattan
he
position
organizer
Chicago
Side
He
school
laude
school
he
American
president
Review
firm
Austin
Obama
Robinson
lawyer
firm


## Lemmatization

In [9]:
t2 = "Obama’s father, Barack Obama, Sr., was a teenage goatherd in rural Kenya, won a scholarship to study in the United States, and eventually became a senior economist in the Kenyan government. Obama’s mother, S. Ann Dunham, grew up in Kansas, Texas, and Washington state before her family settled in Honolulu. In 1960 she and Barack Sr. met in a Russian language class at the University of Hawaii and married less than a year later."
t2

'Obama’s father, Barack Obama, Sr., was a teenage goatherd in rural Kenya, won a scholarship to study in the United States, and eventually became a senior economist in the Kenyan government. Obama’s mother, S. Ann Dunham, grew up in Kansas, Texas, and Washington state before her family settled in Honolulu. In 1960 she and Barack Sr. met in a Russian language class at the University of Hawaii and married less than a year later.'

In [10]:
obj2 = nlp(t2)
obj2

Obama’s father, Barack Obama, Sr., was a teenage goatherd in rural Kenya, won a scholarship to study in the United States, and eventually became a senior economist in the Kenyan government. Obama’s mother, S. Ann Dunham, grew up in Kansas, Texas, and Washington state before her family settled in Honolulu. In 1960 she and Barack Sr. met in a Russian language class at the University of Hawaii and married less than a year later.

In [13]:
for w in obj2:
    print(w.lemma)

4857242187112322394
614914527630368944
17071697760115891398
2593208677638477497
15388493565120789335
4857242187112322394
2593208677638477497
11009051309222302246
12646065887601541794
2593208677638477497
10382539506755952630
11901859001352538922
14903686835195047779
14155881130933041931
3002984154512732771
132814216812512435
4234016620340650456
2593208677638477497
471204509717844521
11901859001352538922
13838487344555440146
3791531372978436496
4251533498015236010
3002984154512732771
7425985699627899538
13226800834791099135
3278625293875499398
2593208677638477497
2283656566040971221
140282802797073706
12558846041070486771
11901859001352538922
17934676104927248284
705753312367543608
3002984154512732771
7425985699627899538
10833033741108051523
3625794390087546215
12646065887601541794
4857242187112322394
614914527630368944
7963322251145911254
2593208677638477497
14855555303352453619
13706867504515737279
2780745668230242118
2593208677638477497
17623831665190758874
2199259611705938403
3002984

In [15]:
for w in obj2:
    print((w.text, w.lemma_))

('Obama', 'Obama')
('’s', '’s')
('father', 'father')
(',', ',')
('Barack', 'Barack')
('Obama', 'Obama')
(',', ',')
('Sr', 'Sr')
('.', '.')
(',', ',')
('was', 'be')
('a', 'a')
('teenage', 'teenage')
('goatherd', 'goatherd')
('in', 'in')
('rural', 'rural')
('Kenya', 'Kenya')
(',', ',')
('won', 'win')
('a', 'a')
('scholarship', 'scholarship')
('to', 'to')
('study', 'study')
('in', 'in')
('the', 'the')
('United', 'United')
('States', 'States')
(',', ',')
('and', 'and')
('eventually', 'eventually')
('became', 'become')
('a', 'a')
('senior', 'senior')
('economist', 'economist')
('in', 'in')
('the', 'the')
('Kenyan', 'kenyan')
('government', 'government')
('.', '.')
('Obama', 'Obama')
('’s', '’s')
('mother', 'mother')
(',', ',')
('S.', 'S.')
('Ann', 'Ann')
('Dunham', 'Dunham')
(',', ',')
('grew', 'grow')
('up', 'up')
('in', 'in')
('Kansas', 'Kansas')
(',', ',')
('Texas', 'Texas')
(',', ',')
('and', 'and')
('Washington', 'Washington')
('state', 'state')
('before', 'before')
('her', 'her')
('fa

## Display tree view of words using displacy using displacy.render()

In [16]:
displacy.render(obj2)

In [17]:
displacy.render(obj2,jupyter=True)

## How to get the meaning of any denoted words by nlp using explain()

In [19]:
sp.explain("PROPN")

'proper noun'

In [20]:
sp.explain("DET")

'determiner'

In [21]:
sp.explain("AUX")

'auxiliary'

In [26]:
sp.explain("GPE")

'Countries, cities, states'

## How to Find out NER(Named entity Recognition) in given doc

In [22]:
obj

Obama’s father, Barack Obama, Sr., was a teenage goatherd in rural Kenya, won a scholarship to study in the United States, and eventually became a senior economist in the Kenyan government. Obama’s mother, S. Ann Dunham, grew up in Kansas, Texas, and Washington state before her family settled in Honolulu. In 1960 she and Barack Sr. met in a Russian language class at the University of Hawaii and married less than a year later.

When Obama was age two, Barack Sr. left to study at Harvard University; shortly thereafter, in 1964, Ann and Barack Sr. divorced. (Obama saw his father only one more time, during a brief visit when Obama was 10.) Later Ann remarried, this time to another foreign student, Lolo Soetoro from Indonesia, with whom she had a second child, Maya. Obama lived for several years in Jakarta with his half sister, mother, and stepfather. While there, Obama attended both a government-run school where he received some instruction in Islam and a Catholic private school where he t

In [25]:
for ner in obj.ents:
    print((ner,ner.label_))

(Obama’s, 'PERSON')
(Barack Obama, Sr., 'PERSON')
(Kenya, 'GPE')
(the United States, 'GPE')
(Kenyan, 'NORP')
(Obama’s, 'ORG')
(S. Ann Dunham, 'PERSON')
(Kansas, 'GPE')
(Texas, 'GPE')
(Washington, 'GPE')
(Honolulu, 'GPE')
(1960, 'DATE')
(Barack Sr., 'PERSON')
(Russian, 'LANGUAGE')
(the University of Hawaii, 'ORG')
(less than a year later, 'DATE')
(Obama, 'GPE')
(age two, 'DATE')
(Barack Sr., 'PERSON')
(Harvard University, 'ORG')
(1964, 'DATE')
(Barack Sr., 'PERSON')
(Obama, 'PERSON')
(only one, 'CARDINAL')
(Obama, 'GPE')
(10, 'CARDINAL')
(Lolo Soetoro, 'PERSON')
(Indonesia, 'GPE')
(second, 'ORDINAL')
(Maya, 'PERSON')
(Obama, 'PERSON')
(several years, 'DATE')
(Jakarta, 'GPE')
(half, 'CARDINAL')
(Obama, 'PERSON')
(Islam, 'ORG')
(Catholic, 'NORP')
(Christian, 'NORP')
(Hawaii, 'GPE')
(1971, 'DATE')
(Indonesia, 'GPE')
(Hawaii, 'GPE')
(Soetoro, 'PERSON')
(1980, 'DATE')
(1979, 'DATE')
(Obama, 'GPE')
(Punahou School, 'ORG')
(Honolulu, 'GPE')
(Obama, 'PERSON')
(Occidental College, 'ORG')
(Los An

## Another way to get NER

In [31]:
for w in obj:
    print((w.text, w.ent_type_))

('Obama', 'PERSON')
('’s', 'PERSON')
('father', '')
(',', '')
('Barack', 'PERSON')
('Obama', 'PERSON')
(',', 'PERSON')
('Sr', 'PERSON')
('.', 'PERSON')
(',', '')
('was', '')
('a', '')
('teenage', '')
('goatherd', '')
('in', '')
('rural', '')
('Kenya', 'GPE')
(',', '')
('won', '')
('a', '')
('scholarship', '')
('to', '')
('study', '')
('in', '')
('the', 'GPE')
('United', 'GPE')
('States', 'GPE')
(',', '')
('and', '')
('eventually', '')
('became', '')
('a', '')
('senior', '')
('economist', '')
('in', '')
('the', '')
('Kenyan', 'NORP')
('government', '')
('.', '')
('Obama', 'ORG')
('’s', 'ORG')
('mother', '')
(',', '')
('S.', 'PERSON')
('Ann', 'PERSON')
('Dunham', 'PERSON')
(',', '')
('grew', '')
('up', '')
('in', '')
('Kansas', 'GPE')
(',', '')
('Texas', 'GPE')
(',', '')
('and', '')
('Washington', 'GPE')
('state', '')
('before', '')
('her', '')
('family', '')
('settled', '')
('in', '')
('Honolulu', 'GPE')
('.', '')
('In', '')
('1960', 'DATE')
('she', '')
('and', '')
('Barack', 'PERSON')


## How to filter specific NER from raw data

In [33]:
for w in obj:
    if w.ent_type_ == "PERSON":
        print((w.text,w.ent_type_))

('Obama', 'PERSON')
('’s', 'PERSON')
('Barack', 'PERSON')
('Obama', 'PERSON')
(',', 'PERSON')
('Sr', 'PERSON')
('.', 'PERSON')
('S.', 'PERSON')
('Ann', 'PERSON')
('Dunham', 'PERSON')
('Barack', 'PERSON')
('Sr', 'PERSON')
('.', 'PERSON')
('Barack', 'PERSON')
('Sr', 'PERSON')
('.', 'PERSON')
('Barack', 'PERSON')
('Sr', 'PERSON')
('.', 'PERSON')
('Obama', 'PERSON')
('Lolo', 'PERSON')
('Soetoro', 'PERSON')
('Maya', 'PERSON')
('Obama', 'PERSON')
('Obama', 'PERSON')
('Soetoro', 'PERSON')
('Obama', 'PERSON')
('Obama', 'PERSON')
('William', 'PERSON')
('Shakespeare', 'PERSON')
('Friedrich', 'PERSON')
('Nietzsche', 'PERSON')
('Toni', 'PERSON')
('Morrison', 'PERSON')
('Sidley', 'PERSON')
('Austin', 'PERSON')
('Michelle', 'PERSON')
('Robinson', 'PERSON')


In [34]:
for w in obj:
    if w.ent_type_ == "CARDINAL":
        print((w.text,w.ent_type_))

('only', 'CARDINAL')
('one', 'CARDINAL')
('10', 'CARDINAL')
('half', 'CARDINAL')
('two', 'CARDINAL')


## Display Named Entity in doc using displacy.render

In [37]:
displacy.render(obj,style="ent")

## How to display only specific ENTITIES in displacy

In [42]:
displacy.render(obj,style="ent",options={"ents":["person"]})

In [44]:
displacy.render(obj,style="ent",options={"ents":["person","cardinal","ORG"]})

## How to Save Entities in html page and customize it as per need

In [46]:
html = displacy.render(obj,style="ent",jupyter=False)

In [48]:
with open("display_ner.html","w") as fo:
    fo.write(html)

In [49]:
!pwd

/Users/pksoni/Documents/{{Documents}}/Python_Prog/for_pycsr/04_Python_for_DataScience/NLP


## find out total number of occurences of ORG entities

In [51]:
op = [1 for w in obj if w.ent_type_ == "ORG"]
sum(op)

25

In [52]:
op = [1 for w in obj if w.ent_type_ == "PERSON"]
sum(op)

38

In [53]:
op = [1 for w in obj if w.ent_type_ == "CARDINAL"]
sum(op)

5

In [54]:
op = [1 for w in obj if w.ent_type_ == "GPE"]
sum(op)

27

## Remove stop_words/punctuation using is_stop & is_punct attribute

In [56]:
op = [w.text for w in obj if not w.is_stop and not w.is_punct]
print(op)

['Obama', 'father', 'Barack', 'Obama', 'Sr', 'teenage', 'goatherd', 'rural', 'Kenya', 'won', 'scholarship', 'study', 'United', 'States', 'eventually', 'senior', 'economist', 'Kenyan', 'government', 'Obama', 'mother', 'S.', 'Ann', 'Dunham', 'grew', 'Kansas', 'Texas', 'Washington', 'state', 'family', 'settled', 'Honolulu', '1960', 'Barack', 'Sr', 'met', 'Russian', 'language', 'class', 'University', 'Hawaii', 'married', 'year', 'later', '\n\n', 'Obama', 'age', 'Barack', 'Sr', 'left', 'study', 'Harvard', 'University', 'shortly', '1964', 'Ann', 'Barack', 'Sr', 'divorced', 'Obama', 'saw', 'father', 'time', 'brief', 'visit', 'Obama', '10', 'Later', 'Ann', 'remarried', 'time', 'foreign', 'student', 'Lolo', 'Soetoro', 'Indonesia', 'second', 'child', 'Maya', 'Obama', 'lived', 'years', 'Jakarta', 'half', 'sister', 'mother', 'stepfather', 'Obama', 'attended', 'government', 'run', 'school', 'received', 'instruction', 'Islam', 'Catholic', 'private', 'school', 'took', 'Christian', 'schooling', '\n\n'

# Try at Home
###### create a list of words/sentence after removing stop_words then make sentence

###### Sentence and Word Tokenization