<a href="https://colab.research.google.com/github/vanessaaleung/natural-language-processing-with-spaCy/blob/master/Natural_Language_Processing_With_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Natural Language Processing With spaCy**

In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
introduction_text = ('This tutorial is about Natural Language Processing in Spacy.')

In [0]:
# create a processed Doc object - a container for accessing linguistic annotations
introduction_doc = nlp(introduction_text)

#### **Sentence Detection**
Sentence Detection is the process of locating the start and end of sentences in a given text.

In [0]:
complete_text = ('Gus Proto is a Python developer currently'
...     'working for a London-based Fintech company. He is'
...     ' interested in learning Natural Language Processing.'
...     ' There is a developer conference happening on 21 July'
...     ' 2019 in London. It is titled "Applications of Natural'
...     ' Language Processing". There is a helpline number '
...     ' available at +1-1234567891. Gus is helping organize it.'
...     ' He keeps organizing local Python meetups and several'
...     ' internal talks at his workplace. Gus is also presenting'
...     ' a talk. The talk will introduce the reader about "Use'
...     ' cases of Natural Language Processing in Fintech".'
...     ' Apart from his work, he is very passionate about music.'
...     ' Gus is learning to play the Piano. He has enrolled '
...     ' himself in the weekend batch of Great Piano Academy.'
...     ' Great Piano Academy is situated in Mayfair or the City'
...     ' of London and has world-class piano instructors.')

In [6]:
complete_doc = nlp(complete_text)
sentences = list(complete_doc.sents)
len(sentences)

13

In [7]:
for sentence in sentences:
  print (sentence)

Gus Proto is a Python developer currentlyworking for a London-based Fintech company.
He is interested in learning Natural Language Processing.
There is a developer conference happening on 21 July 2019 in London.
It is titled "Applications of Natural Language Processing".
There is a helpline number  available at +1-1234567891.
Gus is helping organize it.
He keeps organizing local Python meetups and several internal talks at his workplace.
Gus is also presenting a talk.
The talk will introduce the reader about "Use cases of Natural Language Processing in Fintech".
Apart from his work, he is very passionate about music.
Gus is learning to play the Piano.
He has enrolled  himself in the weekend batch of Great Piano Academy.
Great Piano Academy is situated in Mayfair or the City of London and has world-class piano instructors.


#### **Tokenization**
Identify the basic units in the text

In [8]:
print([token for token in complete_doc])

[Gus, Proto, is, a, Python, developer, currentlyworking, for, a, London, -, based, Fintech, company, ., He, is, interested, in, learning, Natural, Language, Processing, ., There, is, a, developer, conference, happening, on, 21, July, 2019, in, London, ., It, is, titled, ", Applications, of, Natural, Language, Processing, ", ., There, is, a, helpline, number,  , available, at, +1, -, 1234567891, ., Gus, is, helping, organize, it, ., He, keeps, organizing, local, Python, meetups, and, several, internal, talks, at, his, workplace, ., Gus, is, also, presenting, a, talk, ., The, talk, will, introduce, the, reader, about, ", Use, cases, of, Natural, Language, Processing, in, Fintech, ", ., Apart, from, his, work, ,, he, is, very, passionate, about, music, ., Gus, is, learning, to, play, the, Piano, ., He, has, enrolled,  , himself, in, the, weekend, batch, of, Great, Piano, Academy, ., Great, Piano, Academy, is, situated, in, Mayfair, or, the, City, of, London, and, has, world, -, class, pia

#### **Stop Words**

In [9]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [10]:
complete_no_stopword_doc = [token for token in complete_doc if not token.is_stop]
print(complete_no_stopword_doc)

[Gus, Proto, Python, developer, currentlyworking, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, ., developer, conference, happening, 21, July, 2019, London, ., titled, ", Applications, Natural, Language, Processing, ", ., helpline, number,  , available, +1, -, 1234567891, ., Gus, helping, organize, ., keeps, organizing, local, Python, meetups, internal, talks, workplace, ., Gus, presenting, talk, ., talk, introduce, reader, ", Use, cases, Natural, Language, Processing, Fintech, ", ., Apart, work, ,, passionate, music, ., Gus, learning, play, Piano, ., enrolled,  , weekend, batch, Great, Piano, Academy, ., Great, Piano, Academy, situated, Mayfair, City, London, world, -, class, piano, instructors, .]


#### **Lemmatization**
Reducing inflected forms of a word. The reduced form or root word is called a lemma.

In [11]:
print([token.lemma_ for token in complete_no_stopword_doc])

['Gus', 'Proto', 'Python', 'developer', 'currentlyworke', 'London', '-', 'base', 'Fintech', 'company', '.', 'interested', 'learn', 'Natural', 'Language', 'Processing', '.', 'developer', 'conference', 'happen', '21', 'July', '2019', 'London', '.', 'title', '"', 'Applications', 'Natural', 'Language', 'Processing', '"', '.', 'helpline', 'number', ' ', 'available', '+1', '-', '1234567891', '.', 'Gus', 'help', 'organize', '.', 'keep', 'organize', 'local', 'Python', 'meetup', 'internal', 'talk', 'workplace', '.', 'Gus', 'present', 'talk', '.', 'talk', 'introduce', 'reader', '"', 'use', 'case', 'Natural', 'Language', 'Processing', 'Fintech', '"', '.', 'apart', 'work', ',', 'passionate', 'music', '.', 'Gus', 'learn', 'play', 'Piano', '.', 'enrol', ' ', 'weekend', 'batch', 'Great', 'Piano', 'Academy', '.', 'Great', 'Piano', 'Academy', 'situate', 'Mayfair', 'City', 'London', 'world', '-', 'class', 'piano', 'instructor', '.']


#### **Word Frequency**

In [12]:
from collections import Counter
words = [token.text for token in complete_doc if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
word_freq.most_common(5)

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]

#### **Part of Speech Tagging**
There are eight [parts of speech](https://spacy.io/api/annotation#pos-tagging):

- Noun ```NNP, PROPN```
- Pronoun ```PRP, PRON```
- Adjective ``JJ, JJR, JJS, ADJ``
- Verb ```VB, VBD, VBG, VBP, VBZ, VERB```
- Adverb ```RB, RBR, TBS, RP, ADV```
- Preposition ```IN, ADP```
- Conjunction ```CC, CCONJ```
- Interjection ```UH, INTJ```

In [13]:
for token in complete_no_stopword_doc[:5]:
  print(token, 
        token.tag_, 
        token.pos_, 
        spacy.explain(token.tag_))

Gus NNP PROPN noun, proper singular
Proto NNP PROPN noun, proper singular
Python NNP PROPN noun, proper singular
developer NN NOUN noun, singular or mass
currentlyworking VBG VERB verb, gerund or present participle


#### **Rule-Based Matching**
Identify and extract tokens and phrases according to patterns (such as lowercase) and grammatical features (such as part of speech).

In [17]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

def extract_full_name(nlp_doc):
  pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
  matcher.add('FULL_NAME', None, pattern)  # matcher.add(ID key, callback, list of dicts of patterns)
  matches = matcher(nlp_doc)   # Find all token sequences matching the supplied patterns on the Doc
  for match_id, start, end in matches:
    span = nlp_doc[start:end]
    print(span.text)

extract_full_name(complete_doc)

Gus Proto
Natural Language
Language Processing
Natural Language
Language Processing
Natural Language
Language Processing
Great Piano
Piano Academy
Great Piano
Piano Academy


- ```ORTH``` gives the exact text of the token
- ```SHAPE``` transforms the token string to show orthographic features
- ```OP``` defines operators. The pattern is optional

In [19]:
matcher = Matcher(nlp.vocab)
conference_org_text = ('There is a developer conference'
     'happening on 21 July 2019 in London. It is titled'
     ' "Applications of Natural Language Processing".'
     ' There is a helpline number available'
     ' at (123) 456-789')

def extract_phone_number(nlp_doc):
     pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'},
                {'ORTH': ')'}, {'SHAPE': 'ddd'},
                {'ORTH': '-', 'OP': '?'},
                {'SHAPE': 'ddd'}]
     matcher.add('PHONE_NUMBER', None, pattern)
     matches = matcher(nlp_doc)
     for match_id, start, end in matches:
         span = nlp_doc[start:end]
         return span.text

conference_org_doc = nlp(conference_org_text)
extract_phone_number(conference_org_doc)

'(123) 456-789'

#### **Dependency Parsing**
Defines the dependency relationship between headwords and their dependents. The verb is usually the head of the sentence.


- ```nsubj``` is the subject of the word. Its headword is a verb.
- ```aux``` is an auxiliary word. Its headword is a verb.
- ```dobj``` is the direct object of the verb. Its headword is a verb.
- ```acomp```: adjectival complement. Its headword is a verb.
- ```pcomp```: prepositional complement. Its headword is a preposition.
- ```prep```
- ```ROOT```

In [20]:
about_interest_text = ('He is interested in learning Natural Language Processing.')
about_interest_doc = nlp(about_interest_text)
for token in about_interest_doc[:5]:
  print(token.text, token.head.text, token.dep_)

He is nsubj
is is ROOT
interested is acomp
in interested prep
learning in pcomp


In [21]:
from spacy import displacy
displacy.render(about_interest_doc, style='dep', jupyter=True)

In [22]:
# extract the children of 'is'
print([token.text for token in about_interest_doc[1].children])

['He', 'interested', '.']


In [23]:
# extract the previous neighboring node of 'is'
print(about_interest_doc[1].nbor(-1))

He


In [24]:
# extract the next neighboring node of 'is'
print(about_interest_doc[1].nbor())

interested


In [25]:
# extract all left/right tokens of 'is'
print([token.text for token in about_interest_doc[1].lefts])
print([token.text for token in about_interest_doc[1].rights])

['He']
['interested', '.']


In [26]:
print(list(about_interest_doc[1].subtree))

[He, is, interested, in, learning, Natural, Language, Processing, .]


#### **Shallow Parsing/Chuncking**
Groups adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as noun phrases, verb phrases, and prepositional phrases.

- Noun Phrase: has a noun as its head. Help infer what is being talked about in the sentence.
- Verb Phrase: has at least one  verb. Help understand the actions of the nouns. Requires `textacy` package

In [27]:
# noun phrase detection
for chunk in about_interest_doc.noun_chunks:
  print(chunk)

He
Natural Language Processing


In [28]:
!pip3 install textacy --progress-bar off
import textacy

Collecting textacy
[?25l  Downloading https://files.pythonhosted.org/packages/f3/fe/0b57ac1a202de9819e71e8373980d586e824f515ad2f4266e4e98627f8b8/textacy-0.10.0-py3-none-any.whl (206kB)

Collecting jellyfish>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/6c/09/927ae35fc5a9f70abb6cc2c27ee88fc48549f7bc4786c1d4b177c22e997d/jellyfish-0.8.2-cp36-cp36m-manylinux2014_x86_64.whl (93kB)

[?25hCollecting pyphen>=0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/15/82/08a3629dce8d1f3d91db843bb36d4d7db6b6269d5067259613a0d5c8a9db/Pyphen-0.9.5-py2.py3-none-any.whl (3.0MB)

Collecting cytoolz>=0.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/62/b1/7f16703fe4a497879b1b457adf1e472fad2d4f030477698b16d2febf38bb/cytoolz-0.10.1.tar.gz (475kB)

Building wheels for collected packages: cytoolz
  Building wheel for cytoolz (setup.py) ... [?25l[?25hdone
  Created wheel for cytoolz: filename=cytoolz-0.10.1-cp36-cp36m-linux_x86_64.whl size=1233723 sha256=089

In [29]:
# verb phrase detection
about_talk_text = ('The talk will introduce reader about Use'
                    ' cases of Natural Language Processing in'
                    ' Fintech')
pattern = r'(<VERB>?<ADV>*<VERB>+)'
about_interest_doc = textacy.make_spacy_doc(about_interest_text,
                      lang='en_core_web_sm')
verb_phrases = textacy.extract.pos_regex_matches(about_interest_doc, pattern)
for chunk in verb_phrases:
  print(chunk.text)

learning


  action="once",


#### **Named Entity Recognition (NER)**
Locating named entities in text and classifying them into pre-definfed categories: person names, organizations, locations, etc.

In [30]:
piano_class_text = ('Great Piano Academy is situated'
...     ' in Mayfair or the City of London and has'
...     ' world-class piano instructors.')
piano_class_doc = nlp(piano_class_text)
for ent in piano_class_doc.ents:
  print(ent.text, ent.label_, spacy.explain(ent.label_))

Great Piano Academy ORG Companies, agencies, institutions, etc.
Mayfair GPE Countries, cities, states
the City of London GPE Countries, cities, states


In [31]:
displacy.render(piano_class_doc, style='ent', jupyter=True)

In [33]:
# redact people’s names from a text
survey_text = ('Out of 5 people surveyed, James Robert,'
                ' Julie Fuller and Benjamin Brooks like'
                ' apples. Kelly Cox and Matthew Evans'
                ' like oranges.')

def replace_person_names(token):
     if token.ent_iob != 0 and token.ent_type_ == 'PERSON':
         return '[REDACTED] '
     return token.string

def redact_names(nlp_doc):
     for ent in nlp_doc.ents:
         ent.merge()
     tokens = map(replace_person_names, nlp_doc)
     return ''.join(tokens)

survey_doc = nlp(survey_text)
redact_names(survey_doc)

'Out of 5 people surveyed, [REDACTED] , [REDACTED] and [REDACTED] like apples. [REDACTED] and [REDACTED] like oranges.'