<h3> NLP using spaCY </h3>

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp, type(nlp)

(<spacy.lang.en.English at 0x7f81e96d15c0>, spacy.lang.en.English)

<h4> Text conversion to tokenized object </h4>

In [2]:
sample_text = 'This tutorial is about Natural Language Processing using spaCY '
sample_doc = nlp(sample_text)
print(sample_doc), print(type(sample_doc))
print([token.text for token in sample_doc])

This tutorial is about Natural Language Processing using spaCY 
<class 'spacy.tokens.doc.Doc'>
['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'using', 'spaCY']


<h4> Reading from a file and converting to Doc object </h4>

In [3]:
file_text = open('introduction.txt').read()
print(file_text)
file_doc = nlp(file_text)
print(file_doc), print(type(file_doc))
print([token.text for token in file_doc])

This is an introduction to NLP using spaCY.

This is an introduction to NLP using spaCY.

<class 'spacy.tokens.doc.Doc'>
['This', 'is', 'an', 'introduction', 'to', 'NLP', 'using', 'spaCY', '.', '\n']


<h4>Sentence Detection which divides a text into linguistically meaningful units 
using the 'sents' property on doc object</h4>

In [4]:
sample_text = ('ABC is a Python developer currently'
              ' working with a Fin-tech'
              'based in NA. He is interesed in learning '
              'Natural Language Processing.')
print(sample_text)
sample_doc  = nlp(sample_text)
sentences  = list(sample_doc.sents)
print(f'Type->{type(sentences)},  Len->{len(sentences)}')
for sentence in sentences:
    print(sentence)

ABC is a Python developer currently working with a Fin-techbased in NA. He is interesed in learning Natural Language Processing.
Type-><class 'list'>,  Len->2
ABC is a Python developer currently working with a Fin-techbased in NA.
He is interesed in learning Natural Language Processing.


<h4> Using custom boundaries to detect end of a sentence </h4>

In [5]:
def set_custom_boundaries(doc):
    ''' Adds support for a custom delimiter to detect end of sentence '''
    for token in doc[:-1]:
        if token.text == '?':
            doc[token.i+1].is_sent_start = True
    return doc

custom_delimiter_text = ('XYZ, can you!?'
                         'never mind i forgot'
                        ' what i was saying. So, do you think'
                        ' we should...?')
custom_nlp = spacy.load('en_core_web_sm')
custom_nlp.add_pipe(set_custom_boundaries,before='parser')
custom_doc = custom_nlp(custom_delimiter_text)
custom_sentences = list(custom_doc.sents)
for sentences in custom_sentences:
    print(sentences)

XYZ, can you!?never mind i forgot what i was saying.
So, do you think we should...?


<h4> Tokenization with starting index of each word in the sentence </h4>

In [6]:
for token in sample_doc:
    print(token, token.idx)

ABC 0
is 4
a 7
Python 9
developer 16
currently 26
working 36
with 44
a 49
Fin 51
- 54
techbased 55
in 65
NA 68
. 70
He 72
is 75
interesed 78
in 88
learning 91
Natural 100
Language 108
Processing 117
. 127


<h4> Various attributes of a token </h4>

In [7]:
import pandas as pd
data = []
for token in sample_doc:
    data.append([token, token.idx, token.text_with_ws, token.is_alpha, token.is_punct, token.is_space,
          token.shape_, token.is_stop])
df = pd.DataFrame(data, columns=['Token','Token_Index','Token_with_Whitespace','Token_IsAlpha','Token_IsPunctuation','Token_IsSpace','Token_Shape','Token_IsStop'])
print(df)

         Token  Token_Index Token_with_Whitespace  Token_IsAlpha  \
0          ABC            0                  ABC            True   
1           is            4                   is            True   
2            a            7                    a            True   
3       Python            9               Python            True   
4    developer           16            developer            True   
5    currently           26            currently            True   
6      working           36              working            True   
7         with           44                 with            True   
8            a           49                    a            True   
9          Fin           51                   Fin           True   
10           -           54                     -          False   
11   techbased           55            techbased            True   
12          in           65                   in            True   
13          NA           68                    N

<h4> Lemmatization using spaCY </h4>
<h6> Eg:  organizes, organized and organizing are all forms of organize. Here, organize is the lemma. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item</h6>

In [8]:
conference_help_text = ('ABC is helping organize a developer'
                        ' conference on Applications of Natural Language'
                        ' Processing. He keeps organizing local Python meetups'
                        ' and several internal talks at his workplace.')
conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
    print(token, token.lemma_)
    

ABC ABC
is be
helping help
organize organize
a a
developer developer
conference conference
on on
Applications Applications
of of
Natural Natural
Language Language
Processing Processing
. .
He -PRON-
keeps keep
organizing organize
local local
Python Python
meetups meetup
and and
several several
internal internal
talks talk
at at
his -PRON-
workplace workplace
. .


<h4> Word Frequency </h4>
<h6> Find out unique and most common/repeating words in a sentence/paragraph </h6>

In [9]:
from collections import Counter
complete_text = ('ABC XYZ is a Python developer currently'
     'working for a US-based Fintech company. He is'
     ' interested in learning Natural Language Processing.'
     ' There is a developer conference happening on 21 July'
     ' 2019 in London. It is titled "Applications of Natural'
     ' Language Processing". There is a helpline number '
     ' available at +1-1234567891. ABC is helping organize it.'
     ' He keeps organizing local Python meetups and several'
     ' internal talks at his workplace. Gus is also presenting'
     ' a talk. The talk will introduce the reader about "Use'
     ' cases of Natural Language Processing in Fintech".'
     ' Apart from his work, he is very passionate about music.'
     ' ABC is learning to play the Piano. He has enrolled '
     ' himself in the weekend batch of Great Piano Academy.'
     ' Great Piano Academy is situated in Mayfair or the City'
     ' of London and has world-class piano instructors.')
nlp = spacy.load('en_core_web_sm')
complete_doc = nlp(complete_text)
#Remove stop words and punctuations
words = [token.text for token in complete_doc 
            if not token.is_stop and not token.is_punct]
word_frequency = Counter(words)
most_common_words = word_frequency.most_common(5)
print(f'5 most common words-->{most_common_words}')
unique_words = [word for word,freq in word_frequency.items() if freq==1]
print(f'Unique words-->{unique_words}')

5 most common words-->[('ABC', 3), ('Natural', 3), ('Language', 3), ('Processing', 3), ('Piano', 3)]
Unique words-->['XYZ', 'currentlyworking', 'based', 'company', 'interested', 'conference', 'happening', '21', 'July', '2019', 'titled', 'Applications', 'helpline', 'number', 'available', '+1', '1234567891', 'helping', 'organize', 'keeps', 'organizing', 'local', 'meetups', 'internal', 'talks', 'workplace', 'Gus', 'presenting', 'introduce', 'reader', 'Use', 'cases', 'Apart', 'work', 'passionate', 'music', 'play', 'enrolled', 'weekend', 'batch', 'situated', 'Mayfair', 'City', 'world', 'class', 'piano', 'instructors']


<h4>Parts of speech tagging</h4>
<h6>More explanation <a href="https://spacy.io/api/annotation#pos-tagging">here</a></h6>

In [10]:
for token in sample_doc:
    print( token, token.tag_, token.pos_, spacy.explain(token.tag_))

ABC NNP PROPN noun, proper singular
is VBZ VERB verb, 3rd person singular present
a DT DET determiner
Python NNP PROPN noun, proper singular
developer NN NOUN noun, singular or mass
currently RB ADV adverb
working VBG VERB verb, gerund or present participle
with IN ADP conjunction, subordinating or preposition
a DT DET determiner
Fin NNP PROPN noun, proper singular
- HYPH PUNCT punctuation mark, hyphen
techbased VBN VERB verb, past participle
in IN ADP conjunction, subordinating or preposition
NA NNP PROPN noun, proper singular
. . PUNCT punctuation mark, sentence closer
He PRP PRON pronoun, personal
is VBZ VERB verb, 3rd person singular present
interesed VBN VERB verb, past participle
in IN ADP conjunction, subordinating or preposition
learning VBG VERB verb, gerund or present participle
Natural NNP PROPN noun, proper singular
Language NNP PROPN noun, proper singular
Processing NNP PROPN noun, proper singular
. . PUNCT punctuation mark, sentence closer


<h4> Using displaCy</h4>

In [11]:
interest_text = 'He is interested in learning Natural Language Processing.'
interest_doc = nlp(interest_text)
from spacy import displacy
#displacy.serve(interest_doc,style='dep'). This will run a local web server and a visualisation will appear locally on localhost:5000
displacy.render(interest_doc, style='dep', jupyter=True)

<h4> Rule based matching using spaCY </h4>
  <ul> 
  <li>Extract first and last names from a sentence(s)</li>
  <li> Extract phone nos from a sentence(s)</li>
  </ul>

In [12]:
from spacy.matcher import Matcher
matcher =  Matcher(nlp.vocab)
def extract_full_name(nlp_doc):
    pattern = [{'POS':'PROPN'}, {'POS':'PROPN'}]#pattern defines 2 POS(part of speech) tags for which the tokens should be a PROPER NOUN
    matcher.add('FULL_NAME',None,pattern)#add method takes as input an ID Key(could be any ID key), callback function and matching pattern as input
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

extract_full_name(complete_doc)

'ABC XYZ'

In [13]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
conference_org_text = ('There is a developer conference'
    ' happening on 21 July 2019 in London. It is titled'
     ' "Applications of Natural Language Processing".'
     ' There is a helpline number available'
     ' at (123) 456-789')

#ORTH -> Gives the exact text of the token
#SHAPE -> Transforms the token string to orthographic features
#OP -> defines operators. Using ? as a value means that the pattern is optional, meaning it can match 0 or 1 times.
def extract_phone_number(nlp_doc):
    pattern = [{'ORTH':'('},{'SHAPE':'ddd'},{'ORTH':')'},{'SHAPE':'ddd'},{'ORTH':'-', 'OP':'?'},{'SHAPE':'ddd'}]
    matcher.add('PHONE NUMBER', None, pattern)
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text
    
conference_org_doc = nlp(conference_org_text)
extract_phone_number(conference_org_doc)

'(123) 456-789'

In [14]:
from spacy import displacy
plain_text = "ABC is learning piano"
plain_doc = nlp(plain_text)
displacy.render(plain_doc, style='dep', jupyter=True)

<h4> Named-Entity Recognition </h4>
<h5> the process of classifying unstructured text into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on </h5>

In [15]:
import pandas as pd
piano_class_info_text = ('Harmony Piano Academy is situated'
                         ' in Mumbai or the City of Dreams'
                         ' and has world class piano instructors.'
                         'Mr. Fernandes is our head-instructor there and he charges around $10 for each session.'
                         ' The classes will start at sharp 09:30 am on Monday morning.'
                         ' The chances of you reaching on time due to heavy traffic is less than 20%')
piano_class_info_doc = nlp(piano_class_info_text)
cols = ['Text','Label','Explain_Label']
rows = []
#spaCY has the property "ents" which can be used to extract named entities
for entry in piano_class_info_doc.ents:
    rows.append([entry.text, entry.label_, spacy.explain(entry.label_)])
df = pd.DataFrame(rows, columns=cols)
print(df)
    

                    Text    Label                            Explain_Label
0  Harmony Piano Academy      ORG  Companies, agencies, institutions, etc.
1                 Mumbai      GPE                Countries, cities, states
2     the City of Dreams      GPE                Countries, cities, states
3             around $10    MONEY          Monetary values, including unit
4               09:30 am     TIME                 Times smaller than a day
5         Monday morning     TIME                 Times smaller than a day
6          less than 20%  PERCENT                Percentage, including "%"
