# NLP Examples

# NLTK

## Use Corpus

In [1]:
import nltk
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/rootstrap/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/rootstrap/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/rootstrap/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/rootstrap/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/rootstrap/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/rootstrap/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nlt

True

In [2]:
from nltk.book import *
print(text1)
print(text2)

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
<Text: Moby Dick by Herman Melville 1851>
<Text: Sense and Sensibility by Jane Austen 1811>


A concordance view shows us every occurrence of a given word

In [3]:
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


In [4]:
text1.similar("monstrous")

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless


In [5]:
text2.collocations()

Colonel Brandon; Sir John; Lady Middleton; Miss Dashwood; every thing;
thousand pounds; dare say; Miss Steeles; said Elinor; Miss Steele;
every body; John Dashwood; great deal; Harley Street; Berkeley Street;
Miss Dashwoods; young man; Combe Magna; every day; next morning


## Classify text

Example taken from https://www.nltk.org/book/ch06.html      

Classify names - male/female

In [6]:
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
def gender_features(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}
gender_features('Shrek')
{'last_letter': 'k'}
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.775


# TextBlob

Example from https://textblob.readthedocs.io/en/dev/ 


In [7]:
from textblob import TextBlob
text = '''.Qué ignorante... ¿qué investigación se puede hacer hoy sin recursos? 
Estamos hablando de ciencia y conocimientos, no de labura para mantenerte.
Por esos criterios baratos no se avanza'''

blob = TextBlob(text)
blob = blob.translate(to='en')
print(blob.tags)           
                    
print(blob.noun_phrases)   

for sentence in blob.sentences:
    print(sentence, sentence.sentiment.polarity)


[('How', 'WRB'), ('ignorant', 'JJ'), ('what', 'WP'), ('research', 'NN'), ('can', 'MD'), ('be', 'VB'), ('done', 'VBN'), ('today', 'NN'), ('without', 'IN'), ('resources', 'NNS'), ('We', 'PRP'), ('are', 'VBP'), ('talking', 'VBG'), ('about', 'IN'), ('science', 'NN'), ('and', 'CC'), ('knowledge', 'NN'), ('not', 'RB'), ('about', 'IN'), ('work', 'NN'), ('to', 'TO'), ('maintain', 'VB'), ('you', 'PRP'), ('For', 'IN'), ('those', 'DT'), ('cheap', 'JJ'), ('criteria', 'NNS'), ('there', 'EX'), ('is', 'VBZ'), ('no', 'DT'), ('progress', 'NN')]
['ignorant ...', 'cheap criteria']
How ignorant ... what research can be done today without resources? 0.0
We are talking about science and knowledge, not about work to maintain you. 0.0
For those cheap criteria there is no progress 0.4


# Spacy

In [8]:
! python -m spacy download es_core_news_sm


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


## Sentence Detection
https://realpython.com/natural-language-processing-spacy-python/

In [36]:
about_doc = nlp(text)
sentences = list(about_doc.sents)
for sentence in sentences:
    print(sentence)

.Qué ignorante... ¿qué investigación se puede hacer hoy sin recursos? 

Estamos hablando de ciencia y conocimientos, no de labura para mantenerte.

Por esos criterios baratos no se avanza


## Tokenization
Tokenization is the process of identifying the basic units in your text, called tokens.

In [28]:
import spacy
from spacy import displacy
nlp = spacy.load('es_core_news_sm')
doc = nlp(text)
displacy.render(doc, style='dep', options={'distance':140}, jupyter=True)

## Stopping words

The stopping words are common words in the language that might be removed, since are not additional information. The stopping words can be loaded from a file.  

In [38]:
spacy_stopwords = spacy.lang.es.stop_words.STOP_WORDS
len(spacy_stopwords)

551

In [42]:
for stop_word in list(spacy_stopwords)[:3]:
    print(stop_word)

mis
esos
cuánta


## Lematization
Convert token to reduced words. The reduced work is the root word, called lemma.

In [40]:
for token in doc:
    print (token, token.lemma_)

.Qué .Qué
ignorante ignorante
... ...
¿ ¿
qué qué
investigación investigación
se se
puede poder
hacer hacer
hoy hoy
sin sin
recursos recurso
? ?

 

Estamos Estamos
hablando hablar
de de
ciencia ciencia
y y
conocimientos conocimiento
, ,
no no
de de
labura labura
para parir
mantenerte mantenerte
. .

 

Por Por
esos ese
criterios criterio
baratos barato
no no
se se
avanza avanzar


## Word frequency

In [47]:
from collections import Counter
complete_text = ('Gus Proto is a Python developer currently'
     'working for a London-based Fintech company. He is'
     ' interested in learning Natural Language Processing.'
     ' There is a developer conference happening on 21 July'
     ' 2019 in London. It is titled "Applications of Natural'
     ' Language Processing". There is a helpline number '
     ' available at +1-1234567891. Gus is helping organize it.'
     ' He keeps organizing local Python meetups and several'
     ' internal talks at his workplace. Gus is also presenting'
     ' a talk. The talk will introduce the reader about "Use'
     ' cases of Natural Language Processing in Fintech".'
     ' Apart from his work, he is very passionate about music.'
    ' Gus is learning to play the Piano. He has enrolled '
    ' himself in the weekend batch of Great Piano Academy.'
    ' Great Piano Academy is situated in Mayfair or the City'
    ' of London and has world-class piano instructors.')

complete_doc = nlp(complete_text)
words = [token.text for token in complete_doc
          if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
common_words = word_freq.most_common(5)
print (common_words)

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]


## Part of Speach

In [51]:
for token in doc:
    print (token, token.tag_, token.pos_)

.Qué PUNCT__PunctType=Peri PUNCT
ignorante ADV ADV
... PUNCT__PunctType=Comm PUNCT
¿ PUNCT__PunctSide=Ini|PunctType=Qest PUNCT
qué DET__PronType=Int,Rel DET
investigación NOUN__Gender=Fem|Number=Sing NOUN
se PRON__Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes PRON
puede AUX__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin AUX
hacer VERB__VerbForm=Inf VERB
hoy ADV ADV
sin ADP__AdpType=Prep ADP
recursos NOUN__Gender=Masc|Number=Plur NOUN
? PUNCT__PunctSide=Fin|PunctType=Qest PUNCT

 _SP SPACE
Estamos AUX__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin AUX
hablando VERB__VerbForm=Ger VERB
de ADP__AdpType=Prep ADP
ciencia NOUN__Gender=Fem|Number=Sing NOUN
y CCONJ CCONJ
conocimientos NOUN__Gender=Masc|Number=Plur NOUN
, PUNCT__PunctType=Comm PUNCT
no ADV__Polarity=Neg ADV
de ADP__AdpType=Prep ADP
labura NOUN__Gender=Fem|Number=Sing NOUN
para ADP__AdpType=Prep ADP
mantenerte NOUN__Gender=Masc|Number=Sing NOUN
. PUNCT__PunctType=Peri PUNCT

 _SP SPACE
Por ADP__AdpTyp