# Intro to Natural Language Processing with Python

### What are we covering today?
- What is NLP?
- Options for NLP in Python
- Word tokenization
- Sentence tokenization
- Noun phrase extraction
- Part of Speech Tagging
- Find all nouns
- Word transformations (lemmatization, pluralization)
- Look for sentences of particular lengths
- Sentiment Analysis


TODO: What is NLP

In [7]:
from textblob import TextBlob
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [8]:
text = "Natural language processing with Python is so much fun!"
blob = TextBlob(text)

In [15]:
blob.words

WordList(['Natural', 'language', 'processing', 'with', 'Python', 'is', 'so', 'much', 'fun'])

In [9]:
blob.tags

[('Natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('with', 'IN'),
 ('Python', 'NNP'),
 ('is', 'VBZ'),
 ('so', 'RB'),
 ('much', 'JJ'),
 ('fun', 'NN')]

In [42]:
for word, pos in blob.tags:
    print(word, pos)

Natural JJ
language NN
processing NN
with IN
Python NNP
is VBZ
so RB
much JJ
fun NN


For what these tags mean, you might check out http://www.clips.ua.ac.be/pages/mbsp-tags

In [10]:
blob.noun_phrases


WordList(['natural language processing', 'python'])

In [18]:
para = "Here is my first sentence. I am really sad. People are so angry. Life is fantastic. It's really awesome to have three sentences though."

In [19]:
blob2 = TextBlob(para)

In [20]:
blob2.sentences

[Sentence("Here is my first sentence."),
 Sentence("I am really sad."),
 Sentence("People are so angry."),
 Sentence("Life is fantastic."),
 Sentence("It's really awesome to have three sentences though.")]

In [21]:
blob.sentiment

Sentiment(polarity=0.2375, subjectivity=0.30000000000000004)

In [23]:
blob.sentiment.polarity

0.2375

In [26]:
for sentence in blob2.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.25, subjectivity=0.3333333333333333)
Sentiment(polarity=-0.5, subjectivity=1.0)
Sentiment(polarity=-0.5, subjectivity=1.0)
Sentiment(polarity=0.4, subjectivity=0.9)
Sentiment(polarity=1.0, subjectivity=1.0)


In [27]:
for sentence in blob2.sentences:
    print(sentence, sentence.sentiment.polarity)

Here is my first sentence. 0.25
I am really sad. -0.5
People are so angry. -0.5
Life is fantastic. 0.4
It's really awesome to have three sentences though. 1.0


In [28]:
blob.words[1].pluralize()

'languages'

In [35]:
sent2 = "Squids are huge with many tentacles."
blob3 = TextBlob(sent2)

In [38]:
blob3.words[1].lemmatize('v')

'be'

In [39]:
blob3.words[-1].lemmatize()

'tentacle'

In [40]:
blob3.parse()

'Squids/NNP/B-NP/O are/VBP/B-VP/O huge/JJ/B-ADJP/O with/IN/B-PP/B-PNP many/JJ/B-NP/I-PNP tentacles/NNS/I-NP/I-PNP ././O/O'

In [43]:
from textblob import Word

In [45]:
w = Word("octopi")
w.lemmatize()

'octopus'

In [48]:
bad_text = "This is poory spelled wrd"
blob4 = TextBlob(bad_text)
for word in blob4.words:
    print(word.spellcheck())

[('His', 0.7117826487905228), ('This', 0.28821735120947717)]
[('is', 1.0)]
[('poor', 0.9626865671641791), ('poorly', 0.03731343283582089)]
[('spelled', 1.0)]
[('word', 0.8948948948948949), ('ward', 0.03903903903903904), ('rd', 0.03003003003003003), ('wry', 0.015015015015015015), ('ord', 0.009009009009009009), ('wad', 0.006006006006006006), ('wid', 0.003003003003003003), ('wed', 0.003003003003003003)]


In [56]:
repetitive = "John likes to drink sweet Soda. Jane also likes soda. John and Jane both like soda."
blob5 = TextBlob(repetitive)
blob5.word_counts['soda']

3

In [57]:
blob5.words.count('soda', case_sensitive=True)

2

In [58]:
blob5.noun_phrases.count('john')

2

Textblob can also handle some translation and language detection. Do we want to talk about that at all?

In [59]:
blob5.ngrams(n=3)

[WordList(['John', 'likes', 'to']),
 WordList(['likes', 'to', 'drink']),
 WordList(['to', 'drink', 'sweet']),
 WordList(['drink', 'sweet', 'Soda']),
 WordList(['sweet', 'Soda', 'Jane']),
 WordList(['Soda', 'Jane', 'also']),
 WordList(['Jane', 'also', 'likes']),
 WordList(['also', 'likes', 'soda']),
 WordList(['likes', 'soda', 'John']),
 WordList(['soda', 'John', 'and']),
 WordList(['John', 'and', 'Jane']),
 WordList(['and', 'Jane', 'both']),
 WordList(['Jane', 'both', 'like']),
 WordList(['both', 'like', 'soda'])]

In [63]:
from textblob.sentiments import NaiveBayesAnalyzer
blob6 = TextBlob(repetitive)
for sentence in blob6.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.35, subjectivity=0.65)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)


In [61]:
blob7 = TextBlob(repetitive, analyzer=NaiveBayesAnalyzer())
for sentence in blob7.sentences:
    print(sentence.sentiment)

Sentiment(classification='neg', p_pos=0.3519319302860903, p_neg=0.6480680697139108)
Sentiment(classification='neg', p_pos=0.43530881093370977, p_neg=0.564691189066291)
Sentiment(classification='neg', p_pos=0.4908744687646318, p_neg=0.5091255312353682)


TODO: Is it worth showing how to use different tokenizers, noun phrase chunks, POS taggers, etc...? Should we show how to assemble a Blobbler for re-use?

In [67]:
sent = "John hates soda"
blob7 = TextBlob(sent)
for sentence in blob7.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)


In [70]:
sent = "John hates soda"
blob8 = TextBlob(sent, analyzer=NaiveBayesAnalyzer())
for sentence in blob8.sentences:
    print(sentence.sentiment)

Sentiment(classification='neg', p_pos=0.22370657856753343, p_neg=0.7762934214324665)


Avg length of sentence
Richness: number of words / number of diff words

Get speech from Clinton, Bush, Obama, Trump
Do avg grad level of text: Flesch-Kincaid score

https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

https://en.wikipedia.org/wiki/Automated_readability_index

https://pypi.python.org/pypi/textstat/

After calculating our own, could then use textstat to get Flesch-Kincaid as a comparison to the word/sentence approaches. Flesch-Kincaid uses syllables. 