# Intro to Natural Language Processing with Python

### What are we covering today?
- What is NLP?
- Options for NLP in Python
- Word tokenization
- Sentence tokenization
- Noun phrase extraction
- Part of Speech Tagging
- Find all nouns
- Word transformations (lemmatization, pluralization)
- Look for sentences of particular lengths
- Sentiment Analysis


TODO: What is NLP

In [1]:
from textblob import TextBlob
import nltk

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
text = "Natural language processing with Python is so much fun!"
blob = TextBlob(text)

In [3]:
blob.words

WordList(['Natural', 'language', 'processing', 'with', 'Python', 'is', 'so', 'much', 'fun'])

In [4]:
blob.tags

[('Natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('with', 'IN'),
 ('Python', 'NNP'),
 ('is', 'VBZ'),
 ('so', 'RB'),
 ('much', 'JJ'),
 ('fun', 'NN')]

In [5]:
for word, pos in blob.tags:
    print(word, pos)

Natural JJ
language NN
processing NN
with IN
Python NNP
is VBZ
so RB
much JJ
fun NN


For what these tags mean, you might check out http://www.clips.ua.ac.be/pages/mbsp-tags

In [6]:
blob.noun_phrases


WordList(['natural language processing', 'python'])

In [7]:
para = "Here is my first sentence. I am really sad. People are so angry. Life is fantastic. It's really awesome to have three sentences though."

In [8]:
blob2 = TextBlob(para)

In [9]:
blob2.sentences

[Sentence("Here is my first sentence."),
 Sentence("I am really sad."),
 Sentence("People are so angry."),
 Sentence("Life is fantastic."),
 Sentence("It's really awesome to have three sentences though.")]

In [10]:
blob.sentiment

Sentiment(polarity=0.2375, subjectivity=0.30000000000000004)

In [11]:
blob.sentiment.polarity

0.2375

In [12]:
for sentence in blob2.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.25, subjectivity=0.3333333333333333)
Sentiment(polarity=-0.5, subjectivity=1.0)
Sentiment(polarity=-0.5, subjectivity=1.0)
Sentiment(polarity=0.4, subjectivity=0.9)
Sentiment(polarity=1.0, subjectivity=1.0)


In [13]:
for sentence in blob2.sentences:
    print(sentence, sentence.sentiment.polarity)

Here is my first sentence. 0.25
I am really sad. -0.5
People are so angry. -0.5
Life is fantastic. 0.4
It's really awesome to have three sentences though. 1.0


In [14]:
blob.words[1].pluralize()

'languages'

In [15]:
sent2 = "Squids are huge with many tentacles."
blob3 = TextBlob(sent2)

In [16]:
blob3.words[1].lemmatize('v')

'be'

In [17]:
blob3.words[-1].lemmatize()

'tentacle'

In [18]:
blob3.parse()

'Squids/NNP/B-NP/O are/VBP/B-VP/O huge/JJ/B-ADJP/O with/IN/B-PP/B-PNP many/JJ/B-NP/I-PNP tentacles/NNS/I-NP/I-PNP ././O/O'

In [19]:
from textblob import Word

In [20]:
w = Word("octopi")
w.lemmatize()

'octopus'

In [21]:
bad_text = "This is poory spelled wrd"
blob4 = TextBlob(bad_text)
for word in blob4.words:
    print(word.spellcheck())

[('His', 0.7117826487905228), ('This', 0.28821735120947717)]
[('is', 1.0)]
[('poor', 0.9626865671641791), ('poorly', 0.03731343283582089)]
[('spelled', 1.0)]
[('word', 0.8948948948948949), ('ward', 0.03903903903903904), ('rd', 0.03003003003003003), ('wry', 0.015015015015015015), ('ord', 0.009009009009009009), ('wad', 0.006006006006006006), ('wid', 0.003003003003003003), ('wed', 0.003003003003003003)]


In [22]:
repetitive = "John likes to drink sweet Soda. Jane also likes soda. John and Jane both like soda."
blob5 = TextBlob(repetitive)
blob5.word_counts['soda']

3

In [23]:
blob5.words.count('soda', case_sensitive=True)

2

In [24]:
blob5.noun_phrases.count('john')

2

Textblob can also handle some translation and language detection. Do we want to talk about that at all?

In [25]:
blob5.ngrams(n=3)

[WordList(['John', 'likes', 'to']),
 WordList(['likes', 'to', 'drink']),
 WordList(['to', 'drink', 'sweet']),
 WordList(['drink', 'sweet', 'Soda']),
 WordList(['sweet', 'Soda', 'Jane']),
 WordList(['Soda', 'Jane', 'also']),
 WordList(['Jane', 'also', 'likes']),
 WordList(['also', 'likes', 'soda']),
 WordList(['likes', 'soda', 'John']),
 WordList(['soda', 'John', 'and']),
 WordList(['John', 'and', 'Jane']),
 WordList(['and', 'Jane', 'both']),
 WordList(['Jane', 'both', 'like']),
 WordList(['both', 'like', 'soda'])]

In [26]:
from textblob.sentiments import NaiveBayesAnalyzer
blob6 = TextBlob(repetitive)
for sentence in blob6.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.35, subjectivity=0.65)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)


In [27]:
blob7 = TextBlob(repetitive, analyzer=NaiveBayesAnalyzer())
for sentence in blob7.sentences:
    print(sentence.sentiment)

Sentiment(classification='neg', p_pos=0.3519319302860912, p_neg=0.6480680697139093)
Sentiment(classification='neg', p_pos=0.43530881093370977, p_neg=0.564691189066291)
Sentiment(classification='neg', p_pos=0.4908744687646318, p_neg=0.5091255312353694)


TODO: Is it worth showing how to use different tokenizers, noun phrase chunks, POS taggers, etc...? Should we show how to assemble a Blobbler for re-use?

In [28]:
sent = "John hates soda"
blob7 = TextBlob(sent)
for sentence in blob7.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)


In [29]:
sent = "John hates soda"
blob8 = TextBlob(sent, analyzer=NaiveBayesAnalyzer())
for sentence in blob8.sentences:
    print(sentence.sentiment)

Sentiment(classification='neg', p_pos=0.22370657856753343, p_neg=0.7762934214324665)


Avg length of sentence
Richness: number of words / number of diff words

Get speech from Clinton, Bush, Obama, Trump
Do avg grad level of text: Flesch-Kincaid score

https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

https://en.wikipedia.org/wiki/Automated_readability_index

https://pypi.python.org/pypi/textstat/

After calculating our own, could then use textstat to get Flesch-Kincaid as a comparison to the word/sentence approaches. Flesch-Kincaid uses syllables. 

In [30]:
import requests


In [33]:
url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/trump.txt"
page = requests.get(url)
page.text

'AS FOUND ON http://www.cnn.com/2017/02/28/politics/donald-trump-speech-transcript-full-text/\nAND TO BE USED FOR ACADEMIC WORKSHOP\nMr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and Citizens of America:\nTonight, as we mark the conclusion of our celebration of Black History Month, we are reminded of our Nation\'s path toward civil rights and the work that still remains. Recent threats targeting Jewish Community Centers and vandalism of Jewish cemeteries, as well as last week\'s shooting in Kansas City, remind us that while we may be a Nation divided on policies, we are a country that stands united in condemning hate and evil in all its forms.\nEach American generation passes the torch of truth, liberty and justice --- in an unbroken chain all the way down to the present.\nThat torch is now in our hands. And we will use it to light up the world. I am here tonight to deliver a message of unity and strength, and it is a message deeply delivere

In [44]:
with open ('data/trump.txt', 'r', encoding='utf-8') as f:
    trump_text = f.read().splitlines()
trump_text = ' '.join(trump_text[2:])
trump_text                  

'Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and Citizens of America: Tonight, as we mark the conclusion of our celebration of Black History Month, we are reminded of our Nation\'s path toward civil rights and the work that still remains. Recent threats targeting Jewish Community Centers and vandalism of Jewish cemeteries, as well as last week\'s shooting in Kansas City, remind us that while we may be a Nation divided on policies, we are a country that stands united in condemning hate and evil in all its forms. Each American generation passes the torch of truth, liberty and justice --- in an unbroken chain all the way down to the present. That torch is now in our hands. And we will use it to light up the world. I am here tonight to deliver a message of unity and strength, and it is a message deeply delivered from my heart. A new chapter of American Greatness is now beginning. A new national pride is sweeping across our Nation. And a new su

In [45]:
trump_blob = TextBlob(trump_text)
num_words = len(trump_blob.words)
num_sentences = len(trump_blob.sentences)
num_words, num_sentences

(4796, 259)

In [46]:
num_chars = 0
for word in trump_blob.words:
    num_chars = num_chars + len(word)
num_chars

22394

In [47]:
automated_readability = (4.71 * (num_chars/num_words)) + (0.5 * (num_words/num_sentences)) - 21.43
automated_readability

9.821126791631379

In [48]:
def calc_auto_readability(chars, words, sentences):
    return (4.71 * (chars/words)) + (0.5 * (words/sentences)) - 21.43
trump_readability = calc_auto_readability(num_chars, num_words, num_sentences)
trump_readability

9.821126791631379

In [49]:
def calc_coleman_liau(letters_per_100, sentences_per_100):
    return (0.0588 * letters_per_100) - (0.296 * sentences_per_100) - 15.8


In [55]:
def letters_per_100(chars, words):
    return (chars/words) * 100
    
def sentences_per_100(sentences, words):
    return (sentences/words) * 100

In [56]:
trump_letters = letters_per_100(num_chars, num_words)
trump_sentences = sentences_per_100(num_sentences, num_words)
trump_coleman_read = calc_coleman_liau(trump_letters, trump_sentences)
trump_coleman_read

10.05703085904921