# TextBlob: An Introduction of Methods

## What is NLP?

* Computer understanding and manipulation of human language
* A way for computers to analyze, understand, and derive meaning from human language in a smart and useful way
* Intersection of computer science, artificial intelligence, and computational linguistics

NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be.

## Two Subfields of NLP

There are two common subfields of natural language processing:

* Natural Language Understanding (NLU)
    - A process used to convert human language into data with a form that encapsulates meaning and context in a computer-interpretable form.
    - This is a work in progress in data science
    - Understanding human language is difficult
* Natural Language Generation (NLG)
    - Uses NLU to generate human language that appears natural and relevant.
    - Chat bots and software that automatically generates textual content use NLG.
    
These subfields do not completely makeup with space of NLP:

* NLU includes things like
    - relationship extraction
    - sentiment analysis
    - summarization
    - *semantic* parsing
* NLP also includes (not part of NLU)
    - *syntactic* parsing
    - text categorization
    - part of speech tagging
    
While some parts of NLP (e.g. POS tagging) are used in NLU, they are not strictly components of NLU.

## Challenges in NLP

NLP has many challenges, and the field is not yet mature. Some of the challenges currently faced are

* Ambiguity of language
    - syntactic ambiguity: some sentences can have multiple interpretations
    - words with multiple definitions (e.g. patient: to tolerate delays? a hospital patient?)
* Context affects meaning
    - social context
    - time of day
    - content of previous sentences
* Other
    - sarcasm, humor, slang, etc.
    
Most or all of these are tied to NLU in one way or another. Further advancements in AI are needed to create general solutions that can handle the many form of language encountered. For example, the form of language encountered in a novel is very different from what you would find in a social media feed (e.g. Tweets).

## Some Uses for NLP

The uses for NLP grow as new and creative ideas arise, but some common uses are

* automatic summarization
* translation
* named entity recognition
    - person
    - place
    - organization
    - object
    - etc.
* relationship extraction
* sentiment analysis
* speech recognition
* topic segmentation / text classification
* grammar correction
* chat bots
* automatic tag, keyword, and content generation

Speech recognition is one use that doesn't *require* NLU, but it can be made better with it. It doesn't require it because a machine can recognize various words and phrases, and then take certain actions without actually understanding anything about what was said.

**Some specific use cases for NLP:**

* Analyze social media and forums to gain insight into what customers are saying
    - identify new product opportunities,
    - problems with current products/services,
    - overall user/customer sentiment
* Spam detection
* Financial algorithmic trading
    - extract info from news that impacts trading decisions
* Answering questions (e.g. chat bots)

# Techniques &amp; Tools

## Techniques

* **Tokenization**: split text into sentences, words, and noun-phrases
* **Tagging**: String -> tagged list of pairs `('word', 'POS')`

    Ex: `'This is a string'` -> `[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('string', 'NN')]`
  
  
* **Parsing** (syntactic structure): String -> hierarchical structure with syntax tags

    Ex: `'This is a string'` -> `'This/DT/O/O is/VBZ/B-VP/O a/DT/B-NP/O string/NN/I-NP/O'`

* **Information Extraction**: 
    - named entity extraction: string in -> output text with labeled entities (person, company, location, etc.)
    - relationships between entities: string in -> output entity relationships
* **n-grams**:
    - string in -> output list of n-tuples of successive words
    - n-grams are used as features in machine learning
    
    Ex (2-gram): `'This is a string'` -> `(['This', 'is'], ['is', 'a'], ['a', 'string'])`
    
*These will be discussed in more detail as they come up.*

## Tools

There are many tools availble for NLP. Some popular choices are

* Stanford's Core NLP Suite
* Natural Language Toolkit (NLTK)
* Apache OpenNLP
* WordNet
* **TextBlob**

We will be working with TextBlob, which builds off of (and can integrate with) NLTK and WordNet.

# TextBlob: An Introduction of Methods

## Installation

To install TextBlob, open a new Terminal and enter the following:

```
$ pip install -U textblob
$ python -m textblob.download_corpora
```

## Getting Started

From here on, you can follow along with the notebook and create new notes and try out code as you like.

In [1]:
# import what we need
import pandas as pd
from pandas import DataFrame as DF, Series

import numpy as np

from textblob import TextBlob

In [5]:
# read data

# use only the column called 'text'
data = pd.read_csv('data/tweets.csv', usecols=['text'])

In [7]:
data.head(10)

Unnamed: 0,text
0,@VirginAmerica What @dhepburn said.
1,@VirginAmerica plus you've added commercials t...
2,@VirginAmerica I didn't today... Must mean I n...
3,@VirginAmerica it's really aggressive to blast...
4,@VirginAmerica and it's a really big bad thing...
5,@VirginAmerica seriously would pay $30 a fligh...
6,"@VirginAmerica yes, nearly every time I fly VX..."
7,@VirginAmerica Really missed a prime opportuni...
8,"@virginamerica Well, I didn't…but NOW I DO! :-D"
9,"@VirginAmerica it was amazing, and arrived an ..."


## Create a TextBlob object

`TextBlob` objects are the foundation of everything we will be doing. They take a string as an input and create an object on which we can apply many of the TextBlob methods.

Let's create a blob using a tweet in our data.

In [10]:
# create a blob from the tweet at index 25
tweet = data.text[25]
tweet

"@VirginAmerica status match program.  I applied and it's been three weeks.  Called and emailed with no response."

In [11]:
blob = TextBlob(tweet)
blob

TextBlob("@VirginAmerica status match program.  I applied and it's been three weeks.  Called and emailed with no response.")

# TextBlob Methods: Tokenization

Tokenization allows us to split a string (a paragraph, a page, etc.) into various "tokens" that become useful in further processing and analysis. Tokenization also occurs on the back-end of some methods.

Let's look at some tokenization options.

## Sentences

Using the `sentences` method we get a list of `Sentence` objects, each containing (in order) all of the sentences that make up the string passed to `TextBlob`.

In [12]:
# return list of Sentence objects
blob.sentences

[Sentence("@VirginAmerica status match program."),
 Sentence("I applied and it's been three weeks."),
 Sentence("Called and emailed with no response.")]

Similar to `TextBlob` objects, we can use various methods with `Sentence` objects.

In [13]:
# get the first sentence
s = blob.sentences[2]
# get tags from this sentence
s.tags[:10]

[('Called', 'VBN'),
 ('and', 'CC'),
 ('emailed', 'VBN'),
 ('with', 'IN'),
 ('no', 'DT'),
 ('response', 'NN')]

## Words

Instead of a list of sentences, we can get a `WordList` object that returns all of the individual words in our string.

In [14]:
# return WordList object (works like a standard list in Python)
blob.words

WordList(['VirginAmerica', 'status', 'match', 'program', 'I', 'applied', 'and', 'it', "'s", 'been', 'three', 'weeks', 'Called', 'and', 'emailed', 'with', 'no', 'response'])

We can access words in a `WordList` just like a regular Python list:

In [15]:
blob.words[7:9]

WordList(['it', "'s"])

**Notice**: TextBlob doesn't do the best job of handling contractions and possessive forms. Ex: "it's" is split into "it" and "'s".

## Word Counts

We can get a dict that contains all the unique words in our string as keys, and counts for each as values.

In [16]:
# returns defaultdict with unique words as keys and counts as values.
blob.word_counts

defaultdict(int,
            {'virginamerica': 1,
             'status': 1,
             'match': 1,
             'program': 1,
             'i': 1,
             'applied': 1,
             'and': 2,
             'it': 1,
             's': 1,
             'been': 1,
             'three': 1,
             'weeks': 1,
             'called': 1,
             'emailed': 1,
             'with': 1,
             'no': 1,
             'response': 1})

In [17]:
# we can get counts for individual words is two ways
# 1. use the count method on a WordList
print(blob.words.count('and'))
# 2. access a key in the word_counts dict
print(blob.word_counts['and'])

2
2


**NOTE!**

if you use `word_counts['some_word']` and that word is not originally in the defaultdict, it will be added with a count of zero:

In [18]:
# example of above
b = TextBlob('a string of words')
b.word_counts

defaultdict(int, {'a': 1, 'string': 1, 'of': 1, 'words': 1})

In [19]:
# get count of word not in dict
b.word_counts['test']

0

In [20]:
# look at contents of dict again
# notice that 'test' is now included
b.word_counts

defaultdict(int, {'a': 1, 'string': 1, 'of': 1, 'words': 1, 'test': 0})

## Noun Phrases

**Noun phrases:** a word or group of words that functions in a sentence as subject, object, or prepositional object.

Examples of __noun phrases__ are underlined in the sentences below. The **head** noun appears in bold.

* __The election-year **politics**__ are annoying for __many **people**__.
* __Almost every **sentence**__ contains __at least one noun **phrase**__.
* __Current economic **weakness**__ may be __a **result** of high energy prices__.

Noun phrases can be identified by the possibility of pronoun substitution, as is illustrated in the examples below.

a. __This **sentence**__ contains __two noun **phrases**__.<br>
b. **It** contains **them**.

We can get a `WordList` containing noun phrases using the `noun_phrase` method on a blob.

In [21]:
blob.sentences

[Sentence("@VirginAmerica status match program."),
 Sentence("I applied and it's been three weeks."),
 Sentence("Called and emailed with no response.")]

In [22]:
# return WordList with noun phrases for tweet at index 11
TextBlob(data.text[11]).noun_phrases

WordList(['virginamerica', 'pretty graphics', 'minimal iconography'])

The algorithm used isn't perfect, but things rarely are in NLP.

<br>

# Practice Problems

1. Create a TextBlob object called blob using tweet at index 41
2. Print each sentence in blob on a separate line
3. Get word counts in descending order (most frequent first)
4. Come up with two ways to get the total word count for blob
5. Get all noun-phrases in blob. What is wrong with the second “phrase” in the results?
6. Select all entries in the data that contain more than 3 noun phrases
7. **Extra:** Using a similar method as in 6, print one tweet that has exactly 3 sentences without creating a list


<br>

# TextBlob Methods: POS & Morphology

Here we will cover all of the following:
    
* **part-of-speech (POS) tagging**: get list of tuples containing each word and it’s part of speech (e.g. noun)
* **pluralization**: get the plural form of any singular words
* **singularization**: get the singular form of any plural words
* **lemmatization**: get the stripped/unmodified version of a word (e.g. singing -> sing)

## part-of-speech (POS) tagging

Using the `tags` method, we can get a list of doubles that contains every word in our string paired with its part of speech, as determined by the algorithm.

POS tagging (also grammatical tagging) is useful for understanding context and grammar. Many words can belong to different parts of speech, depending on the context and words around them. POS tagging attempts to disambiguate a text by determining most likely parts of speech for each word based on the content.

In [23]:
# return list of tuples containing words in a string and the part of speech that each belongs to
blob.tags

[('@', 'NN'),
 ('VirginAmerica', 'NNP'),
 ('status', 'NN'),
 ('match', 'NN'),
 ('program', 'NN'),
 ('I', 'PRP'),
 ('applied', 'VBD'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('been', 'VBN'),
 ('three', 'CD'),
 ('weeks', 'NNS'),
 ('Called', 'VBN'),
 ('and', 'CC'),
 ('emailed', 'VBN'),
 ('with', 'IN'),
 ('no', 'DT'),
 ('response', 'NN')]

The tags each have a unique meaning. For example:
* 'VBX': verb (X indicates type of verb)
* 'DT': determiner

A comprehensive table can be found at http://www.clips.ua.ac.be/pages/mbsp-tags

## Pluralization

This is a relatively simple rule-based process that takes the singular form of a word and applies the correct pluralization to it.

In TextBlob we can pluralize a single word (in the form of a `Word` obj.) or pluralize all words in a `WordList`.

In [24]:
# import
from textblob import Word, WordList
# create a Word object
w = Word('company')
# return the plural of a single word
w.pluralize()

'companies'

In [25]:
# Side note: we can also create WordList objects
wl = WordList(['who','what','when','where','why'])
wl

WordList(['who', 'what', 'when', 'where', 'why'])

## Singularization

The opposite of pluralization: take a word (or words) in plural form and singularize them.

In [26]:
wl = WordList(['agencies', 'octopi', 'words'])
wl.singularize()

WordList(['agency', 'octopus', 'word'])

## Lemmatization

Lemmatization takes a word that has been modified or morphed in some way using proper linguistic rules, and returns the stripped/unmodified version of it.

The `lemmatize()` method has an optional parameter:
* pos – Part of speech to filter upon. If None, defaults to _wordnet.NOUN.
* options: 
    - `'n'` for noun, 
    - `'v'` for verb, 
    - `'a'` for adjective, 
    - `'r'` for adverb.

Note: adverbs don't usually work with the standard `lemmatize` method.

In [27]:
w = Word('singing')
# for some words you have to pass the type
# in this case we pass 'v' for verb (not to be confused with POS tag formats)
w.lemmatize('v')

'sing'

In [28]:
# past participle verb
w = Word('went')
w.lemmatize('v')

'go'

In [29]:
# it doesn't always work: try an adverb
w = Word('kindly')
w.lemmatize('r')

'kindly'

# Parsing & n-grams

## Parsing

Parsing gives us the syntactic structure of a string or sentence by appending each word with tags that indicate it's place in a hierarchy. See the tree in the PowerPoint slides for a visual example.

Let's parse the sentence shown in the tree:

In [30]:
# return a string containing each word in the text along with its parts of speech hierarchy
b = TextBlob('John loves Mary')
b.parse()

'John/NNP/B-NP/O loves/VBZ/B-VP/O Mary/NNP/B-NP/O'

`John/NNP/B-NP/O` gives the position in the hierarchy of the text for the word "`John`" in our sentence, working from the word to the top of the hierarchy.

In this case (For the word `John`):
* NNP indicates it is a "noun, proper singular"
* the `B-` in `B-NP` indicates the word is: inside the chunk, preceding word is part of a different chunk
* the `NP` in `B-NP` indicates it is part of a noun phrase
* `O` is "not part of chunk", meaning we are at the end of this particular hierarchy (chunk).

Details can be read on the page that gives detailed parts of speech (link posted under POS tagging).

Parsing and syntactic structure is a complex subject, and is not covered in depth here.

## n-grams

**n**-grams are groups of n successive words. Quite often n-grams are created by shifting one word at a time through a text, but there are cases where they skip k-words at a time.

The usefulness of n-grams comes in with machine learning, where each n-gram is used as a feature for learning. These will be used more in the next workshop, but for now let's look at getting n-grams from a text using TextBlob:

TextBlob has an `ngrams` method that will take an optional argument `n`, which is the size of n-grams to generate. Default is 3.

The method returns a list of `WordList` objects.

In [31]:
# return list of n-grams (default n=3)
# get only first 5 n-grams
blob.ngrams()[:5]

[WordList(['VirginAmerica', 'status', 'match']),
 WordList(['status', 'match', 'program']),
 WordList(['match', 'program', 'I']),
 WordList(['program', 'I', 'applied']),
 WordList(['I', 'applied', 'and'])]

In [32]:
# get another set with n = 2
blob.ngrams(n=2)[:5]

[WordList(['VirginAmerica', 'status']),
 WordList(['status', 'match']),
 WordList(['match', 'program']),
 WordList(['program', 'I']),
 WordList(['I', 'applied'])]

# Practice Problems

1. Create and parse blob (using index 25) and print the first 10 pieces on separate lines
2. Singularize all words in blob
3. Pluralize the words ['gallery', 'mouse', 'man']
4. Lemmatize the words ['categories', 'mice', 'better', 'found']
5. Print the first 5 unique POS tags in blob
6. Given the n-grams in the last cell in the notebook, reconstruct the original sentence
7. **Extra:** List all words in blob that are plural (with index of each word)

### For practice problem 6 in the second hour

In [33]:
ngrams = [['The', 'quick', 'brown', 'fox'],
 ['quick', 'brown', 'fox', 'jumps'],
 ['brown', 'fox', 'jumps', 'over'],
 ['fox', 'jumps', 'over', 'the'],
 ['jumps', 'over', 'the', 'lazy'],
 ['over', 'the', 'lazy', 'dog'],
 ['the', 'lazy', 'dog', 'and'],
 ['lazy', 'dog', 'and', 'the'],
 ['dog', 'and', 'the', 'cow'],
 ['and', 'the', 'cow', 'jumped'],
 ['the', 'cow', 'jumped', 'over'],
 ['cow', 'jumped', 'over', 'the'],
 ['jumped', 'over', 'the', 'moon']]