# Text Preprocessing (week 2)

This lab is prepared with the materials in the articles [Text Preprocessing in Python: Steps, Tools, and Examples]( https://www.kdnuggets.com/2018/11/text-preprocessing-python.html)
and [Text Data Preprocessing: A Walkthrough in Python](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html), and the materials in the article [Text Preprocessing in Python: Steps, Tools, and Examples]( https://www.kdnuggets.com/2018/11/text-preprocessing-python.html)

We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing.  

## 1. Text data preprocessing: step by step approach

### Convert text to lowercase

In [1]:
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


### Remove numbers
Remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

In [2]:
import re
input_str = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'
result = re.sub(r'\d+', '', input_str)  # \d+ matches a number whose length is 1 or more than 1.
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


### Remove punctuation
The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

In [1]:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
trantab = str.maketrans(string.punctuation, " "*32) # the second argument has 32 space characters
                                                    # another example, trantab = str.maketrans("{}", " "*2)

result = input_str.translate(trantab)
print(result)

This  is  an  example   of  string  with   punctuation    


In [4]:
result = re.sub(r'\s+', ' ', result)   # \s+ matches a whitespace character whose length is 1 or more than 1.
print(result)

This is an example of string with punctuation 


### Remove whitespaces
To remove leading and ending spaces, you can use the strip() function.

In [5]:
input_str = " \t a string example\t "   # \t means tab
input_str = input_str.strip()
input_str

'a string example'

### Tokenization
Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

In [6]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /home/gaoqiang/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/gaoqiang/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gaoqiang/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/gaoqiang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/gaoqiang/nltk_data...


True

In [7]:
from nltk.tokenize import word_tokenize

input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
print (tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


In [8]:
from nltk.tokenize import TreebankWordTokenizer 
s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' 
print(TreebankWordTokenizer().tokenize(s))  

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']


In [9]:
print(word_tokenize(s)) # similar results as word_tokenize(); note that 'York.' -> 'York' and '.'

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


In [10]:
s = "They'll save and invest more." 
TreebankWordTokenizer().tokenize(s) 

['They', "'ll", 'save', 'and', 'invest', 'more', '.']

In [11]:
s = "hi, my name can't hello," 
TreebankWordTokenizer().tokenize(s)

['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']

In [12]:
from nltk.tokenize import WordPunctTokenizer, WhitespaceTokenizer
WordPunctTokenizer().tokenize(s)

['hi', ',', 'my', 'name', 'can', "'", 't', 'hello', ',']

In [13]:
WhitespaceTokenizer().tokenize(s)

['hi,', 'my', 'name', "can't", 'hello,']

### Remove stop words
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

In [14]:
from nltk.corpus import stopwords

input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


### Stemming
Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words) and Lancaster stemming algorithm (a more aggressive stemming algorithm).

In [15]:
# use PortStemmer
from nltk.stem import PorterStemmer

stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


In [16]:
# use LancasterStemmer
from nltk.stem import LancasterStemmer

stemmer= LancasterStemmer()
input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

ther
ar
sev
typ
of
stem
algorithm
.


### Lemmatization
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

In [17]:
from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


## 2. Text data preprocessing with a sample text

We need some sample text. We'll start with something very small and artificial in order to easily see the results of what we are doing step by step.

In [18]:
import unicodedata
import contractions  # install a relevant library: $>pip install contractions
import inflect       # $>pip install inflect
from bs4 import BeautifulSoup
from nltk import sent_tokenize
from nltk.stem import LancasterStemmer

In [19]:
# nltk.download()

In [20]:
sample = """<h1>Title Goes Here</h1>

<b>Bolded Text</b>
<i>Italicized Text</i>

<img src="this should all be gone"/>
<a href="this will be gone, too">But this will still be here!</a>

I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?

[Some text we don't want to keep is in here]

¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.

[This is some other unwanted text]

John: "Well, well, well."
James: "There, there. There, there."

&nbsp;&nbsp;

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.

I have to go get 2 tutus from 2 different stores, too.

22    45   1067   445

{{Here is some stuff inside of double curly braces.}}

{Here is more stuff in single curly braces.}

[DELETE]

</body>
</html>"""

### Noise Removal
 
Let's loosely define noise removal as text-specific normalization tasks which often take place prior to tokenization. While the other 2 major steps of the preprocessing framework (tokenization and normalization) are basically task-independent, noise removal is much more task-specific.
Sample noise removal tasks could include:
- removing text file headers, footers 
- removing HTML, XML, etc. markup and metadata 
- extracting valuable data from other formats, such as JSON 

Many denoising tasks, such as parsing a JSON structure, would need to be implemented prior to tokenization.
In our data preprocessing pipeline, we will strip away HTML markup with the help of the BeautifulSoup library, and use regular expressions to remove open and close double brackets and anything in between them (we assume this is necessary based on our sample text).

In [21]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_between_square_brackets(text):
    return re.sub(r'\[[^]]*\]', '', text) # [^]]* : match anything except ']'

def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

sample = denoise_text(sample)
print(sample)

Title Goes Here
Bolded Text
Italicized Text

But this will still be here!

I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?



¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.



John: "Well, well, well."
James: "There, there. There, there."

  

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.

I have to go get 2 tutus from 2 different stores, too.

22    45   1067   445

{{Here is some stuff inside of double curly braces

### Expanding contractions
While not mandatory to do at this stage prior to tokenization (you'll find that this statement is the norm for the relatively flexible ordering of text data preprocessing tasks), replacing contractions with their expansions can be beneficial at this point, since our word tokenizer will split words like "didn't" into "did" and "n't." It's not impossible to remedy this tokenization at a later stage, but doing so prior makes it easier and more straightforward.

In [22]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

sample = replace_contractions(sample)  # e.g., can't -> can not; don't -> do not
print(sample)

Title Goes Here
Bolded Text
Italicized Text

But this will still be here!

I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?



¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I cannot do this anymore. I did not know them. Why could not you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

Do not do it.... Just do not. Billy! I know what you are doing. This is a great little house you have got here.



John: "Well, well, well."
James: "There, there. There, there."

  

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.

I have to go get 2 tutus from 2 different stores, too.

22    45   1067   445

{{Here is some stuff inside of double curl

### Tokenization

In [23]:
words = nltk.word_tokenize(sample)
print(words)  # note that there are noisy words or terms.

['Title', 'Goes', 'Here', 'Bolded', 'Text', 'Italicized', 'Text', 'But', 'this', 'will', 'still', 'be', 'here', '!', 'I', 'run', '.', 'He', 'ran', '.', 'She', 'is', 'running', '.', 'Will', 'they', 'stop', 'running', '?', 'I', 'talked', '.', 'She', 'was', 'talking', '.', 'They', 'talked', 'to', 'them', 'about', 'running', '.', 'Who', 'ran', 'to', 'the', 'talking', 'runner', '?', '¡Sebastián', ',', 'Nicolás', ',', 'Alejandro', 'and', 'Jéronimo', 'are', 'going', 'to', 'the', 'store', 'tomorrow', 'morning', '!', 'something', '...', 'is', '!', 'wrong', '(', ')', 'with.', ',', ';', 'this', ':', ':', 'sentence', '.', 'I', 'can', 'not', 'do', 'this', 'anymore', '.', 'I', 'did', 'not', 'know', 'them', '.', 'Why', 'could', 'not', 'you', 'have', 'dinner', 'at', 'the', 'restaurant', '?', 'My', 'favorite', 'movie', 'franchises', ',', 'in', 'order', ':', 'Indiana', 'Jones', ';', 'Marvel', 'Cinematic', 'Universe', ';', 'Star', 'Wars', ';', 'Back', 'to', 'the', 'Future', ';', 'Harry', 'Potter', '.', '

### Normalization
Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.
Normalizing text can mean performing a number of tasks, but for our framework we will approach normalization in 3 distinct steps: (1) stemming, (2) lemmatization, and (3) everything else. 

In [24]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        # Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word) # \w: an alphanumeric character; \s: a whitespace character
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)  # 22 -> twenty-two
    words = remove_stopwords(words)
    return words

words = normalize(words)

print(words)  # better words or terms

['title', 'goes', 'bolded', 'text', 'italicized', 'text', 'still', 'run', 'ran', 'running', 'stop', 'running', 'talked', 'talking', 'talked', 'running', 'ran', 'talking', 'runner', 'sebastian', 'nicolas', 'alejandro', 'jeronimo', 'going', 'store', 'tomorrow', 'morning', 'something', 'wrong', 'sentence', 'anymore', 'know', 'could', 'dinner', 'restaurant', 'favorite', 'movie', 'franchises', 'order', 'indiana', 'jones', 'marvel', 'cinematic', 'universe', 'star', 'wars', 'back', 'future', 'harry', 'potter', 'billy', 'know', 'great', 'little', 'house', 'got', 'john', 'well', 'well', 'well', 'james', 'lot', 'reasons', 'one hundred and one', 'reasons', 'one million', 'reasons', 'actually', 'go', 'get', 'two', 'tutus', 'two', 'different', 'stores', 'twenty-two', 'forty-five', 'one thousand and sixty-seven', 'four hundred and forty-five', 'stuff', 'inside', 'double', 'curly', 'braces', 'stuff', 'single', 'curly', 'braces']


#### Stemming and lemming functions

In [25]:
def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

stems, lemmas = stem_and_lemmatize(words)  # use either stemming or lemmatization, but don't use both of them

print('Stemmed:\n', stems)
print('\nLemmatized:\n', lemmas)

Stemmed:
 ['titl', 'goe', 'bold', 'text', 'it', 'text', 'stil', 'run', 'ran', 'run', 'stop', 'run', 'talk', 'talk', 'talk', 'run', 'ran', 'talk', 'run', 'sebast', 'nicola', 'alejandro', 'jeronimo', 'going', 'stor', 'tomorrow', 'morn', 'someth', 'wrong', 'sent', 'anym', 'know', 'could', 'din', 'resta', 'favorit', 'movy', 'franch', 'ord', 'indian', 'jon', 'marvel', 'cinem', 'univers', 'star', 'war', 'back', 'fut', 'harry', 'pot', 'bil', 'know', 'gre', 'littl', 'hous', 'got', 'john', 'wel', 'wel', 'wel', 'jam', 'lot', 'reason', 'one hundred and on', 'reason', 'one million', 'reason', 'act', 'go', 'get', 'two', 'tut', 'two', 'diff', 'stor', 'twenty-two', 'forty-five', 'one thousand and sixty-seven', 'four hundred and forty-five', 'stuff', 'insid', 'doubl', 'cur', 'brac', 'stuff', 'singl', 'cur', 'brac']

Lemmatized:
 ['title', 'go', 'bolded', 'text', 'italicize', 'text', 'still', 'run', 'run', 'run', 'stop', 'run', 'talk', 'talk', 'talk', 'run', 'run', 'talk', 'runner', 'sebastian', 'nicol

## 3. Additional Text Processing

### n-grams
The TextBlob.ngrams() method returns a list of tuples of n successive words.
[TextBlob](https://textblob.readthedocs.io/en/dev/) is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [26]:
# use TextBlob; $>pip install TextBlob
from textblob import TextBlob
blob = TextBlob("Now is better than never.")
blob.ngrams(n=2)

[WordList(['Now', 'is']),
 WordList(['is', 'better']),
 WordList(['better', 'than']),
 WordList(['than', 'never'])]

In [27]:
blob.ngrams(n=3)

[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never'])]

In [28]:
# use nltk
from nltk.util import ngrams, bigrams
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
list(ngrams(tokens, 2))

[('NLTK', 'is'),
 ('is', 'a'),
 ('a', 'leading'),
 ('leading', 'platform'),
 ('platform', 'for'),
 ('for', 'building'),
 ('building', 'Python'),
 ('Python', 'programs'),
 ('programs', 'to'),
 ('to', 'work'),
 ('work', 'with'),
 ('with', 'human'),
 ('human', 'language'),
 ('language', 'data'),
 ('data', '.')]

In [29]:
list(bigrams(tokens))

[('NLTK', 'is'),
 ('is', 'a'),
 ('a', 'leading'),
 ('leading', 'platform'),
 ('platform', 'for'),
 ('for', 'building'),
 ('building', 'Python'),
 ('Python', 'programs'),
 ('programs', 'to'),
 ('to', 'work'),
 ('work', 'with'),
 ('with', 'human'),
 ('human', 'language'),
 ('language', 'data'),
 ('data', '.')]

In [30]:
list(ngrams(tokens, 3))

[('NLTK', 'is', 'a'),
 ('is', 'a', 'leading'),
 ('a', 'leading', 'platform'),
 ('leading', 'platform', 'for'),
 ('platform', 'for', 'building'),
 ('for', 'building', 'Python'),
 ('building', 'Python', 'programs'),
 ('Python', 'programs', 'to'),
 ('programs', 'to', 'work'),
 ('to', 'work', 'with'),
 ('work', 'with', 'human'),
 ('with', 'human', 'language'),
 ('human', 'language', 'data'),
 ('language', 'data', '.')]

### Find most common ngrams

In [31]:
from collections import Counter
from nltk import ngrams
bigtxt = open('big.txt').read()
ngram_counts = Counter(ngrams(bigtxt.split(), 3))
ngram_counts.most_common(10)

[(('one', 'of', 'the'), 332),
 (('out', 'of', 'the'), 244),
 (('of', 'the', 'United'), 235),
 (('that', 'he', 'was'), 191),
 (('the', 'United', 'States'), 184),
 (('that', 'it', 'was'), 180),
 (('and', 'in', 'the'), 174),
 (('met', 'with', 'in'), 173),
 (('up', 'to', 'the'), 159),
 (('part', 'of', 'the'), 158)]

In [32]:
bigtxt = open('mbox.txt').read()
ngram_counts = Counter(ngrams(bigtxt.split(), 2))
ngram_counts.most_common(10)

[(('Received:', 'from'), 12579),
 (('with', 'ESMTP'), 7188),
 (('ESMTP', 'id'), 7188),
 (('Dec', '2007'), 7063),
 (('Nov', '2007'), 6810),
 (('-0500', 'Received:'), 5843),
 (('for', '<source@collab.sakaiproject.org>;'), 5391),
 (('text/plain;', 'charset=UTF-8'), 5391),
 (('+0000', '(GMT)'), 4932),
 (('from', 'murder'), 3594)]

### Sentence Segmentation

In [33]:
from nltk.tokenize import sent_tokenize

text = "this's a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it's your turn."
sent_tokenize_list = sent_tokenize(text)

print("The length of sentences", " = ", len(sent_tokenize_list))
print(sent_tokenize_list)

The length of sentences  =  5
["this's a sent tokenize test.", 'this is sent two.', 'is this sent three?', 'sent 4 is cool!', "Now it's your turn."]


In [34]:
def sentence_preprocess(document):
    sentences = nltk.sent_tokenize(document) 
    print(sentences, "\n")
    sentences = [nltk.word_tokenize(sent) for sent in sentences] 
    print(sentences, "\n")
    sentences = [nltk.pos_tag(sent) for sent in sentences] 
    print(sentences, "\n")

In [35]:
sample_doc = """I cannot do this anymore. I did not know them. Why could not you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

do not do it.... Just do not. Billy! I know what you are doing. This is a great little house you have got here."""

In [36]:
sentence_preprocess(sample_doc)

['I cannot do this anymore.', 'I did not know them.', 'Why could not you have dinner at the restaurant?', 'My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.', 'do not do it.... Just do not.', 'Billy!', 'I know what you are doing.', 'This is a great little house you have got here.'] 

[['I', 'can', 'not', 'do', 'this', 'anymore', '.'], ['I', 'did', 'not', 'know', 'them', '.'], ['Why', 'could', 'not', 'you', 'have', 'dinner', 'at', 'the', 'restaurant', '?'], ['My', 'favorite', 'movie', 'franchises', ',', 'in', 'order', ':', 'Indiana', 'Jones', ';', 'Marvel', 'Cinematic', 'Universe', ';', 'Star', 'Wars', ';', 'Back', 'to', 'the', 'Future', ';', 'Harry', 'Potter', '.'], ['do', 'not', 'do', 'it', '....', 'Just', 'do', 'not', '.'], ['Billy', '!'], ['I', 'know', 'what', 'you', 'are', 'doing', '.'], ['This', 'is', 'a', 'great', 'little', 'house', 'you', 'have', 'got', 'here', '.']] 

[[('I', 'PRP'), ('can', 'MD'), ('n

### Part of speech tagging (POS)
Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.

In [37]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/gaoqiang/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/gaoqiang/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /home/gaoqiang/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [38]:
### POS TAGGING using nltk ###
from nltk.tokenize import word_tokenize

text = word_tokenize("Parts of speech examples: an article, to write, interesting, easily, and, of")
print(nltk.pos_tag(text)) # input for pos_tag() is a list of words, not a single word.

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), (':', ':'), ('an', 'DT'), ('article', 'NN'), (',', ','), ('to', 'TO'), ('write', 'VB'), (',', ','), ('interesting', 'VBG'), (',', ','), ('easily', 'RB'), (',', ','), ('and', 'CC'), (',', ','), ('of', 'IN')]


[TextBlob](https://textblob.readthedocs.io/en/dev/) is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [39]:
### POS TAGGING using TextBlob ###
input_str="Parts of speech examples: an article, to write, interesting, easily, and, of"
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]


### Chunking (shallow parsing)
Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

In [40]:
# The first step is to determine the part of speech for each word:
input_str="A black television and a white stove were bought for the new apartment of John."
result = TextBlob(input_str)
print(result.tags)

[('A', 'DT'), ('black', 'JJ'), ('television', 'NN'), ('and', 'CC'), ('a', 'DT'), ('white', 'JJ'), ('stove', 'NN'), ('were', 'VBD'), ('bought', 'VBN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('apartment', 'NN'), ('of', 'IN'), ('John', 'NNP')]


In [41]:
# extract noun phrases using TextBlob
# need to install necessary data: $> python -m textblob.download_corpora
result.noun_phrases

WordList(['black television', 'white stove', 'new apartment', 'john'])

In [42]:
import nltk
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker', quiet=False)
# nltk.download('all')
print(nltk.data.path)

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /home/gaoqiang/nltk_data...


['/home/gaoqiang/nltk_data', '/home/gaoqiang/anaconda3/nltk_data', '/home/gaoqiang/anaconda3/share/nltk_data', '/home/gaoqiang/anaconda3/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /home/gaoqiang/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /home/gaoqiang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/gaoqiang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Named entity recognition
Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

In [43]:
# refer to "7.5 Named Entity Recognition" (https://www.nltk.org/book/ch07.html)
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('maxent_ne_chunker_tab')
input_str = "Bill works for Apple so he went to Boston for a conference."
print(ne_chunk(pos_tag(word_tokenize(input_str))))   #####need to check why

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /home/gaoqiang/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


(S
  (PERSON Bill/NNP)
  works/VBZ
  for/IN
  Apple/NNP
  so/IN
  he/PRP
  went/VBD
  to/TO
  (GPE Boston/NNP)
  for/IN
  a/DT
  conference/NN
  ./.)


### WordNet
WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.
Reference: https://www.nltk.org/book/ch02.html and http://www.nltk.org/howto/wordnet.html

In [44]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar') 

[Synset('car.n.01')]

In [45]:
# The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas")
wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

In [46]:
wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [47]:
wn.synset('car.n.01').examples()

['he needs a car to get to work']

In [48]:
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hypernyms()  # parent synsets
print(types_of_motorcar)

[Synset('motor_vehicle.n.01')]


In [49]:
wn.synset('tree.n.01').part_meronyms() # from an item to its components (meronyms) 

[Synset('stump.n.01'),
 Synset('crown.n.07'),
 Synset('burl.n.02'),
 Synset('trunk.n.01'),
 Synset('limb.n.02')]

In [50]:
wn.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

### Word Sense Disambiguation

Lesk Algorithm (http://www.nltk.org/howto/wsd.html; https://www.nltk.org/_modules/nltk/wsd.html) performs the classic Lesk algorithm for Word Sense Disambiguation (WSD) using the definitions of the ambiguous word. Given an ambiguous word and the context in which the word occurs, Lesk returns a Synset with the highest number of overlapping words between the context sentence and different definitions from each Synset.

In [51]:
from nltk.wsd import lesk
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
print(lesk(sent, 'bank', 'n'))
print(lesk(sent, 'bank'))

Synset('savings_bank.n.02')
Synset('savings_bank.n.02')


In [52]:
# The definitions for "bank" are:
# online version: http://wordnetweb.princeton.edu/perl/webwn
# "bank" in sent is close to the sense of Synset('depository_financial_institution.n.01') & Synset('bank.n.09'). 
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bank'):
    print(ss, ss.definition())

Synset('bank.n.01') sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') a long ridge or pile
Synset('bank.n.04') an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') a building in which the business of banking transacted
Synset('bank.n.10') a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
Synset('bank.v.01') tip laterally
Sy

In [53]:
# Test disambiguation of POS tagged "able".
sent = 'people should be able to marry a person of their choice'.split()
print(lesk(sent, 'able'))
print(lesk(sent, 'able', pos='a'))  # provide a correct synset
# a means ADJECTIVE; s means ADJECTIVE SATELLITE 
# Certain adjectives bind minimal meaning. e.g. "dry", "good", etc. Each of these is the center of an adjective synset in WN.
# Adjective satellites imposes additional commitments on top of the meaning of the central adjective, 
# e.g. "arid" = "dry" + a particular context (i.e. climates)

Synset('able.s.04')
Synset('able.a.01')


In [54]:
for ss in wn.synsets('able'):
    print(ss, ss.definition())

Synset('able.a.01') (usually followed by `to') having the necessary means or skill or know-how or authority to do something
Synset('able.s.02') have the skills and qualifications to do things well
Synset('able.s.03') having inherent physical or mental ability or capacity
Synset('able.s.04') having a strong healthy body
