# Text Preprocessing II (week 3)

### Word Sense Disambiguation

Lesk Algorithm (http://www.nltk.org/howto/wsd.html; https://www.nltk.org/_modules/nltk/wsd.html) performs the classic Lesk algorithm for Word Sense Disambiguation (WSD) using the definitions of the ambiguous word. Given an ambiguous word and the context in which the word occurs, Lesk returns a Synset with the highest number of overlapping words between the context sentence and different definitions from each Synset.

In [1]:
from nltk.wsd import lesk
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
print(lesk(sent, 'bank', 'n'))
print(lesk(sent, 'bank'))

Synset('savings_bank.n.02')
Synset('savings_bank.n.02')


In [2]:
# The definitions for "bank" are:
# online version: http://wordnetweb.princeton.edu/perl/webwn
# "bank" in sent is close to the sense of 
# Synset('depository_financial_institution.n.01'). 
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bank'):
    print(ss, ss.definition())

Synset('bank.n.01') sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') a long ridge or pile
Synset('bank.n.04') an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') a building in which the business of banking transacted
Synset('bank.n.10') a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
Synset('bank.v.01') tip laterally
Sy

In [3]:
# Test disambiguation of POS tagged "able". - more accurate result
print([(s, s.pos()) for s in wn.synsets('able')])
sent = 'people should be able to marry a person of their choice'.split()
print(lesk(sent, 'able'))
print(lesk(sent, 'able', pos='a'))  # provide a correct synset
# a means ADJECTIVE; s means ADJECTIVE SATELLITE 
# Certain adjectives bind minimal meaning. e.g. "dry", "good", etc. Each of these is the center of an adjective synset in WN.
# Adjective satellites imposes additional commitments on top of the meaning of the central adjective, 
# e.g. "arid" = "dry" + a particular context (i.e. climates)

[(Synset('able.a.01'), 'a'), (Synset('able.s.02'), 's'), (Synset('able.s.03'), 's'), (Synset('able.s.04'), 's')]
Synset('able.s.04')
Synset('able.a.01')


In [4]:
for ss in wn.synsets('able'):
    print(ss, ss.definition())

Synset('able.a.01') (usually followed by `to') having the necessary means or skill or know-how or authority to do something
Synset('able.s.02') have the skills and qualifications to do things well
Synset('able.s.03') having inherent physical or mental ability or capacity
Synset('able.s.04') having a strong healthy body


### Noise Removal
 
Many denoising tasks, such as removing HTML markups, parsing a JSON structure, would need to be implemented prior to tokenization. In our data preprocessing pipeline, we will strip away HTML markup with the help of the BeautifulSoup library.

In [5]:
sample = """<h1>Title Goes Here</h1>

<b>Bolded Text</b>
<i>Italicized Text</i>

<img src="this should all be gone"/>
<a href="this will be gone, too">But this will still be here!</a>

I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?

[Some text we don't want to keep is in here]

¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.

[This is some other unwanted text]

John: "Well, well, well."
James: "There, there. There, there."

&nbsp;&nbsp;

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.

I have to go get 2 tutus from 2 different stores, too.

22    45   1067   445

{{Here is some stuff inside of double curly braces.}}

{Here is more stuff in single curly braces.}

[DELETE]

</body>
</html>"""

In [6]:
from bs4 import BeautifulSoup

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

sample = strip_html(sample)
print(sample)

Title Goes Here
Bolded Text
Italicized Text

But this will still be here!

I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?

[Some text we don't want to keep is in here]

¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.

[This is some other unwanted text]

John: "Well, well, well."
James: "There, there. There, there."

  

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.

I have to go get 2 tutus from 2 different stores, t

### Expanding contractions
While not mandatory to do at this stage prior to tokenization (you'll find that this statement is the norm for the relatively flexible ordering of text data preprocessing tasks), replacing contractions with their expansions can be beneficial at this point, since our word tokenizer will split words like "didn't" into "did" and "n't." It's not impossible to remedy this tokenization at a later stage, but doing so prior makes it easier and more straightforward.

In [15]:
!pip install textsearch
!pip install contractions



In [16]:
import contractions

text = "I can't go to the movies. We don't want to buy the books."

output = contractions.fix(text)  # e.g., can't -> cannot; don't -> do not
print(output)

I can not go to the movies. We do not want to buy the books.


In [17]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

sample = replace_contractions(sample)  
print(sample)

Title Goes Here
Bolded Text
Italicized Text

But this will still be here!

I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?

[Some text we do not want to keep is in here]

¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!

something... is! wrong() with.,; this :: sentence.

I can not do this anymore. I did not know them. Why could not you have dinner at the restaurant?

My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.

do not do it.... Just do not. Billy! I know what you are doing. This is a great little house you have got here.

[This is some other unwanted text]

John: "Well, well, well."
James: "There, there. There, there."

  

There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.

I have to go get 2 tutus from 2 different

### Regular Expression

Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed'). Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in.
Refer to 3.4   Regular Expressions for Detecting Word Patterns (https://www.nltk.org/book/ch03.html)

In [18]:
import re
x = 'groundhog Groundhog Woodchuck woodchuck'
y = re.findall('[gG]roundhog|[Ww]oodchuck',x)
print (y)

['groundhog', 'Groundhog', 'Woodchuck', 'woodchuck']


In [19]:
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+',x) # [ ] matches only one character, + => one or more
print (y)

['2', '19', '42']


In [20]:
x = '1234567890'
y = re.findall('(456)',x) # ( ) matches all characters as a group
print (y)

['456']


In [24]:
x = 'We just received $10.00 for cookies.'
y = re.findall('$[0-9.]+',x) # $ indicates end of string/line. 
print(y)
y = re.findall('\$[0-9.]+',x) # \ => escape special character $
print (y)

[]
['$10.00']


In [25]:
x = 'Please contact us at: support@abc.com, sales@ABC.com'
y = re.findall('[\w]+@[\w.]+',x)
print (y)

['support@abc.com', 'sales@ABC.com']


In [26]:
# \b is a word boundary in regex and backspace in python
y = '\b[a-cA-C]+.com\b'
print(y)

#input to regex is [a-cA-C]+.co => no match
y = re.findall('\b[a-cA-C]+.com\b',x)    
print(y)

# backslash \ => an escape sequence. \\b causes regex input to be \b   
y = re.findall('\\b[a-cA-C]+.com\\b',x) 
print (y)

# r at start of pattern interprets string as raw - treats \b as literal 
y = re.findall(r'\b[a-cA-C]+.com\b',x) 
print (y)


[a-cA-C]+.com
[]
['abc.com', 'ABC.com']
['abc.com', 'ABC.com']


In [27]:
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print (y)

['From: Using the :']


In [28]:
# ? meant for non-greedy matching - stops at first : match
y = re.findall('^F.+?:', x)  
print (y)

['From:']


### Part of speech tagging (POS)
Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.

In [29]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/robinloh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/robinloh/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /Users/robinloh/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [30]:
### POS TAGGING using nltk ###
from nltk.tokenize import word_tokenize

text = word_tokenize("Parts of speech examples: an article, to write, interesting, easily, and, of")
print(nltk.pos_tag(text))

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), (':', ':'), ('an', 'DT'), ('article', 'NN'), (',', ','), ('to', 'TO'), ('write', 'VB'), (',', ','), ('interesting', 'VBG'), (',', ','), ('easily', 'RB'), (',', ','), ('and', 'CC'), (',', ','), ('of', 'IN')]


[TextBlob](https://textblob.readthedocs.io/en/dev/) is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [32]:
### POS TAGGING using TextBlob - omit punctuations ### 
input_str="Parts of speech examples: an article, to write, interesting, easily, and, of"
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]


### Chunking (shallow parsing)
Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (**noun groups or phrases, verb groups**, etc.)

In [33]:
# The first step is to determine the part of speech for each word:
input_str="A black television and a white stove were bought for the new apartment of John."
result = TextBlob(input_str)
print(result.tags)

[('A', 'DT'), ('black', 'JJ'), ('television', 'NN'), ('and', 'CC'), ('a', 'DT'), ('white', 'JJ'), ('stove', 'NN'), ('were', 'VBD'), ('bought', 'VBN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('apartment', 'NN'), ('of', 'IN'), ('John', 'NNP')]


In [35]:
# extract noun phrases using TextBlob
# need to install necessary data: $> python -m textblob.download_corpora
import nltk
nltk.download('brown')
result.noun_phrases

[nltk_data] Downloading package brown to /Users/robinloh/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


WordList(['black television', 'white stove', 'new apartment', 'john'])

In [36]:
# extract noun phrases using NLTK
# refer to "7. Extracting Information from Text" (https://www.nltk.org/book/ch07.html)
reg_exp = "NP: {<DT>?<JJ>*<NN>}"   # need to define gramma (det, adj, nn) before parsing a sentence.
rp = nltk.RegexpParser(reg_exp)
result1 = rp.parse(result.tags) #parse words with POS
print(result1)

(S
  (NP A/DT black/JJ television/NN)
  and/CC
  (NP a/DT white/JJ stove/NN)
  were/VBD
  bought/VBN
  for/IN
  (NP the/DT new/JJ apartment/NN)
  of/IN
  John/NNP)


In [None]:
# Define grammar before parsing
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """
rp = nltk.RegexpParser(grammar)
result2 = rp.parse(result.tags)
print(result2)

In [None]:
result2.draw() # result on pop-up window

### n-grams
The TextBlob.ngrams() method returns a list of tuples of n successive words.
[TextBlob](https://textblob.readthedocs.io/en/dev/) is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [None]:
# use TextBlob
!pip install TextBlob
from textblob import TextBlob
blob = TextBlob("Now is better than never.")
blob.ngrams(n=2)

In [None]:
blob.ngrams(n=3)

In [None]:
# use nltk
from nltk.util import ngrams, bigrams
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
list(ngrams(tokens, 2))

In [None]:
list(bigrams(tokens))

In [None]:
list(ngrams(tokens, 3))

### Find most common ngrams

In [None]:
from collections import Counter
from nltk import ngrams
bigtxt = open('big.txt').read()
ngram_counts = Counter(ngrams(bigtxt.split(), 3)) #split text to tokens
ngram_counts.most_common(10)

In [None]:
bigtxt = open('mbox.txt').read()
ngram_counts = Counter(ngrams(bigtxt.split(), 2))
ngram_counts.most_common(10)

### BOW vector generation

### Count Vectors as features
Count Vector (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [None]:
### Create feature vectors using CountVectorizer 
### binary parameter used to indicate word presence

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game',
    'I ate a sandwich'
]
vectorizer = CountVectorizer(binary=True) # indicate presence/non-presence
# Return a dense matrix representation of this CSR (Compressed Sparse Row) matrix.
# https://machinelearningmastery.com/sparse-matrices-for-machine-learning/ 
X = vectorizer.fit_transform(corpus).todense() 

print(vectorizer.vocabulary_) # Print the list of words in the vocabulary
for i, document in enumerate(corpus):  # Shows condensed version of the feature vectors
    print(document, '=', X[i])

In [None]:
### CountVectorizer - Creating feature vectors with frequencies of words

corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, \
and I ate a sandwich']
vectorizer = CountVectorizer(stop_words='english')

### Print the class model with it's parameters
print(vectorizer.fit(corpus), '\n')

### Print the list of words in the vocabulary
print(vectorizer.vocabulary_, '\n')

### Shows the vector with transformed numerical values (word frequency)
print(vectorizer.transform(corpus), '\n')

### Shows condensed version of the feature vectors
print(vectorizer.transform(corpus).todense())

### TF-IDF Vectors as features

TF-IDF Vectors (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) can be generated.

In [None]:
### Creating TF and TF-IDF feature vectors using TfidfTransformer. 
### Note the change in IDF feature values with different number of documents in corpus

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]
vectorizer = CountVectorizer(stop_words='english')
# None for no normalization.
transformer = TfidfTransformer(use_idf=False, norm=None) # to get TFIDF values  
transformerIDF = TfidfTransformer(use_idf=True, norm=None) # to get IDF values
X = vectorizer.fit_transform(corpus)
print('Vocabulary:\n', vectorizer.vocabulary_)
print('Count vectors:\n', X.todense())
print('TF vectors:\n', transformer.fit_transform(X).todense())  # divided by the document length
print('TF-IDF vectors:\n', transformerIDF.fit_transform(X).todense())

In [None]:
### TfidfVectorizer - combine CountVectorizer and TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]
vectorizer = TfidfVectorizer(stop_words='english', norm=None)
print(vectorizer.fit_transform(corpus).todense(), "\n")

In [None]:
idf = vectorizer.idf_
print (dict(zip(vectorizer.get_feature_names(), idf)))

# formula : E.g., "ate" in the first column of the first document
# df = 1 -- "ate" appeared only in one document
# N = 2 -- total number of documents
# idf = ln((N + 1)/(df + 1)) + 1 = log(3/2) + 1 = 1.4054651081081644

# tf = 2
# tfidf = tf * idf = 2 * 1.4054651081081644 = 2.81093022

In [None]:
### Norm='l2' used to normalize term vectors.
### Creating TF and TF-IDF feature vectors using TfidfTransformer. 
### Note the change in IDF feature values with different number of documents in corpus

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]
vectorizer = CountVectorizer(stop_words='english')
transformer = TfidfTransformer(use_idf=False)     # Norm='l2' used to normalize term vectors.
transformerIDF = TfidfTransformer(use_idf=True)
X = vectorizer.fit_transform(corpus)
print('Vocabulary:\n', vectorizer.vocabulary_)
print('Count vectors:\n', X.todense())
print('TF vectors:\n', transformer.fit_transform(X).todense())  # divided by the document length
print('TF-IDF vectors:\n', transformerIDF.fit_transform(X).todense())

In [None]:
### Norm='l2' used to normalize term vectors.
### TfidfVectorizer - combine CountVectorizer and TfidfTransformer

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]
vectorizer = TfidfVectorizer(stop_words='english')      # Norm='l2' used to normalize term vectors.
print(vectorizer.fit_transform(corpus).todense(), "\n")