# NLP Text Processing methods




### NLTK

We can use some available data sets for educational purpose. One of such source are NLTK book corporas. To download it we need to do the following:

In [9]:
import nltk
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/materna/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/materna/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/materna/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/materna/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/materna/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/materna/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    |

True

To check what kind of books are available we need to import these and print the list as following:

In [10]:
from nltk.book import *
texts()

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


NLTK Text object does have some useful methods that you can find here: [NLTK Text Docs](http://www.nltk.org/api/nltk.html#nltk.text.Text). Some are used below.

#### Word and sentences similarity

Word does have different meanings. This makes the comparison and analysis a bit more complex.

In [11]:
text6.concordance("King")

text6.count("King")

Displaying 25 of 38 matches:
CENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there ! [ clop clop cl
ragon , from the castle of Camelot . King of the Britons , defeator of the Sax
 CART - MASTER : I dunno . Must be a king . CUSTOMER : Why ? CART - MASTER : H
 all over him . SCENE 3 : [ thud ] [ King Arthur music ] [ thud thud thud ] [ 
 Arthur music ] [ thud thud thud ] [ King Arthur music stops ] ARTHUR : Old wo
e an inferior ! ARTHUR : Well , I am king ! DENNIS : Oh king , eh , very nice 
HUR : Well , I am king ! DENNIS : Oh king , eh , very nice . And how d ' you g
o you do , good lady . I am Arthur , King of the Britons . Who ' s castle is t
s . Who ' s castle is that ? WOMAN : King of the who ? ARTHUR : The Britons . 
. We are all Britons , and I am your king . WOMAN : I didn ' t know we had a k
g . WOMAN : I didn ' t know we had a king . I thought we were an autonomous co
ink he is ? Heh . ARTHUR : I am your king ! WOMAN : Well , I didn ' t vote for
 . WOMAN : Well , how d

27

We can also build a Corpus from a plain text. See the example below.

In [12]:
from nltk.corpus import PlaintextCorpusReader
from nltk.tokenize.regexp import WordPunctTokenizer
from nltk.data import LazyLoader

trump = PlaintextCorpusReader('./datasets/','trump.txt',word_tokenizer=WordPunctTokenizer(),
             sent_tokenizer=LazyLoader(
                 'tokenizers/punkt/english.pickle'),
             encoding='utf8')
print(trump)
trump.words()

OSError: No such file or directory: '/Users/materna/Desktop/ML/workshops/nlp/datasets'

Corpuses contain large amount of texts, say multiple books, couple hundreds of articles from various magazines etc. NLTK has chosen a few books from over 25,000 pieces available in the whole dataset. We are going to pick Bible and Shakespeare's Macbeth in order to compare a few statistics between them. Let's take a look at other corporas and check the gutenberg and check the word distribution:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid') # general display style of seaborn
# how to display plot in notebook, jupyter's magic command
%matplotlib inline 

# Download necessary texts
nltk.download('gutenberg')
nltk.download('stopwords')

print("Books available in gutenberg dataset:")
nltk.corpus.gutenberg.fileids()

In [None]:
nltk.corpus.stopwords.words("english")

In [None]:
from IPython.core.display_functions import display
%matplotlib inline

def lexical_richness(text: str) -> float:
  # Unique words in text divided by the total amount of words
  words = nltk.Text(nltk.corpus.gutenberg.words(text))
  return len(set(words))/len(words)

macbeth = "shakespeare-macbeth.txt"
bible = "bible-kjv.txt"

print("Macbeth richness: {}".format(lexical_richness(macbeth)))
print("Bible richness: {}".format(lexical_richness(bible)))

# Removing stopwords like: , . and or
bible_words = [
    word
    for word in nltk.corpus.gutenberg.words(bible)
    if word not in nltk.corpus.stopwords.words("english") and word.isalpha()
]

words_distribution = nltk.probability.FreqDist(bible_words)
print("Words occuring most often in Bible:") 
display(words_distribution.plot(15))

### Regular expression

The first solution that comes to our mind when it comes to text processing are regular expression. It is usually the best solution in software development. It is also used in many tools that we have mentioned before. Let's take an example to do a short recap how to use regular expressions in Python.

In [None]:
example = "JU is a great company with many great developers."

import re

pattern = "\\s+"
words = re.split(pattern, example)
print(words)

Python does have some methods for Strings that can do the same as above, but with less code.

In [None]:
words = example.split(' ')
print(words)

text = "Well, what can I say about regexes? \
They are quite annoying to be honest, but once you get a grasp you're gonna be fine. \
Yeah, I could use some basic boring text, here you go if that's what you miss: \
Ala ma kota, kot ma Alę. m?ke a m1sta.e and see what happens... \
This sentence: 'Mr. & Mrs. Smith, been looking for easter egg here, failed."
text.split(' ')

That was too naive, let's make something more intelligent:

In [None]:
print("Using regex:")
display(re.split('(?<=[.!?]) +',text))

Our regex only splits the text if either of . ! ? is followed by one or more whitespace signs. This approach is quite powerful, yet limited (as could be seen), keep in mind it has it's uses but is far from perfect (later we will see more sophisticated approaches).

All in all, when faced with a choice: do something using traditional methods or using machine learning, ALWAYS pick the first one.

In [None]:
# Examples of some other regex functions

text = re.sub(r"Mr. & Mrs. Smith", "Now works like a charm", text) #substitute string
found_text = re.findall(", been .*,", text) # find all occurences of string
display(found_text)

# findall returns list, we need to access it's element
re.sub(found_text[0], " but life", text)

### Tokenization

NLTK is a more advanced tool than just regular expressions. We can easily tokenize sentences. Tokenization is a process of getting words from sentences. **Tokenizers** are used to divide a string into it's logical subsets. Above, we have used the sentence tokenizer. Tokenization on the other hand is a common name for dividing a text into separate words and can be done in NLTK using word_tokenizer:

In [None]:
tokens = nltk.word_tokenize(example)
print("Tokens: " + str(tokens))

After a little fun we can use nltk to separate text into sentences in 'smarter' fashion:

In [None]:
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

sent_tokenize(text)

__Tokenizers__ are used to divide a string into it's logical subsets. Above, we
have used
the sentence tokenizer. __Tokenization__ on the other hand is a common
name for dividing
a text into separate words and can be done in NLTK using
__word_tokenizer__:

In [None]:
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.tokenize import TweetTokenizer # Yes, this tokenizer was based on Tweets

text = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
print("Normal tokenizer:")
print(word_tokenize(text))

print("This tokenizer is better suited for emoji handling")
twitter_tokenizer = TweetTokenizer()
twitter_tokenizer.tokenize(text)

As you can see there are other tokenizers available and each is focused on
different
'kind' of tokenization.

They are usually implemented as a set of
complicated rules trailed down to certain language
(e.g. will be different for
polish and english) and __ARE NOT__ a subset of machine
learning (in
this part of course at least).

Here are some other NLTK's tokenizers you could
use:

- __PunktTokenizer__: been used above for sentence tokenization. This one
actually uses unsupervised algorithm to infer sentence boundaries.
- __NISTTokenizer__: tokenizer for non-european text (say in Chinese).
- __SExprTokenizer__: find parenthesized expressions in strings, see below:

In [None]:
from nltk.tokenize import SExprTokenizer

SExprTokenizer().tokenize('(a b (c d)) e f (g)')

### Tagging

What is great with NLTK is that we can tag each word. A tag give us a meaning of a word. We can think about it as a type of word.

In [None]:
tags = nltk.pos_tag(tokens)
print("Tagged: " + str(tags))

Some tags are shown in the table below.

| tag  | short | description   |
|------|-------|---------------|
| DT   | determiner | all an another any both del each either every half la many much nary neither no some such that the them these this those |
| IN   | preposition or conjunction, subordinating | astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ... |
| JJ   | adjective or numeral, ordinal | third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic battery-powered participatory fourth still-to-be-named multilingual multi-disciplinary ... |
| NN   | noun, common, singular or mass | common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... |
| NNP  | noun, proper, singular | Motown Venneboerger Czestochowa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ... |
| NNS  | noun, common, plural | undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ... |
| VBZ  | verb, present tense, 3rd person singular | bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes uses bolsters slaps speaks pleads ... |

To get the full list of tags, use the following code:

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

We can find the token in NLTK's corporas too:

In [None]:
text6.tokens

Another feature of NLTK are the stopwords that are used in each language that NLTK supports:

In [None]:
english_stopwords = set(nltk.corpus.stopwords.words('english'))
print(english_stopwords)

__POS-tagging__ is another method of describing our text. For each tokenized word we can obtain it's __Part Of Speech__. Those informations are frequently used throughout more advanced machine learning methods.

Up until now everything was easy, fortunately this one is no exception.

In [None]:
# we have already imported nltk.tokenize.word_tokenizer

nltk.download('averaged_perceptron_tagger')

words = word_tokenize(
    "Another text. You know, this is the hardest part, coming up with all those texts. "
    "Lorem Ipsum or some random stuff is pretty boring IMO. "
    "You may feel like somebody didn't care (or at least I do)."
    "Or maybe it's just an act? I will not address this question."
)

nltk.pos_tag(words)

Look at `act` and `address` towards the end of output.

POS can be used to dismabiguate word meaning, `act` can be both noun (NN) and verb (VB), same goes for address. 

This tokenizer is __based on simple one layer neural network__ hence it takes context into account (words before and after an act). 

Context (N words before and after certain word) is called __N-gram__ in Machine Learning and we will make use of it more later in this course.

<ADD N-GRAM PHOTO>

This is more sophisticated approach and __should be__ used when more abstract reasoning is needed (like word disambiguation).

With __NLTK__ we can create and train our own taggers, or choose a pre-made one and train is as seen below (this one takes the next word as it's context).

In [None]:
import nltk
from nltk.corpus import brown 
nltk.download("brown")
 
brown_tagged_sents = brown.tagged_sents(categories="news")
brown_sents = brown.sents(categories="news") 
unigram_tagger = nltk.BigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])

### Unicode

Each alphabet is different and some languages use letter that are not used in any other language. Unicode allows use to standardize the text. Let's see an of some german and portugese text below.

In [None]:
sample = """Minderjährige Schülerinnen. O português é uma língua romântica"""

type(sample)

If we convert it to ASCII it will look as following:

In [None]:
ascii(sample)

It is not what we expected. It's hard to work with such a text. We can go a step further and encode the string to be a unicode encoded one:

In [None]:
sample.encode('utf8')

We could also ignore the ASCII characters that we cannot do anything with in the further processing:

In [None]:
sample.encode('ascii', 'ignore')

Now we have all letter that make sense, but we have text that does not make any sense anymore. To solve it, we need to normalize it and only after the normalization remove the ascii signs.

In [None]:
import unicodedata

unicodedata.normalize('NFKD', sample).encode('ascii','ignore')

More about [Unicode normalization forms](http://unicode.org/reports/tr15/)

### Lemmatization

Lemmatization is a process of getting one word for many forms of the word. 

Let's see an example:

In [None]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('do',pos='v'))
print(wordnet_lemmatizer.lemmatize('does',pos='v'))
print(wordnet_lemmatizer.lemmatize('doing',pos='v'))

We can set the main part of speech as one of the following:

ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'

In [None]:
wordnet_lemmatizer.lemmatize('are',pos='v')

### Stemming

Stemming is similar to lemmatization, but the main difference is that is just reduce the word to it root. It gives in many cases different results than lemmatization as the second solution is based on part of speech.

In [None]:
from nltk import PorterStemmer, LancasterStemmer, word_tokenize
from nltk.stem.snowball import SnowballStemmer

sample = "This is a new training about machine learning usage for chatbots"

tokens = word_tokenize(sample)

porter = PorterStemmer()
p_stem = [porter.stem(t) for t in tokens]
print(p_stem)

lancaster = LancasterStemmer()
l_stem = [lancaster.stem(t) for t in tokens]
print(l_stem)

snowball = SnowballStemmer('english')
s_stem = [snowball.stem(t) for t in tokens]
print(s_stem)

As per usual there are other stemmers and lemmatizers available in NLTK. 
Some
of them may be faster (PorterStemmer), but Snowball is considered de-facto
standard.

A few traits of each should be noticed:
- Stemming is faster (it
doesn't have to consult dictionary and complicated
morphological rules).
-
Stemming may not produce a dictionary word, see `studi` above.
- Lemmatization
produces smaller set of words. It is useful as we do not have to keep
track of
all the different form of word (running, runner, runs etc.). More memory
efficient.
- Lemmatization is not as crude, when it does not find the lemma it
returns the original
form. It __may__ produce more words this way, depends on
the kind of text we are
processing.

### Sentence extraction

In this section we show some features of SpaCy. One of the most commonly used is the sentence extraction. An example of Trump speech divided into sentences can be found below.

In [None]:
import spacy

file = open("../datasets/trump.txt", "r",encoding='utf-8') 
trump = file.read() 

nlp = spacy.load("en_core_web_sm")
doc = nlp(trump)

for span in doc.sents:
    print("> ", span)

Each sentence can next be tokenized and we can get a tag and pos of it.

In [None]:
for span in doc.sents:
    for i in range(span.start, span.end):
        token = doc[i]
        print(i, token.text, token.tag_, token.pos_)

### Noun chunks

Chunking is a process of getting just specific part of speeches. The most popular is the noun chunking. It allow us to get the general meaning of what is the text about.

In [None]:
for np in doc.noun_chunks:
    print(np)

### Named entity recognition

The last feature that we show here is the named entitiy resolution. It returns a NER of each word.

In [None]:
for entity in doc.ents:
    print(entity.label_, entity.text)

### Bag of Words

Many machine learning methods cannot use strings as features, we have to encode it using numbers.

We can easily do this using __Bag Of Words (BOW)__ technique and marvelous __sklearn__ library:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]

X = vectorizer.fit_transform(corpus)
print(X.toarray())

Each document is represented by the row. Values ranging from 0 to N represent whether and how many times the word occured in the document. You can see what word corresponds to which column by issuing get_feature_names() on vectorizer object.

In [None]:
vectorizer.get_feature_names()

This approach allows us to easily describe the whole corpora, but it lacks informations crucial for solving some tasks.

### TF-IDF

Look at word is. It is used in most documents many times, yet it does not tell us anything about them. Let's think about sentiment analysis: if words like great or awesome occur frequently in comparison with another documents it may suggest positive attitude.

TF-IDF is one way to encode this information and I'll walk you through it step by step.

First part of TF-IDF is, yes, you guessed it, TF, which means Term Frequency. It can be calculated as:

\begin{equation}
tf_{ij}=\frac{n_{ij}}{\sum n_{ij}},
\end{equation}
where $n_{ij}$ is the number of occurence of word $i$ in document $j$.

In [None]:
import numpy as np

corpus = [
'Tom has cat',
'Tom has fish',
'Tom is polish',
]

def tf(corpus):
    vec = CountVectorizer()
    bow_representation = vec.fit_transform(corpus)
    words_per_corpus = bow_representation.sum(axis=1)
    return np.divide(np.array(bow_representation.toarray()),np.array(words_per_corpus).reshape((5,))[:,None])


For each document we count how many times it occurred (BoW implementaion) and divide by the count of all words in this document.

Next part is IDF, which stands for Inverse Document Frequency:

\begin{equation}
idf=\log(\frac{N}{df_{t}}),
\end{equation}
where $N$ is the total number of documents and $df_{t}$ is number of documents containing $t$.

In [None]:
def idf(corpus):
    document_count = len(corpus)
    bow_representation = CountVectorizer().fit_transform(corpus)
    return np.log(document_count / np.count_nonzero(bow_representation.toarray(), axis=0))

First we calculate number of documents in corpus (number of rows in our case). Next, for each word, we calculate documents containing said word at least once.

Taking logarithm allows us to dampen the effect of idf. For example, the difference between term occuring in 40 out of 50 documents and 45 out of 50 documents will be smaller than difference between 1/50 and 5/50. This puts a bigger emphasis on rarely occuring terms as they are more informative.

Finally, for the whole thing to work, we simply multiply both:

In [None]:
def tf_idf(corpus):
    return tf(corpus) * idf(corpus)

Let's calculate it:

In [None]:

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]

tfidf_result = tf_idf(corpus)

print(tfidf_result.shape)

In Jupyter it's easier to display results with pandas:

In [None]:
import pandas as pd
pd.DataFrame(tfidf_result).head()

There are many versions of tf-idf, some use different smoothing, use additional logarithm for tf part and so on. Each transforms corpora a little differently, and appropriate should be used based on effect we would like to obtain.

### Summary
This constitutes the first NLP-related part of our course and we have gathered some useful and interesting informations 

### References

[1] Natural Language Processing with Python, Edward Loper, Ewan Klein, Steven Bird. O'Reilly 2009

[2] Applied Text Analysis with Python, Tony Ojeda , Rebecca Bilbro , Benjamin Bengfort. O'Reilly 2018

[3] Feature Engineering for Machine Learning, Amanda Casari , Alice Zheng. O'Reilly 2018