# iLykei Lecture Series
# Text Analytics (Northwestern University, MLDS 414)
# Notebook: Cleaning Data


## Yuri Balasanov, &copy; iLykei 2021-2023

##### Main text: Deep Learning for Natural Language Processing, © 2019 Jason Brownlee.   
See also the [blog by Jason Brownlee](https://machinelearningmastery.com/about/)

# Metamorphosis by Franz Kafka

For this example use text of Metamorphosis by Franz Kafka downloaded from [Project Gutenberg](http://www.gutenberg.org/cache/epub/5200/pg5200.txt). The file is in ASCII format, it contains the header and footer parts with meta information. 

Read the complete text (mode `rt`). Print out the header: everything until the first words "One morning". Print out the footer: everything after "her young body."

In [1]:
# load complete text
filename = 'metamorphosis.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
print("HEADER:\n\n",text[:831],"\nEND OF HEADER\n")
print("\nFOOTER: \n",text[119994:122000],"\n")
print("------------------\n",text[137000:],"\n END OF FOOTER")

HEADER:

 The Project Gutenberg EBook of Metamorphosis, by Franz Kafka
Translated by David Wyllie.

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
**     Please follow the copyright guidelines in this file.     **


Title: Metamorphosis

Author: Franz Kafka

Translator: David Wyllie

Release Date: August 16, 2005 [EBook #5200]
First posted: May 13, 2002
Last updated: May 20, 2012

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK METAMORPHOSIS ***




Copyright (C) 2002 David Wyllie.





  Metamorphosis
  Franz Kafka

Translated by David Wyllie



I


 
END OF HEADER


FOOTER: 
 






End of the Project Gutenberg EBook of Metamorphosis, by Franz Kafka
Translated by David Wyllie.

*** END OF THIS PROJECT GUT

Remove the header and the footer from the text file and save it as "metamorphosis_clean.txt". Load the clean file.

In [2]:
# load text
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
text[:500]

'One morning, when Gregor Samsa woke from troubled dreams, he found\nhimself transformed in his bed into a horrible vermin.  He lay on\nhis armour-like back, and if he lifted his head a little he could\nsee his brown belly, slightly domed and divided by arches into stiff\nsections.  The bedding was hardly able to cover it and seemed ready\nto slide off any moment.  His many legs, pitifully thin compared\nwith the size of the rest of him, waved about helplessly as he\nlooked.\n\n"What\'s happened to me?" he'

# Preparing text manually

Before any further cleaning and preparation is done it is a good time now to explore the text. Here are some exploration results in the book, they can be found by searching:


- "It’s plain text so there is no markup to parse
- The translation of the original German uses UK English (e.g. travelling)
- The lines are artificially wrapped with new lines at about 70 characters
- There are no obvious typos or spelling mistakes
- There’s punctuation like commas, apostrophes, quotes, question marks, and more
- There’s hyphenated descriptions like armour-like
- There’s a lot of use of the em dash (-) to continue sentences (maybe replace with commas?)
- There are names (e.g. Mr. Samsa)
- There does not appear to be numbers that require handling (e.g. 1999)
- There are section markers (e.g. II and III )
"    
The text is already pretty clean. It can be prepared manually by following simple steps.

## Cleaning and tokenization

A common first step of text preparation is converting the whole text into a list of words. This will allow using the simplest models of NLP.

Load the text without header and footer. Split it into words by white space.

In [3]:
# split into words by white space
words = text.split()
print(words[:100])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


Such characteristics as punctuation, end of sentence are preserved: "wasn't", "thought.", "me?".

Alternatively, tockenization of the text can be done using regex model pr regular expression operations (re) which splits the document into words by selecting strings of alphanumeric characters.

In [4]:
import re
# split based on words only
words = re.split(r'\W+', text)
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']


This time tokens are formed differently: "armour-like" is now two words "armour" and "like", contractions are not hadled well, though: "What’s" also divided into two words: "What" and "s". This may not be the most convenient way. So, go back to splitting by white space and remove punctuation.

## Punctuation

Standard list of punctuation characters in python is given by

In [5]:
import string
print(string.punctuation)
print(re.escape(string.punctuation))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~


Use regular expressions to select punctuation characters and remove them using the sub()
function.

In [6]:
import string
# split into words by white space
words = text.split()
# prepare regex for char filtering
# match any any character that is present in the string.punctuation module
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']


In case if text contains some non-printable characters, like "\n", "\t" they can be removed similarly.

`re_print = re.compile( ' [^%s] ' % re.escape(string.printable))`      
`result = [re_print.sub( '' , w) for w in words]`

## Normalization

Finally, convert all words to lower case.

In [7]:
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']


# Preparing text using NLTK

When text files are much bigger than Kafka's Metamorphosis loading them into memory and cleaning manually may be challenging. NLTK provides good tools for that. See [Workshop, Part1](https://ilykei.com/api/fileProxy/documents%2FMachineLearning_iLykei%2FNLPIntroduction%2FiLykei_ML_NLP_Intro_1.ipynb) and [Workshop, Part 2](https://ilykei.com/api/fileProxy/documents%2FMachineLearning_iLykei%2FNLPIntroduction%2FiLykei_ML_NLP_Intro_2.ipynb) for review of NLTK.

In [8]:
import nltk
#nltk.download() # run it if NLTK data have not been downloaded yet

## Tokenization

A useful first step in preparation of data provided by NLTK is splitting into sentences. Sentences may then be split into words.

In [1]:
from nltk import sent_tokenize
# split into sentences
sentences = sent_tokenize(text)
print(sentences[0])

NameError: name 'text' is not defined

NLTK function `word_tokenize()` splits tokens based on white space and punctuation. For example, commas and periods become separate tokens and contractions are split apart.

In [10]:
from nltk.tokenize import word_tokenize
# split into words
tokens = word_tokenize(text)
print(tokens[:100])

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']


## Punctuation

Uninformative tokens like punctuation, can now be filtered out.

In [11]:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']


Note that tokens like "armour-like" are removed.

## Stop words

Next remove uninformative words called "stop words", such as "a", "the", "is".   
Print the standard set of stop words.

In [12]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Stop words are all in lower case. They also contain both contractions, like "wouldn't" and they stripped versions "wouldn". As a general pipeline finish preparation by lowering the case and removing punctuation from all tokens. Then remove stop words. 

In [13]:
import string
# convert to lower case
tokens = [w.lower() for w in tokens]
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer']


There are still strange words remaining after removing stop words, like "nt", for example. Searching and cleaning such words, but retaining specific words for the given area of application may be a long and tedious process.    

## Stemming

Finally, normalize the text by stemming it using Porter Stemming algorithm.

In [14]:
from nltk.stem.porter import PorterStemmer
# split into words
#tokens = word_tokenize(text)
# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
print(stemmed[:100])

['one', 'morn', 'gregor', 'samsa', 'woke', 'troubl', 'dream', 'found', 'transform', 'bed', 'horribl', 'vermin', 'lay', 'armourlik', 'back', 'lift', 'head', 'littl', 'could', 'see', 'brown', 'belli', 'slightli', 'dome', 'divid', 'arch', 'stiff', 'section', 'bed', 'hardli', 'abl', 'cover', 'seem', 'readi', 'slide', 'moment', 'mani', 'leg', 'piti', 'thin', 'compar', 'size', 'rest', 'wave', 'helplessli', 'look', 'happen', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'littl', 'small', 'lay', 'peac', 'four', 'familiar', 'wall', 'collect', 'textil', 'sampl', 'lay', 'spread', 'tabl', 'samsa', 'travel', 'salesman', 'hung', 'pictur', 'recent', 'cut', 'illustr', 'magazin', 'hous', 'nice', 'gild', 'frame', 'show', 'ladi', 'fit', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'rais', 'heavi', 'fur', 'muff', 'cover', 'whole', 'lower', 'arm', 'toward', 'viewer']


# Preparing Data with Keras

Text cannot be fed directly into deep learning models. It needs to be encoded in numbers.

## Tokenization

Function `text to word sequence()` splits the text by white space, removes punctuation and converts words into lower case.

In [15]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# tokenize the document
result = text_to_word_sequence(text)
print(result)

2023-08-20 15:44:44.425431: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-20 15:44:44.592410: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-20 15:44:44.595687: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-08-20 15:44:44.595701: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudar

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


## Encoding

Keras function `one hot()` makes tokenization and encoding tokens with integers in one step. However, the function does not do one-hot encoding. Instead it is a wrapper around the `hashing trick()`. The use of a hash function means that not all words will be mapped into integer values uniquely. Similar to `text to word sequence()` the function does tokenization, removal of punctuation and lower case normalization. In order to define the dimension of the hashing space the function requires size of the vocabulary as an argument besides the text. This size can be the number of words in the document or larger.

In [2]:
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

NameError: name 'text_to_word_sequence' is not defined

Apply `one_hot()` with the vocabulary size equal to the number of unique words in the document.

In [17]:
from tensorflow.keras.preprocessing.text import one_hot
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

[6, 5, 4, 3, 1, 1, 6, 3, 6]


A limitation of integer and count base encodings is that they require maintaining a vocabulary of words and their mapping to integers. As an alternative a one-way hash
function converts words to integers without keeping track of a vocabulary. This
is faster and requires less memory.    
Keras function `hashing trick()` tokenizes the text and then encodes it with integers, just like the one hot() function. It allows specifying as hash function any hash function built in function md5 or any user-defined function. Below is an example of integer encoding a document using the md5 hash function.

In [18]:
from tensorflow.keras.preprocessing.text import hashing_trick
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

[6, 4, 1, 2, 7, 5, 6, 2, 6]


## Tokenizer

Within Keras there is a more sophisticated API provided by the class `Tokenizer`. This class prepares larger NLP projects for deep learning modeling.    
The fitted `Tokenizer` returns 4 attributes:

- word_count: A dictionary mapping of words and their occurrence counts when the Tokenizer was fit
- word_docs: A dictionary mapping of words and the number of documents that reach appears in
- word_index: A dictionary of words and their uniquely assigned integers
- document_count: A dictionary mapping and the number of documents they appear in calculated during the fit.   

After fitting `Tokenizer` to training data, it can be used to encode documents in both
train and test datasets.    
The function `texts_to_matrix()` can be used to create one vector per document. The length of the vectors is the total size of the vocabulary.    
This function is common for text encoding for standard bag-of-words models. It works in several modes.     
The available modes include:    

- binary: Whether or not each word is present in the document. This is the default
- count: The count of each word in the document
- tfidf: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document
- freq: The frequency of each word as a ratio of words within each document

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print("Word counts: \n",t.word_counts)
print("\nDocument count: \n",t.document_count)
print("\nWord index: \n",t.word_index)
print("\nWord docs: \n",t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print("\nEncoded docs: count\n",encoded_docs)

Word counts: 
 OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])

Document count: 
 5

Word index: 
 {'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}

Word docs: 
 defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'work': 2, 'good': 1, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})

Encoded docs: count
 [[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]
