# Cleaning Text can be Fun. And, painful ;)

The Keras deep learning library provides some basic tools to help you prepare your text data.

My goal with this notebook is to do the following: 
* Split words with text to word sequence
* Encoding with one_hot
* Hash Encoding with hashing trick
* Tokenizer API

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
#taking only the id,excerpt,target
df = pd.read_csv("../input/commonlitreadabilityprize/train.csv",usecols=["id","excerpt","target"])
test_df = pd.read_csv("../input/commonlitreadabilityprize/test.csv",usecols=["id","excerpt"])
print("train shape",df.shape)
df.head()

An often necessary step in NLP is to split text it into words. Words are called tokens and the process of splitting text into tokens is called tokenization. Keras provides the text to word sequence() function that you can use to split text into a list of words.

Keras provides text_to_word_sequence and that helps you with a few bits of feature engineering:
1. splits words by space
2. filters out punctuation
3. converts text to lowercase (lower=True)

# Split words with text to word sequence

In [None]:
from keras.preprocessing.text import text_to_word_sequence

In [None]:
df.loc[0,'excerpt']

In [None]:
words_excerpt = df.loc[0,'excerpt']
words_excerpt

In [None]:
result = text_to_word_sequence(words_excerpt)

Running the example creates an array containing all of the words in the document. The list
of words is printed for review.

In [None]:
print(result)

This is a good start however we will need to do more pre-processing.

# Encoding with one_hot

Keras provides the one hot() function to tokenize and integer encode a text document in one step. The name is a bit misleading as this isn't about one hot encoding -- isntead it offers hashing_trick() function that returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.

In [None]:
from keras.preprocessing.text import one_hot

# estimate the size of the vocab
vocab_size = len(result)
print(vocab_size)
# integrate encode the document
result_vs = one_hot(words_excerpt, round(vocab_size*1.3))
print(result_vs)

The size of the vocabulary as 181. 

The encoded items are printed as an array of integer encoded words.

# Hashing

Why Hash? The above counts and frequencies can be useful, however the vocabulary can become very large which will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms. 

We can use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. This avoids the need to keep track of a vocabulary, which is faster and requires less memory.

A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word.

Below is an example of integer encoding a document using the md5 hash function.

In [None]:
from keras.preprocessing.text import hashing_trick

result_hash = hashing_trick(words_excerpt, round(vocab_size*1.3), hash_function='md5')
print(result_hash)

The different hash function results in consistent, but different integers for words as the one_hot() function in the previous section.

# Tokenizer API

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. 

The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

In [None]:
from keras.preprocessing.text import Tokenizer
# fit our result from above where we retrieved individual words
text = [result]
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(text)

Tokenizer provides four attributes that you can use to query what has been
learned about your documents:

1. Word counts: A dictionary mapping of words and their occurrence counts when the Tokenizer was fit.
2. Word docs: A dictionary mapping of words and the number of documents that reach appears in.
3. Word index: A dictionary of words and their uniquely assigned integers.
4. document count: A dictionary mapping and the number of documents they appear in calculated during the fit.

In [None]:
#1 Word Counts
print(t.word_counts)

In [None]:
#2 Word Docs
print(t.word_docs)

In [None]:
#3 Word Index
print(t.word_index)

In [None]:
#4 Document Count
print(t.document_count)

More to come in future notebooks on NLP!