# Text Processing

In [3]:
import os
import re
import random
from collections import Counter

In [31]:
import numpy as np
import tensorflow as tf
import sys
import string
import json
import nltk

* Transform raw data into the format that:
    * can fit into the model (i.e., `numerical` word vectors or word embedding since we cannot work with text directly when using machine learning algorithms.)
    * loss as less information as possible during the transforming process (e.g., stopwords may not be appropriate or using domain-specific version of stopwords).
    
    
* Mainly two steps:
    * Preprocess text
    * Vectorize text. In other words, transforming text into numercial vectors.

## Text Preprocessing

* Typical text preprocessing:
    * Extract/clean raw text from web (in form of XHTML or XML) or other resources
        * For some application, e.g., `Knowledge Graph` construction, we may exploit the structure of XHTML/XML to learn the relations between entities extracted from web pages.
    * Remove punctuation
    * Tokenization
    * Remove stopwords (may use domain-specific stopwords)
    * [Stemming and Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)
    * part-of-speech tagging

### Remove Punctuation

In [7]:
def remove_punctuation(samples):
    filtered_samples = []
    for i in samples:
        filtered_samples.append(i.translate(str.maketrans('', '', string.punctuation)))
    return filtered_samples

In [8]:
# test
remove_punctuation(["Today's so beautiful!"])

['Todays so beautiful']

### Tokenization

* Tokenize text into sentences
* Tokenize text into words

* We can use NLTK `word_tokenize` to tokenize sentences

In [39]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [9]:
def tokenize(samples):
    tokenized_samples = []
    for s in samples:
        tokens = word_tokenize(s)
        tokenized_samples.append(tokens)
    return tokenized_samples

In [10]:
# test
tokenize(['Todays so beautiful'])

[['Todays', 'so', 'beautiful']]

* We can use NLTK `sent_tokenize` to tokenize sentences

In [17]:
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
print(sent_tokenize(data))

['All work and no play makes jack dull boy.', 'All work and no play makes jack a dull boy.']


### Remove stopwords

* We can use existing stopwords resources such as English stopwords from nltk.corpus
* For some applications or certain Machine Learning models, existing stopwords may not be appropriate.
    * For example, If we want to use Recurrent Neural Network model to solve sentiment analysis problem, we should not treat negation words such as "not", "neither" and "didn't" as stopwords since they may help solve negation issue.
* For some domains, we may define our domain-specific stopwords.

In [38]:
from nltk.corpus import stopwords

type(stopwords.words('english'))

list

In [11]:
stopWords = set(stopwords.words('english'))
print(stopWords)

{'ourselves', 'had', 'below', 'so', 'is', 'that', 'mightn', 'itself', 'with', 'on', 'as', 'which', "isn't", 'into', "haven't", 'and', "you've", 'off', 'for', 've', "wasn't", 'doing', 'until', 'were', 'an', 'll', 'hadn', 'am', 'now', 'during', 'both', 'here', 'be', "didn't", 'their', 'these', "wouldn't", 'further', 'ours', 'do', 'them', 'yourself', "that'll", 'y', 'against', 'was', 'where', 'up', 'yours', 'or', 'it', 'this', 'don', 'from', 'does', 'at', 'no', 'should', 'yourselves', 'my', 'wasn', 'did', 'whom', 'why', "doesn't", 'been', "don't", 'will', 'under', 'other', 'are', 'but', 'then', 'how', 'not', 'haven', 'its', 's', 't', 'can', 'over', 'there', 'any', 'same', "you'd", 'couldn', "weren't", 'aren', "needn't", 'i', 'more', 'o', 'hasn', 'theirs', "mightn't", "mustn't", "shan't", 'above', 'to', 'weren', 'of', "she's", 'ain', 'they', 'her', 'isn', 'few', 'he', 'myself', 'in', 'm', "it's", 'doesn', "aren't", 'very', 'him', "you're", 'hers', 'only', "hasn't", 'the', 'down', 'while', 

In [14]:
def remove_stopwords(samples):
    filtered_samples = []
    for s in samples:
        filstered_tokens = []
        for w in s:
            if w not in stopWords:
                filstered_tokens.append(w)
        filtered_samples.append(filstered_tokens)
    return filtered_samples   

In [15]:
# test
remove_stopwords([['Todays', 'so', 'not', 'beautiful']])

[['Todays', 'beautiful']]

### Stemming

* Word stemming is a normalizaiton process. It transforms words into their steming form.
* Another typical normalization process for English words is lowercase all words. 

<img src="images/word-stem.png" alt=""/>

In [26]:
# There are more stemming algorithms, but Porter (PorterStemer) is the most popular.
from nltk.stem import PorterStemmer

words = ["game","gaming","gamed","games"]
ps = PorterStemmer()
 
for word in words:
    print(word, "->", ps.stem(word))

game -> game
gaming -> game
gamed -> game
games -> game


### Part-of-speech tagging

For some application, part-of-speech of words may better help us analyze the structure or semantics of text

* For Java, we can use [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
* For Python, we can use NLTK `pos_tag` function

In [36]:
 document = 'Whether you\'re new to programming or an experienced developer, it\'s easy to learn and use Python.'
pos = nltk.pos_tag(nltk.word_tokenize(document))
print(type(pos))
print(pos)

<class 'list'>
[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('new', 'JJ'), ('to', 'TO'), ('programming', 'VBG'), ('or', 'CC'), ('an', 'DT'), ('experienced', 'JJ'), ('developer', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('use', 'VB'), ('Python', 'NNP'), ('.', '.')]


<img src="images/nltk-speech-codes.png" alt=""/>

## Convert text to numerical

Most ML algorithms including deep learning rely on numerical data to be fed into them as input. Meaning, we need to convert the text to numbers.

More specifically, we may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for representing text documents in machine learning is called the Bag-of-Words Model, or BoW that ignores word order and focuses on the occurrence of words in a document.

BoW coverts a collection of documents to a matrix, with each document being a row with the length of the vocabulary of known words and each word (or token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

### Implementing Bag of Words Using CountVectorizer of scikit-learn ###


To handle this, we will be using sklearns 
[count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) method which does the following:

* It tokenizes a collection of text documents (separates each document into individual words) and build a vocabulary of known words, and to encode new documents using that vocabulary.
* It assigns each words an integer representing the frequency of word appearing in its document.

** More specifically: ** 

* The CountVectorizer method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. It does this using the `lowercase` parameter which is by default set to `True`.

* It also ignores all punctuation so that words followed by a punctuation mark (for example: 'hello!') are not treated differently than the same words not prefixed or suffixed by a punctuation mark (for example: 'hello'). It does this using the `token_pattern` parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* CountVectorizer will automatically ignore all words(from our input text) that are found in the built in list of english stop words in scikit-learn. This is extremely helpful as stop words can skew our calculations when we are trying to find certain key words that are indicative of spam.

**To sum up**
* `CountVectorizer` does following work for us:
    * Tokenization
    * Lowercase tokenized words
    * Remove punctuation
    * Remove stopwords

`CountVectorizer()` has certain parameters which take care of these steps for us. They are:

* `lowercase = True`
    
    The `lowercase` parameter has a default value of `True` which converts all of our text to its lower case form.


* `token_pattern = (?u)\\b\\w\\w+\\b`
    
    The `token_pattern` parameter has a default regular expression value of `(?u)\\b\\w\\w+\\b` which 
    * ignores all punctuation marks and treats them as delimiters, 
    * accepts alphanumeric strings of length greater than or equal to 2, as individual tokens or words.


* `stop_words`

    The `stop_words` parameter, if set to `english` will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering the size of our dataset and the fact that we are dealing with SMS messages and not larger text sources like e-mail, we will not be setting this parameter value.

You can take a look at all the parameter values of your `count_vector` object by simply printing out the object as follows:

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [43]:
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

In [46]:
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(documents)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names())

{'how': 5, 'money': 7, 'me': 6, 'you': 11, 'from': 2, 'tomorrow': 9, 'now': 8, 'win': 10, 'are': 0, 'hello': 3, 'call': 1, 'home': 4}
['are', 'call', 'from', 'hello', 'home', 'how', 'me', 'money', 'now', 'tomorrow', 'win', 'you']


In [48]:
# encode document
vector = vectorizer.transform(documents)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

(4, 12)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


A call to transform() returns a sparse matrix, and you can transform them back to numpy arrays by calling the `toarray()` function.

Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.

For example, below is an example of using the vectorizer above to encode a document with no word in the vocab. Running this example prints the array version of the encoded sparse matrix showing that none of the words in this document appears in the learned vocabulary.

In [49]:
# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0]]


One potential issue that can arise from using this method out of the box is the fact that if our dataset of text is extremely large (say if we have a large collection of news articles or email data), there will be certain values that are more common that others simply due to the structure of the language itself. So for example words like 'is', 'the', 'an', pronouns, grammatical contructs etc could skew our matrix and affect our analyis.

There are a couple of ways to mitigate this. One way is to use the stop_words parameter and set its value to english. This will automatically ignore all words(from our input text) that are found in a built in list of English stop words in scikit-learn.

Another way of mitigating this is by using the tfidf method. 

### Implementing Bag of Words Using TfidfVectorizer of scikit-learn ###

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

* `Term Frequency`: This summarizes how often a given word appears within a document.
* `Inverse Document Frequency`: This downscales words that appear a lot across documents.

The [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. 

Alternately, if you already have a learned CountVectorizer, you can use it with a [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to just calculate the inverse document frequencies and start encoding documents.

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(documents)
# summarize
print("------ Vocabulary --------")
print(vectorizer.vocabulary_)
print("--------- idf ------------")
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([documents[0]])
# summarize encoded vector
print("--------- summarize encoded vector ------------")
print(vector.shape)
print(vector.toarray())

------ Vocabulary --------
{'how': 5, 'money': 7, 'me': 6, 'you': 11, 'from': 2, 'tomorrow': 9, 'now': 8, 'win': 10, 'are': 0, 'hello': 3, 'call': 1, 'home': 4}
--------- idf ------------
[ 1.91629073  1.51082562  1.91629073  1.51082562  1.91629073  1.91629073
  1.91629073  1.91629073  1.91629073  1.91629073  1.91629073  1.51082562]
--------- summarize encoded vector ------------
(1, 12)
[[ 0.55528266  0.          0.          0.43779123  0.          0.55528266
   0.          0.          0.          0.          0.          0.43779123]]


A vocabulary of 12 words is learned from the documents and each word is assigned a unique integer index in the output vector.

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.51082562 to the most frequently observed word: “hello” at index 3 and "you" at index 11.

Finally, the first document is encoded as an 12-element sparse array and we can review the final scorings of each word are normalized to values between 0 and 1, and the encoded document vectors can then be used directly with most machine learning algorithms.

### Reference

* [Tokenizing Words and Sentences with NLTK](https://pythonspot.com/en/tokenizing-words-and-sentences-with-nltk/)
* [NLTK stop words](https://pythonspot.com/nltk-stop-words/)
* [NLTK stemming](https://pythonspot.com/nltk-stemming/)
* [NLTK part of speech tagging](https://pythonspot.com/nltk-speech-tagging/)
* [How to Prepare Text Data for Machine Learning with scikit-learn](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)
* [4.2. Feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html)
* [Working With Text Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)