<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-1-data-preparation/2_preparing_text_data_with_scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Text Data with scikit-learn

Text data requires special preparation before you can start using it for predictive modeling. The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or 
oating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

## The Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms. Instead, we need to convert the text to numbers. We may want to perform classification of documents, so each document is an input and a class label is the output for our predictive algorithm. Algorithms
take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model.The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document. This can be done by assigning each word a unique number. Then any document we see can be encoded
as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag-of-words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order. There are many ways to extend this simple method, both by better clarifying what a word is and in defining what to encode about each word in the vector. 

The scikit-learn library provides 3 different schemes that we can use:-
* CountVectorizer for Word Counts
* TfidfVectorizer for Word Frequencies
* HashingVectorizer for Hashing

## Word Counts with CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ['The quick brown fox jumped over the lazy dog.']

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

# showing shape of vector
print(vector.shape)

# showing type vector
print(type(vector))

# showing a count of occurrence for each word
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


Importantly, the same vectorizer can be used on documents that contain words not included
in the vocabulary. These words are ignored and no count is given in the resulting vector.

In [3]:
# encode another document
text2 = ['the puppy']
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 1]]


## Word Frequencies with TfidfVectorizer

Word counts are a good starting point, but are very basic. One issue with simple counts is that
some words like the will appear many times and their large counts will not be very meaningful
in the encoded vectors. An alternative is to calculate word frequencies, and by far the most
popular method is called **TF-IDF**. This is an acronym that stands for **Term Frequency - Inverse
Document Frequency** which are the components of the resulting scores assigned to each word.

* Term Frequency: This summarizes how often a given word appears within a document.
* Inverse Document Frequency:This downscales words that appear a lot across documents.

**TF-IDF** are word frequency scores that try to highlight
words that are more interesting, e.g. frequent in a document but not across documents.
The **TfidfVectorizer** will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. 

Alternately, if you already have a
learned **CountVectorizer**, you can use it with a TfidfTransformer to just calculate the inverse
document frequencies and start encoding documents. The same create, fit, and transform process
is used as with the **CountVectorizer**.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = [
  'The quick brown fox jumped over the lazy dog.',
  'The dog',
  'The fox'
]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

# encode document
vector = vectorizer.transform([text[0]])

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


The scores are normalized to values between 0 and 1 and the encoded document vectors can
then be used directly with most machine learning algorithms.

## Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the
vocabulary can become very large. This, in turn, will require large vectors for encoding
documents and impose large requirements on memory and slow down algorithms. A clever work
around is to use a one way hash of words to convert them to integers. The clever part is that
no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word.

The **HashingVectorizer** class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

An arbitrary fixed-length vector size
of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions.

In [5]:
from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents
text = ['The quick brown fox jumped over the lazy dog.']

# create the transform
vectorizer = HashingVectorizer(n_features=20)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]
