# Vectorizing and Encoding Data

In this section we will learn how to prepare text data into numeric data that machine learning models can understand. This process is called **vectorization**, where we create numeric representations of words and tokens. There are many ways to achieve this task, and we will get our hands on a few techniques. 

## Turning Text into Numbers

Recall this excerpt from the Charles Dickens novel in UTF-8 (Unicode Transformation Format – 8-bit) format. Because it is 8-bit, that means the computer stores/reads each character as 8 bits of memory. Recall a binary number is a numeric system limited to only two digits 0 and 1. 

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way--in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.

```
01001001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100010 01100101 01110011 01110100 00100000 01101111 01100110 00100000 01110100 01101001 01101101 01100101 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110111 01101111 01110010 01110011 01110100 00100000 01101111 01100110 00100000 01110100 01101001 01101101 01100101 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100001 01100111 01100101 00100000 01101111 01100110 00100000 01110111 01101001 01110011 01100100 01101111 01101101 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100001 01100111 01100101 00100000 01101111 01100110 00100000 01100110 01101111 01101111 01101100 01101001 01110011 01101000 01101110 01100101 01110011 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100101 01110000 01101111 01100011 01101000 00100000 01101111 01100110 00100000 01100010 01100101 01101100 01101001 01100101 01100110 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01100101 01110000 01101111 01100011 01101000 00100000 01101111 01100110 00100000 01101001 01101110 01100011 01110010 01100101 01100100 01110101 01101100 01101001 01110100 01111001 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110011 01100101 01100001 01110011 01101111 01101110 00100000 01101111 01100110 00100000 01001100 01101001 01100111 01101000 01110100 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110011 01100101 01100001 01110011 01101111 01101110 00100000 01101111 01100110 00100000 01000100 01100001 01110010 01101011 01101110 01100101 01110011 01110011 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110011 01110000 01110010 01101001 01101110 01100111 00100000 01101111 01100110 00100000 01101000 01101111 01110000 01100101 00101100 00100000 01101001 01110100 00100000 01110111 01100001 01110011 00100000 01110100 01101000 01100101 00100000 01110111 01101001 01101110 01110100 01100101 01110010 00100000 01101111 01100110 00100000 01100100 01100101 01110011 01110000 01100001 01101001 01110010 00101100 00100000 01110111 01100101 00100000 01101000 01100001 01100100 00100000 01100101 01110110 01100101 01110010 01111001 01110100 01101000 01101001 01101110 01100111 00100000 01100010 01100101 01100110 01101111 01110010 01100101 00100000 01110101 01110011 00101100 00100000 01110111 01100101 00100000 01101000 01100001 01100100 00100000 01101110 01101111 01110100 01101000 01101001 01101110 01100111 00100000 01100010 01100101 01100110 01101111 01110010 01100101 00100000 01110101 01110011 00101100 00100000 01110111 01100101 00100000 01110111 01100101 01110010 01100101 00100000 01100001 01101100 01101100 00100000 01100111 01101111 01101001 01101110 01100111 00100000 01100100 01101001 01110010 01100101 01100011 01110100 00100000 01110100 01101111 00100000 01001000 01100101 01100001 01110110 01100101 01101110 00101100 00100000 01110111 01100101 00100000 01110111 01100101 01110010 01100101 00100000 01100001 01101100 01101100 00100000 01100111 01101111 01101001 01101110 01100111 00100000 01100100 01101001 01110010 01100101 01100011 01110100 00100000 01110100 01101000 01100101 00100000 01101111 01110100 01101000 01100101 01110010 00100000 01110111 01100001 01111001 00101101 00101101 01101001 01101110 00100000 01110011 01101000 01101111 01110010 01110100 00101100 00100000 01110100 01101000 01100101 00100000 01110000 01100101 01110010 01101001 01101111 01100100 00100000 01110111 01100001 01110011 00100000 01110011 01101111 00100000 01100110 01100001 01110010 00100000 01101100 01101001 01101011 01100101 00100000 01110100 01101000 01100101 00100000 01110000 01110010 01100101 01110011 01100101 01101110 01110100 00100000 01110000 01100101 01110010 01101001 01101111 01100100 00101100 00100000 01110100 01101000 01100001 01110100 00100000 01110011 01101111 01101101 01100101 00100000 01101111 01100110 00100000 01101001 01110100 01110011 00100000 01101110 01101111 01101001 01110011 01101001 01100101 01110011 01110100 00100000 01100001 01110101 01110100 01101000 01101111 01110010 01101001 01110100 01101001 01100101 01110011 00100000 01101001 01101110 01110011 01101001 01110011 01110100 01100101 01100100 00100000 01101111 01101110 00100000 01101001 01110100 01110011 00100000 01100010 01100101 01101001 01101110 01100111 00100000 01110010 01100101 01100011 01100101 01101001 01110110 01100101 01100100 00101100 00100000 01100110 01101111 01110010 00100000 01100111 01101111 01101111 01100100 00100000 01101111 01110010 00100000 01100110 01101111 01110010 00100000 01100101 01110110 01101001 01101100 00101100 00100000 01101001 01101110 00100000 01110100 01101000 01100101 00100000 01110011 01110101 01110000 01100101 01110010 01101100 01100001 01110100 01101001 01110110 01100101 00100000 01100100 01100101 01100111 01110010 01100101 01100101 00100000 01101111 01100110 00100000 01100011 01101111 01101101 01110000 01100001 01110010 01101001 01110011 01101111 01101110 00100000 01101111 01101110 01101100 01111001 00101110
```

While this is how the computer really sees the text behind-the-scenes, there are further mathematical transformations that have to occur. Thankfully you will not have to think in 1's and 0's as Python will abstract that away from you. However, it is still good to be aware of what 8-bit versus 16-bit means, and UTF-8 stores a character as 8 bits of data but technically extends to 32 bits as it can also store in 4 byte chunks. Now of course, this limits the number of characters UTF-8 can support and map to, but it's a pretty common format and supports internationalization robustly as well. 

When we do natural language processing and large language models, we have to convert documents into fixed-length vectors of numbers. A common approach is to use the **Bag-of-Words (BoW)** model which will be covered a lot in the remainder of this course. It typically disregards the order the words occur and focuses on the occurrence of words in a document. Each word gets a unique number assigned and a vector matching the length of the known vocabulary flags whether that word has occurred or not in its indexed position. 

There are a couple of ways to implement a Bag-of-Words model. Let's start with the `CountVectorizer`. 

## Word Counts

The `CountVectorizer` in scikit-learn will tokenize text in documents and build a vocabulary of known words. Then with new documents you can encode vectors that align to that vocabulary. 

Here is a simple example of using the `CountVectorizer`. Study it and the output closely. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["The sun is up. The sun is yellow. The yellow sun is over the house."]
vectorizer = CountVectorizer()

# tokenize 
vectorizer.fit(text)

# show vocabulary
print(vectorizer.vocabulary_)

# encode a new document
vector = vectorizer.transform(["The house is yellow. The sun is over the house."])

# summarize vector
print(vector.toarray())

Notice closely that we start with the text "The sun is up. The sun is yellow. The yellow sun is over the house." This is `fit()` to the `CountVectorizer` and that is our entire universe of vocabularly in that sentence.  Just those 7 words _the, sun, is, up, yellow, over, and house_. I print that `vocabulary_` and it conveniently returns a map showing which word occupies which index in a vector. 

When I `transform()` a new document "The house is yellow. The sun is over the house." it turns it into a vector with those counts for each word. Again notice that each vocabularly word holds a given position index on the array. 

What happens if we give it a document with vocabularly it has not seen before? This sentence below contains the word "blue" and "moon" which we did not `fit()` to previously. 

In [None]:
vector = vectorizer.transform(["The house is blue. The moon is over the house."])
print(vector.toarray())

Basically any unknown words like "blue" and "moon" are ignored. It is not going to extend the array for new vocabularly. To do that, I would have to call the `fit()` function again which will rebuild the vocabularly. Note that previously captured words will not necessarily retain their old positions. 

In [None]:
text = ["The sun is up. The sun is yellow. The yellow sun is over the house.", "The house is blue. The moon is over the house."]
vectorizer.fit(text)
print(vectorizer.vocabulary_)

vector = vectorizer.transform(["The house is blue. The moon is over the house."])
print(vector.toarray())

We can also observe the vocabularly was converted to lowercase by default and punctuation was ignored. [scikit-learn provides a lot of options to change these behaviors.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

## Word Frequencies

Counting words can only get us so far. Unhelpful words like _the_ will appear many times and will bloat counts in vectors. A far more popular method that works around this is **Term Frequency - Inverse Document Frequency (TF-IDF)** technique. It scores words that are more _interesting_ and diminishes words that appear a lot across documents. 

It will tokenize documents, build the vocabularly, apply some inverse document frequency weightings, and then be able to encode new documents. It actually performs the same steps as the `CountVectorizer` but adds some new ones for the inversion logic. Note that separating text into separate documents will have a different result than putting all text in a single document, as this math formula works on the document-level. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ["The sun is up.", "The sun is yellow.", "The yellow sun is over the house."]
vectorizer = TfidfVectorizer()

# tokenize 
vectorizer.fit(text)

# show vocabulary and scores 
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

# encode a new document
vector = vectorizer.transform(["The house is yellow."])

# summarize vector
print(vector.toarray())

Above the TF-IDF vectorizer learned 7 words. The words "the", "is", and "sun" have the lowest score of 1 while "house", "over", and "up" are the highest-scoring words. When we score a new document, we get normalized scores scaled between 0 and 1 for each word. 

## Word Hashing 

With the previous two techniques, we can get enormous vocabularies and it might be nice to consider downsizing them in some algorithmic way. This is where **hashing**, or a one-way conversion of words to integers, can be handy. No vocabularly is required and you can choose any vector length you want (the default is $ 2^{20} $). Given a lot of AI techniques are a one-way conversion anyway, hashing might fit the bill here. 

We use the [`HashingVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) class to hash words, tokenize, and encode documents. Choosing the vector length is the largest decision in this technique. The default value for `n_features` of $ 2^{20} $ should be sufficient in most cases. The vector length has to be large enough to prevent clashing, but having it larger will take more memory. We will use a very small vector of `20` just to make our toy example easy to observe. 

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

text = ["The sun is up.", "The sun is yellow.", "The yellow sun is over the house."]
vectorizer = HashingVectorizer(n_features=20) # default n_features = 2**20

# tokenize 
vectorizer.fit(text)

# encode a new document
vector = vectorizer.transform(["The house is yellow."])

# summarize vector
print(vector.toarray())

Notice if I use a larger vector, it is going to be much harder to observe the result due to the vector length. 

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

text = ["The sun is up.", "The sun is yellow.", "The yellow sun is over the house."]
vectorizer = HashingVectorizer(n_features=2**10) # default n_features = 2**20

# tokenize 
vectorizer.fit(text)

# encode a new document
vector = vectorizer.transform(["The house is yellow."])

# summarize vector
print(vector.toarray())

## Binary and Other Parameters 

I highly encourage exploring the documentation available for [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer),  [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and [HashingVecorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). Some useful transformations can be added. For example, below we turn off the automatic conversion to lowercase and we use binary outcomes of words instead of counts, which can be useful for certain probability models. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
text = ["The sun is up. The sun is yellow. The yellow sun is over the house."]
vectorizer = CountVectorizer(binary=True, lowercase=False)

# tokenize 
vectorizer.fit(text)

# show vocabulary
print(vectorizer.vocabulary_)

# encode a new document
vector = vectorizer.transform(["The house is yellow. The sun is over the house."])

# summarize vector
print(vector.toarray())

## Exercise

Create a simple vectorizer that flags in binary fashion whether each vocabulary word from *The Legend of Sleepy Hollow* occurred in the new document or not. Complete this code below by replacing the question marks "?". 

In [None]:
from sklearn.feature_extraction.text import ?
import numpy as np

np.set_printoptions(threshold=np.inf) # don't truncate vector outputs 

# bring in document
filename = 'legend_of_sleepy_hollow.txt' 
file = open(filename, encoding="utf-8")
text = file.read()
file.close()
text

# construct vectorizer 
vectorizer = ?

# tokenize 
vectorizer.fit(?)

# encode a new document
vector = vectorizer.transform(["The headless horseman was a Hessian trooper."])

# summarize vector
print(vector.toarray())

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

np.set_printoptions(threshold=np.inf) # don't truncate vector outputs 

# bring in document
filename = 'legend_of_sleepy_hollow.txt' 
file = open(filename, encoding="utf-8")
text = file.read()
file.close()
text

# construct vectorizer 
vectorizer = CountVectorizer(binary=True)

# tokenize 
vectorizer.fit([text])

# encode a new document
vector = vectorizer.transform(["The headless horseman was a Hessian trooper."])

# summarize vector
print(vector.toarray())