<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/applied-text-analysis-with-python/4_text_vectorization_and_transformation_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Text Vectorization and Transformation Pipeline

In this notebook, we will demonstrate how to use the vectorization process to combine
linguistic techniques from NLTK with machine learning techniques in Scikit-Learn
and Gensim, creating custom transformers that can be used inside repeatable and
reusable pipelines.

In order
to perform machine learning on text, we need to transform our documents into vector
representations such that we can apply numeric machine learning. This process is
called feature extraction or more simply, vectorization, and is an essential first step
toward language-aware analysis.

Representing documents numerically gives us the ability to perform meaningful ana
lytics
and also creates the instances on which machine learning algorithms operate

For this reason, we must now make a critical shift in how we think about language—
from a sequence of words to points that occupy a high-dimensional semantic space.
Points in space can be close together or far apart, tightly clustered or evenly distributed.

By
encoding similarity as distance, we can begin to derive the primary components of
documents and draw decision boundaries in our semantic space.

The simplest encoding of semantic space is the bag-of-words model, whose primary
insight is that meaning and similarity are encoded in vocabulary.

##Setup

In [1]:
import nltk
import string

In [None]:
corpus = [
  "The elephant sneezed at the sight of potatoes.",
  "Bats can see via echolocation. See the bat sight sneeze!",
  "Wondering, she opened the door to the studio.",
]

##Words in Space

We will look at four types of vector encoding—frequency,
one-hot, TF–IDF, and distributed representations—and discuss their implementations
in Scikit-Learn, Gensim, and NLTK.

To set this up, let’s create a list of our documents and tokenize them for the proceeding
vectorization examples.

In [2]:
def tokenize(text):
  stem = nltk.stem.SnowballStemmer("english")
  text = text.lower()

  for token in nltk.word_tokenize(text):
    if token in string.punctuation:
      continue
    yield stem.stem(token)

###Frequency Vectors