In this lab, we'll be looking at how to turn a text into numbers. A numeric representation like this is called a *vector*. And so the process of turning language into numbers is called vectorizing. We'll start, as always, by loading our environment.

In [None]:
from text_analytics import text_analytics
from sklearn.feature_extraction.text import CountVectorizer
import os
import pandas as pd

ai = text_analytics()
print("Done!")

We'll work with articles about corruption, like last time.

In [None]:
file = "NYT.Corruption.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file, index_col = 0)
print(df)
print("Done!")

So, here we are going to run some code to initialize the vectorizer. We're using one from *scikit-learn*. And initializing means that we're creating the class, before we use it.

In [None]:
vectorizer = CountVectorizer(
    input = "content", 
    preprocessor = ai.clean_pre,
    analyzer = "word",
    )

print("Done!")

Now, we'll go ahead and use it.

In [None]:
line = ai.print_sample(df)
vector = vectorizer.fit_transform([line])
print(vector)
print("Done!")

So, the first thing we've displayed is the string. And the second thing we've displayed is the vectorized or numeric version of that string.

This is currently in a *sparse* format, so that not every element has to be displayed. We'll show the full version, though.

In [None]:
line = ai.print_sample(df)
vector = vectorizer.fit_transform([line]).todense()
print(vector)
print("Done!")

Let's finish up by making this into a nice dataframe.

In [None]:
line = ai.print_sample(df)
vector = vectorizer.fit_transform([line])
vector = ai.print_vector(vector, vectorizer)
print(vector)
print("Done!")

Some of this code is a bit dense, but try out a few sentences. The point is that this transforms each sentence in a list of numbers. And tells us what each number in that vector corresponds to. So, in this lab we've taken a first look at how to vectorize our texts.

Try it out yourself: use the code block below to make some vectors from this file:
    
    "Gutenberg.1850.Authors.gz"