In [36]:
import pandas as pd

# Vectorizing: turning text into numerical vectors

**Table of contents**

* [Count vectorizer](#1.-Count-Vectorizer)
* [Tuning the vectorizer](#2.-Tuning-the-Vectorizer)
* [TF-IDF vectorizer](#3.-TF-IDF-Vectorizer)

In this notebook, we’ll look at how to represent text as numerical vectors.
This process, called **vectorization**, converts words into numbers so that machine learning models can work with them. 
We’ll go over two main methods: word counts and TF-IDF (Term Frequency-Inverse Document Frequency).
s.

We’ll start with a small training dataset of sentences:

In [23]:
X_train = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The cat climbed the tree.",
    "The dog barked loudly.",
    "A dog and a cat are friends."
]

## 1. Count Vectorizer

Let’s start by using **`CountVectorizer`** from `sklearn` to convert our text data into a matrix of word counts. 
This method creates a matrix where each row represents a document, and each column represents a unique word (token) across all documents.
The values in the matrix show how many times each word appears in each document.

In other words, `CountVectorizer` takes each word in our text dataset and turns it into a numerical feature, based on how frequently it appears.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

In [25]:
# Fit CountVectorizer to the training data to "learn" the vocabulary
vectorizer.fit(X_train)

In [26]:
# View the learned vocabulary
vectorizer.get_feature_names_out()

array(['and', 'are', 'barked', 'cat', 'chased', 'climbed', 'dog',
       'friends', 'loudly', 'mat', 'on', 'sat', 'the', 'tree'],
      dtype=object)

In [27]:
# Convert documents to a document-term matrix
X_train_dtm = vectorizer.transform(X_train)
X_train_dtm

<5x14 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

The resulting matrix is a **sparse matrix**, which stores only the non-zero values to save memory. 
For easier visualization, we can convert it into a dense matrix that shows all values, though this step is typically unnecessary in practical applications.

In [11]:
# convert sparse matrix to a dense matrix
X_train_dtm.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0],
       [1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]], dtype=int64)



We can convert the dense matrix to a DataFrame for better readability. In `count_df`, each row represents a document from the dataset, and each column corresponds to a unique word in the vocabulary. The values show how many times each word appears in each docum tweaks!

In [28]:
# Create a DataFrame to view the vocabulary and document-term matrix together
count_df = pd.DataFrame(X_train_dtm.toarray(), 
                        columns=vectorizer.get_feature_names_out(),
                        index=X_train)
count_df

Unnamed: 0,and,are,barked,cat,chased,climbed,dog,friends,loudly,mat,on,sat,the,tree
The cat sat on the mat.,0,0,0,1,0,0,0,0,0,1,1,1,2,0
The dog chased the cat.,0,0,0,1,1,0,1,0,0,0,0,0,2,0
The cat climbed the tree.,0,0,0,1,0,1,0,0,0,0,0,0,2,1
The dog barked loudly.,0,0,1,0,0,0,1,0,1,0,0,0,1,0
A dog and a cat are friends.,1,1,0,1,0,0,1,1,0,0,0,0,0,0


When we use the vectorizer on new text, it only counts words it "learned" from the training data. Words that weren’t in the training set are ignored.

In [29]:
# New unseen text for testing
X_test = ["The cat sat alone."]

# Transform the new text using the same vectorizer
X_test_dtm = vectorizer.transform(X_test)

# Convert to a DataFrame for readability
test_df = pd.DataFrame(X_test_dtm.toarray(), 
                       columns=vectorizer.get_feature_names_out(),
                       index=X_test)
test_df

Unnamed: 0,and,are,barked,cat,chased,climbed,dog,friends,loudly,mat,on,sat,the,tree
The cat sat alone.,0,0,0,1,0,0,0,0,0,0,0,1,1,0


## 2. Tuning the Vectorizer

The vectorizer has several parameters that we can adjust to improve how it processes text.

One key parameter is `stop_words`, which allows us to remove common words like “and,” “the,” and “him.” These words are usually not helpful for understanding the content and can add noise to the data.

- **stop_words**: `'english'`, a custom list, or `None` (default)
  - **'english'**: Uses a built-in list of common English stop words.
  - **Custom list**: You can provide your own list of stop words to remove.
  - **None**: No stop words are removed.

By removing stop words, we can focus on the words that add the most meaning to the text.

In [30]:
# Initialize CountVectorizer with English stop words
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(X_train)

# Display the learned vocabulary
vectorizer.get_feature_names_out()

array(['barked', 'cat', 'chased', 'climbed', 'dog', 'friends', 'loudly',
       'mat', 'sat', 'tree'], dtype=object)

Notice some common words have been dropped

We can view the complete list of stop words used by scikit-learn.

In [31]:
# Display the list of English stop words used by scikit-learn
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print(ENGLISH_STOP_WORDS)

frozenset({'around', 'if', 'was', 'afterwards', 'across', 'side', 'go', 'from', 'call', 'towards', 'here', 'nobody', 'already', 'along', 'alone', 'why', 'whose', 'than', 'ever', 'also', 'be', 'these', 'never', 'except', 'all', 'off', 'sometimes', 'its', 'both', 'that', 'rather', 'found', 'us', 'eight', 'since', 'for', 'nevertheless', 'myself', 'another', 'show', 'he', 'they', 'front', 'above', 'twelve', 'bottom', 'well', 'whence', 'cry', 'toward', 'the', 'keep', 'such', 'become', 'noone', 'an', 'three', 'with', 'throughout', 'were', 'has', 'most', 'none', 'am', 'always', 'formerly', 'find', 'upon', 'mine', 'seems', 'but', 'serious', 'of', 'wherein', 'two', 'very', 'one', 'must', 'out', 'moreover', 'thin', 'twenty', 'six', 'every', 'by', 'as', 'beside', 'elsewhere', 'others', 'becoming', 'under', 'five', 'please', 'without', 'last', 'now', 'each', 'see', 'so', 'first', 'through', 'take', 'whereas', 'even', 'fire', 'hereafter', 'whatever', 'will', 'may', 'themselves', 'con', 'against', '

You might not agree with all the words on this stop word list, so use it thoughtfully. Later, in another notebook, I'll share a version that I think works a bit better.



An **n-gram** is a continuous sequence of *n* words from a text sample. By setting the `ngram_range` parameter, we can choose the length of these sequences.

- **ngram_range**: `(min_n, max_n)`, default=`(1, 1)`
  - Defines the minimum and maximum *n* values for the n-grams to extract.
  - All n-gram lengths between `min_n` and `max_n` (inclusive) will be con this works!

In [32]:
# Include 1-grams, 2-grams, and 3-grams in the vocabulary
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorizer.fit(X_train)

# Display the vocabulary with n-grams
vectorizer.get_feature_names_out()

array(['and', 'and cat', 'and cat are', 'are', 'are friends', 'barked',
       'barked loudly', 'cat', 'cat are', 'cat are friends',
       'cat climbed', 'cat climbed the', 'cat sat', 'cat sat on',
       'chased', 'chased the', 'chased the cat', 'climbed', 'climbed the',
       'climbed the tree', 'dog', 'dog and', 'dog and cat', 'dog barked',
       'dog barked loudly', 'dog chased', 'dog chased the', 'friends',
       'loudly', 'mat', 'on', 'on the', 'on the mat', 'sat', 'sat on',
       'sat on the', 'the', 'the cat', 'the cat climbed', 'the cat sat',
       'the dog', 'the dog barked', 'the dog chased', 'the mat',
       'the tree', 'tree'], dtype=object)



- **max_df**: `float` in the range [0.0, 1.0] or `int`, defaul.0`1.0`
  - Controls which terms to ignore based on their frequency across documents, helping to remove overly common (corpus-specific) stop words.
  - If a `float`, it represents the maximum proportion of documents in which a term can appear. Terms in more documents than this proportion are ignored.
  - If an `int`, it specifies the maximum absolute count of documents in which a term can ad in mind!

In [33]:
# Ignore terms that appear in more than 50% of the documents
vectorizer = CountVectorizer(max_df=0.5)
vectorizer.fit(X_train)

# Display the vocabulary with high-frequency terms removed
vectorizer.get_feature_names_out()

array(['and', 'are', 'barked', 'chased', 'climbed', 'friends', 'loudly',
       'mat', 'on', 'sat', 'tree'], dtype=object)



- **min_df**: `float` in the range [0.0, 1.0] or `int`, default=`1`
  - Sets a threshold to ignore terms that appear in fewer documents than specified. This threshold is sometimes called a "cut-off."
  - If a `float`, it represents the minimum proportion of documents in which a term must appear.
  - If an `int`, it represents the minimum absolute number of documents in which a term must ks for you!

In [35]:
# Keep only terms that appear in at least 2 documents
vectorizer = CountVectorizer(min_df=2)
vectorizer.fit(X_train)

# Display the vocabulary with low-frequency terms removed
vectorizer.get_feature_names_out()

array(['cat', 'dog', 'the'], dtype=object)

## 3. TF-IDF Vectorizer

Next, we’ll use `TfidfVectorizer` from `sklearn` to transform our text data into a matrix of [**TF-IDF values**](https://scikit-learn.org/1.5/modules/feature_extraction.html#tfidf-term-weighting).

**TF-IDF** stands for **Term Frequency-Inverse Document Frequency**. It’s a technique that not only counts how often each word appears in a document (term frequency) but also considers how unique or important that word is across all documents (inverse document frequency). This way, common words get lower scores, while words that are unique to specific documents get higher scores, helping us focus on what makes each document stand out.

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of TfidfVectorizer
tfidf_vect = TfidfVectorizer()

# Fit and transform the documents to a TF-IDF matrix
X_train_dtm = tfidf_vect.fit_transform(X_train)

# Convert the document-term matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(data=X_train_dtm.toarray(), 
                        index=documents, 
                        columns=tfidf_vect.get_feature_names_out())
tfidf_df

Unnamed: 0,and,are,barked,cat,chased,climbed,dog,friends,loudly,mat,on,sat,the,tree
The cat sat on the mat.,0.0,0.0,0.0,0.26305,0.0,0.0,0.0,0.0,0.0,0.466913,0.466913,0.466913,0.526101,0.0
The dog chased the cat.,0.0,0.0,0.0,0.323361,0.573963,0.0,0.38439,0.0,0.0,0.0,0.0,0.0,0.646722,0.0
The cat climbed the tree.,0.0,0.0,0.0,0.297466,0.0,0.528001,0.0,0.0,0.0,0.0,0.0,0.0,0.594933,0.528001
The dog barked loudly.,0.0,0.0,0.601285,0.0,0.0,0.0,0.402688,0.0,0.601285,0.0,0.0,0.0,0.338754,0.0
A dog and a cat are friends.,0.515306,0.515306,0.0,0.290314,0.0,0.0,0.345106,0.515306,0.0,0.0,0.0,0.0,0.0,0.0
