In [1]:
import pandas as pd

## Vectorizing: representing text as numerical data

In [2]:
# example text for model training
X_train = ['call you tonight', 
           'Call me a cab', 
           'please call me... PLEASE!', 
           'he called the police']

We will use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to convert text into a matrix of token counts.

In [3]:
# import and initialize CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [4]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(X_train)

CountVectorizer()

In [5]:
# examine the fitted vocabulary
vect.get_feature_names_out()

array(['cab', 'call', 'called', 'he', 'me', 'please', 'police', 'the',
       'tonight', 'you'], dtype=object)

In [6]:
# transform training data into a 'document-term matrix'
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4x10 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [7]:
# convert sparse matrix to a dense matrix
X_train_dtm.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
       [1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 2, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]], dtype=int64)

In [9]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(X_train_dtm.toarray(), 
             columns=vect.get_feature_names_out(),
             index=X_train)

Unnamed: 0,cab,call,called,he,me,please,police,the,tonight,you
call you tonight,0,1,0,0,0,0,0,0,1,1
Call me a cab,1,1,0,0,1,0,0,0,0,0
please call me... PLEASE!,0,1,0,0,1,2,0,0,0,0
he called the police,0,0,1,1,0,0,1,1,0,0


In this scheme, features and samples are defined as follows:

- Each individual token occurrence frequency (normalized or not) is treated as a feature. The vector of all the token frequencies for a given document is considered a multivariate sample. A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

- We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [10]:
# example text for model testing
X_test = ["please don't call me"]

In [11]:
# transform testing data into a document-term matrix (using existing vocabulary)
X_test_dtm = vect.transform(X_test)
X_test_dtm.toarray()

array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 0]], dtype=int64)

In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(X_test_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,called,he,me,please,police,the,tonight,you
0,0,1,0,0,1,1,0,0,0,0


## Tuning the vectorizer

The vectorizer has some parameters that you might want to tune

**Stop words** are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction.

- **stop_words**: 'english' or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

In [13]:
vect = CountVectorizer(stop_words='english')
vect.fit(X_train)
vect.get_feature_names_out()

array(['cab', 'called', 'police', 'tonight'], dtype=object)

In [14]:
# list of scikit learn stop words
from sklearn.feature_extraction import _stop_words
print(_stop_words.ENGLISH_STOP_WORDS)

frozenset({'to', 'whether', 'yours', 'seemed', 'between', 'whereas', 'hence', 'became', 'below', 'otherwise', 'rather', 'once', 'our', 'often', 'us', 'any', 'more', 'they', 'if', 'somewhere', 'couldnt', 'somehow', 'con', 'please', 'together', 'because', 'my', 'over', 'here', 'something', 'before', 'it', 'that', 'meanwhile', 'first', 'than', 'about', 'had', 'without', 'bottom', 'he', 'fill', 'since', 'see', 'un', 'forty', 'may', 'ours', 'into', 'fire', 'across', 'their', 'amongst', 'what', 'full', 'anyway', 'already', 'empty', 'then', 'why', 'an', 'interest', 'other', 'formerly', 'whereby', 'be', 'being', 'beyond', 'onto', 'whence', 'out', 'eight', 'hers', 'while', 'where', 'elsewhere', 'of', 'such', 'well', 'and', 'thin', 'many', 'around', 'mostly', 'thereafter', 'during', 'them', 'under', 'never', 'though', 'few', 'twelve', 'side', 'until', 'those', 'but', 'de', 'some', 'give', 'or', 'hasnt', 'herself', 'seeming', 'do', 'further', 'anyhow', 'hereafter', 'enough', 'inc', 'wherever', 'h

An **n-gram** is a contiguous sequence of n words from a given sample of text
- **ngram_range**: (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [14]:
# include 1-grams, 2-grams and 3-grams
vect = CountVectorizer(ngram_range=(1, 3))
vect.fit(X_train)
vect.get_feature_names()

['cab',
 'call',
 'call me',
 'call me cab',
 'call me please',
 'call you',
 'call you tonight',
 'called',
 'called the',
 'called the police',
 'he',
 'he called',
 'he called the',
 'me',
 'me cab',
 'me please',
 'please',
 'please call',
 'please call me',
 'police',
 'the',
 'the police',
 'tonight',
 'you',
 'you tonight']

- **max_df**: decimal number in range [0.0, 1.0] or integer; default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If decimal number, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [15]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)
vect.fit(X_train)
vect.get_feature_names()

['cab', 'called', 'he', 'me', 'please', 'police', 'the', 'tonight', 'you']

- **min_df** : decilmal number in range [0.0, 1.0] or integer; default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If decimal number, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [16]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
vect.fit(X_train)
vect.get_feature_names()

['call', 'me']