# Understanding vectorizers

In the following code examples, we will experiment with vectorizers to understand a bit better how they work. Feel free to adjust the code, and try things out yourself.

For now, we will practice with `sklearn`'s vectorizers. However, packages such as `gensim` offer their own built-in functionality to vectorize the data. 
You can also play around [here](https://apps-computational-teaching-jj92xohbgnwnhksxlclr8t.streamlit.app/).


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
import sklearn
print(sklearn.__version__)

1.6.1


## Example 1: Inspect the output of a vectorizer in a dense format

The following code cell will fit and transform three documents using a `Count`-based vectorizer. Next, the output is transformed to a *dense* matrix, and printed.

### 1. Do you understand the output?

**Answer:**  
The output will be a matrix of numbers. Each row represents one document from the list `texts`, and each column corresponds to a unique word (after the vectorizer processes the documents). The numbers in the matrix represent how important each word is in the context of each document.

- With **`CountVectorizer()`**, the numbers will represent the raw **frequency** of each word in the documents.
- With **`TfidfVectorizer()`**, the numbers will represent the **TF-IDF score** (Term Frequency-Inverse Document Frequency). The TF-IDF score reflects how important a word is within a particular document, considering how often it appears and how unique it is across all documents. Words that appear frequently in one document but rarely across the others will have higher values.


### 2. Is it smart to transform output to a dense format? What will happen if you work with millions of documents, rather than 3 short sentences?

**Answer:**  
While it’s okay for small datasets (like our 3 short sentences), transforming the output to a dense matrix can be very memory-intensive for large datasets. A dense matrix stores every value, even if many of them are zero. For millions of documents, this approach could use a lot of memory and lead to performance issues.

In practice, we often use sparse matrices for large datasets, which only store non-zero values, making them much more memory-efficient. Scikit-learn, by default, uses sparse matrices when working with `CountVectorizer()` and `TfidfVectorizer()`.


### 3. What happens if you replace `CountVectorizer()` for `TfidfVectorizer()`?

**Answer:**  
Both `CountVectorizer()` and `TfidfVectorizer()` convert text into numeric vectors, but the main difference is in how they treat word importance:

- `CountVectorizer()` simply counts how many times each word appears in each document.
- `TfidfVectorizer()` adjusts the counts by taking into account both the frequency of the word in a document (**TF**) and how common the word is across all documents (**IDF**). Words that are common across all documents (like "the", "and", etc.) get a lower score, while words that are more unique to specific documents get higher scores.

So, if you replace `CountVectorizer()` with `TfidfVectorizer()`, the numbers in the resulting matrix will reflect word importance, rather than just word frequency. This is useful when you want to emphasize the significance of unique terms in a dataset, which can be helpful for many machine learning tasks like classification or clustering.


In [3]:
texts = ["hello students!", "how are you today?", "what?", "hello hello everybody"]
vect = CountVectorizer()
#vect = TfidfVectorizer()# initialize the vectorizer
X = vect.fit_transform(texts) #fit the vectorizer and transform the documents in one go

In [4]:
print(pd.DataFrame(X.toarray(), columns=vect.get_feature_names_out()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names_out())

   are  everybody  hello  how  students  today  what  you
0    0          0      1    0         1      0     0    0
1    1          0      0    1         0      1     0    1
2    0          0      0    0         0      0     1    0
3    0          1      2    0         0      0     0    0


## Example 2: Inspect the output of a vectorizer in a sparse format

Internally, `sklearn` represents the data in a *sparse* format, as this is computationally more efficient, and less memory is required.


In [5]:
texts = ["hello students!", "how are you today?", "what?", "hello hello everybody"]
count_vec = CountVectorizer() #initilize the vectorizer
count_vec_fit = count_vec.fit_transform(texts) #fit the vectorizer and transform the documents in one go

1. Inspect the shape of transformed texts. We can see that we have a 4x8 sparse matrix, meaning that we have 4 
rows (=documents) and 8 unique tokens (=words, numbers)


In [6]:
count_vec_fit

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

2. Get the feature names. This will return the tokens that are in the vocabulary of the vectorizer

In [7]:
count_vec.get_feature_names_out()

array(['are', 'everybody', 'hello', 'how', 'students', 'today', 'what',
       'you'], dtype=object)

3. Represent the token's mapping to its id values. The numbers do *not* represent the count of the words but the position of the words in the matrix

In [8]:
count_vec.vocabulary_ 

{'hello': 2,
 'students': 4,
 'how': 3,
 'are': 0,
 'you': 7,
 'today': 5,
 'what': 6,
 'everybody': 1}

4. Get sparse representation on document level

In [9]:
for i, document in zip(count_vec_fit, texts):
    print(f"Current document: {document}")
    print(i)
    print()

Current document: hello students!
  (0, 2)	1
  (0, 4)	1

Current document: how are you today?
  (0, 3)	1
  (0, 0)	1
  (0, 7)	1
  (0, 5)	1

Current document: what?
  (0, 6)	1

Current document: hello hello everybody
  (0, 2)	2
  (0, 1)	1



## a. Do you understand the output printed above?

**Answer:**  
The output consists of the original document and its corresponding sparse vector (the transformed representation) printed for each document. The sparse vector format shows the index and frequency of the words in the document, represented in a compact form. Here's a sample output explanation based on the code snippet:

For a list of documents like `["hello students!", "how are you today?", "what?", "hello hello everybody"]`, the output would look something like this:

- **Current document: `hello students!`**  
  `(0, 2) 1 
  (0, 4) 1`  
  *(This is a sparse vector representation of the document. It means that in the vector, the word at index 2 and index 4 each appear once.)*

- **Current document: `how are you today?`**  
  `(0, 3) 1 (0, 0) 1 (0, 7) 1 (0, 5) 1`  
  *(For this document, words at indices 0, 3, 5, and 7 appear once.)*

- **Current document: `what?`**  
  `(0, 6) 1`  
  *(This document has one word, located at index 6.)*

- **Current document: `hello hello everybody`**  
  `(0, 2) 2 (0, 1) 1`  
  *(In this document, the word at index 2 appears twice, while the word at index 1 appears once.)*

Each vector is a sparse array representing the frequency of words (in this case, the raw counts) in the document. The format `(0, x) y` indicates that in the sparse matrix, the word at index `x` occurs `y` times in the document. This is a compact way of storing the data, as it only records the positions and frequencies of non-zero elements, which saves space for large datasets.

---

## b. What happens if you change the `count` to a `tfidf` vectorizer?

**Answer:**  
If you replace the `CountVectorizer` with a `TfidfVectorizer`, the transformation of the documents changes as follows:

- **CountVectorizer**: It counts the frequency of each word in the document. The output vector is a simple count of the words in the document.
  
  Example:  
  Document: "hello students"  
  Vector (CountVectorizer): [1, 1, 0, 0, 0, ...]  *(Count of each word in the document)*

- **TfidfVectorizer**: It computes the Term Frequency-Inverse Document Frequency (TF-IDF) for each word. The output vector is adjusted to emphasize the importance of words that are more unique to a specific document (higher TF-IDF value). Words that occur frequently across all documents are given a lower score.

  Example:  
  Document: "hello students"  
  Vector (TfidfVectorizer): [0.5, 0.5, 0.0, 0.0, 0.0, ...]  *(TF-IDF score for each word)*

The difference:  
- **CountVectorizer**: Simply counts the word occurrences.  
- **TfidfVectorizer**: Adjusts the word counts by considering the frequency of the word across all documents, making it more sensitive to the importance of words in individual documents.

So, when you replace the `CountVectorizer` with the `TfidfVectorizer`, the numbers in the printed vectors (such as `i`) will represent the TF-IDF scores instead of raw counts. The vectors for words that are more common across documents will have lower values, while unique words will have higher values.</span>

---

### Summary:

- **CountVectorizer**: Output vector contains raw word counts.  
- **TfidfVectorizer**: Output vector contains TF-IDF scores, which balance word frequency and document uniqueness.


5. Get some final descriptives about the sparse matrix

In [10]:
nonzero = df.astype(bool).sum(axis=0)
print("Number of non-zero elements:", nonzero.sum())
print("Total number of elements:", count_vec_fit.shape[0] * count_vec_fit.shape[1])

# compute the sparsity of the matrix: w the proportion of zero elements in the matrix
print("Sparsity:", 1 - count_vec_fit.sum() / (count_vec_fit.shape[0] * count_vec_fit.shape[1]))

Number of non-zero elements: 9
Total number of elements: 32
Sparsity: 0.6875
