# Bag-of-Words Representation with CountVectorizer and TfIdfTransformer
In this notebook, we will explore the concept of the Bag-of-Words (BoW) representation for text data and its two popular variations:

1. Frequency-based representation (using CountVectorizer)
2. Term Frequency-Inverse Document Frequency (TF-IDF) representation (using TfIdfTransformer)


## Introduction

Bag-of-Words (BoW) is a simplistic method to represent text data in numerical format suitable for machine learning algorithms. The basic idea is to represent text as a "bag" of its words, disregarding grammar and even word order but preserving counts.

Let's delve into the details with some examples.


In [None]:
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import pandas as pd

## 1. Frequency-based Representation (Using CountVectorizer)

Let's start with the basic frequency representation using `CountVectorizer`. It converts a collection of text documents to a matrix of token counts.

We'll use the following example sentences:

1. "The sky is blue."
2. "Sky is clear today."
3. "Look at the clear blue sky."

In [None]:
# Sample sentences
sentences = ["The sky is blue.", "Sky is clear today.", "Look at the clear blue sky."]

# Initialize CountVectorizer and fit to our sentences
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

# Convert to dataframe for better visualization
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,at,blue,clear,is,look,sky,the,today
0,0,1,0,1,0,1,1,0
1,0,0,1,1,0,1,0,1
2,1,1,1,0,1,1,1,0


The rows in the above table represent our example sentences, and the columns represent unique words from all sentences. The values are the count of words in each sentence.

As you can observe, words like "the", "is", and "at" might not provide significant meaning in many contexts and are often termed as "stop words". Let's see how to exclude these using `CountVectorizer`.

In [None]:
# Initialize CountVectorizer with stop words and fit to our sentences
vectorizer = CountVectorizer(stop_words=['the', "is"])
X = vectorizer.fit_transform(sentences)

# Convert to dataframe for visualization
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,at,blue,clear,look,sky,today
0,0,1,0,0,1,0
1,0,0,1,0,1,1
2,1,1,1,1,1,0


As you can see above, the vectorized table removes the user selected stop-words like 'the' and 'is'. 

Alternatively, if we would like to use a built-in stop word list for English, we simply set stop_words to {'english'}

In [None]:
# Initialize CountVectorizer with stop words and fit to our sentences
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)

# Convert to dataframe for visualization
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,blue,clear,look,sky,today
0,1,0,0,1,0
1,0,1,0,1,1
2,1,1,1,1,0


## Unigram vs Bigram with CountVectorizer

When dealing with text data, the terms "unigram", "bigram", "trigram", and so on refer to a set of consecutive words or tokens taken as a unit. Specifically:
- **Unigram**: Single words. E.g., "sky", "blue"
- **Bigram**: Two contiguous words. E.g., "sky is", "is blue"
- **Trigram**: Three contiguous words. E.g., "The sky is"

Let's see how `CountVectorizer` can be used to extract unigrams and bigrams from our example sentences.


In [None]:
# Unigram representation
vectorizer_unigram = CountVectorizer(ngram_range=(1, 1), stop_words='english')
X_unigram = vectorizer_unigram.fit_transform(sentences)
df_unigram = pd.DataFrame(X_unigram.toarray(), columns=vectorizer_unigram.get_feature_names_out())

# Bigram representation
vectorizer_bigram = CountVectorizer(ngram_range=(2, 2), stop_words='english')
X_bigram = vectorizer_bigram.fit_transform(sentences)
df_bigram = pd.DataFrame(X_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())

print('-----------------------')
print('Unigram Representation:')
print(df_unigram)

print('-----------------------')
print('Bigram Representation:')
print(df_bigram)

-----------------------
Unigram Representation:
   blue  clear  look  sky  today
0     1      0     0    1      0
1     0      1     0    1      1
2     1      1     1    1      0
-----------------------
Bigram Representation:
   blue sky  clear blue  clear today  look clear  sky blue  sky clear
0         0           0            0           0         1          0
1         0           0            1           0         0          1
2         1           1            0           1         0          0


From the output tables, you can observe:
1. The **Unigram** table consists of individual words from the sentences (after excluding stop words).
2. The **Bigram** table consists of pairs of contiguous words.

While unigrams capture individual word occurrences, bigrams can capture some context, like "clear blue" or "blue sky". Depending on the task at hand, you might prefer using unigrams, bigrams, or a combination of both.

## 2. TF-IDF Representation (Using TfIdfTransformer)

Term Frequency-Inverse Document Frequency (TF-IDF) is another way to represent text data. It reflects the importance of a term to a document in a corpus. A term has a high TF-IDF score if it occurs frequently in a document, but not in many documents across the corpus.

For a more detailed mathematical understanding of TfIdfTransformers, please see [here](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)

We can calculate TF-IDF representation from our frequency-based representation using `TfIdfTransformer`.


In [None]:
# Initialize TfIdfTransformer and fit to our frequency matrix
transformer = TfidfTransformer()
tfidf_matrix = transformer.fit_transform(X)

# Convert to dataframe for visualization
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df_tfidf

Unnamed: 0,blue,clear,look,sky,today
0,0.789807,0.0,0.0,0.613356,0.0
1,0.0,0.547832,0.0,0.425441,0.720333
2,0.480458,0.480458,0.631745,0.373119,0.0


From the table above, you can observe the TF-IDF scores. Words with higher scores are considered more important to their respective sentences in the context of the entire corpus.

In summary, while BoW gives a simple count representation of text data, TF-IDF gives a weighted representation, potentially providing better insights into the importance of words in your documents.
