## Bag Of Words (BOW)

#### Advantages

* **Simple to understand and implement:** The BoW model is straightforward to understand and implement, making it accessible even to those new to NLP

* **Scalability:** BoW can scale well to large datasets and vocabularies, especially when using sparse matrix representations

* **Versatility:** BoW can be used for a wide range of NLP tasks, including sentiment analysis, text classification, document clustering, and more

#### Disadvantages

* **Loss of word order and context:** BoW disregards the order of words and their context within the text, treating each document as a "bag" of words. This can lead to a loss of valuable semantic information, especially in tasks where word order and context are important (e.g., language modeling, sequence-to-sequence tasks).

* **Sparsity:** BoW representations are typically sparse, especially when dealing with large vocabularies or documents with many unique words. This can lead to high-dimensional feature spaces and computational challenges.

* **No consideration of word semantics:** BoW treats each word as a separate feature and does not consider the semantic relationships between words. This can result in a lack of semantic understanding in the representation, leading to suboptimal performance in tasks requiring deeper linguistic understanding.

* **Vocabulary size:** BoW representations can grow large with the size of the vocabulary, which may pose challenges in terms of memory and computational resources, especially when dealing with very large datasets or vocabularies.

* **Insensitive to word frequency:** BoW treats all words equally in terms of their frequency, which may not be suitable for tasks where word frequency or importance plays a significant role (e.g., keyword extraction, summarization)

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Sample collection of text documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

In [2]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

# Convert the sparse matrix to a dense array for easier inspection (not recommended for large datasets)
X_dense = X.toarray()

# Get the feature names (words) corresponding to the columns in the BoW matrix
feature_names = vectorizer.get_feature_names_out()

print(vectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


In [3]:
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0,1,1,1,0,0,1,0,1
text 2,0,2,0,1,0,1,1,0,1
text 3,1,0,0,1,1,0,1,1,1
text 4,0,1,1,1,0,0,1,0,1


### Lowercase

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

vectorizer_lowercase = CountVectorizer(lowercase=True)
vectorizer_no_lowercase = CountVectorizer(lowercase=False)

vectorizer_lowercase.fit_transform(documents)
vectorizer_no_lowercase.fit_transform(documents)

print(f'lowercase True: {vectorizer_lowercase.get_feature_names_out()}')
print(f'lowercase False: {vectorizer_no_lowercase.get_feature_names_out()}')

lowercase True: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
lowercase False: ['And' 'Is' 'This' 'document' 'first' 'is' 'one' 'second' 'the' 'third'
 'this']


### Preprocessor

* CountVectorizer, the preprocessor parameter allows you to specify a function that will be applied to each document before tokenization and processing. This function can be used for tasks such as cleaning or preprocessing the text data before it is tokenized.

In [5]:
# Custom preprocessor function to convert text to lowercase
def custom_preprocessor(text):
    return text.lower()

vectorizer_preprocessor = CountVectorizer(preprocessor=custom_preprocessor)

# Fit and transform the documents using CountVectorizer with custom preprocessor
X_preprocessor = vectorizer_preprocessor.fit_transform(documents)
feature_names_preprocessor = vectorizer_preprocessor.get_feature_names_out()

print("Feature names with custom preprocessor:")
print(feature_names_preprocessor)
print()

Feature names with custom preprocessor:
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']



### Tokenizer

In [6]:
from nltk import word_tokenize

In [7]:
def custom_tokenizer(text):
    return word_tokenize(text)

vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

print(feature_names)


['.' '?' 'and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']




### Stopwords

In [8]:
import nltk
stp = nltk.corpus.stopwords.words('english')

In [9]:
custom_stopwords = ['is', 'the', 'and', 'this']

# vectorizer = CountVectorizer(stop_words='english')
# vectorizer = CountVectorizer(stop_words=custom_stopwords)

vectorizer = CountVectorizer(stop_words=stp)
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

print(feature_names)

['document' 'first' 'one' 'second' 'third']


### Strip Accent

**The strip_accents parameter in the CountVectorizer class of scikit-learn is used to specify whether to remove accents and perform normalization on the text data before tokenization. Accents, also known as diacritical marks, are additional symbols added to letters in various languages to indicate different pronunciations or meanings.**

**Here's why strip_accents is used and its significance:**

* ***Normalization:*** Text data often contains accented characters, especially in languages like French, Spanish, German, etc. By default, strip_accents is set to None, meaning that no normalization is performed. However, setting it to 'unicode' or 'ascii' can help normalize the text by removing accents and converting accented characters to their ASCII or Unicode equivalents, respectively.

* ***Uniformity:*** Removing accents helps achieve uniformity in the text data. For example, "café" and "cafe" would be treated as the same word after removing accents, which can be important for text processing tasks such as text classification, clustering, or information retrieval.

* ***Reduced Vocabulary Size:*** By removing accents and converting accented characters to their ASCII or Unicode equivalents, the vocabulary size may be reduced, leading to a more compact and efficient representation of the text data.

* ***Improved Model Performance:*** In some cases, removing accents can improve the performance of text-based machine learning models by reducing the complexity of the vocabulary and focusing on the essential information in the text.

**However, it's essential to note that stripping accents may not always be necessary or desirable, especially in languages where accents carry semantic meaning (e.g., French, Spanish). Therefore, the choice of whether to use strip_accents and which value to set it to ('unicode', 'ascii', or None) depends on the specific requirements of the text processing task and the characteristics of the text data**

In [10]:
documents = [
    "café",
    "élève",
    "rôle",
    "résumé"
]
vectorizer_ascii = CountVectorizer(strip_accents='ascii')
vectorizer_unicode = CountVectorizer(strip_accents='unicode')

X_ascii = vectorizer_ascii.fit_transform(documents)
X_unicode = vectorizer_unicode.fit_transform(documents)

feature_names_ascii = vectorizer_ascii.get_feature_names_out()
feature_names_unicode = vectorizer_unicode.get_feature_names_out()

print("Feature names with strip_accents='ascii':")
print(feature_names_ascii)
print()

print("Feature names with strip_accents='unicode':")
print(feature_names_unicode)
print()


Feature names with strip_accents='ascii':
['cafe' 'eleve' 'resume' 'role']

Feature names with strip_accents='unicode':
['cafe' 'eleve' 'resume' 'role']



### Decode Error

**The decode_error parameter in scikit-learn's CountVectorizer class specifies how decoding errors encountered during text processing are handled. This parameter is relevant when dealing with text data that needs to be decoded from a specific encoding, such as UTF-8.**

**Here's what the decode_error parameter does:**

*    ***decode_error='strict':*** This is the default behavior. It means that if an error occurs during decoding (e.g., invalid byte sequence), a UnicodeDecodeError exception will be raised, and the processing of the text will stop.

*    ***decode_error='ignore':*** If this value is set, decoding errors will be silently ignored, and the problematic characters will be skipped. This can be useful if you want to continue processing the text data even if some parts are not decodable.

*    ***decode_error='replace':*** With this setting, any characters that cannot be decoded will be replaced with a placeholder character (usually the Unicode replacement character U+FFFD). This option ensures that the text can still be processed, albeit with some loss of information.

**Choosing the appropriate value for decode_error depends on your specific use case and the nature of the text data you're working with. If you're confident that your text data is encoded correctly and you want to be notified of any decoding errors, you can stick with the default 'strict' behavior. On the other hand, if your data might contain encoding issues and you want to continue processing it despite potential errors, you can use 'ignore' or 'replace' to handle decoding errors more gracefully.**

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# b'this document has \xff invalid byte'

# Sample collection of text documents with a decoding error
documents = [
    b'this is the first document',            # Valid UTF-8 encoded bytes
    b'this document has invalid byte',  # Invalid byte sequence (decoding error)
    b'and this is the third one'             # Valid UTF-8 encoded bytes
]

# Initialize CountVectorizer with decode_error='strict' (default)
vectorizer_strict = CountVectorizer(decode_error='strict')
X_strict = vectorizer_strict.fit_transform(documents)

# Initialize CountVectorizer with decode_error='ignore'
vectorizer_ignore = CountVectorizer(decode_error='ignore')
X_ignore = vectorizer_ignore.fit_transform(documents)

# Initialize CountVectorizer with decode_error='replace'
vectorizer_replace = CountVectorizer(decode_error='replace')
X_replace = vectorizer_replace.fit_transform(documents)

# Get the feature names (vocabulary) for each vectorizer
feature_names_strict = vectorizer_strict.get_feature_names_out()
feature_names_ignore = vectorizer_ignore.get_feature_names_out()
feature_names_replace = vectorizer_replace.get_feature_names_out()

# Print the feature names and transformed documents for each vectorizer
print("Feature names with decode_error='strict':")
print(feature_names_strict)
print("Transformed documents with decode_error='strict':")
print(X_strict.toarray())
print()

print("Feature names with decode_error='ignore':")
print(feature_names_ignore)
print("Transformed documents with decode_error='ignore':")
print(X_ignore.toarray())
print()

print("Feature names with decode_error='replace':")
print(feature_names_replace)
print("Transformed documents with decode_error='replace':")
print(X_replace.toarray())


Feature names with decode_error='strict':
['and' 'byte' 'document' 'first' 'has' 'invalid' 'is' 'one' 'the' 'third'
 'this']
Transformed documents with decode_error='strict':
[[0 0 1 1 0 0 1 0 1 0 1]
 [0 1 1 0 1 1 0 0 0 0 1]
 [1 0 0 0 0 0 1 1 1 1 1]]

Feature names with decode_error='ignore':
['and' 'byte' 'document' 'first' 'has' 'invalid' 'is' 'one' 'the' 'third'
 'this']
Transformed documents with decode_error='ignore':
[[0 0 1 1 0 0 1 0 1 0 1]
 [0 1 1 0 1 1 0 0 0 0 1]
 [1 0 0 0 0 0 1 1 1 1 1]]

Feature names with decode_error='replace':
['and' 'byte' 'document' 'first' 'has' 'invalid' 'is' 'one' 'the' 'third'
 'this']
Transformed documents with decode_error='replace':
[[0 0 1 1 0 0 1 0 1 0 1]
 [0 1 1 0 1 1 0 0 0 0 1]
 [1 0 0 0 0 0 1 1 1 1 1]]


### Using NGRAM

In [12]:
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

In [13]:
vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
# print(vectorizer.vocabulary_)
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

Unnamed: 0,and,and this,document,document is,first,first document,is,is the,is this,one,...,the,the first,the second,the third,third,third one,this,this document,this is,this the
text 1,0,0,1,0,1,1,1,1,0,0,...,1,1,0,0,0,0,1,0,1,0
text 2,0,0,2,1,0,0,1,1,0,0,...,1,0,1,0,0,0,1,1,0,0
text 3,1,1,0,0,0,0,1,1,0,1,...,1,0,0,1,1,1,1,0,1,0
text 4,0,0,1,0,1,1,1,0,1,0,...,1,1,0,0,0,0,1,0,0,1


In [14]:
vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
# print(vectorizer.vocabulary_)
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

Unnamed: 0,and this,document is,first document,is the,is this,second document,the first,the second,the third,third one,this document,this is,this the
text 1,0,0,1,1,0,0,1,0,0,0,0,1,0
text 2,0,1,0,1,0,1,0,1,0,0,1,0,0
text 3,1,0,0,1,0,0,0,0,1,1,0,1,0
text 4,0,0,1,0,1,0,1,0,0,0,0,0,1


In [15]:
vectorizer = CountVectorizer(ngram_range=(2,3))
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
# print(vectorizer.vocabulary_)
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

Unnamed: 0,and this,and this is,document is,document is the,first document,is the,is the first,is the second,is the third,is this,...,the second document,the third,the third one,third one,this document,this document is,this is,this is the,this the,this the first
text 1,0,0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,1,1,0,0
text 2,0,0,1,1,0,1,0,1,0,0,...,1,0,0,0,1,1,0,0,0,0
text 3,1,1,0,0,0,1,0,0,1,0,...,0,1,1,1,0,0,1,1,0,0
text 4,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1


### MIN MAX Document Frequency

**In the CountVectorizer class of scikit-learn, "min_df" and "max_df" are parameters used to control the vocabulary size by specifying the minimum and maximum document frequency of terms (words) in the documents. Here's what they mean**
    
* ***"min_df":*** Terms with a document frequency lower than min_df will be ignored in the vocabulary. For example, setting min_df=2 will exclude terms that appear in fewer than 2 documents.

* ***"max_df":*** This parameter is useful for excluding terms that are too frequent and may not provide useful information for tasks like text classification or clustering. For example, setting max_df=0.8 will exclude terms that appear in more than 80% of the documents


In [16]:
# Initialize CountVectorizer with different min_df and max_df values
vectorizer_min_df1_max_df1 = CountVectorizer(min_df=1, max_df=1.0)
vectorizer_min_df2_max_df1 = CountVectorizer(min_df=2, max_df=1.0)
vectorizer_min_df1_max_df2 = CountVectorizer(min_df=1, max_df=0.75)  # max_df=0.75 means exclude terms appearing in more than 75% of documents

# Fit and transform the documents using each vectorizer
X_min_df1_max_df1 = vectorizer_min_df1_max_df1.fit_transform(documents)
X_min_df2_max_df1 = vectorizer_min_df2_max_df1.fit_transform(documents)
X_min_df1_max_df2 = vectorizer_min_df1_max_df2.fit_transform(documents)

# Get the feature names (vocabulary)
feature_names_min_df1_max_df1 = vectorizer_min_df1_max_df1.get_feature_names_out()
feature_names_min_df2_max_df1 = vectorizer_min_df2_max_df1.get_feature_names_out()
feature_names_min_df1_max_df2 = vectorizer_min_df1_max_df2.get_feature_names_out()

# Display the feature names for each vectorizer
print("Feature names with min_df=1, max_df=1:")
pd.DataFrame(data = X_min_df1_max_df1.toarray(), 
             columns=feature_names_min_df1_max_df1, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with min_df=1, max_df=1:


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0,1,1,1,0,0,1,0,1
text 2,0,2,0,1,0,1,1,0,1
text 3,1,0,0,1,1,0,1,1,1
text 4,0,1,1,1,0,0,1,0,1


In [17]:
print("Feature names with min_df=2, max_df=1:")
pd.DataFrame(data = X_min_df2_max_df1.toarray(), 
             columns=feature_names_min_df2_max_df1, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with min_df=2, max_df=1:


Unnamed: 0,document,first,is,the,this
text 1,1,1,1,1,1
text 2,2,0,1,1,1
text 3,0,0,1,1,1
text 4,1,1,1,1,1


In [18]:
print("Feature names with min_df=1, max_df=0.75:")
pd.DataFrame(data = X_min_df1_max_df2.toarray(), 
             columns=feature_names_min_df1_max_df2, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with min_df=1, max_df=0.75:


Unnamed: 0,and,document,first,one,second,third
text 1,0,1,1,0,0,0
text 2,0,2,0,0,1,0
text 3,1,0,0,1,0,1
text 4,0,1,1,0,0,0


**Advantages of using min_df and max_df in CountVectorizer:**

* ***Control over vocabulary size:*** You can control the size of the vocabulary by excluding terms that appear too rarely (min_df) or too frequently (max_df).

* ****Noise reduction:*** Excluding terms with very low or very high document frequencies can help reduce noise in the data and improve the quality of the features used for modeling.

* ****Improved generalization:*** By excluding terms that are too specific (low min_df) or too common (high max_df), you can improve the generalization performance of machine learning models.

**Disadvantages:**

* ***Information loss:*** Excluding terms based on document frequency thresholds may result in the loss of potentially useful information, especially if the thresholds are set too aggressively.

* ***Parameter tuning:*** Choosing appropriate values for min_df and max_df requires experimentation and tuning. Selecting optimal values may depend on the specific dataset and task.

* ***Impact on model performance:*** Incorrectly setting min_df and max_df thresholds may adversely affect the performance of machine learning models, leading to suboptimal results.

***In summary, while min_df and max_df offer control over the vocabulary size and can help improve the quality of features used for modeling, they require careful tuning and consideration to avoid information loss and ensure optimal model performance.***

### Analyzer

**In scikit-learn's CountVectorizer, the analyzer parameter determines whether the feature should be made of word or character n-grams.**

* ***analyzer='word':*** This is the default value. It analyzes the input as a sequence of words. It splits the input into words based on white spaces and punctuation, and then forms features based on these words.

* ***analyzer='char':*** It analyzes the input as a sequence of characters. It forms features based on character n-grams, where an n-gram is a contiguous sequence of n characters.

In [19]:
# Initialize CountVectorizer with analyzer='word'
vectorizer_word = CountVectorizer(analyzer='word')

# Fit and transform the documents using CountVectorizer with 'word' analyzer
X_word = vectorizer_word.fit_transform(documents)

# Get the feature names (vocabulary)
feature_names_word = vectorizer_word.get_feature_names_out()

print("Feature names with analyzer='word':")
pd.DataFrame(data = X_word.toarray(), 
             columns=feature_names_word, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with analyzer='word':


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0,1,1,1,0,0,1,0,1
text 2,0,2,0,1,0,1,1,0,1
text 3,1,0,0,1,1,0,1,1,1
text 4,0,1,1,1,0,0,1,0,1


In [20]:
# Initialize CountVectorizer with analyzer='char'
vectorizer_char = CountVectorizer(analyzer='char')

# Fit and transform the documents using CountVectorizer with 'char' analyzer
X_char = vectorizer_char.fit_transform(documents)

# Get the feature names (vocabulary)
feature_names_char = vectorizer_char.get_feature_names_out()

# Print the feature names and the transformed documents with 'char' analyzer
print("Feature names with analyzer='char':")
pd.DataFrame(data = X_char.toarray(), 
             columns=feature_names_char, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with analyzer='char':


Unnamed: 0,Unnamed: 1,.,?,a,c,d,e,f,h,i,m,n,o,r,s,t,u
text 1,4,1,0,0,1,1,2,1,2,3,1,1,1,1,3,4,1
text 2,5,1,0,0,3,3,4,0,2,2,2,3,3,0,3,4,2
text 3,5,1,0,1,0,2,2,0,3,3,0,2,1,1,2,3,0
text 4,4,0,1,0,1,1,2,1,2,3,1,1,1,1,3,4,1


**The text documents are processed using CountVectorizer with both 'word' and 'char' analyzers. With 'word' analyzer, the input is split into words based on white spaces and punctuation, while with 'char' analyzer, the input is split into character n-grams. The resulting feature matrices represent the occurrences of words and character n-grams in the documents, respectively.**

### Max Features

**CountVectorizer, the max_features parameter specifies the maximum number of features (unique tokens or words) to be extracted from the text data. It controls the vocabulary size by limiting the number of features considered during the tokenization process.**

**Here's what max_features does:**

*   ***If max_features is an integer:*** The max_features parameter specifies the maximum number of features to be considered based on their frequency of occurrence in the corpus. The most frequent max_features features will be selected and used to create the vocabulary.

*    ***If max_features is None (default):*** All features will be considered, and there is no limit on the number of features extracted.

**Setting a value for max_features can be helpful in scenarios where you want to reduce the dimensionality of the feature space or improve computational efficiency by limiting the number of features considered.
However, it's essential to note that specifying max_features may result in the loss of less frequent or informative features from the vocabulary.**

In [21]:
# Initialize CountVectorizer with max_features=5
vectorizer_max_features = CountVectorizer(max_features=5)

# Fit and transform the documents using CountVectorizer with max_features
X_max_features = vectorizer_max_features.fit_transform(documents)

# Get the feature names (vocabulary)
feature_names_max_features = vectorizer_max_features.get_feature_names_out()

# Print the feature names and the transformed documents with max_features
print("Feature names with max_features=5:")
pd.DataFrame(data = X_max_features.toarray(), 
             columns=feature_names_max_features, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with max_features=5:


Unnamed: 0,document,first,is,the,this
text 1,1,1,1,1,1
text 2,2,0,1,1,1
text 3,0,0,1,1,1
text 4,1,1,1,1,1


### Vocabulary

**CountVectorizer, the vocabulary parameter allows you to specify the vocabulary that the vectorizer should use when transforming the input data into feature vectors. The vocabulary is a mapping of terms (tokens or words) to feature indices in the resulting feature matrix.**

*    ***If vocabulary is a mapping (e.g., dict):*** This mapping specifies the vocabulary that the vectorizer should use. The keys are the terms (tokens or words), and the values are the corresponding feature indices. Only the terms in the provided vocabulary will be considered during vectorization, and any terms not found in the vocabulary will be ignored.

*    ***If vocabulary is an iterable (e.g., list, set):*** This iterable specifies the terms (tokens or words) that should be included in the vocabulary. The vectorizer will use these terms to create the vocabulary, and the resulting feature indices will be assigned accordingly.

**Specifying a custom vocabulary using the vocabulary parameter can be useful in scenarios where you want to enforce a specific set of terms to be included in the feature matrix or when you want to map certain terms to specific feature indices.
However, it's essential to note that using a custom vocabulary may result in missing features if some terms in the input data are not present in the specified vocabulary**

In [22]:
# Define a custom vocabulary
custom_vocabulary = {
    'this': 0,
    'is': 1,
    'the': 2,
    'document': 3
}

# Initialize CountVectorizer with custom vocabulary
vectorizer_vocabulary = CountVectorizer(vocabulary=custom_vocabulary)

# Fit and transform the documents using CountVectorizer with custom vocabulary
X_vocabulary = vectorizer_vocabulary.fit_transform(documents)

# Get the feature names (vocabulary)
feature_names_vocabulary = vectorizer_vocabulary.get_feature_names_out()

# Print the feature names and the transformed documents with custom vocabulary
print("Feature names with custom vocabulary:")
pd.DataFrame(data = X_vocabulary.toarray(), 
             columns=feature_names_vocabulary, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with custom vocabulary:


Unnamed: 0,this,is,the,document
text 1,1,1,1,1
text 2,1,1,1,2
text 3,1,1,1,0
text 4,1,1,1,1


### Binary

**CountVectorizer, the binary parameter is a boolean value that specifies whether the feature matrix should be binarized or not. When binary is set to True, the feature matrix will only contain binary values: 0 or 1. If a token (word) is present in a document, its corresponding feature value will be 1; otherwise, it will be 0.**

Here's what binary does:

* ***binary=True:*** The feature matrix will be binarized, meaning that it will only contain binary values. This is useful when you only want to represent the presence or absence of a token in a document, rather than its frequency.

*  ***binary=False (default):*** The feature matrix will contain counts of tokens in each document, representing the frequency of each token.

**Setting binary=True can be beneficial in certain scenarios, such as text classification tasks where the frequency of words is less relevant compared to their presence or absence in a document.**

In [23]:
# Initialize CountVectorizer with binary=True
vectorizer_binary = CountVectorizer(binary=True)
vectorizer_nonbinary = CountVectorizer(binary=False)

# Fit and transform the documents using CountVectorizer with binary=True
X_binary = vectorizer_binary.fit_transform(documents)
X_nonbinary = vectorizer_nonbinary.fit_transform(documents)

# Get the feature names (vocabulary)
feature_names_binary = vectorizer_binary.get_feature_names_out()
feature_names_nonbinary = vectorizer_nonbinary.get_feature_names_out()

# Print the feature names and the transformed documents with binary=True
print("Feature names with binary=True:")
pd.DataFrame(data = X_binary.toarray(), 
             columns=feature_names_binary, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with binary=True:


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0,1,1,1,0,0,1,0,1
text 2,0,1,0,1,0,1,1,0,1
text 3,1,0,0,1,1,0,1,1,1
text 4,0,1,1,1,0,0,1,0,1


In [24]:
print("Feature names with binary=False:")
pd.DataFrame(data = X_nonbinary.toarray(), 
             columns=feature_names_nonbinary, 
             index=["text 1", "text 2", "text 3", "text 4"])

Feature names with binary=False:


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0,1,1,1,0,0,1,0,1
text 2,0,2,0,1,0,1,1,0,1
text 3,1,0,0,1,1,0,1,1,1
text 4,0,1,1,1,0,0,1,0,1


## TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It combines term frequency (TF), which measures the frequency of a term in a document, with inverse document frequency (IDF), which measures how unique or rare a term is across the entire document collection.

#### Advantages

* **Reflects word importance:** TF-IDF assigns higher scores to words that are important in a document while giving lower scores to common words. This helps capture the significance of words within a document.

* **Handles common words:** TF-IDF automatically downweights common words that occur frequently across multiple documents, such as stopwords, by assigning them lower scores. This reduces the impact of noise in the data.

* **Domain independence:** TF-IDF is relatively domain-independent and can be applied to various types of text data and domains without the need for extensive domain-specific knowledge or preprocessing.

* **Simple and efficient:** Implementation of TF-IDF is straightforward, and it can be efficiently computed for large datasets using libraries like scikit-learn.

#### Disadvantages

* **Lack of semantic understanding:** TF-IDF does not consider the semantic meaning of words, treating them as independent units. As a result, it may not capture the semantic relationships between words and could miss important context.

* **Ignores word order:** TF-IDF treats documents as bags of words, ignoring the order in which words appear within a document. This can lead to loss of sequential information, which may be important in certain applications like natural language processing tasks.

* **Vulnerability to document length:** TF-IDF can be sensitive to document length, as longer documents may have higher raw term frequencies. Normalization techniques like sublinear TF scaling or length normalization can help mitigate this issue.

* **Parameter sensitivity:** The effectiveness of TF-IDF can depend on the choice of parameters, such as the smoothing parameter for IDF and the normalization method for TF. Careful tuning of these parameters may be required for optimal performance in different scenarios

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [26]:
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

In [27]:
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()

print(vectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


In [28]:
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
text 2,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
text 3,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
text 4,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


### Norm
The norm parameter specifies the type of normalization applied to the TF-IDF matrix after calculating the term frequency-inverse document frequency (TF-IDF) scores.


* **'l2' (default):** Each row of the TF-IDF matrix is normalized to have unit Euclidean norm (L2 norm). This means that the squared sum of the values in each row is equal to 1 after normalization.
<br></br>
* **'l1':** Each row of the TF-IDF matrix is normalized to have unit Manhattan norm (L1 norm). This means that the sum of the absolute values of the elements in each row is equal to 1 after normalization.
<br></br>
* **None:** No normalization is applied to the TF-IDF matrix. Each row of the TF-IDF matrix retains its original values without normalization.

In [29]:
vectorizer = TfidfVectorizer(norm='l2')

X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with norm l2")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with norm l2


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
text 2,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
text 3,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
text 4,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


In [30]:
vectorizer = TfidfVectorizer(norm='l1')
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with norm l1")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with norm l1


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.213315,0.263487,0.174399,0.0,0.0,0.174399,0.0,0.174399
text 2,0.0,0.33226,0.0,0.135822,0.0,0.260274,0.135822,0.0,0.135822
text 3,0.219033,0.0,0.0,0.1143,0.219033,0.0,0.1143,0.219033,0.1143
text 4,0.0,0.213315,0.263487,0.174399,0.0,0.0,0.174399,0.0,0.174399


In [31]:
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with norm None")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with norm None


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,1.223144,1.510826,1.0,0.0,0.0,1.0,0.0,1.0
text 2,0.0,2.446287,0.0,1.0,0.0,1.916291,1.0,0.0,1.0
text 3,1.916291,0.0,0.0,1.0,1.916291,0.0,1.0,1.916291,1.0
text 4,0.0,1.223144,1.510826,1.0,0.0,0.0,1.0,0.0,1.0


### Use IDF

* **When use_idf=True (default):** IDF reweighting is enabled, and the TF-IDF scores are computed as the product of term frequency (TF) and inverse document frequency (IDF). This means that words that are rare across the entire document collection will have higher IDF scores, leading to higher TF-IDF scores for those words in individual documents. IDF reweighting helps to downweight terms that occur frequently across multiple documents and upweight terms that are more specific or unique to individual documents.
<br></br>
* **When use_idf=False:** IDF reweighting is disabled, and the TF-IDF scores are computed only based on the term frequency (TF) of each term in the document. In this case, all terms are treated equally, regardless of their frequency in the entire document collection.


In [32]:
vectorizer = TfidfVectorizer(use_idf=True)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with use_idf True")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with use_idf True


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
text 2,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
text 3,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
text 4,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


In [33]:
vectorizer = TfidfVectorizer(use_idf=False)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with use_idf False")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with use_idf False


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.447214,0.0,0.447214
text 2,0.0,0.707107,0.0,0.353553,0.0,0.353553,0.353553,0.0,0.353553
text 3,0.408248,0.0,0.0,0.408248,0.408248,0.0,0.408248,0.408248,0.408248
text 4,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.447214,0.0,0.447214


### Smooth IDF

Without smoothing (i.e., when smooth_idf=False), the IDF for term tt is calculated as:

$$IDF(t) = \log \left( \frac{N}{df(t)} \right)$$


When smoothing is applied (i.e., when smooth_idf=True), we add 1 to both the numerator and the denominator as if an extra document was seen containing every term in the collection exactly once:

$$IDF(t) = \log \left( \frac{N + 1}{df(t) + 1} \right)$$

This ensures that even terms with zero document frequency (i.e., terms that do not appear in any document) have a non-zero IDF value. By adding 1 to both the numerator and the denominator, we prevent zero divisions and avoid undefined values, thus smoothing the IDF calculation.

In [34]:
vectorizer = TfidfVectorizer(smooth_idf=True)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with smooth_idf True")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with smooth_idf True


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
text 2,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
text 3,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
text 4,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


In [35]:
vectorizer = TfidfVectorizer(smooth_idf=False)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with smooth_idf False")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with smooth_idf False


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469417,0.617227,0.364544,0.0,0.0,0.364544,0.0,0.364544
text 2,0.0,0.657827,0.0,0.255431,0.0,0.609532,0.255431,0.0,0.255431
text 3,0.532485,0.0,0.0,0.223143,0.532485,0.0,0.223143,0.532485,0.223143
text 4,0.0,0.469417,0.617227,0.364544,0.0,0.0,0.364544,0.0,0.364544


### Sublinear TF

* **When sublinear_tf=True:** sublinear scaling is applied to the TF values. Sublinear scaling applies a logarithmic transformation to the term frequency values, which reduces the impact of very high term frequencies. This transformation helps to normalize the effect of term frequency, preventing terms with very high frequency from dominating the TF-IDF scores.


    $$TF(t)transformed= \left( {1+log⁡(TF(t))} \right)$$
    
* **sublinear_tf=False** TF(t)TF(t) is the original term frequency of term tt in the document.

This transformation ensures that the TF values are scaled logarithmically, leading to a more balanced representation of term frequencies in the TF-IDF scores.


In [36]:
vectorizer = TfidfVectorizer(sublinear_tf=True)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with sublinear_tf True")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with sublinear_tf True


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
text 2,0.0,0.625527,0.0,0.302047,0.0,0.578809,0.302047,0.0,0.302047
text 3,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
text 4,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


In [37]:
vectorizer = TfidfVectorizer(sublinear_tf=False)
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
print("TfidfVectorizer with sublinear_tf False")
pd.DataFrame(data = X_dense, columns=feature_names, index=["text 1", "text 2", "text 3", "text 4"])

TfidfVectorizer with sublinear_tf False


Unnamed: 0,and,document,first,is,one,second,the,third,this
text 1,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
text 2,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
text 3,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
text 4,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
