CountVectorizer is a great tool provided by the `scikit-learn library` in Python. It uses the concept of [Bag of Words](Text_Vectorization.md). It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

document = [
    "The quick brown fox jumps over the lazy dog.",
    "Python is a great programming language for machine learning.",
    "CountVectorizer converts text into a matrix of token counts.",
    "Natural language processing helps computers understand human language.",
    "Machine learning models require clean and structured data."
]

# Create a Vectorizer Object
vectorizer = CountVectorizer()

vectorizer.fit(document)

# Printing the identified Unique words along with their indices
print("Vocabulary:", vectorizer.vocabulary_)

# Encode the Document
vector = vectorizer.transform(document)

# Summarizing the Encoded Texts
print("Encoded Document:\n", vector.toarray())

Vocabulary: {'the': 33, 'quick': 29, 'brown': 1, 'fox': 10, 'jumps': 16, 'over': 25, 'lazy': 18, 'dog': 8, 'python': 28, 'is': 15, 'great': 11, 'programming': 27, 'language': 17, 'for': 9, 'machine': 20, 'learning': 19, 'countvectorizer': 6, 'converts': 4, 'text': 32, 'into': 14, 'matrix': 21, 'of': 24, 'token': 34, 'counts': 5, 'natural': 23, 'processing': 26, 'helps': 12, 'computers': 3, 'understand': 35, 'human': 13, 'models': 22, 'require': 30, 'clean': 2, 'and': 0, 'structured': 31, 'data': 7}
Encoded Document:
 [[0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 2 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 2 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1]
 [1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0]]


In [2]:
# CountVectorizer with min_df and max_df
vectorizer = CountVectorizer(min_df=2, max_df=0.8) # min_df=2 means words that appear in at least 2 documents will be considered
# max_df=0.8 means words that appear in more than 80% of the documents will be ignored

vectorizer.fit(document)

print("Vocabulary with min_df and max_df:", vectorizer.vocabulary_)

vector = vectorizer.transform(document)

print("Encoded Document with min_df and max_df:\n", vector.toarray())

Vocabulary with min_df and max_df: {'language': 0, 'machine': 2, 'learning': 1}
Encoded Document with min_df and max_df:
 [[0 0 0]
 [1 1 1]
 [0 0 0]
 [2 0 0]
 [0 1 1]]
