# Text Vectorization with Scikit-learn

This notebook demonstrates how to vectorize text data using different methods provided by the `scikit-learn` library:

- Count Vectorizer (Bag of Words)
- n-grams
- TF-IDF

## 1. Preparing the Data

We will use a small set of sample text data for this demonstration.


In [None]:
%pip install sklearn

# Sample text data
documents = [
    "Natural language processing is fun and exciting.",
    "Machine learning is a part of data science.",
    "Data science involves statistics and programming.",
    "Python is a great programming language for machine learning."
]

# Display the data
for i, doc in enumerate(documents, 1):
    print(f"Document {i}: {doc}")

## 2. Count Vectorizer (Bag of Words)
Count Vectorizer converts text into a matrix of token counts.

python
Copy code


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(documents)

# Display the feature names (words)
print("Feature Names:", vectorizer.get_feature_names_out())

# Display the Bag of Words matrix
print("\nBag of Words Matrix:")
print(X.toarray())

## 3. Using n-grams
An n-gram is a sequence of n words. Here, we demonstrate bi-grams (n=2).

In [None]:
# Configure the CountVectorizer for bi-grams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the text data
X_bigram = bigram_vectorizer.fit_transform(documents)

# Display the feature names (bi-grams)
print("Bi-gram Feature Names:", bigram_vectorizer.get_feature_names_out())

# Display the bi-gram matrix
print("\nBi-gram Matrix:")
print(X_bigram.toarray())


## 4. TF-IDF Vectorization
TF-IDF (Term Frequency-Inverse Document Frequency) gives importance to words based on their frequency across the documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Display the feature names (words)
print("TF-IDF Feature Names:", tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print("\nTF-IDF Matrix:")
print(X_tfidf.toarray())
