# Text Vectorization with Scikit-learn

This notebook demonstrates how to vectorize text data using different methods provided by the `scikit-learn` library:

- Count Vectorizer (Bag of Words)
- n-grams
- TF-IDF

## 1. Preparing the Data

We will use a small set of sample text data for this demonstration.


In [3]:
%pip install sklearn

# Sample text data
documents = [
    "Natural language processing is fun and exciting.",
    "Machine learning is a part of data science.",
    "Data science involves statistics and programming.",
    "Python is a great programming language for machine learning."
]

# Display the data
for i, doc in enumerate(documents, 1):
    print(f"Document {i}: {doc}")

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' inst

## 2. Count Vectorizer (Bag of Words)
Count Vectorizer converts text into a matrix of token counts.

python
Copy code


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(documents)

# Display the feature names (words)
print("Feature Names:", vectorizer.get_feature_names_out())

# Display the Bag of Words matrix
print("\nBag of Words Matrix:")
print(X.toarray())

Feature Names: ['and' 'data' 'exciting' 'for' 'fun' 'great' 'involves' 'is' 'language'
 'learning' 'machine' 'natural' 'of' 'part' 'processing' 'programming'
 'python' 'science' 'statistics']

Bag of Words Matrix:
[[1 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 1 0]
 [1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1]
 [0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0]]


## 3. Using n-grams
An n-gram is a sequence of n words. Here, we demonstrate bi-grams (n=2).

In [4]:
# Configure the CountVectorizer for bi-grams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the text data
X_bigram = bigram_vectorizer.fit_transform(documents)

# Display the feature names (bi-grams)
print("Bi-gram Feature Names:", bigram_vectorizer.get_feature_names_out())

# Display the bi-gram matrix
print("\nBi-gram Matrix:")
print(X_bigram.toarray())


Bi-gram Feature Names: ['and exciting' 'and programming' 'data science' 'for machine' 'fun and'
 'great programming' 'involves statistics' 'is fun' 'is great' 'is part'
 'language for' 'language processing' 'learning is' 'machine learning'
 'natural language' 'of data' 'part of' 'processing is'
 'programming language' 'python is' 'science involves' 'statistics and']

Bi-gram Matrix:
[[1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0]
 [0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
 [0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0]]


## 4. TF-IDF Vectorization
TF-IDF (Term Frequency-Inverse Document Frequency) gives importance to words based on their frequency across the documents.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Display the feature names (words)
print("TF-IDF Feature Names:", tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print("\nTF-IDF Matrix:")
print(X_tfidf.toarray())


TF-IDF Feature Names: ['and' 'data' 'exciting' 'for' 'fun' 'great' 'involves' 'is' 'language'
 'learning' 'machine' 'natural' 'of' 'part' 'processing' 'programming'
 'python' 'science' 'statistics']

TF-IDF Matrix:
[[0.33166972 0.         0.42068099 0.         0.42068099 0.
  0.         0.26851522 0.33166972 0.         0.         0.42068099
  0.         0.         0.42068099 0.         0.         0.
  0.        ]
 [0.         0.35639424 0.         0.         0.         0.
  0.         0.28853185 0.         0.35639424 0.35639424 0.
  0.4520409  0.4520409  0.         0.         0.         0.35639424
  0.        ]
 [0.37222485 0.37222485 0.         0.         0.         0.
  0.47212003 0.         0.         0.         0.         0.
  0.         0.         0.         0.37222485 0.         0.37222485
  0.47212003]
 [0.         0.         0.         0.41191063 0.         0.41191063
  0.         0.26291722 0.32475507 0.32475507 0.32475507 0.
  0.         0.         0.         0.32475507 0.411