<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/BagofWords(BoW).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bag of Words (BoW):
BoW is a simple and widely used text representation technique in Natural Language Processing (NLP) and machine learning. It converts text data into numerical form, making it easier for models to process and analyze.

## BoW Implementation code

### Step 1: Import libraries and read the data

In [1]:
# Import laibraries
import re
import numpy as np

In [6]:
# Read the dataset
documents = [
    """In the heart of the city, where the streets hum with the rhythm of daily life, a small café stood nestled
    between towering buildings. The aroma of freshly brewed coffee mingled with the crisp morning air, drawing in
    early risers and weary travelers alike. Inside, the café buzzed with quiet conversations, the clinking of cups,
    and the occasional rustle of newspaper pages.""",

    """Among the patrons sat an old man, his eyes filled with the weight of years, gazing out the window as the world
    passed by. Across from him, a young woman typed furiously on her laptop, her brow furrowed in concentration.
    The barista, a cheerful fellow with a knack for remembering names, moved gracefully behind the counter,
    crafting intricate patterns in the frothy tops of cappuccinos.""",

    """As the morning stretched into afternoon, the café remained a sanctuary—a temporary escape from the relentless
    pace of the world outside. The city continued its symphony of honking horns and hurried footsteps, but within
    these walls, time seemed to slow, allowing stories to unfold in whispered exchanges and silent reflections."""
]

documents

['In the heart of the city, where the streets hum with the rhythm of daily life, a small café stood nestled \n    between towering buildings. The aroma of freshly brewed coffee mingled with the crisp morning air, drawing in \n    early risers and weary travelers alike. Inside, the café buzzed with quiet conversations, the clinking of cups, \n    and the occasional rustle of newspaper pages.',
 'Among the patrons sat an old man, his eyes filled with the weight of years, gazing out the window as the world \n    passed by. Across from him, a young woman typed furiously on her laptop, her brow furrowed in concentration. \n    The barista, a cheerful fellow with a knack for remembering names, moved gracefully behind the counter, \n    crafting intricate patterns in the frothy tops of cappuccinos.',
 'As the morning stretched into afternoon, the café remained a sanctuary—a temporary escape from the relentless \n    pace of the world outside. The city continued its symphony of honking horns a

### Tokenization:
- Split documents into words (tokens).

In [8]:
# Tokenize the corpus (by corpus_tokenizer function)
def tokenizer(doc):
    return [doc.lower().split() for doc in doc]

tokenized_doc = tokenizer(documents)
print(tokenized_doc)

[['in', 'the', 'heart', 'of', 'the', 'city,', 'where', 'the', 'streets', 'hum', 'with', 'the', 'rhythm', 'of', 'daily', 'life,', 'a', 'small', 'café', 'stood', 'nestled', 'between', 'towering', 'buildings.', 'the', 'aroma', 'of', 'freshly', 'brewed', 'coffee', 'mingled', 'with', 'the', 'crisp', 'morning', 'air,', 'drawing', 'in', 'early', 'risers', 'and', 'weary', 'travelers', 'alike.', 'inside,', 'the', 'café', 'buzzed', 'with', 'quiet', 'conversations,', 'the', 'clinking', 'of', 'cups,', 'and', 'the', 'occasional', 'rustle', 'of', 'newspaper', 'pages.'], ['among', 'the', 'patrons', 'sat', 'an', 'old', 'man,', 'his', 'eyes', 'filled', 'with', 'the', 'weight', 'of', 'years,', 'gazing', 'out', 'the', 'window', 'as', 'the', 'world', 'passed', 'by.', 'across', 'from', 'him,', 'a', 'young', 'woman', 'typed', 'furiously', 'on', 'her', 'laptop,', 'her', 'brow', 'furrowed', 'in', 'concentration.', 'the', 'barista,', 'a', 'cheerful', 'fellow', 'with', 'a', 'knack', 'for', 'remembering', 'names

### Build Vocabulary:
- Create a unique set of words from all documents.

In [9]:
# Build a vocabulary (by build_vocab function)
def build_vocab(doc):
  vocab = set()
  for tokens in doc:
    vocab.update(tokens)
  return sorted(vocab)

vocab = build_vocab(tokenized_doc)
print(vocab)

['a', 'across', 'afternoon,', 'air,', 'alike.', 'allowing', 'among', 'an', 'and', 'aroma', 'as', 'barista,', 'behind', 'between', 'brewed', 'brow', 'buildings.', 'but', 'buzzed', 'by.', 'café', 'cappuccinos.', 'cheerful', 'city', 'city,', 'clinking', 'coffee', 'concentration.', 'continued', 'conversations,', 'counter,', 'crafting', 'crisp', 'cups,', 'daily', 'drawing', 'early', 'escape', 'exchanges', 'eyes', 'fellow', 'filled', 'footsteps,', 'for', 'freshly', 'from', 'frothy', 'furiously', 'furrowed', 'gazing', 'gracefully', 'heart', 'her', 'him,', 'his', 'honking', 'horns', 'hum', 'hurried', 'in', 'inside,', 'into', 'intricate', 'its', 'knack', 'laptop,', 'life,', 'man,', 'mingled', 'morning', 'moved', 'names,', 'nestled', 'newspaper', 'occasional', 'of', 'old', 'on', 'out', 'outside.', 'pace', 'pages.', 'passed', 'patrons', 'patterns', 'quiet', 'reflections.', 'relentless', 'remained', 'remembering', 'rhythm', 'risers', 'rustle', 'sanctuary—a', 'sat', 'seemed', 'silent', 'slow,', 'sm

## Vectorization
- Convert each document into a vector:
  *   Count-based: Number of times each word appears.
  *   Binary-based: Presence (1) or absence (0) of words.



In [15]:
# Vectorization: Create BoW matrix
def vectorize_docs(tokenized_docs, vocab, mode="count"):
    """
    mode: "count" for count-based, "binary" for binary-based representation.
    """
    num_docs = len(tokenized_docs)
    vocab_size = len(vocab)
    # Initialize matrix with zeros
    bow_matrix = np.zeros((num_docs, vocab_size), dtype=int)

    # Map each word in vocabulary to its index
    vocab_index = {word: idx for idx, word in enumerate(vocab)}

    for i, tokens in enumerate(tokenized_docs):
        for token in tokens:
            idx = vocab_index[token]
            if mode == "count":
                bow_matrix[i, idx] += 1
            elif mode == "binary":
                bow_matrix[i, idx] = 1
    return bow_matrix

# Create count-based BoW matrix
bow_count = vectorize_docs(tokenized_doc, vocab, mode="count")
print("\nCount-based BoW Matrix:")
print(bow_count)


Count-based BoW Matrix:
[[1 0 0 1 1 0 0 0 2 1 0 0 0 1 1 0 1 0 1 0 2 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1
  1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 2 1 0 0 0 0 0 1 0 1 1 0 0
  1 1 1 5 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 9 0 0
  0 0 1 1 0 0 0 1 0 1 0 0 3 0 0 0 0 0]
 [3 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0
  0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 2 1 1 0 0 0 0 2 0 0 1 0 1 1 0 1 0 0 1 1
  0 0 0 2 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 7 0 0
  0 1 0 0 1 0 0 0 1 0 0 1 2 0 1 1 1 1]
 [1 0 1 0 0 1 0 0 2 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
  0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0
  0 0 0 2 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 5 1 1
  2 0 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0]]


In [16]:
# Create binary-based BoW matrix
bow_binary = vectorize_docs(tokenized_doc, vocab, mode="binary")
print("\nBinary-based BoW Matrix:")
print(bow_binary)


Binary-based BoW Matrix:
[[1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1
  1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0
  1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
  0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0]
 [1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0
  0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 1 1
  0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
  0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 1 1]
 [1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
  0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0
  0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1
  1 0 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0]]


In [20]:
# Normalization function to normalize each document vector
def normalize_bow_matrix(bow_matrix, norm='l1'):
    """
    Normalizes the BoW matrix using L1 or L2 normalization.

    Parameters:
      bow_matrix : numpy.ndarray
          The BoW matrix.
      norm : str, optional
          Type of normalization ('l1' or 'l2'). Default is 'l1'.

    Returns:
      normalized_matrix : numpy.ndarray
          The normalized BoW matrix.
    """
    if norm == 'l1':
        # L1 normalization: divide each row by its sum
        row_sums = bow_matrix.sum(axis=1, keepdims=True)
        normalized_matrix = bow_matrix / np.where(row_sums == 0, 1, row_sums)
    elif norm == 'l2':
        # L2 normalization: divide each row by its Euclidean norm
        row_norms = np.linalg.norm(bow_matrix, axis=1, keepdims=True)
        normalized_matrix = bow_matrix / np.where(row_norms == 0, 1, row_norms)
    else:
        raise ValueError("Unsupported normalization type. Choose 'l1' or 'l2'.")
    return normalized_matrix

# Normalize the count-based BoW matrix (using L1 normalization)
normalized_bow = normalize_bow_matrix(bow_count, norm='l1')
print("\nNormalized (L1) Count-based BoW Matrix:")
print(normalized_bow)

# Optionally, you can wrap the entire BoW pipeline into one function that returns the normalized BoW matrix and vocabulary
def bag_of_words_pipeline(documents, mode='count', norm='l1'):
    # Tokenize documents
    tokenized_docs = tokenizer(documents)
    # Build vocabulary
    vocab = build_vocab(tokenized_docs)
    # Vectorize documents
    bow_matrix = vectorize_docs(tokenized_docs, vocab, mode=mode)
    # Normalize the BoW matrix
    normalized_matrix = normalize_bow_matrix(bow_matrix, norm=norm)
    return normalized_matrix, vocab

# Get final normalized BoW matrix and vocabulary
final_normalized_bow, final_vocab = bag_of_words_pipeline(documents, mode="count", norm="l1")
print("\nFinal Normalized BoW Matrix and Vocabulary:")
print(final_normalized_bow)


Normalized (L1) Count-based BoW Matrix:
[[0.01612903 0.         0.         0.01612903 0.01612903 0.
  0.         0.         0.03225806 0.01612903 0.         0.
  0.         0.01612903 0.01612903 0.         0.01612903 0.
  0.01612903 0.         0.03225806 0.         0.         0.
  0.01612903 0.01612903 0.01612903 0.         0.         0.01612903
  0.         0.         0.01612903 0.01612903 0.01612903 0.01612903
  0.01612903 0.         0.         0.         0.         0.
  0.         0.         0.01612903 0.         0.         0.
  0.         0.         0.         0.01612903 0.         0.
  0.         0.         0.         0.01612903 0.         0.03225806
  0.01612903 0.         0.         0.         0.         0.
  0.01612903 0.         0.01612903 0.01612903 0.         0.
  0.01612903 0.01612903 0.01612903 0.08064516 0.         0.
  0.         0.         0.         0.01612903 0.         0.
  0.         0.01612903 0.         0.         0.         0.
  0.01612903 0.01612903 0.01612903 