# <center> Bag-of-Words


# <a id= 'b0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Introduction](#b1)<br>
[2. skLearn-Countvectorizer](#b2)<br>

## <a id = 'b1'>
    
<font size = 10 color = 'midnightblue'> <b> Introduction

<div class="alert alert-block alert-success">    
<font size = 4> 
    
- A "bag of words" (BoW) is a representation of a text that describes the occurrence of words within a document.
- It is a simple and widely used technique for feature extraction in various NLP tasks.
- The method is called “bag-of-words,” as the order of the words is lost entirely.

In [None]:
import re
import pandas as pd
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer

<font size = 5 color = seagreen><b> Create a collection dataset

In [None]:
dataset = [
    "The weather today is fantastic, with clear skies and a gentle breeze.",
    "Reading is a great way to escape reality and immerse oneself in different worlds.",
    "Climate change is a pressing global issue that requires immediate attention.",
    "Exercise is crucial for maintaining good physical and mental health.",
    "Learning a new language can be challenging but incredibly rewarding."
]

[top](#b0)

## <a id = 'b2'>
<font size = 10 color = 'midnightblue'> <b> CountVectorizer 

<div class="alert alert-block alert-success">    
<font size = 4> 

<b>`CountVectoriser` from sklearn is used to fit the bag-of-words model.</b>

<font size = 5 color = seagreen><b>Define a count vectoriser

In [None]:
bow = CountVectorizer(max_features=1000, lowercase=True, analyzer='word')

<font size = 5 color = seagreen><b> Fit the bag-of-words model

In [None]:
bag_of_words = bow.fit(dataset)

<div class="alert alert-block alert-success">    
<font size = 4> 

The vectoriser object also returns the feature names for transformation which is the vocabulary

In [None]:
print(list(bow.get_feature_names_out()))

['and', 'attention', 'be', 'breeze', 'but', 'can', 'challenging', 'change', 'clear', 'climate', 'crucial', 'different', 'escape', 'exercise', 'fantastic', 'for', 'gentle', 'global', 'good', 'great', 'health', 'immediate', 'immerse', 'in', 'incredibly', 'is', 'issue', 'language', 'learning', 'maintaining', 'mental', 'new', 'oneself', 'physical', 'pressing', 'reading', 'reality', 'requires', 'rewarding', 'skies', 'that', 'the', 'to', 'today', 'way', 'weather', 'with', 'worlds']


<div class="alert alert-block alert-success">    
<font size = 4> 

The vectorizer returns a sparse matrix where rows represent each sentence of the dataset and columns correspond to each word in vocabulary


In [None]:
vector = bow.transform(dataset).toarray()

In [None]:
pd.DataFrame(vector,
             columns= list(bow.get_feature_names_out()),
             index = [f'sent_{i}' for i in range(1,len(dataset)+1)])

Unnamed: 0,and,attention,be,breeze,but,can,challenging,change,clear,climate,...,rewarding,skies,that,the,to,today,way,weather,with,worlds
sent_1,1,0,0,1,0,0,0,0,1,0,...,0,1,0,1,0,1,0,1,1,0
sent_2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1
sent_3,0,1,0,0,0,0,0,1,0,1,...,0,0,1,0,0,0,0,0,0,0
sent_4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent_5,0,0,1,0,1,1,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0


<font size = 5 color = seagreen><b> This vectorised data can be used as features (predictors) to any ML model

[top](#b0)