# 5.2 Bag of Words

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = [' Most shark attacks occur about 10 feet from the beach since that is where the people are',
        'the efficiency with which he paired the socks in the drawer was quite admirable',
        'carol drank the blood as if she were a vampire',
        'giving directions that the mountains are to the west only works when you can see them',
        'the sign said there was road work ahead so he decided to speed up',
        'the gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms']

In [None]:
# This will turn text into matrix of token counts
countvec = CountVectorizer()
# You can also customize the vectorizer (remove stop words, set ngram_range, etc.)

In [None]:
# Fit the vectorizer to the data and transform the data into a sparse matrix
# This combines .fit() (learn the vocabulary) and .transform() (apply it to data)
countvec_fit = countvec.fit_transform(data)

In [None]:
# Convert into a DataFrame for easier viewing
# get_feature_names_out() -> This returns the actual vocabulary words used as column headers
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [6]:
print(bag_of_words)

   10  about  admirable  ahead  are  as  attacks  back  bait  beach  ...  \
0   1      1          0      0    1   0        1     0     0      1  ...   
1   0      0          1      0    0   0        0     0     0      0  ...   
2   0      0          0      0    0   1        0     0     0      0  ...   
3   0      0          0      0    1   0        0     0     0      0  ...   
4   0      0          0      1    0   0        0     0     0      0  ...   
5   0      0          0      0    0   1        0     1     1      0  ...   

   were  west  when  where  which  with  work  works  worms  you  
0     0     0     0      1      0     0     0      0      0    0  
1     0     0     0      0      1     1     0      0      0    0  
2     1     0     0      0      0     0     0      0      0    0  
3     0     1     1      0      0     0     0      1      0    1  
4     0     0     0      0      0     0     1      0      0    0  
5     0     0     0      0      0     0     0      0      1    0 

## Another example

In [2]:
data = ["cats sleep a lot", "dogs run fast"]

countvec = CountVectorizer()
matrix = countvec.fit_transform(data)
df = pd.DataFrame(matrix.toarray(), columns=countvec.get_feature_names_out())

print(df)

   cats  dogs  fast  lot  run  sleep
0     1     0     0    1    0      1
1     0     1     1    0    1      0


## What I Learned

- Bag-of-Words represents text as **word frequency vectors**.

- CountVectorizer() tokenizes text and builds a vocabulary.

- The resulting matrix should be converted into a DataFrame for better readability.

- Common for preprocessing in **text classification or clustering tasks**.