# Text Feature Extraction : BagofWords Model

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from scipy.sparse import csr_matrix

In [2]:
doc = ["This is a good city",
      "You are good human",
      "This is worth fight for your own worth"]

## Sklearn CountVectorizer

token_pattern is defined since CountVectorizer ignores the single character

In [3]:
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
count_occurrences = cv.fit_transform(doc)

In [4]:
count_occurrences.toarray()

array([[1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 2, 0, 1]], dtype=int64)

In [5]:
count_vect_df = pd.DataFrame(data = count_occurrences.toarray(), columns= cv.get_feature_names())

In [6]:
count_vect_df

Unnamed: 0,a,are,city,fight,for,good,human,is,own,this,worth,you,your
0,1,0,1,0,0,1,0,1,0,1,0,0,0
1,0,1,0,0,0,1,1,0,0,0,0,1,0
2,0,0,0,1,1,0,0,1,1,1,2,0,1


Now let's implementit from scratch

## Manually (Scratch)

Convert the text into lowercase to not differentiate between "this" and "This"

In [7]:
doc = list(map(str.lower, doc))

Tokenize and create a vocabulary identifying unique words.
Set an index for the vocab words

In [8]:
unique_words = set((doc[0] + ' '+ doc[1] +' '+ doc[2]).split())
index_dict = {}
for ind, i in enumerate(sorted(unique_words)):
    index_dict[i] = ind
index_dict

{'a': 0,
 'are': 1,
 'city': 2,
 'fight': 3,
 'for': 4,
 'good': 5,
 'human': 6,
 'is': 7,
 'own': 8,
 'this': 9,
 'worth': 10,
 'you': 11,
 'your': 12}

Create a sparse matrix
- Iterate over each doc (sentence) in corpus. Get its word count. 
- To create sparse matrix, create the variables for row, col and values. 
    - Get the "row" data relating the corpus 
    - Get the "col" data as an index relating the index_dict
    - Get the "val" data as a countrelating count_dict

In [9]:
row,col,val = [],[],[]
for idx, text in enumerate(doc):
    count_dict = {}
    tokens = text.split()
    # Get count of each word in sentence
    for word in tokens:        
        count_dict[word] = tokens.count(word)   
    #c = dict(Counter(text.split(' ')))
    #for word, count in c.items():
    for word, count in count_dict.items():
        ind = index_dict[word]        
        row.append(idx)
        col.append(ind)
        val.append(count)

#### Scratch implementation result

In [10]:
print((csr_matrix((val, (row, col)),shape = (len(doc),len(index_dict)))).toarray())

[[1 0 1 0 0 1 0 1 0 1 0 0 0]
 [0 1 0 0 0 1 1 0 0 0 0 1 0]
 [0 0 0 1 1 0 0 1 1 1 2 0 1]]


#### Count Vectorizer result

In [11]:
count_occurrences.toarray()

array([[1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 2, 0, 1]], dtype=int64)

We have seen the scratch implementation of CountVectorizer and the results of both are the same. 

# END