# Extracting the Frequency of terms using a bag of words Model

#### - Machine learning Algorithms need numeric data to work with.
#### - this model extracts a vocabulary from all words in the document and builds a model using document term matrix
#### - A document is a Bag of Words
#### - bag of words === Track of word counts and disregared the grammatical details and the word order.

### - Document term matrix is a table that gives us counts of various words occur in the document.
#### - Text Document === weighted combination of various words.
#### - The weight is the importance of the word. ( obtained by tf-idf or Deep learning ).
#### - choose words that are more meaningful using thresholds.
#### - Feature Vector --> Text Classification

Example:
- sentence 1: The children are playing in the hall
- sentence 2: The hall has alot of space
- sentence 3: Lots of children like playing in an open space

### Distinct words: 
    The, children, are, playing, in, hall, has, a, lot, of, space, like, an, open.
    
### Feature Vector:
    - sentence 1: [2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    - sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
    - sentence 3: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]

In [7]:
# import libs
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import brown

In [8]:
def chunker(input_data, N):
    input_words =input_data.split(' ')
    output =[]
    
    curr_chunk =[]
    count =0
    
    for word in input_words:
        curr_chunk.append(word)
        count+=1
        if count == N:
            output.append(' '.join(curr_chunk))
            count, curr_chunk =0, []
            
    output.append(' '.join(curr_chunk))
    return output

In [9]:
# read the data from brown corpus
input_data =' '.join(brown.words()[:5400])

# number of words in each chunk
chunk_size =800

In [10]:
text_chunks =chunker(input_data, chunk_size)

In [11]:
# convert to dict items
chunks =[]
for count, chunk in enumerate(text_chunks):
    d ={'index': count, 'text': chunk}
    chunks.append(d)

In [12]:
# Extract the document term matrix ( count each word )
count_vectorizer =CountVectorizer(min_df=7, max_df=20)
document_term_matrix =count_vectorizer.fit_transform([chunk['text'] for chunk in chunks ] )

In [14]:
# Exctract the vocabulary & display it
vocabulary =np.array(count_vectorizer.get_feature_names())
print('\nVocabulary:\n', vocabulary)


Vocabulary:
 ['and' 'are' 'be' 'by' 'county' 'for' 'in' 'is' 'it' 'of' 'on' 'one'
 'said' 'state' 'that' 'the' 'to' 'two' 'was' 'which' 'with']


In [15]:
# Generate names for chunks
chunk_names =[]
for i in range(len(text_chunks)):
    chunk_names.append('Chunk-'+str(i+1))

In [16]:
# Print the document term matrix
print('\nDocument term matrix:')
formatted_text ='{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_text.format('Word', *chunk_names), '\n')
for word, item in zip(vocabulary, document_term_matrix.T):
    # item is a csr_matrix data structure
    output =[word] + [str(freq) for freq in item.data]
    print(formatted_text.format(*output))


Document term matrix:

         Word     Chunk-1     Chunk-2     Chunk-3     Chunk-4     Chunk-5     Chunk-6     Chunk-7 

         and          23           9           9          11           9          17          10
         are           2           2           1           1           2           2           1
          be           6           8           7           7           6           2           1
          by           3           4           4           5          14           3           6
      county           6           2           7           3           1           2           2
         for           7          13           4          10           7           6           4
          in          15          11          15          11          13          14          17
          is           2           7           3           4           5           5           2
          it           8           6           8           9           3           1           2
   