# CountVector for bag of words model

Setup library

In [1]:
%run setup.ipynb

In [2]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Define text corpus.

In [3]:
sentences = ["We are reading about Natural Language Processing Here",
            "Natural Language Processing making computers comprehend language data",
            "The field of Natural Language Processing is evolving everyday"]

corpus = pd.Series(sentences)

common_dot_words = ['U.S.', 'Mr.', 'Mrs.', 'D.C.']

In [4]:
%run text_data_preprocessing_steps.ipynb

In [5]:
# Preprocessing with Lemmatization here
preprocessed_corpus = preprocess_list(corpus, keep_list = common_dot_words, stemming = False, stem_type = None,
                                lemmatization = True, remove_stopwords = True)
preprocessed_corpus

['read natural language process',
 'natural language process make computers comprehend language data',
 'field natural language process evolve everyday']

## CountVectorizer for Bag of Words Model

**Documentation**: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

## CountVectorizer

This method converts a list of text documents into a matrix such that each entry in the matrix would correspond to the count of a particular token in the respective sentences.

In [7]:
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(preprocessed_corpus)

### Let's what features were obtained and the corresponding bag of words matrix

In [8]:
print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())

['comprehend' 'computers' 'data' 'everyday' 'evolve' 'field' 'language'
 'make' 'natural' 'process' 'read']
[[0 0 0 0 0 0 1 0 1 1 1]
 [1 1 1 0 0 0 2 1 1 1 0]
 [0 0 0 1 1 1 1 0 1 1 0]]


In [9]:
print(bow_matrix.toarray().shape)

(3, 11)


### The matrix is the same as what was obtained after all the hard work in the previous exercise.

Now you know, what to use when a basic Bag of Words Model is needed.

## Let's see how bigrams and trigrams can be included here

The `ngram_range` argument allows you to include 2-grams and 3-grams.

In [10]:
vectorizer_ngram_range = CountVectorizer(analyzer='word', ngram_range=(1,3))
bow_matrix_ngram = vectorizer_ngram_range.fit_transform(preprocessed_corpus)

In [11]:
print(vectorizer_ngram_range.get_feature_names_out())
print(bow_matrix_ngram.toarray())
print(bow_matrix_ngram.toarray().shape)

['comprehend' 'comprehend language' 'comprehend language data' 'computers'
 'computers comprehend' 'computers comprehend language' 'data' 'everyday'
 'evolve' 'evolve everyday' 'field' 'field natural'
 'field natural language' 'language' 'language data' 'language process'
 'language process evolve' 'language process make' 'make' 'make computers'
 'make computers comprehend' 'natural' 'natural language'
 'natural language process' 'process' 'process evolve'
 'process evolve everyday' 'process make' 'process make computers' 'read'
 'read natural' 'read natural language']
[[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1]
 [1 1 1 1 1 1 1 0 0 0 0 0 0 2 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0]
 [0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0]]
(3, 32)


In [12]:
three_grams = vectorizer_ngram_range.get_feature_names_out()
index = np.where(three_grams == 'natural language process')
index[0]

array([23], dtype=int64)

### Inference
As can be seen, the phrase *natural language process* occurs once in every sentence.

The column corresponding to it has the entries **1, 1 and 1**.

## Understanding Max Features 

The `max_features` argument allows you to build a vocabulary such that the size of the vocabulary would be less than or equal to `max_features` ordered by the frequency of tokens occuring in a corpus.

In [13]:
vectorizer_max_features = CountVectorizer(analyzer='word', ngram_range=(1,3), max_features = 6)
bow_matrix_max_features = vectorizer_max_features.fit_transform(preprocessed_corpus)

In [14]:
print(vectorizer_max_features.get_feature_names_out())
print(bow_matrix_max_features.toarray())

['language' 'language process' 'natural' 'natural language'
 'natural language process' 'process']
[[1 1 1 1 1 1]
 [2 1 1 1 1 1]
 [1 1 1 1 1 1]]


### Inference

The Vocabulary and Bag of Words Model got limited to 6 features since *max_features = 6* was provided as input to the CountVectorizer

## Thresholding using Max_df and Min_df

The `max_df` argument allows you to ignore terms having a document frequency higher than a provided threshold mentioned as part of the `max_df`.

The `min_df` arguments allows you to remove rarely occurring terms that occur fewer times in a document than a given threshold.

In [15]:
vectorizer_max_features = CountVectorizer(analyzer='word', ngram_range=(1,3), max_df = 3, min_df = 2)
bow_matrix_max_features = vectorizer_max_features.fit_transform(preprocessed_corpus)

In [16]:
print(vectorizer_max_features.get_feature_names_out())
print(bow_matrix_max_features.toarray())

['language' 'language process' 'natural' 'natural language'
 'natural language process' 'process']
[[1 1 1 1 1 1]
 [2 1 1 1 1 1]
 [1 1 1 1 1 1]]


## Inference

Terms with max_df less than or equal to or less than 3 were only present in the vocabulary.

**Note**: *max_features* was not used here