### Author : Sanjoy Biswas
### Topic : Count Vectorizer
### Email : sanjoy.eee32@gmail.com

Scikit-learn’s CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

#### Methods
build_analyzer():Return a callable that handles preprocessing, tokenization and n-grams generation.

build_preprocessor():Return a function to preprocess the text before tokenization.

build_tokenizer():Return a function that splits a string into a sequence of tokens.

decode(doc):Decode the input into a string of unicode symbols.

fit(raw_documents[, y]):Learn a vocabulary dictionary of all tokens in the raw documents.

fit_transform(raw_documents[, y]):Learn the vocabulary dictionary and return document-term matrix.

get_feature_names():Array mapping from feature integer indices to feature name.

get_params([deep]):Get parameters for this estimator.

get_stop_words():Build or fetch the effective stop words list.

inverse_transform(X):Return terms per document with nonzero entries in X.

set_params(**params):Set the parameters of this estimator.

transform(raw_documents):Transform documents to document-term matrix.

#### Word Counts with CountVectorizer
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

1. Create an instance of the CountVectorizer class.
2. Call the fit() function in order to learn a vocabulary from one or more documents.
3. Call the transform() function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
dataset = ['Hey welcome to datascience',
          'This is Data Science Course',
          'Working as data scientist']

In [11]:
dataset

['Hey welcome to datascience',
 'This is Data Science Course',
 'Working as data scientist']

In [12]:
cv = CountVectorizer()
x = cv.fit_transform(dataset)

In [13]:
cv.get_feature_names()

['as',
 'course',
 'data',
 'datascience',
 'hey',
 'is',
 'science',
 'scientist',
 'this',
 'to',
 'welcome',
 'working']

In [14]:
x.toarray()

array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1]], dtype=int64)