This is a sample code to demonstrate how Vectorization on text can be achieved using the sklearn library

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()


In [2]:
texts=['NLP is an interesting area of work and NLP is getting popular', 
       'New algortithms are being build day by day', 
       'I am working on accuracy of classification']

Here the fit and transform components have been combined which can also be done in 2 steps

In [3]:
cv_fit=cv.fit_transform(texts)

In [4]:
# This gives the number of unique words
print(len(cv.get_feature_names()))

22


By default, CountVectorizer does the following:

1) Lowercases your text (set lowercase=false if you don’t want lowercasing)
2) Uses utf-8 encoding
3) Performs tokenization (converts raw text to smaller units of text)
4) Uses word level tokenization (meaning each word is treated as a separate token)
5) Ignores single characters during tokenization (say goodbye to words like ‘a’ and ‘I’)

In [5]:
print(cv.get_feature_names())

['accuracy', 'algortithms', 'am', 'an', 'and', 'are', 'area', 'being', 'build', 'by', 'classification', 'day', 'getting', 'interesting', 'is', 'new', 'nlp', 'of', 'on', 'popular', 'work', 'working']


In [None]:
#scipy.sparse.csc_matrix.tocoo
#csc_matrix.tocoo(copy=True)[source]
#Return a COOrdinate representation of this matrix

In [32]:
type(cv_fit)

scipy.sparse.csr.csr_matrix

In [37]:
cv_fit[0].tocoo().col

array([16, 14,  3, 13,  6, 17, 20,  4, 12, 19])

In [38]:
cv_fit[0].tocoo().data

array([2, 2, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [39]:
tuples=zip(cv_fit[0].tocoo().col,cv_fit[0].tocoo().data)

In [40]:
sorted_tuples=sorted(tuples, key=lambda x: (x[1],x[0]), reverse=True)

In [41]:
sorted_tuples

[(16, 2),
 (14, 2),
 (20, 1),
 (19, 1),
 (17, 1),
 (13, 1),
 (12, 1),
 (6, 1),
 (4, 1),
 (3, 1)]

In [6]:
# show resulting vocabulary; the numbers are not counts, they are the position in the sparse vector.
cv.vocabulary_

{'nlp': 16,
 'is': 14,
 'an': 3,
 'interesting': 13,
 'area': 6,
 'of': 17,
 'work': 20,
 'and': 4,
 'getting': 12,
 'popular': 19,
 'new': 15,
 'algortithms': 1,
 'are': 5,
 'being': 7,
 'build': 8,
 'day': 11,
 'by': 9,
 'am': 2,
 'working': 21,
 'on': 18,
 'accuracy': 0,
 'classification': 10}

In [7]:
#Words in the text are transformed to numbers and these numbers represent positional index in the sparse matrix as seen below
print(cv_fit.toarray())

[[0 0 0 1 1 0 1 0 0 0 0 0 1 1 2 0 2 1 0 1 1 0]
 [0 1 0 0 0 1 0 1 1 1 0 2 0 0 0 1 0 0 0 0 0 0]
 [1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1]]


In [8]:
# Here 21 are the unique words in the input and 3 is the number of documents or texts that was used as the input
cv_fit.shape

(3, 22)

Below we will show some additional pre-processing steps that can be used by CountVectorizer 


Use of custom stopwords

In [10]:
cv2 = CountVectorizer(texts,stop_words=["of","in","the","is","an","by","are"])
cv_fit2=cv2.fit_transform(texts)

In [11]:
#To check the stop words that are being used access cv.stop_words
cv2.stop_words

['of', 'in', 'the', 'is', 'an', 'by', 'are']

cv.stop_words_ (* with underscore suffix) gives you the stop words that CountVectorizer inferred from the settings:
    min_df
    max_df settings
    those that were cut off during feature selection (through the use of max_features)


The following settings consider document frequency. This can help elimiate words that might exist in 1-2 document with a high frequency

1) min_df is a setting used to ignore words that have fewer occurrences than the number specified. These words are considered noise. This could be given as an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g 0.1 ignore words that have appeared in 10 % of the documents)

2) max_df looks at how many documents contained a term, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. This could be given as an absolute value (e.g. 100, 200) or a value representing proportion of documents (e.g 0.85 ignore words that have appeared in 85 % of the documents)

In [23]:
# ignore terms that appeared in less than 2 documents  and ignore words that have appeared in 85 % of the documents
cv3 = CountVectorizer(texts,min_df=2,max_df=0.75)
cv_fit3=cv3.fit_transform(texts)                                                                                            

In [24]:
# We have to be careful with these setting because given the small number of documents we have this eliminated everything other than of :-)
cv3.stop_words_

{'accuracy',
 'algortithms',
 'am',
 'an',
 'and',
 'are',
 'area',
 'being',
 'build',
 'by',
 'classification',
 'day',
 'getting',
 'interesting',
 'is',
 'new',
 'nlp',
 'on',
 'popular',
 'work',
 'working'}

In [25]:
cv3.vocabulary_

{'of': 0}

Other options to build on:
ngram_range=(1,2),
ngram_range=(2,2),analyzer='char_wb' # character level bi-gram, not sure where they are used though
preprocessor=custom_preprocessor (define your own rules for pre-processing)
max_features=1000 # limit features space b controlling vocabulary size
binary=True # 0 and 1 values instead of counts, default values is False

References to learn other basics on this topic:

Wiki resources:
https://en.wikipedia.org/wiki/Bag-of-words_model

sciki library links:

Useful tutorials:

10+ Examples for Using CountVectorizer. https://kavita-ganesan.com/how-to-use-countvectorizer/#.XzmAkuhKiUl
How to Use Tfidftransformer & Tfidfvectorizer?. https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XzmFsehKiUk
Gensim Word2Vec Tutorial – Full Working Example. https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XzmFvOhKiUk