## Word Counts with "CountVectorizer"

In [1]:
import numpy as np
import sklearn

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text1 = ["The quick brown fox jumped over the lazy dog."]
# create the transform
my_vectorizer = CountVectorizer()
# tokenize and build vocab
my_vectorizer.fit(text1)
# summarize
print(my_vectorizer.vocabulary_)

# encode document
my_vector = my_vectorizer.transform(text1)
# summarize encoded vector
print(my_vector.shape)
print(type(my_vector))
print(my_vector.toarray())

{'fox': 2, 'lazy': 4, 'dog': 1, 'quick': 6, 'over': 5, 'the': 7, 'brown': 0, 'jumped': 3}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


In [3]:
## We can see that all words were made lowercase by default and that the punctuation was ignored. \
## These and other aspects of tokenizing can be configured and I encourage you to review all of the \
## options in the 'API documentation'.

## Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary.\
## These words are ignored and no count is given in the resulting vector.Examole :

In [4]:
# encode another document

text2 = ["the puppy"]
my_vector2 = my_vectorizer.transform(text2)
print(my_vector2.toarray())

[[0 0 0 0 0 0 0 1]]


## Word Frequencies with "TfidfVectorizer"

In [5]:
## TF-IDF are word frequency scores that try to highlight words that are more interesting, \
## e.g. frequent in a document but not across documents.

## The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document \
## frequency weightings, and allow you to encode new documents. Alternately, if you already \
## have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate |
## the inverse document frequencies and start encoding documents.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text3 = ["The quick brown fox jumped over the lazy dog.",
		"The puppy.",
		"The fox"]

text_array = np.array(text3)
print(text_array.shape)
print(text3[0])
print(type(text3[0]))

# create the transform
my_vectorizer_2 = TfidfVectorizer()  ## Note the change in vactorizer - from above 'Counter' to here 'Tfidf'
# tokenize and build vocab
my_vectorizer_2.fit(text3)
# summarize
print(my_vectorizer_2.vocabulary_)
print(my_vectorizer_2.idf_)

# encode document - one row
my_vector_2 = my_vectorizer_2.transform([text3[0]])
# summarize encoded vector
print(my_vector_2.shape)
print(my_vector_2.toarray())

# encode document - all rows
my_vector_3 = my_vectorizer_2.transform([text3][0])
print(my_vector_3.shape)
print(my_vector_3.toarray())

(3,)
The quick brown fox jumped over the lazy dog.
<class 'str'>
{'the': 8, 'puppy': 6, 'lazy': 4, 'quick': 7, 'brown': 0, 'jumped': 3, 'over': 5, 'dog': 1, 'fox': 2}
[1.69314718 1.69314718 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.69314718 1.        ]
(1, 9)
[[0.35413578 0.35413578 0.26932939 0.35413578 0.35413578 0.35413578
  0.         0.35413578 0.41831659]]
(3, 9)
[[0.35413578 0.35413578 0.26932939 0.35413578 0.35413578 0.35413578
  0.         0.35413578 0.41831659]
 [0.         0.         0.         0.         0.         0.
  0.861037   0.         0.50854232]
 [0.         0.         0.78980693 0.         0.         0.
  0.         0.         0.61335554]]


## Hashing with HashingVectorizer

In [12]:
## Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary \
## can become very large.

## This, in turn, will require large vectors for encoding documents and impose large requirements on memory \
## and slow down algorithms.

## A clever work around is to use a one way hash of words to convert them to integers. The clever part is that \
## no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the \
## hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter \
## for many supervised learning tasks).

## The HashingVectorizer class implements this approach that can be used to consistently hash words, then \
## tokenize and encode documents as needed.

In [17]:
## An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, \
## where small values (like 20) may result in hash collisions.

## Note that this vectorizer does not require a call to fit on the training data documents. Instead, after \
## instantiation, it can be used directly to start encoding documents.

from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())


(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]


In [2]:
A = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

In [3]:
A

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [4]:
b = A.sum(axis = 0)
b

array([ 6,  8, 10, 12])

In [5]:
A.shape


(2, 4)

In [6]:
b.shape

(4,)

In [8]:
type(A)

numpy.ndarray

In [9]:
type(b)

numpy.ndarray

In [10]:
c = A/b

In [11]:
c

array([[0.16666667, 0.25      , 0.3       , 0.33333333],
       [0.83333333, 0.75      , 0.7       , 0.66666667]])

In [12]:
d = A*b

In [13]:
d

array([[ 6, 16, 30, 48],
       [30, 48, 70, 96]])