## HashingVectorizer
* <b>HashingVectorizer</b> is the combination of <b>FeatureHasher</b> and <b>CountVectorizer</b> i.e, we get the Term frequency of words as well as the reduced dimension
* When we used FeatureHasher the dimension of vector reduced but we couldn't get any understandable format in output

#### If we have large vocabulary of words we can choose to use the HashingVectorizer rather than the CountVectorizer
* The use of hashing buckets to represent words allows us to scale large data sets when we use the HashingVectorizer.
* The input argument to the vectorizer is the number of hash buckets (n_features)
* Result : numeric representation of all the words in documents.
* Word ids are from 0 to (n_features - 1) because total of n_features buckets.
* Because the size of vocabulary is larger than the number of buckets, multiple words can hash to the same bucket.
* No way to get back to the original value from the hash bucket value.
* Frequencies of each is represented in normalized from

https://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

##### words are mapped directly to indices with a hashing function

In [60]:
from sklearn.feature_extraction.text import CountVectorizer

text_array = ["Good things come to those who wait.",
              "These watches cost $1500! ",
              "These are other fish in the sea.",
              "The ball is in your court.",
              "Mr. Smith Goes to Washington ",
              "Doogie Howser M.D."]

count_vectorizer = CountVectorizer()
feature_vector = count_vectorizer.fit_transform(text_array)

feature_vector.shape

(6, 27)

In [61]:
feature_vector.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        1, 0, 0, 1, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
        0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0]], dtype=int64)

In [62]:
count_vectorizer.vocabulary_

{'good': 9,
 'things': 19,
 'come': 3,
 'to': 21,
 'those': 20,
 'who': 25,
 'wait': 22,
 'these': 18,
 'watches': 24,
 'cost': 4,
 '1500': 0,
 'are': 1,
 'other': 14,
 'fish': 7,
 'in': 11,
 'the': 17,
 'sea': 15,
 'ball': 2,
 'is': 12,
 'your': 26,
 'court': 5,
 'mr': 13,
 'smith': 16,
 'goes': 8,
 'washington': 23,
 'doogie': 6,
 'howser': 10}

In [63]:
analyzer = count_vectorizer.build_analyzer()

In [64]:
word_tokens = analyzer(text_array[0])

word_tokens

['good', 'things', 'come', 'to', 'those', 'who', 'wait']

In [65]:
frequency_list = []

for i, text in enumerate(text_array):
    tokens = analyzer(text)
    
    word_frequency = {}

    for token in tokens:
        word_idx = count_vectorizer.vocabulary_[token]
        
        word_frequency[token] = feature_vector[i, word_idx]
    
    frequency_list.append(word_frequency)

In [66]:
frequency_list

[{'good': 1, 'things': 1, 'come': 1, 'to': 1, 'those': 1, 'who': 1, 'wait': 1},
 {'these': 1, 'watches': 1, 'cost': 1, '1500': 1},
 {'these': 1, 'are': 1, 'other': 1, 'fish': 1, 'in': 1, 'the': 1, 'sea': 1},
 {'the': 1, 'ball': 1, 'is': 1, 'in': 1, 'your': 1, 'court': 1},
 {'mr': 1, 'smith': 1, 'goes': 1, 'to': 1, 'washington': 1},
 {'doogie': 1, 'howser': 1}]

In [67]:
from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=8, input_type='string')
hashed_features = hasher.fit_transform(frequency_list)

hashed_features.shape

(6, 8)

In [68]:
hashed_features.toarray()

array([[ 0.,  3.,  0.,  1., -2., -1.,  0.,  0.],
       [ 0.,  0., -1.,  0., -1.,  0.,  2.,  0.],
       [ 0.,  1., -1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0., -1.,  0.,  1.,  0., -1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

In [69]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=8, norm=None)
feature_vector = vectorizer.transform(text_array)

feature_vector.shape

(6, 8)

In [70]:
feature_vector.toarray()

array([[ 0.,  3.,  0.,  1., -2., -1.,  0.,  0.],
       [ 0.,  0., -1.,  0., -1.,  0.,  2.,  0.],
       [ 0.,  1., -1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0., -1.,  0.,  1.,  0., -1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

In [71]:
vectorizer = HashingVectorizer(n_features=8, norm='l1')
feature_vector = vectorizer.transform(text_array)

feature_vector.shape

(6, 8)

In [72]:
feature_vector.toarray()

array([[ 0.        ,  0.42857143,  0.        ,  0.14285714, -0.28571429,
        -0.14285714,  0.        ,  0.        ],
       [ 0.        ,  0.        , -0.25      ,  0.        , -0.25      ,
         0.        ,  0.5       ,  0.        ],
       [ 0.        ,  0.33333333, -0.33333333,  0.        ,  0.        ,
         0.        ,  0.        ,  0.33333333],
       [ 0.        ,  0.5       ,  0.        ,  0.        ,  0.        ,
         0.5       ,  0.        ,  0.        ],
       [ 0.        , -0.33333333,  0.        ,  0.33333333,  0.        ,
        -0.33333333,  0.        ,  0.        ],
       [ 0.        ,  0.5       ,  0.        ,  0.        ,  0.5       ,
         0.        ,  0.        ,  0.        ]])

In [73]:
vectorizer = HashingVectorizer(n_features=8, norm='l2')
feature_vector = vectorizer.transform(text_array)

feature_vector.shape

(6, 8)

In [74]:
feature_vector.toarray()

array([[ 0.        ,  0.77459667,  0.        ,  0.25819889, -0.51639778,
        -0.25819889,  0.        ,  0.        ],
       [ 0.        ,  0.        , -0.40824829,  0.        , -0.40824829,
         0.        ,  0.81649658,  0.        ],
       [ 0.        ,  0.57735027, -0.57735027,  0.        ,  0.        ,
         0.        ,  0.        ,  0.57735027],
       [ 0.        ,  0.70710678,  0.        ,  0.        ,  0.        ,
         0.70710678,  0.        ,  0.        ],
       [ 0.        , -0.57735027,  0.        ,  0.57735027,  0.        ,
        -0.57735027,  0.        ,  0.        ],
       [ 0.        ,  0.70710678,  0.        ,  0.        ,  0.70710678,
         0.        ,  0.        ,  0.        ]])

In [76]:
vectorizer = HashingVectorizer(n_features=8, norm=None, alternate_sign=False)
feature_vector = vectorizer.transform(text_array)

feature_vector.toarray()

array([[0., 3., 0., 1., 2., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0., 2., 0.],
       [2., 1., 1., 0., 0., 0., 2., 1.],
       [0., 1., 0., 0., 2., 1., 2., 0.],
       [0., 1., 0., 3., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0.]])