## HashingVectorizer
* <b>HashingVectorizer</b> is the combination of <b>FeatureHasher</b> and <b>CountVectorizer</b> i.e, we get the Term frequency of words as well as the reduced dimension
* When we used FeatureHasher the dimension of vector reduced but we couldn't get any understandable format in output

#### If we have large vocabulary of words we can choose to use the HashingVectorizer rather than the CountVectorizer
* The use of hashing buckets to represent words allows us to scale large data sets when we use the HashingVectorizer.
* The input argument to the vectorizer is the number of hash buckets (n_features)
* Result : numeric representation of all the words in documents.
* Word ids are from 0 to (n_features - 1) because total of n_features buckets.
* Because the size of vocabulary is larger than the number of buckets, multiple words can hash to the same bucket.
* No way to get back to the original value from the hash bucket value.
* Frequencies of each is represented in normalized from

https://scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick

##### words are mapped directly to indices with a hashing function

In [2]:
from sklearn.feature_extraction.text import HashingVectorizer

text_array = ["The Pessimist Sees Difficulty In Every Opportunity.",
              "The Optimist Sees Opportunity In Every Difficulty.",
              "Don’t Let Yesterday Take Up Too Much Of Today. ",
              "You Learn More From Failure Than From Success.",
              "We May Encounter Many Defeats But We Must Not Be Defeated.",
              "Life Is Either A Daring Adventure Or Nothing."]

#### The HashingVectorizer

In [3]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=8, 
                               norm=None)

hash_vector = vectorizer.transform(text_array)

hash_vector.shape

(6, 8)

In [4]:
hash_vector.toarray()

array([[ 0.,  2.,  1., -1., -1.,  1., -1.,  0.],
       [ 0.,  2.,  0., -1., -1.,  1., -1., -1.],
       [-1., -1.,  0.,  0.,  2.,  1., -1., -1.],
       [ 0.,  0.,  0., -1.,  0.,  0.,  0., -1.],
       [ 1.,  1.,  0., -2.,  1.,  1.,  1.,  2.],
       [-1.,  0.,  0.,  0.,  1.,  1., -1.,  1.]])

In [5]:
vectorizer = HashingVectorizer(n_features=8, 
                               norm=None, 
                               alternate_sign=False)

hash_vector = vectorizer.transform(text_array)

hash_vector.toarray()

array([[0., 2., 1., 1., 1., 1., 1., 0.],
       [0., 2., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 2., 1., 3., 1.],
       [0., 0., 2., 1., 4., 0., 0., 1.],
       [1., 1., 0., 2., 1., 1., 3., 2.],
       [1., 2., 0., 0., 1., 1., 1., 1.]])

In [6]:
vectorizer = HashingVectorizer(n_features=8, 
                               norm='l1')

hash_vector = vectorizer.transform(text_array)

hash_vector.shape

(6, 8)

In [7]:
hash_vector.toarray()

array([[ 0.        ,  0.28571429,  0.14285714, -0.14285714, -0.14285714,
         0.14285714, -0.14285714,  0.        ],
       [ 0.        ,  0.28571429,  0.        , -0.14285714, -0.14285714,
         0.14285714, -0.14285714, -0.14285714],
       [-0.14285714, -0.14285714,  0.        ,  0.        ,  0.28571429,
         0.14285714, -0.14285714, -0.14285714],
       [ 0.        ,  0.        ,  0.        , -0.5       ,  0.        ,
         0.        ,  0.        , -0.5       ],
       [ 0.11111111,  0.11111111,  0.        , -0.22222222,  0.11111111,
         0.11111111,  0.11111111,  0.22222222],
       [-0.2       ,  0.        ,  0.        ,  0.        ,  0.2       ,
         0.2       , -0.2       ,  0.2       ]])

In [8]:
vectorizer = HashingVectorizer(n_features=8, 
                               norm='l2')

hash_vector = vectorizer.transform(text_array)

hash_vector.shape

(6, 8)

In [9]:
hash_vector.toarray()

array([[ 0.        ,  0.66666667,  0.33333333, -0.33333333, -0.33333333,
         0.33333333, -0.33333333,  0.        ],
       [ 0.        ,  0.66666667,  0.        , -0.33333333, -0.33333333,
         0.33333333, -0.33333333, -0.33333333],
       [-0.33333333, -0.33333333,  0.        ,  0.        ,  0.66666667,
         0.33333333, -0.33333333, -0.33333333],
       [ 0.        ,  0.        ,  0.        , -0.70710678,  0.        ,
         0.        ,  0.        , -0.70710678],
       [ 0.2773501 ,  0.2773501 ,  0.        , -0.5547002 ,  0.2773501 ,
         0.2773501 ,  0.2773501 ,  0.5547002 ],
       [-0.4472136 ,  0.        ,  0.        ,  0.        ,  0.4472136 ,
         0.4472136 , -0.4472136 ,  0.4472136 ]])