-------------------
## Hashing with HashingVectorizer
---------------------------------

The __count & tf_idf__ vectorization scheme is simple but the fact that it holds an __in- memory__ mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

- the `larger the corpus`, the `larger the vocabulary` will grow and hence the memory use too,

- building the word-mapping requires a `full pass over the dataset` hence it is `not` possible to fit text classifiers in a strictly __online__ manner.
    - Impossibile to do `online` or `out-of-core` / `streaming` learning: the vocabulary_ needs to be learned from the data: its size cannot be known before making one pass over the full dataset
    
- `pickling` and `un-pickling` vectorizers with a large vocabulary_ can be very `slow` (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),

- it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.

> It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by the sklearn.feature_extraction.FeatureHasher class 

class sklearn.feature_extraction.text.HashingVectorizer 
 - (input='content',  - encoding='utf-8',  - decode_error='strict',  - strip_accents=None,  - lowercase=True,  
 - reprocessor=None,  - tokenizer=None,  - stop_words=None,  - token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), 
 - analyzer='word',  - n_features=1048576,  - binary=False,  - norm='l2',  - alternate_sign=True, 
 
 dtype=<class 'numpy.float64'>)


- Convert a collection of text documents to a matrix of token occurrences

- It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

- This text vectorizer implementation uses the __hashing__ trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

- it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory

- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

- it can be used in a streaming (__partial fit__) or parallel pipeline as there is no state computed during fit.

- There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

    - there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

    - there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

    - no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of __Murmurhash3__.

#### Parameters

1. string : {‘filename’, ‘file’, ‘content’}
2. lowercase : boolean, default=True
3. preprocessor : callable or None (default)
4. tokenizer : callable or None (default)
5. stop_words : string {‘english’}, list, or None (default)
6. ngram_range : tuple (min_n, max_n), default=(1, 1)
7. `n_features` : integer, default=(2 ** 20)
The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions
8. binary : boolean, default=False.
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
9. 

In [1]:
from sklearn.feature_extraction.text import HashingVectorizer

In [3]:
# list of text documents
text = ["The quick brown fox jumped over the lazy dog jumped."] 

# create the transform
vectorizer = HashingVectorizer(n_features=6)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 6)
[[-0.53452248  0.53452248  0.         -0.53452248  0.26726124  0.26726124]]


another example ...

In [5]:
corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]

In [6]:
vectorizer = HashingVectorizer(n_features=2**4)

In [9]:
X = vectorizer.fit_transform(corpus)

In [10]:
print(X.shape)

(4, 16)


In [11]:
X.toarray()

array([[-0.57735027,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        , -0.57735027,  0.        ,
         0.        ,  0.        ,  0.        ,  0.57735027,  0.        ,
         0.        ],
       [-0.81649658,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.40824829,  0.        ,  0.40824829,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        , -0.70710678,
         0.70710678,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [-0.57735027,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        , -0.57735027,  0.        ,
         0.        ,  0.        ,  0.        ,  0.57735027,  0.        ,
         0.        ]])