### </b> **Hashing Trick**: </b>
<p>  <a style="color:#00FFFF"><b>The hashing trick</b></a> is a dimensionality reduction technique where categories are hashed into a fixed number of bins. This approach can be useful when you have a large number of categories and want to reduce the dimensionality of the encoded variable. However, there is a possibility of hash collisions where different categories may be mapped to the same bin.</p>

In [64]:
import numpy as np
from sklearn.feature_extraction import FeatureHasher

# Create a dataset with three features: 'dog', 'cat', and 'elephant'
data = [{'dog': 1, 'cat': 2}, {'dog': 2, 'run': 5},{'dog': 4,'cat': 2, 'run': 5}]


This code creates a dataset with three features: ‘dog’, ‘cat’, and ‘elephant’. The dataset contains two samples (dictionaries), where the first sample has values of 1 for ‘dog’, 2 for ‘cat’, and 4 for ‘elephant’, and the second sample has values of 2 for ‘dog’ and 5 for ‘run’.

In [65]:
# Use the hashing trick to map each feature to one of 10 bins
hasher = FeatureHasher(n_features=4)# Creates a FeatureHasher object with 10 bins. This means that each feature will be mapped to one of 10 integers.
hashed_data = hasher.transform(data)# Uses the FeatureHasher object to transform the dataset. This results in a sparse matrix with 10 columns and 2 rows.
# The columns represent the 10 bins, and the rows represent the two data points in the dataset.

This code uses the hashing trick to map each feature to one of 10 bins. The FeatureHasher class from scikit-learn is used to perform this operation. The n_features parameter specifies the number of bins to use. The transform() method is used to apply the hashing trick to the data.

In [66]:
# Print the hashed data
print(hashed_data.toarray())

[[ 0. -1.  0.  2.]
 [-5. -2.  0.  0.]
 [-5. -4.  0.  2.]]


This code prints the hashed data. The toarray() method is used to convert the hashed data into a NumPy array.

This code is an example of how to use the hashing trick in scikit-learn to encode categorical variables in a dataset. The FeatureHasher class from the sklearn.feature_extraction module is used to perform the hashing trick.

In this example, the data variable is a list of dictionaries where each dictionary represents a row in the dataset. The keys in each dictionary represent the categorical variables and the values represent the categories.

The FeatureHasher object is created with n_features=2, which means that it will generate two features for each row in the dataset. The input_type parameter is set to ‘dict’ to indicate that the input data is a list of dictionaries.

The transform method of the FeatureHasher object is then called with the data variable as its argument. This method returns a sparse matrix that contains the hashed feature vectors for each row in the dataset.

Finally, the toarray method of the sparse matrix object is called to convert it to a dense numpy array that can be printed.

In [48]:
from sklearn.feature_extraction import FeatureHasher

data = [{'color': 'red', 'fruit': 'apple'},
        {'color': 'blue', 'fruit': 'banana'},
        {'color': 'green', 'fruit': 'pear'}]

h = FeatureHasher(n_features=2, input_type='dict')
f = h.transform(data)

print(f.toarray())
data

[[ 0.  0.]
 [ 1. -1.]
 [ 1.  1.]]


[{'color': 'red', 'fruit': 'apple'},
 {'color': 'blue', 'fruit': 'banana'},
 {'color': 'green', 'fruit': 'pear'}]

In [53]:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher

data = pd.DataFrame({'type': ['a', 'b', 'a', 'c', 'b'], 'model': ['bab', 'ba', 'ba', 'ce', 'bw']})

h = FeatureHasher(n_features=2, input_type='string')
f = h.transform(data.values.astype(str))

print(f.toarray())


[[ 0.  0.]
 [-1. -1.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]
