# Module 1: Data Analysis and Data Preprocessing

## Section 3: Encoding categorical variables

### Part 5: Hashing Encoding

Hashing Encoding is a data preprocessing technique used to convert categorical variables into a numerical representation using hashing functions. Hashing Encoding is particularly useful when dealing with high-dimensional categorical variables or when memory efficiency is a concern.

### 5.1 Understanding Hashing Encoding

Hashing Encoding is a technique used to convert categorical variables into a numerical representation using hashing functions. It maps each category to a fixed number of bins or "hashes" based on the output of a hash function. The resulting numerical representation can be used as input to machine learning algorithms.

The key idea behind Hashing Encoding is to transform categorical variables into a fixed-dimensional space, regardless of the number of unique categories. This technique is especially useful when dealing with high-cardinality variables or situations where memory efficiency is crucial.

### 5.2 Using hasing encoding

To apply Hashing Encoding, we need a dataset with categorical variables. The encoding process involves applying a hash function to each category and mapping the resulting hash value to a fixed number of bins. The number of bins determines the dimensionality of the encoded representation.

The scikit learn FeatureHasher maps each category to a specific index in a fixed-size feature vector, and the hashing process helps reduce memory usage and computational complexity. 

Here's an example of how to use scikit-learn's FeatureHasher:

In [None]:
from sklearn.feature_extraction import FeatureHasher

# Sample data with a categorical feature
data = [{'color': 'Red'}, {'color': 'Green'}, {'color': 'Blue'}, {'color': 'Green'}, {'color': 'Red'}, {'color': 'Blue'}]

# Create the FeatureHasher object
hasher = FeatureHasher(n_features=3, input_type='dict')
# Transform the data using FeatureHasher
hashed_features = hasher.transform(data)
# Convert the hashed features to a dense NumPy array for better visualization
hashed_features_array = hashed_features.toarray()

# Print the original data and the hashed features
print("Original Data:")
print(data)
print("\nHashed Features:")
print(hashed_features_array)

In this example, we have a sample dataset with a categorical feature 'color'. We use the FeatureHasher with n_features=3 to specify the size of the resulting feature vector. The input_type='dict' indicates that our data is in the form of a list of dictionaries, where each dictionary contains the categorical feature.

The output will be a dense NumPy array where each row corresponds to the hashed representation of the corresponding categorical feature in the original data. The number of columns in the array is equal to the specified n_features, and each cell contains the numerical value after hashing the corresponding category.

Note that the hashing process may lead to collisions, where different categories are mapped to the same index, but this is usually managed by using a sufficiently large value for n_features to reduce the likelihood of collisions and maintain good representation properties. Also, it is important to keep in mind that FeatureHasher does not provide a way to map the hashed features back to the original categories, as it is a one-way transformation. Therefore, it is generally used in scenarios where interpretability of the transformed features is not a critical concern.

Regenerate response
### 5.3 Choosing Parameters

The most important parameter in Hashing Encoding is the number of features or bins (n_features), which determines the dimensionality of the encoded representation. Choosing an appropriate number of features depends on the cardinality of the categorical variable and the desired trade-off between memory efficiency and collision risk.

### 5.4 Handling High-Cardinality Variables

Hashing Encoding is particularly useful when dealing with high-dimensional categorical variables or variables with a large number of unique categories. By mapping categories to a fixed number of bins, Hashing Encoding ensures a consistent dimensionality regardless of the number of categories. However, it is important to note that collisions can occur, where different categories are hashed to the same bin.

### 5.5 Smmary

Hashing Encoding is a data preprocessing technique used to convert categorical variables into a numerical representation using hash functions. It maps categories to a fixed number of bins, allowing for a consistent dimensionality regardless of the number of unique categories. Libraries like category_encoders and scikit-learn provide convenient functions and classes for performing Hashing Encoding in Python. Understanding the concepts, choosing the number of features, and handling high-cardinality variables are crucial for effectively using Hashing Encoding in practice.

In the next part, we will explore other data preprocessing techniques provided by Scikit-Learn.

Feel free to practice implementing Hashing Encoding using libraries like category_encoders or scikit-learn. Experiment with different datasets and observe the effects of the encoding on the categorical variables.