# 1. One Hot Encoder

## 1.1. Overview

In this encoding, a "dictionary" needs to be constructed containing all possible values of each item data. Each item value is then encoded with a binary vector with all zeroes minus an element equal to 1 corresponding to the item value's position in the dictionary.

For example, if we have one-column data `"Sydney", "Paris", "New York"` then we do the following steps:

1. Build a dictionary. In this case, we can build the dictionary as `["New York", "Paris", "Sydney"]`

2. After building the dictionary, we need to save the index of each item in the dictionary. With the above dictionary, the corresponding index is `"New York": 0, "Paris": 1, "Sydney": 2`.

3. Finally, we encode the initial values as follows:

| Cat | One-hot Encoder |
| --- | --- |
| `"Sydney"` | `[0, 0, 1]` |
| `"Paris"` | `[0, 1, 0]`|
|`"New York"` | `[1, 0, 0]`|


Since each item value is encoded with a vector with only one element equal to 1 at its corresponding position in the dictionary, this vector is called a "one-hot vector". The dimension of this vector is precisely similar to the number of words in the dictionary. To put it another way, each binary value in this vector represents whether the item value in question "is" the corresponding value in the dictionary. With out-of-vocabulary or OOV, we can encode them as `[0, 0, 0]` in the sense that they are not any dictionary values.

Another common way to encode values ​​that are not in the dictionary is to add the word `"unknown"` to the dictionary, and all new values ​​are grouped into this `"unknown"`. It should be noted when `"unknown"` is also a possible value in the dataset. Encoding unknown values ​​with the same vector can confuse the model that these two values ​​are the same. Suppose somehow you know these values ​​appear a lot in the future. In that case, you should specifically include them in the dictionary to have your encoding to avoid duplication with other discounts. If these values ​​occur rarely, we can put them together in a code and consider them to be exact as "rare". Trying to encode each rare value will result in a lot of memory and a more complex model to learn in particular cases, where overfitting is more likely.

## 1.2. Sklearn
[`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn-preprocessing-onehotencoder).

In [8]:
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame(
    data = {"City": ["Sydney", "Paris", "New York"], "Population" : [10, 59, 89]}
)

df

Unnamed: 0,City,Population
0,Sydney,10
1,Paris,59
2,New York,89


In [9]:
onehot = OneHotEncoder()

onehot_encoded_city= onehot.fit_transform(df[["City"]])
print(type(onehot_encoded_city))
print(onehot_encoded_city)

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 2)	1.0
  (1, 1)	1.0
  (2, 0)	1.0


There are a few points to note here. First, the default `onehot_encoded_location` return is stored as [`scipy.sparse.csr.csr_matrix`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) is a special type for storing two-dimensional arrays with mostly zero elements. This storage is very convenient in terms of memory because each vector has exactly one non-zero element. If the dictionary size grows into the millions and we store the matrix in its common form, it will be a waste of resources to store a lot of zeros that don't carry much information.

When printing `onehot_encoded_location,` we will see two columns. The first column is the coordinates of the non-zero points, and the second column is the value of the element at that coordinate - always equal to 1 in this case.

To return the result in standard matrix form, we can add `sparse=False` at initialization:

In [10]:
onehot = OneHotEncoder(sparse=False)

onehot_encoded_city= onehot.fit_transform(df[["City"]])
print(onehot_encoded_city)

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [11]:
onehot.categories_

[array(['New York', 'Paris', 'Sydney'], dtype=object)]

One-hot encoding is a way of rapidly converting item data to digital form. With this coding, we can quickly build simple models such as linear regression or SVM, requiring the input value to be numeric. With decision tree models (Random Forest, LightGBM, XGBoost, etc.) - which are very common with tabular data, we don't need to convert them to one-hot; we just need to return them to dictionary ordinal numbers and report to model that it is item-specific, the models will have appropriate processing 

See also [`sklearn.preprocessing.LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

# 2. Hashing
Comming soon...

# 3. Crossing
Comming soon...