<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Categorical-Preprocessing-" data-toc-modified-id="Categorical-Preprocessing--1">Categorical Preprocessing </a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#Categorical-Feature-Engineering-" data-toc-modified-id="Categorical-Feature-Engineering--3">Categorical Feature Engineering </a></span></li><li><span><a href="#What-are-the-most-common-ways-to-encode-categorical-features?" data-toc-modified-id="What-are-the-most-common-ways-to-encode-categorical-features?-4">What are the most common ways to encode categorical features?</a></span></li><li><span><a href="#What-is-the-difference-between-pd.get_dummies-and-sklearn.preprocessing.OneHotEncoder?" data-toc-modified-id="What-is-the-difference-between-pd.get_dummies-and-sklearn.preprocessing.OneHotEncoder?-5">What is the difference between pd.get_dummies and sklearn.preprocessing.OneHotEncoder?</a></span></li><li><span><a href="#Embeddings-(e.g.,-word2vec)" data-toc-modified-id="Embeddings-(e.g.,-word2vec)-6">Embeddings (e.g., word2vec)</a></span></li><li><span><a href="#One-hot-encoding-vs-Embedding" data-toc-modified-id="One-hot-encoding-vs-Embedding-7">One-hot encoding vs Embedding</a></span></li><li><span><a href="#One-way-to-use-word-embeddings-in-a-ML-Algorithm" data-toc-modified-id="One-way-to-use-word-embeddings-in-a-ML-Algorithm-8">One way to use word embeddings in a ML Algorithm</a></span></li><li><span><a href="#StarSpace-Package:-Embed-All-The-Things!" data-toc-modified-id="StarSpace-Package:-Embed-All-The-Things!-9">StarSpace Package: Embed All The Things!</a></span></li><li><span><a href="#How-to-use-StarSpace-embeddings-in-a-ML-Algorithm" data-toc-modified-id="How-to-use-StarSpace-embeddings-in-a-ML-Algorithm-10">How to use StarSpace embeddings in a ML Algorithm</a></span></li><li><span><a href="#Comparing-Encoding-Methods" data-toc-modified-id="Comparing-Encoding-Methods-11">Comparing Encoding Methods</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-12">Takeaways</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-13">Bonus Material</a></span></li><li><span><a href="#Bin-Counting" data-toc-modified-id="Bin-Counting-14">Bin Counting</a></span></li><li><span><a href="#Bin-Counting" data-toc-modified-id="Bin-Counting-15">Bin Counting</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-16">Check for understanding</a></span></li></ul></div>

<center><h2>Categorical Preprocessing </h2></center>
<br>
<center><img src="../images/0_iKsDex5fUBQoYTju.png" width="100%"/></center>

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- List the common methods of encoding categorical features.
- Compare and contrast the common methods of encoding categorical features.

<center><h2>Categorical Feature Engineering </h2></center>

Categorical variables needs to be transformed into numbers that amendable to machine learning.

Examples of categorical variables: Products, users, words, …

<center><h2>What are the most common ways to encode categorical features?</h2></center>

- One-Hot Encoding / Dummy Encoding
- Embeddings 

[category_encoders](https://github.com/scikit-learn-contrib/category_encoders) package has many techniques that all with scikit-learn.

For the sake of time, I'm only go to cover a couple of techniques to show how the scikit-learn API works.

<center><h2>What is the difference between pd.get_dummies and sklearn.preprocessing.OneHotEncoder?</h2></center>

`pd.get_dummies` creates a DataFrame to manage encoded variables.

`sklearn.preprocessing.OneHotEncoder` creates an OneHotEncoder object to manage encoded variables.

In [3]:
reset -fs

In [4]:
import numpy as np

X = np.array(['Positive', 'Positive', 'Negative'])

In [5]:
import pandas as pd

pd.get_dummies(pd.Series(X))

Unnamed: 0,Negative,Positive
0,0,1
1,0,1
2,1,0


In [6]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit(X.reshape(-1, 1))
enc.get_feature_names()

array(['x0_Negative', 'x0_Positive'], dtype=object)

In [7]:
# The result is similar how ever it is a numpy array ready for machine learning
enc.transform(X.reshape(-1, 1)).toarray()

array([[0., 1.],
       [0., 1.],
       [1., 0.]])

In [8]:
# Combine fit and transform
enc.fit_transform(X.reshape(-1, 1)).toarray()

array([[0., 1.],
       [0., 1.],
       [1., 0.]])

<center><h2>Embeddings (e.g., word2vec)</h2></center>

<center><img src="../images/0_A9qy9wS7m-eMKiX6.png" width="45%"/></center>

<center>Project sparse high-dimensional categorical data into a dense, low-dimensional numeric vector.</center>



Source: https://instagram-engineering.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine-learning-for-emoji-trends-7f5f9cb979ad

<center><h2>One-hot encoding vs Embedding</h2></center>

The advantages of embedding is converting categorical variables to numeric variables. Those numerical variables have semantic meaning. Thus any valid mathematical operation for numerical values has interpretable meaning.

One-hot encoding just tells the model that two entities are different. Embedding tells how different the entities are.

[source](https://datascience.stackexchange.com/questions/71808/when-can-embeddings-be-useful-for-small-input-spaces/82517#82517)

<center><h2>One way to use word embeddings in a ML Algorithm</h2></center>

1. Train (or download) word embeddings.
1. Create a document embedding by taking the an average of words vectors for that document.
1. Each dimension is a feature.
1. Train the machine learning algorithm.

For example, tree-based classifier would learn which parts of the embedding space are associated with a specific label.

<center><h2>StarSpace Package: Embed All The Things!</h2></center>

<center><img src="../images/product_embedding.png" width="75%"/></center>

[StarSpace](https://github.com/facebookresearch/StarSpace) can embedded any sequential discrete data. 

Examples: Words, emojis, documents, users, products, images, videos, …

<center><h2>How to use StarSpace embeddings in a ML Algorithm</h2></center>

1. Train (or download) embeddings where every entity is embedded into the same space.
1. Each dimension is a feature.
1. Train the machine learning algorithm.

Source: 

- https://research.fb.com/downloads/starspace/
- https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16998

In [11]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 100%;
}
</style>

<center><h2>Comparing Encoding Methods</h2></center>

| Method | Size Requirements |  Growth |
|:-------:|:------:|:------:|
| One-Hot Encoding | Cardinality of data| Unbounded |
| Feature Hashing | Hash table size | Fixed | 
| Embeddings | Feature space dimensions | Fixed | 

<center><h2>Takeaways</h2></center>

- One-hot encoding works well for small number of categories
- Features hashing works well for medium number of categories.
- Embeddings are the best choice (if possible).


Bonus Material
-----

How do create categorical __targets__?

LabelEncoder

In [12]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit_transform(['cat', 'cat', 'dog', 'llama'])

array([0, 0, 1, 2])

Bin Counting
-------

<center><img src="../images/counts.png" width="75%"/></center>

Rather than using the value of the categorical variable as the feature, instead use the __conditional probability of the target under that value__.

Bin Counting
-----

<center><img src="../images/bin_conts.png" width="75%"/></center>

Use probability of click as a feature (not just raw click or not click).

Turns a large, sparse, binary representation of the categorical variable (e.g., one-hot encoding) into a very small, dense, real-valued numeric representation

Bin Counting: _m_ bin table size, fixed

Check for understanding
-----

What is the difference between Bin Counting and Naive Bayes?

Naive Bayes always multiplies the conditional probabilities.

Bin counting treat them as features, which can be used in other models such as trees.

Learn more:


- https://towardsdatascience.com/beyond-one-hot-17-ways-of-transforming-categorical-features-into-numeric-features-57f54f199ea4
- https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63
- https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
- https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/
- https://blogs.technet.microsoft.com/machinelearning/2015/02/17/big-learning-made-easy-with-counts/
- https://www.slideshare.net/SessionsEvents/misha-bilenko-principal-researcher-microsoft

<br>