# An Exploration of One-Hot Encoding Techniques

The following methods are going to be compared and discussed:

* Pandas - get_dummies()
* Scikit-Learn - OneHotEncoder()
* Keras - to_categorical()

**Note:** Please ensure that you have access to Pandas 1.5.0 or later otherwise some of the methods in this notebook will not run.

## Datasource

The data used in this notebook has been taken from Kaggle:

Aman Chauhan, [Alcohol Effects On Study](https://medium.com/r/?url=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fwhenamancodes%2Falcohol-effects-on-study) (2022), Kaggle

The data taken from Kaggle originates from the following two places:

Paulo Cortez,  [Student Performance Data Set](https://medium.com/r/?url=https%3A%2F%2Farchive.ics.uci.edu%2Fml%2Fdatasets%2Fstudent%2Bperformance) (2014), UCI Machine Learning Repository

P. Cortez and A. Silva, [Using Data Mining to Predict Secondary School Student Performance](https://medium.com/r/?url=http%3A%2F%2Fwww3.dsi.uminho.pt%2Fpcortez%2Fstudent.pdf) (2008), In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology (FUBUTEC) Conference pp. 5–12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978–9077381–39–7

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.utils import to_categorical

2022-10-26 10:54:26.686776: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-26 10:54:26.845114: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-26 10:54:26.845138: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-26 10:54:26.884253: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-26 10:54:28.044756: W tensorflow/stream_executor/pla

# Import and select data

The dataset actually includes a lot of different columns. For the purpose of this particular experiment I have chosen a set of columns that feature a range of different numerical and categorical types, as follows:

* sex - binary string ('M' for male and 'F' for female)
* age - standard numerical column (int)
* Medu - Mother's education - Multiclass integer representation (0 - none, 1 - primary education, 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
* Mjob - Mother's job - Multiclass string representation - ('teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
* Dalc - Workday alcohol consumption - Multiclass graduated integer representation (from 1 - very low to 5 - very high)
* Walc - Weekend alcohol consumption - Multiclass graduated integer representation (from 1 - very low to 5 - very high)
* G3 - Final grade (the label) - Multiclass graduated integer representation (numeric: from 0 to 20) 

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/thetestspecimen/notebooks/main/datasets/alcohol_effects_on_study/alcohol-effects-study-maths.csv',
                   usecols=['sex', 'age', 'Medu', 'Mjob','Dalc','Walc','G3'])

In [3]:
data.head()

Unnamed: 0,sex,age,Medu,Mjob,Dalc,Walc,G3
0,F,18,4,at_home,1,1,6
1,F,17,1,at_home,1,1,6
2,F,15,1,at_home,2,3,10
3,F,15,4,health,1,1,15
4,F,16,3,other,1,2,10


In [4]:
data.describe()

Unnamed: 0,age,Medu,Dalc,Walc,G3
count,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,1.481013,2.291139,10.41519
std,1.276043,1.094735,0.890741,1.287897,4.581443
min,15.0,0.0,1.0,1.0,0.0
25%,16.0,2.0,1.0,1.0,8.0
50%,17.0,3.0,1.0,2.0,11.0
75%,18.0,4.0,2.0,3.0,14.0
max,22.0,4.0,5.0,5.0,20.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sex     395 non-null    object
 1   age     395 non-null    int64 
 2   Medu    395 non-null    int64 
 3   Mjob    395 non-null    object
 4   Dalc    395 non-null    int64 
 5   Walc    395 non-null    int64 
 6   G3      395 non-null    int64 
dtypes: int64(5), object(2)
memory usage: 21.7+ KB


# Pandas get_dummies()

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

The method:

```python
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
```

You can very simply just pass a dataframe to get_dummies() and it will work out which columns are most suitable for one hot encoding. However, this is not the best way to approach things as you will see:

In [6]:
not_recommended = pd.get_dummies(data)
not_recommended.head()

Unnamed: 0,age,Medu,Dalc,Walc,G3,sex_F,sex_M,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher
0,18,4,1,1,6,1,0,1,0,0,0,0
1,17,1,1,1,6,1,0,1,0,0,0,0
2,15,1,2,3,10,1,0,1,0,0,0,0
3,15,4,1,1,15,1,0,0,1,0,0,0
4,16,3,1,2,10,1,0,0,0,1,0,0


If you review the output above you will see that only columns with type 'Object' have been one hot encoded (sex and MJob). Any integer datatype columns have been ignored, which in our case is not ideal.

However, you can specify the columns you wish to encode as follows:

In [7]:
pd_data_dum = pd.get_dummies(data, columns=['sex','Medu','Mjob'])
pd_data_dum.head()

Unnamed: 0,age,Dalc,Walc,G3,sex_F,sex_M,Medu_0,Medu_1,Medu_2,Medu_3,Medu_4,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher
0,18,1,1,6,1,0,0,0,0,0,1,1,0,0,0,0
1,17,1,1,6,1,0,0,1,0,0,0,1,0,0,0,0
2,15,2,3,10,1,0,0,1,0,0,0,1,0,0,0,0
3,15,1,1,15,1,0,0,0,0,0,1,0,1,0,0,0
4,16,1,2,10,1,0,0,0,0,1,0,0,0,1,0,0


There may be specific circumstances where it is advisable, or useful to drop the first column of each one hot encoded series (for example to avoid multi-colinearity). get_dummies() has this ability built in:

In [8]:
pd_data_dum_drop = pd.get_dummies(data, columns=['sex','Medu','Mjob'], drop_first=True)
pd_data_dum_drop.head()

Unnamed: 0,age,Dalc,Walc,G3,sex_M,Medu_1,Medu_2,Medu_3,Medu_4,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher
0,18,1,1,6,0,0,0,0,1,0,0,0,0
1,17,1,1,6,0,1,0,0,0,0,0,0,0
2,15,2,3,10,0,1,0,0,0,0,0,0,0
3,15,1,1,15,0,0,0,0,1,1,0,0,0
4,16,1,2,10,0,0,0,1,0,0,1,0,0


## Reversing get_dummies() with from_dummies()

There was until very recently no availble method for reversals from the Pandas library. 

However, as of Pandas 1.5.0 there is a new method called from_dummies():

https://pandas.pydata.org/docs/dev/reference/api/pandas.from_dummies.html

```python
pandas.from_dummies(data, sep=None, default_category=None)
```
This allows the reversal to be achieved without writing your own method. It can even handle the reversal of a one hot encoding that utilised 'drop_first' with the use of the 'default_category' parameter as you will see below.



In [9]:
restored_dum = pd.from_dummies(pd_data_dum.iloc[:,4:], sep='_')
restored_dum.head()

Unnamed: 0,sex,Medu,Mjob
0,F,4,at_home
1,F,1,at_home
2,F,1,at_home
3,F,4,health
4,F,3,other


You can also do this when you have used 'drop_first' on encoding, but you must specify the dropped item:

In [10]:
restored_dum_drop = pd.from_dummies(pd_data_dum_drop.iloc[:,4:], sep='_', default_category={'sex': 'F','Medu': '1','Mjob': 'at_home'})
restored_dum_drop.head()

Unnamed: 0,sex,Medu,Mjob
0,F,4,at_home
1,F,1,at_home
2,F,1,at_home
3,F,4,health
4,F,3,other


# sklearn OneHotEncoder

The OneHotEncoder method from sklearn is probably the most comprehensive of all the available methods for one hot encoding.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

```python
sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)
```

As you can see the method inputs above can handle:

* automatically picking out the categories for one hot encoding
* drop columns (not just the first, there are more options available)
* produce sparse matrices
* handle categories that appear in future datasets (handle_unknown)
* you can limit the amount of categories returned from the encoding based on frequency or a maximum

The method also uses the fit-transform methodology, which is very useful for using this method in your input pipelines for machine and deep learning.

In [11]:
skencoder = OneHotEncoder(handle_unknown='ignore',sparse=False)

Let's try the simpelest implementation of the method, and just pass the whole dataframe:

In [12]:
sk_data = skencoder.fit_transform(data)
sk_data, sk_data.shape

(array([[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.]]),
 (395, 48))

As you can see from the output below, what OneHotEncoder does is extracts the columns that it thinks should be onehot encoded and returns them as a new array.

This is different to get_dummies() which keeps the output in the same dataframe. If you want to keep all your data contained within a dataframe with minimal effort then this is something worth considering.

It should also be noted that OneHotEncoder recognises more input columns that get_dummies() when on 'auto'.

Regardless, it is still good practise to specify the columns you wish to target.

In [13]:
skencoder.get_feature_names_out()

array(['sex_F', 'sex_M', 'age_15', 'age_16', 'age_17', 'age_18', 'age_19',
       'age_20', 'age_21', 'age_22', 'Medu_0', 'Medu_1', 'Medu_2',
       'Medu_3', 'Medu_4', 'Mjob_at_home', 'Mjob_health', 'Mjob_other',
       'Mjob_services', 'Mjob_teacher', 'Dalc_1', 'Dalc_2', 'Dalc_3',
       'Dalc_4', 'Dalc_5', 'Walc_1', 'Walc_2', 'Walc_3', 'Walc_4',
       'Walc_5', 'G3_0', 'G3_4', 'G3_5', 'G3_6', 'G3_7', 'G3_8', 'G3_9',
       'G3_10', 'G3_11', 'G3_12', 'G3_13', 'G3_14', 'G3_15', 'G3_16',
       'G3_17', 'G3_18', 'G3_19', 'G3_20'], dtype=object)

For consistency I will encode the same columns as we looked at previously with get_dummies()

In [14]:
sk_data = skencoder.fit_transform(data.loc[:,['sex','Medu','Mjob']])
sk_data

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 1., 0., 0.]])

In [15]:
skencoder.get_feature_names_out()

array(['sex_F', 'sex_M', 'Medu_0', 'Medu_1', 'Medu_2', 'Medu_3', 'Medu_4',
       'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher'], dtype=object)

In [16]:
skencoder.get_params()

{'categories': 'auto',
 'drop': None,
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': None,
 'sparse': False}

## Reversing OneHotEncoder

There is a very simple method for reversing the encoding, and as the encoder is saved as it's own object (in this case 'skencoder'), then all the original parameters used to do the one hot encoding are saved within this object. This makes reversal very easy.

In [17]:
skencoder.inverse_transform(sk_data)

array([['F', 4, 'at_home'],
       ['F', 1, 'at_home'],
       ['F', 1, 'at_home'],
       ...,
       ['M', 1, 'other'],
       ['M', 3, 'services'],
       ['M', 1, 'other']], dtype=object)

## Useful information

Another advantage to using OneHotEncoder is that there are a wealth of attributes and helper methods that give you access to the information used in the encoding. I have provided some examples below:

### Attributes

In [18]:
# output the category names stored by the encoder

skencoder.categories_

[array(['F', 'M'], dtype=object),
 array([0, 1, 2, 3, 4]),
 array(['at_home', 'health', 'other', 'services', 'teacher'], dtype=object)]

In [19]:
# the number of columns / features originally encoded

skencoder.n_features_in_

3

In [20]:
# the names of the features / columns that were fed into the encoder. 
# Especially useful if you went with the auto method of encoding

skencoder.feature_names_in_

array(['sex', 'Medu', 'Mjob'], dtype=object)

### Methods

In [21]:
# This is a little bit like the "categories_" attribute, but it gives the encoded features rather
# than the features before encoding

skencoder.get_feature_names_out()

array(['sex_F', 'sex_M', 'Medu_0', 'Medu_1', 'Medu_2', 'Medu_3', 'Medu_4',
       'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher'], dtype=object)

In [22]:
# you can list the parameters that were originally set for the encoder

skencoder.get_params()

{'categories': 'auto',
 'drop': None,
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': None,
 'sparse': False}

## Advanced features

As mentioned earlier OneHotEncoder has quite a lot of useful features making it a very flexible method to use.

I will touch on some of these methods below.

### Min frequency

This can be used to limit the encoded categories. If you have a feature that is dominated by a few significant categories, but has a lot of smaller categories, then you can effectively group the smaller categories into a single 'other' category.

In [23]:
# As an example if we take the 'Medu' feature, which has 5 categories (0 to 4) we can limit this
# output to 4 categories by setting a frequency limit of 60 which will group categories 0 and 1 
# as they both have less than 60 entries

skencoder.set_params(min_frequency = 60)
sk_data = skencoder.fit_transform(data.loc[:,['Medu']])
sk_data

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.]])

In [24]:
# see that the parameter was set in the encoder

skencoder.get_params()

{'categories': 'auto',
 'drop': None,
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': 60,
 'sparse': False}

In [25]:
# check OneHotEncoder assigns the combined categories

skencoder.get_feature_names_out()

array(['Medu_2', 'Medu_3', 'Medu_4', 'Medu_infrequent_sklearn'],
      dtype=object)

In [26]:
# see the names of the categories that were grouped as infrequent

skencoder.infrequent_categories_

[array([0, 1])]

You may find that you don't want to specify an exact amount of records for the infrequenct categories. In this case you could specify a minimum amount of records compared to the overall amount of records available. To do this you specify a fraction of the total count. 

In our case there are 395 records, so to acheive the same outcome as specifying exactly 60 records as the limit, we could specify 60 / 395 = 0.152 (or for simplicity 0.16 (which basically means that a category has to have 16% of the total count to be counted as significant))

In [27]:
skencoder.set_params(min_frequency = 0.16)
sk_data = skencoder.fit_transform(data.loc[:,['Medu']])
sk_data

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.]])

In [28]:
# see that the parameter was set in the encoder

skencoder.get_params()

{'categories': 'auto',
 'drop': None,
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': 0.16,
 'sparse': False}

In [29]:
# check OneHotEncoder assigns the combined categories

skencoder.get_feature_names_out()

array(['Medu_2', 'Medu_3', 'Medu_4', 'Medu_infrequent_sklearn'],
      dtype=object)

In [30]:
# see the names of the categories that were grouped as infrequent

skencoder.infrequent_categories_

[array([0, 1])]

### Max categories

Another way to approach the problem is to specify a maximum number of categories.



In [31]:
skencoder.set_params(min_frequency = None, max_categories = 3)
sk_data = skencoder.fit_transform(data.loc[:,['Medu']])
sk_data[:5,:]

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [32]:
# see that the parameter was set in the encoder

skencoder.get_params()

{'categories': 'auto',
 'drop': None,
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': 3,
 'min_frequency': None,
 'sparse': False}

In [33]:
# check OneHotEncoder assigns the combined categories

skencoder.get_feature_names_out()

array(['Medu_2', 'Medu_4', 'Medu_infrequent_sklearn'], dtype=object)

In [34]:
# see the names of the categories that were grouped as infrequent

skencoder.infrequent_categories_

[array([0, 1, 3])]

### Handle Unknown

Handle unknown is an extremely useful feature, especially when used in a pipeline for machine learning or neural network models.

It essentially allows you to plan for a case in the future where another category may appear without breaking your input pipeline.

For example you may have a feature such as Medu, and in the future for some reason you add a 'PhD' category above the final category of 'higher eduction'. In theory this additional category would break your input pipeline as the amount of categories has changed.

Handle unknown allows us to avoid this.

Although I'm not going to give a concrete example of this, it is very easy to understand, especially if you have read the previous two sections on max_categories and min_frequency.

Setting options:

* **'error'** : this will just raise an error if you try to add additional category, you could say this is standard behaviour
* **'ignore'** : this will cause any extra categories to be encoded with all zeros, so if there were originally 3 categories [1,0,0], [0,1,0] and [0,0,1] then the unknown category will be encoded as [0,0,0]. When inverted this will have the value 'None'.
* **'infrequent_if_exist'** : bascially if you have implemented 'max_categories' or 'min_frequency' in your encoder then the additional category will be mapped to 'xxx_infrequent_sklearn' along with any infrequent categories. Otherwise it will be treated exactly the same as 'ignore'.

**Important note:** you cannot use handle_unknown='ignore' and the drop category parameter (e.g. drop: 'first') at the same time. This is because they both produce a category with all zeros, and therefore conflict. 

### Drop

As with Pandas from_dummies() you have the option to drop categories, although the options are a little more extensive.

Here are the options:

* **None** : retain all features (the default).
* **‘first’** : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
* **‘if_binary’** : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
* **array** : drop[i] is the category in feature X[:, i] that should be dropped.

**Important note:** you cannot use handle_unknown='ignore' and the thse drop category parameters at the same time. This is because they both produce a category with all zeros, and therefore conflict. 

#### 'first'

In [None]:
skencoder.set_params(min_frequency = None, max_categories = None, drop= 'first')
skencoder.get_params()

{'categories': 'auto',
 'drop': 'first',
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': None,
 'sparse': False}

In [None]:
sk_data = skencoder.fit_transform(data.loc[:,['sex','Medu','Mjob']])
sk_data

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [1., 1., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [1., 1., 0., ..., 1., 0., 0.]])

In [None]:
# note that the first category has been dropped from each feature
# 'sex_F', 'Medu_0' and 'Mjob_at_home' are all missing

skencoder.get_feature_names_out()

array(['sex_M', 'Medu_1', 'Medu_2', 'Medu_3', 'Medu_4', 'Mjob_health',
       'Mjob_other', 'Mjob_services', 'Mjob_teacher'], dtype=object)

#### 'if_binary'

In [None]:
skencoder.set_params(drop= 'if_binary')
skencoder.get_params()

{'categories': 'auto',
 'drop': 'if_binary',
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': None,
 'sparse': False}

In [None]:
sk_data = skencoder.fit_transform(data.loc[:,['sex','Medu','Mjob']])
sk_data

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 1., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 1., ..., 1., 0., 0.]])

In [None]:
# note that the first category has only been dropped from the sex feature, which was a binary category

skencoder.get_feature_names_out()

array(['sex_M', 'Medu_0', 'Medu_1', 'Medu_2', 'Medu_3', 'Medu_4',
       'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher'], dtype=object)

#### array

In [None]:
skencoder.set_params(drop= ['M', 3, 'other'])
skencoder.get_params()

{'categories': 'auto',
 'drop': ['M', 3, 'other'],
 'dtype': numpy.float64,
 'handle_unknown': 'ignore',
 'max_categories': None,
 'min_frequency': None,
 'sparse': False}

In [None]:
sk_data = skencoder.fit_transform(data.loc[:,['sex','Medu','Mjob']])
sk_data

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [None]:
# note that the selected categories have been dropped for each feature

skencoder.get_feature_names_out()

array(['sex_F', 'Medu_0', 'Medu_1', 'Medu_2', 'Medu_4', 'Mjob_at_home',
       'Mjob_health', 'Mjob_services', 'Mjob_teacher'], dtype=object)

# Keras to_categorical

The keras method is a very simple method, and although it can be used for anything just like the other methods, it can only handle numeric values. Therefore, if you have string categories you will have to convert them first, something that the other methods take care of automatically.

https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical

Keras to_categorical() is probably most useful for one hot encoding the labels. 

The example below will one hot encode the labels.

The method:

```python
tf.keras.utils.to_categorical(y, num_classes=None, dtype='float32')
```

In [None]:
data['G3'].to_numpy()

array([ 6,  6, 10, 15, 10, 15, 11,  6, 19, 15,  9, 12, 14, 11, 16, 14, 14,
       10,  5, 10, 15, 15, 16, 12,  8,  8, 11, 15, 11, 11, 12, 17, 16, 12,
       15,  6, 18, 15, 11, 13, 11, 12, 18, 11,  9,  6, 11, 20, 14,  7, 13,
       13, 10, 11, 13, 10, 15, 15,  9, 16, 11, 11,  9,  9, 10, 15, 12,  6,
        8, 16, 15, 10,  5, 14, 11, 10, 10, 11, 10,  5, 12, 11,  6, 15, 10,
        8,  6, 14, 10,  7,  8, 18,  6, 10, 14, 10, 15, 10, 14,  8,  5, 17,
       14,  6, 18, 11,  8, 18, 13, 16, 19, 10, 13, 19,  9, 16, 14, 13,  8,
       13, 15, 15, 13, 13,  8, 12, 11,  9,  0, 18,  0,  0, 12, 11,  0,  0,
        0,  0, 12, 15,  0,  9, 11, 13,  0, 11,  0, 11,  0, 10,  0, 14, 10,
        0, 12,  8, 13, 10, 15, 12,  0,  7,  0, 10,  7, 12, 10, 16,  0, 14,
        0, 16, 10,  0,  9,  9, 11,  6,  9, 11,  8, 12, 17,  8, 12, 11, 11,
       15,  9, 10, 13,  9,  8, 10, 14, 15, 16, 10, 18, 10, 16, 10, 10,  6,
       11,  9,  7, 13, 10,  7,  8, 13, 14,  8, 10, 15,  4,  8,  8, 10,  6,
        0, 17, 13, 14,  7

In [None]:
keras_cat = to_categorical(data['G3'], dtype='uint8')
keras_cat

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

In [None]:
keras_cat.shape

(395, 21)

To reverse the one hot encoding you can use argmax, this will also work with the output of a softmax layer.

In [None]:
keras_cat[5]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
      dtype=uint8)

In [None]:
np.argmax(keras_cat[5])

15

In [None]:
np.argmax(keras_cat, axis=1)

array([ 6,  6, 10, 15, 10, 15, 11,  6, 19, 15,  9, 12, 14, 11, 16, 14, 14,
       10,  5, 10, 15, 15, 16, 12,  8,  8, 11, 15, 11, 11, 12, 17, 16, 12,
       15,  6, 18, 15, 11, 13, 11, 12, 18, 11,  9,  6, 11, 20, 14,  7, 13,
       13, 10, 11, 13, 10, 15, 15,  9, 16, 11, 11,  9,  9, 10, 15, 12,  6,
        8, 16, 15, 10,  5, 14, 11, 10, 10, 11, 10,  5, 12, 11,  6, 15, 10,
        8,  6, 14, 10,  7,  8, 18,  6, 10, 14, 10, 15, 10, 14,  8,  5, 17,
       14,  6, 18, 11,  8, 18, 13, 16, 19, 10, 13, 19,  9, 16, 14, 13,  8,
       13, 15, 15, 13, 13,  8, 12, 11,  9,  0, 18,  0,  0, 12, 11,  0,  0,
        0,  0, 12, 15,  0,  9, 11, 13,  0, 11,  0, 11,  0, 10,  0, 14, 10,
        0, 12,  8, 13, 10, 15, 12,  0,  7,  0, 10,  7, 12, 10, 16,  0, 14,
        0, 16, 10,  0,  9,  9, 11,  6,  9, 11,  8, 12, 17,  8, 12, 11, 11,
       15,  9, 10, 13,  9,  8, 10, 14, 15, 16, 10, 18, 10, 16, 10, 10,  6,
       11,  9,  7, 13, 10,  7,  8, 13, 14,  8, 10, 15,  4,  8,  8, 10,  6,
        0, 17, 13, 14,  7

## Specify categories

One useful feature is the ability to specify how many unique categories there are. By default the amount of categories is the highest number in the array + 1. The +1 is to take account of zero.

It is worth noting that this is the minimum value you can specify. However, there may be cases where the data passed does not contain all the categories and you still wish to convert it (like a small set of test labels), in which case you should specify the number of classes. 

Even though the method requires whole numbers, it can deal with the float datatype as per the below.

In [None]:
np.unique(data['G3'])

array([ 0,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20])

In [None]:
np.unique(data['G3']).size

18

In [None]:
keras_cat_classes = to_categorical(data['G3'], num_classes=30, dtype='float32')
keras_cat_classes

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [None]:
keras_cat_classes.shape

(395, 30)

In [None]:
np.argmax(keras_cat_classes, axis=1)

array([ 6,  6, 10, 15, 10, 15, 11,  6, 19, 15,  9, 12, 14, 11, 16, 14, 14,
       10,  5, 10, 15, 15, 16, 12,  8,  8, 11, 15, 11, 11, 12, 17, 16, 12,
       15,  6, 18, 15, 11, 13, 11, 12, 18, 11,  9,  6, 11, 20, 14,  7, 13,
       13, 10, 11, 13, 10, 15, 15,  9, 16, 11, 11,  9,  9, 10, 15, 12,  6,
        8, 16, 15, 10,  5, 14, 11, 10, 10, 11, 10,  5, 12, 11,  6, 15, 10,
        8,  6, 14, 10,  7,  8, 18,  6, 10, 14, 10, 15, 10, 14,  8,  5, 17,
       14,  6, 18, 11,  8, 18, 13, 16, 19, 10, 13, 19,  9, 16, 14, 13,  8,
       13, 15, 15, 13, 13,  8, 12, 11,  9,  0, 18,  0,  0, 12, 11,  0,  0,
        0,  0, 12, 15,  0,  9, 11, 13,  0, 11,  0, 11,  0, 10,  0, 14, 10,
        0, 12,  8, 13, 10, 15, 12,  0,  7,  0, 10,  7, 12, 10, 16,  0, 14,
        0, 16, 10,  0,  9,  9, 11,  6,  9, 11,  8, 12, 17,  8, 12, 11, 11,
       15,  9, 10, 13,  9,  8, 10, 14, 15, 16, 10, 18, 10, 16, 10, 10,  6,
       11,  9,  7, 13, 10,  7,  8, 13, 14,  8, 10, 15,  4,  8,  8, 10,  6,
        0, 17, 13, 14,  7

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=668fea72-6c55-453b-98ea-fb3354dc106c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>