https://en.wikipedia.org/wiki/One-hot

https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f

In [7]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html
from sklearn.preprocessing import LabelBinarizer

In [8]:
categorical_labels = ['red', 'blue', 'green', 'red', 'blue']

In [9]:
unique_categories = list(set(categorical_labels))
unique_categories

['green', 'red', 'blue']

In [10]:
binarizer = LabelBinarizer()

#  This learns that red will be [0,0,1], blue will be [1, 0, 0], etc.
binarizer.fit(unique_categories)

#  This takes what it learned and replaces each string with its replacement
#  in one-hot encoding
binarizer.transform(categorical_labels)

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

In [11]:
# The position in this array tells us which column in the output features this element.
binarizer.classes_

array(['blue', 'green', 'red'],
      dtype='<U5')

Now we need to merge these values back into our array of features.

Note that the positions of the various features need to be the same for each data sample.  That is, color can't be the first three elements of one sample and the last three of the next.

Fortunately for us, this is so common that sklearn has a built-in mechanism for taking an existing feature matrix and one-hot encoding some fields but not necessarily all:  The `OneHotEncoder`.

In [13]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

enc = OneHotEncoder() # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# OneHotEncoder encodes categorical features
# use when there is no order between values

city_enc = LabelEncoder()
country_enc = LabelEncoder()

city_enc.fit(['Atlanta', 'Baltimore',  'Zurich', 'Charlotte'])
country_enc.fit(['USA', 'Switzerland', 'Germany'])

samples = [['Atlanta', 'USA'],
           ['Charlotte', 'USA'],
           ['Baltimore', 'USA'],
           ['Baltimore', 'Germany'],
           ['Baltimore', 'Switzerland'],
           ['Zurich', 'Switzerland']]

# use list comprehensions to access the 0th element of each pair -- the city
city_samples = city_enc.transform([sample[0] for sample in samples])
print('City samples: ', city_samples)

country_samples = country_enc.transform([sample[1] for sample in samples])
print('Country samples: ', country_samples)

City samples:  [0 2 1 1 1 3]
Country samples:  [2 2 2 0 1 1]


https://stackoverflow.com/questions/33592034/active-features-attribute-in-onehotencoder

In [14]:
transformed_samples = list(zip(city_samples, country_samples))
print(transformed_samples)

[(0, 2), (2, 2), (1, 2), (1, 0), (1, 1), (3, 1)]


In [15]:
enc.fit(transformed_samples)  

print('number of values:',enc.n_values_)
print('',enc.feature_indices_)

[4 3]
[0 4 7]


In [16]:
for i in range(len(samples)):
    feature_vector = enc.transform([transformed_samples[i]]).todense()
    print('%s => %s' % (samples[i], feature_vector))
#   print('city: %s, country: %s' % (feature_vector[enc.feature_indices_[0]:enc.feature_indices_[1]], 
#                                    feature_vector[enc.feature_indices_[1]:enc.feature_indices_[2]]))

['Atlanta', 'USA'] => [[ 1.  0.  0.  0.  0.  0.  1.]]
['Charlotte', 'USA'] => [[ 0.  0.  1.  0.  0.  0.  1.]]
['Baltimore', 'USA'] => [[ 0.  1.  0.  0.  0.  0.  1.]]
['Baltimore', 'Germany'] => [[ 0.  1.  0.  0.  1.  0.  0.]]
['Baltimore', 'Switzerland'] => [[ 0.  1.  0.  0.  0.  1.  0.]]
['Zurich', 'Switzerland'] => [[ 0.  0.  0.  1.  0.  1.  0.]]
