# Encoder
Many other Encoder and Transformers in sklearn


### 1. Binarizer
Binarize data (set feature values to 0 or 1) according to a threshold.

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

In [2]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import Binarizer
iris= load_iris(as_frame=True)
X= iris.data

In [3]:
X.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [14]:
transformer = Binarizer(threshold=3.2).fit(X)
X_scaled = transformer.transform(X)
X_scaled[:10,:]

array([[1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

### 2. KBinsDiscretizer
Bin continuous data into intervals.

- encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}

- strategy{‘uniform’, ‘quantile’, ‘kmeans’}, default=’quantile’
Strategy used to define the widths of the bins.

1. uniform
All bins in each feature have identical widths.

2. quantile
All bins in each feature have the same number of points.

3. kmeans
Values in each bin have the same nearest center of a 1D k-means cluster.

In [50]:
KBinsDiscretizer
iris = load_iris(as_frame=True)
X = iris.data
X.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [51]:
transformer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform').fit(X)
X_scaled = transformer.transform(X)
X_scaled[:10,:]

array([[1., 3., 0., 0.],
       [0., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 3., 0., 0.],
       [1., 3., 0., 0.],
       [0., 2., 0., 0.],
       [0., 2., 0., 0.],
       [0., 1., 0., 0.],
       [0., 2., 0., 0.]])

In [52]:
transformer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile').fit(X)
X_scaled = transformer.transform(X)
X_scaled[:10,:]

array([[1., 4., 0., 1.],
       [0., 2., 0., 1.],
       [0., 3., 0., 1.],
       [0., 3., 1., 1.],
       [1., 4., 0., 1.],
       [1., 4., 1., 1.],
       [0., 4., 0., 1.],
       [1., 4., 1., 1.],
       [0., 1., 0., 1.],
       [0., 3., 1., 0.]])

In [53]:
transformer = KBinsDiscretizer(n_bins=5, encode='onehot').fit(X)
X_scaled = transformer.transform(X)
X_scaled.shape

(150, 20)

In [54]:
aa=X_scaled.toarray()
aa[0:10,5:10]

array([[0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

### 3. LabelEncoder
Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

In [57]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder().fit(["paris", "paris", "tokyo", "amsterdam"])
list(le.classes_)

['amsterdam', 'paris', 'tokyo']

In [58]:
le.transform(["tokyo", "tokyo", "paris"])

array([2, 2, 1])

### 4. OrdinalEncoder
Encode categorical features as an integer array.
Both **OridnalEncoder** and **LabelEncoder** have the same functionality. A bit difference is the idea behind. **OrdinalEncoder** is for converting features, while **LabelEncoder** is for converting target variable. Similarily, **LabelBinarizer** is 'target-variable' version of **OneHotEncoder** which can fit multi columns.

That's why OrdinalEncoder can fit data that has the shape of (n_samples, n_features) while LabelEncoder can only fit data that has the shape of (n_samples,)

### 5. OneHotEncoder


In [75]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
df

Unnamed: 0,A,B,C
0,a,b,1
1,b,a,2
2,a,c,3


In [76]:
enc = OneHotEncoder()
df2 = enc.fit_transform(df)

In [78]:
df2.toarray()

array([[1., 0., 0., 1., 0., 1., 0., 0.],
       [0., 1., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0., 1.]])

pandas.get_dummies will only convert categorical variable into dummy/indicator variables.

In [79]:

pd.get_dummies(df)

Unnamed: 0,C,A_a,A_b,B_a,B_b,B_c
0,1,1,0,0,1,0
1,2,0,1,1,0,0
2,3,1,0,0,0,1
