## Transforming categorical features to into numeric is common task in NLP since
many MLs require numeric only. There are several techniques to tackle this problems. For educational purpose, this tutorial provides some useful information for **preprocessing data**. Many of examples, ideas from several works
The notebook credits from the link
https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn  
http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
http://pbpython.com/categorical-encoding.html
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.htmlof 
In addition, chapter 6,7 from Python ML by examples contributes to this notebook.   
Here we use car dataset from to illustrate

### The first section, we adapt the orginal code from http://pbpython.com/categorical-encoding.html

In [1]:
import pandas as pd
import numpy as np

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [2]:
# Select categorical features 
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


### Check null values 

In [3]:
obj_df[obj_df.isnull().any(axis=1)]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
27,dodge,gas,turbo,,sedan,fwd,front,ohc,four,mpfi
63,mazda,diesel,std,,sedan,fwd,front,ohc,four,idi


In [4]:
# Fill missing values with the most frequent values
obj_df["num_doors"]= obj_df.groupby(['make','fuel_type']).num_doors.transform(lambda x: x.fillna(x.mode()[0]))

In [5]:
# check these data points again
obj_df.iloc[[27,63],]

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
27,dodge,gas,turbo,four,sedan,fwd,front,ohc,four,mpfi
63,mazda,diesel,std,four,sedan,fwd,front,ohc,four,idi


## Approach 1: Replace
Using a complete dictionary for cleaning up the num_doors and num_cylinders columns:

In [6]:
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8, "two": 2, "twelve": 12, "three":3 }}
obj_df.replace(cleanup_nums, inplace=True)
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi


In [7]:
obj_df.dtypes

make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

In [8]:
### Note that values of doors,cylinders are relatively small , instead int64, we may use
obj_df[['num_doors','num_cylinders']]=obj_df[['num_doors','num_cylinders']].astype(np.int)
obj_df.dtypes

make               object
fuel_type          object
aspiration         object
num_doors           int32
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int32
fuel_system        object
dtype: object

## Approach 2: Label Encoder
Label Ecoder transforms data converts nomimal data into ordinal data. This method is better suite for ordinal features.

In [9]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
obj_df['body_style']=le.fit_transform(obj_df['body_style'].astype('str'))
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,0,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,0,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,2,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,3,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,3,4wd,front,ohc,5,mpfi


Using apply function with fit_transform on multiple columns

In [10]:
obj_df[['body_style','make']]=obj_df[['body_style','make']].apply(le.fit_transform)
obj_df.head(3)

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,0,gas,std,2,0,rwd,front,dohc,4,mpfi
1,0,gas,std,2,0,rwd,front,dohc,4,mpfi
2,0,gas,std,2,2,rwd,front,ohcv,6,mpfi


### Using LabelBinarizer method 

In [13]:
# Alternative, a separeate example is used to illustrate this method
from sklearn.preprocessing import LabelBinarizer
lb =LabelBinarizer()
label = lb.fit_transform(['yes', 'no', 'no', 'yes'])
label

array([[1],
       [0],
       [0],
       [1]])

In [14]:
# To make one column for each 
label = np.hstack((label, 1 - label))
label

array([[1, 0],
       [0, 1],
       [0, 1],
       [1, 0]])

## Note: With more than 2 classes, LabelBinarizer method can general

In [16]:
# for more than two columns
lb.fit_transform(['yes', 'no', 'no', 'yes', 'maybe'])

array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

## Proposed solution 
is recommend by https://stackoverflow.com/questions/31947140/sklearn-labelbinarizer-returns-vector-when-there-are-2-classes

In [17]:
class LabelBinarizer2:

    def __init__(self):
        self.lb = LabelBinarizer()

    def fit(self, X):
        # Convert X to array
        X = np.array(X)
        # Fit X using the LabelBinarizer object
        self.lb.fit(X)
        # Save the classes
        self.classes_ = self.lb.classes_

    def fit_transform(self, X):
        # Convert X to array
        X = np.array(X)
        # Fit + transform X using the LabelBinarizer object
        Xlb = self.lb.fit_transform(X)
        # Save the classes
        self.classes_ = self.lb.classes_
        if len(self.classes_) == 2:
            Xlb = np.hstack((Xlb, 1 - Xlb))
        return Xlb

    def transform(self, X):
        # Convert X to array
        X = np.array(X)
        # Transform X using the LabelBinarizer object
        Xlb = self.lb.transform(X)
        if len(self.classes_) == 2:
            Xlb = np.hstack((Xlb, 1 - Xlb))
        return Xlb

    def inverse_transform(self, Xlb):
        # Convert Xlb to array
        Xlb = np.array(Xlb)
        if len(self.classes_) == 2:
            X = self.lb.inverse_transform(Xlb[:, 0])
        else:
            X = self.lb.inverse_transform(Xlb)
        return X    

In [18]:
lb = LabelBinarizer2()
label1 = lb.fit_transform(['yes', 'no', 'no', 'yes'])
print(label1)
print(lb.inverse_transform(label1))
label2 = lb.fit_transform(['yes', 'no', 'no', 'yes', 'maybe'])
print(label2)
print(lb.inverse_transform(label2))

[[1 0]
 [0 1]
 [0 1]
 [1 0]]
['yes' 'no' 'no' 'yes']
[[0 0 1]
 [0 1 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]]
['yes' 'no' 'no' 'yes' 'maybe']


In [12]:
# we can inverse tranform by slicing a first column
lb.inverse_transform(label[:, 0])

array(['yes', 'no', 'no', 'yes'],
      dtype='<U3')

### Multilabel solution

In [33]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform([set(['sci-fi', 'thriller']), set(['comedy'])])

array([[0, 1, 1],
       [1, 0, 0]])

In [34]:
list(mlb.classes_)

['comedy', 'sci-fi', 'thriller']

### Recommend that transform into category can optimize memory
Category is a compact data size, ability to order, plotting support
> categorize_label = lambda x: x.astype('category')   
> LABELS = ['Function','Use','Sharing','Reporting']   
> df[LABELS] = df[LABELS].aaply(categorize_label,axis=0)   


In [46]:
obj_df["body_style"] = obj_df["body_style"].astype('category')
obj_df.head(3)

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,0,gas,std,2,0,rwd,front,dohc,4,mpfi
1,0,gas,std,2,0,rwd,front,dohc,4,mpfi
2,0,gas,std,2,2,rwd,front,ohcv,6,mpfi


In [47]:
obj_df.dtypes

make                  int64
fuel_type            object
aspiration           object
num_doors             int32
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int32
fuel_system          object
dtype: object

In [None]:
# Alternative, 


### Alternative, the encoded variable can be used on category to encoding with the cat.codes accessor:
Note that only **category** feature can be applied

In [50]:
obj_df["body_style"] = obj_df["body_style"].cat.codes
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,0,gas,std,2,0,rwd,front,dohc,4,mpfi
1,0,gas,std,2,0,rwd,front,dohc,4,mpfi
2,0,gas,std,2,2,rwd,front,ohcv,6,mpfi
3,1,gas,std,4,3,fwd,front,ohc,4,mpfi
4,1,gas,std,4,3,4wd,front,ohc,5,mpfi


## Approach #3 - One Hot Encoding
When data is not ordinal values, One-Hot-Encoding may be a solution. Sklearn provides a solution


In [20]:
data = np.array(['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'])
print(data)

['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']


In [24]:
data = np.array(['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'])
integer_encoded = LabelEncoder().fit_transform(data)
print(integer_encoded)

[0 0 2 0 1 1 2 0 2 1]


In [29]:
# First reshape integer_encoded s.t each claas for each column
from sklearn.preprocessing import OneHotEncoder
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
ohe = OneHotEncoder(sparse=False)
ohe_encoded = ohe.fit_transform(integer_encoded)
print(ohe_encoded)

[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]]


###  One Hot Encoding with getdummies method from pandas data frames
Credit: http://pbpython.com/categorical-encoding.html on drive_wheels. Note that, drive_wheels attribute consists of 4wd , fwd or rwd values resulting into 3 new columns. Originally,new features are create
> drive_wheels_4wd   
> drive_wheels_rwd   
> drive_wheels_fwd       
we may want to use prefix to **shorten new names**

In [30]:
pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,engine_location,engine_type,num_cylinders,fuel_system,body_0,body_1,body_2,body_3,body_4,drive_4wd,drive_fwd,drive_rwd
0,0,gas,std,2,front,dohc,4,mpfi,1,0,0,0,0,0,0,1
1,0,gas,std,2,front,dohc,4,mpfi,1,0,0,0,0,0,0,1
2,0,gas,std,2,front,ohcv,6,mpfi,0,0,1,0,0,0,0,1
3,1,gas,std,4,front,ohc,4,mpfi,0,0,0,1,0,0,1,0
4,1,gas,std,4,front,ohc,5,mpfi,0,0,0,1,0,1,0,0


### One-Hot-Encoding approach with Keras using ** to_categorical()**


In [None]:
from keras.utils import to_categorical
data = np.array([1, 3, 2, 0, 3, 2, 2, 1, 0, 1])
# one hot encode
encoded = to_categorical(data)
print(encoded)

## Approach #4 - Custom Binary Encoding
Label encoding and One Hot Encoding can be used to create a binary column for further analysis. E.g., a **engine_type** feature can be created


In [54]:
obj_df["engine_type"].value_counts()

ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

We may create a new feature which indicates  the engine is an Overhead Cam (OHC) or not. Again from http://pbpython.com/categorical-encoding.html

In [64]:
obj_df["OHC_Code"] = np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,OHC_Code
0,0,gas,std,2,0,rwd,front,dohc,4,mpfi,1
1,0,gas,std,2,0,rwd,front,dohc,4,mpfi,1
2,0,gas,std,2,2,rwd,front,ohcv,6,mpfi,1
3,1,gas,std,4,3,fwd,front,ohc,4,mpfi,1
4,1,gas,std,4,3,4wd,front,ohc,5,mpfi,1


In [13]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'pets':['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 'owner':['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 'location':['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 'New_York']})
le = LabelEncoder()


In [4]:
df.apply(LabelEncoder().fit_transform)

Unnamed: 0,location,owner,pets
0,1,1,0
1,0,2,1
2,0,0,0
3,1,1,2
4,1,3,1
5,0,2,1


For inverse_transform and transform, we use dictionary to retain all columns LabelEncoder 


In [6]:
from collections import defaultdict
d = defaultdict(LabelEncoder)
df.head()

Unnamed: 0,location,owner,pets
0,San_Diego,Champ,cat
1,New_York,Ron,dog
2,New_York,Brick,cat
3,San_Diego,Champ,monkey
4,San_Diego,Veronica,dog


In [17]:
 df.apply(lambda x: d[x.name])

location    LabelEncoder()
owner       LabelEncoder()
pets        LabelEncoder()
dtype: object

In [7]:
# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))
fit

Unnamed: 0,location,owner,pets
0,1,1,0
1,0,2,1
2,0,0,0
3,1,1,2
4,1,3,1
5,0,2,1


In [8]:
# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))


Unnamed: 0,location,owner,pets
0,San_Diego,Champ,cat
1,New_York,Ron,dog
2,New_York,Brick,cat
3,San_Diego,Champ,monkey
4,San_Diego,Veronica,dog
5,New_York,Ron,dog


In [9]:
d

defaultdict(sklearn.preprocessing.label.LabelEncoder,
            {'location': LabelEncoder(),
             'owner': LabelEncoder(),
             'pets': LabelEncoder()})

In [10]:
df.head()

Unnamed: 0,location,owner,pets
0,San_Diego,Champ,cat
1,New_York,Ron,dog
2,New_York,Brick,cat
3,San_Diego,Champ,monkey
4,San_Diego,Veronica,dog


### Using dictionary with encoder allows  converting testset later

In [18]:
# Given a new dataset (e.g. testset)
df1 = pd.DataFrame({'pets':['monkey', 'cat', 'dog'], 
                    'owner':['Brick', 'Veronica', 'Brick'], 
                    'location':['San_Diego', 'New_York', 'New_York']})


In [19]:
# Using the dictionary to label future data
df1.apply(lambda x: d[x.name].transform(x))

Unnamed: 0,location,owner,pets
0,1,0,2
1,0,3,0
2,0,0,1


## What if data contains unknown values

In [20]:
# The erroe in this example is an intention to illustrate the problem when unknown value (objects) exists in testing
df1 = pd.DataFrame({'pets':['horse', 'cat', 'dog'], 
                    'owner':['Britt', 'Veronica', 'Brick'], 
                    'location':['San_Diego', 'Denver', 'New_York']})
df1.apply(lambda x:d[x.name].transform(x))
df1

ValueError: ("y contains new labels: ['Denver']", 'occurred at index location')

### Note: 
**LabelEncoder** can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.   
**One-Hot-Encoding** has a the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. The disadvantage is that for high cardinality, the feature space can really blow up quickly and you start fighting with the curse of dimensionality. In these cases, I typically employ one-hot-encoding followed by PCA for dimensionality reduction. I find that the judicious combination of one-hot plus PCA can seldom be beat by other encoding schemes. 

## Suggestion
credit: Python ML by example

In [2]:
from sklearn.feature_extraction import DictVectorizer

X_dict = [{'interest': 'tech', 'occupation': 'professional'},
          {'interest': 'fashion', 'occupation': 'student'},
          {'interest': 'fashion', 'occupation': 'professional'},
          {'interest': 'sports', 'occupation': 'student'},
          {'interest': 'tech', 'occupation': 'student'},
          {'interest': 'tech', 'occupation': 'retired'},
          {'interest': 'sports', 'occupation': 'professional'}]

dict_one_hot_encoder = DictVectorizer(sparse=False)
X_encoded = dict_one_hot_encoder.fit_transform(X_dict)
print(X_encoded)

[[ 0.  0.  1.  1.  0.  0.]
 [ 1.  0.  0.  0.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.]
 [ 0.  1.  0.  1.  0.  0.]]


In [3]:
print(dict_one_hot_encoder.vocabulary_)

{'interest=tech': 2, 'occupation=professional': 3, 'interest=fashion': 0, 'occupation=student': 5, 'interest=sports': 1, 'occupation=retired': 4}


## For new data, 

In [4]:
new_dict = [{'interest': 'sports', 'occupation': 'retired'}]
new_encoded = dict_one_hot_encoder.transform(new_dict)
print(new_encoded)

[[ 0.  1.  0.  0.  1.  0.]]


In [5]:
# Reversely, transforming back to original features
print(dict_one_hot_encoder.inverse_transform(new_encoded))

[{'interest=sports': 1.0, 'occupation=retired': 1.0}]


In [18]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
X_str = np.array([['tech', 'professional'],
                  ['fashion', 'student'],
                  ['fashion', 'professional'],
                  ['sports', 'student'],
                  ['tech', 'student'],
                  ['tech', 'retired'],
                  ['sports', 'professional']])

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()
X_int = label_encoder.fit_transform(X_str.ravel()).reshape(*X_str.shape)
print(X_int)

[[5 1]
 [0 4]
 [0 1]
 [3 4]
 [5 4]
 [5 2]
 [3 1]]


In [21]:
#list(label_encoder.inverse_transform([]))

array([5, 1], dtype=int64)

In [27]:
one_hot_encoder = OneHotEncoder()
X_encoded = one_hot_encoder.fit_transform(X_int).toarray()
print(X_encoded)

[[ 0.  0.  1.  1.  0.  0.]
 [ 1.  0.  0.  0.  0.  1.]
 [ 1.  0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  1.]
 [ 0.  0.  1.  0.  0.  1.]
 [ 0.  0.  1.  0.  1.  0.]
 [ 0.  1.  0.  1.  0.  0.]]


### DictVectorizer handles unseen category during training by ignoring it

In [29]:
# new category not encountered before
new_str = np.array([['unknown_interest', 'retired'],
                  ['tech', 'unseen_occupation'],
                  ['unknown_interest', 'unseen_occupation']])

def string_to_dict(columns, data_str):
    data_dict = []
    for sample_str in data_str:
        data_dict.append({column: value for column, value in zip(columns, sample_str)})
    return data_dict

columns = ['interest', 'occupation']
new_encoded = dict_one_hot_encoder.transform(string_to_dict(columns, new_str))
print(new_encoded)

[[ 0.  0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]


## Challenging problem
Sometimes, data comes with a range such as age instead of year. E.g: ages (0-17,18-25, 26-35,36-45,46-60,>60). Intuitive, we can transform with Label Encoder. These numeric bin will be treated as multiple levels of non-numeric features but may not provide informative features. Other way is to create statistical summaries features with mean, mode.
Conversely, several features can combine. Some ideas come from https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/ 

### For multi categorical columns 

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

In [2]:
fruit_data

Unnamed: 0,color,fruit,weight
0,red,apple,5
1,orange,orange,6
2,green,pear,3
3,green,orange,4


In [2]:
class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

Suppose we want to encode our two categorical attributes (fruit and color), while leaving the numeric attribute weight alone. We could do this as follows: 

In [4]:
MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)

Unnamed: 0,color,fruit,weight
0,2,0,5
1,1,1,6
2,0,2,3
3,0,1,4


Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded

In [5]:
MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))

Unnamed: 0,color,fruit
0,2,0
1,1,1
2,0,2
3,0,1


## In addition, we can use this custom transformer in a pipeline: 

In [3]:
encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
    # add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)

Unnamed: 0,color,fruit,weight
0,2,0,5
1,1,1,6
2,0,2,3
3,0,1,4


## Approach using dictionary

In [8]:
df = pd.DataFrame({'pets':['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 'owner':['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 'location':['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 'New_York']})
df

Unnamed: 0,location,owner,pets
0,San_Diego,Champ,cat
1,New_York,Ron,dog
2,New_York,Brick,cat
3,San_Diego,Champ,monkey
4,San_Diego,Veronica,dog
5,New_York,Ron,dog


In [9]:
df=pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
df

Unnamed: 0,location,owner,pets
0,1,1,0
1,0,2,1
2,0,0,0
3,1,1,2
4,1,3,1
5,0,2,1


### To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:

In [10]:
{col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 0, 1: 1},
 'owner': {0: 0, 1: 1, 2: 2, 3: 3},
 'pets': {0: 0, 1: 1, 2: 2}}

In [14]:
class MultiColumnLabelEncoder(LabelEncoder):
    """
    Wraps sklearn LabelEncoder functionality for use on multiple columns of a
    pandas dataframe.

    """
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, dframe):
        """
        Fit label encoder to pandas columns.

        Access individual column classes via indexig `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            for idx, column in enumerate(self.columns):
                # fit LabelEncoder to get `classes_` for the column
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                # append this column's encoder
                self.all_encoders_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return self

    def fit_transform(self, dframe):
        """
        Fit label encoder and return encoded labels.

        Access individual column classes via indexing self.all_classes_`

        Access individual column encoders via indexing self.all_encoders_`

        Access individual column encoded labels via indexing self.all_labels_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            self.all_labels_ = np.ndarray(shape=self.columns.shape,
                                          dtype=object)
            for idx, column in enumerate(self.columns):
                # instantiate LabelEncoder
                le = LabelEncoder()
                # fit and transform labels in the column
                dframe.loc[:, column] =\
                    le.fit_transform(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
                self.all_labels_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                dframe.loc[:, column] = le.fit_transform(
                        dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return dframe

    def transform(self, dframe):
        """
        Transform labels to normalized encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[
                    idx].transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

    def inverse_transform(self, dframe):
        """
        Transform labels back to original encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        return dframe
df = pd.DataFrame({'pets':['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 'owner':['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 'location':['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 'New_York']})
df    

Unnamed: 0,location,owner,pets
0,San_Diego,Champ,cat
1,New_York,Ron,dog
2,New_York,Brick,cat
3,San_Diego,Champ,monkey
4,San_Diego,Veronica,dog
5,New_York,Ron,dog


** Note:**

If df and df_copy() are mixed-type pandas dataframes, you can apply the MultiColumnLabelEncoder() to the dtype=object columns in the following way:

In [16]:
# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object'].columns

# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=df_object_columns)

# fit to `df` data
mcle.fit(df)
                                                          
# transform the `df` data
mcle.transform(df)
                                                          

SyntaxError: invalid syntax (<ipython-input-16-31267d555db5>, line 6)

### Further approaches are presented in 
http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
* Polynomial: The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing.
* Backward Difference: the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
* Helmert: The mean of the dependent variable for a level is compared to the mean of the dependent variable over all 