# Integer Encoding

The numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models. 


### Advantages

- Straightforward to implement
- Does not expand the feature space


### Limitations

- Does not capture any information about the categories labels
- Not suitable for linear models.


### Dataset:
- House Pricing dataset


### Content:

1. First Steps:
    - loading the data
    - exploring cardinality
    - train/test split
2.  Integer Encoding with Pandas:
    - for a single column
    - with re-usable functions
3.  Integer Encoding with Scikit-Learn:
    - column by column
    - for the whole data set

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

## 1. First Steps

### - loading the data

In [2]:
# load dataset

data = pd.read_csv(
    '../houseprice.csv',
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


### - exploring cardinality

In [3]:
# how many labels each variable has

for col in data.columns:
    print(col, ': ', len(data[col].unique()), ' labels')

Neighborhood :  25  labels
Exterior1st :  15  labels
Exterior2nd :  16  labels
SalePrice :  663  labels


In [4]:
# explore the unique categories
data['Neighborhood'].unique()

array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)

In [5]:
data['Exterior1st'].unique()

array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock'], dtype=object)

In [6]:
data['Exterior2nd'].unique()

array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',
       'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',
       'AsphShn', 'Stone', 'Other', 'CBlock'], dtype=object)

### - train/test split 

It is needed to select a digit to assign to each category using the train set, and then use those mappings in the test set.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## 2. Integer Encoding with Pandas

- returns pandas dataframe
<br>but
- it does not preserve information from train data to propagate to test data

It is needed to capture and save the mappings one by one, manually if we are planning to use those in production.

In [8]:
# create a dictionary with the mappings of categories to numbers

ordinal_mapping = {
    k: i
    for i, k in enumerate(X_train['Neighborhood'].unique(), 0)
}

ordinal_mapping

{'CollgCr': 0,
 'ClearCr': 1,
 'BrkSide': 2,
 'Edwards': 3,
 'SWISU': 4,
 'Sawyer': 5,
 'Crawfor': 6,
 'NAmes': 7,
 'Mitchel': 8,
 'Timber': 9,
 'Gilbert': 10,
 'Somerst': 11,
 'MeadowV': 12,
 'OldTown': 13,
 'BrDale': 14,
 'NWAmes': 15,
 'NridgHt': 16,
 'SawyerW': 17,
 'NoRidge': 18,
 'IDOTRR': 19,
 'NPkVill': 20,
 'StoneBr': 21,
 'Blmngtn': 22,
 'Veenker': 23,
 'Blueste': 24}

In [9]:
# replace the labels with the integers

X_train['Neighborhood'] = X_train['Neighborhood'].map(ordinal_mapping)
X_test['Neighborhood'] = X_test['Neighborhood'].map(ordinal_mapping)

In [10]:
# explore the result

X_train['Neighborhood'].head(10)

64      0
682     1
960     2
1384    3
1100    4
416     5
1034    6
853     7
472     3
1011    3
Name: Neighborhood, dtype: int64

In [11]:
# the same but with re-usable functions


def find_category_mappings(df, variable):
    return {k: i for i, k in enumerate(df[variable].unique(), 0)}


def integer_encode(train, test, variable, ordinal_mapping):

    X_train[variable] = X_train[variable].map(ordinal_mapping)
    X_test[variable] = X_test[variable].map(ordinal_mapping)

In [12]:
# run a loop over the remaining categorical variables

for variable in ['Exterior1st', 'Exterior2nd']:
    mappings = find_category_mappings(X_train, variable)
    integer_encode(X_train, X_test, variable, mappings)

In [13]:
# the result

X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0,0,0
682,1,1,1
960,2,1,2
1384,3,2,3
1100,4,1,1


## 3. Integer Encoding with Scikit-learn

### - column by column

In [14]:
# train_test_split 

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [15]:
# create an encoder

le = LabelEncoder()
le.fit(X_train['Neighborhood'])

LabelEncoder()

In [16]:
# the unique classes
le.classes_

array(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
       'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel',
       'NAmes', 'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'OldTown',
       'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber',
       'Veenker'], dtype=object)

In [17]:
X_train['Neighborhood'] = le.transform(X_train['Neighborhood'])
X_test['Neighborhood'] = le.transform(X_test['Neighborhood'])

X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,5,VinylSd,VinylSd
682,4,Wd Sdng,Wd Sdng
960,3,Wd Sdng,Plywood
1384,7,WdShing,Wd Shng
1100,18,Wd Sdng,Wd Sdng


### - for the whole data set

LabelEncoder works one variable at the time. However, there is a way to automate this for all the categorical variable:<br>
1. `OneHotEncoder().fit_transform(df)` 
2. `df.apply(LabelEncoder().fit_transform)`
3. with `collections.defaultdict(LabelEncoder)`

As the last option seems to be the most effective, it is used right below. 

In [18]:
# additional import required
from collections import defaultdict

In [19]:
# train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [20]:
d = defaultdict(LabelEncoder)

In [21]:
# encode the X_train variable
train_transformed = X_train.apply(lambda x: d[x.name].fit_transform(x))

# encode the X_test variable
test_transformed = X_test.apply(lambda x: d[x.name].transform(x))

In [22]:
train_transformed.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,5,12,13
682,4,13,14
960,3,13,10
1384,7,14,15
1100,18,13,14


In [23]:
test_transformed.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
529,6,13,11
491,12,13,14
459,3,8,8
279,4,9,10
655,2,6,7


In [24]:
# inverse transform to recover the original labels

tmp = train_transformed.apply(lambda x: d[x.name].inverse_transform(x))
tmp.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,CollgCr,VinylSd,VinylSd
682,ClearCr,Wd Sdng,Wd Sdng
960,BrkSide,Wd Sdng,Plywood
1384,Edwards,WdShing,Wd Shng
1100,SWISU,Wd Sdng,Wd Sdng
