# Dealing with categorical features

In this notebook we cover examples of common tasks for treating **categorical data** prior to modeling. Categorical data needs a lot of attention during data pre-processing. This is because most machine learning algorithms don't deal directly with categorical data. Instead we need to **recode** the data from categorical into numeric, and we will see how we do that in this notebook.

Let's begin by reading some data. We will use a marketing data set of bank customers. You can read more about the data [here](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing). 

In [128]:
import pandas as pd
import numpy as np

bank = pd.read_csv('data/bank-full.csv', sep = ";")
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


We can see that our data contains many categorical columns, including the target itself. Let's check the data types:

In [129]:
bank.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

We can use the `select_dtypes` method to limit the data to just the categorical columns.

In [130]:
bank.select_dtypes('object').head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,poutcome,y
0,management,married,tertiary,no,yes,no,unknown,may,unknown,no
1,technician,single,secondary,no,yes,no,unknown,may,unknown,no
2,entrepreneur,married,secondary,no,yes,yes,unknown,may,unknown,no
3,blue-collar,married,unknown,no,yes,no,unknown,may,unknown,no
4,unknown,single,unknown,no,no,no,unknown,may,unknown,no


### Exercise

Write a loop to obtain counts of unique values for each categorical column in the data. We can use the `value_counts` method to get counts for a column. 

In [132]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [133]:
# mass convert objects to categorical.
cats = ['marital', 'default', 'housing', 'loan']
bank[cats] = bank[cats].astype('category')
bank.dtypes

age             int64
job            object
marital      category
education      object
default      category
balance         int64
housing      category
loan         category
contact        object
day             int64
month          object
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome       object
y              object
dtype: object

In [134]:
for i in bank.select_dtypes('category'):
    x = bank[i].value_counts(normalize=True, dropna=False)
    print('------------------------')
    print(x)
    print('Number of categories: ' + str(len(x)))

------------------------
married     0.601933
single      0.282896
divorced    0.115171
Name: marital, dtype: float64
Number of categories: 3
------------------------
no     0.981973
yes    0.018027
Name: default, dtype: float64
Number of categories: 2
------------------------
yes    0.555838
no     0.444162
Name: housing, dtype: float64
Number of categories: 2
------------------------
no     0.839774
yes    0.160226
Name: loan, dtype: float64
Number of categories: 2


In [135]:
#show the cardinality
bank.nunique()

age            77
job            12
marital         3
education       4
default         2
balance      7168
housing         2
loan            2
contact         3
day            31
month          12
duration     1573
campaign       48
pdays         559
previous       41
poutcome        4
y               2
dtype: int64

### End of exercise

It turns out there are **two kinds of data types for categorical data** in `pandas`: `object` and `category`. By default, any string column will inherit the `object` type, but we can later convert them to `category` type. Ideally, a `catogory` type is only appropriate for a column with **a limited number pre-defined categories**. This is because the `category` type is a more rigid data type that we can use to impose additional structure on the column. So this only makes sense when the categories are known and few. Let's illustrate that by turning some of the columns in our data into a `category` columns.

In [136]:
cat_cols = ['marital', 'default', 'housing', 'loan']
bank[cat_cols] = bank[cat_cols].astype('category')
bank.dtypes

age             int64
job            object
marital      category
education      object
default      category
balance         int64
housing      category
loan         category
contact        object
day             int64
month          object
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome       object
y              object
dtype: object

Why would we want to add additional rigidity? Because this way we can impose some amount of **data integrity**. For example, if `marital` should always be limited to "single", "divorced" or "married" then by converting `marital` into a `category` column we can prevent the data from introducing any other category without first adding it as one of the acceptable categories for this column.

In [137]:
bank['marital'].cat.categories

Index(['divorced', 'married', 'single'], dtype='object')

### Exercise

Try to change `marital` at the second row to the value "widowed". You should get an error. To fix the error, you need to add "widowed" as one of the acceptable categories. Use the `cat.add_categories` method to add "widowed" as a category and then try again to make sure you don't get an error this time.

In [138]:
bank['marital'] = bank['marital'].cat.add_categories('widowed')
bank.loc[1,'marital'] = 'widowed'
bank.loc[1]

age                  44
job          technician
marital         widowed
education     secondary
default              no
balance              29
housing             yes
loan                 no
contact         unknown
day                   5
month               may
duration            151
campaign              1
pdays                -1
previous              0
poutcome        unknown
y                    no
Name: 1, dtype: object

Undo your change by reassigning `marital` at the second row to the value "single". Get a count of unique values for `marital` now. Do you notice anything? Explain what and why?

In [139]:
bank.loc[1,'marital'] = 'single'
bank['marital'] = bank['marital'].cat.remove_categories('widowed')

# weird, but the column data type changes to int64...
bank['marital'] = bank['marital'].astype('category')
bank['marital'].value_counts()


married     27214
single      12790
divorced     5207
Name: marital, dtype: int64

Categorical columns have other useful methods, and their names speak for themselves, such as
`as_ordered`, `as_unordered`, `remove_categories`, `remove_unused_categories`, `rename_categories`, `reorder_categories`, and `set_categories`. It is important to be aware of this functionality and use it when it makes sense. Of course an alternative to using these is to convert the column back to `object` and make all the changes we want and then turn it back into `category`.

### End of exercise

So we saw that a `category` column has pre-defined categories and a set of methods specific to itself for changing the categories, whereas an `object` column is more a type of **free-form** categorical column where the categories can be changed on the fly and no particular structure is imposed. One way the above distinction matters if when we need to rename the categories for a categorical column. This is sometimes referred to as **recoding** or **remapping**.

### Exercise

Let's first begin with an example using `job`, which has type `object`. Rename the category "management" to "managerial". HINT: find all rows where `job` is the string `'management'`, and use `loc` to change those rows to the string `'managerial'`.

In [140]:
bank['job'] = bank['job'].replace({'management':'managerial'})
bank['job'].value_counts()

blue-collar      9732
managerial       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

The above approach works fine, but it's tedious if we have a lot of changes we want to make. The better way to do it is to create a Python dictionary that maps old values (values we want to change) to new values, then use the `replace` method to replace them all at once.

Create such a dictionary and use `replace` to make the following changes in the `job` column:

- rename `'student'` to `'in-school'`
- combine `'housemaid'` and `'services'` into a single group called `'catering'`
- change `unknown` to a missing value, i.e. `np.NaN` (without quotes)

In [141]:
bank['job'] = bank['job'].replace({'student':'in-school', 'housemaid':'catering','services':'catering', 'unknown': np.nan})
bank['job'].value_counts(dropna=False)

blue-collar      9732
managerial       9458
technician       7597
catering         5394
admin.           5171
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
in-school         938
NaN               288
Name: job, dtype: int64

Get a count of unique values for `job` to make sure everything worked. Note that `value_counts()` does not provide count for missing values by default. We need to specify `dropna = False` to include missing vaules in the count.

### End of exercise

The `replace` method works equally well with a categorical column of type `category`, however **it changes its type to `object`!** So either we have to convert it back to `category`, or we need to use the `rename_categories` method to replace values, which workes very similar to `replace` in that it accepts a dictionary mapping old values to new ones. Here's an example:

In [142]:
bank['marital'] = bank['marital'].cat.rename_categories({'married': 'taken'})
bank['marital'].value_counts()

taken       27214
single      12790
divorced     5207
Name: marital, dtype: int64

Categorical columns can also be easily generated from numeric columns. For example, let's say we want to have a column called `high_balance` that is `True` when balance exceeds $2,000 and `False` otherwise. Technically this would be a boolean column, but in practice it acts as categorical column. Generating such a column is very easy. We refer to such binary colums as **dummy variables** or **flags** because they single out a group.

In [143]:
bank['high_balance'] = bank['balance'] > 2000

The process of creating a dummy variable **for each category** of a categorical feature is called **one-hot encoding**. Let's see what happens if we one-hot-encode `marital`.

In [144]:
bank['marital_taken'] = (bank['marital'] == 'taken').astype('int')
bank['marital_single'] = (bank['marital'] == 'single').astype('int')
bank['marital_divorced'] = (bank['marital'] == 'divorced').astype('int')

In [145]:
bank.filter(like = 'marital').head()

Unnamed: 0,marital,marital_taken,marital_single,marital_divorced
0,taken,1,0,0
1,single,0,1,0
2,taken,1,0,0
3,taken,1,0,0
4,single,0,1,0


One-hot encoding is a common enough task that we don't need to do it manually like we did above. Instead we can use `pd.get_dummies` to do it in one go.

In [146]:
pd.get_dummies(bank['marital'], prefix = 'marital').head()

Unnamed: 0,marital_divorced,marital_taken,marital_single
0,0,1,0
1,0,0,1
2,0,1,0
3,0,1,0
4,0,0,1


There's an even more streamlined way to do one-hot encoding, although at first blush it appears less straight-forward, but there is a reason it is set up this way and we will explain that later. Just like normalization, one-hot-encoding is a common pre-processing task and we can turn to the `sklearn` library to do the hard part for us.

In [147]:
from sklearn.preprocessing import OneHotEncoder

bank_cat = bank.select_dtypes('category').copy() # only select columns that have type 'category'
onehot = OneHotEncoder(sparse = False) # initialize one-hot-encoder
onehot.fit(bank_cat)
col_names = onehot.get_feature_names(bank_cat.columns) # this allows us to properly name columns
bank_onehot =  pd.DataFrame(onehot.transform(bank_cat), columns = col_names)
bank_onehot.head()

Unnamed: 0,marital_divorced,marital_single,marital_taken,default_no,default_yes,housing_no,housing_yes,loan_no,loan_yes
0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
2,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
3,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


So we can see that one-hot encoding created a **binary feature** for **each category of each categorical column** in the data. Although to be more specific, we limited it to columns whose type is `category` and excluded columns whose type is `object`. This is because one-hot encoding can quickly blow up the number of columns in the data if we are not careful and include categorical columns with lots of categories (also called **high-cardinality** categorical columns).

What is the point of doing this? The reason we do this is that most machine learning algorithms do not work **directly** with categorical data, so we need to encode the categorical data which turns it into numeric data. One-hot encoding is just one type of encoding, but it is the most common one.

One last note about the `sklearn` pre-processing transformations we learned about in this notebook: If you look at examples online, you may notice that instead of calling `fit` and `transform` separately, you can call `fit_transform` which combines the two steps into one. This may seem reasonable and saves you one extra line of code, but we discourage it. The following exercise will illustrate why, but the main reason will become clear when we talk about machine learning.

### Exercise

Let's return to the data, and once again fit a one-hot encoder on it. This time we run it on `job` and `education`.

In [158]:
bank_cat = bank[['job', 'education']].copy()
onehot = OneHotEncoder(sparse = False) # initialize one-hot-encoder
#onehot.fit(bank_cat)

# The one hot encoding fails..
# to fix this, let's see the value_counts()
bank_cat['job'].value_counts(dropna=False)

# Job has NaNs... replace them
bank_cat['job'] = bank_cat['job'].replace({np.nan:'unknown'})
bank_cat['job'].value_counts()

# with the replace, this should work...
onehot.fit(bank_cat)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='error', sparse=False)

We now introduce a change: We replace the value for `job` at the second row with `'data scientist'` (assuming that's a new category). Note that `job` is of type `object`, so we can set it to anything we want.

In [159]:
bank_cat.loc[1, 'job'] = 'data scientist' # introduce a category unseen when we ran fit
bank_cat.head()

Unnamed: 0,job,education
0,managerial,tertiary
1,data scientist,secondary
2,entrepreneur,secondary
3,blue-collar,unknown
4,unknown,unknown


The important point here is that we introduce this additional category **after** we ran `fit` on the one-hot encoder above.

Now let's see what happens if we try to run `transform` on the data to one-hot encode the features. If you run the code below you'll notice that we get an error. What is the error for?

In [160]:
col_names = onehot.get_feature_names(bank_cat.columns)
bank_onehot =  pd.DataFrame(onehot.transform(bank_cat), columns = col_names)
bank_onehot.head()

ValueError: Found unknown categories ['data scientist'] in column 0 during transform

Is it a good thing that we got an error? The answer is it depends: 

- If we are okay with letting new categories slip through, we can return to where we initiated `OneHotEncoder` and change the `handle_unknown = 'ignore'` (default value is `'error'`). Make this change and rerun the code. What is the one-hot encoded value for `job` at the row that we changed?
- If you want to make sure that we preserve **data integrity** so that the data we call `transform` on matches the schema of the data we ran `fit` on, then we want errors like this to stop us in our tracks so we have a change to see why the data changed. 

### End of exercise

So we saw that using `fit` and `transform`, we can impose a sort of data integrity at training time and enforce it at transform time, and this is true even if the column is of type `object`. This is very similar to what how a column of type `category` works. In fact, if `job` was of type `category` instead of `object`, then we would not have been able to add a new category on the fly, and we would have caught the error pointed in the above exercise earlier.