In [9]:
import pandas as pd
import numpy as np

## Great Lecture 

Showing how one can do labeling for both ordered and un-ordered categorical variable,s as well as one-hot encoding, very easily.

## Ordered Categorical

In [2]:
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']

In [5]:
df = pd.DataFrame({'satisfaction': ['Mad', 'Happy', 'Unhappy', 'Neutral']})

In [7]:
df.satisfaction = df.satisfaction.astype('category', ordered=True, categories=ordered_satisfaction)

### Category Codes

In [8]:
df.satisfaction.cat.codes

0   -1
1    3
2    1
3    2
dtype: int8

In [10]:
np.mean(df.satisfaction.cat.codes)

1.25

## Unordered

In [18]:
df = pd.DataFrame({'vertebrates':[
'Bird',
'Bird',
'Mammal',
'Fish',
'Amphibian',
'Reptile',
'Mammal',
], 'mammals':['Dog', 'Dog', 'Cat', 'Horse', 'Cow', 'Buffalo', 'Whale']})


In [19]:
df.vertebrates = df.vertebrates.astype("category")
df.vertebrates.cat.codes

0    1
1    1
2    3
3    2
4    0
5    4
6    3
dtype: int8

Notice how this time, `ordered=True` was not pass in, nor was a specific ordering list. Because of this, Pandas encodes your nominal entries in alphabetical order. This approach is fine for getting your feet wet, but the issue it has is that it still introduces some kind of ordering to a categorical list of items that has none. This may or may not cause problems for you in the future. If you aren't getting the results you hoped for, or even if you are getting the results you desired but would like to further increase their accuracy, then a more precise encoding approach would be to separate the distinct values out into individual boolean features:

>NOTE: I added a `mammals` column but did not one-hot encoded it to see if `pd.get_dummies` keeps the other columns an automatically replaces the one that's being one-hot encoded.  Indeed it does.  Crap, how much time have I wasted not knowing this!

In [20]:
df = pd.get_dummies(df, columns=['vertebrates'])

In [21]:
df

Unnamed: 0,mammals,vertebrates_Amphibian,vertebrates_Bird,vertebrates_Fish,vertebrates_Mammal,vertebrates_Reptile
0,Dog,0.0,1.0,0.0,0.0,0.0
1,Dog,0.0,1.0,0.0,0.0,0.0
2,Cat,0.0,0.0,0.0,1.0,0.0
3,Horse,0.0,0.0,1.0,0.0,0.0
4,Cow,1.0,0.0,0.0,0.0,0.0
5,Buffalo,0.0,0.0,0.0,0.0,1.0
6,Whale,0.0,0.0,0.0,1.0,0.0


And of course, we can do this:

In [22]:
df = pd.DataFrame({'vertebrates':[
'Bird',
'Bird',
'Mammal',
'Fish',
'Amphibian',
'Reptile',
'Mammal',
], 'mammals':['Dog', 'Dog', 'Cat', 'Horse', 'Cow', 'Buffalo', 'Whale']})

In [23]:
df = pd.get_dummies(df, ['vertebrates', 'mammals'])

In [24]:
df

Unnamed: 0,vertebrates_Buffalo,vertebrates_Cat,vertebrates_Cow,vertebrates_Dog,vertebrates_Horse,vertebrates_Whale,mammals_Amphibian,mammals_Bird,mammals_Fish,mammals_Mammal,mammals_Reptile
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
