## My blog posts over dealing with categorical data and a Hungry GF

In [1]:
#First import necessary libraries
import pandas as pd
import numpy as np

Getting data regarding asking gf what she wanted to eat
Gathered from years of experience of asking gf what she wanted to eat

In [2]:
#import data into df
df =pd.read_csv('decisiontree.csv')

In [3]:
#Look at initial df
df.head()

Unnamed: 0,Do_you_want_to_eat?,What_do_you_want_to_eat?,Fastfood,Restaurant,Money_spent,Choice
0,n,,,,0,Nothing
1,y,Food,Burger,,15,Food
2,y,IDK,Chicken,,20,Food
3,y,Anything,Tacos,,10,Food
4,y,You_pick,,korean,80,IDLT


As you can see most of data is categorical.  We need to deal with this.

In [4]:
#Get info regarding df
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 6 columns):
Do_you_want_to_eat?         11 non-null object
What_do_you_want_to_eat?    10 non-null object
Fastfood                    3 non-null object
Restaurant                  7 non-null object
Money_spent                 11 non-null int64
Choice                      11 non-null object
dtypes: int64(1), object(5)
memory usage: 608.0+ bytes
None


As you can see most info is objects or categorical data

In [5]:
#Seperating dtypes into own df
print(df.select_dtypes(include=['object']).head())


  Do_you_want_to_eat? What_do_you_want_to_eat? Fastfood Restaurant   Choice
0                   n                      NaN      NaN        NaN  Nothing
1                   y                     Food   Burger        NaN     Food
2                   y                      IDK  Chicken        NaN     Food
3                   y                 Anything    Tacos        NaN     Food
4                   y                 You_pick      NaN     korean     IDLT


4 possible decisions

In [6]:
print(df['Fastfood'].value_counts())  

Burger     1
Chicken    1
Tacos      1
Name: Fastfood, dtype: int64


3 decisions for fast food and has a lower cardinality then the column asking What do you want to eat?
We need to encode these categorical data into something we can use for any modeling we wish to do
I decided to show some different encoding techniques to deal with this categorical data

In [7]:
##Check categories for unique values and counts to find out cardinality 
print(df['Do_you_want_to_eat?'].value_counts())

y    10
n     1
Name: Do_you_want_to_eat?, dtype: int64


Seeing as there are only 2 unique values I could encode this the old fashion way using pandas.

In [8]:
label = {'y':1, 'n':0}
df['Do_you_want_to_eat?'] = df['Do_you_want_to_eat?'].map(label)
df.head()

Unnamed: 0,Do_you_want_to_eat?,What_do_you_want_to_eat?,Fastfood,Restaurant,Money_spent,Choice
0,0,,,,0,Nothing
1,1,Food,Burger,,15,Food
2,1,IDK,Chicken,,20,Food
3,1,Anything,Tacos,,10,Food
4,1,You_pick,,korean,80,IDLT


Category encoder is a set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.  Easy to use and works well with modeling.  
Would install using  
```pip install category_encoders```  
or  
```conda install -c conda-forge category_encoders```

In [9]:
#Importing category encoder library
import category_encoders as ce

Next column I will be checking out What_do_you_want_to_eat? for unique values and cardinality

In [10]:
##Check categories for unique values and counts to find out cardinality 
print(df['What_do_you_want_to_eat?'].value_counts())

Anything    6
IDK         2
You_pick    1
Food        1
Name: What_do_you_want_to_eat?, dtype: int64


As you can see there are 4 unique values I will use one of the classic encoders called Onehot to encode this.
One hot will give each value a column and it will either be 1 or a 0 depending if it is true or false.

In [11]:
encoder = ce.OneHotEncoder(cols=['What_do_you_want_to_eat?'])
df= encoder.fit_transform(df)
df.head()

Unnamed: 0,Do_you_want_to_eat?,What_do_you_want_to_eat?_1,What_do_you_want_to_eat?_2,What_do_you_want_to_eat?_3,What_do_you_want_to_eat?_4,What_do_you_want_to_eat?_5,Fastfood,Restaurant,Money_spent,Choice
0,0,1,0,0,0,0,,,0,Nothing
1,1,0,1,0,0,0,Burger,,15,Food
2,1,0,0,1,0,0,Chicken,,20,Food
3,1,0,0,0,1,0,Tacos,,10,Food
4,1,0,0,0,0,1,,korean,80,IDLT


It made 5 new one hot encoded columns for the 4 options and a 5th for Null value if no option was chosen

In [12]:
##Check categories for unique values and counts to find out cardinality 
print(df['Fastfood'].value_counts())

Burger     1
Chicken    1
Tacos      1
Name: Fastfood, dtype: int64


Another classic encoder that is used is binary encoder.  Binary encoder converts each integer to binary digits.  Each binary digit gets one column.   Some loss but fewer dimensions.

In [13]:
encoder = ce.BinaryEncoder(cols=['Fastfood'])
df = encoder.fit_transform(df)

df.head()

Unnamed: 0,Do_you_want_to_eat?,What_do_you_want_to_eat?_1,What_do_you_want_to_eat?_2,What_do_you_want_to_eat?_3,What_do_you_want_to_eat?_4,What_do_you_want_to_eat?_5,Fastfood_0,Fastfood_1,Fastfood_2,Restaurant,Money_spent,Choice
0,0,1,0,0,0,0,0,0,1,,0,Nothing
1,1,0,1,0,0,0,0,1,0,,15,Food
2,1,0,0,1,0,0,0,1,1,,20,Food
3,1,0,0,0,1,0,1,0,0,,10,Food
4,1,0,0,0,0,1,0,0,1,korean,80,IDLT


We will deal with the Restuarant column and find unique values

In [14]:
##Check categories for unique values and counts to find out cardinality 
print(df['Restaurant'].value_counts())

italian       1
korean        1
mexican       1
french        1
japanese      1
vietnamese    1
american      1
Name: Restaurant, dtype: int64


Another classic encoder that is used is Ordinal encoder.  Ordinal encorder will use a single column of integers to represent classes.   Classes are assumed to have not true order and selected at random unless indicated.

In [15]:
encoder = ce.OrdinalEncoder(cols = ['Restaurant'])
# ce_leave.fit(X3, y3['outcome'])        
# ce_leave.transform(X3, y3['outcome']) 
df = encoder.fit_transform(df)

df.head()

Unnamed: 0,Do_you_want_to_eat?,What_do_you_want_to_eat?_1,What_do_you_want_to_eat?_2,What_do_you_want_to_eat?_3,What_do_you_want_to_eat?_4,What_do_you_want_to_eat?_5,Fastfood_0,Fastfood_1,Fastfood_2,Restaurant,Money_spent,Choice
0,0,1,0,0,0,0,0,0,1,1,0,Nothing
1,1,0,1,0,0,0,0,1,0,1,15,Food
2,1,0,0,1,0,0,0,1,1,1,20,Food
3,1,0,0,0,1,0,1,0,0,1,10,Food
4,1,0,0,0,0,1,0,0,1,2,80,IDLT


Assigned an integer to each value into a single column

In [19]:
##Check categories for unique values and counts to find out cardinality 
print(df['Choice'].value_counts())

IDLT       7
Food       3
Nothing    1
Name: Choice, dtype: int64
