In [1]:
%matplotlib inline
import sys
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import feature_extraction
from __future__ import print_function

# turn of data table rendering
pd.set_option('display.notebook_repr_html', False)
sns.set_palette(['#00A99D', '#F5CA0C', '#B6129F', '#76620C', '#095C57'])
sys.version

'3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]'

## Example
We have a dataframe with four columns with only the value column containing continuous data. To use this data for a machine learning task, we need to extract the categorical data into binary one-hot columns. This way, each categorical value gets its own column with either the value 1 or 0 to denote the state of the feature for a particular row.

In [2]:
# Load the original dataframe from csv
df = pd.read_csv('data/feature-engineering.csv')
df.head()

  Gender       City Attended     Value
0   girl   New York      Yes  0.991472
1    boy     London       No  0.980504
2    boy  Amsterdam      Yes  0.969145
3    boy  Amsterdam       No  0.968502
4    man     London      Yes  0.938684

## Binary One-Hot Encoding using DictVectorizer and Pandas
A few simple steps let us extracts the categorical data from the dataframe and replace them with binary one-hot encoded colums. For this we use DictVectorizer from Scikit Learn's [Feature Extraction](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) library.

In [3]:
# Create a dictionary with the categorical data points for each row
cat_columns = ['Gender', 'City', 'Attended']
cat_dict = df[cat_columns].to_dict(outtype='records')
cat_dict[:5]

TypeError: to_dict() got an unexpected keyword argument 'outtype'

In [None]:
# Construct a DictVectorizer to transform our dictionary to
# a binary on-hot encoded array for each row
vec = feature_extraction.DictVectorizer()
cat_vector = vec.fit_transform(cat_dict).toarray()
cat_vector[:5]

In [None]:
# Construct a separate dataframe with the one-hot encoded data
# and set the column names by calling get_feature_names
df_vector = pd.DataFrame(cat_vector)
vector_columns = vec.get_feature_names()
vector_columns

In [None]:
# Drop the categorical columns and join the new one-hot 
# dataframe with the original dataframe
df_vector.columns = vector_columns
df_vector.index = df.index

df = df.drop(cat_columns, axis=1)
df = df.join(df_vector)
df.head()

In [None]:
df.describe()

### Done!

#### Next: _Decision Tree Classifier_