## Feature engineering - categorical features

While it is ok to have your target vector in classification represent categories as strings for example, categorical features need to be transformed in _scikit-learn_.

There are two major types:
1. Ordinal use `OrdinalEncoder`
2. Nominal use `OneHotEncoder` (can handle non-string features, `DictVectorizer()` cannot)

**Question:** What is the difference between ordinal and nominal features? Can you give examples?

**Answer:** ...

In [27]:
import pandas as pd

In [28]:
data = pd.DataFrame([
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
])

In [29]:
data

Unnamed: 0,price,rooms,neighborhood
0,850000,4,Queen Anne
1,700000,3,Fremont
2,650000,3,Wallingford
3,600000,2,Fremont


### OrdinalEncoder
**IMPORTANT:** Neighborhoods are _nominal_ and should not be encoded with an OrdinalEncoder. We do this here only for comparison to One-Hot Encoder.

In [30]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
enc.fit(data[['neighborhood']])

In [31]:
enc.categories_

[array(['Fremont', 'Queen Anne', 'Wallingford'], dtype=object)]

In [32]:
enc.transform(data[['neighborhood']])

array([[1.],
       [0.],
       [2.],
       [0.]])

In [33]:
enc.inverse_transform([[1]])

array([['Queen Anne']], dtype=object)

### OneHotEncoder 

In [34]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)
enc.fit(data[['neighborhood']])

In [35]:
enc.get_feature_names_out()

array(['neighborhood_Fremont', 'neighborhood_Queen Anne',
       'neighborhood_Wallingford'], dtype=object)

In [36]:
enc.transform(data[['neighborhood']])

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [37]:
data_enc = pd.DataFrame(enc.transform(data[['neighborhood']]), columns=enc.get_feature_names_out())
data_enc

Unnamed: 0,neighborhood_Fremont,neighborhood_Queen Anne,neighborhood_Wallingford
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,0.0,0.0,1.0
3,1.0,0.0,0.0


**Question:** Can you add `data_enc` to the original `data`, dropping the `'neighborhood'` column?

In [38]:
data.head()

Unnamed: 0,price,rooms,neighborhood
0,850000,4,Queen Anne
1,700000,3,Fremont
2,650000,3,Wallingford
3,600000,2,Fremont


In [39]:
# TO DO: Answer question
data2=pd.concat([data.drop(columns="neighborhood"), data_enc], axis=1)
data2

Unnamed: 0,price,rooms,neighborhood_Fremont,neighborhood_Queen Anne,neighborhood_Wallingford
0,850000,4,0.0,1.0,0.0
1,700000,3,1.0,0.0,0.0
2,650000,3,0.0,0.0,1.0
3,600000,2,1.0,0.0,0.0


In practice, we would use `ColumnTransformer` in a pipeline. We will see this later.

## One-hot encoding (cont'd)

We have already seen how one-hot encoding transforms a categorical feature column into multiple output columns containing 0's and 1's 

In [40]:
# create a DataFrame with an integer feature and a categorical string feature
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
display(demo_df)

Unnamed: 0,Integer Feature,Categorical Feature
0,0,socks
1,1,fox
2,2,socks
3,1,box


In [41]:
from sklearn.preprocessing import OneHotEncoder
# Setting sparse=False means OneHotEncoder will return a numpy array, not a sparse matrix
ohe = OneHotEncoder(sparse_output=False)
print(ohe.fit_transform(demo_df))
# ------------------------------------------------------------------------
# first three column in each array:
# 0,1,2,1
# last three column in each array:
# sock, fox, sock, box
# ------------------------------------------------------------------------

[[1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0.]]


In [42]:
print(ohe.get_feature_names_out())

['Integer Feature_0' 'Integer Feature_1' 'Integer Feature_2'
 'Categorical Feature_box' 'Categorical Feature_fox'
 'Categorical Feature_socks']


### One-hot encoding a large dataset

As an example, we will use the dataset of adult incomes in the United States, derived from the 1994 census database. The task of the adult dataset is to predict whether a worker has an income of over \\$50,000 or under \\$50,000. The features in this dataset include the workers’ ages, how they are employed (self employed, private industry employee, government employee, etc.), their education, their gender, their working hours per week, occupation, and more...

In [43]:
import mglearn
import os
# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")
data = pd.read_csv(
    adult_path, 
    header=None, 
    index_col=False,
    skipinitialspace=True, #remove space after comma
    names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'gender',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
           'income'])
# For illustration purposes, we only select some of the columns
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]
# IPython.display allows nice output formatting within the Jupyter notebook
display(data.head())

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


In this dataset, __age__ and __hours-per-week__ are continuous features, which we know how to treat. The __workclass, education, sex, and occupation features__ are categorical, however. All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity.

In [44]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   education       32561 non-null  object
 3   gender          32561 non-null  object
 4   hours-per-week  32561 non-null  int64 
 5   occupation      32561 non-null  object
 6   income          32561 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.7+ MB


In [45]:
ohe = OneHotEncoder(sparse_output=False)
encoded_df = ohe.fit_transform(data['education'].values.reshape(-1,1))
print(type(encoded_df))
print(encoded_df.shape)

<class 'numpy.ndarray'>
(32561, 16)


One column goes in, 16 columns come out, meaning that the education column has 16 discrete values

In [46]:
print(ohe.get_feature_names_out(input_features=['education']))

['education_10th' 'education_11th' 'education_12th' 'education_1st-4th'
 'education_5th-6th' 'education_7th-8th' 'education_9th'
 'education_Assoc-acdm' 'education_Assoc-voc' 'education_Bachelors'
 'education_Doctorate' 'education_HS-grad' 'education_Masters'
 'education_Preschool' 'education_Prof-school' 'education_Some-college']


Using `pd.get_dummies()` instead:

In [47]:
encoded_df = pd.get_dummies(data['education'])
print(type(encoded_df))
print(encoded_df.shape)

<class 'pandas.core.frame.DataFrame'>
(32561, 16)


In [48]:
encoded_df.head()

Unnamed: 0,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college
0,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
