There are five techniques to encode or convert the categorical features into numbers. Here are them:

* Mapping Method
* Ordinary Encoding
* Label Encoding
* Pandas Dummies
* OneHot Encoding

In [1]:
# Loading the dataset
import pandas as pd
import seaborn as sns

In [2]:
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
titanic.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


Some features like deck or class have category as data type.

In [5]:
titanic['sex'].value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
male,577
female,314


In [6]:
titanic['embarked'].value_counts()

Unnamed: 0_level_0,count
embarked,Unnamed: 1_level_1
S,644
C,168
Q,77


In [7]:
titanic['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
Third,491
First,216
Second,184


In [8]:
titanic['who'].value_counts()

Unnamed: 0_level_0,count
who,Unnamed: 1_level_1
man,537
woman,271
child,83


In [9]:
titanic.adult_male.value_counts()

Unnamed: 0_level_0,count
adult_male,Unnamed: 1_level_1
True,537
False,354


In [10]:
titanic.embark_town.value_counts()

Unnamed: 0_level_0,count
embark_town,Unnamed: 1_level_1
Southampton,644
Cherbourg,168
Queenstown,77


In [11]:
titanic.alone.value_counts()

Unnamed: 0_level_0,count
alone,Unnamed: 1_level_1
True,537
False,354


In [12]:
titanic.deck.value_counts()

Unnamed: 0_level_0,count
deck,Unnamed: 1_level_1
C,59
B,47
D,33
E,32
A,15
F,13
G,4


# 1. Mapping Method

I am using the mapping method, which is a straightforward way to encode categorical features with a few categories. I will apply it to the class feature, which has three categories: Third, First, and Second. I create a dictionary where the keys are the categories and the values are the numeric codes I want to encode to, and then I map it to the dataframe.

In [13]:
map_dict = {
    'First':0,
    'Second':1,
    'Third':2
}

In [14]:
titanic['class'] = titanic['class'].map(map_dict)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,2,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,0,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,2,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,0,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,2,man,True,,Southampton,no,True


In [15]:
titanic['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
2,491
0,216
1,184


The class feature is encoded.

# 2. Ordinary Encoding

This will also convert categorical data into numbers.

In [16]:
from sklearn.preprocessing import OrdinalEncoder

# Encoding categorical features
encoder = OrdinalEncoder()
titanic[['alive', 'alone']] = encoder.fit_transform(titanic[['alive', 'alone']])

In [17]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,2,man,True,,Southampton,0.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,0,woman,False,C,Cherbourg,1.0,0.0
2,1,3,female,26.0,0,0,7.925,S,2,woman,False,,Southampton,1.0,1.0
3,1,1,female,35.0,1,0,53.1,S,0,woman,False,C,Southampton,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,2,man,True,,Southampton,0.0,1.0


In [18]:
encoder.categories_

[array(['no', 'yes'], dtype=object), array([False,  True])]

Warning: Ordinary Encoder can't handle missing values. It will be error.

# 3. Label Encoding

It also don't support missing values. So, to make it simple, I'm gonna drop all missing values.

In [19]:
titanic_cleaned = titanic.dropna()

In [20]:
titanic_cleaned.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


In [21]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
titanic_cleaned['deck'] = encoder.fit_transform(titanic_cleaned['deck'])

titanic_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_cleaned['deck'] = encoder.fit_transform(titanic_cleaned['deck'])


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,0,woman,False,2,Cherbourg,1.0,0.0
3,1,1,female,35.0,1,0,53.1,S,0,woman,False,2,Southampton,1.0,0.0
6,0,1,male,54.0,0,0,51.8625,S,0,man,True,4,Southampton,0.0,1.0
10,1,3,female,4.0,1,1,16.7,S,2,child,False,6,Southampton,1.0,0.0
11,1,1,female,58.0,0,0,26.55,S,0,woman,False,2,Southampton,1.0,1.0


In [22]:
encoder.classes_

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)

In [23]:
titanic_cleaned['deck'].value_counts()

Unnamed: 0_level_0,count
deck,Unnamed: 1_level_1
2,51
1,43
3,31
4,30
0,12
5,11
6,4


# 4. Pandas Dummies

In [24]:
titanic = pd.get_dummies(titanic, columns=['who'], drop_first=True)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,deck,embark_town,alive,alone,who_man,who_woman
0,0,3,male,22.0,1,0,7.25,S,2,True,,Southampton,0.0,0.0,True,False
1,1,1,female,38.0,1,0,71.2833,C,0,False,C,Cherbourg,1.0,0.0,False,True
2,1,3,female,26.0,0,0,7.925,S,2,False,,Southampton,1.0,1.0,False,True
3,1,1,female,35.0,1,0,53.1,S,0,False,C,Southampton,1.0,0.0,False,True
4,0,3,male,35.0,0,0,8.05,S,2,True,,Southampton,0.0,1.0,True,False


# 5. One Hot Encoding

In [26]:
from sklearn.preprocessing import OneHotEncoder

# Initialize encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform
town_encoded = encoder.fit_transform(titanic_cleaned[['embark_town']])

# Create a DataFrame with proper column names
town_df = pd.DataFrame(town_encoded, columns=encoder.categories_[0], index=titanic_cleaned.index)

# Drop original column and join encoded columns
titanic_cleaned = titanic_cleaned.drop('embark_town', axis=1).join(town_df)

titanic_cleaned.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone,Cherbourg,Queenstown,Southampton
1,1,1,female,38.0,1,0,71.2833,C,0,woman,False,2,1.0,0.0,1.0,0.0,0.0
3,1,1,female,35.0,1,0,53.1,S,0,woman,False,2,1.0,0.0,0.0,0.0,1.0
6,0,1,male,54.0,0,0,51.8625,S,0,man,True,4,0.0,1.0,0.0,0.0,1.0
10,1,3,female,4.0,1,1,16.7,S,2,child,False,6,1.0,0.0,0.0,0.0,1.0
11,1,1,female,58.0,0,0,26.55,S,0,woman,False,2,1.0,1.0,0.0,0.0,1.0
