In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("C:/Users/Karthik.Iyer/Downloads/AccelerateAI/DV_EDA/DV_and_EDA-main/data/insurance.csv")
df.head()

Unnamed: 0,age,gender,bmi,children,smoker,geography,charges,EducationLevel
0,54,female,47.41,0,yes,East,63770.42801,PhD
1,45,male,30.36,0,yes,East,62592.87309,PhD
2,52,male,34.485,3,yes,West,60021.39897,Master
3,31,female,38.095,1,yes,North,58571.07448,PhD
4,33,female,35.53,0,yes,West,55135.40209,PhD


In [2]:
# Check if there are any missing values first
df.isnull().sum()

age               0
gender            0
bmi               0
children          0
smoker            0
geography         0
charges           0
EducationLevel    0
dtype: int64

There are no missing values

## Categorical Encoding

We use encoding techniques to encode the categorical features which are object (text) data types into integers before applying models as Machines understand numbers not the text.

Widely used techniques are:
- One-Hot Encoding
- Label Encoding

### One-Hot Encoding

One-Hot encoding creates additional features based on the number of unique values in a categorical feature. Every unique value in the category will be added as a feature. This encoding technique creates dummy variables and get_dummies method from pandas can be used. Each category is represnted as one-hot vector with values 1 (when a match is found) and 0 (otherwise).

This method is preferred when there is no intrinsic order in the categorical feature i.e when the categorical feature is nominal. In the above insurance dataset, the features 'gender', 'smoker', 'geography' are categorical nominal features.

In [3]:
# Make a list of categorical nominal features
catvar_nominal = ['gender', 'smoker', 'geography']

# One-Hot Encoding
one_hot_encoded_data = pd.get_dummies(df, columns=catvar_nominal)
one_hot_encoded_data

Unnamed: 0,age,bmi,children,charges,EducationLevel,gender_female,gender_male,smoker_no,smoker_yes,geography_East,geography_North,geography_South,geography_West
0,54,47.410,0,63770.42801,PhD,1,0,0,1,1,0,0,0
1,45,30.360,0,62592.87309,PhD,0,1,0,1,1,0,0,0
2,52,34.485,3,60021.39897,Master,0,1,0,1,0,0,0,1
3,31,38.095,1,58571.07448,PhD,1,0,0,1,0,1,0,0
4,33,35.530,0,55135.40209,PhD,1,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,18,34.100,0,1137.01100,HighSchool,0,1,1,0,1,0,0,0
1334,18,33.660,0,1136.39940,HighSchool,0,1,1,0,1,0,0,0
1335,18,33.330,0,1135.94070,HighSchool,0,1,1,0,1,0,0,0
1336,18,30.140,0,1131.50660,HighSchool,0,1,1,0,1,0,0,0


This technique can be applied when there are less number of categories in a categorical nominal feature, as more categories lead to lot of additional features getting created.

## Label-Encoding

In Label Encoding, each cateogory in the feature is assigned with a unique integer based on alphabetical ordering. Unlike One-Hot encoding which is suitable for categorical nominal features, Label Encoding can be used for categorical ordinal features where there is an intrinsic order in the categories.

In the above insurance dataset, the feature 'EducationLevel' is a categorial ordinal feature. Hence Label Encoding can be applied here.

In [4]:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

In [5]:
# Make a sist of categorical ordinal feature
catvar_ordinal = ['EducationLevel']

In [6]:
# Creating instance of labelencoder
labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
df['EducationLevel_cat'] = labelencoder.fit_transform(df['EducationLevel'])

df.head()

Unnamed: 0,age,gender,bmi,children,smoker,geography,charges,EducationLevel,EducationLevel_cat
0,54,female,47.41,0,yes,East,63770.42801,PhD,3
1,45,male,30.36,0,yes,East,62592.87309,PhD,3
2,52,male,34.485,3,yes,West,60021.39897,Master,2
3,31,female,38.095,1,yes,North,58571.07448,PhD,3
4,33,female,35.53,0,yes,West,55135.40209,PhD,3


## Summary:

#### We apply One-Hot Encoding when:

- The categorical feature is nominal (like 'gender', 'smoker', 'geography')
- The number of categorical features is less so one-hot encoding can be effectively applied

#### We apply Label Encoding when:

- The categorical feature is ordinal (like 'EducationLevel')
- The number of categories is quite large as one-hot encoding can lead to high memory consumption