<h1>Handling categorical data</h1>

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_csv("Titanic-Dataset.csv")
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S
523,524,1,1,"Hippach, Mrs. Louis Albert (Ida Sophia Fischer)",female,44.0,0,1,111361,57.9792,B18,C
432,433,1,2,"Louch, Mrs. Charles Alexander (Alice Adelaide ...",female,42.0,1,0,SC/AH 3085,26.0,,S
189,190,0,3,"Turcin, Mr. Stjepan",male,36.0,0,0,349247,7.8958,,S
741,742,0,1,"Cavendish, Mr. Tyrell William",male,36.0,1,0,19877,78.85,C46,S


<h1>1. One-Hot Encoding</h1>

<p>It involves converting each category value into a new column and assigning a binary value (0 or 1) to represent whether or not the category is present in the original data. One-hot encoding is appropriate when the categories do not have a natural order or ranking.</p>

<h1>Before encoding</h1>

In [3]:
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
206,207,0,3,"Backstrom, Mr. Karl Alfred",male,32.0,1,0,3101278,15.85,,S
586,587,0,2,"Jarvis, Mr. John Denzil",male,47.0,0,0,237565,15.0,,S
259,260,1,2,"Parrish, Mrs. (Lutie Davis)",female,50.0,0,1,230433,26.0,,S
428,429,0,3,"Flynn, Mr. James",male,,0,0,364851,7.75,,Q


<h1>After encoding</h1>

In [4]:
df_encoded = pd.get_dummies(df, columns=['Sex',"Pclass"])
df_encoded.sample(5)

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
365,366,0,"Adahl, Mr. Mauritz Nils Martin",30.0,0,0,C 7076,7.25,,S,False,True,False,False,True
359,360,1,"Mockler, Miss. Helen Mary ""Ellie""",,0,0,330980,7.8792,,Q,True,False,False,False,True
663,664,0,"Coleff, Mr. Peju",36.0,0,0,349210,7.4958,,S,False,True,False,False,True
465,466,0,"Goncalves, Mr. Manuel Estanslas",38.0,0,0,SOTON/O.Q. 3101306,7.05,,S,False,True,False,False,True
334,335,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",,1,0,PC 17611,133.65,,S,True,False,True,False,False


<h1>2. Label Encoding</h1>

<p>It's suitable when the categories have an inherent ordinal relationship. However, it might introduce unintended ordinality where none exists.</p>

<h2>Before encoding</h2>

In [5]:
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<h2>After encoding</h2>

In [6]:
label_encoder = LabelEncoder()

In [7]:
label_encoded_data = label_encoder.fit_transform(df["Sex"])
df["Sex"] = label_encoded_data
df["Sex"].head()

0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

<h1>3. Ordinal Encoding:</h1>

<p>OrdinalEncoder is a tool in scikit-learn used for encoding categorical features into ordinal integers. It's particularly useful when dealing with categorical variables that have a natural order or hierarchy. For instance, if you have categories like "low", "medium", and "high", where there's a clear order, OrdinalEncoder can map these categories to numerical values while preserving their order.</p>

<h1>Bofore Encoding</h1>

In [30]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Performance_Level': ['Average', 'Poor', 'Good', 'Average', 'Excellent']
}
dff = pd.DataFrame(data)
dff.sample(5)

Unnamed: 0,Name,Performance_Level
1,Bob,Poor
2,Charlie,Good
3,David,Average
4,Emma,Excellent
0,Alice,Average


In [31]:
dff["Performance_Level"].value_counts()

Performance_Level
Average      2
Poor         1
Good         1
Excellent    1
Name: count, dtype: int64

In [22]:
from sklearn.preprocessing import OrdinalEncoder

<h1>After Encoding</h1>

In [32]:
ord = OrdinalEncoder(categories=[['Poor', 'Average', 'Good', 'Excellent']])
Performance_Level_encoded = ord.fit_transform(dff[["Performance_Level"]])
Performance_Level_encoded
dff["Performance_Level_encoded"] = Performance_Level_encoded
dff.head()

Unnamed: 0,Name,Performance_Level,Performance_Level_encoded
0,Alice,Average,1.0
1,Bob,Poor,0.0
2,Charlie,Good,2.0
3,David,Average,1.0
4,Emma,Excellent,3.0
