In [111]:
import pandas as pd
import numpy as np

In [112]:
data=pd.read_csv("students.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


In [113]:
data.isnull().sum()

Unnamed: 0                0
Gender                    0
EthnicGroup            1840
ParentEduc             1845
LunchType                 0
TestPrep               1830
ParentMaritalStatus    1190
PracticeSport           631
IsFirstChild            904
NrSiblings             1572
TransportMeans         3134
WklyStudyHours          955
MathScore                 0
ReadingScore              0
WritingScore              0
dtype: int64

In [114]:
data.dropna(inplace=True)


## Methods of Encoding Categorical Data

### 1.One Hot Enoding

#### One hot Encoding typically used when the categorical variable has more than two categories and there is no natural ordering between the categories.One-hot encoding creates a new binary column for each category in the original categorical variable.

In [115]:
pd.get_dummies(data["Gender"], drop_first=True).head(2)

Unnamed: 0,male
2,0
4,1


In [116]:
data.head(2)

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


### 2.One hot encoding with multiple categories variables

##### When we have a categorical variable with a large number of categories, it can become computationally expensive to create a new binary column for each category, especially if many of the categories are infrequent or rare. In such cases, we may choose to select only the top (e.g., 5-10) most frequent categories and perform one-hot encoding on them. This can help reduce the dimensionality of the resulting data and make it more manageable for downstream analysis.

In [117]:
data["EthnicGroup"].value_counts()

group C    6181
group D    4970
group B    3915
group E    2712
group A    1465
Name: EthnicGroup, dtype: int64

In [118]:
top_3 = data['EthnicGroup'].value_counts().head(3).index.tolist()
data = pd.get_dummies(data[data['EthnicGroup'].isin(top_3)], columns=['EthnicGroup'],drop_first=True)
data.head(3)

Unnamed: 0.1,Unnamed: 0,Gender,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,EthnicGroup_group C,EthnicGroup_group D
2,2,female,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91,0,0
4,4,male,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75,1,0
5,5,female,associate's degree,standard,none,married,regularly,yes,1.0,school_bus,5 - 10,73,84,79,0,0


### 3. Ordinal encoding

#### Ordinal encoding is a technique used to convert categorical data into numerical data in a way that preserves the ordinal relationship between the categories. It assigns a unique integer code to each category in the data, with the lowest integer being assigned to the category that appears first in the data, and the highest integer being assigned to the category that appears last.

For example, suppose we have a dataset of shirts with a 'size' column that contains the categories 'small', 'medium', and 'large'. Label encoding would assign a code of 0 to 'small', 1 to 'medium', and 2 to 'large'. This encoding preserves the fact that 'large' is larger than 'medium', which is larger than 'small', and can be useful for certain types of machine learning models that can handle numerical data but not categorical data.

Ordinal Encoding maps each category to a unique integer based on their order.

In [119]:
data["ParentEduc"].value_counts()

some college          3446
high school           2962
associate's degree    2930
some high school      2863
bachelor's degree     1779
master's degree       1086
Name: ParentEduc, dtype: int64

In [120]:
#manually assign integer codes to the categories
label_map = {'some high school': 0, 'high school': 1, 'some college': 2,"bachelor's degree":3,
            "master's degree": 4, "associate's degree": 5}
data['ParentEduc'] = data['ParentEduc'].map(label_map)

In [121]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,Gender,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,EthnicGroup_group C,EthnicGroup_group D
2,2,female,4,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91,0,0
4,4,male,2,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75,1,0
5,5,female,5,standard,none,married,regularly,yes,1.0,school_bus,5 - 10,73,84,79,0,0


### 4.Label Encoding

#### Label Encoding is a simple technique where each category is assigned a unique integer. It is useful when there is no inherent order or ranking between the categories. However, it can be problematic if there is an implicit order or hierarchy between the categories. In such cases, using Label Encoding may introduce false relationships between the categories that can negatively affect the performance of a model.

Label Encoding maps each category to a unique integer,

In [122]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['ParentMaritalStatus'] = label_encoder.fit_transform(data['ParentMaritalStatus'])
data.head(3)

Unnamed: 0.1,Unnamed: 0,Gender,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,EthnicGroup_group C,EthnicGroup_group D
2,2,female,4,standard,none,2,sometimes,yes,4.0,school_bus,< 5,87,93,91,0,0
4,4,male,2,standard,none,1,sometimes,yes,0.0,school_bus,5 - 10,76,78,75,1,0
5,5,female,5,standard,none,1,regularly,yes,1.0,school_bus,5 - 10,73,84,79,0,0


### 5.Count or frequency encoding

#### Count or frequency encoding is a technique used for encoding categorical variables where the values of each category are replaced by their frequency in the dataset. This encoding technique is useful when the categories have no intrinsic ordering or when one-hot encoding would result in a very high-dimensional and sparse dataset.

In [123]:
import category_encoders as ce
count_enc = ce.CountEncoder()

# fit and transform the data using CountEncoder
data1 = count_enc.fit_transform(data['Gender'])
data1.head()

Unnamed: 0,Gender
2,7650
4,7416
5,7650
6,7650
7,7416


### 6.Mean Encoding

In [124]:
df=pd.read_csv("students.csv")
data.dropna(inplace=True)
df.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


In [125]:
mean_cate=df.groupby(['LunchType'])['WritingScore'].mean().to_dict()
print(mean_cate)

{'free/reduced': 62.650521609538, 'standard': 71.52971615172068}


In [126]:
df['mean_ordinal_encode_LunchType']=df['LunchType'].map(mean_cate)
df.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,mean_ordinal_encode_LunchType
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74,71.529716
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88,71.529716
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91,71.529716
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42,62.650522
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75,71.529716


### 7.Target-guided ordinal encoding

#### In target-guided ordinal encoding, the categories are encoded based on the target variable. Here, for each category, the mean of the target variable is calculated and then the categories are sorted based on their mean value. Then, the categories are assigned ordinal values based on their position in the sorted list. The idea is to assign ordinal values to the categories based on their relationship with the target variable.

In [127]:
import pandas as pd
import category_encoders as ce

# Load data
#df = pd.read_csv('data.csv')

# Create target-guided ordinal encoder object
encoder = ce.OrdinalEncoder(cols=['PracticeSport'], 
                            mapping=[{'col': 'PracticeSport', 
                                      'mapping': df.groupby('PracticeSport')['ReadingScore'].mean().rank().to_dict()}])

# Fit and transform data
df_encoded = encoder.fit_transform(df)

# View encoded data
df_encoded.head()


Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,mean_ordinal_encode_LunchType
0,0,female,,bachelor's degree,standard,none,married,3.0,yes,3.0,school_bus,< 5,71,71,74,71.529716
1,1,female,group C,some college,standard,,married,2.0,yes,0.0,,5 - 10,69,90,88,71.529716
2,2,female,group B,master's degree,standard,none,single,2.0,yes,4.0,school_bus,< 5,87,93,91,71.529716
3,3,male,group A,associate's degree,free/reduced,none,married,1.0,no,1.0,,5 - 10,45,56,42,62.650522
4,4,male,group C,some college,standard,none,married,2.0,yes,0.0,school_bus,5 - 10,76,78,75,71.529716


### 8.Binary encoding

#### Binary encoding is a type of categorical encoding that transforms categorical data into binary (0/1) format. This encoding method is particularly useful for nominal data with a large number of unique values, as it reduces the dimensionality of the feature space while still retaining meaningful information.

In binary encoding, each category is assigned a unique integer value, and then the integer value is converted to its corresponding binary representation. Each binary digit is treated as a separate feature, resulting in a binary feature space. The number of binary digits required depends on the number of unique categories. For example, if there are 8 categories, 3 binary digits are required because 2^3 = 8.

Binary encoding is especially useful for machine learning algorithms that require numerical data, such as regression or neural networks. It can also be used to reduce the dimensionality of high-cardinality categorical data, which can be difficult to handle using other encoding methods.

In [130]:
import pandas as pd
import category_encoders as ce

# creating a sample dataframe
df2 = pd.DataFrame({'color': ['red', 'green', 'blue', 'green', 'red']})

# creating binary encoder object
encoder = ce.BinaryEncoder(cols=['color'])

# fitting the encoder and transforming the data
df_binary = encoder.fit_transform(df2)

# printing the transformed dataframe
print(df_binary)


   color_0  color_1
0        0        1
1        1        0
2        1        1
3        1        0
4        0        1
