### One Hot Encoding

sklearn also provides a function to perform a one-hot encoding of the categorical variable. Let us use 'OneHotEncoder' from skelarn to encode the variable 'sex'.

In [25]:
# import the OneHotEncoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer



# instantiate the encoder
enc = OneHotEncoder()


# fit the encoder on 'sex'
# 'encoded_var' returns an array of encoded variables
df = pd.read_csv('suicide_data1.csv')
# df['sex'] = df['sex'].astype('category')
# df['sex_new'] = df['sex'].cat.codes

# create a dataframe of encoded columns
enc_data = pd.DataFrame(enc.fit_transform(
    df[['sex']]).toarray())
df[enc.categories_[0]] = enc_data
df


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation,female,male
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X,0.0,1.0
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent,0.0,1.0
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X,1.0,0.0
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,0.0,1.0
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,107,3620833,2.96,Uzbekistan2014,0.675,63067077179,2309,Generation X,1.0,0.0
27816,Uzbekistan,2014,female,75+ years,9,348465,2.58,Uzbekistan2014,0.675,63067077179,2309,Silent,1.0,0.0
27817,Uzbekistan,2014,male,5-14 years,60,2762158,2.17,Uzbekistan2014,0.675,63067077179,2309,Generation Z,0.0,1.0
27818,Uzbekistan,2014,female,5-14 years,44,2631600,1.67,Uzbekistan2014,0.675,63067077179,2309,Generation Z,1.0,0.0


### Label Encoding

This technique labels each of the categories of the variable with values between 0 and (n-1), where 'n' is the number of distinct categories in the variable. If the category is repeating in the data, then the same label gets assigned.

Use 'LabelEncoder' from sklearn to encode the variable 'generation'

In [26]:
# check the categories in 'generation'
df.generation.unique()

array(['Generation X', 'Silent', 'G.I. Generation', 'Boomers',
       'Millenials', 'Generation Z'], dtype=object)

In [27]:
# import the LabelEncoder
from sklearn.preprocessing import LabelEncoder

# instantiate the encoder
label_encoder = LabelEncoder()
# fit the encoder on 'generation' 
# label_encoder.fit(df['generation'])
# encoded_data = label_encoder.transform(df['generation'])
df[ 'generation_new' ]= label_encoder.fit_transform(df[ 'generation' ])   


# display first 5 observations
df

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation,female,male,generation_new
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X,0.0,1.0,2
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent,0.0,1.0,5
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X,1.0,0.0,2
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,0.0,1.0,1
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers,0.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,107,3620833,2.96,Uzbekistan2014,0.675,63067077179,2309,Generation X,1.0,0.0,2
27816,Uzbekistan,2014,female,75+ years,9,348465,2.58,Uzbekistan2014,0.675,63067077179,2309,Silent,1.0,0.0,5
27817,Uzbekistan,2014,male,5-14 years,60,2762158,2.17,Uzbekistan2014,0.675,63067077179,2309,Generation Z,0.0,1.0,3
27818,Uzbekistan,2014,female,5-14 years,44,2631600,1.67,Uzbekistan2014,0.675,63067077179,2309,Generation Z,1.0,0.0,3


LabelEncoder has encoded the six generations. This method is not always useful, as it creates the order in the label which is not present in the original variable. This method assigns the order to the categories in an alphabetical manner.