<a href="https://colab.research.google.com/github/rose-777/Projects/blob/main/Task_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Discuss One-Hot-Encoding

Introduction:

Sometimes in datasets, we encounter columns that contain numbers of no specific order of preference. The data in the column usually denotes a category or value of the category and also when the data in the column is label encoded. This confuses the machine learning model, to avoid this the data in the column should be One Hot encoded.

One-Hot-Encoding:

It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.

One hot encoder only takes numerical categorical values, hence any value of string type should be label encoded before one-hot encoded.

The one hot encoder does not accept 1-dimensional array or a pandas series, the input should always be 2 Dimensional.
The data passed to the encoder should not contain strings.

Some points to remember:
OneHotEncoder()
• One hot encoding is a process of converting categorical data variables.
• So they can be provided to machine learning algorithms to improve predictions.
• One hot encoding is a crucial part of feature engineering for machine learning.
• The input to this transformer should be an array-like of integers or strings, denoting the
values taken on by categorical (discrete) features. The features are encoded using a one-hot
(aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category
and returns a sparse matrix or dense array (depending on the sparse parameter)
• By default, the encoder derives the categories based on the unique values in each feature.

Discuss Multicollinearity and the Dummy Variable Trap.

Multicollinearity:

Multicollinearity occurs when two or more independent variables (a.k.a. features) in the dataset are correlated with each other. There are several methods using which we can measure the degree and direction of correlation for bivariate cases (more information on measures of correlation), while multicollinearity is generally measured using Variance Inflation Factor (more information on measures of multicollinearity). In a nutshell, multicollinearity is said to exist in a dataset when the independent variables are (nearly) linearly related to each other.

Dummy Variable Trap

The dummy variable trap manifests itself directly from one-hot-encoding applied on categorical variables. As discussed earlier, size of one-hot vectors is equal to the number of unique values that a categorical column takes up and each such vector contains exactly one ‘1’ in it.The vectors that we use to encode the categorical columns are called ‘Dummy Variables’. We intended to solve the problem of using categorical variables, but got trapped by the problem of Multicollinearity. This is called the Dummy Variable Trap.

Some points to remember:
• Dummy Variable Trap as the outcome of one variable can easily be predicted with the help
of the remaining variables.
• Dummy Variable Trap is a scenario in which variables are highly correlated to each other.
• The Dummy Variable Trap leads to the problem known as Multicollinearity. Multicollinear￾ity occurs where there is a dependency between the independent features.
• Multicollinearity is a serious issue in machine learning models like Linear Regression and
Logistic Regression.
• So, in order to overcome the problem of multicollinearity, one of the dummy variables has to
be dropped.

What is Nominal and Ordinal Variables ?
 Nominal data simply names something without assigning it to an order in relation to other
numbered objects or pieces of data.
• An example of nominal data might be a “pass” or “fail” classification for each student’s test result.     Nominal data provides some information about a group or set of events, even if that
information is limited to mere counts.

Ordinal data, unlike nominal data, involves some order; ordinal numbers stand in relation to
each other in a ranked fashion.
• For example, suppose you receive a survey from your favorite restaurant that asks you to
provide feedback on the service you received. You can rank the quality of service as “1” for
poor, “2” for below average, “3” for average, “4” for very good and “5” for excellent. The
data collected by this survey are examples of ordinal data. Here the numbers assigned have
an order or rank; that is, a ranking of “4” is better than a ranking of “2.”

Salary Dataset of 52 professors having categorical columns. Apply dummy variables concept and one-hot-encoding on categorical columns

In [None]:
import numpy as np
import pandas as pd

In [None]:
sal = pd.read_table('/content/salary.dat.txt',delim_whitespace = True)
sal.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [None]:
sal.dtypes

sx    object
rk    object
yr     int64
dg    object
yd     int64
sl     int64
dtype: object

In [None]:
sal.columns

Index(['sx', 'rk', 'yr', 'dg', 'yd', 'sl'], dtype='object')

In [None]:
sal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sx      52 non-null     object
 1   rk      52 non-null     object
 2   yr      52 non-null     int64 
 3   dg      52 non-null     object
 4   yd      52 non-null     int64 
 5   sl      52 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 2.6+ KB


get_dummies()
• pandas.get_dummies() is used for data manipulation.
• It converts categorical data into dummy or indicator variables.

In [None]:
 pd.get_dummies(sal,columns = ['sx','rk','dg']).head()

Unnamed: 0,yr,yd,sl,sx_female,sx_male,rk_assistant,rk_associate,rk_full,dg_doctorate,dg_masters
0,25,35,36350,0,1,0,0,1,1,0
1,13,22,35350,0,1,0,0,1,1,0
2,10,23,28200,0,1,0,0,1,1,0
3,7,27,26775,1,0,0,0,1,1,0
4,19,30,33696,0,1,0,0,1,0,1


In [None]:
pd.get_dummies(sal,columns = ['sx','rk','dg'],drop_first=True).head()

Unnamed: 0,yr,yd,sl,sx_male,rk_associate,rk_full,dg_masters
0,25,35,36350,1,0,1,0
1,13,22,35350,1,0,1,0
2,10,23,28200,1,0,1,0
3,7,27,26775,0,0,1,0
4,19,30,33696,1,0,1,1


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
columnTransformer = ColumnTransformer([('encoder',
OneHotEncoder(),
[0,1,3])],
remainder='passthrough')
data = np.array(columnTransformer.fit_transform(sal), dtype = str)
print(data[:5])

[['0.0' '1.0' '0.0' '0.0' '1.0' '1.0' '0.0' '25.0' '35.0' '36350.0']
 ['0.0' '1.0' '0.0' '0.0' '1.0' '1.0' '0.0' '13.0' '22.0' '35350.0']
 ['0.0' '1.0' '0.0' '0.0' '1.0' '1.0' '0.0' '10.0' '23.0' '28200.0']
 ['1.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '7.0' '27.0' '26775.0']
 ['0.0' '1.0' '0.0' '0.0' '1.0' '0.0' '1.0' '19.0' '30.0' '33696.0']]
