#Encoding Categorical Data
There are three common approaches for converting ordinal and categorical variables to **numerical values.** They are:

###Ordinal Encoding

###One-Hot Encoding

###Dummy Variable Encoding

#Nominal and Ordinal Variables
Numerical data, as its name suggests, involves features that are only composed of numbers, such as **integers or floating-point values.**

**Categorical data** are variables that contain **label** values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

A “pet” variable with the values: “dog” and “cat“.

A “color” variable with the values: “red“, “green“, and “blue“.

A “place” variable with the values: “first“, “second“, and “third“.

**Each value** represents a **different category.**

Some categories may have a natural relationship to each other, such as a **natural ordering.**

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an **ordinal variable** because the values can be **ordered or ranked.**

A numerical variable can be converted to an ordinal variable by **dividing the range of the numerical variable into bins** and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

**Nominal Variable (Categorical):** Variable comprises a finite set of discrete values with **no relationship between values.**

**Ordinal Variable:** Variable comprises a finite set of discrete values with a **ranked ordering between values.**

-------------------------------------------------------------
Some algorithms can work with categorical data directly.

For example, **a decision tree** can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

------------------------------------------------------------
Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

---------------------------------------------------------
In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

##This means that categorical data must be converted to a numerical form.

If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.



https://datascience.stackexchange.com/questions/98172/what-is-the-difference-between-one-hot-and-dummy-encoding

https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

Most machine learning models accept only numerical variables. This is the reason behind why categorical variables are converted to number so the model can understand better.

what is one-hot encoding and dummy encoding ? and then see the difference

One hot Encoding: Take the example of column name Fruit which can have different types of fruits like Blackberry, Grape, Orange. Here each category is mapped to binary variable containing either 0 or 1. Widely utilized when features are nominal.

Fruit	            Price (dollars per pound)

Blackberry	      3.82

Grape	            1.2

Orange	          0.64

Post one hot encoding the table now looks as shown below

One Hot Encoded table

Blackberry	Grape	   Orange	Price (dollars per pound)

1	           0	       0	   3.82

0	           1	       0	   1.2

0	           0	       1	   0.64

**Dummy Encoding:** similar to one hot encoding. While one hot encoding utilises N binary variables for N categories in a variable. Dummy encoding uses N-1 features to represent N labels/categories

        ONEHOTENCODING DUMMYENCODING
------  ------------  ----------
Blackberry	100	             10

Grape	      010	             01

Orange	    001	             00

#Ordinal Encoding
In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where **no such relationship may exist.**

##This can cause problems and a one-hot encoding may be used instead.

This ordinal encoding transform is available in the scikit-learn Python machine learning library via the OrdinalEncoder class.

###By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the “categories” argument as a list with the rank order of all expected labels.

We can demonstrate the usage of this class by converting colors categories “red”, “green” and “blue” into integers. First the categories are sorted then numbers are applied. For strings, this means the labels are sorted alphabetically and that blue=0, green=1 and red=2.

In [None]:
# example of a ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]


#One-Hot Encoding
For categorical variables where **no ordinal relationship exists,** the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the ordinal representation. This is where the **integer encoded variable is removed** and **one new binary variable is added for each unique integer value in the variable.**



In [None]:
# example of a one hot encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]




#Dummy Variable Encoding
The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

In [None]:
# example of a dummy variable encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[['red']
 ['green']
 ['blue']]
[[0. 1.]
 [1. 0.]
 [0. 0.]]




#Use ColumnTransformer in SciKit instead of LabelEncoding and OneHotEncoding for data preprocessing in Machine Learning.

#columnTransformer=labelencoding + oneHotEncoding

The developers of the library might have realised that people use LabelEncoding and OneHotEncoding very frequently. So they decided to come up with a new library called the **ColumnTransformer,** which will basically **combine LabelEncoding and OneHotEncoding** into just one line of code. And the result is exactly the same.



In [1]:
import numpy as np
import pandas as pd

In [2]:
dataset = pd.read_csv("/content/sample.csv")
dataset

Unnamed: 0,name1,123,text1,yes
0,name2,456.0,text2,no
1,name3,,text3,yes
2,name2,123.0,text4,yes
3,name3,789.0,text5,no
4,name1,987.0,text6,no
5,name1,753.0,text7,yes


As you can clearly make out, the data makes no sense and is clearly just for this demonstration. Anyway, the first column we have here is a text field, and is categorical in a sense. So we’ll have to label encode this and also one hot encode to be sure we’ll not be working with any hierarchy. For this, we’ll still need the OneHotEncoder library to be imported in our code. But instead of the LabelEncoder library, we’ll use the new ColumnTransformer. So let’s import these two first:

In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

Next, we have to create an object of the ColumnTransformer class. But before we can do that, we need to understand the constructor signature of the class. The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in two. The first argument is an array called transformers, which is a list of tuples. The array has the following elements in the same order:

name: a name for the column transformer, which will make setting of parameters and searching of the transformer easy.
transformer: here we’re supposed to provide an estimator. We can also just “passthrough” or “drop” if we want. But since we’re encoding the data in this example, we’ll use the OneHotEncoder here. Remember that the estimator you use here needs to support fit and transform.
column(s): the list of columns which you want to be transformed. In this case, we’ll only transform the first column.
The second parameter we’re interested in is the remainder. This will tell the transformer what to do with the other columns in the dataset. By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped. But we have the option to tell the transformer what to do with the other columns. We can either drop them, pass them through unchanged, or specify another estimator if we want to do some more processing.

Now that we (somewhat) understand the signature of the constructor, let’s go ahead and create an object:

In [4]:
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')


As you can see from the snippet above, we’ll name the transformer simply “encoder.” We’re using the OneHotEncoder() constructor to provide a new instance as the estimator. And then we’re specifying that only the first column has to be transformed. We’re also making sure that the remainder columns are passed through without any changes.

Once we have constructed this columnTransformer object, we have to fit and transform the dataset to label encode and one hot encode the column. For this, we’ll use the following simple command:

In [6]:
dataset = np.array(columnTransformer.fit_transform(dataset), dtype = np.str)
dataset

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset = np.array(columnTransformer.fit_transform(dataset), dtype = np.str)


array([['1.0', '0.0', '1.0', '0.0', '456.0', 'text2', 'no'],
       ['1.0', '0.0', '0.0', '1.0', 'nan', 'text3', 'yes'],
       ['1.0', '0.0', '1.0', '0.0', '123.0', 'text4', 'yes'],
       ['1.0', '0.0', '0.0', '1.0', '789.0', 'text5', 'no'],
       ['0.0', '1.0', '0.0', '0.0', '987.0', 'text6', 'no'],
       ['0.0', '1.0', '0.0', '0.0', '753.0', 'text7', 'yes']],
      dtype='<U32')

In [9]:
df = pd.DataFrame(dataset, columns=['name1', 'name2', 'name3','name4','123','text1','yes'])
df

Unnamed: 0,name1,name2,name3,name4,123,text1,yes
0,1.0,0.0,1.0,0.0,456.0,text2,no
1,1.0,0.0,0.0,1.0,,text3,yes
2,1.0,0.0,1.0,0.0,123.0,text4,yes
3,1.0,0.0,0.0,1.0,789.0,text5,no
4,0.0,1.0,0.0,0.0,987.0,text6,no
5,0.0,1.0,0.0,0.0,753.0,text7,yes


###As you can see, we have easily label encoded and one hot encoded a column in our dataset using only the ColumnTransformer class. This so much more easier and cleaner than using both LabelEncoder and OneHotEncoder classes.

###Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the **same type.**

It can be challenging when you have a dataset with **mixed types** and you want to **selectively apply** data transforms to some, but not all, input features.

Thankfully, the scikit-learn Python machine learning library provides the **ColumnTransformer** that allows you to **selectively apply data transforms to different columns in your dataset.**

In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed data types.

After completing this tutorial, you will know:

The challenge of using data transformations with datasets that have mixed data types.

How to define, fit, and use the ColumnTransformer to selectively apply data transforms to columns.

How to work through a real dataset with mixed data types and use the ColumnTransformer to apply different transforms to **categorical and numerical data columns.**

#Challenge of Transforming Different Data Types
It is important to prepare data prior to modeling.

This may involve **replacing missing values, scaling numerical values, and one hot encoding categorical data.**

Data transforms can be performed using the scikit-learn library;

##EX:  **SimpleImputer class** can be used to **replace missing values.**
##EX:  **MinMaxScaler class** can be used to **scale numerical values.**
##EX:  **OneHotEncoder** can be used to **encode categorical variables.**



In [4]:
import pandas as pd

df = pd.read_csv('/content/house-prices.csv')
df.head()

Unnamed: 0,Home,Price,SqFt,Bedrooms,Bathrooms,Offers,Brick,Neighborhood
0,1,114300,1790,2,2,2,No,East
1,2,114200,2030,4,2,3,No,East
2,3,114800,1740,3,2,1,No,East
3,4,94700,1980,3,2,3,No,East
4,5,119800,2130,3,3,3,No,East


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

In [7]:
X = df.iloc[:,2:-2]
X

Unnamed: 0,SqFt,Bedrooms,Bathrooms,Offers
0,1790,2,2,2
1,2030,4,2,3
2,1740,3,2,1
3,1980,3,2,3
4,2130,3,3,3
...,...,...,...,...
123,1900,3,3,3
124,2160,4,3,3
125,2070,2,2,2
126,2020,3,3,1


In [9]:
y= df.iloc[:,1:2]
y

Unnamed: 0,Price
0,114300
1,114200
2,114800
3,94700
4,119800
...,...
123,119700
124,147900
125,113500
126,149900


In [13]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
X_train,y_test


(     SqFt  Bedrooms  Bathrooms  Offers
 0    1790         2          2       2
 1    2030         4          2       3
 2    1740         3          2       1
 3    1980         3          2       3
 4    2130         3          3       3
 ..    ...       ...        ...     ...
 97   2000         2          2       1
 98   2060         3          2       1
 99   2080         3          3       2
 100  2010         3          2       5
 101  2260         3          3       5
 
 [102 rows x 4 columns],
       Price
 102  136800
 103  211200
 104   82300
 105  146900
 106  108500
 107  134000
 108  117000
 109  108700
 110  111600
 111  114900
 112  123600
 113  115700
 114  124500
 115  102500
 116  199500
 117  117800
 118  150200
 119  109700
 120  110400
 121  105600
 122  144800
 123  119700
 124  147900
 125  113500
 126  149900
 127  124600)

In [14]:
from sklearn.preprocessing import MinMaxScaler
# prepare transform
scaler = MinMaxScaler()
# fit transform on training data
scaler.fit(X_train)
# transform training data
X_train = scaler.transform(X_train)

X_train

array([[0.29824561, 0.        , 0.        , 0.2       ],
       [0.50877193, 0.66666667, 0.        , 0.4       ],
       [0.25438596, 0.33333333, 0.        , 0.        ],
       [0.46491228, 0.33333333, 0.        , 0.4       ],
       [0.59649123, 0.33333333, 1.        , 0.4       ],
       [0.28947368, 0.33333333, 0.        , 0.2       ],
       [0.33333333, 0.33333333, 1.        , 0.4       ],
       [0.62280702, 0.66666667, 0.        , 0.2       ],
       [0.57894737, 0.66666667, 0.        , 0.4       ],
       [0.24561404, 0.33333333, 1.        , 0.4       ],
       [0.50877193, 0.33333333, 0.        , 0.4       ],
       [0.36842105, 0.        , 0.        , 0.2       ],
       [0.40350877, 0.33333333, 0.        , 0.6       ],
       [0.61403509, 0.33333333, 1.        , 0.8       ],
       [1.        , 0.66666667, 1.        , 0.6       ],
       [0.28947368, 0.66666667, 0.        , 0.        ],
       [0.64912281, 0.33333333, 1.        , 0.6       ],
       [0.47368421, 0.33333333,

###Sequences of different transforms can also be **chained together using the Pipeline,** such as imputing missing values, then scaling numerical values.

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# define pipeline
pipeline = Pipeline(steps=[('i', SimpleImputer(strategy='median')), ('s', MinMaxScaler())])
# transform training data
X_train = pipeline.fit_transform(X_train)
X_train

array([[0.29824561, 0.        , 0.        , 0.2       ],
       [0.50877193, 0.66666667, 0.        , 0.4       ],
       [0.25438596, 0.33333333, 0.        , 0.        ],
       [0.46491228, 0.33333333, 0.        , 0.4       ],
       [0.59649123, 0.33333333, 1.        , 0.4       ],
       [0.28947368, 0.33333333, 0.        , 0.2       ],
       [0.33333333, 0.33333333, 1.        , 0.4       ],
       [0.62280702, 0.66666667, 0.        , 0.2       ],
       [0.57894737, 0.66666667, 0.        , 0.4       ],
       [0.24561404, 0.33333333, 1.        , 0.4       ],
       [0.50877193, 0.33333333, 0.        , 0.4       ],
       [0.36842105, 0.        , 0.        , 0.2       ],
       [0.40350877, 0.33333333, 0.        , 0.6       ],
       [0.61403509, 0.33333333, 1.        , 0.8       ],
       [1.        , 0.66666667, 1.        , 0.6       ],
       [0.28947368, 0.66666667, 0.        , 0.        ],
       [0.64912281, 0.33333333, 1.        , 0.6       ],
       [0.47368421, 0.33333333,

###Traditionally, this would require you to **separate the numerical and categorical data** and then manually apply the transforms on those groups of features before combining the columns back together in order to fit and evaluate a model.

Now, you can use the **ColumnTransformer** to perform this operation for you.


##How to use the ColumnTransformer?
The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.

For example, it allows you to apply a specific transform or sequence of transforms to just the numerical columns, and a separate sequence of transforms to just the categorical columns.

To use the ColumnTransformer, you must specify a list of transformers.

Each transformer is a three-element tuple that defines the name of the transformer, the transform to apply, and the column indices to apply it to. For example:

(Name, Object, Columns)
For example, the ColumnTransformer below applies a OneHotEncoder to columns 0 and 1.

In [23]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [28]:
transformer1 = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])

In [29]:
transformer1

###The example below applies a SimpleImputer with median imputing for numerical columns 0 and 1, and SimpleImputer with most frequent imputing to categorical columns 2 and 3.

In [25]:

t = [('num', SimpleImputer(strategy='median'), [0, 1]), ('cat', SimpleImputer(strategy='most_frequent'), [2, 3])]
transformer2 = ColumnTransformer(transformers=t)

t

[('num', SimpleImputer(strategy='median'), [0, 1]),
 ('cat', SimpleImputer(strategy='most_frequent'), [2, 3])]

In [27]:
transformer2

###Any columns not specified in the list of “transformers” are **dropped from the dataset by default;** this can be changed by setting the “remainder” argument.

Setting **remainder=’passthrough’** will mean that all columns not specified in the list of “transformers” will be **passed through without transformation,** instead of being dropped.

###EX: if columns 0 and 1 were numerical and columns 2 and 3 were categorical and we wanted to just transform the categorical data and pass through the numerical columns unchanged, we could define the ColumnTransformer as follows:

In [30]:

transformer3 = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [2, 3])], remainder='passthrough')
transformer3

##Once the transformer is defined, it can be used to transform a dataset.

In [47]:
import pandas as pd

df = pd.read_csv('/content/house-prices.csv')
X = df.iloc[:,2:-2]
y= df.iloc[:,1:2]
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
X_train,y_test

transformer4 = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])])
# transform training data
X_train = np.array(transformer4.fit_transform(X_train))
X_test = np.array(transformer4.fit_transform(X_test))


X_train

array(<102x60 sparse matrix of type '<class 'numpy.float64'>'
	with 204 stored elements in Compressed Sparse Row format>, dtype=object)

In [67]:
X_train.shape

()

##A ColumnTransformer can also be used in a Pipeline to selectively prepare the columns of your dataset before fitting a model on the transformed data.

This is the most likely use case as it ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a test dataset via cross-validation or making predictions on new data in the future.

In [66]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


df = pd.read_csv('/content/house-prices.csv')
X = df.iloc[:,2:-2]
y= df.iloc[:,1:2]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
X_train,y_test
print(X_train.shape, y_train.shape)

# transform training data
X_train = np.array(transformer4.fit_transform(X_train))
X_test = np.array(transformer4.fit_transform(X_test))


# define model
model = LogisticRegression()

# define transform
transformer4 = ColumnTransformer(transformers=[('cat', OneHotEncoder(), [0, 1])], remainder='passthrough')
print(transformer4)

# define pipeline
pipeline = Pipeline(steps=[('t', transformer4), ('m',model)])
print(pipeline)


#X_train = X_train.todense()


# fit the pipeline on the transformed data
pipeline.fit(X_train, y_train)
print(X_train, y_train)

# make predictions
yhat = pipeline.predict(X_test)

(102, 4) (102, 1)
ColumnTransformer(remainder='passthrough',
                  transformers=[('cat', OneHotEncoder(), [0, 1])])
Pipeline(steps=[('t',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat', OneHotEncoder(),
                                                  [0, 1])])),
                ('m', LogisticRegression())])


IndexError: ignored