In [None]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import TruncatedSVD
from sklearn import preprocessing, model_selection, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from IPython.display import display # Allows the use of display() for DataFrames

import warnings
warnings.filterwarnings('ignore')

Categorical variables are those values which are selected from a group of categories or
labels. For example, the variable may be “color” and may take on the values “red,” “green,” and “green.” Or the variable Gender with the values of male or female is categorical, and so is the variable marital status with the values of never married, married, divorced, or widowed.

Another example, in survey about preferred brand of car they owned, the result would be categorical (e.g. Tesla, Toyota, Ford, None, etc.). Responses fall into a fixed set of categories.

I will get error if you try to plug these variables into most machine learning models without "encoding" them first.

Almost all Machine learning and deep learning neural networks algorithms require that input and output variables are numbers, requiring that categorical data must be encoded to numbers before we can use it to feed to models and evaluate a model.

There are quite a few techniques to encode categorical variables for modeling, although the three most common are as follows:

- Integer Encoding: Where each unique label is mapped to an integer.

- One Hot Encoding: Where each label is mapped to a binary vector.

- Learned Embedding: Where a distributed representation of the categories is learned.

In some categorical variables, the labels have an intrinsic order, for example, in the variable Student's grade, the values of A, B, C, or Fail are ordered, A being the highest grade and Fail the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are called nominal categorical variables, such as the variable City, with the values of London, Manchester, Bristol, and so on.

The values of categorical variables are often encoded as strings. Scikit-learn, does not support strings as values, therefore, we need to transform those strings into numbers. The act of replacing strings with numbers is called categorical encoding.

## One-hot Encoding

In one-hot encoding, we represent a categorical variable as a group of binary variables,
where each binary variable represents one category. The binary variable indicates whether
the category is present in an observation (1) or not (0).

One hot encoding is the most widespread approach, and it works very well unless our categorical variable takes on a large number of values (e.g. more than 20 different values)

![img](https://i.imgur.com/5td19b8.jpg)

Another example with a variable named 'color'. The values in the variable are Red, Yellow and Green. And then we create a separate column for each possible value. Wherever the original value was Red, we put a 1 in the Red column.

![img](https://i.imgur.com/kdltIHI.png)

From the above Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.
For the categorical variable of Color with the values of red, green, and green, we can create three variables called red, green, and green. These variables will take the value of 1 if the
observation is red, green, or green, respectively, or 0 otherwise.

A categorical variable with k unique categories can be encoded in k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), therefore, we need to create only one binary variable (k - 1 = 1) to capture all of the information. For the color variable, which has three categories (k=3; red, green, and green), we need to create two (k - 1 = 2) binary variables to capture all the information, so that the following occurs:

- If the observation is red, it will be captured by the variable red (red = 1, green = 0).

- f the observation is green, it will be captured by the variable green (red = 0, green = 1).

- If the observation is green, it will be captured by the combination of red and green (red = 0, green = 0).

There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:

- When training decision trees, as they do not evaluate the entire feature space at the same time
- When selecting features recursively
- When determining the importance of each category within a variable

In [None]:
breast_cancer_df = pd.read_csv('../input/breast-cancer-data/breast-cancer.data')
print('Breast Cancer df number of rows and columns are ', breast_cancer_df.shape)

In [None]:
# Replace the question marks in the dataset with NumPy NaN values:
breast_cancer_df = breast_cancer_df.replace('?', np.nan)

In [None]:
# Create a list with the variable names:
# There are 10 columns as we know the from the shape of the dataframe
# So create list of 10 column-headings starting with 'A1' and ending with 'A-10'
# Meaning I have to traverser a range of 1 to 11
column_labels = ['A' + str(s) for s in range(1, 11)]
column_labels

In [None]:
# Now assign the above list of as column-label
breast_cancer_df.columns = column_labels

In [None]:
# Make lists with categorical and numerical variables:

category_columns = [c for c in breast_cancer_df.columns if breast_cancer_df[c].dtypes == 'O' ]
numeric_columns = [c for c in breast_cancer_df.columns if breast_cancer_df[c].dtypes != 'O' ]

print('breast_cancer_category_columns ', category_columns)
print('breast_cancer_numeric_columns ', numeric_columns)

From the above we see that column 'A7' is the Numeric Column and the, rest all are categorical column.
Now, re-cast numerical variables to float types:

In [None]:
breast_cancer_df['A7'] = breast_cancer_df['A7'].astype(float)

#### Re-code the target variable as binary:
Which is the column label of 'A10'. That is making each 'yes' as 1 and each 'no' as 0 (zero)

In [None]:
breast_cancer_df['A10'] = breast_cancer_df['A10'].map({'yes':1, 'no':0})
breast_cancer_df.head()

In [None]:
# Fill in the missing data
breast_cancer_df[numeric_columns] =  breast_cancer_df[numeric_columns].fillna(0)
breast_cancer_df[category_columns] = breast_cancer_df[category_columns].fillna(0)


In [None]:
# separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_df.drop(labels=['A10'], axis=1), breast_cancer_df['A10'], test_size=0.3, random_state=0)

#  Let's inspect the unique categories of the A8 variable:
X_train['A8'].unique()


In [None]:
#  Let's inspect the unique categories of the A3 variable
X_train['A3'].unique()

So I have the unique values as

array(['ge40', 'premeno', 'lt40'], dtype=object)

---

## one-hot encoding using pandas get_dummies()

Let's encode A3 into k-1 binary variables using pandas and then inspect the first five rows of the resulting dataframe:

In [None]:
tmp_1 = pd.get_dummies(X_train['A3'], drop_first=True)
tmp_1.head()

In [None]:
tmp_2 = pd.get_dummies(X_train['A3'], drop_first=False)
tmp_2.head()

`get_dummies` pandas function converts categorical variables into indicator variables and  ignores missing data, unless we specifically indicate otherwise, in which case, it will return missing data as an additional category

 To encode the variable into k binaries, use instead `drop_first=False`.

From the output above we can see each label is now a binary variable and there's two (because we used k - 1 ) new columns for the label-names.

To understand how the get_dummies() implementation take a look at the below code

In [None]:
df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea']})
df_get_dummied = pd.get_dummies(df['country'], prefix='country')
df_get_dummied

![img](https://i.imgur.com/DgTHD0B.jpg)

To encode all categorical variables at the same time, let's first make a list with their names: i.e.
 - I am excluding A7 (which is numerical data) and
 - A10 (which is the target variable and I have make it to be binary previously.
 - Also excluding all the age ranges i.e 'A2', 'A4', 'A5', 'A8'

In [None]:
vars_categorical = ['A1', 'A3', 'A6', 'A8', 'A9' ]

# Now, let's encode all of the categorical variables into k-1 binaries each, capturing the result in a new dataframe:

X_train_dummy_encoded_pandas = pd.get_dummies(X_train[vars_categorical], drop_first=True)
X_test_dummy_encoded = pd.get_dummies(X_test[vars_categorical], drop_first=True )

X_train_dummy_encoded_pandas.head()

So as we can see above, the pandas' `get_dummies()` function will create one binary variable per found category. Hence, if there are more categories in the train set than in the test set, get_dummies() will return more columns in the transformed train set than in the transformed test set.

---

## Now one-hot encoding using scikit-learn

First create a OneHotEncoder transformer that encodes into k-1 binary variables and returns a NumPy array:

In [None]:
encoder_scikit_learn = OneHotEncoder(categories='auto', drop='first', sparse=False)

# Now fit i.e. make scikit_learn to learn the encoder to a slice of the train set with the categorical variables so it identifies the categories to encode:
encoder_scikit_learn.fit(X_train[vars_categorical])

Scikit-learn's `OneHotEncoder()` function will only encode the categories learned from the train set. If there are new categories in the test set, we can instruct the encoder to ignore them or to return an error with the `handle_unknown='ignore'` argument or the `handle_unknown='error'` argument, respectively. 

Now, let's create the NumPy arrays with the binary variables for train and test sets:

In [None]:
X_train_encoded_scikit = encoder_scikit_learn.transform(X_train[vars_categorical])

X_train_encoded_scikit.head()

In [None]:
X_test_encoded_scikit = encoder_scikit_learn.transform(X_test[vars_categorical])

X_test_encoded_scikit.head()

Unfortunately, the feature names are not preserved in the NumPy array, therefore, identifying which feature was derived from which variable is not straightforward.