In [223]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import preprocessing, model_selection, metrics
from sklearn.model_selection import train_test_split

from IPython.display import display # Allows the use of display() for DataFrames

import warnings
warnings.filterwarnings('ignore')

Categorical variables are those values which are selected from a group of categories or
labels. For example, the variable may be “color” and may take on the values “red,” “green,” and “green.” Or the variable Gender with the values of male or female is categorical, and so is the variable marital status with the values of never married, married, divorced, or widowed.

Another example, in survey about preferred brand of car they owned, the result would be categorical (e.g. Tesla, Toyota, Ford, None, etc.). Responses fall into a fixed set of categories.

I will get error if you try to plug these variables into most machine learning models without "encoding" them first.

Almost all Machine learning and deep learning neural networks algorithms require that input and output variables are numbers, requiring that categorical data must be encoded to numbers before we can use it to feed to models and evaluate a model.

There are quite a few techniques to encode categorical variables for modeling, although the three most common are as follows:

- Integer Encoding: Where each unique label is mapped to an integer.

- One Hot Encoding: Where each label is mapped to a binary vector.

- Learned Embedding: Where a distributed representation of the categories is learned.

In some categorical variables, the labels have an intrinsic order, for example, in the variable Student's grade, the values of A, B, C, or Fail are ordered, A being the highest grade and Fail the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are called nominal categorical variables, such as the variable City, with the values of London, Manchester, Bristol, and so on.

The values of categorical variables are often encoded as strings. Scikit-learn, does not support strings as values, therefore, we need to transform those strings into numbers. The act of replacing strings with numbers is called categorical encoding.

## One-hot Encoding

In one-hot encoding, we represent a categorical variable as a group of binary variables,
where each binary variable represents one category. The binary variable indicates whether
the category is present in an observation (1) or not (0).

One hot encoding is the most widespread approach, and it works very well unless our categorical variable takes on a large number of values (e.g. more than 20 different values)

![img](https://i.imgur.com/5td19b8.jpg)

Another example with a variable named 'color'. The values in the variable are Red, Yellow and Green. And then we create a separate column for each possible value. Wherever the original value was Red, we put a 1 in the Red column.

![img](https://i.imgur.com/kdltIHI.png)

From the above Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.
For the categorical variable of Color with the values of red, green, and green, we can create three variables called red, green, and green. These variables will take the value of 1 if the
observation is red, green, or green, respectively, or 0 otherwise.

A categorical variable with k unique categories can be encoded in k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), therefore, we need to create only one binary variable (k - 1 = 1) to capture all of the information. For the color variable, which has three categories (k=3; red, green, and green), we need to create two (k - 1 = 2) binary variables to capture all the information, so that the following occurs:

- If the observation is red, it will be captured by the variable red (red = 1, green = 0).

- f the observation is green, it will be captured by the variable green (red = 0, green = 1).

- If the observation is green, it will be captured by the combination of red and green (red = 0, green = 0).

There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:

- When training decision trees, as they do not evaluate the entire feature space at the same time
- When selecting features recursively
- When determining the importance of each category within a variable

In [224]:
# santander_train_df = pd.read_csv('../input/santander-value-prediction-challenge/train.csv')
# santander_train_df.head()

In [225]:
credit_screening_df = pd.read_csv('../input/uci-ml-credit-screening/crx.data')
# credit_screening_df.head()

In [226]:
# print('Credit screening df number of rows and columns are ', credit_screening_df.shape)

In [227]:
breast_cancer_df = pd.read_csv('../input/uci-ml-credit-screening/breast-cancer.data')
print('Breast Cancer df number of rows and columns are ', breast_cancer_df.shape)

Breast Cancer df number of rows and columns are  (285, 10)


In [228]:
# Replace the question marks in the dataset with NumPy NaN values:
credit_screening_df = credit_screening_df.replace('?', np.nan)
breast_cancer_df = breast_cancer_df.replace('?', np.nan)

In [229]:
# Create a list with the variable names:
# There are 10 columns as we know the shpe of the dataframe
column_labels = ['A' + str(s) for s in range(1, 11)]
column_labels

['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10']

In [230]:
# Now assign the above list of as column-label
breast_cancer_df.columns = column_labels

In [231]:
# Make lists with categorical and numerical variables:
#
# category_columns_santander = [c for c in santander_train_df.columns if santander_train_df[c].dtypes == 'O']
# print('Category Columns_santander', category_columns_santander)
# numeric_columns = [c for c in santander_train_df.columns if santander_train_df[c].dtypes != 'O']
# print('Numeric Columns', numeric_columns)

# credit_scoring_category_columns = [c for c in credit_screening_df.columns if credit_screening_df[c].dtypes == 'O']
# credit_scoring_numeric_columns = [c for c in credit_screening_df.columns if credit_screening_df[c].dtypes != 'O']
# print('credit_scoring_category_columns ', credit_scoring_category_columns)
# print('credit_scoring_numeric_columns ', credit_scoring_numeric_columns)

breast_cancer_category_columns = [c for c in breast_cancer_df.columns if breast_cancer_df[c].dtypes == 'O' ]
breast_cancer_numeric_columns = [c for c in breast_cancer_df.columns if breast_cancer_df[c].dtypes != 'O' ]

print('breast_cancer_category_columns ', breast_cancer_category_columns)
print('breast_cancer_numeric_columns ', breast_cancer_numeric_columns)

breast_cancer_category_columns  ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A8', 'A9', 'A10']
breast_cancer_numeric_columns  ['A7']


In [232]:
# From the above we see that column 'A7' is the Numeric Column and the
# rest all are categorical column.
# Now, re-cast numerical variables to float types:
breast_cancer_df['A7'] = breast_cancer_df['A7'].astype(float)

In [233]:
#  Re-code the target variable as binary:
# Which is the column labe of 'A-10'
breast_cancer_df['A10'] = breast_cancer_df['A10'].map({'+':1, '-':0})
breast_cancer_df.head()


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10
0,no-recurrence-events,40-49,premeno,20-24,0-2,no,2.0,right,right_up,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2.0,left,left_low,no
2,no-recurrence-events,60-69,ge40,15-19,0-2,no,2.0,right,left_up,no
3,no-recurrence-events,40-49,premeno,0-4,0-2,no,2.0,right,right_low,no
4,no-recurrence-events,60-69,ge40,15-19,0-2,no,2.0,left,left_low,no
