# Feature Selection: Select Categorical Input Features

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

Feature selection is the process of identifying and selecting a subset of input features that are
most relevant to the target variable. Feature selection is often straightforward when working
with real-valued data, such as using the Pearson's correlation coefficient, but can be challenging
when working with categorical data. The two most commonly used feature selection methods for
categorical input data when the target variable is also categorical (e.g. classification predictive
modeling) are the **chi-squared statistic** and the **mutual information statistic**.

## Learning Objectives

- Learn about the breast cancer predictive modeling problem with categorical inputs and binary classification target variable
- Understand how to evaluate the importance of categorical features using the chi-squared and mutual information statistics
- Learn how to perform feature selection for categorical data when fitting and evaluating a classification model

### Tasks to complete

- Evaluate models using different feature selection methods
- Compare performance between full feature set and selected features
- Create visualizations of feature importance scores

## Prerequisites


- A working Python environment and familiarity with Python
- Familiarity with pandas and numpy libraries
- Basic understanding of classification models
- Knowledge of basic statistical concepts

## Get Started

Setup steps:
- Import required libraries (matplotlib, pandas, scikit-learn)
- Download breast cancer dataset
- Prepare data loading and preprocessing functions
- Set up feature selection methods

## Get Started

To start, we install required packages and import the necessary libraries.

### Install packages 


In [None]:
# Install the necessary packages using pip in a Jupyter notebook environment

# 'matplotlib' is a plotting library for creating static, interactive, and animated visualizations
%pip install matplotlib 

# 'numpy' is a fundamental package for scientific computing in Python, providing support for arrays and matrices
%pip install numpy

# 'pandas' is a powerful data manipulation and analysis library, often used for working with structured data
%pip install pandas

# 'scikit-learn' is a machine learning library that provides simple and efficient tools for data mining and data analysis
%pip install scikit-learn


### Import libraries

In [None]:
# Importing the necessary modules and classes

# Importing pyplot from matplotlib for plotting graphs and visualizations
from matplotlib import pyplot

# Importing read_csv from pandas to load CSV data into a DataFrame
from pandas import read_csv

# Importing SelectKBest and feature selection methods for selecting the best features
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif

# Importing LogisticRegression from sklearn to use as a machine learning model for classification
from sklearn.linear_model import LogisticRegression

# Importing accuracy_score from sklearn to evaluate the performance of the model
from sklearn.metrics import accuracy_score

# Importing train_test_split from sklearn to split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Importing LabelEncoder and OrdinalEncoder for encoding categorical features as numeric values
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

## Breast Cancer Dataset

Breast cancer dataset classifies breast cancer
patient data as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))



### Download Breast Cancer data files

In [None]:
# Define the file path for the breast cancer dataset
# The dataset is assumed to be located in the 'Data' folder, relative to the current working directory
breast_cancer_csv = "../../Data/breast-cancer.csv"  # Path to the CSV file containing the breast cancer dataset

### Loading and encoding the categorical dataset

In [None]:
# example of loading and preparing the breast cancer dataset

# load the dataset
def load_dataset(filename):
    # Load the dataset from the provided CSV file.
    # 'read_csv' reads the file into a DataFrame without headers (header=None)
    dataset = read_csv(filename, header=None)

    # Retrieve the dataset as a NumPy array, including both input and output variables.
    data = dataset.values

    # Split the data into input variables (X) and output variable (y).
    # 'X' contains all columns except the last (input features), while 'y' contains the last column (target/output)
    X = data[:, :-1]  # Input features (all rows, all columns except the last one)
    y = data[:, -1]   # Output (target values, all rows, only the last column)

    # Format all input features (X) as strings.
    # This ensures that the input data is treated as categorical or textual if necessary.
    X = X.astype(str)

    # Return the input features (X) and output (y).
    return X, y


# Function to prepare and encode the input data for training and testing
def prepare_inputs(X_train, X_test):
    # Create an OrdinalEncoder instance to convert categorical data into integer codes
    oe = OrdinalEncoder()  # encode each variable to integers

    # Fit the encoder on the training data to learn the unique categories
    oe.fit(X_train)

    # Transform both the training and testing data using the fitted encoder
    X_train_enc = oe.transform(X_train)  # Encode the training data
    X_test_enc = oe.transform(X_test)    # Encode the test data

    # Return the encoded training and testing data
    return X_train_enc, X_test_enc



# Function to prepare and encode target variables for training and testing datasets
def prepare_targets(y_train, y_test):
    # Initialize the LabelEncoder, which converts categorical labels into numeric labels
    le = LabelEncoder()  # LabelEncoder is designed for encoding a single variable
    
    # Fit the encoder on the training data to learn the mapping of labels
    le.fit(y_train)
    
    # Transform both the training and testing labels into numeric form
    y_train_enc = le.transform(y_train)  # Encode the training labels
    y_test_enc = le.transform(y_test)    # Encode the testing labels
    
    # Return the encoded labels for both training and testing datasets
    return y_train_enc, y_test_enc


# Load the breast cancer dataset (assumed to be in CSV format)
# X represents the input features, and y represents the target labels
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# 33% of the data will be used for testing (test_size=0.33), and 67% for training
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Prepare the input data by applying necessary preprocessing to the features
# X_train_enc and X_test_enc are the encoded (transformed) input data for training and testing
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare the output data by encoding or transforming the target labels
# y_train_enc and y_test_enc are the encoded target data for training and testing
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Print the shapes of the training and testing datasets to confirm the splitting and preprocessing
print("Train", X_train_enc.shape, y_train_enc.shape)
print("Test", X_test_enc.shape, y_test_enc.shape)

We can see that we have 191 examples for training and 95 for testing.

## Categorical Feature Selection

There are two popular feature selection techniques that can be used for categorical input data
and a categorical (class) target variable. They are:
* Chi-Squared Statistic.
* Mutual Information Statistic.



### Chi-Squared Feature Selection

Pearson's chi-squared statistical hypothesis
test is an example of a test for independence between categorical variables. The results of this
test can be used for feature selection, where those features that are independent of the target
variable can be removed from the dataset.

In [None]:
# example of chi squared feature selection for categorical data
# feature selection
def select_features(X_train, y_train, X_test):
    # Select features according to the k highest scores.
    # k : int or "all", default=10, Number of top features to select.
    # The "all" option bypasses selection, for use in a parameter search.
    fs = SelectKBest(score_func=chi2, k="all")
    # Run score function on (X, y) and get the appropriate features.
    fs.fit(X_train, y_train)
    # Reduce X to the selected features.
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs


# load the dataset
X, y = load_dataset(breast_cancer_csv)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs.scores_)):
    print("Feature %d: %f" % (i, fs.scores_[i]))


In this case, we can see the scores are small and it is hard to get an idea from the number
alone as to which features are more relevant. Perhaps features 3, 4, 5, and 8 are most relevant.

In [None]:
# Plotting the feature selection scores

# 'fs.scores_' contains the importance or scores of each feature (e.g., from feature selection)
# A bar plot is created using pyplot.bar to visualize the scores of the features
# The x-axis represents the feature indices (e.g., 0, 1, 2, ..., len(fs.scores_)-1)
# The y-axis represents the corresponding scores of each feature
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)

# Display the plot
pyplot.show()  # Show the generated bar plot with feature scores

A bar chart of the feature importance scores for each input feature is created. This clearly
shows that feature 3 might be the most relevant (according to chi-squared) and that perhaps
four of the nine input features are the most relevant. We could set k = 4 when configuring the
SelectKBest to select these top four features.

### Mutual Information Feature Selection

Mutual information from the field of information theory is the application of information gain
(typically used in the construction of decision trees) to feature selection. Mutual information is
calculated between two variables and measures the reduction in uncertainty for one variable
given a known value of the other variable.

In [None]:
# example of mutual information feature selection for categorical data

# Function to load a dataset from a file
def load_dataset(filename):
    # Load the dataset from a CSV file as a pandas DataFrame.
    # 'header=None' indicates that the dataset has no header row (column names).
    dataset = read_csv(filename, header=None)

    # Retrieve the data as a numpy array, which includes both inputs and outputs.
    data = dataset.values

    # Split the dataset into input variables (X) and output variable (y).
    # X will contain all columns except the last one (input features).
    # y will contain only the last column (output/target).
    X = data[:, :-1]  # All columns except the last one for inputs
    y = data[:, -1]   # Only the last column for the output

    # Format all fields in the input array (X) as strings.
    # This is useful for certain data types that might need to be handled as categorical strings.
    X = X.astype(str)

    # Return the input variables (X) and output variables (y)
    return X, y

# load the dataset
X, y = load_dataset(breast_cancer_csv)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs.scores_)):
    print("Feature %d: %f" % (i, fs.scores_[i]))


In this case, we can see that some of the features have a very low score, suggesting that
perhaps they can be removed.

In [None]:
# Plotting the feature scores to visualize their importance

# 'fs.scores_' contains the scores (or importance values) of the features
# We use 'range(len(fs.scores_))' to generate x-axis positions for each feature score
# 'pyplot.bar' creates a bar chart with the feature indices on the x-axis and their corresponding scores on the y-axis
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)  # Create a bar plot for feature scores

# Display the plot
pyplot.show()  # Show the generated bar plot

A bar chart of the feature importance scores for each input feature is created. Importantly,
a different mixture of features is promoted.

## Modeling With Selected Features

A robust approach is to evaluate models using different
feature selection methods (and numbers of features) and select the method that results in a
model with the best performance. We will evaluate a Logistic Regression model
with all features compared to a model built from features selected by chi-squared and those
features selected via mutual information. Logistic regression is a good model for testing feature
selection methods as it can perform better if irrelevant features are removed from the model.

### Model Built Using All Features

In [None]:
# evaluation of a model using all input features

# load the dataset
X, y = load_dataset(breast_cancer_csv)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# fit the model
model = LogisticRegression(solver="lbfgs")
model.fit(X_train_enc, y_train_enc)

# evaluate the model
yhat = model.predict(X_test_enc)

# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print("Accuracy: %.2f" % (accuracy * 100))

In this case, we can see that the model achieves a classification accuracy of about 75 percent.
We would prefer to use a subset of features that achieves a classification accuracy that is as
good or better than this.

### Model Built Using Chi-Squared Features

In [None]:
# evaluation of a model fit using chi squared input features

# load the dataset
def load_dataset(filename):
    # load the dataset as a pandas DataFrame
    dataset = read_csv(filename, header=None)

    # retrieve numpy array
    data = dataset.values

    # split into input (X) and output (y) variables
    X = data[:, :-1]
    y = data[:, -1]

    # format all fields as string
    X = X.astype(str)
    return X, y


# prepare input data
def prepare_inputs(X_train, X_test):
    oe = OrdinalEncoder()
    oe.fit(X_train)
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc


# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc


# feature selection
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=chi2, k=4)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs


# load the dataset
X, y = load_dataset(breast_cancer_csv)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# feature selection
X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# fit the model
model = LogisticRegression(solver="lbfgs")
model.fit(X_train_fs, y_train_enc)

# evaluate the model
yhat = model.predict(X_test_fs)

# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print("Accuracy: %.2f" % (accuracy * 100))

In this case, we see that the model achieved an accuracy of about 74 percent, a slight drop in
performance. It is possible that some of the features removed are, in fact, adding value directly
or in concert with the selected features. At this stage, we would probably prefer to use all of
the input features.

### Model Built Using Mutual Information Features

In [None]:
# evaluation of a model fit using mutual information input features

# load the dataset
X, y = load_dataset(breast_cancer_csv)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# feature selection
X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# fit the model
model = LogisticRegression(solver="lbfgs")
model.fit(X_train_fs, y_train_enc)

# evaluate the model
yhat = model.predict(X_test_fs)

# evaluate predictions
accuracy = accuracy_score(y_test_enc, yhat)
print("Accuracy: %.2f" % (accuracy * 100))

In this case, we can see drop in classification accuracy. To be sure that
the effect is real, it would be a good idea to repeat each experiment multiple times and compare
the mean performance. It may also be a good idea to explore using k-fold cross-validation
instead of a simple train/test split.

## Conclusion

We explored different feature selection techniques for categorical data, specifically chi-squared and mutual information statistics. The analysis showed that while feature selection can help identify important variables, using all features may sometimes yield better performance. The choice of feature selection method should be validated through careful model evaluation.

## Clean up

Remember to:
- Close any open plot windows
- Clear any stored variables
- Shutdown the notebook kernel when finished

