# Feature Selection: Select Categorical Input Features

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. While feature selection is often straightforward for **real-valued data** (e.g., using Pearson's correlation coefficient), it can be more challenging when working with **categorical data**.

When both the input features and the target variable are categorical (e.g., in classification tasks), the two most commonly used feature selection methods are:

1. **Chi-Squared Statistic**:
   - Measures the dependence between each feature and the target variable.
   - Suitable for categorical data with a categorical target.

2. **Mutual Information Statistic**:
   - Quantifies the amount of information obtained about the target variable through each feature.
   - Effective for capturing non-linear relationships between features and the target.

These methods help identify the most relevant features, improving model performance and interpretability.

## Learning Objectives

- Learn about the breast cancer predictive modeling problem with categorical inputs and binary classification target variable
- Understand how to evaluate the importance of categorical features using the chi-squared and mutual information statistics
- Learn how to perform feature selection for categorical data when fitting and evaluating a classification model

### Tasks to complete

- Evaluate models using different feature selection methods
- Compare performance between full feature set and selected features
- Create visualizations of feature importance scores

## Prerequisites


- A working Python environment and familiarity with Python
- Familiarity with pandas and numpy libraries
- Basic understanding of classification models
- Knowledge of basic statistical concepts

## Get Started

Setup steps:
- Import required libraries (matplotlib, pandas, scikit-learn)
- Download breast cancer dataset
- Prepare data loading and preprocessing functions
- Set up feature selection methods

### Install required packages 

In [None]:
# Install the necessary packages using pip in a Jupyter notebook environment

# 'matplotlib' is a plotting library for creating static, interactive, and animated visualizations
%pip install matplotlib 

# 'numpy' is a fundamental package for scientific computing in Python, providing support for arrays and matrices
%pip install numpy

# 'pandas' is a powerful data manipulation and analysis library, often used for working with structured data
%pip install pandas

# 'scikit-learn' is a machine learning library that provides simple and efficient tools for data mining and data analysis
%pip install scikit-learn


### Import libraries

In [None]:
# Importing the necessary modules and classes

# Importing pyplot from matplotlib for plotting graphs and visualizations
from matplotlib import pyplot

# Importing read_csv from pandas to load CSV data into a DataFrame
from pandas import read_csv

# Importing SelectKBest and feature selection methods for selecting the best features
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif

# Importing LogisticRegression from sklearn to use as a machine learning model for classification
from sklearn.linear_model import LogisticRegression

# Importing accuracy_score from sklearn to evaluate the performance of the model
from sklearn.metrics import accuracy_score

# Importing train_test_split from sklearn to split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Importing LabelEncoder and OrdinalEncoder for encoding categorical features as numeric values
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

## Breast Cancer Dataset

Breast cancer dataset classifies breast cancer
patient data as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 286
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```

You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))



### Loading and encoding the categorical dataset

In [None]:
# Define the file path for the breast cancer dataset
# The dataset is assumed to be located in the 'Data' folder, relative to the current working directory
breast_cancer_csv = "../../Data/breast-cancer.csv"  # Path to the CSV file containing the breast cancer dataset

# Function to load the dataset
def load_dataset(filename):
    # Load the dataset from the provided CSV file.
    # 'read_csv' reads the file into a DataFrame without headers (header=None)
    dataset = read_csv(filename, header=None)

    # Retrieve the dataset as a NumPy array, including both input and output variables.
    data = dataset.values

    # Split the data into input variables (X) and output variable (y).
    # 'X' contains all columns except the last (input features), while 'y' contains the last column (target/output)
    X = data[:, :-1]  # Input features (all rows, all columns except the last one)
    y = data[:, -1]   # Output (target values, all rows, only the last column)

    # Format all input features (X) as strings.
    # This ensures that the input data is treated as categorical or textual if necessary.
    X = X.astype(str)

    # Return the input features (X) and output (y).
    return X, y


# Function to prepare and encode the input data for training and testing
def prepare_inputs(X_train, X_test):
    # Create an OrdinalEncoder instance to convert categorical data into integer codes
    oe = OrdinalEncoder()  # encode each variable to integers

    # Fit the encoder on the training data to learn the unique categories
    oe.fit(X_train)

    # Transform both the training and testing data using the fitted encoder
    X_train_enc = oe.transform(X_train)  # Encode the training data
    X_test_enc = oe.transform(X_test)    # Encode the test data

    # Return the encoded training and testing data
    return X_train_enc, X_test_enc

# Function to prepare and encode target variables for training and testing datasets
def prepare_targets(y_train, y_test):
    # Initialize the LabelEncoder, which converts categorical labels into numeric labels
    le = LabelEncoder()  # LabelEncoder is designed for encoding a single variable
    
    # Fit the encoder on the training data to learn the mapping of labels
    le.fit(y_train)
    
    # Transform both the training and testing labels into numeric form
    y_train_enc = le.transform(y_train)  # Encode the training labels
    y_test_enc = le.transform(y_test)    # Encode the testing labels
    
    # Return the encoded labels for both training and testing datasets
    return y_train_enc, y_test_enc


# Load the breast cancer dataset (assumed to be in CSV format)
# X represents the input features, and y represents the target labels
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# 33% of the data will be used for testing (test_size=0.33), and 67% for training
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Prepare the input data by applying necessary preprocessing to the features
# X_train_enc and X_test_enc are the encoded (transformed) input data for training and testing
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare the output data by encoding or transforming the target labels
# y_train_enc and y_test_enc are the encoded target data for training and testing
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Print the shapes of the training and testing datasets to confirm the splitting and preprocessing
print("Train", X_train_enc.shape, y_train_enc.shape)
print("Test", X_test_enc.shape, y_test_enc.shape)

The dataset has been partitioned into 191 training examples and 95 test examples, resulting in a roughly 2:1 train-test split that provides sufficient data for model development while retaining an adequate holdout set for evaluation.

## Categorical Feature Selection

When working with categorical input features and a categorical target variable, two statistically-grounded feature selection methods are particularly effective:
* Chi-Squared Statistic.
* Mutual Information Statistic.



### Chi-Squared Feature Selection

**Pearson's chi-squared test** is a statistical hypothesis test used to assess the independence between categorical variables. The results of this test can be leveraged for **feature selection** in machine learning. Specifically:

- **Independent Features**: Features that are found to be independent of the target variable can be removed from the dataset.
- **Dependent Features**: Features that show a significant dependence on the target variable are retained for model training.

By applying Pearson's chi-squared test, you can identify and eliminate irrelevant features, improving model efficiency and performance.

In [None]:
# Example of chi squared feature selection for categorical data
# feature selection
def select_features(X_train, y_train, X_test):
    # Select features according to the k highest scores.
    # Initialize the SelectKBest feature selection method
    # - score_func=chi2: Use the chi-squared statistic to score features
    # - k="all": Evaluate all features (no feature selection is performed yet)
    fs = SelectKBest(score_func=chi2, k="all")
    
    # Run score function on (X, y) and get the appropriate features.
    fs.fit(X_train, y_train)
    
    # Reduce X to the selected features.
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    
    return X_train_fs, X_test_fs, fs


# Load the dataset
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# 33% of the data will be used for testing (test_size=0.33), and 67% for training
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Prepare the input data by applying necessary preprocessing to the features
# X_train_enc and X_test_enc are the encoded (transformed) input data for training and testing
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare the output data by encoding or transforming the target labels
# y_train_enc and y_test_enc are the encoded target data for training and testing
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
# The fs.scores_ attribute in scikit-learn's chi2 feature selector represents the raw chi-squared (χ²) statistic values (not normalized importance scores)
# For each feature, this is the computed χ² value from the Pearson's chi-squared test of independence between the feature and target
# Higher values indicate stronger dependence between the feature and target
for i in range(len(fs.scores_)):
    print("Feature %d: %f" % (i, fs.scores_[i]))


The computed feature importance scores exhibit relatively low magnitudes across all variables, making it challenging to discern clear patterns from the numerical values alone. However, preliminary interpretation suggests features 3, 4, 5, and 8 may hold greater predictive relevance, as they consistently appear at the upper end of the score distribution.

In [None]:
# Plotting the feature selection scores

# 'fs.scores_' contains the importance or scores of each feature (e.g., from feature selection)
# A bar plot is created using pyplot.bar to visualize the scores of the features
# The x-axis represents the feature indices (e.g., 0, 1, 2, ..., len(fs.scores_)-1)
# The y-axis represents the corresponding scores of each feature
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)

# Add x-axis and y-axis labels
pyplot.xlabel("Feature Indices")
pyplot.ylabel("Feature Scores")

# Set x-axis ticks to show every integer (0, 1, 2, 3, ...)
pyplot.xticks([i for i in range(len(fs.scores_))])

# Display the plot
pyplot.show()  # Show the generated bar plot with feature scores

#### Feature Importance Visualization

The **bar chart** visualizes the feature importance scores for each input feature. The chart reveals the following insights:

- **Most Relevant Feature**: Feature 3 appears to be the most relevant according to the chi-squared statistic.
- **Top Features**: Approximately **four out of the nine input features** show significantly higher importance scores compared to the others.

##### Actionable Insight
Based on this analysis, we can configure the `SelectKBest` feature selection method to retain only the **top four features** by setting `k=4`. This will help streamline the dataset and improve model efficiency without sacrificing predictive power.

### Mutual Information Feature Selection

**Mutual information** is a concept from **information theory** that is widely applied in feature selection. It is based on the idea of **information gain**, which is commonly used in the construction of decision trees. Here's how it works:

- **Definition**: Mutual information measures the reduction in uncertainty for one variable when the value of another variable is known.
- **Application**: It quantifies the dependency between two variables, making it a powerful tool for identifying relevant features in a dataset.

By calculating mutual information between each input feature and the target variable, we can determine which features provide the most information for predicting the target, enabling effective feature selection.

In [None]:
# Example of mutual information feature selection for categorical data
# Feature selection
def select_features(X_train, y_train, X_test):
    """
    Perform feature selection using the SelectKBest method with mutual information.

    Args:
        X_train (array-like): Training input features.
        y_train (array-like): Training target variable.
        X_test (array-like): Testing input features.

    Returns:
        X_train_fs (array-like): Transformed training features after feature selection.
        X_test_fs (array-like): Transformed testing features after feature selection.
        fs (object): Fitted SelectKBest object for further analysis.
    """
    # Initialize SelectKBest with mutual information as the scoring function
    # - score_func=mutual_info_classif: Use mutual information for feature scoring
    # - k='all': Evaluate all features (no feature selection is performed yet)
    fs = SelectKBest(score_func=mutual_info_classif, k='all')

    # Fit the feature selector on the training data
    fs.fit(X_train, y_train)

    # Transform the training and testing data using the fitted selector
    X_train_fs = fs.transform(X_train)  # Apply feature selection to training data
    X_test_fs = fs.transform(X_test)    # Apply feature selection to testing data

    # Return the transformed datasets and the fitted feature selector object
    return X_train_fs, X_test_fs, fs


# Function to load a dataset from a file
def load_dataset(filename):
    # Load the dataset from a CSV file as a pandas DataFrame.
    # 'header=None' indicates that the dataset has no header row (column names).
    dataset = read_csv(filename, header=None)

    # Retrieve the data as a numpy array, which includes both inputs and outputs.
    data = dataset.values

    # Split the dataset into input variables (X) and output variable (y).
    # X will contain all columns except the last one (input features).
    # y will contain only the last column (output/target).
    X = data[:, :-1]  # All columns except the last one for inputs
    y = data[:, -1]   # Only the last column for the output

    # Format all fields in the input array (X) as strings.
    # This is useful for certain data types that might need to be handled as categorical strings.
    X = X.astype(str)

    # Return the input variables (X) and output variables (y)
    return X, y

# load the dataset
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# 33% of the data will be used for testing (test_size=0.33), and 67% for training
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Prepare the input data by applying necessary preprocessing to the features
# X_train_enc and X_test_enc are the encoded (transformed) input data for training and testing
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare the output data by encoding or transforming the target labels
# y_train_enc and y_test_enc are the encoded target data for training and testing
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs.scores_)):
    print("Feature %d: %f" % (i, fs.scores_[i]))


The feature importance analysis reveals several variables with notably low statistical scores, indicating their limited predictive value for the target variable. These underperforming features demonstrate either minimal correlation with the outcome or redundant information already captured by other variables. Their removal could potentially streamline the model without sacrificing accuracy, while offering benefits such as reduced computational overhead, improved interpretability, and possibly better generalization by eliminating noise. However, we should validate this through ablation testing - comparing model performance with and without these features - before finalizing the feature set.

In [None]:
# Plotting the feature scores to visualize their importance

# 'fs.scores_' contains the scores (or importance values) of the features
# We use 'range(len(fs.scores_))' to generate x-axis positions for each feature score
# 'pyplot.bar' creates a bar chart with the feature indices on the x-axis and their corresponding scores on the y-axis
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)  # Create a bar plot for feature scores

# Add x-axis and y-axis labels
pyplot.xlabel("Feature Indices")
pyplot.ylabel("Feature Scores")

# Set x-axis ticks to show every integer (0, 1, 2, 3, ...)
pyplot.xticks([i for i in range(len(fs.scores_))])

# Display the plot
pyplot.show()  # Show the generated bar plot

The feature importance analysis, visualized through a bar chart, reveals distinct patterns in variable significance compared to previous methods. Notably, this approach highlights a substantially different subset of predictive features, suggesting that:
* The current selection technique prioritizes alternative data relationships
* Complementary information may exist across different feature sets
* Model performance could benefit from ensemble feature selection strategies

## Modeling With Selected Features

A robust approach to feature selection involves evaluating models using different methods and varying numbers of features, then selecting the approach that yields the best model performance. In this analysis, we will compare the performance of a **Logistic Regression model** under three scenarios:

1. **All Features**: The model is trained using all available input features.
2. **Chi-Squared Feature Selection**: The model is trained using features selected by the **chi-squared statistic**.
3. **Mutual Information Feature Selection**: The model is trained using features selected by **mutual information**.

### Why Logistic Regression?
Logistic Regression is an excellent choice for testing feature selection methods because:
- It can **benefit significantly** from the removal of irrelevant features.
- It provides a clear baseline for evaluating the impact of feature selection on model performance.

By comparing the results of these three scenarios, we can determine the most effective feature selection method for the dataset.

### Model Built Using All Features

In [None]:
# Load the dataset from the specified CSV file defined by a variable 'breast_cancer_csv'
# - X: Input features
# - y: Target variable
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# - test_size=0.33: 33% of the data is used for testing
# - random_state=1: Ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Prepare the input data (e.g., encoding categorical variables, scaling, etc.)
# - X_train_enc: Transformed training input features
# - X_test_enc: Transformed testing input features
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare the output data (e.g., encoding the target variable)
# - y_train_enc: Encoded training target variable
# - y_test_enc: Encoded testing target variable
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Initialize a Logistic Regression model
# - solver="lbfgs": Specifies the optimization algorithm
model = LogisticRegression(solver="lbfgs")

# Train the model on the encoded training data
model.fit(X_train_enc, y_train_enc)

# Use the trained model to make predictions on the encoded test data
yhat = model.predict(X_test_enc)

# Evaluate the model's predictions by calculating accuracy
# - accuracy_score: Compares predicted values (yhat) with actual values (y_test_enc)
accuracy = accuracy_score(y_test_enc, yhat)

# Print the accuracy as a percentage
print("Accuracy: %.2f" % (accuracy * 100))

The current model configuration achieves a baseline classification accuracy of 76%. Our feature selection objective is to identify an optimal subset that maintains or improves upon this performance threshold while reducing feature dimensionality.

### Model Built Using Chi-Squared Features

In [None]:
# Load the dataset from a CSV file
def load_dataset(filename):
    """
    Load the dataset from a CSV file and split it into input (X) and output (y) variables.

    Args:
        filename (str): Path to the CSV file.

    Returns:
        X (numpy array): Input features.
        y (numpy array): Target variable.
    """
    # Load the dataset as a pandas DataFrame
    dataset = read_csv(filename, header=None)

    # Retrieve the numpy array from the DataFrame
    data = dataset.values

    # Split into input (X) and output (y) variables
    X = data[:, :-1]  # All columns except the last
    y = data[:, -1]   # Last column

    # Format all fields as strings
    X = X.astype(str)
    return X, y


# Prepare input data by encoding categorical variables
def prepare_inputs(X_train, X_test):
    """
    Encode categorical input features using OrdinalEncoder.

    Args:
        X_train (numpy array): Training input features.
        X_test (numpy array): Testing input features.

    Returns:
        X_train_enc (numpy array): Encoded training input features.
        X_test_enc (numpy array): Encoded testing input features.
    """
    # Initialize the OrdinalEncoder
    oe = OrdinalEncoder()

    # Fit the encoder on the training data and transform both training and testing data
    oe.fit(X_train)
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc


# Prepare target data by encoding labels
def prepare_targets(y_train, y_test):
    """
    Encode the target variable using LabelEncoder.

    Args:
        y_train (numpy array): Training target variable.
        y_test (numpy array): Testing target variable.

    Returns:
        y_train_enc (numpy array): Encoded training target variable.
        y_test_enc (numpy array): Encoded testing target variable.
    """
    # Initialize the LabelEncoder
    le = LabelEncoder()

    # Fit the encoder on the training data and transform both training and testing targets
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc


# Perform feature selection using SelectKBest with chi-squared statistic
def select_features(X_train, y_train, X_test):
    """
    Select the top k features using the chi-squared statistic.

    Args:
        X_train (numpy array): Training input features.
        y_train (numpy array): Training target variable.
        X_test (numpy array): Testing input features.

    Returns:
        X_train_fs (numpy array): Transformed training features with selected features.
        X_test_fs (numpy array): Transformed testing features with selected features.
    """
    # Initialize SelectKBest with chi-squared statistic and k=4 (top 4 features)
    fs = SelectKBest(score_func=chi2, k=4)

    # Fit the feature selector on the training data
    fs.fit(X_train, y_train)

    # Transform the training and testing data using the fitted selector
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs


# Load the dataset from a CSV defined by a variable called breast
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# - test_size=0.33: 33% of the data is used for testing
# - random_state=1: Ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=1
)

# Prepare input data by encoding categorical variables
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare output data by encoding the target variable
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Perform feature selection to select the top 4 features
X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# Initialize and fit a Logistic Regression model
# - solver="lbfgs": Specifies the optimization algorithm
model = LogisticRegression(solver="lbfgs")
model.fit(X_train_fs, y_train_enc)

# Use the trained model to make predictions on the test data
yhat = model.predict(X_test_fs)

# Evaluate the model's predictions by calculating accuracy
# - accuracy_score: Compares predicted values (yhat) with actual values (y_test_enc)
accuracy = accuracy_score(y_test_enc, yhat)

# Print the accuracy as a percentage
print("Accuracy: %.2f" % (accuracy * 100))

The model's accuracy of 75% represents a modest but notable decrease in performance compared to previous results. This degradation suggests that some of the eliminated features may have contributed meaningful predictive value, either independently or through interactions with other variables. Given these findings, retaining the complete set of input features appears to be the more prudent approach at this stage of development. The marginal reduction in dimensionality does not justify the corresponding performance trade-off, particularly when considering potential feature synergies that may be critical for optimal model behavior.

### Model Built Using Mutual Information Features

In [None]:
# Load the dataset from a CSV file
def load_dataset(filename):
    """
    Load the dataset from a CSV file and split it into input (X) and output (y) variables.

    Args:
        filename (str): Path to the CSV file.

    Returns:
        X (numpy array): Input features.
        y (numpy array): Target variable.
    """
    # Load the dataset as a pandas DataFrame
    dataset = read_csv(filename, header=None)

    # Retrieve the numpy array from the DataFrame
    data = dataset.values

    # Split into input (X) and output (y) variables
    X = data[:, :-1]  # All columns except the last
    y = data[:, -1]   # Last column

    # Format all fields as strings
    X = X.astype(str)
    return X, y


# Prepare input data by encoding categorical variables
def prepare_inputs(X_train, X_test):
    """
    Encode categorical input features using OrdinalEncoder.

    Args:
        X_train (numpy array): Training input features.
        X_test (numpy array): Testing input features.

    Returns:
        X_train_enc (numpy array): Encoded training input features.
        X_test_enc (numpy array): Encoded testing input features.
    """
    # Initialize the OrdinalEncoder
    oe = OrdinalEncoder()

    # Fit the encoder on the training data and transform both training and testing data
    oe.fit(X_train)
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc


# Prepare target data by encoding labels
def prepare_targets(y_train, y_test):
    """
    Encode the target variable using LabelEncoder.

    Args:
        y_train (numpy array): Training target variable.
        y_test (numpy array): Testing target variable.

    Returns:
        y_train_enc (numpy array): Encoded training target variable.
        y_test_enc (numpy array): Encoded testing target variable.
    """
    # Initialize the LabelEncoder
    le = LabelEncoder()

    # Fit the encoder on the training data and transform both training and testing targets
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc


# Perform feature selection using SelectKBest with mutual information
def select_features(X_train, y_train, X_test):
    """
    Select the top k features using mutual information.

    Args:
        X_train (numpy array): Training input features.
        y_train (numpy array): Training target variable.
        X_test (numpy array): Testing input features.

    Returns:
        X_train_fs (numpy array): Transformed training features with selected features.
        X_test_fs (numpy array): Transformed testing features with selected features.
    """
    # Initialize SelectKBest with mutual information and k=4 (top 4 features)
    # The mutual_info_classif function estimates mutual information between features and the target variable 
    # using a non-parametric method based on nearest neighbors (k-nearest neighbors, KNN).
    # By default, it includes a small amount of noise to continuous variables to handle discretization, 
    # controlled by the random_state parameter. If random_state is not set (i.e., left as None), 
    # this noise addition introduces some randomness, which can lead to slight variations in results across runs. 
    # Setting random_state to a fixed integer (e.g., random_state=42) makes the process deterministic and reproducible.
    fs = SelectKBest(score_func=lambda X, y: mutual_info_classif(X, y, random_state=42), k=4)

    # Fit the feature selector on the training data
    fs.fit(X_train, y_train)

    # Transform the training and testing data using the fitted selector
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs


# Load the dataset
X, y = load_dataset(breast_cancer_csv)

# Split the dataset into training and testing sets
# - test_size=0.33: 33% of the data is used for testing
# - random_state=1: Ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# Prepare input data by encoding categorical variables
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# Prepare output data by encoding the target variable
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

# Perform feature selection to select the top 4 features
X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# Initialize and fit a Logistic Regression model
# - solver="lbfgs": Specifies the optimization algorithm
# - random_state=1: Ensures reproducibility
model = LogisticRegression(solver='lbfgs', random_state=1)
model.fit(X_train_fs, y_train_enc)

# Use the trained model to make predictions on the test data
yhat = model.predict(X_test_fs)

# Evaluate the model's predictions by calculating accuracy
# - accuracy_score: Compares predicted values (yhat) with actual values (y_test_enc)
accuracy = accuracy_score(y_test_enc, yhat)

# Print the accuracy as a percentage
print('Accuracy: %.2f' % (accuracy * 100))

The model demonstrates improved predictive performance with an accuracy of 78%, representing a meaningful increase over previous results. To validate whether this improvement reflects a genuine enhancement rather than random variation, we recommend implementing more rigorous evaluation protocols. Specifically, conducting multiple experiments with different random seeds would provide a distribution of performance metrics, while adopting k-fold cross-validation (with k=5 or k=10) would offer more reliable performance estimates by utilizing multiple train-test partitions. These approaches would not only confirm the significance of the observed improvement but also yield more statistically robust insights into the model's true generalization capability.

## Conclusion

We explored different feature selection techniques for categorical data, specifically chi-squared and mutual information statistics. The analysis showed that while feature selection can help identify important variables, using all features may sometimes yield better performance. The choice of feature selection method should be validated through careful model evaluation.

## Clean up

Remember to:
- Close any open plot windows
- Clear any stored variables
- Shutdown the notebook kernel when finished

