## 0 Introduction
In this paper, I will explore the realm of data discretization, an important preprocessing step in machine learning and data analysis. Firstly, I will elaborate on the discretization techniques included in the scikit-learn library. Secondly, I will implement an alternative approach to discretization based on decision trees, as described in the paper [1].

[1] Niculescu-Mizil, A., Perlich, C., Swirszcz, G., Sindhwani, V., Liu, Y., Melville, P., Wang, D., Xiao, J., Hu, J., Singh, M., Xiong Shang, W., Feng Zhu, Y.. Winning the KDD Cup Orange Challenge with Ensemble Selection in Proceedings of KDD-Cup 2009 Competition, PMLR 7:23-34, 2009.
https://dl--acm--org.us.debiblio.com/doi/10.5555/3000364.3000366

## 1 Discretization in the scikit-learn Library
*Investigate the use of the different discretization strategies included in scikit-learn*

The scikit-learn library comprises some different discretization algorithms i.e., *K-bins discretization* and *Feature binarization* (https://scikit-learn.org/stable/modules/preprocessing.html#discretization, accessed on 11.12.23). I will explain further how they function and what their different use cases are.

#### 1.1 K-Bins Discretization
K-bins discretization is a preprocessing-technique based on turning data with continuous variables into categorical data. This is done by dividing the data into K intervals, namely bins. The bins can be of equal width or based on a custom criterion, decided by the strategy param of the `sklearn.preprocessing.KBinsDiscretizer`. The alternative states of this param is either *uniform*, *quantile*, or *kmeans*. The *uniform* strategy uses bins of equal and constant width; the *quantile* strategy uses the quantiles to create bins with the same population for each feature; the *kmeans* strategy creates the bins based on a strategy where all the values in the same bin have the same nearest center of a one-dimensional (1D) k-means cluster. Now, the latter strategy may sound quite greek at first. However, simply put, k-means clustering is just a method for grouping data points based on their proximity to eachother.  

- K-bins discretization is especially useful for linear models. This is because it improves the linear models work with continuous data by converting the continuous data into categorical data. In the example shown below (https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html#sphx-glr-auto-examples-preprocessing-plot-discretization-py, accessed on 12.12.23), you can see the before and after effects of k-bins discretization on a decision tree, as well as a linear model. As is apparent, the linear model becomes much more flexible, and the decision tree less flexible.
- To reduce the risk of overfitting, as you can visibly imagine based on the illustration below - it can be important to ensure a sufficient width of the different bins as a means to reduce overfitting. 

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_discretization_001.png" width="600" />

#### 1.2 Feature Binarization
Feature binarization is a preprocessing-technique that creates a threshold value for continuous data and separates the data into a binary distribution, based on which side of the threshold any given data point is located. 
- The use cases for feature binarization can be many. One trivial use case to think about is e.g., in medical situations, where it is interesting to analyse if some values are over a given threshold rather than the exact amount of a value. Binarizing continuous data can also be efficient in many other cases where it is useful to create a binary distribution based on a threshold. 
- Utilizing feature binarization can also improve computational efficacy, as it can lead to smaller and fewer numbers to work with. 

---
## 2 Decision-Tree Discretizer

In [4]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split

#### 2.1 Fitting the Model
*In this section I will be defining the function `decisionTreeDiscretizerFit(X_data,y_data,variables)`. This model, given a dataset and its classification values (`X_data` and `y_data`), and a list of feature indices, `variables`, returns a dictionary that associates each feature index in `variables` with a pair `(treeModel,encoding)`. Here, `treeModel` is a decision tree trained only with that feature of the dataset `X_data` with classification values `y_data`, and `encoding` is an association between the different classification probability vectors (*predict_proba*) obtained with that tree in the dataset and unique numerical values.*

In [5]:
def decisionTreeDiscretizerFit(X_data, y_data, variables: list) -> dict:
    '''
    This method fits a decision tree discretizer for each of the features in `X_data` which are in the `variables` list.

    param X_data: array-like 
        The input features dataset.
    param y_data: array-like
        The target values (class labels).
    param variables: list
        List of indices of features to be discretized.

    return: dict
        A dictionary where keys are feature indices and values are tuples (treeModel, encoding).
    '''
    tree_discretizers = {}

    # If no variables are specified, we take into account all features
    if variables is None:
        print("No variables specified. All features will be discretized.")
        variables = list(range(X_data.shape[1]))
    
    for v in variables:
        # Initializing the best tree and its score 
        best_tree, best_score = None, -np.inf

        # Isolating the feature column for training
        feature_column = np.array(X_data[:, v]).reshape(-1, 1)

        # Training a decision tree for an array of possible depths to find the best one
        for depth in range(1, 10):
            # Training the decision tree with the current depth
            tree = DecisionTreeClassifier(max_depth=depth)
            tree.fit(feature_column, y_data)
            # Calculating the score of the decision tree
            score = tree.score(feature_column, y_data)
            # Updating the best tree if the current one is better
            if score > best_score:
                best_tree = tree
                best_score = score            

        # Generating the different possible classification probability vectors
        prob_vectors = best_tree.predict_proba(feature_column)
        unique_prob_vectors = np.unique(prob_vectors, axis=0)
        
        # Associating each unique probability vector with its index
        encoding = { tuple(vector): index for index, vector in enumerate(unique_prob_vectors) }

        # Adding the best tree (treeModel) and its encoding to the result dictionary i.e., { feature_index: (treeModel, encoding) }
        tree_discretizers[v] = (best_tree, encoding)

    return tree_discretizers

#### 2.2 Creating a Transform Method
*In this section I will be defining the function `decisionTreeDiscretizerTransform(X_data,variables,dtDiscretizer)`. This model, given a dataset, `X_data`, a list of feature indices, `variables`, and a dictionary `dtDiscretizer` obtained by the `decisionTreeDiscretizerFit` function (not necessarily for the same dataset `X_data` or the same list of variables `variables`), generates a new dataset (i.e., does not modify the input dataset) identical to `X_data`. In this new dataset, the values of the features whose indices are indicated in `variables` are replaced by the numerical values associated with the classification probability vectors, obtained with the corresponding decision trees associated with each feature in `dtDiscretizer`.*

In [6]:
def decisionTreeDiscretizerTransform(X_data, variables: list, tree_discretizers: dict) -> np.ndarray:
    '''
    This method transforms the features in `X_data` using the decision tree discretizers in `tree_discretizers`.

    param X_data: array-like 
        The input features dataset.
    param variables: list
        List of indices of features to be discretized.
    param dtDiscretizer: dict
        A dictionary where keys are feature indices and values are tuples (treeModel, encoding).

    return: np.ndarray
        The transformed dataset.
    '''
    # Copying the input dataset
    X_data_discretized = X_data.copy()

    # If no variables are specified, we discretize all features
    if variables is None:
        variables = list(range(X_data.shape[1]))

    # Iterating over the features to be discretized
    for v in variables:
        # Isolating the data for this feature by selecting its column
        feature_column = np.array(X_data_discretized[:, v]).reshape(-1, 1)
        # Transforming the feature column using the correct decision tree discretizer
        transformed_column = np.array([tree_discretizers[v][1][tuple(vector)] for vector in tree_discretizers[v][0].predict_proba(feature_column)])
        # Replacing the feature column with the transformed one
        X_data_discretized[:, v] = transformed_column

    return X_data_discretized

In [7]:
# Testing the discretizer
if __name__ == '__main__':
    # Loading the dataset
    from sklearn.datasets import load_breast_cancer
    dataset = load_breast_cancer()
    X_data = dataset.data
    y_data = dataset.target

    variables = [0, 1, 2, 3, 4, 5, 6, 7, 8]

    # Fitting the discretizer
    tree_discretizers = decisionTreeDiscretizerFit(X_data, y_data, variables)
    X_data_discretized = decisionTreeDiscretizerTransform(X_data, variables, tree_discretizers)

    print("#"*10 + " Testing the discretizer " + "#"*10)
    for v in variables: 
        print(f"Feature '{dataset.feature_names[v]}': Discretized from {len(np.unique(X_data[:, v]))} to {len(np.unique(X_data_discretized[:, v]))} values.")
    
    # Testing that all other values are not discretized
    for v in range(X_data.shape[1]):
        if v not in variables:
            # This throws an assertion error if the two arrays are not equal
            assert np.array_equal(X_data[:, v], X_data_discretized[:, v])
    print("\n --> Test passed! All other features are not discretized.")


########## Testing the discretizer ##########
Feature 'mean radius': Discretized from 456 to 10 values.
Feature 'mean texture': Discretized from 479 to 13 values.
Feature 'mean perimeter': Discretized from 522 to 10 values.
Feature 'mean area': Discretized from 539 to 11 values.
Feature 'mean smoothness': Discretized from 474 to 13 values.
Feature 'mean compactness': Discretized from 537 to 11 values.
Feature 'mean concavity': Discretized from 537 to 12 values.
Feature 'mean concave points': Discretized from 542 to 9 values.
Feature 'mean symmetry': Discretized from 432 to 20 values.

 --> Test passed! All other features are not discretized.


---
## 3 Analysis and Comparison of Results
To conclude the work in this short paper, I will compare the different discretization strategies included in *scikit-learn* and the decision-tree based implementation. For the comparison I will be using the *Iris* dataset and discretize some of its features. I will split the dataset into `X_train` and `X_test`, train the discretizers with `X_train` and transform both `X_train` and `X_test`. Finally, I will evaluate the performance of a *LogisticRegression* linear model, both with and without discretization.

In [8]:
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Loading the iris dataset
iris = load_iris()
def get_iris():
    '''
    Provides the data and target of the iris dataset, shuffled randomly.
    
    return: tuple
        A tuple (X_train, X_test, y_train, y_test) where X_train and X_test are the training and test features, respectively, and y_train and y_test are the training and test targets, respectively.
    '''
    return train_test_split(iris.data, iris.target, test_size=0.33)

print(f"Features in the iris dataset (there are {len(iris.feature_names)}):")
print(iris.feature_names)

Features in the iris dataset (there are 4):
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


#### 3.1 Base Function for Analysing the Performance of Different Models

In [9]:
# Training a logistic regression model on the data
from sklearn.linear_model import LogisticRegression
def trainLogisticRegression(X_train, y_train, X_test, y_test, prefix: str = ""):
    # Training the model
    model = LogisticRegression(max_iter=1500)
    model.fit(X_train, y_train)
    # Predicting and calculating accuracies
    y_pred_test, y_pred_train = model.predict(X_test), model.predict(X_train)
    test_accuracy, train_accuracy = accuracy_score(y_test, y_pred_test), accuracy_score(y_train, y_pred_train)
    print("#"*10 + f" Getting the results for the model, '{prefix}' " + "#"*10)
    print(f"Accuracy for the model on the test-set: {test_accuracy*100:.2f}%")
    print(f"Accuracy for the model on the train-set: {train_accuracy*100:.2f}%")
    print()
    

In [10]:
# The variables to be discretized for the iris dataset
variables = [1,3]

#### 3.2 Results of Decision-Tree Discretizer

*Finding the results from discretizing the data and training a LinearRegression model with the discretized data vs. not discretizing.*

In [52]:
X_train, X_test, y_train, y_test = get_iris()

# Discretizing with the decision tree discretized
tree_discretizers = decisionTreeDiscretizerFit(X_train, y_train, variables)
X_train_discretized = decisionTreeDiscretizerTransform(X_train, variables=variables, tree_discretizers=tree_discretizers)
X_test_discretized = decisionTreeDiscretizerTransform(X_test, variables=variables, tree_discretizers=tree_discretizers)

# Training a logistic regression model on the discretized data
trainLogisticRegression(X_train_discretized, y_train, X_test_discretized, y_test, prefix="Discretized with decision tree")

# Training a logistic regression model on the original data
trainLogisticRegression(X_train, y_train, X_test, y_test, prefix="Original data")

########## Getting the results for the model, 'Discretized with decision tree' ##########
Accuracy for the model on the test-set: 100.00%
Accuracy for the model on the train-set: 97.00%

########## Getting the results for the model, 'Original data' ##########
Accuracy for the model on the test-set: 100.00%
Accuracy for the model on the train-set: 94.00%



^ Having tested with some different values and shuffled datasets, I can conclude that in some cases the model with the discretized data set is better than the one without. However, for most cases the model with the undiscretized data seems to deliver the most consistent results with the least deviation. This is at least my observations from testing with the iris dataset.

#### 3.3 Results with K-Bins Discretization

In [12]:
from sklearn.preprocessing import KBinsDiscretizer

In [104]:
# Implement k-bins discretization
X_train, X_test, y_train, y_test = get_iris()

# Discretizing with the decision tree discretized
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans', subsample=None)
discretizer.fit(X_train[:, variables])

# Transforming the data
X_train_discretized = discretizer.transform(X_train[:, variables])
X_test_discretized = discretizer.transform(X_test[:, variables])

# Training a logistic regression model on the discretized data
trainLogisticRegression(X_train_discretized, y_train, X_test_discretized, y_test, prefix="Discretized with k-bins")

# Training a logistic regression model on the original data
trainLogisticRegression(X_train, y_train, X_test, y_test, prefix="Original data")


########## Getting the results for the model, 'Discretized with k-bins' ##########
Accuracy for the model on the test-set: 96.00%
Accuracy for the model on the train-set: 94.00%

########## Getting the results for the model, 'Original data' ##########
Accuracy for the model on the test-set: 94.00%
Accuracy for the model on the train-set: 97.00%



^ My impression is that the results here are quite similar to that of the decision-tree discretizer. There are no obvious pros / cons for either strategy (discretized vs not discretized): they both deviate and they take turns in out-performing each other.