# Preprocessing and Feature Engineering
## Decision Tree-Based Discretization
Discretization is a data preprocessing technique that allows replacing a continuous feature with a discrete one by grouping values. We have seen that this technique can be complemented with one-hot encoding to eliminate the linear dependency between the obtained groups.

*Scikit-learn* includes several different strategies for discretization in the *KBinDiscretizer* class, which can be used to specify transformations on features using an instance of the *ColumnTransformer* class.

In the paper [1], the use of decision trees for discretization is described. The method consists of, given a feature, training a decision tree of limited depth (2, 3, or 4) using only that feature. The classification probability vectors obtained with this tree are used to recode the initial feature. The key is that the possible classification probability vectors (i.e., the values of *predict_proba*) obtained by a decision tree are a very small finite set (usually equal to the number of possible classifications). The method could be as follows:

In the training phase, given a dataset $X$, its classification values $y$, and a feature $v$:

* A decision tree $T$ is generated for the dataset formed only by the values of the feature $v$ from the original dataset $X$, with classification values $y$. This decision tree may be the best possible option when considering different values of limited depth (2, 3, or 4).
* All possible classification probability vectors of the decision tree $T$ are obtained for all the data of the feature $v$ in $X$. There is a finite number of these classification vectors, to which we associate a unique numerical value. For example, the classification probability vectors could be $[0,0.5,0.5]$, $[0.3,0.2,0.5]$, and $[0.8,0.1,0.1]$, to which we could associate the values $0$, $1$, and $2$, respectively.

In the transformation phase, given a dataset $X'$ (possibly different from the one used during training) and a feature $v$:

* For each data $e$ in $X'$, the value of the feature $v$ is considered and passed to the decision tree $T$ obtained during the training phase. For this value, the classification probability vector, $w$, is obtained.
* The value of the feature $v$ in $e$ is replaced by the numerical value associated with the classification probability vector $w$. Following the previous example, if the classification probability vector is $[0.3,0.2,0.5]$, then the value of the feature $v$ will be replaced by $1$.

[1] Niculescu-Mizil, A., Perlich, C., Swirszcz, G., Sindhwani, V., Liu, Y., Melville, P., Wang, D., Xiao, J., Hu, J., Singh, M., Xiong Shang, W., Feng Zhu, Y.. Winning the KDD Cup Orange Challenge with Ensemble Selection in Proceedings of KDD-Cup 2009 Competition, PMLR 7:23-34, 2009.
https://dl--acm--org.us.debiblio.com/doi/10.5555/3000364.3000366


## Exercise Content
The exercise consists of:

* Investigating the use of the different discretization strategies included in *scikit-learn*, clearly explaining what each one entails and how they are used.
* Defining the function `decisionTreeDiscretizerFit(X_data,y_data,variables)` which, given a dataset and its classification values, `X_data` and `y_data`, and a list of feature indices, `variables`, returns a dictionary that associates each feature index in `variables` with a pair `(treeModel,encoding)` where `treeModel` is a decision tree trained only with that feature of the dataset `X_data` with classification values `y_data`, and `encoding` is an association between the different classification probability vectors (*predict_proba*) obtained with that tree in the dataset and unique numerical values.
* Defining the function `decisionTreeDiscretizerTransform(X_data,variables,dtDiscretizer)` which, given a dataset, `X_data`, a list of feature indices, `variables`, and a dictionary `dtDiscretizer` obtained by the `decisionTreeDiscretizerFit` function (not necessarily for the same dataset `X_data` or the same list of variables `variables`), generates a new dataset (i.e., does not modify the input dataset) identical to `X_data` in which the values of the features whose indices are indicated in `variables` are replaced by the numerical values associated with the classification probability vectors, obtained with the corresponding decision trees associated with each feature in `dtDiscretizer`.
* Evaluating and comparing the different discretization strategies included in *scikit-learn* and the proposed strategy of discretization based on decision trees on the *Iris* dataset, to discretize some of its features. For this, it is proposed to split the dataset into two, `X_train` and `X_test`, train the discretizers with `X_train` and transform both `X_train` and `X_test`, finally evaluating the performance of a linear model (for example, *LogisticRegression*) both with and without discretization.

The **development must be reasoned**, indicating in each section what is being done, **thus demonstrating the knowledge acquired in this module**. What conclusions can you draw about discretization based on decision trees?

---

## Exercise 1
*Investigate the use of the different discretization strategies included in scikit-learn*

For this task I will explore the discretization methods featured in the scikit-learn docs, *K-bins discretization* and *Feature binarization* (https://scikit-learn.org/stable/modules/preprocessing.html#discretization, accessed on 11.12.23).

#### K-bins discretization
K-bins discretization is a preprocessing-technique based on turning data with continuous variables into categorical data, by dividing the data into K intervals, namely bins. The bins can be of equal width or based on a custom criterion. These are defined by the strategy param of the `sklearn.preprocessing.KBinsDiscretizer` as either *uniform*, *quantile*, or *kmeans*. The *uniform* strategy uses bins of equal and constant width; the *quantile* strategy uses the quantiles to create bins with the same population for each feature; the *kmeans* strategy creates the bins 

#### 