# Classification Trees and Forests -- Sales Level Prediction

In this notebook, we will predict the sales level (high or low) of car seats. 

## Review of Decision Tree

### Data

Let's first load the dataset from a csv file, which contains sales volumes of a car seat product at various retail locations, as well as factors that potentially affect sales. 

Consider **7 thousand** units as the benchmark sales volume. 
* We label any sales volume as **high** if it exceeds 7, and low if it is below 7.


We will:
* **create** a new column to indicate if a retail location's sales volume is **high**.
* drop the *Sales* column, as we will predict the sales level (high or low) instead of the sales number.

### Preprocessing (one-hot encoding)

Let's perform one-hot encoding on the categorical varaible.

Split the dataset into features and target.

Train-test split:

## Decision tree model

Note that it is not necessary to scale the features for tree-based methods
* they are not distance-based methods 
* they do not use gradient-based methods

When constructing the tree, we may specify the hyperparameters to keyword arguments. If not specified, default values will be used.

See the handout for list of the hyperparameters for building a decision tree.

We can fit the model to the training data by the `.fit()` function.

Let's make predictions on the test set and get the classification report and the AUC score.

### Exercise: Cross-validation

Perform a 10-fold cross-validation. 

In [24]:

# CV strategy 

# Get the scores for each train-test split in CV


The attribute `feature_importances_` of the trained model shows the importance of the features.

## Plot the fitted tree

We may plot the fitted tree using the `plot_tree()` function.

The tree shows how we splitted each node (if not a leaf node), along with other information.

## Random Forest

A random forest an **ensemble** method based on a certain number of decisions tree. The number of trees, $B$, is a hyperparameter. 

To construct each tree in the random forest:
* (Bootstraping) Sample the same number of data records (i.e., rows; in the current example, 400 rows) from the original dataset, **with replacement**.

**Note**: Bootstraping is a widely used sampling technique to create multiple samples from one original sample.

<img src='attachment:image-2.png' width='650'></img>

In the above example, the ball labeled "2" is never sampled. Thus it is unseen by the model and can be used as a testing data point for evaluating the model's performance score. This is known as "out of bag scoring".

* For the bootstrapped sample, construct a decision tree.

* For a new data points not in the training dataset, make a prediction based on each of the $B$ decision trees.

* Average over all the $B$ trees to get the ensemble prediction.

We will need to import the `RandomForestClassifier()` method from the `ensemble` module in `sklearn`. 

In addition to the hyperparameters for the individual trees, we also have:
* `n_estimators`: total number of trees to construct 
    * default value is 100
* `bootstrap`: if set to False, each tree will be constructed based on the same dataset (the original dataset)
    * default value: True

* `oob_score`: Whether to use the 'out of bag' method to evaluate the performance scores
    * with the trees based on bootstrapped samples from the original dataset, there are usually data records never selected to construct any of the trees, i.e., unseen by the model during training. 
     * if `oob_score` is set to True, then we will evalute the performance score on those unselected data records; in other words, they serve as the test data.
     * default value is false

Now we can fit the random forest model.

Making predictions.

Create the classification report and obtain the AUC score.

Perform cross-validation. Recall that:
* Define the strategy
* Specify the model, features (X), and target (y) in the positional arguments 
* `scoring=` specifies the performance score
* `cv=` specifies the cross validation strategy

## Hyperparameter tuning

To select hyperparameters with good performance, we first define the space of hyperparameters to search over, in the form of a dictionary.

In the following, we will use `RandomizedSearchCV` to randomly select parameter combinations from the search space and evaluate the performance.

It has the following keyword arguments:
* `estimator=` specifies the model to perform the research with
* `param_distributions=` specifies the search space to use
* `n_iter=` is how many combinations we will try out
* `scoring` specifies the performance metric
* `n_jobs=` specifies the number of cpu cores to use for the search; -1 represents the use of all available cores.
* `random_state=` is for reproducibility

`RandomizedSearchCV` will randomly select `n_iter` combinations of the hyperparameter values defined in `param_grid`.  Define the search:

Conduct the search by `.fit`:

Note that we used `.fit()` on the training dataset only. This means that during the fitting process:
* we did not make use of the test data
* `X_train` and `y_train` are further splitted into subsets (training and validation sets).

After fitting the RandomSearchCV, we may use `.best_params_` attribute of the search to find the best hyperparameters.

We may use `.best_estimator_` attribute to find the best model (i.e., the one with the best performing hyperparameters).

With the best model, we can evaluate the performance of the model on the test dataset.

We can use `cv_results_` to inspect the performance of all parameter combinations.