In [1]:
%load_ext watermark
%watermark

%load_ext autoreload
%autoreload 2


# import standard libs
from IPython.display import display
from IPython.core.debugger import set_trace as bp
from pathlib import PurePath, Path
import sys
import time
from collections import OrderedDict as od
import re
import os
import json
import datetime
import pickle


# import python scientific stack
import pandas as pd
import pandas_datareader.data as web
pd.set_option('display.max_rows', 10)
from dask import dataframe as dd
from dask.diagnostics import ProgressBar
from multiprocessing import cpu_count
pbar = ProgressBar()
pbar.register()
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from numba import jit
import math
# import ffn


# import visual tools
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
import seaborn as sns

plt.style.use('seaborn-talk')
plt.style.use('bmh')
#plt.rcParams['font.family'] = 'DejaVu Sans Mono'
plt.rcParams['font.size'] = 9.5
plt.rcParams['font.weight'] = 'medium'
plt.rcParams['figure.figsize'] = 10,7
blue, green, red, purple, gold, teal = sns.color_palette('colorblind', 6)

RANDOM_STATE = 777

print()

Last updated: 2024-09-06T14:12:31.617466-04:00

Python implementation: CPython
Python version       : 3.8.19
IPython version      : 8.12.2

Compiler    : Clang 16.0.6 
OS          : Darwin
Release     : 23.6.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit




  plt.style.use('seaborn-talk')


In [2]:
import os

# Run the setup script
%run ../../config/setup_project.py

# Call the function to set up the project path
setup_project_path()

# Now you can import your modules
from utils import helper as h_
import ch_02.code_ch_02 as f_ch2
import ch_03.code_ch_03 as f_ch3
import ch_05.code_ch_05 as f_ch5
import code_ch_06 as f_ch6
import ch_08.code_ch_08 as f_ch8

Project root added to sys.path: /Users/paulkelendji/Desktop/GitHub_paul/ML-Asset_Management
Config path added to sys.path: /Users/paulkelendji/Desktop/GitHub_paul/ML-Asset_Management/config
Current sys.path: ['/Users/paulkelendji/miniconda3/envs/financial_math/lib/python38.zip', '/Users/paulkelendji/miniconda3/envs/financial_math/lib/python3.8', '/Users/paulkelendji/miniconda3/envs/financial_math/lib/python3.8/lib-dynload', '', '/Users/paulkelendji/miniconda3/envs/financial_math/lib/python3.8/site-packages', '/Users/paulkelendji/miniconda3/envs/financial_math/lib/python3.8/site-packages/setuptools/_vendor', '/Users/paulkelendji/Desktop/GitHub_paul/ML-Asset_Management', '/Users/paulkelendji/Desktop/GitHub_paul/ML-Asset_Management/config']


ModuleNotFoundError: No module named 'utils'

In [5]:
df = pd.read_parquet("../../data/IVE_kibot.parq")

In [6]:
# load ../data/variables_ch2.pkl
%run ../ch_02/code_ch_02.py

path = '../../data/variables_ch2.pkl'
import pickle
with open(path, 'rb') as f:
    bars = pickle.load(f)
    bar_time = pickle.load(f)
    
# df as bars['Dollar'].df_OLHC without 'cusum' column
df = bars['Dollar'].df_OLHC.drop(columns=['cusum'])
# For the purpose of this example, remove rows where time_close is duplicated
# (keep the first row)
# Remove rows where time_close is duplicated, keeping the first occurrence
df = df.drop_duplicates(subset='time_close', keep='first')
# set index as 'time_close'
df = df.set_index('time_close')

In [7]:
# CLOSE PRICE AND DAILY VOLATILITY
# Step 1 : get the daily volatility
close = df['close']
dailyVol = f_ch3.getDailyVol(close, span0=100).dropna()

# from series to df
close = pd.DataFrame(close)
dailyVol = pd.DataFrame(dailyVol)

# 6.3.3 Observation Redundancy

### Understanding Out-of-Bag (OOB) Accuracy in Bagging:

**Bagging (Bootstrap Aggregating):**
- Bagging is an ensemble learning technique that involves training multiple models on different subsets of the training data and then aggregating their predictions.
- In bagging, each model is trained on a bootstrap sample, which is created by randomly sampling the original training set **with replacement**. This means that some observations will be repeated in the bootstrap sample, while others might be left out.

**Out-of-Bag (OOB) Observations:**
- For each bootstrap sample, the observations that were **not selected** are called "out-of-bag" (OOB) observations.
- On average, about 63% of the original training data is included in each bootstrap sample, meaning that around 37% of the data is left out as OOB observations.

**Out-of-Bag (OOB) Accuracy:**
- After a model is trained on the bootstrap sample, its performance can be evaluated on the OOB observations. This evaluation is done for each model in the ensemble.
- The OOB accuracy is computed by aggregating the predictions of all the models for their respective OOB observations and comparing them to the true labels.
- **Key Point:** OOB accuracy provides an estimate of the model's performance without the need for a separate validation set or cross-validation.

### Understanding Cross-Validation (CV) Accuracy:

**Cross-Validation (CV):**
- Cross-validation is a technique used to assess the performance of a model by partitioning the data into multiple subsets or "folds."
- In **k-fold cross-validation**, the data is divided into `k` folds. The model is trained on `k-1` folds and tested on the remaining fold. This process is repeated `k` times, with each fold serving as the test set exactly once.
- The final performance metric is typically the average accuracy (or another metric) across all `k` folds.

### Key Differences:

1. **Source of Data:**
   - **OOB Accuracy:** Evaluated on the observations that were **not included** in the bootstrap samples (about 37% of the data per model).
   - **CV Accuracy:** Evaluated on a specific fold in the data that was **held out** as the test set during cross-validation.

2. **Data Overlap:**
   - **OOB Accuracy:** Some OOB observations in different models might be very similar because the bootstrap samples are drawn with replacement. This can lead to inflated accuracy if the models have seen very similar observations during training.
   - **CV Accuracy:** The test fold in cross-validation is explicitly kept separate from the training data, providing a more robust estimate of model performance.

3. **Purpose:**
   - **OOB Accuracy:** Provides an estimate of model performance directly from the training process in bagging without needing to set aside a separate validation set.
   - **CV Accuracy:** Provides a more robust estimate of model performance by ensuring the test data is independent of the training data.

4. **Potential Issues in Bagging:**
   - The passage highlights that **OOB accuracy can be misleading** in the presence of observation redundancy, where the OOB observations are not truly independent from the training data due to the nature of the bootstrap sampling. This can lead to an overestimation of the model's true performance.
   - **Cross-validation without shuffling** is recommended in such cases to get a more accurate assessment of model performance.

### Summary:
- **OOB Accuracy** is a convenient estimate of model performance in bagging but can be biased if the training and OOB sets are not truly independent.
- **Cross-Validation Accuracy** is a more reliable measure of model performance, especially when there is a risk of data redundancy or dependence between training and testing samples.

---

The snippet you provided demonstrates three different approaches to setting up a Random Forest (RF) model, each with its own characteristics. Here's a breakdown of each setting and an explanation of the `bootstrap=False` parameter.

### 1. **`clf0`: Standard Random Forest**
```python
clf0 = RandomForestClassifier(
    n_estimators=1000,
    class_weight='balanced_subsample',
    criterion='entropy'
)
```
- **`n_estimators=1000`**: The model consists of 1,000 trees.
- **`class_weight='balanced_subsample'`**: This balances the classes in each bootstrap sample. For imbalanced datasets, it adjusts the weights inversely proportional to class frequencies in the input data.
- **`criterion='entropy'`**: The model uses the entropy (information gain) criterion to split nodes.
- **`bootstrap=True`**: Implicit because this is the default. The model uses bootstrap samples (sampling with replacement) to build each tree.

This is the standard implementation of a Random Forest with 1,000 trees, using bootstrapping to create the datasets for each tree.

### 2. **`clf1`: Decision Tree with Bagging**
```python
clf1 = DecisionTreeClassifier(
    criterion='entropy',
    max_features='auto',
    class_weight='balanced'
)
clf1 = BaggingClassifier(
    base_estimator=clf1,
    n_estimators=1000,
    max_samples=avgU
)
```
- **`DecisionTreeClassifier`**: A single decision tree is used as the base model.
  - **`criterion='entropy'`**: Similar to `clf0`, using entropy for node splitting.
  - **`max_features='auto'`**: The number of features considered for splitting is automatically chosen.
  - **`class_weight='balanced'`**: Balances class weights based on class frequencies.
- **`BaggingClassifier`**: This creates an ensemble of decision trees using bagging.
  - **`n_estimators=1000`**: 1,000 decision trees are used.
  - **`max_samples=avgU`**: This limits the size of each bootstrap sample to a fraction of the original dataset, determined by `avgU`.

This approach uses Bagging to create an ensemble of decision trees, which is a more flexible form of RF, where you can control the sampling process explicitly.

### 3. **`clf2`: Bagging with a Single Tree per Bootstrap Sample**
```python
clf2 = RandomForestClassifier(
    n_estimators=1,
    criterion='entropy',
    bootstrap=False,
    class_weight='balanced_subsample'
)
clf2 = BaggingClassifier(
    base_estimator=clf2,
    n_estimators=1000,
    max_samples=avgU,
    max_features=1.
)
```
- **`RandomForestClassifier`**: This is configured to build only a single tree.
  - **`n_estimators=1`**: Only one tree is built.
  - **`bootstrap=False`**: **This is the key difference**. No bootstrapping is done; the tree is built on the entire dataset, which means it will be a fully deterministic tree.
- **`BaggingClassifier`**: Creates an ensemble of 1,000 models.
  - **`max_samples=avgU`**: Uses a fraction of the data for each tree, determined by `avgU`.
  - **`max_features=1.`**: Only one feature is considered for splitting at each node, adding further randomness to the model.

This setup is an unusual use of a Random Forest where each tree is built on a non-bootstrapped dataset. The ensemble is created using Bagging, which adds randomness by subsampling the dataset and limiting the features considered at each split.

### **`bootstrap=False` Explained:**
- In a standard Random Forest (`bootstrap=True`), each tree is built on a bootstrapped dataset (a sample with replacement from the original dataset). This introduces variability into the trees, even when trained on the same data, because each tree sees a different subset of the data.
- **`bootstrap=False`**: When set to `False`, each tree is built on the entire dataset without sampling. This can lead to less diversity among the trees in the ensemble because they are all trained on the same data.

### **Distinctions Between the Three Models:**
1. **`clf0`**: Standard Random Forest with bootstrapping, using all available data for each tree. This is the most commonly used setup.
2. **`clf1`**: An ensemble of decision trees created through bagging, allowing for more control over the sampling process (e.g., limiting sample size).
3. **`clf2`**: A Random Forest with `bootstrap=False`, meaning each tree is built on the entire dataset without resampling, combined with Bagging to introduce some variability through subsampling and feature selection.

In summary, `clf2` deviates the most from standard Random Forests by avoiding bootstrapping and instead relying on Bagging to generate variability. This setup can be particularly useful in scenarios where bootstrapping is not desirable or effective.

---

## 7.5 Bugs In SKLEARN Cross-Validation

Regarding the bugs mentioned in your book about scikit-learn's cross-validation, here is an update:

1. **Scoring functions not recognizing `classes_` (Issue #6231):** This issue was related to scikit-learn's scoring functions not recognizing the classes when working with certain classifiers, leading to incorrect scoring when using cross-validation. This issue has been acknowledged and discussed within the scikit-learn community, and a solution was implemented. The scorer objects now look at the `classes_` attribute of the estimator before cross-validation, addressing the problem.

2. **`cross_val_score` issue with `log_loss` (Issue #9144):** The problem was that `cross_val_score` did not allow the labels to be passed to the `log_loss` function, which could result in errors when the training and testing sets contained different classes. This issue was also addressed by improving how scikit-learn handles labels within its scoring functions.

Both of these issues have been addressed in newer versions of scikit-learn. If you are using a recent version of the library, these bugs should no longer affect your work. However, it's always a good practice to review the documentation and release notes of the libraries you're using to stay informed about any ongoing or resolved issues【683†source】【684†source】.