<a href="https://colab.research.google.com/github/reflectormensah/Financial-Engineering-Data-Science/blob/main/Hyperparameter%20Tuning(Variance%20%26%20Bias).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimizing the Trade off between variance and bias

In [None]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

# fetch dataset
# This data concerns credit card applications; good mix of attributes
credit_approval = fetch_ucirepo(id=27)

# data (as pandas dataframes)
X = credit_approval.data.features
y = credit_approval.data.targets

# metadata
print(credit_approval.metadata)

# variable information
print(credit_approval.variables)


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.2-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.2
{'uci_id': 27, 'name': 'Credit Approval', 'repository_url': 'https://archive.ics.uci.edu/dataset/27/credit+approval', 'data_url': 'https://archive.ics.uci.edu/static/public/27/data.csv', 'abstract': 'This data concerns credit card applications; good mix of attributes', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 690, 'num_features': 15, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['A16'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1987, 'last_updated': 'Wed Aug 23 2023', 'dataset_doi': '10.24432/C5FS30', 'creators': ['J. R. Quinlan'], 'intro_paper': None, 'additional_info': {'summary': 'This file concerns credit card applications.  All attribute names and values have 

In [None]:
# the feature names are hidden since this type of data is considered personal data for the purpose of the example we can use this data
X.head()


Unnamed: 0,A15,A14,A13,A12,A11,A10,A9,A8,A7,A6,A5,A4,A3,A2,A1
0,0,202.0,g,f,1,t,t,1.25,v,w,g,u,0.0,30.83,b
1,560,43.0,g,f,6,t,t,3.04,h,q,g,u,4.46,58.67,a
2,824,280.0,g,f,0,f,t,1.5,h,q,g,u,0.5,24.5,a
3,3,100.0,g,t,5,t,t,3.75,v,w,g,u,1.54,27.83,b
4,0,120.0,s,f,0,f,t,1.71,v,w,g,u,5.625,20.17,b


In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A15     690 non-null    int64  
 1   A14     677 non-null    float64
 2   A13     690 non-null    object 
 3   A12     690 non-null    object 
 4   A11     690 non-null    int64  
 5   A10     690 non-null    object 
 6   A9      690 non-null    object 
 7   A8      690 non-null    float64
 8   A7      681 non-null    object 
 9   A6      681 non-null    object 
 10  A5      684 non-null    object 
 11  A4      684 non-null    object 
 12  A3      690 non-null    float64
 13  A2      678 non-null    float64
 14  A1      678 non-null    object 
dtypes: float64(4), int64(2), object(9)
memory usage: 81.0+ KB


In [None]:
X.fillna(method='ffill' , inplace=True)
X.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.fillna(method='ffill' , inplace=True)


A15    0
A14    0
A13    0
A12    0
A11    0
A10    0
A9     0
A8     0
A7     0
A6     0
A5     0
A4     0
A3     0
A2     0
A1     0
dtype: int64

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit, cross_validate
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(X)
categorical_columns

['A13', 'A12', 'A10', 'A9', 'A7', 'A6', 'A5', 'A4', 'A1']

In [None]:
# now let's identify the numerical variables

num_vars = [var for var in X.columns if var not in categorical_columns]

# number of numerical variables
print(len(num_vars))
print(num_vars)

6
['A15', 'A14', 'A11', 'A8', 'A3', 'A2']


In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ("one-hot-encoder", categorical_preprocessor, categorical_columns),
        ("standard_scaler", numerical_preprocessor, num_vars),
    ]
)

In [None]:
model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("regressor", RandomForestClassifier(random_state=42)),
    ]
)

In [None]:
from sklearn import set_config

set_config(display="diagram")
model

In [None]:
y_new = y.copy()
y_new[y_new =='+'] = '1'
y_new[y_new =='-'] = '0'
y_new
X_train, X_test, y_train, y_test = train_test_split(
    X,  # predictive variables
    y_new,  # target
    test_size=0.3,  # portion of dataset to allocate to test set
    random_state=0,  # we are setting the seed here
)

X_train.shape, X_test.shape

((483, 15), (207, 15))

In [None]:
model.fit(X_train, y_train)

  self._final_estimator.fit(Xt, y, **fit_params_last_step)


In [None]:
from sklearn.metrics import accuracy_score
target_predicted = model.predict(X_test)

print(
    f"accuracy score on the testing set: "
    f"{accuracy_score(y_test, target_predicted):.3f}"
)

accuracy score on the testing set: 0.865


In [None]:
from pprint import pprint

# Look at parameters used by our current forest
print("Parameters currently in use:\n")
pprint(model.get_params())

Parameters currently in use:

{'memory': None,
 'preprocessor': ColumnTransformer(transformers=[('one-hot-encoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['A13', 'A12', 'A10', 'A9', 'A7', 'A6', 'A5',
                                  'A4', 'A1']),
                                ('standard_scaler', StandardScaler(),
                                 ['A15', 'A14', 'A11', 'A8', 'A3', 'A2'])]),
 'preprocessor__n_jobs': None,
 'preprocessor__one-hot-encoder': OneHotEncoder(handle_unknown='ignore'),
 'preprocessor__one-hot-encoder__categories': 'auto',
 'preprocessor__one-hot-encoder__drop': None,
 'preprocessor__one-hot-encoder__dtype': <class 'numpy.float64'>,
 'preprocessor__one-hot-encoder__handle_unknown': 'ignore',
 'preprocessor__one-hot-encoder__max_categories': None,
 'preprocessor__one-hot-encoder__min_frequency': None,
 'preprocessor__one-hot-encoder__sparse': 'deprecated',
 'preprocessor__one-hot-encoder__sparse

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=5, stop=50, num=10)]
# Number of features to consider at every split
max_features = ["auto", "sqrt"]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 20, num=21)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 7]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {
    "regressor__n_estimators": n_estimators,
    "regressor__max_features": max_features,
    "regressor__max_depth": max_depth,
    "regressor__min_samples_split": min_samples_split,
    "regressor__min_samples_leaf": min_samples_leaf,
    "regressor__bootstrap": bootstrap,
}
pprint(random_grid)

{'regressor__bootstrap': [True, False],
 'regressor__max_depth': [5,
                          5,
                          6,
                          7,
                          8,
                          8,
                          9,
                          10,
                          11,
                          11,
                          12,
                          13,
                          14,
                          14,
                          15,
                          16,
                          17,
                          17,
                          18,
                          19,
                          20,
                          None],
 'regressor__max_features': ['auto', 'sqrt'],
 'regressor__min_samples_leaf': [1, 2, 4],
 'regressor__min_samples_split': [2, 5, 7],
 'regressor__n_estimators': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]}


In [None]:
model_random_search = RandomizedSearchCV(
    model,
    param_distributions=random_grid,
    n_iter=10,
    cv=5,
    verbose=1,
)
model_random_search.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  warn(
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  w

In [None]:
model_random_search.best_params_

{'regressor__n_estimators': 50,
 'regressor__min_samples_split': 2,
 'regressor__min_samples_leaf': 2,
 'regressor__max_features': 'auto',
 'regressor__max_depth': 17,
 'regressor__bootstrap': True}

In [None]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    accuracy = accuracy_score(test_labels, predictions)
    print("Model Performance")
    print("Accuracy = {:0.2f}.".format(accuracy))

    return accuracy


base_model = model
base_model.fit(X_train, y_train)
base_accuracy = evaluate(base_model, X_test, y_test)

  self._final_estimator.fit(Xt, y, **fit_params_last_step)


Model Performance
Accuracy = 0.86.


In [None]:
best_random = model_random_search.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

Model Performance
Accuracy = 0.86.


In [None]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search
param_grid = {
    "regressor__bootstrap": [True],
    "regressor__max_depth": [int(x) for x in np.linspace(2, 10, num=6)],
    "regressor__max_features": [2, 3],
    "regressor__min_samples_leaf": [3, 4, 5],
    "regressor__min_samples_split": [8, 10, 12],
    "regressor__n_estimators": [
        int(x) for x in np.linspace(start=10, stop=50, num=11)
    ],
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(
    estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2
)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_

Fitting 3 folds for each of 1188 candidates, totalling 3564 fits


  self._final_estimator.fit(Xt, y, **fit_params_last_step)


{'regressor__bootstrap': True,
 'regressor__max_depth': 10,
 'regressor__max_features': 3,
 'regressor__min_samples_leaf': 3,
 'regressor__min_samples_split': 8,
 'regressor__n_estimators': 30}

In [None]:
best_random = grid_search.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

Model Performance
Accuracy = 0.87.


##**Hyperparameter** **Optimzation**

Loading the necessary libraries and data

In [None]:
# loading the sonar dataset
from pandas import read_csv

# grid search logistic regression model on the sonar dataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV


# random search logistic regression model on the sonar dataset
from scipy.stats import loguniform
from sklearn.model_selection import RandomizedSearchCV

# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)

# define model
model = LogisticRegression()

# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

(208, 60) (208,)


1. Performing a Randomized Search Optimization

In [None]:
dataframe = read_csv(url, header=None)

# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = loguniform(1e-5, 100)

# define search
search = RandomizedSearchCV(model, space, n_iter=600, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

# execute search
result = search.fit(X, y)

# summarize result
print('Best Score: %s' % result.best_score_)


Best Score: 0.7897619047619049


8490 fits failed out of a total of 18000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1680 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

-----------------------

In [None]:
print('Best Hyperparameters: %s' % result.best_params_)

Best Hyperparameters: {'C': 4.878363034905761, 'penalty': 'l2', 'solver': 'newton-cg'}


We see that the result is about 79% best score.

2. Performing a Grid Search Optimization

In [None]:
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]

# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

# execute search
result = search.fit(X, y)

# summarize result
print('Best Score: %s' % result.best_score_)


Best Score: 0.7828571428571429


1440 fits failed out of a total of 2880.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
240 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 71, in _check_solver
    raise ValueError("penalty='none' is not supported for the liblinear solver")
ValueError: penalty='none' is not supported for the libl

In [None]:
print('Best Hyperparameters: %s' % result.best_params_)

Best Hyperparameters: {'C': 1, 'penalty': 'l2', 'solver': 'newton-cg'}


We can see the result to be about 78% before, showing a better performance from Random Search.

# How can models be used togother?

### Technical

To answer the above question, we can turn to ensemble learning. Ensemble learning involves combining multiple base models to improve the overall predictive performance. We have so far assessed the effectiveness of Decision Trees, PCA, and Elastic Net in dealing with machine learning problems and pointed out a few advantages and areas where they fall short. With any model, the aim is to reduce the error rate without changing the input data to a point where the original points are unrecognizable. Combining models is one such way in which we can improve the overall outcome of the modeling process. There is vast literature that shows that ensembles or ensemble learning will yield more accurate results than single models. Models can be combined in a few ways, and we list some of them here:

**Voting:**
Voting is a straightforward ensemble method where the predictions of multiple models are combined to reach a final decision. This often involves a majority vote, where the prediction with the most votes becomes the final prediction. Alternatively, models can be assigned weights based on their performance, giving more influence to the more accurate ones. It's like a democratic process for machine learning, where each model gets a vote, and the majority's decision prevails.

**Averaging:**
Averaging is akin to voting but tailored for regression problems. Instead of choosing a single prediction, it takes the average of predictions from multiple models. This approach is particularly useful when you want to predict numerical values. You can use a simple mean or assign weights to models based on their reliability or relevance to the problem. Averaging strikes a balance among various models' opinions to provide a consolidated and often more accurate prediction.

**Stacking:**
Stacking is a more advanced ensemble method that capitalizes on the strengths of multiple models. It begins by training several base models on the data. Then, a meta-model is introduced to learn how to optimally combine the predictions of these base models. This hierarchical approach allows for a sophisticated blending of different models' insights, ultimately leading to more accurate and robust predictions.

**Boosting:**
Boosting is an iterative ensemble technique that focuses on improving the performance of weak models. It does so by assigning more weight to instances that are frequently misclassified by previous models. The final prediction is a weighted combination of these weak models, with the weights adapted during the boosting process. It's like teamwork, where each model corrects the weaknesses of its predecessors, leading to increasingly accurate predictions.

**Bagging:**
Bagging, short for Bootstrap Aggregating, involves training multiple base models independently on random subsets of the data. The final prediction is often made by averaging or using majority voting over individual model predictions. It's like conducting several mini-experiments on different parts of the dataset and then combining their results. This approach is effective at reducing the impact of outliers and variability in the data.

**Random Subspace Method:**
The Random Subspace Method is a technique designed for high-dimensional data. Here, subsets of features are randomly selected for each base model, which is then trained on these feature subsets. The final prediction combines the outputs of the individual models. This method helps to combat overfitting by reducing the complexity of each model, ensuring that no single model is overwhelmed by the dimensionality of the data.

**Blending:**
Blending is similar to stacking in that it leverages the power of multiple base models. However, it takes a different approach. Base models are trained, and then a separate dataset is used to train a meta-model. This meta-model learns how to optimally combine the base models' predictions on a validation set, which is then used to make the final prediction on the test data. Blending allows for a strategic combination of model outputs, offering a versatile approach to ensemble learning.

The choice of which method to use will depend on the nature of the problem. We further make the distinction between homogeneous and heterogeneous ensemble methods. Homogeneous and heterogeneous ensemble methods refer to how similar or dissimilar the base models within an ensemble are. These terms are used to classify ensemble methods based on the diversity of the constituent models.
In homogeneous ensemble methods, the base models are of the same type or built using the same learning algorithm. They have the same structure and make predictions using the same type of model. The diversity in these ensembles is introduced by training the base models on different subsets of the data or using different random seeds. Homogeneous ensembles tend to be simpler to implement and understand because all base models are of the same type. They are often used when the goal is to reduce overfitting, increase robustness, or improve accuracy by averaging or voting over multiple similar models. Heterogeneous ensemble methods, on the other hand, incorporate diverse types of base models. These models can be of different learning algorithms, have different structures, or use different features. The diversity introduced by using different models can lead to improved overall performance. Examples of heterogeneous ensemble methods include stacking, where different types of models are trained and then combined with a meta-model which combines different weak learners. Heterogeneous ensembles are often more complex to build and fine-tune because they involve multiple types of models.

Application: What is the effect of combining models? In GWP1, we saw classification trees applied to 20 years of daily Emini S&P 500 data from Quandl, used to calculate daily returns from the "settle price" used as the closing price. Common technical analysis indicators for trend were used to generate trading signals. Since Quandl has data limits, we use the same dataset but using Yahoo Finance (yfinance) as the source, and the results of the model are presented below.


In [None]:

import pandas as pd
import yfinance as yf

# Define the ticker symbol for E-mini S&P 500 futures (example: ES=F for the continuous front-month contract)
ticker_symbol = "ES=F"

# Define the start and end dates for the data you want to fetch
start_date = "2000-01-01"
end_date = "2020-12-31"

# Use yfinance to fetch the data
data = yf.download(ticker_symbol, start=start_date, end=end_date)
df = pd.DataFrame(data)
# The 'data' DataFrame now contains the E-mini S&P 500 futures data


[*********************100%%**********************]  1 of 1 completed


In [None]:
#20 years of daily Emini S&P 500 data from yfinance
#settle price used as the closing price
#Here we will use common technical analysis indicators for trend to generate trading signals
# 3 indicators namely EMA, ATR, RSI and MACD

import pandas as pd

# Assuming you have a DataFrame 'df' with columns 'Settle', 'High', and 'Low'

# Calculate EMA10 and EMA30
df['EMA10'] = df['Adj Close'].rolling(window=10).mean()
df['EMA30'] = df['Adj Close'].rolling(window=30).mean()

# Calculate ATR
df['TR'] = df['High'] - df['Low']
df['TR'] = df[['High', 'Adj Close']].shift(1).max(axis=1) - df[['Low', 'Adj Close']].shift(1).min(axis=1)
df['ATR'] = df['TR'].rolling(window=14).mean()



# Calculate RSI
def calculate_rsi(close, period):
    delta = close.diff()
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)
    avg_gain = gain.rolling(window=period).mean()
    avg_loss = loss.rolling(window=period).mean()
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

df['RSI'] = calculate_rsi(df['Adj Close'], period=14)

# Calculate MACD
short_window = 12
long_window = 26
signal_window = 9
exp_short = df['Adj Close'].ewm(span=short_window, adjust=False).mean()
exp_long = df['Adj Close'].ewm(span=long_window, adjust=False).mean()
macd = exp_short - exp_long
signal = macd.ewm(span=signal_window, adjust=False).mean()

df['MACD'] = macd
df['MACDsignal'] = signal

df = df.drop(['TR'], axis=1)  # Drop the temporary TR column used for ATR calculation
df.dropna(inplace=True)  # Remove rows with NaN values


In [None]:
import numpy as np

#these columns will serve as predictors for the averages and the MACD
df['ClgtEMA10'] = np.where(df['Adj Close'] > df['EMA10'], 1, -1)
df['EMA10gtEMA30'] = np.where(df['EMA10'] > df['EMA30'], 1, -1)
df['MACDSIGgtMACD'] = np.where(df['MACDsignal'] > df['MACD'], 1, -1)

In [None]:

# What we have now are possible trading rules that we will introduce in the
# decision tree to help us identify the best combination of these indicators to maximize the result.

# EMA, we are interested in when the price is above average and when the fastest average is above the slowest average.
# ATR(14), we’re interested in the threshold that will trigger the signal.
# RSI(14), we’re interested in the threshold that will trigger the signal.
# MACD, we are interested in when the MACD signal is above MACD.

df['Return'] = df['Adj Close'].pct_change(1).shift(-1)
df['target_cls'] = np.where(df.Return > 0, 1, 0)
# the target classificatio will be 1 if the return is positive and 0 if negetive


In [None]:
#
predictors_list = ['ATR','RSI', 'ClgtEMA10', 'EMA10gtEMA30', 'MACDSIGgtMACD']
X = df[predictors_list]


In [None]:
# we define the target classifications for each data point to use in out training
y_cls = df.target_cls

In [None]:
# Splitting the data into test and trainng data
# 80% to be used for training
from sklearn.model_selection import train_test_split
y=y_cls
X_cls_train, X_cls_test, y_cls_train, y_cls_test = train_test_split(X, y, test_size=0.7, random_state=432, stratify=y)


In [None]:
# importing the decision tree classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_leaf=6)


In [None]:
#  we fit the model and the algorithm would already be fully trained.
clf = clf.fit(X_cls_train, y_cls_train)


In [None]:
#usnig the model to make a forecast
y_cls_pred = clf.predict(X_cls_test)


In [None]:
from sklearn.metrics import classification_report
report = classification_report(y_cls_test, y_cls_pred)
print (report)

              precision    recall  f1-score   support

           0       0.41      0.01      0.03      1638
           1       0.54      0.98      0.70      1931

    accuracy                           0.54      3569
   macro avg       0.47      0.50      0.36      3569
weighted avg       0.48      0.54      0.39      3569



the interpretation of the results is as follows:
   - precision is based on the accuracy of the predictions i.e. the model predicts 1 (bulish signal) when the outcome is actually 1. Mathematically, precision is defined as:
   
<p align="right">
\( \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Positives (FP)}} \)

          
</p>   

    - True Positives (TP): The number of positive instances correctly predicted as positive by the model.
        
    - False Negatives (FN): The number of positive instances incorrectly predicted as negative by the model.
    
    
   
   - The recall score is a more powere full measure of robustness. The recall score, also known as sensitivity or true positive rate, is a performance metric used in classification tasks to measure a model's ability to correctly identify all positive instances out of the total actual positive instances. It is a crucial metric, especially when dealing with imbalanced datasets or when the cost of missing positive cases is high. In essence, recall quantifies the model's ability to avoid false negatives. High recall indicates that the model is effective at capturing most of the actual positive instances, while low recall suggests that the model is missing a significant portion of the positive cases. Mathematically, recall is defined as:  
   
 <p align="right">
  \( \text{Recall (Sensitivity)} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP) + False Negatives (FN)}} \)
</p>  

  
 - F1 Score: It combines two other important metrics: precision and recall. The F1 score provides a balance between these two metrics, as it takes into account both false positives and false negatives. A higher F1 score indicates better overall model performance. The F1 score provides a harmonic mean of precision and recall, which makes it suitable for situations where you want to balance the trade-off between minimizing false positives (precision) and minimizing false negatives (recall). A higher F1 score indicates that the model achieves a better balance between precision and recall, effectively reducing both false positives and false negatives. The F1 score is calculated using the following formula:

 <p align="right">
          \( \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}} \)
 </p>

     


A model that does better than the one presented aboe would have to have a better precision as that is in escence what is we want to see we want to see a model that makes good predictions most of the time. one with better recall and precision would mean that it would perform better with out of sample data and we have seen in throughout the course how a core aim of Machine Leaning is building models that perform well with unseen data.


To test the effect of Ensamble learning, we apply Bagging, Boosting and Stacking to the data set and compare how those models fair against each other and most importantly if they actually improve the model produced from a simple decision classification tree.   

      

In [None]:
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Bagging with Decision Tree
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', max_depth=15, min_samples_leaf=16),
                               n_estimators=100, random_state=432)
bagging_clf.fit(X_cls_train, y_cls_train)
bagging_preds = bagging_clf.predict(X_cls_test)
bagging_accuracy = accuracy_score(y_cls_test, bagging_preds)

# AdaBoost with Decision Tree
adaboost_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', max_depth=15, min_samples_leaf=16),
                                 n_estimators=100, random_state=432)
adaboost_clf.fit(X_cls_train, y_cls_train)
adaboost_preds = adaboost_clf.predict(X_cls_test)
adaboost_accuracy = accuracy_score(y_cls_test, adaboost_preds)

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

# Create a Stacking Classifier with Decision Tree, Bagging, and AdaBoost as base models
base_models = [
    ('Decision Tree', clf),
    ('Bagging', bagging_clf),
    ('AdaBoost', adaboost_clf)
]

# Use a logistic regression meta-estimator for stacking
stacking_model = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression())
stacking_model.fit(X_cls_train, y_cls_train)
stacking_preds = stacking_model.predict(X_cls_test)

# Calculate accuracy for the stacking model
stacking_accuracy = accuracy_score(y_cls_test, stacking_preds)

# Compare Results (include only 'recall' and 'f1-score')
results_table = pd.DataFrame({
    'Decision Tree': [classification_report(y_cls_test, y_cls_pred).split()[6], classification_report(y_cls_test, y_cls_pred).split()[7]],
    'Bagging': [classification_report(y_cls_test, bagging_preds).split()[6], classification_report(y_cls_test, bagging_preds).split()[7]],
    'AdaBoost': [classification_report(y_cls_test, adaboost_preds).split()[6], classification_report(y_cls_test, adaboost_preds).split()[7]],
    'Stacking': [classification_report(y_cls_test, stacking_preds).split()[6], classification_report(y_cls_test, stacking_preds).split()[7]]
}, index=['recall', 'f1-score'])

print("Classification Report Comparison:")
print(results_table)


# Calculate accuracy for the stacking model and round to 3 decimal places
stacking_accuracy = round(accuracy_score(y_cls_test, stacking_preds), 3)

# Calculate and round accuracy for other models
decision_tree_accuracy = round(accuracy_score(y_cls_test, y_cls_pred), 3)
bagging_accuracy = round(bagging_accuracy, 3)
adaboost_accuracy = round(adaboost_accuracy, 3)

# Print the accuracy of each model
print("Accuracy of Decision Tree:", decision_tree_accuracy)
print("Accuracy of Bagging:", bagging_accuracy)
print("Accuracy of AdaBoost:", adaboost_accuracy)
print("Accuracy of Stacking:", stacking_accuracy)




Classification Report Comparison:
         Decision Tree Bagging AdaBoost Stacking
recall            0.01    0.34     0.46     0.01
f1-score          0.03    0.39     0.46     0.02
Accuracy of Decision Tree: 0.538
Accuracy of Bagging: 0.528
Accuracy of AdaBoost: 0.505
Accuracy of Stacking: 0.541


The verdict is that:
- Stacking results in a bettwe accuracy score but lowere f1 Score
- Both bagging and boosting result in lower accuracy scores but better recall and F1-scores which means they would be better than the decision tree on out of sample data

Overall ensamble learning results in technically better results and helps in improving aspects of simple models.

### Non- Technical

Ensemble learning is akin to using the collective wisdom of a group of experts, each with their unique insights and expertise. Just as we seek advice from various individuals when making significant decisions, ensemble learning combines the strengths of different machine learning models to make more accurate predictions. By doing so, ensemble methods often outperform individual models, providing a solid foundation for informed financial decision-making.

Enhanced Robustness:

In the dynamic world of finance, where uncertainty and market volatility are constant companions, robust predictions are paramount. Ensemble methods bolster the robustness of predictive models by mitigating the risks associated with relying on a single model. This is achieved by introducing diversity among models, which means that the potential biases and limitations of any single model are counterbalanced by the collective intelligence of the ensemble. The result is a more resilient approach to financial strategy, capable of weathering unexpected shifts in the market.

Real-World Examples:

The real-world impact of ensemble methods in the realm of finance is profound and far-reaching. They find applications in diverse areas, such as portfolio optimization, fraud detection, and credit scoring. In portfolio optimization, ensemble techniques help in making well-informed investment decisions by combining the forecasts of multiple models. For fraud detection, the ability to identify subtle patterns indicative of fraudulent activity is significantly enhanced. When it comes to credit scoring, ensemble models improve the accuracy of determining creditworthiness, thereby facilitating responsible lending and risk management practices. These practical examples showcase the versatility and efficacy of ensemble methods, making them invaluable tools in the financial industry. We have in the technical section demonstrated how this applies to real-world data

Reduced Overfitting:

Overfitting, a common pitfall in financial modeling, occurs when a model fits the training data so closely that it fails to generalize effectively to new, unseen data. Ensemble methods act as a safeguard against this issue. By amalgamating the insights of multiple models, the risk of overfitting is reduced. Each model contributes its unique perspective, and through the ensemble, the collective intelligence ensures that predictions remain accurate and reliable across various scenarios. This mitigation of overfitting is especially critical in financial decision-making, where the stakes are high and the consequences of poor predictions can be significant.

Optimizing Profitability:

In the finance sector, profitability is the ultimate goal. Ensemble methods play a pivotal role in achieving this objective. By improving the accuracy and stability of predictions, ensemble techniques contribute to more profitable investment strategies and enhanced risk management. The combination of diverse models, each excelling in different facets of financial analysis, results in more well-informed decisions. This, in turn, translates to better returns on investments and a reduction in the potential for financial losses. As such, ensemble methods become valuable allies in the pursuit of financial success.

Overall, ensemble learning stands as a fundamental and evolving tool in the realm of financial analytics. Its power lies in the synergy it creates by amalgamating the wisdom of multiple models. The significance of ensemble methods in the financial industry is underscored by their ability to enhance robustness, mitigate overfitting, and optimize profitability.
