## Logistic Regression
Logistic Regression is one of the most widely used algorithms for classification that maps quantitative data onto categorial variables. Unlike Linear Regression, where y is an outcome variable, we use a function of y called the logit.

Logit can be modelled as a linear function of the predictor

$$Logit = log(odds) = w_{0}+w_{1}x_{1}+w_{2}x_{2}+w_{3}x_{3}+.....+w_{q}x_{q}$$

and can be mapped back to a probability which in turn can be mapped to a class.

## The workflow
We'll employ logistic regressor from scikit-learn for stock / equity index trend prediction.
<p>
<table>
<thead>
<tr>
<th style="text-align:left">Steps</th>
<th style="text-align:left">Workflow</th>
<th style="text-align:left">Remarks</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">Step 1</td>
<td style="text-align:left">Ideation</td>
<td style="text-align:left">Predict the trend of the underlying from the given dataset</td>
</tr>
<tr>
<td style="text-align:left">Step 2</td>
<td style="text-align:left">Data Collection</td>
<td style="text-align:left">Load the dataset from quantmod library page</td>
</tr>
<tr>
<td style="text-align:left">Step 3</td>
<td style="text-align:left">Exploratory Data Analysis</td>
<td style="text-align:left">Study summary statistics</td>
</tr>
<tr>
<td style="text-align:left">Step 4</td>
<td style="text-align:left">Cleaning Dataset</td>
<td style="text-align:left">Data already cleaned, no further imputation required</td>
</tr>
<tr>
<td style="text-align:left">Step 5</td>
<td style="text-align:left">Transformation</td>
<td style="text-align:left">Perform feature scaling based on EDA</td>
</tr>
<tr>
<td style="text-align:left">Step 6</td>
<td style="text-align:left">Modeling</td>
<td style="text-align:left">Building and training linear regressor</td>
</tr>
<tr>
<td style="text-align:left">Step 7</td>
<td style="text-align:left">Metrics</td>
<td style="text-align:left">Validating the model performance using score method</td>
</tr>
</tbody>
</table>

## Problem Statement
The objective is to predict market movement based on classification algorithm. In this lab, we'll use Logistic Regression to predict market direction and devise a trading strategies based on it.

In [8]:

# install packages
%pip install quantmod quantstats-lumi

Note: you may need to restart the kernel to use updated packages.


### Import libraries

In [16]:
# Base Libraries
import pandas as pd
import numpy as np

# Quantmod
from quantmod.datasets import fetch_historical_data
from quantmod.timeseries import *
from quantmod.indicators import BBands

# Plotting
import matplotlib.pyplot as plt

# Classifier
from sklearn.linear_model import LogisticRegression

# Preprocessing
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import (
                                    train_test_split, 
                                    TimeSeriesSplit,
                                    GridSearchCV
                                    )

# Metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
                            accuracy_score,
                            f1_score,
                            log_loss,
                            RocCurveDisplay,
                            ConfusionMatrixDisplay,
                            classification_report
                            )

# Analysis
#import quantstats, lumi as qs

AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

### Load Data

In [15]:
# load nifty index data
df = fetch_historical_data("nifty")

# set date as index
df = (
    df
    .assign(date=pd.to_datetime(df['date']))  
    .set_index('date', drop=True)             
)

df.head()

NameError: name 'fetch_historical_data' is not defined

In [None]:
# get info
df.info()

In [None]:
# Visualize data
plt.plot(df['close']);

## EDA of Original dataset

In [None]:
# Descriptive statistics
df.describe().T

## Cleaning & Imputation
Data is already cleaned. No further processing or imputation required.

In [None]:
# Check for missing values
df.isnull().sum()
df.shape

## Feature Engineering
Features or Predictors are also known as an independent variable which are used to determine the value of the target variable. We will generate features and label from the original dataset.

In [None]:
# generate features
# use quantmod timeseries functions
def generate_xy(frame):
    df = frame.copy()
    upper= BBands(df.close, lookback=5, multiplier=2)[2]

    # features - X
    data = pd.DataFrame({
        'x1': np.where(-LoCl(df) > HiCl(df), 1 ,0),
        'x2': np.where(Hi(df) > lag(Hi(df)), 1, 0),
        'x3': np.where(Lo(df) > lag(Lo(df)), 1, 0),
        'y': np.where(lead(Cl(df)) > upper, 1, 0)
    }, index=df.index)

    return data

## Feature Specification

In [None]:
# extract features
data = generate_xy(df)
X = data.drop(columns=['y'])
X

## Target or Label Definition
Label or the target variable is also known as the dependent variable. Here, the target variable is whether the underlying price will close above the upper bollinger band the next trading day. If the tomorrow’s closing price is greater than the upper bollinger band level, then we will buy the underlying, else we will do nothing (hold) it.

We assign a value of +1 for the buy signal and 0 for the hold signal to target variable. The target can be described as :

$$y_t =\begin{cases}
    +1, & \text{if $p_{t+1} > ub_{t}$}\\
    0, & \text{Otherwise}
    \end{cases}$$
    
Where $ub_{t} is the upper bollinger band level and p_{t+1} is the 1-day forward closing price of the underlying$


In [None]:
# get labels
y = data['y']
y

# check class imbalance
pd.Series(y).value_counts()

#### Split data

In [None]:
# Splitting the datasets into training and testing data.
# Always keep shuffle = False for financial time series
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Output the train and test data size
print(f"Train and Test Size {len(X_train)}, {len(X_test)}")

## Base Model

We now build a base model with default parameters using Pipelines. Dataset needs to be scaled for the model to work properly and all the features should have a similar scale. The scaling can be accomplished by using the StandardScaler transformer.

#### Fit Model

In [None]:
# Use pipeline to fit the basemodel
basemodel = Pipeline([
    ("scaler", StandardScaler()), 
    ("classifier", LogisticRegression(
        class_weight='balanced')) 
]) 

basemodel.fit(X_train, y_train)

## Predict model

In [None]:
# Predicting the test dataset
y_pred = basemodel.predict(X_test) # gives you the class labels 1, -1,..

# Predict Probabilities # gives you the probabilities of each class labels
y_proba = basemodel.predict_proba(X_test)

In [None]:
# verify the class labels
basemodel.classes_

In [None]:
# predict probability
y_proba[-20:]

In [None]:
# predict class labels
y_pred[-20:]

In [None]:
# accuracy of the model
acc_train = accuracy_score(y_train, basemodel.predict(X_train))
acc_test = accuracy_score(y_test, y_pred)

print(f'Train Accuracy: {acc_train:0.4}, Test Accuracy: {acc_test:0.4}')

## Prediction Quality
### Confusion Matrix
Confusion matrix is a table used to describe the performance of a classification model on a set of test data for which the true values are known.

| Outcome | Position1 |:---------------|:------------| |True Negative | upper-left | |False Negative | lower-left | |False Positive | upper-right | |True Positive | lower-right |

True Positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

False Positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

Note: In a binary classification task, the terms ‘’positive’’ and ‘’negative’’ refer to the classifier’s prediction, and the terms ‘’true’’ and ‘’false’’ refer to whether that prediction corresponds to the external judgment (sometimes known as the ‘’observation’’) and the axes can be flipped. Refer Scikit-Learn Binary Classification for further details.

In [None]:
# Display confussion matrix
disp = ConfusionMatrixDisplay.from_estimator(
        basemodel,
        X_test,
        y_test,
        # display_labels=model.classes_,
        cmap=plt.cm.Blues
    )
plt.title('Confusion matrix')
plt.show()

### Classification Report
A classification report is used to measure the quality of predictions from a classification algorithm.



In [None]:
# Classification Report
print(classification_report(y_test, y_pred))

### Macro Average
Average of precision (or recall or f1-score) of different classes.

### Weighted Average
Actual Class1 instance * precision (or recall or f1-score) of Class1 + Actual Class2 instance * (or recall or f1-score) of Class2.

### Receiver Operator Characterisitc Curve (ROC)
The area under the ROC curve (AUC) is a measure of how well a model can distinguish between two classes. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds.

In [None]:
# Display ROCCurve 
disp = RocCurveDisplay.from_estimator(
            basemodel, 
            X_test, 
            y_test,
            name='Baseline Model')
plt.title("AUC-ROC Curve \n")
plt.plot([0,1],[0,1],linestyle="--", label='Random 50:50')
plt.legend()
plt.show()

## Hyper-parameter Tuning
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. It is possible and recommended to search the hyper-parameter space for the best cross validation score. Any parameter provided when constructing an estimator may be optimized in this manner.

## Cross-validation of Time Series
Time series data are sequential in nature and are characterised by the correlation between observations. Classical cross-validation techniques such as KFold assume the samples are independent and identically distributed, and would result in poor estimates when applied on time series data.

To preserve the order and have training set occur prior to the test set, we use Forward Chaining method in which the model is initially trained and tested with the same windows size. And, for each subsequent fold, the training window increases in size, encompassing both the previous training data and test data. The new test window once again follows the training window but stays the same length.

<img align='center' style='vertical-align: middle;' width="25%" src="fowardchaining.png">

We will tune the hyperparameters to select the K-Best Neighbor by TimeSeriesSplit from scikit-learn. This is a forward chaining cross-validation method and is a variation from the KFold. In the kth split, it returns first k folds as train set and the (k+1)th fold as test set. Unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

In [None]:
# # Example: First 2 split
# tscv = TimeSeriesSplit(n_splits=4, gap=1)
# for train, test in tscv.split(X):
#     print(train, test)
# Cross-validation
tscv = TimeSeriesSplit(n_splits=5, gap=1)

## GridSearch
The conventional way of performing hyperparameter optimization has been a grid search (aka parameter sweep). It is an exhaustive search through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a validation set.

GridSearch performs exhaustive search over specified parameter values for an estimator. It implements a “fit” and a “score” method among other methods. The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [None]:
# Get parameters list
basemodel.get_params()

In [None]:
pipeline = Pipeline([
    ("scaler", StandardScaler()), 
    ("classifier", LogisticRegression(
        class_weight='balanced')) 
]) 

# Perform Gridsearch and fit
param_grid = {"classifier__tol": [0.001, 0.001, 0.01, 0.1, 1],
              "classifier__C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
              }

# Perform Gridsearch
gs = GridSearchCV(pipeline, 
                  param_grid, 
                  scoring='roc_auc', 
                  n_jobs=-1, 
                  cv=tscv, 
                  verbose=1)

gs.fit(X_train, y_train)

In [None]:
# predict using the best model
tunedmodel = gs.best_estimator_

# Predict class labels 
y_pred = tunedmodel.predict(X_test)

# Predict Probabilities
# y_proba = tunedmodel.predict_proba(X_test)[:,1]

# Measure Accuracy
acc_train = accuracy_score(y_train, tunedmodel.predict(X_train))
acc_test = accuracy_score(y_test, y_pred)

# Print Accuracy
print(f'\n Training Accuracy \t: {acc_train :0.4} \n Test Accuracy \t\t: {acc_test :0.4}')

In [None]:
# Display confussion matrix
disp = ConfusionMatrixDisplay.from_estimator(
        tunedmodel,
        X_test,
        y_test,
        # display_labels=tunedmodel.classes_,
        cmap=plt.cm.Blues
    )
plt.title('Confusion matrix')
plt.show()

In [None]:
# Display ROCCurve 
disp = RocCurveDisplay.from_estimator(
            tunedmodel, 
            X_test, 
            y_test,
            name='Tuned Logistic')
plt.title("AUC-ROC Curve \n")
plt.plot([0,1],[0,1],linestyle="--", label='Random 50:50')
plt.legend()
plt.show()

In [None]:
# Classification Report
print(classification_report(y_test, y_pred))

## Trading Strategy
Let's now define a trading strategy. We will use the predicted signal to buy or sell. We then compare the result of this strategy with the buy and hold and visualize the performance of the KNN Algorithm.

In [None]:
# Subsume into a new dataframe
df1 = df[-len(X_test):].copy()                                  # df[-len(X_test):]
df1['Signal'] = tunedmodel.predict(X_test)                      # tunedmodel.predict(X_test)

In [None]:
# Daily Returns - Benchmark return
df1['Returns'] = np.log(df1['close']).diff().fillna(0)

# Strategy Returns - Logistic
df1['Strategy'] =  df1['Returns'] * df1['Signal'].shift(1).fillna(0)

df1

## Return Analysis

In [None]:
# performance analysis
qs.reports.html(
    df1['Strategy'],
    df1['Returns'],
    title='Strategy Performance',
    output='output/logistic.html'
)

## References

<ul>
<li><a href="https://urldefense.com/v3/__https://kannansingaravelu.com/docs/site/index.html__;!!KGvANbslH1YjwA!98wbvTkK9-6bv88kkSsRletbahwkdvhkC0A89umh3dImWdEgbd4ULDHjRAMsYdJNpiJuX5N_HxtKd64IiBYmGRQcXMLYqbof1c0$">Quantmod</a></li>
<li><a href="https://urldefense.com/v3/__https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html__;!!KGvANbslH1YjwA!98wbvTkK9-6bv88kkSsRletbahwkdvhkC0A89umh3dImWdEgbd4ULDHjRAMsYdJNpiJuX5N_HxtKd64IiBYmGRQcXMLYuKcLKyk$">TimeSeriesSplit</a></li>
<li><a href="https://urldefense.com/v3/__https://scikit-learn.org/stable/modules/cross_validation.html__;!!KGvANbslH1YjwA!98wbvTkK9-6bv88kkSsRletbahwkdvhkC0A89umh3dImWdEgbd4ULDHjRAMsYdJNpiJuX5N_HxtKd64IiBYmGRQcXMLY4r3DDHA$">Cross-validation</a></li>
<li><a href="https://urldefense.com/v3/__https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html__;!!KGvANbslH1YjwA!98wbvTkK9-6bv88kkSsRletbahwkdvhkC0A89umh3dImWdEgbd4ULDHjRAMsYdJNpiJuX5N_HxtKd64IiBYmGRQcXMLYsijTEio$">GridSearchCV</a></li>
<li><a href="https://urldefense.com/v3/__https://scikit-learn.org/stable/modules/grid_search.html*grid-search__;Iw!!KGvANbslH1YjwA!98wbvTkK9-6bv88kkSsRletbahwkdvhkC0A89umh3dImWdEgbd4ULDHjRAMsYdJNpiJuX5N_HxtKd64IiBYmGRQcXMLYQzQKOlU$">Hyperparameters Tuning</a></li>
<li><a href="https://urldefense.com/v3/__https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html__;!!KGvANbslH1YjwA!98wbvTkK9-6bv88kkSsRletbahwkdvhkC0A89umh3dImWdEgbd4ULDHjRAMsYdJNpiJuX5N_HxtKd64IiBYmGRQcXMLYfOItr4E$">Logistic Regression</a></li>
</ul>