# Introduction

This notebook is essentially to understand how forecasting in financial instruments works for given data

We will go through a lot of theory and mathematics behind this to come up with a thorough understanding of this

# Parameters of Forecasting Accuracy
**Hit Rate**

This essentially means the percentage of correct predictions : just jargon for accuracy ; no big deal

**Confusion Matrix**

In case of financial instruments, the price may move up or down and our predictions for the same may be correct or incorrect

It is obvious then that there arise 4 possible cases in the testing phase : let us generalize the *up* and *down* to positives and negatives respectively

Therefore we have true positives, true negatives, false positives, and false negatives : the last two make up Type 1 Error and Type 2 Error in jargon respectively

A 2 x 2 matrix represents these values

The first row contains the true and false positive rates , second row the false and true negative rates

# Choosing Appropriate Factors

Just fitting in some algorithm does not help : we need to find the appropriate factors that help us identify key trends and achieve better results

**Lagged Price Factors**

The prior historical values of the time series are extremely useful in forecasting the time series

A set of p factors can be obtained by creating p lags of the time series close price

We may simply consider a current day x ; our factors would be the historical daily values of the financial instrument at time periods x-1, x-2, and so on till x- p

**Traded Volume**

Traded volume indicates momentum or interest within the financial instrument ; if the volume is low, there is less interest and therefore not much earning potential

Opposite goes for a high volume


Combining these two, we can create a p+1 dimensional feature vector for each day of the time series , which incorporates the p time lags and the traded volume on that day

We therefore have the feature vector for each day , and the closing price for each day

Thence we can commence a supervised classification exercise


**External Factors**

There are factors beyond the market which may cause market movements ; these may be natural, political, or even cultural

Poor monsoon might mean a downward trend for agricultural stocks ; a communist party in power might be bad news for financial firms :)

If the relationship between external factors and the time series is significant then we need to consider them for our trading model to be robust

Just a reminder for intuition : what do we classify ? That whether the instrument goes *up* or *down* ; not the value by which it moves

# Classification Models

**Logistic Regression**

The logistic regression model provides the probability that a particular subsequent time period will be categorised as *up* or *down*

We introduce the parameter of probability threshold ; if our probability comes up to be greater than the parameter, the financial instrument will move upwards ; and downwards for the other case

In general we take the threshold to be 50%

In essence, this is based on the logarithmic formula to model the probability of a positive based on continuous factors such as the lagged returns

Let us consider the situation where we are interested in preducting the subsequent time period from the previous two lagged returns

Denote them as L<sub>1</sub> and L<sub>2</sub>

Then the probability of an upward movement can be denoted as :

exp(b<sub>0</sub> + b<sub>1</sub> x L<sub>1</sub> + b<sub>2</sub> x L<sub>2</sub> )/( exp(b<sub>0</sub> + b<sub>1</sub> x L<sub>1</sub> + b<sub>2</sub> x L<sub>2</sub> ) + 1 )


We use this instead of a linear regression because it provides a probability between 0 and 1 for all values of L<sub>1</sub> and L<sub>2</sub>

To obtain the appropriate b<sub>i</sub> coefficients, the maximum likelihood method is used

This is, fortunately, directly handled by the Scikit-Learn Library


**Linear Discriminant Analysis**

In logistic regression we modelled the probability of a true classification given the previous two lagged returns

In LDA, the distribution of L<sub>i</sub> variables is modelled separately ; given the state of the system (either positive or negative)

We assume that the predictors are drawn from a multivariate Gaussian distribution ; after calculating estimates for the parameters of the distribution , the parameters are inserted into Bayes' Theorem to make predictions on which class an observation belongs to

Essentially it finds a linear combination of features that best separates two or more classes

By projecting data onto a new axis that maximizes class separation, LDA improves class predictability while preserving class discriminatory information

All classes share the same covariance matrix

Again, the Scikit-Learn library handles this for us ; we do not need to code up the nitty-gritty of this

**Quadratic Discriminant Analysis**

Similar to LDA, except for the fact that separate classes have separate covariance matrices

QDA performs better when *decision boundaries are non-linear*

LDA performs better when there are fewer training observations and therefore reduction of the variance is a key concern

QDA, on the other hand, performs better when there are a large number of training obserations and variance reduction would not change much

**Support Vector Machines**

In SVMs , we attempt to locate a linear separation boundary in the feature space that correctly classifies most, but of course not all, of the training observations by creating an optimal separation boundary between the two classes

We can extend that capability to allow detection of non-linear decision boundaries

This allows the enlargening of the feature space to include significant non-linearity

SVMs allow non-linear decision boundaries via many different choices of kernel ; we are free to use kernels beyond linear systems, we may use quadratic or higher order polynomials, or even radial kernels to describe non-linear boundaries

**Decision Trees**

Decision trees are a supervised classification technique that utilise a tree structure to partition
the feature space into recursive subsets via a "decision" at each node of the tree

Let us consider an example ; suppose the question at the decision node is whether the price of a particular instrument was above or below a certain threshold ; this will divide the feature space into two subsets ; we can further ask the question if the volume was above or below a certain threshold , thus now creating 4 subsets

This is a more intuitive and naturally interpretable classification mechanism as compared to SVMs or Discriminant Analysers

**Random Forests**

This is the domain where we start getting into ensemble learning

Instead of attacking the problem with a single classifier, we create a large number of classifiers and train them all with varying parameters

Then combine the results of the prediction in a weighted average to obtain a prediction accuracy that is greater than that brought on by any of the individual constituents

This is the Random Forest ; which takes in multiple decision trees and combines the predictions

**Principal Component Analysis**

All of the above techniques belong to the *supervised classification* domain

Alternatively, we can approach the problem not in terms of supervising the training procedure but instead allowing an algorithm to ascertain the relevant features on its own ; such methods are known as unsupervised learning algorithms

The idea remains to reduce the number of dimensions of a problem to the relevant and statistically significant ones and ultimately discovering features that provide predictive power in the time series analysis

One of the techniques to do this is Principal Component Analysis (PCA)

The basic idea of a PCA is to transform a set of possibly correlated variables (such as with
time series autocorrelation) into a set of linearly uncorrelated variables known as the principal
components. Such principal components are ordered according to the amount of variance they
describe, in an orthogonal manner. Thus if we have a very high-dimensional feature space (10+
features), then we could reduce the feature space via PCA to perhaps 2 or 3 principal components
that provide nearly all of the variability in the data, thus leading to a more robust supervised
classifier model when used on this reduced dataset.



Now we move on to coding up some stuff








In [12]:
from __future__ import print_function
import datetime
import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import LinearSVC, SVC


def create_lagged_series(symbol, start_date, end_date, lags=5):
    # Download stock information from Yahoo Finance
    ts = yf.download(symbol, start=start_date - datetime.timedelta(days=365), end=end_date)

    # Create the new lagged DataFrame
    tslag = pd.DataFrame(index=ts.index)
    tslag["Today"] = ts["Adj Close"]
    tslag["Volume"] = ts["Volume"]

    # Create the shifted lag series of prior trading period close values
    # This has purely been done with the help of ChatGPT
    for i in range(0, lags):
        tslag["Lag%s" % str(i+1)] = ts["Adj Close"].shift(i+1)

    # Create the returns DataFrame
    tsret = pd.DataFrame(index=tslag.index)
    tsret["Volume"] = tslag["Volume"]
    tsret["Today"] = tslag["Today"].pct_change() * 100.0

    # If any of the values of percentage returns equal zero, set them to
    # a small number (this helps in dealing with issues in the QDA model in scikit-learn)
    tsret["Today"].replace(0, 0.0001, inplace=True)

    # Create the lagged percentage returns columns
    for i in range(0, lags):
        tsret["Lag%s" % str(i+1)] = tslag["Lag%s" % str(i+1)].pct_change() * 100.0

    # Create the "Direction" column (+1 or -1) indicating an up/down day
    tsret["Direction"] = np.sign(tsret["Today"])
    tsret = tsret[tsret.index >= start_date]

    # Drop rows with NaN values
    tsret.dropna(inplace=True)

    return tsret

if __name__ == "__main__":
    # I like dealing with the AMZN stock
    snpret = create_lagged_series("AMZN", datetime.datetime(2020, 1, 10), datetime.datetime(2023, 12, 31), lags=5)

    # Here we use the prior two days of returns as predictor
    X = snpret[["Lag1", "Lag2"]]
    y = snpret["Direction"]

    # The test data is split into two parts: Before and after 1st Jan 2023.
    start_test = datetime.datetime(2023, 1, 1)

    X_train = X[X.index < start_test]
    X_test = X[X.index >= start_test]
    y_train = y[y.index < start_test]
    y_test = y[y.index >= start_test]

    print("Hit Rates/Confusion Matrices:\n")
    models = [
        ("LR", LogisticRegression(max_iter=1000)),
        ("LDA", LDA()),
        ("QDA", QuadraticDiscriminantAnalysis()),
        ("LSVC", LinearSVC(max_iter=10000)),
        ("RSVM", SVC(
            C=1000000.0, cache_size=200, class_weight=None,
            coef0=0.0, degree=3, gamma=0.0001, kernel='rbf',
            max_iter=-1, probability=False, random_state=None,
            shrinking=True, tol=0.001, verbose=False)
        ),
        ("RF", RandomForestClassifier(
            n_estimators=1000, criterion='gini',
            max_depth=None, min_samples_split=2,
            min_samples_leaf=1, max_features='auto',
            bootstrap=True, oob_score=False, n_jobs=1,
            random_state=None, verbose=0)
        )
    ]

    #I really do not understand the multiply hyperparameters of RSVM and RF ; need to understand that


    for m in models:
        m[1].fit(X_train, y_train)
        pred = m[1].predict(X_test)
        print("%s:\n%0.3f" % (m[0], m[1].score(X_test, y_test)))
        print("%s\n" % confusion_matrix(y_test, pred))


[*********************100%%**********************]  1 of 1 completed


Hit Rates/Confusion Matrices:

LR:
0.540
[[ 14  98]
 [ 17 121]]

LDA:
0.540
[[ 14  98]
 [ 17 121]]

QDA:
0.452
[[ 76  36]
 [101  37]]

LSVC:
0.540
[[ 14  98]
 [ 17 121]]

RSVM:
0.548
[[  8 104]
 [  9 129]]



  warn(


RF:
0.460
[[41 71]
 [64 74]]



# Conclusion

Almost all of the hit rates lie between 45% and 55%

It is clear that lagged time series analysis alone is not a good enough forecasting predictor and we need to improve

The true negative rates in general are much higher than true positives (except for QDA) : as such , this makes for a much better shorting strategy as compared to a long strategy