# Sequence Model Research

The scope of this notebook is to assess and train different sequence models given the training data generated.

Training data is generated based on financial time series data labeled with potential profits using a buy-sell system.

The goal is to create a sequence model that can choose favourable stock charts equal to or better than a human can via traditional technical analysis.

## Import Libraries and Data

In [7]:
import os
import numpy as np

# Define the data directory relative to the script location
data_dir = 'data'

# Define the file paths
sequences_path = os.path.join(data_dir, 'sequences.npy')
labels_path = os.path.join(data_dir, 'labels.npy')
metadata_path = os.path.join(data_dir, 'metadata.npy')

# Load the data
try:
    data_sequences = np.load(sequences_path)
    data_labels = np.load(labels_path)
    data_metadata = np.load(metadata_path)

    # Number of examples to select
    num_examples = 115000

    # Generate a random permutation of indices
    indices = np.random.permutation(len(data_sequences))

    # Select the first `num_examples` indices
    selected_indices = indices[:num_examples]

    # Use the selected indices to create the random subset
    data_sequences = data_sequences[selected_indices, -84:, :]
    data_labels = data_labels[selected_indices]
    data_metadata = data_metadata[selected_indices]

    # Inspect the shape and size of the loaded data before slicing
    print(f'Loaded sequences shape: {data_sequences.shape}')
    print(f'Loaded sequences size: {data_sequences.size}')
    print(f'Loaded labels shape: {data_labels.shape}')
    print(f'Loaded metadata shape: {data_metadata.shape}')

except FileNotFoundError as e:
    print(f"Error loading files: {e}")
except ValueError as e:
    print(f"Value error: {e}")

# Calculate and print the expected total size
expected_total_size = num_examples * 252 * 15
print(f'Expected total size: {expected_total_size}')

# Define relevant columns and indices for normalization
relevant_columns = [
    'Open', 'High', 'Low', 'Close', 'Volume', 'Turnover', 'Consol_Detected',
    'Consol_Len_Bars', 'Consol_Depth_Percent', 'Close_21_bar_ema',
    'Close_50_bar_sma', 'Close_150_bar_sma', 'Close_200_bar_sma',
    'RSL', 'RSL_NH', 'UpDownVolumeRatio', 'ATR', '%B'
]

price_columns_indices = [0, 1, 2, 3, 9, 10, 11, 12]  # Indices of price-related columns in the sequence data

# Map indices to column names
price_columns = [relevant_columns[i] for i in price_columns_indices]

print("\nPrice-related columns:")
for index, column in zip(price_columns_indices, price_columns):
    print(f"Index: {index}, Column: {column}")


Loaded sequences shape: (115000, 63, 19)
Loaded sequences size: 137655000
Loaded labels shape: (115000,)
Loaded metadata shape: (115000, 2)
Expected total size: 434700000

Price-related columns:
Index: 0, Column: Open
Index: 1, Column: High
Index: 2, Column: Low
Index: 3, Column: Close
Index: 9, Column: Close_21_bar_ema
Index: 10, Column: Close_50_bar_sma
Index: 11, Column: Close_150_bar_sma
Index: 12, Column: Close_200_bar_sma


## Data Preprocessing

### NaN Removal

In [8]:
# Replace all NaNs with 0 due to moving averages having insufficient data to compute anything, leaving blank inputs.
# Check if NaNs exist

# Dictionary to map variable names to their corresponding data arrays
data_dict = {
    'data_sequences': data_sequences,
    'data_labels': data_labels,
}

# Using a dictionary to iterate over variables
for var_name, data in data_dict.items():
    num_nans = np.sum(np.isnan(data))
    print(f"NaNs in {var_name}: {num_nans}")

    # Remove NaNs
    if num_nans > 0:
        data_dict[var_name][:] = np.nan_to_num(data)
        num_nans = np.sum(np.isnan(data))
        print(f"NaNs remaining in {var_name} after removal: {num_nans}")

print(f"Data Seq Min: {np.min(data_sequences[:,:,price_columns_indices])}")
print(f"Data Seq Max: {np.max(data_sequences[:,:,price_columns_indices])}")


NaNs in data_sequences: 1218686
NaNs remaining in data_sequences after removal: 0
NaNs in data_labels: 0
Data Seq Min: 0.0
Data Seq Max: 468223510183936.0


### Corrupted sequence removal

99% of stocks I buy will be below 1000, with a few above 1000, although they are important.

I also noticed quite a few training examples have weird price data, which I filter out below.

I noticed with thresholds above 3e3, the max is the threshold, which is very suspect.

The loss of training examples is insignificant, and the result is better normalization of the data and obviously no corrupted sequences.

In [9]:
# Set the threshold for abnormal values based on domain knowledge
threshold = 3.0e3

# Detect all sequences with abnormally large price data
abnormal_sequences = []

# Iterate through each sequence to check for abnormal values
for sequence_index in range(data_sequences.shape[0]):
    # Extract price-related columns for the current sequence
    price_data = data_sequences[sequence_index, :, price_columns_indices]
    
    # Check if any value in the price_data exceeds the threshold
    if np.any(price_data > threshold):
        abnormal_sequences.append(sequence_index)

# Print the indices of the abnormal sequences
print(f"Abnormal Sequence Count: {len(abnormal_sequences)}")
print(f"Indices of abnormal sequences: {abnormal_sequences}")

# Create a mask for sequences that are not abnormal
mask = np.ones(data_sequences.shape[0], dtype=bool)
mask[abnormal_sequences] = False

# Filter out abnormal sequences from data_sequences and data_labels
filtered_data_sequences = data_sequences[mask]
filtered_data_labels = data_labels[mask]

# Print the shape of the filtered data
print(f"Filtered data_sequences shape: {filtered_data_sequences.shape}")
print(f"Filtered data_labels shape: {filtered_data_labels.shape}")

print(f"Data Seq Min: {np.min(filtered_data_sequences[:,:,price_columns_indices])}")
print(f"Data Seq Max: {np.max(filtered_data_sequences[:,:,price_columns_indices])}")


Abnormal Sequence Count: 1001
Indices of abnormal sequences: [52, 60, 435, 661, 701, 736, 749, 994, 1009, 1051, 1184, 1234, 1275, 1363, 1471, 1505, 1522, 1702, 1869, 1886, 2066, 2237, 2416, 2536, 2609, 2629, 2871, 2885, 3046, 3068, 3479, 3570, 3576, 3611, 3629, 3634, 3673, 3676, 3716, 4270, 4417, 4610, 4627, 4678, 5020, 5180, 5225, 5284, 5332, 5412, 5459, 5583, 5775, 5797, 5871, 5945, 6110, 6391, 6880, 6998, 7097, 7731, 7864, 8021, 8053, 8059, 8106, 8108, 8269, 8437, 8460, 8936, 8991, 9033, 9038, 9110, 9113, 9129, 9164, 9193, 9212, 9351, 9428, 9568, 9758, 9765, 9891, 9923, 9962, 9978, 10272, 10384, 10417, 10429, 10498, 10784, 10865, 10945, 11178, 11209, 11230, 11252, 11275, 11424, 11614, 11873, 11943, 11973, 12306, 12435, 12436, 12558, 12570, 12685, 12780, 12950, 13062, 13064, 13158, 13175, 13609, 13754, 13767, 13776, 13808, 13872, 13956, 13968, 13990, 14057, 14090, 14097, 14265, 14271, 14496, 14501, 14548, 14574, 14669, 14728, 14820, 14961, 15013, 15068, 15164, 15265, 15294, 15392, 15

### Normalization of Training Data

In [10]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Indices of price-related columns
price_columns_indices = [0, 1, 2, 3]
ma_columns_indices = [9, 10, 11, 12]

# Extract data shapes
num_sequences, num_timesteps, num_features = data_sequences.shape

# Calculate log transformation for price-related features
price_data = data_sequences[:, :, price_columns_indices]
log_price_data = np.log(price_data + 1e-8)

# Replace the original price-related features with the log values
data_sequences[:, :, price_columns_indices] = log_price_data

# Calculate percentage away from the Close price for moving averages
close_price_data = data_sequences[:, :, 3].reshape(num_sequences, num_timesteps, 1)  # Close price at index 3
ma_data = data_sequences[:, :, ma_columns_indices]

# Avoid division by zero by adding epsilon
epsilon = 1e-8
percentage_away_from_close = (close_price_data - ma_data) / (ma_data + epsilon)

# Replace the original moving average features with the percentage away values
data_sequences[:, :, ma_columns_indices] = percentage_away_from_close

# Handle infinite values by replacing them with NaNs and then replacing NaNs with zero
data_sequences = np.nan_to_num(data_sequences, nan=0.0, posinf=0.0, neginf=0.0)

# Clip extreme values to avoid large outliers
data_sequences = np.clip(data_sequences, -1e3, 1e3)

# Normalize the price-related features together
price_scaler = MinMaxScaler(feature_range=(-1, 1))

# Reshape the price-related features to fit the scaler
original_shape = data_sequences[:, :, price_columns_indices].shape
reshaped_data = data_sequences[:, :, price_columns_indices].reshape(-1, len(price_columns_indices))

# Fit and transform the price-related features
normalized_price_data = price_scaler.fit_transform(reshaped_data)

# Reshape back to the original shape
normalized_price_data = normalized_price_data.reshape(original_shape)

# Replace the original price-related features with the normalized ones
data_sequences[:, :, price_columns_indices] = normalized_price_data

# Normalize the remaining features individually
for feature_index in range(num_features):
    if feature_index not in price_columns_indices and feature_index not in ma_columns_indices:
        # Initialize a new scaler for each feature
        feature_scaler = MinMaxScaler()

        # Extract the feature data
        feature_data = data_sequences[:, :, feature_index].reshape(-1, 1)

        # Fit and transform the scaler
        normalized_feature_data = feature_scaler.fit_transform(feature_data)

        # Reshape back to the original shape
        normalized_feature_data = normalized_feature_data.reshape(num_sequences, num_timesteps)

        # Replace the original feature with the normalized one
        data_sequences[:, :, feature_index] = normalized_feature_data

# Print normalized data sequences to check
print(f"Normalized Data Seq Min: {np.min(data_sequences)}")
print(f"Normalized Data Seq Max: {np.max(data_sequences)}")

# Make the labels a binary decision, rather than a profit
min_profit = 0.1  # implies a good decision is a breakout that produces more than min_profit (*100 for percent, 0.2 = 20%)

data_labels = (data_labels > min_profit).astype(int)

print(f"Data Labels for > {min_profit*100}% Min: {np.min(data_labels)}")
print(f"Data Labels for > {min_profit*100}% Max: {np.max(data_labels)}")

# Count how many labels are 1 and how many are 0
num_ones = np.sum(data_labels)
num_zeros = len(data_labels) - num_ones

print(f"Number of labels that are 1: {num_ones}")
print(f"Number of labels that are 0: {num_zeros}")
print(f"Probability of randomly selecting a stock making {min_profit*100}% is {num_ones/(num_ones+num_zeros)*100}%")


Normalized Data Seq Min: -1000.0
Normalized Data Seq Max: 1000.0
Data Labels for > 10.0% Min: 0
Data Labels for > 10.0% Max: 1
Number of labels that are 1: 21255
Number of labels that are 0: 93745
Probability of randomly selecting a stock making 10.0% is 18.482608695652175%


## Model -> Hyperparameter Tuning

In [12]:
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Precision, Recall
from sklearn.utils import class_weight
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, f1_score
import kerastuner as kt
from tensorflow.keras.regularizers import l1_l2

# Print shapes to verify data
print(f"X shape: {data_sequences.shape}")
print(f"y shape: {data_labels.shape}")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(data_sequences, data_labels, test_size=0.2, random_state=42, stratify=data_labels)

# Calculate class weights
class_weights = dict(enumerate(class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)))

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_reshaped = X_train.reshape(X_train.shape[0], -1)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_reshaped, y_train)
X_train_resampled = X_train_resampled.reshape(X_train_resampled.shape[0], X_train.shape[1], X_train.shape[2])

# Get the number of features and sequence length
n_features = data_sequences.shape[2]
sequence_length = data_sequences.shape[1]

def build_model(hp):
    model = Sequential()
    
    # First LSTM layer
    model.add(LSTM(
        units=hp.Int('lstm_1_units', min_value=32, max_value=256, step=32),
        return_sequences=True,
        input_shape=(sequence_length, n_features),
        kernel_regularizer=l1_l2(l1=hp.Choice('l1_1', [1e-5, 1e-4, 1e-3]), l2=hp.Choice('l2_1', [1e-5, 1e-4, 1e-3]))
    ))
    model.add(Dropout(hp.Float('dropout_1', min_value=0.0, max_value=0.5, step=0.1)))
    
    # Second LSTM layer
    model.add(LSTM(
        units=hp.Int('lstm_2_units', min_value=16, max_value=128, step=16),
        kernel_regularizer=l1_l2(l1=hp.Choice('l1_2', [1e-5, 1e-4, 1e-3]), l2=hp.Choice('l2_2', [1e-5, 1e-4, 1e-3]))
    ))
    model.add(Dropout(hp.Float('dropout_2', min_value=0.0, max_value=0.5, step=0.1)))
    
    # Dense layers
    for i in range(hp.Int('num_dense_layers', 1, 3)):
        model.add(Dense(
            units=hp.Int(f'dense_{i}_units', min_value=8, max_value=64, step=8),
            activation='relu',
            kernel_regularizer=l1_l2(l1=hp.Choice(f'l1_dense_{i}', [1e-5, 1e-4, 1e-3]), l2=hp.Choice(f'l2_dense_{i}', [1e-5, 1e-4, 1e-3]))
        ))
        model.add(Dropout(hp.Float(f'dropout_dense_{i}', min_value=0.0, max_value=0.5, step=0.1)))
    
    # Output layer
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile the model
    model.compile(
        optimizer=Adam(hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='LOG')),
        loss='binary_crossentropy',
        metrics=['accuracy', Precision(), Recall()]
    )
    
    return model

# Initialize the tuner
tuner = kt.Hyperband(
    build_model,
    objective=kt.Objective('val_precision', direction='max'),
    max_epochs=100,
    factor=3,
    directory='keras_tuner',
    project_name='stock_prediction'
)

# Perform the search
tuner.search(X_train_resampled, y_train_resampled,
             epochs=100,
             validation_data=(X_test, y_test),
             class_weight=class_weights)

# Get the best model
best_model = tuner.get_best_models(num_models=1)[0]

# Evaluate the best model
test_loss, test_accuracy, test_precision, test_recall = best_model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")

# Make predictions
predictions = best_model.predict(X_test)
predicted_classes = (predictions > 0.5).astype(int).flatten()

# Calculate F1 score
f1 = f1_score(y_test, predicted_classes)
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, predicted_classes))

# Calculate the percentage of positive predictions
positive_predictions_percentage = (predicted_classes.sum() / len(predicted_classes)) * 100
print(f"\nPercentage of positive predictions: {positive_predictions_percentage:.2f}%")

# Print the best hyperparameters
best_hp = tuner.get_best_hyperparameters(1)[0]
print("\nBest Hyperparameters:")
for param, value in best_hp.values.items():
    print(f"{param}: {value}")

X shape: (115000, 63, 19)
y shape: (115000,)


2024-06-23 06:04:08.864601: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-06-23 06:04:09.127329: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-06-23 06:04:09.127472: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-06-23 06:04:09.132434: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-06-23 06:04:09.132520: I external/local_xla/xla/stream_executor


Search: Running Trial #1

Value             |Best Value So Far |Hyperparameter
224               |224               |lstm_1_units
0.001             |0.001             |l1_1
1e-05             |1e-05             |l2_1
0.3               |0.3               |dropout_1
16                |16                |lstm_2_units
0.001             |0.001             |l1_2
0.0001            |0.0001            |l2_2
0.3               |0.3               |dropout_2
2                 |2                 |num_dense_layers
24                |24                |dense_0_units
1e-05             |1e-05             |l1_dense_0
1e-05             |1e-05             |l2_dense_0
0                 |0                 |dropout_dense_0
0.0016953         |0.0016953         |learning_rate
2                 |2                 |tuner/epochs
0                 |0                 |tuner/initial_epoch
4                 |4                 |tuner/bracket
0                 |0                 |tuner/round

Epoch 1/2


2024-06-23 06:04:13.687016: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8907
2024-06-23 06:04:14.485476: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f3111de34a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-06-23 06:04:14.485519: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3070 Ti Laptop GPU, Compute Capability 8.6
2024-06-23 06:04:14.517881: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1719115454.637240  349919 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


 391/4688 [=>............................] - ETA: 1:04 - loss: 1.0259 - accuracy: 0.4971 - precision: 0.4972 - recall: 0.9944

KeyboardInterrupt: 