# CS 440/540 Machine Learning in Finance: Homework 3

Download data files from LMS. Code/Explain your solution over this `IPython` notebook at required cells, and complete locally.

To submit your assignment, in LMS, upload your solution to LMS as a single notebook with following file name format:

`lastName_firstName_CourseNumber_HW3.ipynb`

where `CourseNumber` is the course in which you're enrolled (CS 440 or CS 540).

Problems on homework assignments are equally weighted.

Any type of plagiarism will not be tolerated. Your submitted codes will be compared with other submissions and also the codes available on internet and violations will have a penalty of -100 points. (In case of copying from
another student both parties will get -100)

Import all libraries here

In [2]:
#Import libraries before starting
import pandas as pd
import numpy as np
from tqdm import tqdm

## Problem 1: XGBoost and Random Forest for Detecting Fraudulent Transactions

In this problem, we will focus on predicting whether a transaction is a fraud or not. All transactions are provided in "transactions.csv". The file contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, original features cannot be provided. Features V1, V2, …, V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Analyze the dataset. Do you think dataset is balanced? Evaluate and compare XGBoost and RandomForest algorithms without SMOTE, as well as after balancing the data via SMOTE. You can compare performance by F1 score and use 5-fold cross-validation for hyperparamter optimization.

Among 4 scenarios(Random Forest, XGBoost, Random Forest + SMOTE, XGBoost + SMOTE), which performs the best? Discuss.

In [47]:
#Solution 1
df = pd.read_csv('creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [48]:
df["Class"].value_counts()
# Dataset is not balanced. There are 492 frauds and 284315 non-fraudulent transactions.

Class
0    284315
1       492
Name: count, dtype: int64

In [49]:
folds: list[tuple[pd.DataFrame, pd.DataFrame]] = []

fold_length = len(df) // 5

for i in range(5):
    test_start = i * fold_length
    test_end = (i + 1) * fold_length
    
    df_test = df.iloc[test_start:test_end]
    df_train = df.drop(df_test.index)
    
    folds.append((df_train, df_test))
    

In [50]:
folds[0][0].head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
56961,47694.0,1.138149,-0.698637,0.332976,0.272394,-0.629432,0.463714,-0.628447,0.096643,-1.013536,...,-0.200924,-0.328477,-0.240835,-0.868956,0.483849,-0.256578,0.045728,0.03495,116.0,0
56962,47695.0,0.017199,-0.148533,-0.095542,-0.923477,1.161514,-0.560818,0.874059,-0.141331,-0.241034,...,0.034412,0.099442,0.785986,-0.291115,-2.653623,0.148533,0.101011,0.124571,55.98,0
56963,47695.0,1.157218,-0.400497,0.224997,-0.662346,-0.058859,0.701721,-0.468211,0.17825,0.147136,...,-0.082828,-0.319259,-0.08415,-1.23833,0.039034,1.321246,-0.076822,-0.002819,69.99,0
56964,47695.0,1.227454,0.052971,0.07456,1.151042,-0.047766,0.09269,-0.109024,0.155374,0.385462,...,-0.156081,-0.366528,-0.180845,-0.542113,0.751469,-0.301263,0.006493,-0.003925,5.37,0
56965,47695.0,1.228671,1.260991,-1.699439,1.450723,1.079511,-1.310171,0.683609,-0.232911,-0.722463,...,-0.177758,-0.43305,-0.265018,-0.320238,0.880613,-0.291865,0.037405,0.08394,0.89,0


In [59]:
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

xg_boost_f1_scores = []
random_forest_f1_scores = []

smote_xg_boost_f1_scores = []
smote_random_forest_f1_scores = []

for df_train, df_test in tqdm(folds):
    train_X = df_train.iloc[:, :-1]
    train_y = df_train.iloc[:, -1]
    
    test_X = df_test.iloc[:, :-1]
    test_y = df_test.iloc[:, -1]
    
    ## XGBoost
    xgboost = XGBClassifier()
    xgboost: XGBClassifier
    xgboost.fit(train_X, train_y)
    
    predictions_test = xgboost.predict(test_X)
    f1s = f1_score(test_y, predictions_test)
    xg_boost_f1_scores.append(f1s) 
    
    ## Random Forest
    random_forest = RandomForestClassifier(n_estimators=10)
    random_forest: RandomForestClassifier
    random_forest.fit(train_X, train_y)
    
    predictions_test = random_forest.predict(test_X)
    f1s = f1_score(test_y, predictions_test)
    random_forest_f1_scores.append(f1s)
    
    ## Smote the data
    
    smote_X_train, smote_y_train = SMOTE().fit_resample(train_X, train_y)
    
    ## XGBoost + Smote
    xgboost = XGBClassifier()
    xgboost: XGBClassifier
    xgboost.fit(smote_X_train, smote_y_train)
    
    predictions_test = xgboost.predict(test_X)
    f1s = f1_score(test_y, predictions_test)
    smote_xg_boost_f1_scores.append(f1s) 
    
    ## Random Forest + Smote
    random_forest = RandomForestClassifier(n_estimators=10)
    random_forest: RandomForestClassifier
    random_forest.fit(smote_X_train, smote_y_train)
    
    predictions_test = random_forest.predict(test_X)
    f1s = f1_score(test_y, predictions_test)
    smote_random_forest_f1_scores.append(f1s)
    

xgboost_f1 = np.mean(xg_boost_f1_scores)
random_forest_f1 = np.mean(random_forest_f1_scores)
smote_xgboost_f1 = np.mean(smote_xg_boost_f1_scores)
smote_random_forest_f1 = np.mean(smote_random_forest_f1_scores)

100%|██████████| 5/5 [08:42<00:00, 104.46s/it]


In [60]:
print(f"XGBoost F1: {xgboost_f1}")
print(f"Random Forest F1: {random_forest_f1}")
print(f"SMOTE XGBoost F1: {smote_xgboost_f1}")
print(f"SMOTE Random Forest F1: {smote_random_forest_f1}")

# XGBoost seems to peform better than other approaches. SMOTE does not seem to help much.

XGBoost F1: 0.8217737463512647
Random Forest F1: 0.7943355609579489
SMOTE XGBoost F1: 0.7722751511568273
SMOTE Random Forest F1: 0.7851228490891498


## Problem 2: MLP for House Price Prediction

Let's focus on the same Real State Price dataset from HW1 and HW2. In this problem, you are provided a single dataset "kaggle_house.csv" which includes both train and test sets. We will now implement four MLPs: 

a- MLP with 1 hidden layer with 8 units in hidden layer

b- MLP with 1 hidden layer with 4 units in hidden layer

c- MLP with 2 hidden layers with 4 units in both first and second layers.

d- MLP with 2 hidden layers with 8 units in first layer and 4 units in the second layer.

We will predict house sale price (last column) by using the following attributes: "SalePrice", "MSSubClass", "MSZoning", "LotFrontage", "LotArea","Street", "YearBuilt", "LotShape", "1stFlrSF", "2ndFlrSF". Report the performance in terms of R2 and RMSE for the test set by applying 5-fold cross-validation.

Note that you need to carefully tune learning rate and number of epochs. 

In [49]:
#Solution 2
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('kaggle_house.csv')
df = df[["MSSubClass", "MSZoning", "LotFrontage", "LotArea","Street", "YearBuilt", "LotShape", "1stFlrSF", "2ndFlrSF", "SalePrice"]]
non_numeric_columns = ["MSZoning", "Street", "LotShape"]
for column in non_numeric_columns:
    df[column] = LabelEncoder().fit_transform(df[column])

df = df.dropna()
len(df)

1201

In [50]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [51]:
df_test = df.sample(frac=0.2)
df_train = df.drop(df_test.index)

In [52]:
from torch import nn

class MLP(nn.Module):
    def __init__(self, structure_dict: dict[int, int], input_dim):
        super().__init__()

        layers = []

        prev_layer_units = input_dim
        for layer, n_unit in structure_dict.items():
            layers.append(nn.Linear(prev_layer_units, n_unit))
            layers.append(nn.ReLU())
            prev_layer_units = n_unit

        layers.append(nn.Linear(prev_layer_units, 1))

        self.model = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.model(x)

In [74]:
import torch
from torch.utils.data import DataLoader, TensorDataset

def train(_model: MLP, X_train: torch.Tensor, y_train: torch.Tensor, X_test: torch.Tensor, y_test: torch.Tensor, learning_rate, epochs):
    batch_size = 32
    optimizer = torch.optim.Adam(_model.parameters(), lr=learning_rate)
    loss_function = nn.MSELoss()
    
    dataset = TensorDataset(X_train, y_train)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    epoch_bar = tqdm(range(epochs))
    for epoch in epoch_bar:
        for X, y in dataloader:
            pred = _model(X)
            loss = loss_function(pred, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if epoch % 50 == 0:
            epoch_bar.set_description(f"Epoch {epoch}, Test Loss: {loss_function(_model(X_test), y_test)}, Train Loss: {loss_function(_model(X_train), y_train)}")


In [75]:
folds: list[tuple[pd.DataFrame, pd.DataFrame]] = []

fold_length = len(df) // 5

for i in range(5):
    test_start = i * fold_length
    test_end = (i + 1) * fold_length
    
    df_test = df.iloc[test_start:test_end]
    df_train = df.drop(df_test.index)
    
    folds.append((df_train, df_test))

In [76]:
torch.cuda.is_available()
device = "cuda" if torch.cuda.is_available() else "cpu"

In [85]:
from itertools import product

def grid_search(structure):
    epoch_lengts = [100, 500, 1000, 1500]
    learning_rates = [0.001, 0.01, 0.1]
    
    params = product(epoch_lengts, learning_rates)
    
    results = []
    for epoch_length, lr in params:
        r2_scores = []
        rmse_scores = []
        
        for df_train, df_test in folds:
            X_train = torch.tensor(df_train.iloc[:, :-1].values, dtype=torch.float32, device=device)
            y_train = torch.tensor(df_train.iloc[:, -1].values, dtype=torch.float32, device=device).view(-1, 1)
            X_test = torch.tensor(df_test.iloc[:, :-1].values, dtype=torch.float32, device=device)
            y_test = torch.tensor(df_test.iloc[:, -1].values, dtype=torch.float32, device=device).view(-1, 1)
            
            y_test_numpy = df_test.iloc[:, -1].values
            
            model = MLP(structure, X_train.shape[1]).to(device)
            train(model, X_train, y_train, X_test, y_test, lr, epoch_length)
            
            prediction = model(X_test).cpu().detach().numpy()
            rmse_scores.append(np.sqrt(mean_squared_error(y_test_numpy, prediction)))
            r2_scores.append(r2_score(y_test_numpy, prediction))
            
        results.append({
        'epoch_length': epoch_length,
        'learning_rate': lr,
        'r2_mean': np.mean(r2_scores),
        'rmse_mean': np.mean(rmse_scores)
        })
    
    best_result = min(results, key=lambda x: x['r2_mean'])
    best_epochs = best_result['epoch_length']
    best_lr = best_result['learning_rate']
    
    return best_epochs, best_lr
        

In [86]:
structures = [
    {1: 8},
    {1: 4},
    {1: 4, 2: 4},
    {1: 8, 2: 4}
]

lrs = []
epochs = []

for structure in structures:
    best_epochs, best_lr = grid_search(structure)
    lrs.append(best_lr)
    epochs.append(best_epochs)
    
for i, structure in enumerate(structures):
    print(f"Structure: {structure}, Epochs: {epochs[i]}, Learning Rate: {lrs[i]}")

Epoch 50, Test Loss: 0.0033400801476091146, Train Loss: 0.004243063274770975: 100%|██████████| 100/100 [00:05<00:00, 18.87it/s]
Epoch 50, Test Loss: 0.0037708920426666737, Train Loss: 0.004117575008422136: 100%|██████████| 100/100 [00:05<00:00, 18.07it/s]
Epoch 50, Test Loss: 0.004681235644966364, Train Loss: 0.004137230571359396: 100%|██████████| 100/100 [00:05<00:00, 16.73it/s]
Epoch 50, Test Loss: 0.0035494451876729727, Train Loss: 0.00429045595228672: 100%|██████████| 100/100 [00:05<00:00, 19.76it/s]
Epoch 50, Test Loss: 0.006742347031831741, Train Loss: 0.004356927238404751: 100%|██████████| 100/100 [00:05<00:00, 17.70it/s]
Epoch 50, Test Loss: 0.004444521386176348, Train Loss: 0.0049465112388134: 100%|██████████| 100/100 [00:09<00:00, 10.96it/s]
Epoch 50, Test Loss: 0.005799348931759596, Train Loss: 0.003715687897056341: 100%|██████████| 100/100 [00:05<00:00, 17.77it/s]
Epoch 50, Test Loss: 0.0064147282391786575, Train Loss: 0.005354071501642466: 100%|██████████| 100/100 [00:06<0

Structure: {1: 8}, Epochs: 1500, Learning Rate: 0.1
Structure: {1: 4}, Epochs: 1500, Learning Rate: 0.1
Structure: {1: 4, 2: 4}, Epochs: 500, Learning Rate: 0.1
Structure: {1: 8, 2: 4}, Epochs: 100, Learning Rate: 0.1





In [68]:
from sklearn.metrics import r2_score, mean_squared_error
from itertools import product

epoch_lengts = [100, 500, 1000, 1500]
learning_rates = [0.001, 0.01, 0.1]

params = product(epoch_lengts, learning_rates)

r2_mlp8 = []
rmse_mlp8 = []

r2_mlp4 = []
rmse_mlp4 = []

r2_mlp44 = []
rmse_mlp44 = []

r2_mlp84 = []
rmse_mlp84 = []


for df_train, df_test in folds:
    X_train = torch.tensor(df_train.iloc[:, :-1].values, dtype=torch.float32, device=device)
    y_train = torch.tensor(df_train.iloc[:, -1].values, dtype=torch.float32, device=device).view(-1, 1)
    X_test = torch.tensor(df_test.iloc[:, :-1].values, dtype=torch.float32, device=device)
    y_test = torch.tensor(df_test.iloc[:, -1].values, dtype=torch.float32, device=device).view(-1, 1)
    
    y_test_numpy = df_test.iloc[:, -1].values
    
    model = MLP({1: 8}, X_train.shape[1]).to(device)
    train(model, X_train, y_train, X_test, y_test)
    
    prediction = model(X_test).cpu().detach().numpy()
    rmse_mlp8.append(np.sqrt(mean_squared_error(y_test_numpy, prediction)))
    r2_mlp8.append(r2_score(y_test_numpy, prediction))
    
    model = MLP({1: 4}, X_train.shape[1]).to(device)
    train(model, X_train, y_train, X_test, y_test)
    
    prediction = model(X_test).cpu().detach().numpy()
    rmse_mlp4.append(np.sqrt(mean_squared_error(y_test_numpy, prediction)))
    r2_mlp4.append(r2_score(y_test_numpy, prediction))
    
    model = MLP({1: 4, 2: 4}, X_train.shape[1]).to(device)
    train(model, X_train, y_train, X_test, y_test)
    
    prediction = model(X_test).cpu().detach().numpy()
    rmse_mlp44.append(np.sqrt(mean_squared_error(y_test_numpy, prediction)))
    r2_mlp44.append(r2_score(y_test_numpy, prediction))
    
    model = MLP({1: 8, 2: 4}, X_train.shape[1]).to(device)
    train(model, X_train, y_train, X_test, y_test)
    
    prediction = model(X_test).cpu().detach().numpy()
    rmse_mlp84.append(np.sqrt(mean_squared_error(y_test_numpy, prediction)))
    r2_mlp84.append(r2_score(y_test_numpy, prediction))
    
r2_mlp8_mean = np.mean(r2_mlp8)
rmse_mlp8_mean = np.mean(rmse_mlp8)

r2_mlp4_mean = np.mean(r2_mlp4)
rmse_mlp4_mean = np.mean(rmse_mlp4)

r2_mlp44_mean = np.mean(r2_mlp44)
rmse_mlp44_mean = np.mean(rmse_mlp44)

r2_mlp84_mean = np.mean(r2_mlp84)
rmse_mlp84_mean = np.mean(rmse_mlp84)
    

Epoch 950, Test Loss: 0.002307337708771229, Train Loss: 0.0021706109400838614: 100%|██████████| 1000/1000 [00:50<00:00, 19.77it/s]
Epoch 950, Test Loss: 0.003010334214195609, Train Loss: 0.003253083908930421: 100%|██████████| 1000/1000 [00:45<00:00, 21.98it/s] 
Epoch 950, Test Loss: 0.002046388341113925, Train Loss: 0.001738176797516644: 100%|██████████| 1000/1000 [00:54<00:00, 18.31it/s] 
Epoch 950, Test Loss: 0.00218328763730824, Train Loss: 0.0019523411756381392: 100%|██████████| 1000/1000 [01:04<00:00, 15.61it/s] 
Epoch 950, Test Loss: 0.005286112427711487, Train Loss: 0.003327341051772237: 100%|██████████| 1000/1000 [00:54<00:00, 18.47it/s]
Epoch 950, Test Loss: 0.007844577543437481, Train Loss: 0.0026930051390081644: 100%|██████████| 1000/1000 [00:58<00:00, 17.09it/s]
Epoch 950, Test Loss: 0.004448226187378168, Train Loss: 0.002190494444221258: 100%|██████████| 1000/1000 [01:09<00:00, 14.29it/s] 
Epoch 950, Test Loss: 0.004263903945684433, Train Loss: 0.0034298335667699575: 100%|

## Problem 3: Combining Technical Analysis Indicators with 2D CNN on Bitcoin Direction Prediction

In this problem, we will focus on adding technical analysis indicators to original series, which will help us convert it into 2D Image. We will use the following technical indicators from TA-Lib library in Python: MACD, RSI, CMO, MOM, Bollinger Bands, SMA. In general, technical analysis indicators are financial indicators which give trades a guidance about the market. Our train period is 2021-2022 and test period will be 2022-2023.

We will use historical 6 days closing price, build up 6x6 image by calculating technical indicators, and predict the direction for the next day (whether the price will be up or down). We will use a single convolutional layer followed by Fully Connected Layer where kernel size=(2,2) can be set. 

In [202]:
#Solution 3
df_test = pd.read_csv('btc_test.csv')
df_train = pd.read_csv('btc_train.csv')

df = pd.concat([df_train, df_test])
df = df.dropna()
df = df[["Date", "Close"]]
df.set_index("Date", inplace=True)

df = df.dropna()

df

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2021-01-01,29374.152344
2021-01-02,32127.267578
2021-01-03,32782.023438
2021-01-04,31971.914063
2021-01-05,33992.429688
...,...
2022-12-27,16717.173828
2022-12-28,16552.572266
2022-12-29,16642.341797
2022-12-30,16602.585938


In [203]:
import talib
df["MACD"] = talib.MACD(df["Close"])[0]
df["RSI"] = talib.RSI(df["Close"])
df["CMO"] = talib.CMO(df["Close"])
df["MOM"] = talib.MOM(df["Close"])
df["BBANDS"] = talib.BBANDS(df["Close"])[0]
df["SMA"] = talib.SMA(df["Close"])
df["next_close"] = df.loc[:, "Close"].shift(-1)
df["direction"] = (df["next_close"] > df["Close"]).astype(int)

df.drop(labels=["Close", "next_close"], axis = 1, inplace=True)
df = df.dropna()
df

Unnamed: 0_level_0,MACD,RSI,CMO,MOM,BBANDS,SMA,direction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-02-03,-226.422720,62.098634,24.197269,5182.710938,37925.235103,35262.511524,0
2021-02-04,-15.039748,60.319527,20.639054,4559.673828,38809.440003,35360.299414,1
2021-02-05,247.926798,62.875123,25.750246,5574.458985,39594.868785,35404.297591,1
2021-02-06,540.609933,65.103681,30.207362,8833.464844,39964.391466,35400.796550,0
2021-02-07,734.836297,63.771135,27.542271,5437.343750,39879.156250,35337.657617,1
...,...,...,...,...,...,...,...
2022-12-27,-103.142239,44.860801,-10.278397,-77.917969,16958.022581,16964.124935,0
2022-12-28,-120.315843,42.121070,-15.757859,-205.404297,17034.413436,16975.299935,1
2022-12-29,-125.238723,44.125248,-11.749504,202.662109,16999.776595,16981.878581,0
2022-12-30,-130.839866,43.408369,-13.183263,-303.718750,16943.487726,16963.012565,0


In [204]:
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame(columns=df.columns, index=df.index, data=MinMaxScaler().fit_transform(df))
df

Unnamed: 0_level_0,MACD,RSI,CMO,MOM,BBANDS,SMA,direction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-02-03,0.479715,0.694851,0.694851,0.741884,0.402851,0.402763,0.0
2021-02-04,0.500722,0.665358,0.665358,0.724384,0.419576,0.404884,1.0
2021-02-05,0.526856,0.707724,0.707724,0.752888,0.434433,0.405839,1.0
2021-02-06,0.555943,0.744669,0.744669,0.844428,0.441423,0.405763,0.0
2021-02-07,0.575245,0.722578,0.722578,0.749036,0.439810,0.404393,1.0
...,...,...,...,...,...,...,...
2022-12-27,0.491967,0.409085,0.409085,0.594122,0.006249,0.005795,0.0
2022-12-28,0.490260,0.363666,0.363666,0.590541,0.007694,0.006038,1.0
2022-12-29,0.489771,0.396891,0.396891,0.602003,0.007039,0.006180,0.0
2022-12-30,0.489214,0.385006,0.385006,0.587780,0.005974,0.005771,0.0


In [205]:
df_test = df.loc["2022-01-01":"2023-01-01"]
df_train = df.drop(df_test.index)

In [206]:
def get_formatted_X_y(_df: pd.DataFrame, lag_length):
    X = []
    y = []
    
    for i in range(len(_df) - lag_length):
        endi = i + lag_length
        
        X.append(_df.iloc[i:endi, :-1].values)
        y.append(_df.iloc[endi - 1, -1])
    
    return np.array(X).reshape(-1, 1, lag_length, 6), np.array(y).reshape(-1, 1)

X_train, y_train = get_formatted_X_y(df_train, 6)
X_test, y_test = get_formatted_X_y(df_test, 6)

In [207]:
import torch
from torch import nn

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        
        layers = []
        layers.append(nn.Conv2d(1, 1, kernel_size=(2, 2)))
        layers.append(nn.ReLU())
        layers.append(nn.Flatten())
        layers.append(nn.Linear(25, 1))
        layers.append(nn.Sigmoid())
        
        self.model = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.model(x)

In [208]:
device = "cuda" if torch.cuda.is_available() else "cpu"

X_test = torch.tensor(X_test, dtype=torch.float32, device=device)
y_test = torch.tensor(y_test, dtype=torch.float32, device=device)

X_train = torch.tensor(X_train, dtype=torch.float32, device=device)
y_train = torch.tensor(y_train, dtype=torch.float32, device=device)

In [209]:
y_test

tensor([[0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
      

In [226]:
from torch.utils.data import DataLoader, TensorDataset

def train(_model: CNN, X_train: torch.Tensor, y_train: torch.Tensor, X_test: torch.Tensor, y_test: torch.Tensor, learning_rate, epochs):
    batch_size = 32
    optimizer = torch.optim.Adam(_model.parameters(), lr=learning_rate)
    loss_function = nn.MSELoss()
    
    dataset = TensorDataset(X_train, y_train)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    epoch_bar = tqdm(range(epochs))
    for epoch in epoch_bar:
        for X, y in dataloader:
            pred = _model(X)
            loss = loss_function(pred, y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if epoch % 5 == 0:
            acc = torch.sum((torch.round(_model(X_test)) == y_test).float()) / len(y_test)
            
            epoch_bar.set_description(f"Epoch {epoch}, Test Loss: {loss_function(_model(X_test), y_test)}, Train Loss: {loss_function(_model(X_train), y_train)}, Accuracy:"
                                      f"{acc}")

In [228]:
cnn = CNN().to(device)
train(cnn, X_train, y_train, X_test, y_test, 1e-4, 1000)

Epoch 230, Test Loss: 0.25527423620224, Train Loss: 0.2492552399635315, Accuracy:0.470752090215683:  23%|██▎       | 232/1000 [00:04<00:15, 49.60it/s]      


KeyboardInterrupt: 

## Problem 4: Multivariate LSTM for Predicting EPS (Earnings per Share) over Company Fundamentals

In this problem, we will focus on predicting Earnings Per Share (EPS) by jointly modeling historical fundamentals where fundamentals for multiple companies in "fundamentals.csv" file for each year. Number of latent dimension of LSTM can be [5, 10, 30] and the best one can be determined by hyperparameter search. On the other hand, learning rate and number of epochs should be carefully tuned. Our evaluation metric will be MAPE score.

In [55]:
#Solution 4