## Coding test - Flora Y. SUN - 18th June 2023
Note: all the "we" in the following sections are "academic we". This project is finished solely by Flora Y. SUN

In [1]:
import pandas as pd
import numpy as np 
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
import datetime as dt
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
## import data
df = pd.read_csv('data.csv').dropna()
df.rename(columns={'ticker': 'stock', 'last': 'price'}, inplace=True)
date_format = '%Y-%m-%d'
df['date'] = df['date'].apply(lambda x: dt.datetime.strptime(x, date_format))
df['year'] = df['date'].apply(lambda x: x.year)

## 0 - Introduction

【General Ideas】

1. We used data before year 2019 for model training and used data in or after 2019 for backtesting.
2. Stock-price prediction: For each stock, we build an ensumble model consisting of **SVR, Random Forest, GBM, LSTM** on the training set, and then apply the model on the testing set to predict the stock price. We used price of the last 30 days as predictors, while in reality we may change the training step to improve the performance of our model.
3. Backtesting: After we got the predicted price value, we moved on to the backtesting session, where we long the 10% most undervalued stocks and short the 10% most overvalued stocks on every trading day, assuming we long/short stocks with equal weight (Alternatively, if we have data on market value, we may also trade stocks based on market value)

【Weak Model Selection】
1. SVR: SVR is a classic machine learning model and it does not require independece between observations, thus our time-series data does not contradict with the model assumption. 
2. Random Forest: Random forest is a tree-based model. It build multiple deep trees simultaneously. While building each tree, it conducts boostrap to select a subset of observations and a subset of variables for modelling.
3. GBM: GBM is also a tree-based model. It build multiple shallow trees one by one. Ideally we may use XGBoost to train a model. However, it would be better to build a XGBoost model based on a well-trained basic/stochastic GBM model. Since we skipped this part.
4. LSTM: we tried LSTM as this neutral network can "memorize" historical information and call those information when necessary. It has been widely used to handle time-series data or text data (essentially data that require the model to "memorize" the previous things).
5. Notes
    - We did not try OLS since it requires independence across observations. Time-series apparently violate this assumption.
    - we did not try naive bayes since naive bayes requires independence across predictors. while our predictors are historical data of the focal stock, the model assumption may be violated significantly.

【Ensemble Model】
 - We took the average value predicted by the weak models as the final value as some models may overfit while other may underfit. However, in our case, since we did not put too much attention on the hyperparameter tuning, it is likely that all models are underfit. We still put this step here as in reality we may need to consider this.

【Future works along this vein】
 - We did not put too much attention on hyperparameter tuneing since that process may take too much time. However, we may consider improve the model by tuneing hyperparameters.

【Adding more information】
 - Should we have more information (or more time to collect more information), we may consider incorporating more factors into our analysis. Those factors should be able to help us to predict the stock price. For new factors, we may conduct factor analysis (pay attention to IC, IR etc.) and test the stock selection performance based on backtesting (pay attention to the relative performance of stocks located in each factor-value-quantile)

## 1- Data Preparation

In [3]:
## data preprocessing

### get the stocklist
stocklist = list(set(df.stock.values.flatten()))

### get the training step - use the last 30 days to predict the next day
training_step = 30

### get the training dict and testing dict.
### key: stock name
### value: dataframe containing the training data/testing data
stock_dict = {}

for stock in stocklist:
    stock_df = df.iloc[np.where(df.stock == stock)[0], :]
    # only keep the stock with more than 500 data points in training and testing sets
    if stock_df[stock_df.year < 2019].shape[0] > 500 and stock_df[stock_df.year >= 2019].shape[0] > 500:
        stock_dict[stock] = stock_df
    
    def get_training_dataset(stock_df, normalize):
        if normalize:
            stock_nor = stock_df.copy(deep=True)
            global scaler
            scaler = preprocessing.MinMaxScaler(feature_range = (0,1))
            price = scaler.fit_transform(stock_df[['price']]).reshape(-1)
            volume = scaler.fit_transform(stock_df[['volume']]).reshape(-1)
            parameters = scaler.fit(stock_df[['price']])
    
            stock_nor['price']=price
            stock_nor['volume']=volume
            stock_df = stock_nor
        else:
            stock_df = stock_df.copy()
        
        for i in range(1, 1+training_step):
            stock_df['price_lastday{}'.format(i)] = stock_df.price.shift(i)
            stock_df['volume_lastday{}'.format(i)] = stock_df.volume.shift(i)
        stock_df.dropna(inplace=True)
        stock_df.reset_index(drop=True, inplace=True)

        train_df = stock_df[stock_df.year < 2019]
        test_df = stock_df[stock_df.year >= 2019]
        return train_df, test_df
    
    def un_normalize(normalized_series):
        un_normalized = scaler.inverse_transform(np.array(normalized_series).reshape(-1,1))
        return un_normalized


In [4]:
#trytrain = train_dict['8252 JT'].copy()
#trytest = test_dict['8252 JT'].copy()

In [4]:
def ML_model(stock_df): 
    train_df, test_df = get_training_dataset(stock_df, normalize=False)
    train_y = train_df.iloc[:, 2].values
    train_x = train_df.iloc[:, 5:].values
    test_x = test_df.iloc[:, 5:].values

    model_svr = SVR()
    model_svr.fit(train_x, train_y)
    ## There should be hyperparameter tuneing process, yet we skip it here. 
    ## If we are to conduct the tuneing process, we can use the following code as the starting point:
    
    # n_folds = 5
    # parameters = {'kernel':('rbf', ' linear', 'poly'), 'C': [1, 5, 10]}
    # clf = GridSearchCV(model_svr, parameters, cv=n_folds)
    # clf.fit(train_x, train_y)
    # pred_y_svm = clf.predict(test_x)

    model_rf = RandomForestRegressor()
    model_rf.fit(train_x, train_y)

    model_gbm = GradientBoostingRegressor()
    model_gbm.fit(train_x, train_y)

    pred_y_svm = model_svr.predict(test_x)
    pred_y_rf = model_rf.predict(test_x)
    pred_y_gbm = model_gbm.predict(test_x)
    
    test_df['pred_price_SVM'] = pred_y_svm
    test_df['pred_price_RF'] = pred_y_rf
    test_df['pred_price_GBM'] = pred_y_gbm

    results_df = test_df.loc[:, ["date",'pred_price_SVM', 'pred_price_RF', 'pred_price_GBM']] 
    stock_df = pd.merge(stock_df, results_df, on ='date')
    
    return stock_df

In [5]:
def LSTM_model(stock_df):
    train_df, test_df = get_training_dataset(stock_df, normalize=True)
    xtrain_price = []
    xtrain_volume = []
    
    train_price_df = train_df.filter(like='price_', axis=1)
    train_volume_df = train_df.filter(like='volume_', axis=1)
    
    for i in range(train_price_df.shape[0]):
        xtrain_price.append(train_price_df.iloc[i])
    for i in range(train_volume_df.shape[0]):
        xtrain_volume.append(train_volume_df.iloc[i])
    xtrain_price, xtrain_volume = np.array(xtrain_price), np.array(xtrain_volume)
    
    X_train = np.stack([xtrain_price], axis = 2)
    # add xtrain_volume to np.stack list if want to add volume as a feature
    y_train = np.array(train_df.price)
    y_train = np.reshape(y_train,(len(y_train),1))
    
    #return X_train.shape(), y_train.shape()
    
    regressor = Sequential()
    regressor.add(LSTM(units = 100, return_sequences = True, input_shape = (X_train.shape[1],X_train.shape[2])))
    regressor.add(Dropout(0.2))

    regressor.add(LSTM(units = 50, return_sequences = True))
    regressor.add(Dropout(0.2))

    regressor.add(LSTM(units = 50, return_sequences = True))
    regressor.add(Dropout(0.2))

    regressor.add(LSTM(units = 25))
    regressor.add(Dropout(0.2))

    #regressor.add(Dense(units = 50, activation='relu'))
    regressor.add(Dense(units=1))

    regressor.compile(optimizer='adam', loss = 'mean_squared_error')
    regressor.fit(X_train, y_train, epochs = 50, batch_size = 32)

    xtest_price = []
    xtest_volume = []
    
    test_price_df = test_df.filter(like='price_', axis=1)
    test_volume_df = test_df.filter(like='volume_', axis=1)
    
    for i in range(test_price_df.shape[0]):
        xtest_price.append(test_price_df.iloc[i])
    for i in range(test_volume_df.shape[0]):
        xtest_volume.append(test_volume_df.iloc[i])
    xtest_price, xtest_volume = np.array(xtest_price), np.array(xtest_volume)
    
    X_test = np.stack([xtest_price], axis = 2)
    y_prediction = regressor.predict(X_test)

    y_test_df = test_df.price

    predict_df = pd.DataFrame(y_prediction)
    
    true_y_test = pd.DataFrame(un_normalize(y_test_df))
    true_y_prediction = pd.DataFrame(un_normalize(y_prediction))
    
    test_df['pred_price_LSTM'] = un_normalize(y_prediction)
    results_df = test_df.loc[:, ["date",'pred_price_LSTM']] 
    stock_df = pd.merge(stock_df, results_df, on ='date')
    
    return stock_df

In [9]:
# apply ML algorithms on each stock
test_set = pd.DataFrame()

for stock in list(stock_dict.keys()):
    # uncomment the line above and comment the line below to iterate through all the stocks 
# for stock in [random.choice(list(stock_dict.keys()))]:
    ML_result = ML_model(stock_dict[stock].copy())
    LSTM_result = LSTM_model(stock_dict[stock].copy())
    test_result = pd.concat([ML_result, LSTM_result], axis=1).drop_duplicates()
    test_set = pd.concat([test_set, test_result], axis=0)


## Note: This part was not completely finished due to time limit. Only part of the results are shown below.

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5

KeyboardInterrupt: 

In [10]:
test_set = test_set.T.drop_duplicates().T

In [11]:
test_set

Unnamed: 0,stock,date,price,volume,year,pred_price_SVM,pred_price_RF,pred_price_GBM,pred_price_LSTM
0,6301 JT,2019-01-04,2129.1624,8288500,2019,2043.846443,2192.032629,2195.132148,2174.094971
1,6301 JT,2019-01-07,2275.7284,6711000,2019,2039.949457,2155.446811,2152.040612,2164.656738
2,6301 JT,2019-01-08,2300.854,6721500,2019,2037.450294,2294.8768,2293.01569,2183.822754
3,6301 JT,2019-01-09,2373.9044,6645900,2019,2033.645516,2297.973262,2288.165875,2209.687988
4,6301 JT,2019-01-10,2360.411,5910000,2019,2030.337745,2368.14555,2368.174291,2248.574463
...,...,...,...,...,...,...,...,...,...
530,6504 JT,2021-03-15,4800.0,563000,2021,2290.940481,4321.67037,4213.879559,4170.245605
531,6504 JT,2021-03-16,4775.0,487100,2021,2290.530843,4330.911916,4224.744946,4194.691406
532,6504 JT,2021-03-17,4800.0,449400,2021,2288.939472,4303.590587,4265.959865,4214.552734
533,6504 JT,2021-03-18,4790.0,733200,2021,2288.614862,4325.470405,4246.570936,4232.743164


## 3 - Backtesting

### 3.1 - Generate the position of each stock on each day

In [16]:
test_set1 = test_set.copy()
# get the return should we buy the stock at the close price of the day and sell it at the close price of the next day
test_set1['return'] = (test_set1.groupby('stock')['price'].shift(-1) - test_set1['price'])/test_set1['price']

# get the predicted price for the next day    
test_set1['predicted_price'] = test_set1[['pred_price_SVM', 'pred_price_RF', 'pred_price_GBM']].mean(axis=1)
test_set1['predicted_price1'] = test_set1.groupby('stock')['predicted_price'].shift(-1)

# generate the undervalued/overvalued signal
test_set1['under_valued'] = test_set1['predicted_price1'] - test_set1['price']

# select the 10% most under-valuated stocks and 10% most over-valued stocks for each day; generate the long-short position
test_set1['under_valued_rank'] = test_set1.groupby('date')['under_valued'].rank(ascending=False)
test_set1['under_valued_rank'] = test_set1['under_valued_rank'] / test_set1.groupby('date')['under_valued_rank'].transform('max')
test_set1['position'] = test_set1['under_valued_rank'].apply(lambda x: 1 if x <= 0.1 else (-1 if x >= 0.9 else 0))

### 3.2 - Calculate the daily return of the strategy

In [32]:
return_df = pd.DataFrame(index=test_set1.date.unique(), columns=['return'])

for date in test_set1.date.unique():
    df_long = test_set1[(test_set1.date == date) & (test_set1.position == 1)]
    return_long = np.mean(np.array(df_long['return'])+1)
    df_short = test_set1[(test_set1.date == date) & (test_set1.position == -1)]
    return_short = np.mean(np.array(df_short['return'])+1)
    return_df.loc[date, 'return'] = return_long - return_short
return_df.head()

Unnamed: 0,return
2019-01-04,-0.001884
2019-01-07,0.018005
2019-01-08,0.010172
2019-01-09,0.000998
2019-01-10,0.020964


### 3.3 - Evaluation Criteria to be evaluated
- Annulized Return
- Annulized Volatility
- Sharpe Ratio = (Annulized Return - Risk Free return) / Annulized Volatility
- Maximum Drawdown