[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/witchapong/build-ai-based-applications/blob/main/tabular/4_make_prediction.ipynb)

# Stock Price Prediction using ML model
In this session, we'll learn how to build a ML model for predicting **%change of stock prices of the next day** of stocks in SET index (Stock Exchange of Thailand). Thus, we should be able to use the prediction to buy stocks that are going up the next day, make profits, and hopefully get rich!

This session is divided into the following 5 notebooks.
1. `1_collect_data.ipynb`
2. `2_eda.ipynb`
3. `3_features_prep.ipynb`
4. `4_make_prediction.ipynb` (current notebook)
5. `5_evaluation.ipynb`

# Make Prediction
So far, we've already prepared the features from the previous step. In this notebook, we'll finally train a ML model and use it to make prediction of return on the next day. However, before we can feed the features to a model, we need to do some more data processing step for converting any non-numerical feature column into number as a ML model only accept numerical representation of data i.e. String needs to be encoded as Interger.

In [89]:
from tqdm.notebook import tqdm

import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

idx = pd.IndexSlice

import lightgbm as lgb

import math

# Load and Process Features

In [14]:
data_df = pd.read_csv("data/model_data.csv")

In [15]:
data_df = data_df.set_index(["symbol", "date"])

In [16]:
# encode categorical variables
categoricals = ["year", "month", "weekday", "industry", "sector"]

for feat in categoricals:
    data_df[feat] = pd.factorize(data_df[feat], sort=True)[0]

In [17]:
labels = sorted(data_df.filter(like="_fwd").columns)
features = data_df.columns.difference(labels).tolist()

In [19]:
data_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,dividends,stock splits,value,val_rank,rsi,bb_high,bb_low,NATR,ATR,PPO,MACD,industry,sector,r01,r05,r10,r21,r42,r63,r01dec,r05dec,r10dec,r21dec,r42dec,r63dec,r01q_sector,r05q_sector,r10q_sector,r21q_sector,r42q_sector,r63q_sector,r01_fwd,r05_fwd,r21_fwd,income_stmt_taxEffectOfUnusualItems_p1y,income_stmt_taxRateForCalcs_p1y,income_stmt_normalizedEBITDA_p1y,income_stmt_totalUnusualItems_p1y,income_stmt_totalUnusualItemsExcludingGoodwill_p1y,income_stmt_netIncomeFromContinuingOperationNetMinorityInterest_p1y,income_stmt_reconciledDepreciation_p1y,income_stmt_reconciledCostOfRevenue_p1y,income_stmt_eBITDA_p1y,income_stmt_eBIT_p1y,income_stmt_netInterestIncome_p1y,...,income_stmt_normalizedIncome_p3y,income_stmt_netIncomeFromContinuingAndDiscontinuedOperation_p3y,income_stmt_totalExpenses_p3y,income_stmt_dilutedAverageShares_p3y,income_stmt_basicAverageShares_p3y,income_stmt_dilutedEPS_p3y,income_stmt_basicEPS_p3y,income_stmt_dilutedNIAvailtoComStockholders_p3y,income_stmt_netIncomeCommonStockholders_p3y,income_stmt_netIncome_p3y,income_stmt_netIncomeIncludingNoncontrollingInterests_p3y,income_stmt_netIncomeContinuousOperations_p3y,income_stmt_taxProvision_p3y,income_stmt_pretaxIncome_p3y,income_stmt_otherIncomeExpense_p3y,income_stmt_otherNonOperatingIncomeExpenses_p3y,income_stmt_specialIncomeCharges_p3y,income_stmt_gainOnSaleOfPpe_p3y,income_stmt_otherSpecialCharges_p3y,income_stmt_gainOnSaleOfSecurity_p3y,income_stmt_netNonOperatingInterestIncomeExpense_p3y,income_stmt_interestExpenseNonOperating_p3y,income_stmt_interestIncomeNonOperating_p3y,income_stmt_operatingIncome_p3y,income_stmt_operatingExpense_p3y,income_stmt_sellingGeneralAndAdministration_p3y,income_stmt_sellingAndMarketingExpense_p3y,income_stmt_generalAndAdministrativeExpense_p3y,income_stmt_otherGandA_p3y,income_stmt_salariesAndWages_p3y,income_stmt_grossProfit_p3y,income_stmt_costOfRevenue_p3y,income_stmt_totalRevenue_p3y,income_stmt_operatingRevenue_p3y,income_stmt_minorityInterests_p3y,income_stmt_impairmentOfCapitalAssets_p3y,income_stmt_totalOperatingIncomeAsReported_p3y,income_stmt_totalOtherFinanceCost_p3y,income_stmt_otherOperatingExpenses_p3y,income_stmt_depreciationAmortizationDepletionIncomeStatement_p3y,income_stmt_depreciationAndAmortizationInIncomeStatement_p3y,income_stmt_earningsFromEquityInterest_p3y,income_stmt_provisionForDoubtfulAccounts_p3y,income_stmt_rentExpenseSupplemental_p3y,income_stmt_writeOff_p3y,income_stmt_netIncomeDiscontinuousOperations_p3y,income_stmt_gainOnSaleOfBusiness_p3y,year,month,weekday
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1
24CS,2022-10-03,7.1,10.2,7.1,10.2,559465900,0.0,0.0,5706.552073,1.0,,,,,,,,17,7,,,,,,,,,,,,,,,,,,,-0.29902,-0.509804,-0.592157,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704474.0,35101165.0,28141921.0,-2013217.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,9,0
24CS,2022-10-04,10.7,11.1,7.15,7.15,330707400,0.0,0.0,2364.557942,1.0,,,,,,,,17,7,-0.29902,,,,,,0.0,,,,,,0.0,,,,,,-0.27972,-0.373427,-0.454545,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704474.0,35101165.0,28141921.0,-2013217.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,9,1
24CS,2022-10-05,5.85,6.45,5.05,5.15,361028900,0.0,0.0,1859.298869,1.0,,,,,,,,17,7,-0.27972,,,,,,0.0,,,,,,0.0,,,,,,0.009709,-0.246602,-0.254369,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704474.0,35101165.0,28141921.0,-2013217.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,9,2
24CS,2022-10-06,5.4,5.45,4.7,5.2,232679200,0.0,0.0,1209.931796,2.0,,,,,,,,17,7,0.009709,,,,,,4.0,,,,,,3.0,,,,,,-0.038462,-0.292308,-0.276923,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704474.0,35101165.0,28141921.0,-2013217.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,9,3
24CS,2022-10-07,5.1,5.15,4.76,5.0,131778400,0.0,0.0,658.892,2.0,,,,,,,,17,7,-0.038462,,,,,,0.0,,,,,,0.0,,,,,,0.0,-0.224,-0.268,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704474.0,35101165.0,28141921.0,-2013217.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,9,4


# Define Train and Test Periods
We'll use 240 days of data to train a model and use the model for predicting the next 60 days of data. Then, we progressively roll our training data window by 60 days and again use the updated 240 days of data to train a model and make prediction of the next 60 days and so on.

In [40]:
TRAIN_LENGTH = 240
TEST_LENGTH = 60

trading_dates = np.sort(data_df.index.get_level_values("date").unique())
trading_dates[0], trading_dates[-1]

('2020-01-02', '2024-12-30')

In [39]:
# calculate number of periods we need to train & predict based on our TRAIN_LENGTH and TEST_LENGTH
n_periods = math.ceil((len(data_df.index.get_level_values("date").unique()) - TRAIN_LENGTH) / TEST_LENGTH)
n_periods

17

In [69]:
train_test_dates = []

for i in range(n_periods):
    train_start = i * TEST_LENGTH
    train_dates = trading_dates[train_start: train_start + TRAIN_LENGTH]
    test_dates  = trading_dates[train_start + TRAIN_LENGTH: min(train_start + TRAIN_LENGTH + TEST_LENGTH, len(trading_dates))] 
    train_test_dates.append((train_dates, test_dates))

In [90]:
for i, (train_dates, test_dates) in enumerate(train_test_dates):
    print(f"Period {i+1}: Train dates from {train_dates[0]} to {train_dates[-1]}, Test dates from {test_dates[0]} to {test_dates[-1]}")

Period 1: Train dates from 2020-01-02 to 2020-12-25, Test dates from 2020-12-28 to 2021-03-25
Period 2: Train dates from 2020-03-27 to 2021-03-25, Test dates from 2021-03-26 to 2021-06-29
Period 3: Train dates from 2020-06-26 to 2021-06-29, Test dates from 2021-06-30 to 2021-09-27
Period 4: Train dates from 2020-09-28 to 2021-09-27, Test dates from 2021-09-28 to 2021-12-24
Period 5: Train dates from 2020-12-28 to 2021-12-24, Test dates from 2021-12-27 to 2022-03-23
Period 6: Train dates from 2021-03-26 to 2022-03-23, Test dates from 2022-03-24 to 2022-06-27
Period 7: Train dates from 2021-06-30 to 2022-06-27, Test dates from 2022-06-28 to 2022-09-23
Period 8: Train dates from 2021-09-28 to 2022-09-23, Test dates from 2022-09-26 to 2022-12-23
Period 9: Train dates from 2021-12-27 to 2022-12-23, Test dates from 2022-12-26 to 2023-03-21
Period 10: Train dates from 2022-03-24 to 2023-03-21, Test dates from 2023-03-22 to 2023-06-22
Period 11: Train dates from 2022-06-28 to 2023-06-22, Test 

# Train Model and Make Predictions
Based on the defined train and test periods, we'll train a ML model on the train period and make prediction on the test period. We'll use [LightGBM](https://lightgbm.readthedocs.io/en/stable/Python-API.html#scikit-learn-api) model which is one of the most popular model among ML usecases for tabular data due to its performance and speed.

In [88]:
lgb_params = {
 'learning_rate': 0.01,
 'num_leaves': 4,
 'feature_fraction': 0.95,
 'min_data_in_leaf': 250,
 'boost_rounds': 200}

In [82]:
# for each period, train, make predictions, store predictions for evaluation
prediction_df = pd.DataFrame()
for train_dates, test_dates in tqdm(train_test_dates):

    train_df = data_df.loc[idx[:, train_dates], :]
    X_train, Y_train = train_df[features], train_df["r01_fwd"]

    model = lgb.LGBMRegressor(**lgb_params)
    model.fit(X_train, Y_train)

    test_df = data_df.loc[idx[:, test_dates], :]
    X_test = test_df[features]
    prediction = model.predict(X_test)
    
    prediction_df = pd.concat([prediction_df, pd.DataFrame(prediction, index=test_df.index, columns=["prediction"])], axis=0)

  0%|          | 0/17 [00:00<?, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004784 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5613
[LightGBM] [Info] Number of data points in the train set: 168156, number of used features: 42
[LightGBM] [Info] Start training from score 0.000822
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006443 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5714
[LightGBM] [Info] Number of data points in the train set: 169866, number of used features: 59
[LightGBM] [Info] Start training from score 0.003224
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.015155 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not eno

# Save Predictions

In [85]:
# save predictions
prediction_df = prediction_df.sort_index().reset_index()

In [86]:
prediction_df.head()

Unnamed: 0,symbol,date,prediction
0,24CS,2022-10-03,-0.000981
1,24CS,2022-10-04,0.004074
2,24CS,2022-10-05,0.004074
3,24CS,2022-10-06,0.001674
4,24CS,2022-10-07,0.002785


In [87]:
prediction_df.to_csv("data/prediction.csv", index=False)