# Stock Price Prediction using ML model
In this session, we'll learn how to build a ML model for predicting **%change of stock prices of the next day** of stocks in SET index (Stock Exchange of Thailand). Thus, we should be able to use the prediction to buy stocks that are going up the next day, make profits, and hopefully get rich!

This session is divided into the following 5 notebooks.
1. `1_collect_data.ipynb` (current notebook)
2. `2_eda.ipynb`
3. `3_features_prep.ipynb`
4. `4_make_prediction.ipynb`
5. `5_evaluation.ipynb`

# Load stock data using `YahooFinance`
In this notebook, we will download and store the following data
1. Daily price - end of day information of stocks i.e. Open, High, Low, Close, Volume
2. Company information - sector and industry of stocks
3. Annual income statement - profitability of stocks

In [None]:
import pandas as pd
import yfinance as yf
from tqdm.notebook import tqdm
import time

## Load symbol list

In [None]:
# read symbols list within SET index from file
symbols = pd.read_csv("data/SET_symbols_20241230.csv")

## Daily price

In [33]:
# configurations
interval = "1d"
start = "2020-01-01"
end = "2024-12-31"

In [3]:
if True:

    price_df = pd.DataFrame()
    for symbol in tqdm(symbols["symbol"]):
        yf_ticker = yf.Ticker(f"{symbol}.BK")
        price_df = pd.concat([price_df, yf_ticker.history(interval=interval, start=start, end=end).assign(symbol=symbol)])

    # check-point
    price_df.to_csv("data/set_price.csv")


  0%|          | 0/925 [00:00<?, ?it/s]

In [8]:
price_df = pd.read_csv("data/set_price.csv")

In [9]:
price_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,symbol,Capital Gains
0,2022-10-03 00:00:00+07:00,7.1,10.2,7.1,10.2,559465900,0.0,0.0,24CS,
1,2022-10-04 00:00:00+07:00,10.7,11.1,7.15,7.15,330707400,0.0,0.0,24CS,
2,2022-10-05 00:00:00+07:00,5.85,6.45,5.05,5.15,361028900,0.0,0.0,24CS,
3,2022-10-06 00:00:00+07:00,5.4,5.45,4.7,5.2,232679200,0.0,0.0,24CS,
4,2022-10-07 00:00:00+07:00,5.1,5.15,4.76,5.0,131778400,0.0,0.0,24CS,


In [11]:
price_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 967016 entries, 0 to 967015
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           967016 non-null  object 
 1   Open           966999 non-null  float64
 2   High           966999 non-null  float64
 3   Low            966999 non-null  float64
 4   Close          966999 non-null  float64
 5   Volume         967016 non-null  int64  
 6   Dividends      967016 non-null  float64
 7   Stock Splits   967016 non-null  float64
 8   symbol         967016 non-null  object 
 9   Capital Gains  18098 non-null   float64
dtypes: float64(7), int64(1), object(2)
memory usage: 73.8+ MB


## Daily price of SET (benchmark)

In [36]:
# Download data for a specified date range (e.g., last 5 years)
set_data = yf.download("^SET.BK", start=start, end=end, interval=interval)

[*********************100%%**********************]  1 of 1 completed


In [39]:
set_data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-02,1584.349976,1597.920044,1583.180054,1595.819946,1595.819946,3442200
2020-01-03,1596.949951,1604.430054,1592.900024,1594.969971,1594.969971,3251500
2020-01-06,1584.130005,1585.560059,1565.930054,1568.5,1568.5,4116100
2020-01-07,1578.52002,1585.439941,1570.040039,1585.22998,1585.22998,3201300
2020-01-08,1569.819946,1572.030029,1555.75,1559.27002,1559.27002,3619500


In [42]:
set_data.reset_index(inplace=True)

In [43]:
set_data.to_csv("data/set_price_index.csv", index=False)

## Company information

In [12]:
if True:
        
        # get company info
        company_info_df = pd.DataFrame()

        for symbol in tqdm(symbols["symbol"]):
                ticker = yf.Ticker(f"{symbol}.BK")
                try:
                        symbol_info = ticker.info
                        if len(symbol_info) > 1:
                                company_info_df = pd.concat([company_info_df, pd.DataFrame({"symbol": [symbol], "industry": [symbol_info.get("industry")], "sector": [symbol_info.get("sector")]})])
                except TimeoutError:
                        print(f"TimeoutError for symbol:{symbol}")

        company_info_df.to_csv("data/set_company_info.csv", index=False)


  0%|          | 0/925 [00:00<?, ?it/s]

In [15]:
company_info_df = pd.read_csv("data/set_company_info.csv")

In [16]:
company_info_df.head()

Unnamed: 0,symbol,industry,sector
0,24CS,Building Products & Equipment,Industrials
1,2S,Steel,Basic Materials
2,3BBIF,,
3,A,Real Estate - Development,Real Estate
4,A5,Real Estate - Development,Real Estate


In [19]:
company_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 925 entries, 0 to 924
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   symbol    925 non-null    object
 1   industry  858 non-null    object
 2   sector    858 non-null    object
dtypes: object(3)
memory usage: 21.8+ KB


## Income statement

In [22]:
if True:
    incm_stmt_df = pd.DataFrame()
    for symbol in tqdm(symbols["symbol"]):

        ticker = yf.Ticker(f"{symbol}.BK")

        try:
            _incm_stmt_df = ticker.income_stmt
            
            if len(_incm_stmt_df) > 0:

                _incm_stmt_df = _incm_stmt_df.T
                _incm_stmt_df = _incm_stmt_df.assign(symbol=symbol).set_index("symbol", append=True)
                _incm_stmt_df.index.names = ["date", "symbol"]

                incm_stmt_df = pd.concat([incm_stmt_df, _incm_stmt_df])
        except:
            time.sleep(1)

    incm_stmt_df.reset_index(inplace=True)

    # store the data
    incm_stmt_df.to_csv("data/set_incm_stmt.csv", index=False)

  0%|          | 0/925 [00:00<?, ?it/s]

In [29]:
incm_stmt_df = pd.read_csv("data/set_incm_stmt.csv")

In [30]:
incm_stmt_df.head()

Unnamed: 0,date,symbol,Tax Effect Of Unusual Items,Tax Rate For Calcs,Normalized EBITDA,Total Unusual Items,Total Unusual Items Excluding Goodwill,Net Income From Continuing Operation Net Minority Interest,Reconciled Depreciation,Reconciled Cost Of Revenue,...,Amortization,Amortization Of Intangibles Income Statement,Depreciation Income Statement,Insurance And Claims,Preferred Stock Dividends,Net Income From Tax Loss Carryforward,Research And Development,Earnings From Equity Interest Net Of Tax,Net Income Extraordinary,Excise Taxes
0,2023-12-31,24CS,247811.9,0.195597,-44136736.0,1266954.0,1266954.0,-45071044.0,9617938.0,671432400.0,...,,,,,,,,,,
1,2022-12-31,24CS,547957.8,0.223307,41876591.0,2453829.0,2453829.0,24494231.0,7726356.0,870063900.0,...,,,,,,,,,,
2,2021-12-31,24CS,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704500.0,...,,,,,,,,,,
3,2020-12-31,24CS,0.0,0.352007,18310000.0,,,6940000.0,6310000.0,354560000.0,...,,,,,,,,,,
4,2023-12-31,2S,-1027415.0,0.037799,248561000.0,-27181000.0,-27181000.0,160083000.0,55027000.0,6351661000.0,...,,,,,,,,,,


In [32]:
incm_stmt_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3499 entries, 0 to 3498
Data columns (total 77 columns):
 #   Column                                                      Non-Null Count  Dtype  
---  ------                                                      --------------  -----  
 0   date                                                        3499 non-null   object 
 1   symbol                                                      3499 non-null   object 
 2   Tax Effect Of Unusual Items                                 3468 non-null   float64
 3   Tax Rate For Calcs                                          3468 non-null   float64
 4   Normalized EBITDA                                           3352 non-null   float64
 5   Total Unusual Items                                         2641 non-null   float64
 6   Total Unusual Items Excluding Goodwill                      2641 non-null   float64
 7   Net Income From Continuing Operation Net Minority Interest  3468 non-null   float64
 8 