# Download and store STOOQ data
Inspired by code from Stefan Jansen github repository related to his book [Machine Learning for Algorithmic Trading - Second Edition](https://github.com/stefan-jansen/machine-learning-for-trading)

As mentioned [here](https://github.com/stefan-jansen/machine-learning-for-trading/issues/82), STOOQ has disabled automatic download. Therefore, in order to fetch data files, it can be done manually from [STOOQ](https://stooq.com/db/h/).

---
Here, implemented data loading process of already manually downloaded files, and storing it in the [HDF5](https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml) format. NOTE, that the HDF5 store location would later be used in configuration (see also: `config.yaml`).


> NOTE: the `ticker` in STOOQ data files is not the `symbol` that's available elsewhere
> * `ticker`: `KGH`, `PKN`, ..
> * `symbol`: `KGHM`, `PKNORLEN`, ..
>
> that might require additional conversion (e.g. fetching ticker symbols), but is not necessary for the `stock-ml` project

In [1]:
from pathlib import Path
from tqdm import tqdm
from typing import Optional

import numpy as np
import pandas as pd
import zipfile as zip


## Make sure data files exist at given location

Expected the downloaded files to exist in the folder of  `DATA_ROOT`

In [2]:

DATA_ROOT = Path('/Data/stooq')
DATA_STORE = DATA_ROOT.joinpath('assets.h5')
DATA_EXTRACT = DATA_ROOT.joinpath('extract')

assert DATA_ROOT.exists(), f'Data folder not found {DATA_ROOT}'
if not DATA_EXTRACT.exists(): DATA_EXTRACT.mkdir()

In [3]:
market_data_files = {
    'pl': ['/wse stocks/', '/nc stocks/', '/wse indices/'] 
}
archives = [DATA_ROOT.joinpath(f'd_{market}_txt.zip') for market in market_data_files.keys()]
assert all(archive.exists() for archive in archives), f"Some data files not found: {str([archive for archive in archives if not archive.exists()])}"

In [4]:
def get_data_file(market:str) -> Path:
    return DATA_ROOT.joinpath(f'd_{market}_txt.zip')
assert all(get_data_file(market_file).exists() for market_file in market_data_files.keys()), f"Some data files not found"

In [5]:
for market_file in market_data_files.keys():
    if not DATA_EXTRACT.joinpath(market_file).exists(): DATA_EXTRACT.joinpath(market_file).mkdir()
    with zip.ZipFile(get_data_file(market_file)) as zip_file:
        to_extract = [file for file in zip_file.namelist() if any(extract_folder in file for extract_folder in  market_data_files[market_file])]
        for file in tqdm(to_extract):
            try:
                zip_file.extract(file, DATA_EXTRACT.joinpath(market_file))
            except FileNotFoundError as ex:
                print(f'{ex}')

 30%|██▉       | 263/882 [00:00<00:00, 1315.74it/s]

[Errno 2] No such file or directory: '\\Data\\stooq\\extract\\pl\\data\\daily\\pl\\nc stocks\\aux.txt'


 45%|████▍     | 395/882 [00:00<00:00, 1167.95it/s]

[Errno 2] No such file or directory: '\\Data\\stooq\\extract\\pl\\data\\daily\\pl\\nc stocks\\prn.txt'


100%|██████████| 882/882 [00:00<00:00, 961.06it/s] 


## Data extract and transform

Extract 
 from txt files, transform to pandas DataFrames, with certain structure, load to local HDF store

In [6]:
files = DATA_EXTRACT.glob('**/*.txt')
sample_file = next(files)
with open(sample_file) as f:
    print(sample_file,'\n', f.readlines(5))

\Data\stooq\extract\pl\data\daily\pl\nc stocks\01c.txt 
 ['<TICKER>,<PER>,<DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>\n']


In [7]:

def load_data(
        data_file:Path,
        columns_map:dict = {
            '<DATE>':'date',
            '<TICKER>':'stock',
            '<OPEN>':'open',
            '<HIGH>':'high',
            '<LOW>':'low',
            '<CLOSE>':'close',
            '<VOL>':'volume'
            }    
        ) -> Optional[pd.DataFrame]:
    
    data = pd.read_csv(
        data_file, 
        header=0,                                                   # header in first row
        parse_dates=['<DATE>'],                                     # date in certain column
        usecols=list(columns_map.keys()),                           # ignore other columns
        index_col=None                                              # index will be set later
    )
    data.rename(columns=columns_map, inplace=True)                  # use well-known column names
    data.set_index(['date','stock'], inplace=True)                  # set multiindex (for further merge)
    data = data[~data.index.duplicated(keep='first')].sort_index()  # remove duplicates (!)
    data['volume'] = data['volume'].astype(int)                     # no fractional volume / positinos
    
    return data

In [8]:
files = DATA_EXTRACT.glob('**/*.txt')

load_errors = {}
if DATA_STORE.exists(): DATA_STORE.unlink()
for market in market_data_files.keys():
    with pd.HDFStore(DATA_STORE, mode='w') as store:
        for file in tqdm(files):
            try:
                data = load_data(file)
                store.put(f'{market}/prices', data, format='table', append=True, data_columns=True, min_itemsize={'stock' : 15}) # type: ignore 
            except Exception as e:
                load_errors[f'{market}/{file}'] = str(e)
print(f'Data loaded to HDFStore {DATA_STORE}')
if load_errors: 
    for file, error in load_errors.items():
        print(file, ': ', error)

877it [01:34,  9.32it/s]

Data loaded to HDFStore.
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\c249l.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\c24n3l.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\dbe1.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\iburu.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\invgl.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\invsl.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\ipgpa.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\ipogparpa.pl.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\leb.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\data\daily\pl\wse stocks\lkburu.txt :  No columns to parse from file
pl/\Data\stooq\extract\pl\dat




# 