Feat/New Datasets (ETT and Electricity) #960

gdevos010 · 2022-05-15T23:45:52Z

Fixes part of #617.

The PEMS-SF dataset is in an odd format and Im not sure how to include it.

Summary

Added the ETT and Electricity Dataset.
Added a method for processing other .zip datasets.

Other Information

None

gdevos010 · 2022-05-16T18:22:37Z

@hrzn I also need help understanding why this one failed? It's falling on the Electricity dataset but not the ETT ones.
FAILED darts/tests/datasets/test_dataset_loaders.py::DatasetLoaderTestCase::test_ok_dataset

hrzn · 2022-05-17T18:38:37Z

The tests fail because a newer version of some dependency is introducing an issue. It's not due to this PR - it's a separate issue that we'll fix asap.

codecov-commenter · 2022-05-18T12:45:39Z

Codecov Report

Merging #960 (71f6a25) into master (adb66fd) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #960      +/-   ##
==========================================
+ Coverage   92.61%   92.65%   +0.04%     
==========================================
  Files          74       74              
  Lines        7404     7449      +45     
==========================================
+ Hits         6857     6902      +45     
  Misses        547      547

Impacted Files	Coverage Δ
darts/datasets/__init__.py	`100.00% <100.00%> (ø)`
darts/datasets/dataset_loaders.py	`96.59% <100.00%> (+1.27%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adb66fd...71f6a25. Read the comment docs.

hrzn

Thanks, it's a very nice initiative to add more datasets :)
I have a few comments before we can merge this one.

darts/datasets/__init__.py

darts/datasets/dataset_loaders.py

…file library

# Conflicts: # darts/datasets/dataset_loaders.py

dennisbader · 2022-05-20T07:55:09Z

Those will be nice additions, thanks for that!

TL;DR

could we add another dataset for static covariates? The target is monthly beer sales per Agency and SKU (Stock Keeping Unit). If we do some preprocessing, this dataset can contain categorical static covariates, numerical static covariates and additional covariates such as weather, historical industry sales, ...

Description

From a quick look at the data, these are multivariate datasets without any static covariates (such as household id or similar), right?

Could we imagine also adding a dataset for static covariates? I'm currently using the one from below:

This longitudinal table from here includes some of Stallion & Co.'s beer sales data from different wholesalers (Agencies) and SKUs (Stock Keeping Units).

Context: Country Beeristan, a high potential market, accounts for nearly 10% of Stallion & Co.’s global beer sales. Stallion & Co. has a large portfolio of products distributed to retailers through wholesalers (agencies). There are thousands of unique wholesaler-SKU/products combinations. In order to plan its production and distribution as well as help wholesalers with their planning, it is important for Stallion & Co. to have an accurate estimate of demand at SKU level for each wholesaler.

I'm particularly interested in the price_sales_promotion.csv: ($/hectoliter) Holds the monthly price, sales & promotion data in dollar value per hectoliter at Agency-SKU-month level from January 2013 - December 2017.

From my TimeSeries.from_logitudinal_dataframe() this results in 350 TimeSeries at Agency-SKU level. They share the same time index so we could use it for both multivariate and multiple TS.

Additional Info

Agency and SKU are categorical, so we would have to encode them. There are 58 unique Agencies and 28 unique SKUs. For simplicity and testing if everything works with static covariates I'm just converting them to numerical right now as below (will try later on with one hot encoding).

df = pd.read_csv(fin)
for col in ["Agency", "SKU"]:
    df[col] = pd.Categorical(df[col]).codes

We could also add additional data from the kaggle data dir to the csv to also have numerical static covariates such as the average population per Agency Region from demographics.csv, and other covariates from remaining files (weather, industry volume, historical volume, industry soda sales).

gdevos010 · 2022-05-20T20:04:23Z

@dennisbader this seems like a separate PR. Can we move this discussion to #966 or #597 ? I agree that this is an important next step for datasets.

hrzn · 2022-05-22T06:40:00Z

@dennisbader this seems like a separate PR. Can we move this discussion to #966 or #597 ? I agree that this is an important next step for datasets.

I also agree with that. @dennisbader could we address it in a separate PR once your static covariates one is merged?

…age if pre_process_zipped_csv_fn is used on csv files

gdevos010 · 2022-05-27T01:32:06Z

@hrzn The dataset tests won't pass because the ETT files are not yet in main. Should we merge or handle it a different way

darts/datasets/__init__.py

hrzn · 2022-05-27T06:51:11Z

@hrzn The dataset tests won't pass because the ETT files are not yet in main. Should we merge or handle it a different way

I think we have no other choice but to merge and hope for the best :) We'll open a fix PR later on if it fails for some reason.

gdevos010 added 4 commits May 13, 2022 20:37

adding ETT-small dataset from Informer paper

696f523

added framework for handling zip datasets. Added Electricity dataset.

3143cd6

zip tests

37a1d4c

changelog

3429b5f

gdevos010 requested review from hrzn, tomasvanpottelbergh, dennisbader and brunnedu as code owners May 15, 2022 23:45

gdevos010 changed the title ~~New Datasets (ETT and Electricity)~~ Feat/New Datasets (ETT and Electricity) May 15, 2022

gdevos010 added 2 commits May 16, 2022 10:21

fixed tests. created test.zip to reduce runtime and bandwidth

bae1e75

added debug info

6b51f6b

Greg DeVos and others added 3 commits May 17, 2022 11:52

fixed test. Linux is picker with file extensions

1fe7c66

Merge branch 'master' into new-datasets

a244bed

Merge branch 'master' into new-datasets

71f6a25

Merge branch 'master' into new-datasets

61eba9b

hrzn reviewed May 19, 2022

View reviewed changes

gdevos010 added 3 commits May 19, 2022 10:15

PR comments. Added ETT datasets to project. fixed the docstrings

dc0fe98

PR comments. updated zip download and extraction to use python's temp…

3663f9d

…file library

Merge remote-tracking branch 'origin/new-datasets' into new-datasets

1ee4ffa

# Conflicts: # darts/datasets/dataset_loaders.py

gdevos010 and others added 2 commits May 20, 2022 12:25

docstring update

130e5c0

docstring updates

9b15870

consistent line endings to match hash on windows and linux

03b0401

changed pre_process_fn to pre_process_zipped_csv_fn. added error mess…

03f2d9b

…age if pre_process_zipped_csv_fn is used on csv files

fix

fc93e7f

hrzn approved these changes May 27, 2022

View reviewed changes

darts/datasets/__init__.py Outdated Show resolved Hide resolved

Update darts/datasets/__init__.py

3de3b94

Merge branch 'master' into new-datasets

a1bd5d3

hrzn merged commit bf634f7 into unit8co:master May 27, 2022

gdevos010 deleted the new-datasets branch May 27, 2022 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/New Datasets (ETT and Electricity) #960

Feat/New Datasets (ETT and Electricity) #960

gdevos010 commented May 15, 2022 •

edited

gdevos010 commented May 16, 2022

hrzn commented May 17, 2022

codecov-commenter commented May 18, 2022

hrzn left a comment

dennisbader commented May 20, 2022 •

edited

gdevos010 commented May 20, 2022

hrzn commented May 22, 2022

gdevos010 commented May 27, 2022

hrzn commented May 27, 2022

Feat/New Datasets (ETT and Electricity) #960

Feat/New Datasets (ETT and Electricity) #960

Conversation

gdevos010 commented May 15, 2022 • edited

Summary

Other Information

gdevos010 commented May 16, 2022

hrzn commented May 17, 2022

codecov-commenter commented May 18, 2022

Codecov Report

hrzn left a comment

Choose a reason for hiding this comment

dennisbader commented May 20, 2022 • edited

TL;DR

Description

Additional Info

gdevos010 commented May 20, 2022

hrzn commented May 22, 2022

gdevos010 commented May 27, 2022

hrzn commented May 27, 2022

gdevos010 commented May 15, 2022 •

edited

dennisbader commented May 20, 2022 •

edited