Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/New Datasets (ETT and Electricity) #960

Merged
merged 20 commits into from
May 27, 2022
Merged

Conversation

gdevos010
Copy link
Contributor

@gdevos010 gdevos010 commented May 15, 2022

Fixes part of #617.

The PEMS-SF dataset is in an odd format and Im not sure how to include it.

Summary

Added the ETT and Electricity Dataset.
Added a method for processing other .zip datasets.

Other Information

None

@gdevos010 gdevos010 changed the title New Datasets (ETT and Electricity) Feat/New Datasets (ETT and Electricity) May 15, 2022
@gdevos010
Copy link
Contributor Author

@hrzn I also need help understanding why this one failed? It's falling on the Electricity dataset but not the ETT ones.
FAILED darts/tests/datasets/test_dataset_loaders.py::DatasetLoaderTestCase::test_ok_dataset

@hrzn
Copy link
Contributor

hrzn commented May 17, 2022

The tests fail because a newer version of some dependency is introducing an issue. It's not due to this PR - it's a separate issue that we'll fix asap.

@codecov-commenter
Copy link

Codecov Report

Merging #960 (71f6a25) into master (adb66fd) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #960      +/-   ##
==========================================
+ Coverage   92.61%   92.65%   +0.04%     
==========================================
  Files          74       74              
  Lines        7404     7449      +45     
==========================================
+ Hits         6857     6902      +45     
  Misses        547      547              
Impacted Files Coverage Δ
darts/datasets/__init__.py 100.00% <100.00%> (ø)
darts/datasets/dataset_loaders.py 96.59% <100.00%> (+1.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adb66fd...71f6a25. Read the comment docs.

Copy link
Contributor

@hrzn hrzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a very nice initiative to add more datasets :)
I have a few comments before we can merge this one.

darts/datasets/__init__.py Outdated Show resolved Hide resolved
darts/datasets/__init__.py Show resolved Hide resolved
darts/datasets/__init__.py Outdated Show resolved Hide resolved
darts/datasets/dataset_loaders.py Outdated Show resolved Hide resolved
darts/datasets/dataset_loaders.py Outdated Show resolved Hide resolved
darts/datasets/dataset_loaders.py Outdated Show resolved Hide resolved
@dennisbader
Copy link
Collaborator

dennisbader commented May 20, 2022

Those will be nice additions, thanks for that!

TL;DR

could we add another dataset for static covariates? The target is monthly beer sales per Agency and SKU (Stock Keeping Unit). If we do some preprocessing, this dataset can contain categorical static covariates, numerical static covariates and additional covariates such as weather, historical industry sales, ...

Description

From a quick look at the data, these are multivariate datasets without any static covariates (such as household id or similar), right?

Could we imagine also adding a dataset for static covariates? I'm currently using the one from below:

This longitudinal table from here includes some of Stallion & Co.'s beer sales data from different wholesalers (Agencies) and SKUs (Stock Keeping Units).

Context: Country Beeristan, a high potential market, accounts for nearly 10% of Stallion & Co.’s global beer sales. Stallion & Co. has a large portfolio of products distributed to retailers through wholesalers (agencies). There are thousands of unique wholesaler-SKU/products combinations. In order to plan its production and distribution as well as help wholesalers with their planning, it is important for Stallion & Co. to have an accurate estimate of demand at SKU level for each wholesaler.

I'm particularly interested in the price_sales_promotion.csv: ($/hectoliter) Holds the monthly price, sales & promotion data in dollar value per hectoliter at Agency-SKU-month level from January 2013 - December 2017.

From my TimeSeries.from_logitudinal_dataframe() this results in 350 TimeSeries at Agency-SKU level. They share the same time index so we could use it for both multivariate and multiple TS.

Additional Info

Agency and SKU are categorical, so we would have to encode them. There are 58 unique Agencies and 28 unique SKUs. For simplicity and testing if everything works with static covariates I'm just converting them to numerical right now as below (will try later on with one hot encoding).

df = pd.read_csv(fin)
for col in ["Agency", "SKU"]:
    df[col] = pd.Categorical(df[col]).codes

We could also add additional data from the kaggle data dir to the csv to also have numerical static covariates such as the average population per Agency Region from demographics.csv, and other covariates from remaining files (weather, industry volume, historical volume, industry soda sales).

@gdevos010
Copy link
Contributor Author

@dennisbader this seems like a separate PR. Can we move this discussion to #966 or #597 ? I agree that this is an important next step for datasets.

@hrzn
Copy link
Contributor

hrzn commented May 22, 2022

@dennisbader this seems like a separate PR. Can we move this discussion to #966 or #597 ? I agree that this is an important next step for datasets.

I also agree with that. @dennisbader could we address it in a separate PR once your static covariates one is merged?

…age if pre_process_zipped_csv_fn is used on csv files
@gdevos010
Copy link
Contributor Author

@hrzn The dataset tests won't pass because the ETT files are not yet in main. Should we merge or handle it a different way

darts/datasets/__init__.py Outdated Show resolved Hide resolved
@hrzn
Copy link
Contributor

hrzn commented May 27, 2022

@hrzn The dataset tests won't pass because the ETT files are not yet in main. Should we merge or handle it a different way

I think we have no other choice but to merge and hope for the best :) We'll open a fix PR later on if it fails for some reason.

@hrzn hrzn merged commit bf634f7 into unit8co:master May 27, 2022
@gdevos010 gdevos010 deleted the new-datasets branch May 27, 2022 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants