-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/New Datasets (ETT and Electricity) #960
Conversation
@hrzn I also need help understanding why this one failed? It's falling on the Electricity dataset but not the ETT ones. |
The tests fail because a newer version of some dependency is introducing an issue. It's not due to this PR - it's a separate issue that we'll fix asap. |
Codecov Report
@@ Coverage Diff @@
## master #960 +/- ##
==========================================
+ Coverage 92.61% 92.65% +0.04%
==========================================
Files 74 74
Lines 7404 7449 +45
==========================================
+ Hits 6857 6902 +45
Misses 547 547
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's a very nice initiative to add more datasets :)
I have a few comments before we can merge this one.
# Conflicts: # darts/datasets/dataset_loaders.py
Those will be nice additions, thanks for that! TL;DRcould we add another dataset for static covariates? The target is monthly beer sales per Agency and SKU (Stock Keeping Unit). If we do some preprocessing, this dataset can contain categorical static covariates, numerical static covariates and additional covariates such as weather, historical industry sales, ... DescriptionFrom a quick look at the data, these are multivariate datasets without any static covariates (such as household id or similar), right? Could we imagine also adding a dataset for static covariates? I'm currently using the one from below: This longitudinal table from here includes some of Stallion & Co.'s beer sales data from different wholesalers (Agencies) and SKUs (Stock Keeping Units). Context: Country Beeristan, a high potential market, accounts for nearly 10% of Stallion & Co.’s global beer sales. Stallion & Co. has a large portfolio of products distributed to retailers through wholesalers (agencies). There are thousands of unique wholesaler-SKU/products combinations. In order to plan its production and distribution as well as help wholesalers with their planning, it is important for Stallion & Co. to have an accurate estimate of demand at SKU level for each wholesaler. I'm particularly interested in the price_sales_promotion.csv: ($/hectoliter) Holds the monthly price, sales & promotion data in dollar value per hectoliter at Agency-SKU-month level from January 2013 - December 2017. From my Additional InfoAgency and SKU are categorical, so we would have to encode them. There are 58 unique Agencies and 28 unique SKUs. For simplicity and testing if everything works with static covariates I'm just converting them to numerical right now as below (will try later on with one hot encoding).
We could also add additional data from the kaggle data dir to the csv to also have numerical static covariates such as the average population per Agency Region from demographics.csv, and other covariates from remaining files (weather, industry volume, historical volume, industry soda sales). |
@dennisbader this seems like a separate PR. Can we move this discussion to #966 or #597 ? I agree that this is an important next step for datasets. |
I also agree with that. @dennisbader could we address it in a separate PR once your static covariates one is merged? |
…age if pre_process_zipped_csv_fn is used on csv files
@hrzn The dataset tests won't pass because the ETT files are not yet in main. Should we merge or handle it a different way |
I think we have no other choice but to merge and hope for the best :) We'll open a fix PR later on if it fails for some reason. |
Fixes part of #617.
The PEMS-SF dataset is in an odd format and Im not sure how to include it.
Summary
Added the ETT and Electricity Dataset.
Added a method for processing other .zip datasets.
Other Information
None