Skip to content

Commit

Permalink
Avoid warning in Many-Models Notebook (Azure#1971)
Browse files Browse the repository at this point in the history
* avoid warning

* update reason for dropping column

* update data_preprocessing_tabular script

Co-authored-by: Rahul Kumar <rahulkuma@microsoft.com>
  • Loading branch information
iamrk04 and Rahul Kumar committed Dec 10, 2022
1 parent cd336db commit 9b4f99c
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,6 @@
"\n",
"forecasting_parameters = ForecastingParameters(\n",
" time_column_name=TIME_COLNAME,\n",
" drop_column_names=\"Revenue\",\n",
" forecast_horizon=6,\n",
" time_series_id_column_names=partition_column_names,\n",
" cv_step_size=\"auto\",\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,6 @@
"\n",
"forecasting_parameters = ForecastingParameters(\n",
" time_column_name=\"WeekStarting\",\n",
" drop_column_names=\"Revenue\",\n",
" forecast_horizon=6,\n",
" time_series_id_column_names=partition_column_names,\n",
" cv_step_size=\"auto\",\n",
Expand Down Expand Up @@ -469,7 +468,9 @@
"\n",
"Reuse of previous results (``allow_reuse``) is key when using pipelines in a collaborative environment since eliminating unnecessary reruns offers agility. Reuse is the default behavior when the ``script_name``, ``inputs``, and the parameters of a step remain the same. When reuse is allowed, results from the previous run are immediately sent to the next step. If ``allow_reuse`` is set to False, a new run will always be generated for this step during pipeline execution.\n",
"\n",
"> Note that we only support partitioned FileDataset and TabularDataset without partition when using such output as input."
"> Note that we only support partitioned FileDataset and TabularDataset without partition when using such output as input.\n",
"\n",
"> Note that we **drop column** \"Revenue\" from the dataset in this step to avoid information leak as \"Quantity\" = \"Revenue\" / \"Price\". **Please modify the logic based on your data**."
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ def main(args):
dataset = run_context.input_datasets["train_10_models"]
df = dataset.to_pandas_dataframe()

# Drop the column "Revenue" from the dataset to avoid information leak as
# "Quantity" = "Revenue" / "Price". Please modify the logic based on your data.
drop_column_name = "Revenue"
if drop_column_name in df.columns:
df.drop(drop_column_name, axis=1, inplace=True)

# Apply any data pre-processing techniques here

df.to_parquet(output / "data_prepared_result.parquet", compression=None)
Expand Down

0 comments on commit 9b4f99c

Please sign in to comment.