-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Prediction on multiple gpus return results from only one gpu. #1945
Comments
Hi @heury , and thanks for writing. Unfortunately we don't have the infrastructure to test on multiple GPUs. So we would need a bit of help for debugging. To make sure that everything is setup as expected:
|
You can reproduce it with TFT example code in darts.
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm
import torch
import matplotlib.pyplot as plt
from darts import TimeSeries, concatenate
from darts.dataprocessing.transformers import Scaler
from darts.models import TFTModel
from darts.metrics import mape
from darts.utils.statistics import check_seasonality, plot_acf
from darts.datasets import AirPassengersDataset, IceCreamHeaterDataset
from darts.utils.timeseries_generation import datetime_attribute_timeseries
from darts.utils.likelihood_models import QuantileRegression
import warnings
warnings.filterwarnings("ignore")
import logging
logging.disable(logging.CRITICAL)
def get_model():
quantiles = [
0.01,
0.05,
0.1,
0.15,
0.2,
0.25,
0.3,
0.4,
0.5,
0.6,
0.7,
0.75,
0.8,
0.85,
0.9,
0.95,
0.99,
]
input_chunk_length = 24
forecast_horizon = 12
my_model = TFTModel(
input_chunk_length=input_chunk_length,
output_chunk_length=forecast_horizon,
hidden_size=64,
lstm_layers=1,
num_attention_heads=4,
dropout=0.1,
batch_size=16,
n_epochs=300,
add_relative_index=False,
add_encoders=None,
likelihood=QuantileRegression(
quantiles=quantiles
), # QuantileRegression is set per default
# loss_fn=MSELoss(),
pl_trainer_kwargs={
"accelerator": "gpu",
# "gpus": 1,
"devices": "auto",
#"devices": [0],
# "devices": 1,
# "strategy":'ddp_notebook',
# "logger": wandb_logger,
},
categorical_embedding_sizes={},
random_state=42,
)
return my_model
def eval_model(model, n, val_series, num_samples):
pred_series = model.predict(n=n, num_samples=num_samples)
# plot actual series
# plot prediction with quantile ranges
print("MAPE: {:.2f}%".format(mape(val_series, pred_series)))
def main(cmd):
# before starting, we define some constants
num_samples = 200
figsize = (9, 6)
lowest_q, low_q, high_q, highest_q = 0.01, 0.1, 0.9, 0.99
label_q_outer = f"{int(lowest_q * 100)}-{int(highest_q * 100)}th percentiles"
label_q_inner = f"{int(low_q * 100)}-{int(high_q * 100)}th percentiles"
# Read data
series = AirPassengersDataset().load()
# we convert monthly number of passengers to average daily number of passengers per month
series = series / TimeSeries.from_series(series.time_index.days_in_month)
series = series.astype(np.float32)
# Create training and validation sets:
training_cutoff = pd.Timestamp("19571201")
train, val = series.split_after(training_cutoff)
# Normalize the time series (note: we avoid fitting the transformer on the validation set)
transformer = Scaler()
train_transformed = transformer.fit_transform(train)
val_transformed = transformer.transform(val)
series_transformed = transformer.transform(series)
# create year, month and integer index covariate series
covariates = datetime_attribute_timeseries(series, attribute="year", one_hot=False)
covariates = covariates.stack(
datetime_attribute_timeseries(series, attribute="month", one_hot=False)
)
covariates = covariates.stack(
TimeSeries.from_times_and_values(
times=series.time_index,
values=np.arange(len(series)),
columns=["linear_increase"],
)
)
covariates = covariates.astype(np.float32)
# transform covariates (note: we fit the transformer on train split and can then transform the entire covariates series)
scaler_covs = Scaler()
cov_train, cov_val = covariates.split_after(training_cutoff)
scaler_covs.fit(cov_train)
covariates_transformed = scaler_covs.transform(covariates)
if cmd == "train":
# default quantiles for QuantileRegression
my_model = get_model()
my_model.fit(train_transformed, future_covariates=covariates_transformed, verbose=True)
my_model.save("tft_model.pt")
pred_series = my_model.predict(n=24, num_samples=num_samples)
print("MAPE: {:.2f}%".format(mape(val_transformed, pred_series)))
elif cmd == "load":
my_model = get_model()
my_model.load_weights(path="tft_model.pt")
pred_series = my_model.predict(n=12, series=train_transformed, future_covariates=covariates_transformed)
print("MAPE: {:.2f}%".format(mape(val_transformed, pred_series)))
if __name__ == "__main__":
torch.multiprocessing.freeze_support()
main("train")
# main("load") |
Can you try creating the model with Also this example should work as it's only fitting/predicting a single series, right? |
I use lightning==2.0.6 and "auto_select_gpus" does not exist in that version. I use multiple time series for TFT model. |
Alright, can you reduce your example to a minimal example that will cause the issue. It makes it easier for others to debug it. You can use one of our smaller datasets for this as a series (and then just make a list of it).
Does the above work for you or not? |
I think my description is not enough.
result
code import torch
from darts.models import TFTModel
from darts.datasets import AirPassengersDataset
if __name__ == "__main__":
torch.multiprocessing.freeze_support()
series = AirPassengersDataset().load()
# series = [series] * 10
model = TFTModel(
input_chunk_length=12,
output_chunk_length=6,
add_relative_index=True,
pl_trainer_kwargs={"accelerator": "gpu", "devices": "auto"}
)
model.fit(series, epochs=10)
preds = model.predict(n=6, series=series, num_samples=100)
print("len(preds)",len(preds), "len(series)", len(series))
result
code import torch
from darts.models import TFTModel
from darts.datasets import AirPassengersDataset
if __name__ == "__main__":
torch.multiprocessing.freeze_support()
series = AirPassengersDataset().load()
series = [series] * 10
model = TFTModel(
input_chunk_length=12,
output_chunk_length=6,
add_relative_index=True,
pl_trainer_kwargs={"accelerator": "gpu", "devices": "auto"}
)
model.fit(series, epochs=10)
preds = model.predict(n=6, series=series, num_samples=100)
print("len(preds)",len(preds), "len(series)", len(series)) |
Any news about this? I'm trying to do the same thing with a RNNModel and it gives me the exact same result |
@dennisbader I am also facing similar issue. I ran the below code. While predicting, it is distributed across all the GPUs, hence not able to use "pred" for any further calculation. Please do let me know if there is any solution. if name == "main":
----------------------------------------- Output ----------------------------------------------------------------------- |
I have a problem that is almost exactly the same as this problem, anyone who knows how to solve this problem, please reply to me, thank you! |
Hello, I am faced with almost exactly the same problem as this one. When can this problem be fixed, or can you provide some solutions? |
Hi .. any update on this specific issue |
Hi, a solution seems to have been suggested in #2265. Can you check if it solves the problem? |
Describe the bug
My server has four gpus. When I test prediction on four gpus, darts seems returning results from only one gpu.
To Reproduce
I trained TFT model with following trainer options and saved model to file.
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": "auto",
}
I loaded model from file by "load_weights" method.
model.load_weights("tft_model.pt")
And I test prediction with test dataset. But, its count does not match with input data count.
For example, I only get two results when I input 10 time series for prediction.
It seems that darts get results from only one gpu and the other results are missing.
It does not happen on a single gpu with following options.
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": [0],
}
Expected behavior
Prediction output count must match with input data count
System (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: