-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training fails (sometimes) when using several GPUs #58
Comments
Hi! Thank you for such a detailed report. What I see from the error traceback:
This happened because of the too low value of
The purpose of this is to reduce the number of features to make the attention faster. The best scenario is when you don't need compression (i.e. Does this help? |
Dear @Yura52, Thanks a lot for your answer, it does help! class FTT(rtdl.FTTransformer):
def __init__(self, n_num_features=None, cat_cardinalities=None, d_token=16, n_blocks=1, attention_n_heads=4, attention_dropout=0.3, attention_initialization='kaiming', attention_normalization='LayerNorm', ffn_d_hidden=16, ffn_dropout=0.1, ffn_activation='ReGLU', ffn_normalization='LayerNorm', residual_dropout=0.0, prenormalization=True, first_prenormalization=False, last_layer_query_idx=[-1], n_tokens=None, kv_compression_ratio=0.004, kv_compression_sharing='headwise', head_activation='ReLU', head_normalization='LayerNorm', d_out=None):
feature_tokenizer = rtdl.FeatureTokenizer(
n_num_features=n_num_features,
cat_cardinalities=cat_cardinalities,
d_token=d_token
)
transformer = rtdl.Transformer(
d_token=d_token,
n_blocks=n_blocks,
attention_n_heads=attention_n_heads,
attention_dropout=attention_dropout,
attention_initialization=attention_initialization,
attention_normalization=attention_normalization,
ffn_d_hidden=ffn_d_hidden,
ffn_dropout=ffn_dropout,
ffn_activation=ffn_activation,
ffn_normalization=ffn_normalization,
residual_dropout=residual_dropout,
prenormalization=prenormalization,
first_prenormalization=first_prenormalization,
last_layer_query_idx=last_layer_query_idx,
n_tokens=None if int(kv_compression_ratio * n_num_features) == 0 else n_num_features + 1, # Modified line
kv_compression_ratio=None if int(kv_compression_ratio * n_num_features) == 0 else kv_compression_ratio, # Modified line
kv_compression_sharing=None if int(kv_compression_ratio * n_num_features) == 0 else "headwise", # Modified line
head_activation=head_activation,
head_normalization=head_normalization,
d_out=d_out
)
super(FTT, self).__init__(feature_tokenizer, transformer) It's clearly not optimal and can be improved (e.g., by automatically setting the value of kv_compression_ratio with respect to the number of input features and not putting always 0.004), but for the moment it is sufficient as the code is running. However, I'm having trouble understanding why the code was working when I was running it on a single GPU (1x RTX 2080 Ti) but not on two (2x RTX 2080 Ti). Could you explain this? I understand your answer, but I don't get why the code still runs locally with 152 input features and a |
I should admit I don't have a good explanation for that :) |
Feel free to reopen the issue if needed! |
Dear maintainers,
I've been using your package for a while now (especially for the FTT model). I've never encountered any trouble, and it helped me boost my performance on a tabular dataset.
Recently, I've been doing some ablation studies, i.e., I've removed some input features to check if the performance of the model would decrease (and if yes, at what point). I've discovered that, when using several GPUs (2x RTX 2080 Ti), the training fails when it has a certain number of input features (but not always, it really depends of the number of input features).
I'm using
torch.nn.DataParallel
to implement data parallelism, and for some reason related to my framework I don't wish to usetorch.nn.parallel.DistributedDataParallel
.Here's a minimal reproducible example to prove my point:
When my training data (a CSR matrix) has a shape of (803473, 152) (i.e., 803473 samples with each 152 features), this code fails (on multi-GPU). However, if I have a training data of shape (803473, 252) (I just tried a random number), then it works smoothly.
Here are the logs:
Some reminders:
Thanks for your help!
The text was updated successfully, but these errors were encountered: