-
-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Models not converging in image classification problem #1989
Comments
After some more investigations, I noted the following had been added to the training step in the CIFAR-10 example. scaler = GradScaler(enabled=with_amp)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update() This was compared to my original example of: loss.backward()
optimizer.step()
lr_scheduler.step() It appears the gradient scaling is doing the trick. def train_model(model, criterion, optimizer, scheduler, device, dataloaders, dataset_sizes,
num_epochs=25, return_history=False, log_history=True, working_dir='output'):
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
history = {'epoch' : [], 'train_loss' : [], 'test_loss' : [], 'train_acc' : [], 'test_acc' : []}
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'test']:
if phase == 'train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
if phase == 'train':
scheduler.step()
epoch_loss = running_loss / dataset_sizes[phase]
epoch_acc = running_corrects.double() / dataset_sizes[phase]
history['epoch'].append(epoch)
history[phase+'_loss'].append(epoch_loss)
history[phase+'_acc'].append(epoch_acc)
print('{} Loss: {:.4f} Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))
# deep copy the model
if phase == 'test' and epoch_acc > best_acc:
best_acc = epoch_acc
best_model_wts = copy.deepcopy(model.state_dict())
if log_history:
save_pickle(history,os.path.join(working_dir,'model_history.pkl'))
print()
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
time_elapsed // 60, time_elapsed % 60))
print('Best val Acc: {:4f}'.format(best_acc))
# load best model weights
print('Returning object of best model.')
model.load_state_dict(best_model_wts)
if return_history:
return model, history
else:
return model |
Thanks for the report @ecm200 ! def initialize(model_func, model_args, criterion=None, optimizer=None, scheduler=None, is_torchvision=True, num_classes=200):
# Setup the model object
model = model_func(**model_args)
if is_torchvision:
# Alternatively, it can be generalized to nn.Linear(num_ftrs, len(class_names)).
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Setup loss criterion and optimizer
if (optimizer == None):
optimizer = optim.SGD(params=model.parameters(), lr=0.001, momentum=0.9)
if criterion == None:
criterion = nn.CrossEntropyLoss()
# Setup learning rate scheduler
if scheduler == None:
scheduler = StepLR(optimizer=optimizer,step_size=7, gamma=0.1)
return model, optimizer, criterion, scheduler
optimizer = None
model, optimizer, criterion, lr_scheduler = initialize(
model_func=model_func,
model_args=model_args,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
is_torchvision=True,
num_classes=200
)
model = model.to(device) In Python we usually compare with model = ...
- optimizer = optim.SGD(model.parameters())
model = model.to("cuda")
+ optimizer = optim.SGD(model.parameters()) |
Actually, grad scaler and AMP is optional and is not usually required. |
Hi @vfdev-5 , Thanks very much for your observation. I think I just found something else I missed, which was a difference between my simple script and the Ignite examples. In the model update phase of the forward training step, the batch inference and loss calculation, and backward propagation of the loss, all take part under the call with torch.set_grad_enabled(True). This was not in my original training step for the Ignite training function. # Iterate over data.
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step() I think this means that unless the backward propagation is done with that call, the model parameters cannot be updated, which would tally with the fact that the training up to this point did not appear to be updating the model parameters as the loss was not reducing.
@vfdev-5 Good spot sir! I checked in my original simple script and I have indeed pushed the model to the compute device before calling the optimizer, where as here I did not. I have fixed that by simply adding the .to(device) call in the init function. def initialize(device, model_func, model_args, criterion=None, optimizer=None, scheduler=None, is_torchvision=True, num_classes=200):
# Setup the model object
model = model_func(**model_args)
if is_torchvision:
# Alternatively, it can be generalized to nn.Linear(num_ftrs, len(class_names)).
model.fc = nn.Linear(model.fc.in_features, num_classes)
model.to(device)
# Setup loss criterion and optimizer
if (optimizer == None):
optimizer = optim.SGD(params=model.parameters(), lr=0.001, momentum=0.9)
if criterion == None:
criterion = nn.CrossEntropyLoss()
# Setup learning rate scheduler
if scheduler == None:
scheduler = StepLR(optimizer=optimizer,step_size=7, gamma=0.1)
return model, optimizer, criterion, scheduler |
Well, Let me try to reproduce the issue here with CIFAR10: https://colab.research.google.com/drive/12tSCyNEvkJXYvEOt82dnCu-KZLWfEQVL?usp=sharing |
@vfdev-5 ok that would be great. I have just tried running the mode again, without the GradScaler function, and I have the same issues again. So without GradScaler I get this behaviour:
And with GradScaler, I get this behaviour:
def initialize(device, model_func, model_args, criterion=None, optimizer=None, scheduler=None, is_torchvision=True, num_classes=200):
# Setup the model object
model = model_func(**model_args)
if is_torchvision:
# Alternatively, it can be generalized to nn.Linear(num_ftrs, len(class_names)).
model.fc = nn.Linear(model.fc.in_features, num_classes)
model.to(device)
# Setup loss criterion and optimizer
if (optimizer == None):
optimizer = optim.SGD(params=model.parameters(), lr=0.001, momentum=0.9)
if criterion == None:
criterion = nn.CrossEntropyLoss()
# Setup learning rate scheduler
if scheduler == None:
scheduler = StepLR(optimizer=optimizer,step_size=7, gamma=0.1)
return model, optimizer, criterion, scheduler
def create_trainer(model, optimizer, criterion, lr_scheduler):
# Define any training logic for iteration update
def train_step(engine, batch):
# Get the images and labels for this batch
x, y = batch[0].to(device), batch[1].to(device)
# Set the model into training mode
model.train()
# Zero paramter gradients
optimizer.zero_grad()
# Update the model
if with_grad_scale:
with autocast(enabled=with_amp):
y_pred = model(x)
loss = criterion(y_pred, y)
scaler = GradScaler(enabled=with_amp)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
with torch.set_grad_enabled(True):
y_pred = model(x)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
lr_scheduler.step()
return loss.item()
# Define trainer engine
trainer = Engine(train_step)
return trainer |
OK, I see the issue you have is with LR scheduler which is called with ignite every iteration instead of each epoch (probably initially supposed to). This means that the LR was dumped to 0 and thus no training. You can do the following with lr scheduler trainer.add_event_handler(Events.EPOCH_COMPLETED, lambda _: lr_scheduler.step()) |
That would do it! So, to be clear, take out the lp_scheduler.step() call from the train_step, and replace with the event handler above? |
Yes, exactly |
Who'd have thought that updating the learning rate by iteration instead of epoch would cause such an issue...... Thanks again, great spot! I've been looking for that for about a day and you nailed it. |
@ecm200 updating LR by iteration can also be a way to train models with warm-up and down like here: https://pytorch.org/ignite/contrib/handlers.html#piecewise-linear-scheduler But, yes, it can be tricky to check everything in a single run. Maybe, we can think of providing a debug mode where would dump all important things to check like : loss, grads, LR etc
Glad that could help. |
@vfdev-5 It is happily training away now with that modification:
|
The problem
I am trying to train a Torchvision GoogleNet model on a fine grained classification problem using Ignite. I have successfully trained this model and additional models from Torchvision and various other libraries (e.g. pytorchcv, timm), on this dataset, using a simple training script that I developed.
However, training with Ignite using all the same parameters as my simple training script, I am getting non-convergence (or it appears even non-training) of the network and I haven't managed to figure it out yet. It does seem that somehow the model is not being updated, but I have triple checked my implementation against the examples I have followed, and as far as I can see, it it should be working.
Results from simple training script
Here are the metrics during training from the simple training script I developed, and show convergence during the inital epochs:
Results from Ignite
Here are example outputs of the epoch metrics during training for both training and validation datasets. The initial training loss is similar for both approaches, but the 2nd epoch of the simple training script shows the network has learnt something as the loss has been reduced. However, comparing this the Ignite version, there is no improvement in loss over 5 epochs.
Ignite Training Script
This is the script I have developed looking at the CIFAR-10 example as a guide:
It has one additional dependency which is a dataloader creation function:
The text was updated successfully, but these errors were encountered: