Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 10/Python 3.5 Pytorch 1.1.0 CPU increase versus 1.0.0 mnist #20969

djygithub opened this issue May 26, 2019 · 2 comments


Copy link

commented May 26, 2019

馃悰 Bug

Running the example mnist application under Windows 10/Python 3.5 Cuda 10, significant increase in CPU time 1.1.0 versus 1.0.0. This is on an I7-7700, 32GB Ram, Nvidia GTX1050 GPU.

From VTUNE GPU Hotspots for 1.0.0:

Module / Function / Call Stack Effective Time
python35.dll 35.5
caffe2.dll 30.2
ntoskrnl.exe 18.4
ntdll.dll 12
torch.dll 6.60001
igdkmd64.sys 5.7
wxmsw28u_vc_amplxe_2.8.1226.dll 5.2
torch_python.dll 4.3
c10.dll 3.1
nvlddmkm.sys 2.7
From VTUNE GPU Hotspots 1.1.0

Module / Function / Call Stack Effective Time
ntoskrnl.exe 371.2
igdkmd64.sys 150.6
libiomp5md.dll 86.0001
python35.dll 43.3
ntdll.dll 29.2
caffe2.dll 28.4
c10.dll 19.3
symefasi64.sys 10.5
torch.dll 10.3
torch_python.dll 5.7

The preceding video shows CUDA 9.0 plus torch 1.0.0: CUDA 10.0 plus torch 1.0.0 shows the same performance, it seems torch 1.1.0 has introduced a regression?

Package          |    Version
absl-py     |         0.6.1
astor        |        0.7.1
gast         |        0.2.0
grpcio      |         1.17.0
h5py        |         2.8.0
Keras-Applications   1.0.6
Keras-Preprocessing  1.0.5
Markdown             3.0.1
mock                 3.0.5
numpy                1.16.3
Pillow               5.4.1
pip                  18.1
protobuf             3.6.1
setuptools           40.6.2
six                  1.12.0
tensorboard          1.13.1
tensorflow-estimator 1.13.0
tensorflow-gpu       1.13.1
termcolor            1.1.0
torch                1.0.0
torchvision          0.2.2.post3
Werkzeug             0.14.1
wheel                0.32.3

Here's the modified used for testing:

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from datetime import datetime
import time
import datetime
from torchvision import datasets, transforms

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x,
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

def train(args, model, device, train_loader, optimizer, epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target =,
        output = model(data)
        loss = F.nll_loss(output, target)
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def test(args, model, device, test_loader):
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target =,
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=512, metavar='N',
                        help='input batch size for training (default: 512)')
    parser.add_argument('--test-batch-size', type=int, default=10000, metavar='N',
                        help='input batch size for testing (default: 10000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()


    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    train_loader =
        datasets.MNIST('../data', train=True, download=True,
                           transforms.Normalize((0.1307,), (0.3081,))
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader =
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.Normalize((0.1307,), (0.3081,))
        batch_size=args.test_batch_size, shuffle=True, **kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(),, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        train_start =
        train(args, model, device, train_loader, optimizer, epoch)
        train_end =
        diff_train=train_end - train_start
        print ("training iterations/second %g" % (60000/result_train))
        test_start =   
        test(args, model, device, test_loader)
        test_end =
        diff_test=test_end - test_start
        print ("test images/second %g" % (10000/result_test))

if __name__ == '__main__':



This comment has been minimized.

Copy link

commented May 26, 2019

because of a stupid bug, 1.0.0 did not ship with OpenMP enabled. 1.1.0 shipped with OpenMP enabled, and uses multiple cores. This is in line with expectations. MNIST as a workload is really small, so OpenMP's multithread optimizations are probably slowing down the workload because most of the time is simply spent in thread overhead.
This is fixed on master because we moved away from OpenMP to our own threadpool which has more reasonable characteristics.

@soumith soumith closed this May 26, 2019


This comment has been minimized.

Copy link

commented May 26, 2019

a workaround for 1.1.0 for you is to use torch.set_num_threads(1) or use the env variable OMP_NUM_THREADS=1 (you can replace 1 with other values to tune the perf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can鈥檛 perform that action at this time.