Skip to content

AssertionError: Torch not compiled with CUDA enabled - DETECTRON CPU/LINUX TRAINING ERROR  #41598

@svideloc

Description

@svideloc

Hello,

I recently created a venv and downloaded pytorch in the following way (cpu only):

pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Then downloaded the pre-built detectron2 for linux & cpu with the following (all other prereqs are installed also)

python -m pip install detectron2 -f
https://dl.fbaipublicfiles.com/detectron2/wheels/cpu/torch1.5/index.html

Instructions To Reproduce the Issue:

I am training on a custom dataset, and the trainer.train() line is seeing the following error:

AssertionError: Torch not compiled with CUDA enabled

  1. Here is my code the get there
# Some basic setup:
# Setup detectron2 logger
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

# import some common libraries
import numpy as np
import cv2
import os
import random
from matplotlib import pyplot as plt

# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
from detectron2.structures import BoxMode
from detectron2.data.datasets import register_coco_instances
from detectron2.data.catalog import DatasetCatalog
from detectron2.engine import HookBase

register_coco_instances("boat_train", {}, "/home/Documents/train/instances.json", "/home/Documents/train")
register_coco_instances("boat_val", {}, "/home/Documents/val/instances.json", "/home/Documents/val")

from detectron2.engine import DefaultTrainer
from detectron2.engine import TrainerBase

#Specify Model yaml & weights to grab
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_C4_1x.yaml"))
cfg.MODEL.WEIGHTS = "/home/svidelock/source/Detectron/model_final_721ade.pkl" # Let training initialize from model zoo 

#Spcify DIR for output, if not specified will create "output" DIR
# cfg.OUTPUT_DIR = '/home/svidelock/source/Detectron/HyperParamDetectron/output2/'

#Specify Datasets
cfg.DATASETS.TRAIN = ("boat_train",) #list of the pre-computed proposal files for trianing
cfg.DATASETS.TEST = ("boat_val",) #validation set

#Hyperparams
cfg.SOLVER.IMS_PER_BATCH = 2 #means that in 1 iteration the model sees 2 images 
cfg.SOLVER.BASE_LR = 0.02 #learning rate

#Some other configurable items
cfg.DATALOADER.NUM_WORKERS = 2 # depends on harware config ... 
# cfg.SOLVER.WARMUP_ITERS = 1000 #constant learning rate
# cfg.SOLVER.STEPS = (1000, 1500) #Decaying learning rate
# cfg.SOLVER.GAMMA = 0.001 # The iteration number to decrease learning rate by GAMMA
cfg.SOLVER.MAX_ITER = 500 # Model will stop after this many iterations
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 #look into
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (boat)

#specify if CPU Training
cfg.MODEL.DEVICE='cpu'#cpu training

#Checkpoint/ValidationSet Params
cfg.TEST.EVAL_PERIOD = 20 # Tests validation set every 20 itterations
cfg.SOLVER.CHECKPOINT_PERIOD = cfg.TEST.EVAL_PERIOD #saves a checkpoint model each time we validate 

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

  1. full logs you observed:
[07/17 13:16:51 d2.engine.train_loop]: Starting training from iteration 0
ERROR [07/17 13:17:05 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 130, in train
    self.run_step()
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 227, in run_step
    with torch.cuda.stream(torch.cuda.Stream()):
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/streams.py", line 21, in __new__
    with torch.cuda.device(device):
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 201, in __init__
    self.idx = _get_device_index(device, optional=True)
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/_utils.py", line 31, in _get_device_index
    return torch.cuda.current_device()
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 330, in current_device
    _lazy_init()
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 149, in _lazy_init
    _check_driver()
  File "/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 47, in _check_driver
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
[07/17 13:17:05 d2.engine.hooks]: Total training time: 0:00:13 (0:00:00 on hooks)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-7-7c9d1293789c> in <module>
      4 # trainer.register_hooks([early_stoping])
      5 trainer.resume_or_load(resume=False)
----> 6 trainer.train()

~/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/defaults.py in train(self)
    396             OrderedDict of results, if evaluation is enabled. Otherwise None.
    397         """
--> 398         super().train(self.start_iter, self.max_iter)
    399         if len(self.cfg.TEST.EXPECTED_RESULTS) and comm.is_main_process():
    400             assert hasattr(

~/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py in train(self, start_iter, max_iter)
    128                 for self.iter in range(start_iter, max_iter):
    129                     self.before_step()
--> 130                     self.run_step()
    131                     self.after_step()
    132             except Exception:

~/unineunet-test2/lib/python3.6/site-packages/detectron2/engine/train_loop.py in run_step(self)
    225 
    226         # use a new stream so the ops don't wait for DDP
--> 227         with torch.cuda.stream(torch.cuda.Stream()):
    228             metrics_dict = loss_dict
    229             metrics_dict["data_time"] = data_time

~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/streams.py in __new__(cls, device, priority, **kwargs)
     19 
     20     def __new__(cls, device=None, priority=0, **kwargs):
---> 21         with torch.cuda.device(device):
     22             return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
     23 

~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in __init__(self, device)
    199 
    200     def __init__(self, device):
--> 201         self.idx = _get_device_index(device, optional=True)
    202         self.prev_idx = -1
    203 

~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/_utils.py in _get_device_index(device, optional)
     29         if optional:
     30             # default cuda device index
---> 31             return torch.cuda.current_device()
     32         else:
     33             raise ValueError('Expected a cuda device with a specified index '

~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in current_device()
    328 def current_device():
    329     r"""Returns the index of a currently selected device."""
--> 330     _lazy_init()
    331     return torch._C._cuda_getDevice()
    332 

~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in _lazy_init()
    147             raise RuntimeError(
    148                 "Cannot re-initialize CUDA in forked subprocess. " + msg)
--> 149         _check_driver()
    150         if _cudart is None:
    151             raise AssertionError(

~/unineunet-test2/lib/python3.6/site-packages/torch/cuda/__init__.py in _check_driver()
     45 def _check_driver():
     46     if not hasattr(torch._C, '_cuda_isDriverSufficient'):
---> 47         raise AssertionError("Torch not compiled with CUDA enabled")
     48     if not torch._C._cuda_isDriverSufficient():
     49         if torch._C._cuda_getDriverVersion() == 0:

AssertionError: Torch not compiled with CUDA enabled

Expected behavior:

The model should run. I created a virtual environment in the same way about a week ago and have no issues, but when I recreate a new virtual environment it (with all cpu installs, and specifying cpu in the config, I receive the above error.

Environment:

sys.platform           linux
Python                 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
numpy                  1.19.0
detectron2             0.2 @/home/svidelock/unineunet-test2/lib/python3.6/site-packages/detectron2
Compiler               GCC 7.3
CUDA compiler          not available
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.5.1+cpu @/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torch
PyTorch debug build    False
GPU available          False
Pillow                 7.2.0
torchvision            0.6.1+cpu @/home/svidelock/unineunet-test2/lib/python3.6/site-packages/torchvision
fvcore                 0.1.1.post20200716
cv2                    4.3.0
---------------------  ----------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=0, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions