Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q]wandb: ERROR Internal wandb error: file data was not synced wandb: While tearing down the service manager. The following error has occurred: Python int too large to convert to C long #7859

Closed
yangjhui opened this issue Jun 26, 2024 · 8 comments
Labels
a:sdk Area: sdk related issues

Comments

@yangjhui
Copy link

When I ran the neural network code under the window system for training, the results of both the training and validation sets were successfully uploaded to the Wandb platform, but the test set could not be uploaded smoothly, and the following error message appeared in the end.

Thread SenderThread:
Traceback (most recent call last):
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\internal_util.py", line 48, in run
self._run()
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\internal_util.py", line 99, in _run
self._process(record)
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\internal.py", line 327, in _process
self._sm.send(record)
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\sender.py", line 385, in send
send_handler(record)
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\sender.py", line 407, in send_request
send_handler(record)
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\sender.py", line 659, in send_request_defer
self._pusher.join()
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\site-packages\wandb\sdk\internal\file_pusher.py", line 178, in join
self._tempdir.cleanup()
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\tempfile.py", line 811, in cleanup
_shutil.rmtree(self.name)
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\shutil.py", line 516, in rmtree
return _rmtree_unsafe(path, onerror)
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\shutil.py", line 377, in _rmtree_unsafe
onerror(os.scandir, path, sys.exc_info())
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\shutil.py", line 374, in _rmtree_unsafe
with os.scandir(path) as scandir_it:
FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'C:\Users\YJH\AppData\Local\Temp\tmpgj9j6mjgwandb'
wandb: ERROR Internal wandb error: file data was not synced
wandb: While tearing down the service manager. The following error has occurred: Python int too large to convert to C long
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\shutil.py", line 374, in _rmtree_unsafe
with os.scandir(path) as scandir_it:
FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'C:\Users\YJH\AppData\Local\Temp\tmp1zdgid0ewandb-media'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "E:\LeStoreDownload\Anaconda\envs\PPO\lib\shutil.py", line 374, in _rmtree_unsafe
with os.scandir(path) as scandir_it:
FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'C:\Users\YJH\AppData\Local\Temp\tmpi4ryqkddwandb-artifacts'

进程已结束,退出代码0

The wandb platform displays the following:
2
1

The wandb-summary.json is as follows:
3

How can I solve this problem?

@paulosabile-wb
Copy link

Hi @yangjhui Good day and thank you for reporting this to us. Happy to help you on this!

To further investigate this issue, we would like to request for your help if you can share us how you are logging this to wandb. Could you please share us a code snippet that you used for your training?

Also, to get more information with this error, can you also share the debug-internal.log and debug.log for the affected run. These files are under your local folder wandb/run-_-/logs in the same directory where you’re running your code. Thank you!

@yangjhui
Copy link
Author

yangjhui commented Jun 27, 2024

Hello @paulosabile-wb Thank you for your quick reply.
The content of "train.py" is as follows:

def main():
    # Step 0: Parse Arguments and Setup
    args = argument_parser()  # 解析命令行参数
    run_id = datetime.now().strftime("%Y%m%d") + '-' + str(random.randint(0, 9999))  # 创建一个唯一的运行ID,基于当前日期和随机数
    LOG_DIR = 'logs'  # 定义日志保存目录
    SAVE_DIR = 'models'  # 定义模型保存目录
    TRAIN_LOG_PATH = os.path.join(LOG_DIR, 'train_log/train_log_' + run_id + '.pt')  # 定义训练日志文件路径
    SAVE_LOG_PATH = os.path.join(LOG_DIR, 'save_logs_HBN.json')  # 定义保存日志文件路径
    SAVE_MODEL_PATH = os.path.join(SAVE_DIR, 'model_' + run_id + '.pt')  # 定义保存模型文件路径
    # 定义模型字典,其中键为模型名称,值为模型类
    models = {
        'MPN': MPN,
        'MPN_simplenet': MPN_simplenet,
        'SkipMPN': SkipMPN,
        'MaskEmbdMPN': MaskEmbdMPN,
        'MultiConvNet': MultiConvNet,
        'MultiMPN': MultiMPN,
        'MaskEmbdMultiMPN': MaskEmbdMultiMPN
    }
    mixed_cases = ['118v2', '14v2']  # 定义混合案例列表

    # Training parameters
    data_dir = args.data_dir  # 数据目录 'data'
    nomalize_data = not args.disable_normalize  # 是否归一化数据 not False
    # num_epochs = args.num_epochs  # 训练的轮数 100
    num_epochs = 100 # 训练的轮数 100
    # 定义损失函数,如果启用正则化,则添加正则化项,并指定正则化系数
    loss_fn = Masked_L2_loss(regularize=args.regularize, regcoeff=args.regularization_coeff) # regularize=args.regularize=True,regcoeff=args.regularization_coeff=1
    eval_loss_fn = Masked_L2_loss(regularize=False)  # 定义评估损失函数,不启用正则化
    lr = args.lr  # 学习率 1e-3
    batch_size = args.batch_size  # 批量大小 128
    # grid_case = args.case  # 网格案例 初始是case14
    # grid_case = '14'  # 网格案例
    grid_case = 'originalHBN40000'  # 网格案例originalHBN40000
    
    # Network parameters
    # 网络参数
    nfeature_dim = args.nfeature_dim  # 输入特征维度 6 caseHBN:7
    efeature_dim = args.efeature_dim  # 边特征维度 6 caseHBN:5
    hidden_dim = args.hidden_dim  # 隐藏层维度 128 caseHBN:512
    output_dim = args.output_dim  # 输出维度 6 caseHBN:7
    n_gnn_layers = args.n_gnn_layers  # GNN 层数 4 caseHBN:5
    conv_K = args.K  # 卷积层的 K 值 3
    dropout_rate = args.dropout_rate  # Dropout 比率 0.2
    # model = models[args.model]  # 根据参数选择模型 MPN
    model = models['MaskEmbdMultiMPN']  # 根据参数选择模型

    log_to_wandb = True# 是否将日志记录到WandB

    # wandb_entity = args.wandb_entity  # WandB 实体 PowerFlowNet
    if log_to_wandb:
        wandb.init(project="PowerFlowNet",  # 初始化WandB项目
                   # entity=wandb_entity,  # 设置WandB实体
                   name=run_id,  # 设置运行名称
                   config=args)  # 传递参数配置给WandB

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
    # device = torch.device('cpu')
    # print(f"device:{device}")
    torch.manual_seed(1234)  # 设置随机种子以保证结果的可复现性
    np.random.seed(1234)  # 设置NumPy的随机种子
    # torch.backends.cudnn.deterministic = True  # 设置确定性计算,可以提高计算稳定性,但可能会降低速度
    # torch.backends.cudnn.benchmark = False  # 不启用Benchmark模式,这可以确保在输入大小变化时,内置的cuDNN自动找到最适合的卷积算法

    # Step 1: Load data
    print('下面初始化训练数据集:')
    trainset = PowerFlowData(root=data_dir, case=grid_case, split=[.5, .2, .3], task='train', normalize=nomalize_data)
    print('下面初始化验证数据集:')
    valset = PowerFlowData(root=data_dir, case=grid_case, split=[.5, .2, .3], task='val', normalize=nomalize_data)
    print('下面初始化测试数据集:')
    testset = PowerFlowData(root=data_dir, case=grid_case, split=[.5, .2, .3], task='test', normalize=nomalize_data)
    # 创建目录来存储归一化参数
    os.makedirs(os.path.join(data_dir, 'params'), exist_ok=True)
    # 保存训练集中用于归一化的参数
    torch.save({
        'xymean': trainset.xymean,  # 保存节点特征的均值
        'xystd': trainset.xystd,  # 保存节点特征的标准差
        'edgemean': trainset.edgemean,  # 保存边特征的均值
        'edgestd': trainset.edgestd,  # 保存边特征的标准差
    }, os.path.join(data_dir, 'params', f'data_params_{run_id}.pt'))

    # 初始化数据加载器用于训练,设置批量大小,并打乱顺序
    train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True)
    # 初始化数据加载器用于验证,设置批量大小,不打乱顺序
    val_loader = DataLoader(valset, batch_size=batch_size, shuffle=False)
    # 初始化数据加载器用于测试,设置批量大小,不打乱顺序
    test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False)
    
    ## [Optional] physics-informed loss function
    # 根据参数选择训练时使用的损失函数
    print(f'args.train_loss_fn: {args.train_loss_fn}')
    if args.train_loss_fn == 'power_imbalance':
        # 如果选用电力不平衡损失函数,使用训练集的均值和标准差初始化,并移至计算设备
        loss_fn = PowerImbalance(*trainset.get_data_means_stds()).to(device)
    elif args.train_loss_fn == 'masked_l2':
        # 如果选用带掩码的L2损失函数,可以设置正则化项和系数
        loss_fn = Masked_L2_loss(regularize=args.regularize, regcoeff=args.regularization_coeff)
    elif args.train_loss_fn == 'mixed_mse_power_imbalance':
        # 如果选用混合MSE和电力不平衡损失函数,使用训练集的均值和标准差初始化,设置权重参数alpha,移至计算设备
        loss_fn = MixedMSEPoweImbalance(*trainset.get_data_means_stds(), alpha=0.9).to(device)
    else:
        # 默认使用MSE损失函数
        loss_fn = torch.nn.MSELoss()

    # Step 2: Create model and optimizer (and scheduler)
    # 获取数据集的节点输入维度、节点输出维度和边维度
    node_in_dim, node_out_dim, edge_dim = trainset.get_data_dimensions()
    # 断言节点输入维度必须为16,作为一种检查机制
    # assert node_in_dim == 16 # case14
    assert node_in_dim == 18 # caseHBN
    # 初始化模型,设置相关参数如特征维度、隐藏层维度、卷积层数等,并移至计算设备  model = models[args.model]  # 根据参数选择模型 MPN
    model = model(
        nfeature_dim=nfeature_dim,  # nfeature_dim = args.nfeature_dim  # 输入特征维度 6
        efeature_dim=efeature_dim,  # efeature_dim = args.efeature_dim  # 边特征维度 6
        output_dim=output_dim,      # output_dim = args.output_dim  # 输出维度 6
        hidden_dim=hidden_dim,      # hidden_dim = args.hidden_dim  # 隐藏层维度 128
        n_gnn_layers=n_gnn_layers,  # n_gnn_layers = args.n_gnn_layers  # GNN 层数 4
        K=conv_K,                   # conv_K = args.K  # 卷积层的 K 值 3
        dropout_rate=dropout_rate   # dropout_rate = args.dropout_rate  # Dropout 比率 0.2
    ).to(device)

    #calculate model size
    # 计算模型参数总数
    pytorch_total_params = sum(p.numel() for p in model.parameters()) # 4483591
    print("Total number of parameters: ", pytorch_total_params)

    # 使用AdamW优化器来优化模型参数,设置学习率为lr
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    # scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
    #                                                        mode='min',
    #                                                        factor=0.5,
    #                                                        patience=5,
    #                                                        verbose=True)
    # 使用OneCycleLR学习率调度器,设置最大学习率为lr,steps_per_epoch为每个epoch的步数,epochs为总epoch数
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, steps_per_epoch=len(train_loader), epochs=num_epochs)

    # Step 3: Train model
    # 初始化最佳训练损失和最佳验证损失
    best_train_loss = 10000.
    best_val_loss = 10000.
    train_log = {
        'train': {
            'loss': []},
        'val': {
            'loss': []},
    }
    # pbar = tqdm(range(num_epochs), total=num_epochs, position=0, leave=True)
    # 遍历每个epoch进行训练
    for epoch in range(num_epochs): # epoch = 100
        # 训练当前epoch的模型,计算训练损失
        train_loss = train_epoch(
            model, train_loader, loss_fn, optimizer, device)
        # 计算当前epoch的验证损失
        val_loss = evaluate_epoch(model, val_loader, eval_loss_fn, device) # eval_loss_fn
        # val_loss = evaluate_epoch(model, val_loader, loss_fn, device)
        # 更新学习率
        scheduler.step()
        # 记录训练损失和验证损失
        train_log['train']['loss'].append(train_loss)
        train_log['val']['loss'].append(val_loss)

        # 如果开启了日志记录到wandb,则记录训练损失和验证损失
        if log_to_wandb:
            wandb.log({'train_loss': train_loss,
                       'val_loss': val_loss})

        # 更新最佳训练损失
        if train_loss < best_train_loss:
            best_train_loss = train_loss

        # 如果当前epoch的验证损失比之前的最佳验证损失更低,则更新最佳验证损失并保存模型
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            # 如果设置了保存模型的参数,则保存模型
            if args.save:
                # 准备要保存的数据
                _to_save = {
                    'epoch': epoch,
                    'args': args,
                    'val_loss': best_val_loss,
                    'model_state_dict': model.state_dict(),
                }
                # 确保保存模型的文件夹存在
                os.makedirs('models', exist_ok=True)
                # 保存模型数据到指定路径
                torch.save(_to_save, SAVE_MODEL_PATH)
                # 在日志文件中添加当前模型的信息
                append_to_json(
                    SAVE_LOG_PATH,
                    run_id,
                    {
                        'val_loss': f"{best_val_loss: .4f}",
                        # 'test_loss': f"{test_loss: .4f}",
                        'train_log': TRAIN_LOG_PATH,
                        'saved_file': SAVE_MODEL_PATH,
                        'epoch': epoch,
                        'model': args.model,
                        'train_case': args.case,
                        'train_loss_fn': args.train_loss_fn,
                        'args': vars(args)
                    }
                )
                # 保存训练日志到指定路径
                torch.save(train_log, TRAIN_LOG_PATH)
        # 打印当前epoch的训练损失、验证损失和最佳验证损失
        print(f"Epoch {epoch+1} / {num_epochs}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}, best_val_loss={best_val_loss:.4f}")

    # 打印训练完成信息以及最佳验证损失
    print(f"Training Complete. Best validation loss: {best_val_loss:.4f}")

    # Step 4: Evaluate model 模型评估
    # 如果设置了保存模型的参数,则加载保存的模型并评估测试集损失
    if args.save:
        _to_load = torch.load(SAVE_MODEL_PATH)
        model.load_state_dict(_to_load['model_state_dict'])
        test_loss = evaluate_epoch(model, test_loader, eval_loss_fn, device)
        # test_loss = evaluate_epoch(model, test_loader, loss_fn, device)
        print(f"Test loss: {best_val_loss:.4f}")
        # 如果开启了wandb日志记录,则记录测试集损失
        if log_to_wandb:
            wandb.log({'test_loss': test_loss}) # 字典形式

    # Step 5: Save results
    # 确保训练日志文件夹存在
    os.makedirs(os.path.join(LOG_DIR, 'train_log'), exist_ok=True)
    # 如果设置了保存模型的参数,则保存训练日志到指定路径
    if args.save:
        torch.save(train_log, TRAIN_LOG_PATH)

    end = time.perf_counter()  # 结束计时
    runtime = end - start
    s = runtime % 60
    m = runtime // 60 % 60
    h = runtime // 3600
    print(f"输出代码运行时间{runtime} s")
    print('代码运行时间'"{:0>2} h:{:0>2} min:{:0>2} s".format(h,m,s))

if __name__ == '__main__':
    main()

The debug-internal.log and debug.log are as follows:
debug.log
debug-internal.log

Thank you again for your reply. I would greatly appreciate it.

@kptkin kptkin added the a:sdk Area: sdk related issues label Jun 27, 2024
Copy link

WandB Internal User commented:
paulosabile-wb commented:
Hi @yangjhui Good day and thank you for reporting this to us. Happy to help you on this!

To further investigate this issue, we would like to request for your help if you can share us how you are logging this to wandb. Could you please share us a code snippet that you used for your training?

Also, to get more information with this error, can you also share the debug-internal.log and debug.log for the affected run. These files are under your local folder wandb/run-_-/logs in the same directory where you’re running your code. Thank you!

@luisbergua
Copy link
Contributor

Hi @yangjhui, thanks for sharing this! I can see Python int too large to convert to C long which seems to be a numpy error so I would recommend fixing it and trying again, see here.

Copy link

Luis Bergua commented:
Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

@yangjhui
Copy link
Author

yangjhui commented Jul 8, 2024

Hello @luisbergua ! I am regret to tell you that my issue has not been resolved. Can you give me further assistance?
Thank you for your reply.

@luisbergua
Copy link
Contributor

Hi @yangjhui! As I shared, Python int too large to convert to C long seems to be a numpy error so I would recommend fixing it and trying again, see here. Mind letting me know if this makes any difference?

@sydholl
Copy link

sydholl commented Nov 14, 2024

Due to inactivity we are closing this request, please comment to reopen.

@sydholl sydholl closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:sdk Area: sdk related issues
Projects
None yet
Development

No branches or pull requests

5 participants