Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'TPUStrategyV2' object has no attribute 'experimental_run_v2' #6

Open
swcrazyfan opened this issue Mar 13, 2022 · 0 comments

Comments

@swcrazyfan
Copy link

swcrazyfan commented Mar 13, 2022

I'm trying to fine-tune in Colab, and I keep getting this error, but I'm not sure how to fix/get around it.

Any advice?

I believe it's related to this:

per_replica_losses = strategy.experimental_run_v2(train_step, args=(x_train, y_train,))

This is the full error output:

2022-03-13 04:56:03.212 INFO - run: args: {}
Output directory (tmp/t5-small_t2t_content-train) already exists and is not empty, you wanna remove it before start training? (y/n)y
2022-03-13 04:57:45.139 INFO inputs - get_with_prepare_func: reading cached data from /content/train/t5-small-data.pkl
2022-03-13 04:57:45.142 WARNING inputs - get_with_prepare_func: if you changed the max_seq_length/max_src_length/max_tgt_length, this may not correctly loaded, since the /content/train/t5-small-data.pkl is pickled based on first time loading
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
2022-03-13 04:57:45.222 INFO tpu_strategy_util - initialize_tpu_system: Deallocate tpu buffers before initializing tpu system.
WARNING:tensorflow:TPU system grpc://10.77.192.66 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
2022-03-13 04:57:45.917 WARNING tpu_strategy_util - initialize_tpu_system: TPU system grpc://10.77.192.66 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
INFO:tensorflow:Initializing the TPU system: grpc://10.77.192.66
2022-03-13 04:57:45.926 INFO tpu_strategy_util - initialize_tpu_system: Initializing the TPU system: grpc://10.77.192.66
INFO:tensorflow:Finished initializing TPU system.
2022-03-13 04:57:53.909 INFO tpu_strategy_util - initialize_tpu_system: Finished initializing TPU system.
2022-03-13 04:57:53.914 INFO - create_model: All TPU devices:
2022-03-13 04:57:53.916 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU')
2022-03-13 04:57:53.920 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU')
2022-03-13 04:57:53.922 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')
2022-03-13 04:57:53.925 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')
2022-03-13 04:57:53.928 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU')
2022-03-13 04:57:53.930 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU')
2022-03-13 04:57:53.933 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU')
2022-03-13 04:57:53.935 INFO - create_model: LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')
INFO:tensorflow:Found TPU system:
2022-03-13 04:57:53.938 INFO tpu_system_metadata - _query_tpu_system_metadata: Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
2022-03-13 04:57:53.941 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
2022-03-13 04:57:53.945 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
2022-03-13 04:57:53.948 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
2022-03-13 04:57:53.952 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
2022-03-13 04:57:53.958 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
2022-03-13 04:57:53.961 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
2022-03-13 04:57:53.965 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
2022-03-13 04:57:53.968 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
2022-03-13 04:57:53.972 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
2022-03-13 04:57:53.975 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
2022-03-13 04:57:53.979 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
2022-03-13 04:57:53.985 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
2022-03-13 04:57:53.988 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
2022-03-13 04:57:53.992 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
2022-03-13 04:57:53.995 INFO tpu_system_metadata - _query_tpu_system_metadata: *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Model: "tft5_for_conditional_generation_2"


Layer (type) Output Shape Param #

shared (TFSharedEmbeddings) multiple 16449536

encoder (TFT5MainLayer) multiple 18881280

decoder (TFT5MainLayer) multiple 25175808

=================================================================
Total params: 60,506,624
Trainable params: 60,506,624
Non-trainable params: 0


2022-03-13 04:58:18.877 INFO - create_model: None
/content/ttt/ttt/t2t_trainer.py:56: FutureWarning: Passing inputs as a keyword argument is deprecated. Use train_dataset and eval_dataset instead.
FutureWarning,
2022-03-13 04:58:18.946 INFO t2t_trainer - train: set random seed for everything with 122
2022-03-13 04:58:19.412 INFO utils - write_args_enhance: {
"source_field_name": "source",
"target_field_name": "target",
"use_tpu": true,
"do_train": true,
"use_tb": true,
"model_select": "t5-small",
"data_path": "/content/train",
"task": "t2t",
"log_steps": 400,
"scheduler": "warmuplinear",
"do_eval": false,
"tpu_address": "10.77.192.66",
"output_folder": "t5-small_t2t_content-train",
"output_path": "tmp/t5-small_t2t_content-train",
"is_pretrain": false,
"is_load_from_data_cache": true,
"data_cache_path": "/content/train/t5-small-data.pkl",
"source_sequence_length": 111,
"target_sequence_length": 20,
"num_replicas_in_sync": 8,
"best": -Infinity,
"warmup_steps": 233
}
/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adam.py:105: UserWarning: The lr argument is deprecated, use learning_rate instead.
super(Adam, self).init(name, **kwargs)
epochs: 0%| | 0/6 [00:00<?, ?it/s]2022-03-13 04:58:19.433 INFO t2t_trainer - train: start training at epoch = 0
2022-03-13 04:58:19.440 INFO t2t_trainer - train: global train batch size = 64
2022-03-13 04:58:19.442 INFO t2t_trainer - train: using learning rate scheduler: warmuplinear
2022-03-13 04:58:19.446 INFO t2t_trainer - train: num_train_examples: 24867, total_steps: 2334, steps_per_epoch: 389
2022-03-13 04:58:19.454 INFO t2t_trainer - train: warmup_steps:233

0%| | 0/389 [00:00<?, ?it/s]
epochs: 0%| | 0/6 [00:00<?, ?it/s]

AttributeError Traceback (most recent call last)
in ()
----> 1 run()

3 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in autograph_handler(*args, **kwargs)
1145 except Exception as e: # pylint:disable=broad-except
1146 if hasattr(e, "ag_error_metadata"):
-> 1147 raise e.ag_error_metadata.to_exception(e)
1148 else:
1149 raise

AttributeError: in user code:

File "/content/ttt/ttt/t2t_trainer.py", line 147, in distributed_train_step  *
    per_replica_losses = strategy.experimental_run_v2(train_step, args=(x_train, y_train,))

AttributeError: 'TPUStrategyV2' object has no attribute 'experimental_run_v2'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant