Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/06/19 23:23:40 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 16838@brain003 17/06/19 23:23:40 INFO SignalUtils: Registered signal handler for TERM 17/06/19 23:23:40 INFO SignalUtils: Registered signal handler for HUP 17/06/19 23:23:40 INFO SignalUtils: Registered signal handler for INT 17/06/19 23:23:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/06/19 23:23:40 INFO SecurityManager: Changing view acls to: braincreator 17/06/19 23:23:40 INFO SecurityManager: Changing modify acls to: braincreator 17/06/19 23:23:40 INFO SecurityManager: Changing view acls groups to: 17/06/19 23:23:40 INFO SecurityManager: Changing modify acls groups to: 17/06/19 23:23:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(braincreator); groups with view permissions: Set(); users with modify permissions: Set(braincreator); groups with modify permissions: Set() 17/06/19 23:23:40 INFO TransportClientFactory: Successfully created connection to /192.168.1.103:38370 after 69 ms (0 ms spent in bootstraps) 17/06/19 23:23:41 INFO SecurityManager: Changing view acls to: braincreator 17/06/19 23:23:41 INFO SecurityManager: Changing modify acls to: braincreator 17/06/19 23:23:41 INFO SecurityManager: Changing view acls groups to: 17/06/19 23:23:41 INFO SecurityManager: Changing modify acls groups to: 17/06/19 23:23:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(braincreator); groups with view permissions: Set(); users with modify permissions: Set(braincreator); groups with modify permissions: Set() 17/06/19 23:23:41 INFO TransportClientFactory: Successfully created connection to /192.168.1.103:38370 after 1 ms (0 ms spent in bootstraps) 17/06/19 23:23:41 INFO DiskBlockManager: Created local directory at /tmp/spark-21f263c5-ef53-4103-9054-256896da2d8f/executor-b3ac2de4-dcd0-4e21-b76f-3df9a115ced0/blockmgr-1bbfb642-16d5-4a59-83b5-7a40b97a62c0 17/06/19 23:23:41 INFO MemoryStore: MemoryStore started with capacity 399.6 MB 17/06/19 23:23:41 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.1.103:38370 17/06/19 23:23:41 INFO WorkerWatcher: Connecting to worker spark://Worker@brain003:33973 17/06/19 23:23:41 INFO TransportClientFactory: Successfully created connection to brain003/192.168.1.127:33973 after 1 ms (0 ms spent in bootstraps) 17/06/19 23:23:41 INFO WorkerWatcher: Successfully connected to spark://Worker@brain003:33973 17/06/19 23:23:41 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 17/06/19 23:23:41 INFO Executor: Starting executor ID 1 on host brain003 17/06/19 23:23:41 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45011. 17/06/19 23:23:41 INFO NettyBlockTransferService: Server created on brain003:45011 17/06/19 23:23:41 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 17/06/19 23:23:41 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, brain003, 45011, None) 17/06/19 23:23:41 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, brain003, 45011, None) 17/06/19 23:23:41 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, brain003, 45011, None) 17/06/19 23:23:41 INFO CoarseGrainedExecutorBackend: Got assigned task 0 17/06/19 23:23:41 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 17/06/19 23:23:41 INFO Executor: Fetching spark://192.168.1.103:38370/files/inception.zip with timestamp 1497907419244 17/06/19 23:23:41 INFO TransportClientFactory: Successfully created connection to /192.168.1.103:38370 after 1 ms (0 ms spent in bootstraps) 17/06/19 23:23:41 INFO Utils: Fetching spark://192.168.1.103:38370/files/inception.zip to /tmp/spark-21f263c5-ef53-4103-9054-256896da2d8f/executor-b3ac2de4-dcd0-4e21-b76f-3df9a115ced0/spark-f3ab2108-650b-4789-8df4-a6fd2b6d7680/fetchFileTemp1783030734357594821.tmp 17/06/19 23:23:41 INFO Utils: Copying /tmp/spark-21f263c5-ef53-4103-9054-256896da2d8f/executor-b3ac2de4-dcd0-4e21-b76f-3df9a115ced0/spark-f3ab2108-650b-4789-8df4-a6fd2b6d7680/1410124971497907419244_cache to /home/braincreator/opt/spark/v2.1.1/work/app-20170619232339-0050/1/./inception.zip 17/06/19 23:23:41 INFO Executor: Fetching spark://192.168.1.103:38370/files/imagenet_distributed_train.py with timestamp 1497907419235 17/06/19 23:23:41 INFO Utils: Fetching spark://192.168.1.103:38370/files/imagenet_distributed_train.py to /tmp/spark-21f263c5-ef53-4103-9054-256896da2d8f/executor-b3ac2de4-dcd0-4e21-b76f-3df9a115ced0/spark-f3ab2108-650b-4789-8df4-a6fd2b6d7680/fetchFileTemp5727244748814194412.tmp 17/06/19 23:23:41 INFO Utils: Copying /tmp/spark-21f263c5-ef53-4103-9054-256896da2d8f/executor-b3ac2de4-dcd0-4e21-b76f-3df9a115ced0/spark-f3ab2108-650b-4789-8df4-a6fd2b6d7680/-7488342311497907419235_cache to /home/braincreator/opt/spark/v2.1.1/work/app-20170619232339-0050/1/./imagenet_distributed_train.py 17/06/19 23:23:41 INFO TorrentBroadcast: Started reading broadcast variable 0 17/06/19 23:23:42 INFO TransportClientFactory: Successfully created connection to /192.168.1.103:43615 after 1 ms (0 ms spent in bootstraps) 17/06/19 23:23:42 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.6 KB, free 399.6 MB) 17/06/19 23:23:42 INFO TorrentBroadcast: Reading broadcast variable 0 took 227 ms 17/06/19 23:23:42 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 12.3 KB, free 399.6 MB) 2017-06-19 23:23:42,858 INFO (MainThread-16945) connected to server at ('brain001', 37931) 2017-06-19 23:23:42,860 INFO (MainThread-16945) TFSparkNode.reserve: {'authkey': 'C\xbb\xa7\xc5\xcc\x16D\xe3\xb5\x07\xc2z\x1e\xb5\xa8\xf5', 'worker_num': 0, 'host': 'brain003', 'tb_port': 0, 'addr': ('brain003', 37641), 'ppid': 16931, 'task_index': 0, 'job_name': 'ps', 'tb_pid': 0, 'port': 42158} 2017-06-19 23:23:50,877 INFO (MainThread-16945) node: {'addr': ('brain003', 37641), 'task_index': 0, 'job_name': 'ps', 'authkey': 'C\xbb\xa7\xc5\xcc\x16D\xe3\xb5\x07\xc2z\x1e\xb5\xa8\xf5', 'worker_num': 0, 'host': 'brain003', 'ppid': 16931, 'port': 42158, 'tb_pid': 0, 'tb_port': 0} 2017-06-19 23:23:50,877 INFO (MainThread-16945) node: {'addr': '/tmp/pymp-JzzBmC/listener-ZMW31y', 'task_index': 0, 'job_name': 'worker', 'authkey': '\x9dM\x11?\xd2\xfbL\x17\x97\x9a\x94\xe6\x176\xed:', 'worker_num': 1, 'host': 'brain002', 'ppid': 9722, 'port': 38686, 'tb_pid': 0, 'tb_port': 0} 2017-06-19 23:23:50,878 INFO (MainThread-16945) Starting TensorFlow ps:0 on cluster node 0 on background process argv: ['/home/braincreator/nfs/TensorFlowOnSpark/examples/imagenet/inception/imagenet_distributed_train.py', '--cluster_size', '2', '--data_dir', 'file:///data/braincreator/ImageNet/tfrecords', '--train_dir', 'file:///tmp/imagenet_distributed_train_model', '--max_steps', '1000', '--batch_size=16', '--subset', 'train'] FLAGS: {'subset': 'train', 'learning_rate_decay_factor': 0.94, 'ps_hosts': '', 'worker_hosts': '', 'data_dir': 'file:///data/braincreator/ImageNet/tfrecords', 'save_summaries_secs': 180, 'num_readers': 4, 'job_name': 'ps', 'input_mode': 'tf', 'input_queue_memory_factor': 16, 'num_epochs_per_decay': 2.0, 'batch_size': 16, 'rdma': False, 'image_size': 299, 'save_interval_secs': 600, 'num_preprocess_threads': 4, 'log_device_placement': False, 'task_id': 0, 'num_gpus': 1, 'train_dir': 'file:///tmp/imagenet_distributed_train_model', 'initial_learning_rate': 0.045, 'num_replicas_to_aggregate': -1, 'max_steps': 1000} 2017-06-19 23:23:52,543 INFO (MainThread-17003) 0: ======== ps:0 ======== 2017-06-19 23:23:52,543 INFO (MainThread-17003) 0: Cluster spec: {'ps': ['brain003:42158'], 'worker': ['brain002:38686']} 2017-06-19 23:23:52,561 INFO (MainThread-17003) 0: Using GPU: 0 2017-06-19 23:23:52.562292: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-06-19 23:23:52.562325: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-06-19 23:23:52.562334: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-06-19 23:23:52.827671: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2017-06-19 23:23:52.828365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:04:00.0 Total memory: 11.90GiB Free memory: 11.10GiB 2017-06-19 23:23:52.828390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 2017-06-19 23:23:52.828399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y 2017-06-19 23:23:52.828417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:04:00.0) 2017-06-19 23:23:52.955914: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:42158} 2017-06-19 23:23:52.955956: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> brain002:38686} 2017-06-19 23:23:52.961734: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:240] Started server with target: grpc://localhost:42158