Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce/Allreduce issues on cyt_rdma #196

Closed
lawirz opened this issue May 14, 2024 · 2 comments
Closed

Reduce/Allreduce issues on cyt_rdma #196

lawirz opened this issue May 14, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@lawirz
Copy link

lawirz commented May 14, 2024

I'm getting errors on repeated runs of Reduce and Allreduce:

...
Pass accl barrier
host measured durationUs:91.724
2th item is incorrect! (1.000000 != 2.000000)
3th item is incorrect! (2.000000 != 4.000000)
4th item is incorrect! (3.000000 != 6.000000)
5th item is incorrect! (4.000000 != 8.000000)
...

The first run succeeds.

I'm using a sligthly modified version of the script https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/run_scripts/run.sh on commit a0ba7ea

Output of first run(allreduce):

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '10' '-c' '512' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:512 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000002, PSN 0x13da2b, VADDR 00007fe623800000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000001, PSN 0x46045d, VADDR 00007f4951e00000, SIZE 00200000, IP 0x0afd4a60,
rank: 0 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:106.15
hw nop time [ns]:940
Start allreduce test and reduce function 0...
Repetition 0
Pass accl barrier
host measured durationUs:410.125
Test is successful!

ACCL base functionality test completed successfully!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	96
                Host writes sent: 	64
                 Card reads sent: 	64
                Card writes sent: 	64
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 -- �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 320
TX pkgs: 132
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 152
ROCE TX pkgs: 130
IBV RX pkgs: 195
IBV TX pkgs: 195
PSN drop cnt: 0
Retrans cnt: 0
TCP session cnt: 0
STRM down: 0

Finalizing MPI...
Done. Terminating...
stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 21032
UID: 500207
[Tue May 14 12:09:44 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 2696576574 at 0x0
CCLO source commit (first 24b): a0ba7e
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fe622c00000, Size: 64
calling offload: 7fe622c00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fe622a00000, Size: 64
calling offload: 7fe622a00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fe622600000, Size: 4194304
calling offload: 7fe622600000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fe622200000, Size: 4194304
calling offload: 7fe622200000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fe621e00000, Size: 4194304
calling offload: 7fe621e00000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fe622c00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fe622a00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1
Allocation successful! Allocated buffer: 7fe621c00000, Size: 2048
CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1
Allocation successful! Allocated buffer: 7fe621a00000, Size: 2048
Reducing data...
Free user buffer from cProc cPid:0, buffer_size:2048,7fe621c00000
Free user buffer from cProc cPid:0, buffer_size:2048,7fe621a00000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 64, -> outbound seq number 64

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fe622c00000 	 status: ENQUEUED 	 occupancy: 32/64 	 MPI tag: ffffffff 	 seq: 62 	 src: 1
Spare RX Buffer 1:	 address: 0x7fe622a00000 	 status: ENQUEUED 	 occupancy: 32/64 	 MPI tag: ffffffff 	 seq: 63 	 src: 1

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7fe622c00000
Free user buffer from cProc cPid:0, buffer_size:64,7fe622a00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fe622600000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fe622200000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fe621e00000

Second run:

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '10' '-c' '512' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:512 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000001, PSN 0x709eba, VADDR 00007fe2c6c00000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000002, PSN 0xd26325, VADDR 00007f9402e00000, SIZE 00200000, IP 0x0afd4a60,
rank: 0 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:87.215
hw nop time [ns]:940
Start allreduce test and reduce function 0...
Repetition 0
Pass accl barrier
host measured durationUs:79.17
2th item is incorrect! (1.000000 != 2.000000)
3th item is incorrect! (2.000000 != 4.000000)
4th item is incorrect! (3.000000 != 6.000000)
5th item is incorrect! (4.000000 != 8.000000)
6th item is incorrect! (5.000000 != 10.000000)
7th item is incorrect! (6.000000 != 12.000000)
8th item is incorrect! (7.000000 != 14.000000)
9th item is incorrect! (8.000000 != 16.000000)
10th item is incorrect! (9.000000 != 18.000000)
11th item is incorrect! (10.000000 != 20.000000)
12th item is incorrect! (11.000000 != 22.000000)
13th item is incorrect! (12.000000 != 24.000000)
14th item is incorrect! (13.000000 != 26.000000)
15th item is incorrect! (14.000000 != 28.000000)
16th item is incorrect! (15.000000 != 30.000000)
17th item is incorrect! (16.000000 != 32.000000)
18th item is incorrect! (17.000000 != 34.000000)
19th item is incorrect! (18.000000 != 36.000000)
20th item is incorrect! (19.000000 != 38.000000)
21th item is incorrect! (20.000000 != 40.000000)
22th item is incorrect! (21.000000 != 42.000000)
23th item is incorrect! (22.000000 != 44.000000)
24th item is incorrect! (23.000000 != 46.000000)
25th item is incorrect! (24.000000 != 48.000000)
26th item is incorrect! (25.000000 != 50.000000)
27th item is incorrect! (26.000000 != 52.000000)
28th item is incorrect! (27.000000 != 54.000000)
29th item is incorrect! (28.000000 != 56.000000)
30th item is incorrect! (29.000000 != 58.000000)
31th item is incorrect! (30.000000 != 60.000000)
32th item is incorrect! (31.000000 != 62.000000)
33th item is incorrect! (32.000000 != 64.000000)
34th item is incorrect! (33.000000 != 66.000000)
35th item is incorrect! (34.000000 != 68.000000)
36th item is incorrect! (35.000000 != 70.000000)
37th item is incorrect! (36.000000 != 72.000000)
38th item is incorrect! (37.000000 != 74.000000)
39th item is incorrect! (38.000000 != 76.000000)
40th item is incorrect! (39.000000 != 78.000000)
41th item is incorrect! (40.000000 != 80.000000)
42th item is incorrect! (41.000000 != 82.000000)
43th item is incorrect! (42.000000 != 84.000000)
44th item is incorrect! (43.000000 != 86.000000)
45th item is incorrect! (44.000000 != 88.000000)
46th item is incorrect! (45.000000 != 90.000000)
47th item is incorrect! (46.000000 != 92.000000)
48th item is incorrect! (47.000000 != 94.000000)
49th item is incorrect! (48.000000 != 96.000000)
50th item is incorrect! (49.000000 != 98.000000)
51th item is incorrect! (50.000000 != 100.000000)
52th item is incorrect! (51.000000 != 102.000000)
53th item is incorrect! (52.000000 != 104.000000)
54th item is incorrect! (53.000000 != 106.000000)
55th item is incorrect! (54.000000 != 108.000000)
56th item is incorrect! (55.000000 != 110.000000)
57th item is incorrect! (56.000000 != 112.000000)
58th item is incorrect! (57.000000 != 114.000000)
59th item is incorrect! (58.000000 != 116.000000)
60th item is incorrect! (59.000000 != 118.000000)
61th item is incorrect! (60.000000 != 120.000000)
62th item is incorrect! (61.000000 != 122.000000)
63th item is incorrect! (62.000000 != 124.000000)
64th item is incorrect! (63.000000 != 126.000000)
65th item is incorrect! (64.000000 != 128.000000)
66th item is incorrect! (65.000000 != 130.000000)
67th item is incorrect! (66.000000 != 132.000000)
68th item is incorrect! (67.000000 != 134.000000)
69th item is incorrect! (68.000000 != 136.000000)
70th item is incorrect! (69.000000 != 138.000000)
71th item is incorrect! (70.000000 != 140.000000)
72th item is incorrect! (71.000000 != 142.000000)
73th item is incorrect! (72.000000 != 144.000000)
74th item is incorrect! (73.000000 != 146.000000)
75th item is incorrect! (74.000000 != 148.000000)
76th item is incorrect! (75.000000 != 150.000000)
77th item is incorrect! (76.000000 != 152.000000)
78th item is incorrect! (77.000000 != 154.000000)
79th item is incorrect! (78.000000 != 156.000000)
80th item is incorrect! (79.000000 != 158.000000)
81th item is incorrect! (80.000000 != 160.000000)
82th item is incorrect! (81.000000 != 162.000000)
83th item is incorrect! (82.000000 != 164.000000)
84th item is incorrect! (83.000000 != 166.000000)
85th item is incorrect! (84.000000 != 168.000000)
86th item is incorrect! (85.000000 != 170.000000)
87th item is incorrect! (86.000000 != 172.000000)
88th item is incorrect! (87.000000 != 174.000000)
89th item is incorrect! (88.000000 != 176.000000)
90th item is incorrect! (89.000000 != 178.000000)
91th item is incorrect! (90.000000 != 180.000000)
92th item is incorrect! (91.000000 != 182.000000)
93th item is incorrect! (92.000000 != 184.000000)
94th item is incorrect! (93.000000 != 186.000000)
95th item is incorrect! (94.000000 != 188.000000)
96th item is incorrect! (95.000000 != 190.000000)
97th item is incorrect! (96.000000 != 192.000000)
98th item is incorrect! (97.000000 != 194.000000)
99th item is incorrect! (98.000000 != 196.000000)
100th item is incorrect! (99.000000 != 198.000000)
101th item is incorrect! (100.000000 != 200.000000)
102th item is incorrect! (101.000000 != 202.000000)
103th item is incorrect! (102.000000 != 204.000000)
104th item is incorrect! (103.000000 != 206.000000)
105th item is incorrect! (104.000000 != 208.000000)
106th item is incorrect! (105.000000 != 210.000000)
107th item is incorrect! (106.000000 != 212.000000)
108th item is incorrect! (107.000000 != 214.000000)
109th item is incorrect! (108.000000 != 216.000000)
110th item is incorrect! (109.000000 != 218.000000)
111th item is incorrect! (110.000000 != 220.000000)
112th item is incorrect! (111.000000 != 222.000000)
113th item is incorrect! (112.000000 != 224.000000)
114th item is incorrect! (113.000000 != 226.000000)
115th item is incorrect! (114.000000 != 228.000000)
116th item is incorrect! (115.000000 != 230.000000)
117th item is incorrect! (116.000000 != 232.000000)
118th item is incorrect! (117.000000 != 234.000000)
119th item is incorrect! (118.000000 != 236.000000)
120th item is incorrect! (119.000000 != 238.000000)
121th item is incorrect! (120.000000 != 240.000000)
122th item is incorrect! (121.000000 != 242.000000)
123th item is incorrect! (122.000000 != 244.000000)
124th item is incorrect! (123.000000 != 246.000000)
125th item is incorrect! (124.000000 != 248.000000)
126th item is incorrect! (125.000000 != 250.000000)
127th item is incorrect! (126.000000 != 252.000000)
128th item is incorrect! (127.000000 != 254.000000)
129th item is incorrect! (128.000000 != 256.000000)
130th item is incorrect! (129.000000 != 258.000000)
131th item is incorrect! (130.000000 != 260.000000)
132th item is incorrect! (131.000000 != 262.000000)
133th item is incorrect! (132.000000 != 264.000000)
134th item is incorrect! (133.000000 != 266.000000)
135th item is incorrect! (134.000000 != 268.000000)
136th item is incorrect! (135.000000 != 270.000000)
137th item is incorrect! (136.000000 != 272.000000)
138th item is incorrect! (137.000000 != 274.000000)
139th item is incorrect! (138.000000 != 276.000000)
140th item is incorrect! (139.000000 != 278.000000)
141th item is incorrect! (140.000000 != 280.000000)
142th item is incorrect! (141.000000 != 282.000000)
143th item is incorrect! (142.000000 != 284.000000)
144th item is incorrect! (143.000000 != 286.000000)
145th item is incorrect! (144.000000 != 288.000000)
146th item is incorrect! (145.000000 != 290.000000)
147th item is incorrect! (146.000000 != 292.000000)
148th item is incorrect! (147.000000 != 294.000000)
149th item is incorrect! (148.000000 != 296.000000)
150th item is incorrect! (149.000000 != 298.000000)
151th item is incorrect! (150.000000 != 300.000000)
152th item is incorrect! (151.000000 != 302.000000)
153th item is incorrect! (152.000000 != 304.000000)
154th item is incorrect! (153.000000 != 306.000000)
155th item is incorrect! (154.000000 != 308.000000)
156th item is incorrect! (155.000000 != 310.000000)
157th item is incorrect! (156.000000 != 312.000000)
158th item is incorrect! (157.000000 != 314.000000)
159th item is incorrect! (158.000000 != 316.000000)
160th item is incorrect! (159.000000 != 318.000000)
161th item is incorrect! (160.000000 != 320.000000)
162th item is incorrect! (161.000000 != 322.000000)
163th item is incorrect! (162.000000 != 324.000000)
164th item is incorrect! (163.000000 != 326.000000)
165th item is incorrect! (164.000000 != 328.000000)
166th item is incorrect! (165.000000 != 330.000000)
167th item is incorrect! (166.000000 != 332.000000)
168th item is incorrect! (167.000000 != 334.000000)
169th item is incorrect! (168.000000 != 336.000000)
170th item is incorrect! (169.000000 != 338.000000)
171th item is incorrect! (170.000000 != 340.000000)
172th item is incorrect! (171.000000 != 342.000000)
173th item is incorrect! (172.000000 != 344.000000)
174th item is incorrect! (173.000000 != 346.000000)
175th item is incorrect! (174.000000 != 348.000000)
176th item is incorrect! (175.000000 != 350.000000)
177th item is incorrect! (176.000000 != 352.000000)
178th item is incorrect! (177.000000 != 354.000000)
179th item is incorrect! (178.000000 != 356.000000)
180th item is incorrect! (179.000000 != 358.000000)
181th item is incorrect! (180.000000 != 360.000000)
182th item is incorrect! (181.000000 != 362.000000)
183th item is incorrect! (182.000000 != 364.000000)
184th item is incorrect! (183.000000 != 366.000000)
185th item is incorrect! (184.000000 != 368.000000)
186th item is incorrect! (185.000000 != 370.000000)
187th item is incorrect! (186.000000 != 372.000000)
188th item is incorrect! (187.000000 != 374.000000)
189th item is incorrect! (188.000000 != 376.000000)
190th item is incorrect! (189.000000 != 378.000000)
191th item is incorrect! (190.000000 != 380.000000)
192th item is incorrect! (191.000000 != 382.000000)
193th item is incorrect! (192.000000 != 384.000000)
194th item is incorrect! (193.000000 != 386.000000)
195th item is incorrect! (194.000000 != 388.000000)
196th item is incorrect! (195.000000 != 390.000000)
197th item is incorrect! (196.000000 != 392.000000)
198th item is incorrect! (197.000000 != 394.000000)
199th item is incorrect! (198.000000 != 396.000000)
200th item is incorrect! (199.000000 != 398.000000)
201th item is incorrect! (200.000000 != 400.000000)
202th item is incorrect! (201.000000 != 402.000000)
203th item is incorrect! (202.000000 != 404.000000)
204th item is incorrect! (203.000000 != 406.000000)
205th item is incorrect! (204.000000 != 408.000000)
206th item is incorrect! (205.000000 != 410.000000)
207th item is incorrect! (206.000000 != 412.000000)
208th item is incorrect! (207.000000 != 414.000000)
209th item is incorrect! (208.000000 != 416.000000)
210th item is incorrect! (209.000000 != 418.000000)
211th item is incorrect! (210.000000 != 420.000000)
212th item is incorrect! (211.000000 != 422.000000)
213th item is incorrect! (212.000000 != 424.000000)
214th item is incorrect! (213.000000 != 426.000000)
215th item is incorrect! (214.000000 != 428.000000)
216th item is incorrect! (215.000000 != 430.000000)
217th item is incorrect! (216.000000 != 432.000000)
218th item is incorrect! (217.000000 != 434.000000)
219th item is incorrect! (218.000000 != 436.000000)
220th item is incorrect! (219.000000 != 438.000000)
221th item is incorrect! (220.000000 != 440.000000)
222th item is incorrect! (221.000000 != 442.000000)
223th item is incorrect! (222.000000 != 444.000000)
224th item is incorrect! (223.000000 != 446.000000)
225th item is incorrect! (224.000000 != 448.000000)
226th item is incorrect! (225.000000 != 450.000000)
227th item is incorrect! (226.000000 != 452.000000)
228th item is incorrect! (227.000000 != 454.000000)
229th item is incorrect! (228.000000 != 456.000000)
230th item is incorrect! (229.000000 != 458.000000)
231th item is incorrect! (230.000000 != 460.000000)
232th item is incorrect! (231.000000 != 462.000000)
233th item is incorrect! (232.000000 != 464.000000)
234th item is incorrect! (233.000000 != 466.000000)
235th item is incorrect! (234.000000 != 468.000000)
236th item is incorrect! (235.000000 != 470.000000)
237th item is incorrect! (236.000000 != 472.000000)
238th item is incorrect! (237.000000 != 474.000000)
239th item is incorrect! (238.000000 != 476.000000)
240th item is incorrect! (239.000000 != 478.000000)
241th item is incorrect! (240.000000 != 480.000000)
242th item is incorrect! (241.000000 != 482.000000)
243th item is incorrect! (242.000000 != 484.000000)
244th item is incorrect! (243.000000 != 486.000000)
245th item is incorrect! (244.000000 != 488.000000)
246th item is incorrect! (245.000000 != 490.000000)
247th item is incorrect! (246.000000 != 492.000000)
248th item is incorrect! (247.000000 != 494.000000)
249th item is incorrect! (248.000000 != 496.000000)
250th item is incorrect! (249.000000 != 498.000000)
251th item is incorrect! (250.000000 != 500.000000)
252th item is incorrect! (251.000000 != 502.000000)
253th item is incorrect! (252.000000 != 504.000000)
254th item is incorrect! (253.000000 != 506.000000)
255th item is incorrect! (254.000000 != 508.000000)
256th item is incorrect! (255.000000 != 510.000000)
257th item is incorrect! (256.000000 != 512.000000)
258th item is incorrect! (257.000000 != 514.000000)
259th item is incorrect! (258.000000 != 516.000000)
260th item is incorrect! (259.000000 != 518.000000)
261th item is incorrect! (260.000000 != 520.000000)
262th item is incorrect! (261.000000 != 522.000000)
263th item is incorrect! (262.000000 != 524.000000)
264th item is incorrect! (263.000000 != 526.000000)
265th item is incorrect! (264.000000 != 528.000000)
266th item is incorrect! (265.000000 != 530.000000)
267th item is incorrect! (266.000000 != 532.000000)
268th item is incorrect! (267.000000 != 534.000000)
269th item is incorrect! (268.000000 != 536.000000)
270th item is incorrect! (269.000000 != 538.000000)
271th item is incorrect! (270.000000 != 540.000000)
272th item is incorrect! (271.000000 != 542.000000)
273th item is incorrect! (272.000000 != 544.000000)
274th item is incorrect! (273.000000 != 546.000000)
275th item is incorrect! (274.000000 != 548.000000)
276th item is incorrect! (275.000000 != 550.000000)
277th item is incorrect! (276.000000 != 552.000000)
278th item is incorrect! (277.000000 != 554.000000)
279th item is incorrect! (278.000000 != 556.000000)
280th item is incorrect! (279.000000 != 558.000000)
281th item is incorrect! (280.000000 != 560.000000)
282th item is incorrect! (281.000000 != 562.000000)
283th item is incorrect! (282.000000 != 564.000000)
284th item is incorrect! (283.000000 != 566.000000)
285th item is incorrect! (284.000000 != 568.000000)
286th item is incorrect! (285.000000 != 570.000000)
287th item is incorrect! (286.000000 != 572.000000)
288th item is incorrect! (287.000000 != 574.000000)
289th item is incorrect! (288.000000 != 576.000000)
290th item is incorrect! (289.000000 != 578.000000)
291th item is incorrect! (290.000000 != 580.000000)
292th item is incorrect! (291.000000 != 582.000000)
293th item is incorrect! (292.000000 != 584.000000)
294th item is incorrect! (293.000000 != 586.000000)
295th item is incorrect! (294.000000 != 588.000000)
296th item is incorrect! (295.000000 != 590.000000)
297th item is incorrect! (296.000000 != 592.000000)
298th item is incorrect! (297.000000 != 594.000000)
299th item is incorrect! (298.000000 != 596.000000)
300th item is incorrect! (299.000000 != 598.000000)
301th item is incorrect! (300.000000 != 600.000000)
302th item is incorrect! (301.000000 != 602.000000)
303th item is incorrect! (302.000000 != 604.000000)
304th item is incorrect! (303.000000 != 606.000000)
305th item is incorrect! (304.000000 != 608.000000)
306th item is incorrect! (305.000000 != 610.000000)
307th item is incorrect! (306.000000 != 612.000000)
308th item is incorrect! (307.000000 != 614.000000)
309th item is incorrect! (308.000000 != 616.000000)
310th item is incorrect! (309.000000 != 618.000000)
311th item is incorrect! (310.000000 != 620.000000)
312th item is incorrect! (311.000000 != 622.000000)
313th item is incorrect! (312.000000 != 624.000000)
314th item is incorrect! (313.000000 != 626.000000)
315th item is incorrect! (314.000000 != 628.000000)
316th item is incorrect! (315.000000 != 630.000000)
317th item is incorrect! (316.000000 != 632.000000)
318th item is incorrect! (317.000000 != 634.000000)
319th item is incorrect! (318.000000 != 636.000000)
320th item is incorrect! (319.000000 != 638.000000)
321th item is incorrect! (320.000000 != 640.000000)
322th item is incorrect! (321.000000 != 642.000000)
323th item is incorrect! (322.000000 != 644.000000)
324th item is incorrect! (323.000000 != 646.000000)
325th item is incorrect! (324.000000 != 648.000000)
326th item is incorrect! (325.000000 != 650.000000)
327th item is incorrect! (326.000000 != 652.000000)
328th item is incorrect! (327.000000 != 654.000000)
329th item is incorrect! (328.000000 != 656.000000)
330th item is incorrect! (329.000000 != 658.000000)
331th item is incorrect! (330.000000 != 660.000000)
332th item is incorrect! (331.000000 != 662.000000)
333th item is incorrect! (332.000000 != 664.000000)
334th item is incorrect! (333.000000 != 666.000000)
335th item is incorrect! (334.000000 != 668.000000)
336th item is incorrect! (335.000000 != 670.000000)
337th item is incorrect! (336.000000 != 672.000000)
338th item is incorrect! (337.000000 != 674.000000)
339th item is incorrect! (338.000000 != 676.000000)
340th item is incorrect! (339.000000 != 678.000000)
341th item is incorrect! (340.000000 != 680.000000)
342th item is incorrect! (341.000000 != 682.000000)
343th item is incorrect! (342.000000 != 684.000000)
344th item is incorrect! (343.000000 != 686.000000)
345th item is incorrect! (344.000000 != 688.000000)
346th item is incorrect! (345.000000 != 690.000000)
347th item is incorrect! (346.000000 != 692.000000)
348th item is incorrect! (347.000000 != 694.000000)
349th item is incorrect! (348.000000 != 696.000000)
350th item is incorrect! (349.000000 != 698.000000)
351th item is incorrect! (350.000000 != 700.000000)
352th item is incorrect! (351.000000 != 702.000000)
353th item is incorrect! (352.000000 != 704.000000)
354th item is incorrect! (353.000000 != 706.000000)
355th item is incorrect! (354.000000 != 708.000000)
356th item is incorrect! (355.000000 != 710.000000)
357th item is incorrect! (356.000000 != 712.000000)
358th item is incorrect! (357.000000 != 714.000000)
359th item is incorrect! (358.000000 != 716.000000)
360th item is incorrect! (359.000000 != 718.000000)
361th item is incorrect! (360.000000 != 720.000000)
362th item is incorrect! (361.000000 != 722.000000)
363th item is incorrect! (362.000000 != 724.000000)
364th item is incorrect! (363.000000 != 726.000000)
365th item is incorrect! (364.000000 != 728.000000)
366th item is incorrect! (365.000000 != 730.000000)
367th item is incorrect! (366.000000 != 732.000000)
368th item is incorrect! (367.000000 != 734.000000)
369th item is incorrect! (368.000000 != 736.000000)
370th item is incorrect! (369.000000 != 738.000000)
371th item is incorrect! (370.000000 != 740.000000)
372th item is incorrect! (371.000000 != 742.000000)
373th item is incorrect! (372.000000 != 744.000000)
374th item is incorrect! (373.000000 != 746.000000)
375th item is incorrect! (374.000000 != 748.000000)
376th item is incorrect! (375.000000 != 750.000000)
377th item is incorrect! (376.000000 != 752.000000)
378th item is incorrect! (377.000000 != 754.000000)
379th item is incorrect! (378.000000 != 756.000000)
380th item is incorrect! (379.000000 != 758.000000)
381th item is incorrect! (380.000000 != 760.000000)
382th item is incorrect! (381.000000 != 762.000000)
383th item is incorrect! (382.000000 != 764.000000)
384th item is incorrect! (383.000000 != 766.000000)
385th item is incorrect! (384.000000 != 768.000000)
386th item is incorrect! (385.000000 != 770.000000)
387th item is incorrect! (386.000000 != 772.000000)
388th item is incorrect! (387.000000 != 774.000000)
389th item is incorrect! (388.000000 != 776.000000)
390th item is incorrect! (389.000000 != 778.000000)
391th item is incorrect! (390.000000 != 780.000000)
392th item is incorrect! (391.000000 != 782.000000)
393th item is incorrect! (392.000000 != 784.000000)
394th item is incorrect! (393.000000 != 786.000000)
395th item is incorrect! (394.000000 != 788.000000)
396th item is incorrect! (395.000000 != 790.000000)
397th item is incorrect! (396.000000 != 792.000000)
398th item is incorrect! (397.000000 != 794.000000)
399th item is incorrect! (398.000000 != 796.000000)
400th item is incorrect! (399.000000 != 798.000000)
401th item is incorrect! (400.000000 != 800.000000)
402th item is incorrect! (401.000000 != 802.000000)
403th item is incorrect! (402.000000 != 804.000000)
404th item is incorrect! (403.000000 != 806.000000)
405th item is incorrect! (404.000000 != 808.000000)
406th item is incorrect! (405.000000 != 810.000000)
407th item is incorrect! (406.000000 != 812.000000)
408th item is incorrect! (407.000000 != 814.000000)
409th item is incorrect! (408.000000 != 816.000000)
410th item is incorrect! (409.000000 != 818.000000)
411th item is incorrect! (410.000000 != 820.000000)
412th item is incorrect! (411.000000 != 822.000000)
413th item is incorrect! (412.000000 != 824.000000)
414th item is incorrect! (413.000000 != 826.000000)
415th item is incorrect! (414.000000 != 828.000000)
416th item is incorrect! (415.000000 != 830.000000)
417th item is incorrect! (416.000000 != 832.000000)
418th item is incorrect! (417.000000 != 834.000000)
419th item is incorrect! (418.000000 != 836.000000)
420th item is incorrect! (419.000000 != 838.000000)
421th item is incorrect! (420.000000 != 840.000000)
422th item is incorrect! (421.000000 != 842.000000)
423th item is incorrect! (422.000000 != 844.000000)
424th item is incorrect! (423.000000 != 846.000000)
425th item is incorrect! (424.000000 != 848.000000)
426th item is incorrect! (425.000000 != 850.000000)
427th item is incorrect! (426.000000 != 852.000000)
428th item is incorrect! (427.000000 != 854.000000)
429th item is incorrect! (428.000000 != 856.000000)
430th item is incorrect! (429.000000 != 858.000000)
431th item is incorrect! (430.000000 != 860.000000)
432th item is incorrect! (431.000000 != 862.000000)
433th item is incorrect! (432.000000 != 864.000000)
434th item is incorrect! (433.000000 != 866.000000)
435th item is incorrect! (434.000000 != 868.000000)
436th item is incorrect! (435.000000 != 870.000000)
437th item is incorrect! (436.000000 != 872.000000)
438th item is incorrect! (437.000000 != 874.000000)
439th item is incorrect! (438.000000 != 876.000000)
440th item is incorrect! (439.000000 != 878.000000)
441th item is incorrect! (440.000000 != 880.000000)
442th item is incorrect! (441.000000 != 882.000000)
443th item is incorrect! (442.000000 != 884.000000)
444th item is incorrect! (443.000000 != 886.000000)
445th item is incorrect! (444.000000 != 888.000000)
446th item is incorrect! (445.000000 != 890.000000)
447th item is incorrect! (446.000000 != 892.000000)
448th item is incorrect! (447.000000 != 894.000000)
449th item is incorrect! (448.000000 != 896.000000)
450th item is incorrect! (449.000000 != 898.000000)
451th item is incorrect! (450.000000 != 900.000000)
452th item is incorrect! (451.000000 != 902.000000)
453th item is incorrect! (452.000000 != 904.000000)
454th item is incorrect! (453.000000 != 906.000000)
455th item is incorrect! (454.000000 != 908.000000)
456th item is incorrect! (455.000000 != 910.000000)
457th item is incorrect! (456.000000 != 912.000000)
458th item is incorrect! (457.000000 != 914.000000)
459th item is incorrect! (458.000000 != 916.000000)
460th item is incorrect! (459.000000 != 918.000000)
461th item is incorrect! (460.000000 != 920.000000)
462th item is incorrect! (461.000000 != 922.000000)
463th item is incorrect! (462.000000 != 924.000000)
464th item is incorrect! (463.000000 != 926.000000)
465th item is incorrect! (464.000000 != 928.000000)
466th item is incorrect! (465.000000 != 930.000000)
467th item is incorrect! (466.000000 != 932.000000)
468th item is incorrect! (467.000000 != 934.000000)
469th item is incorrect! (468.000000 != 936.000000)
470th item is incorrect! (469.000000 != 938.000000)
471th item is incorrect! (470.000000 != 940.000000)
472th item is incorrect! (471.000000 != 942.000000)
473th item is incorrect! (472.000000 != 944.000000)
474th item is incorrect! (473.000000 != 946.000000)
475th item is incorrect! (474.000000 != 948.000000)
476th item is incorrect! (475.000000 != 950.000000)
477th item is incorrect! (476.000000 != 952.000000)
478th item is incorrect! (477.000000 != 954.000000)
479th item is incorrect! (478.000000 != 956.000000)
480th item is incorrect! (479.000000 != 958.000000)
481th item is incorrect! (480.000000 != 960.000000)
482th item is incorrect! (481.000000 != 962.000000)
483th item is incorrect! (482.000000 != 964.000000)
484th item is incorrect! (483.000000 != 966.000000)
485th item is incorrect! (484.000000 != 968.000000)
486th item is incorrect! (485.000000 != 970.000000)
487th item is incorrect! (486.000000 != 972.000000)
488th item is incorrect! (487.000000 != 974.000000)
489th item is incorrect! (488.000000 != 976.000000)
490th item is incorrect! (489.000000 != 978.000000)
491th item is incorrect! (490.000000 != 980.000000)
492th item is incorrect! (491.000000 != 982.000000)
493th item is incorrect! (492.000000 != 984.000000)
494th item is incorrect! (493.000000 != 986.000000)
495th item is incorrect! (494.000000 != 988.000000)
496th item is incorrect! (495.000000 != 990.000000)
497th item is incorrect! (496.000000 != 992.000000)
498th item is incorrect! (497.000000 != 994.000000)
499th item is incorrect! (498.000000 != 996.000000)
500th item is incorrect! (499.000000 != 998.000000)
501th item is incorrect! (500.000000 != 1000.000000)
502th item is incorrect! (501.000000 != 1002.000000)
503th item is incorrect! (502.000000 != 1004.000000)
504th item is incorrect! (503.000000 != 1006.000000)
505th item is incorrect! (504.000000 != 1008.000000)
506th item is incorrect! (505.000000 != 1010.000000)
507th item is incorrect! (506.000000 != 1012.000000)
508th item is incorrect! (507.000000 != 1014.000000)
509th item is incorrect! (508.000000 != 1016.000000)
510th item is incorrect! (509.000000 != 1018.000000)
511th item is incorrect! (510.000000 != 1020.000000)
512th item is incorrect! (511.000000 != 1022.000000)
511 errors!

ERROR: ACCL base functionality test failed!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	98
                Host writes sent: 	66
                 Card reads sent: 	65
                Card writes sent: 	64
                 Sync reads sent: 	10
                Sync writes sent: 	0
                     Page faults: 	0


 -- �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 510
TX pkgs: 138
ARP RX pkgs: 4
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 180
ROCE TX pkgs: 136
IBV RX pkgs: 235
IBV TX pkgs: 236
PSN drop cnt: 0
Retrans cnt: 0
TCP session cnt: 0
STRM down: 0

Finalizing MPI...
Done. Terminating...

stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 21422
UID: 500207
[Tue May 14 12:11:34 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
CCLO HWID: 2696576574 at 0x0
CCLO source commit (first 24b): a0ba7e
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fe2c6000000, Size: 64
calling offload: 7fe2c6000000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fe2c5e00000, Size: 64
calling offload: 7fe2c5e00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fe2c5a00000, Size: 4194304
calling offload: 7fe2c5a00000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fe2c5600000, Size: 4194304
calling offload: 7fe2c5600000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fe2c5200000, Size: 4194304
calling offload: 7fe2c5200000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fe2c6000000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fe2c5e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1
Allocation successful! Allocated buffer: 7fe2c5000000, Size: 2048
CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1
Allocation successful! Allocated buffer: 7fe2c4e00000, Size: 2048
Reducing data...
Free user buffer from cProc cPid:0, buffer_size:2048,7fe2c5000000
Free user buffer from cProc cPid:0, buffer_size:2048,7fe2c4e00000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 1

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fe2c6000000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fe2c5e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7fe2c6000000
Free user buffer from cProc cPid:0, buffer_size:64,7fe2c5e00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5a00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5600000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5200000
@quetric quetric self-assigned this May 14, 2024
@quetric quetric added the bug Something isn't working label May 14, 2024
@quetric
Copy link
Collaborator

quetric commented May 21, 2024

@lawirz can you specify what is the threshold for Eager transfers in your ACCL initialization?

@lawirz
Copy link
Author

lawirz commented May 28, 2024

The errors are fixed on the issue branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants