Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

golang的加法比C快? #142

Open
zhangyachen opened this issue Feb 2, 2019 · 2 comments
Open

golang的加法比C快? #142

zhangyachen opened this issue Feb 2, 2019 · 2 comments
Labels

Comments

@zhangyachen
Copy link
Owner

zhangyachen commented Feb 2, 2019

1.31

晚上的火车回家,在公司还剩两个小时,无心工作,本着不虚度光阴的原则(写这句话时还剩一个半小时~~),还是找点事情干。决定写一下前几天同事遇到的一个golang与c加法速度比较的问题(现在心里在想我工作不饱和的,请大胆的把你的名字放到留言区!)。

操作系统信息:

$uname -a
Linux 35d4aec21d2e 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

先看一段C语言的加法:

#include<stdio.h>

int main(){

    long i , sum = 0;

    for ( i = 0 ; i < 9000000000; i++  ) {
        sum += i;
    }

    printf("%ld",sum);

    return 0;
}

执行时间:

$time ./a.out
3606511848080896768
real	0m32.353s
user	0m30.963s
sys	0m1.091s

再看一段GO语言的加法:

package main

import (
        "fmt"
       )

func main() {

    var i, sum uint64
        for i = 0; i < 9000000000; i++ {
            sum += i
        }

    fmt.Print(sum)
}

执行时间:

$time go run a.go
3606511848080896768
real	0m6.272s
user	0m6.142s
sys	0m0.215s

我们可以发现Golang的加法比C版本快5倍以上。结果确实令人大跌眼镜,如果差一点还可以理解,但是这个5倍的差距确实有点大。是什么导致了这种差距?

第一反应肯定是分别查看汇编代码,因为在语言层面实在想不出能有什么因素导致如此大的性能差距,毕竟只是一个加法运算而已。

gcc生成的汇编代码(只看main函数的):

0000000000400540 <main>:
  400540:	55                   	push   %rbp
  400541:	48 89 e5             	mov    %rsp,%rbp
  400544:	48 83 ec 10          	sub    $0x10,%rsp
  400548:	48 c7 45 f0 00 00 00 	movq   $0x0,-0x10(%rbp)  -----> sum = 0
  40054f:	00
  400550:	48 c7 45 f8 00 00 00 	movq   $0x0,-0x8(%rbp)     ---->   i = 0
  400557:	00
  400558:	eb 0d                	jmp    400567 <main+0x27>
  40055a:	48 8b 45 f8          	mov    -0x8(%rbp),%rax     
  40055e:	48 01 45 f0          	add    %rax,-0x10(%rbp)       -------> sum = sum + i
  400562:	48 83 45 f8 01       	addq   $0x1,-0x8(%rbp)        -------> i++
  400567:	48 b8 ff 19 71 18 02 	mov    $0x2187119ff,%rax
  40056e:	00 00 00
  400571:	48 39 45 f8          	cmp    %rax,-0x8(%rbp)         ------> i < 9000000000
  400575:	7e e3                	jle    40055a <main+0x1a>
  400577:	48 8b 45 f0          	mov    -0x10(%rbp),%rax
  40057b:	48 89 c6             	mov    %rax,%rsi
  40057e:	bf 6c 06 40 00       	mov    $0x40066c,%edi
  400583:	b8 00 00 00 00       	mov    $0x0,%eax
  400588:	e8 37 fe ff ff       	callq  4003c4 <printf@plt>
  40058d:	b8 00 00 00 00       	mov    $0x0,%eax     -------> return 0
  400592:	c9                   	leaveq

比较重要的汇编语句已经标记出来了,可以发现,代码中频繁用到的sum和i变量,是放在栈中的(内存)。但是在这个例子中,我猜测大多数情况下应该是cpu cache hit,而不会直接访问内存。

再看看GO编译器生成的汇编代码:

000000000047b660 <main.main>:
  47b660:   64 48 8b 0c 25 f8 ff    mov    %fs:0xfffffffffffffff8,%rcx
  47b667:   ff ff
  47b669:   48 3b 61 10             cmp    0x10(%rcx),%rsp
  47b66d:   0f 86 aa 00 00 00       jbe    47b71d <main.main+0xbd>
  47b673:   48 83 ec 50             sub    $0x50,%rsp
  47b677:   48 89 6c 24 48          mov    %rbp,0x48(%rsp)
  47b67c:   48 8d 6c 24 48          lea    0x48(%rsp),%rbp
  47b681:   31 c0                   xor    %eax,%eax         -----------> i = 0
  47b683:   48 89 c1                mov    %rax,%rcx     ----------->  sum = i =0
  47b686:   48 ba 00 1a 71 18 02    mov    $0x218711a00,%rdx
  47b68d:   00 00 00
  47b690:   48 39 d0                cmp    %rdx,%rax
  47b693:   73 19                   jae    47b6ae <main.main+0x4e>
  47b695:   48 8d 58 01             lea    0x1(%rax),%rbx   ----------->  i++ (1)
  47b699:   48 01 c1                add    %rax,%rcx   -----------> sum = sum + i
  47b69c:   48 89 d8                mov    %rbx,%rax   -----------> i++ (2)
  47b69f:   48 ba 00 1a 71 18 02    mov    $0x218711a00,%rdx
  47b6a6:   00 00 00
  47b6a9:   48 39 d0                cmp    %rdx,%rax
  47b6ac:   72 e7                   jb     47b695 <main.main+0x35>
  47b6ae:   48 89 4c 24 30          mov    %rcx,0x30(%rsp)
  47b6b3:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)
  47b6ba:   00 00
  47b6bc:   48 c7 44 24 40 00 00    movq   $0x0,0x40(%rsp)
  47b6c3:   00 00
  47b6c5:   48 8d 05 d4 e6 00 00    lea    0xe6d4(%rip),%rax        # 489da0 <type.*+0xdda0>
 47b6cc:   48 89 04 24             mov    %rax,(%rsp)
  47b6d0:   48 8d 44 24 30          lea    0x30(%rsp),%rax
  47b6d5:   48 89 44 24 08          mov    %rax,0x8(%rsp)
  47b6da:   e8 91 02 f9 ff          callq  40b970 <runtime.convT2E>
  47b6df:   48 8b 44 24 10          mov    0x10(%rsp),%rax
  47b6e4:   48 8b 4c 24 18          mov    0x18(%rsp),%rcx
  47b6e9:   48 89 44 24 38          mov    %rax,0x38(%rsp)
  47b6ee:   48 89 4c 24 40          mov    %rcx,0x40(%rsp)
  47b6f3:   48 8d 44 24 38          lea    0x38(%rsp),%rax
  47b6f8:   48 89 04 24             mov    %rax,(%rsp)
  47b6fc:   48 c7 44 24 08 01 00    movq   $0x1,0x8(%rsp)
  47b703:   00 00
  47b705:   48 c7 44 24 10 01 00    movq   $0x1,0x10(%rsp)
  47b70c:   00 00
  47b70e:   e8 4d 90 ff ff          callq  474760 <fmt.Print>
  47b713:   48 8b 6c 24 48          mov    0x48(%rsp),%rbp
  47b718:   48 83 c4 50             add    $0x50,%rsp
  47b71c:   c3                      retq
  47b71d:   e8 ee f7 fc ff          callq  44af10 <runtime.morestack_noctxt>
  47b722:   e9 39 ff ff ff          jmpq   47b660 <main.main>

可以看出,GO编译器将常用的sum和i变量放到了寄存器上。

image

CPU访问寄存器的效率是内存的100倍,是CPU cache的10倍。

在我的机器环境上,给变量加上register关键字,程序运行时间会有明显的提升:

#include<stdio.h>

int main(){

    //add register keyword
    register long i , sum = 0;

    for ( i = 0 ; i < 9000000000; i++  ) {
        sum += i;
    }

    printf("%ld",sum);

    return 0;
}

执行时间:

$time ./a.out
3606511848080896768
real	0m4.650s
user	0m4.645s
sys	0m0.001s

由之前的32.4秒提升到了4.6秒,效果很明显。
看下生成的汇编:

0000000000400540 <main>:
  400540:	55                   	push   %rbp
  400541:	48 89 e5             	mov    %rsp,%rbp
  400544:	41 54                	push   %r12
  400546:	53                   	push   %rbx
  400547:	41 bc 00 00 00 00    	mov    $0x0,%r12d    ----------> i = 0  lower 32-bit
  40054d:	bb 00 00 00 00       	mov    $0x0,%ebx     -----------> sum = 0
  400552:	eb 07                	jmp    40055b <main+0x1b>
  400554:	49 01 dc             	add    %rbx,%r12     ---------->  sum = sum + i
  400557:	48 83 c3 01          	add    $0x1,%rbx      --------->   i++ 
  40055b:	48 b8 ff 19 71 18 02 	mov    $0x2187119ff,%rax
  400562:	00 00 00
  400565:	48 39 c3             	cmp    %rax,%rbx
  400568:	7e ea                	jle    400554 <main+0x14>
  40056a:	4c 89 e6             	mov    %r12,%rsi
  40056d:	bf 5c 06 40 00       	mov    $0x40065c,%edi
  400572:	b8 00 00 00 00       	mov    $0x0,%eax
  400577:	e8 48 fe ff ff       	callq  4003c4 <printf@plt>
  40057c:	b8 00 00 00 00       	mov    $0x0,%eax
  400581:	5b                   	pop    %rbx
  400582:	41 5c                	pop    %r12
  400584:	5d                   	pop    %rbp
  400585:	c3                   	retq

这时,gcc将变量都放到了寄存器上。

刚才强调了一下在我的机器环境上,因为在我本地的mac上,即使加上regisrer,gcc还是不会将变量放到寄存器上。
我记得K&R里说过,编译器往往比人聪明,不需要我们手动加register关键字,有时候即使加了,编译器也不会把他们放到寄存器上。但是这个例子中,明显将变量放到寄存器会比较好,为什么gcc不这么做呢?有没有高人出来解释一下。(搞明白了,gcc默认是-O0,我一直以为是-O1,如果优化级别是-O1及以上就可以了)

以一篇水文迎接即将到来的新年。

完。

@zhangyachen zhangyachen changed the title golang的加法比C快 golang的加法比C快? Feb 2, 2019
@zhangyachen zhangyachen changed the title golang的加法比C快? golang的加法比C快? Feb 2, 2019
@robberphex
Copy link

楼主可以试试加上-O2选项看看,能不能将其优化到寄存器上。

P.S. 确实,macOS上两者运行结果差不多,并没有很大的性能提升。

@zhangyachen
Copy link
Owner Author

@robberphex 原来gcc默认优化级别是-O0,一直以为是-O1。在linux上测试的话。-O1就会把变量放到寄存器上,mac上也是这样,但是感觉mac上的gcc直接把结果计算好了。。
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants