Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numpy-pyccel #942

Closed
mohamedlaminebabou opened this issue Aug 18, 2021 · 25 comments · Fixed by #938
Closed

Numpy-pyccel #942

mohamedlaminebabou opened this issue Aug 18, 2021 · 25 comments · Fixed by #938
Labels

Comments

@mohamedlaminebabou
Copy link

mohamedlaminebabou commented Aug 18, 2021

Describe the bug
Hello,

I create a program in fortran and then I transformed it to python and I use Pyccel to accelerate Python functions, but when I use numpy functions my code pyccel is slow than pure fortran , and when I don't use numpy functions pyccel be faster than my fortran code .

So is there a possibility to combine the tow together (fortran code + pyccel ) ?

To Reproduce
Provide code to reproduce the behavior:

from pyccel.decorators import types
@types('float[:,:]','float[:,:](order=C)','float[:,:](order=C)')
def qr_mgs(A,Q,R):
    from numpy import matmul,norm,zeros
    n,p=A.shape
    Qt=zeros((A.shape[1],A.shape[0]))
    s=zeros((Qt.shape[0],Q.shape[1]))

    for  k in range(p): 
        Q[:,k] = A[:,k]
        Qt[k,:]=A[:,k]
        s=matmul(Qt,Q)
        for  i in range(k):
            R[i,k] =s[k,i]
            Q[:,k] = Q[:,k] - R[i,k]*Q[:,i]
        R[k,k] = norm(Q[:,k])
        Q[:,k] = Q[:,k]/R[k,k]
    return 0
from pyccel.epyccel import epyccel
qr_f90 = epyccel(qr_mgs)
def qr_fp(a):
    n,p=a.shape
    q= np.zeros((n,p),order='C')
    r= np.zeros((p,p),order='C')
   # qt=np.zeros((p,n),order='C')
    qr_f90(a,q,r)
    return q,r
""""fortran version using f2py """
%load_ext fortranmagic
%%fortran --fcompiler=gnu95

subroutine  mgs(A,n,p,Q,R)
    integer(8) ,intent(in) ::n,p   
    real(8) ,intent(in) ::A(n,p)    
    real(8), intent(out) :: R(p,p)
    real(8), intent(out) ::   Q(n,p)
    integer(8):: k,i
    Q = 0.0d+00
    R = 0.0d+00
    do k = 1,p
        Q(:,k) = A(:,k)
        do i = 1,k-1
            R(i,k) = dot_product(Q(:,i),Q(:,k))
            Q(:,k) = Q(:,k) - R(i,k)*Q(:,i)
        end do
        R(k,k) = sqrt(dot_product(Q(:,k),Q(:,k)))
        Q(:,k) = Q(:,k)/R(k,k)
    end do
end subroutine  mgs

Provide the generated code, or the error message:

import numpy as np

A=np.random.rand(700,700)
q,r=qr_fp(A)
Q,R=mgs(A)
np.linalg.norm(A-q@r) =2.8754496870175957e-13
np.linalg.norm(A-Q@R) = 2.870136485142167e-13
%timeit qr_fp(A)
22.9 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mgs(A)
247 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected behavior
A clear and concise description of what you expected to happen.

Language
Please specify which language the python code is translated to (Fortran by default)

Additional context
Add any other context about the problem here.

@EmilyBourne
Copy link
Member

%timeit qr_fp(A)
22.9 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mgs(A)
247 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am really surprised that pyccel is so slow that this is counted in seconds! My first instinct is to ask if you are certain that you are running the compiled code and not the python code, or if you are including the translation time in the timeit function?

Could I suggest using pyccel for this instead of epyccel. That way you are certain that the translation is only done once and the interface between qr_mgs and mgs should be done in the same way. This also makes it easier for you to see the generated code and compare it with your handwritten code

@EmilyBourne
Copy link
Member

EmilyBourne commented Aug 18, 2021

The generated code is:

function qr_mgs(A, Q, R) result(Out_0001)

    implicit none

    integer(C_INT64_T) :: Out_0001
    real(C_DOUBLE), intent(in) :: A(0:,0:)
    real(C_DOUBLE), intent(inout) :: Q(0:,0:)
    real(C_DOUBLE), intent(inout) :: R(0:,0:)
    integer(C_INT64_T) :: n
    integer(C_INT64_T) :: p
    real(C_DOUBLE), allocatable :: Qt(:,:)
    real(C_DOUBLE), allocatable :: s(:,:)
    integer(C_INT64_T) :: k
    integer(C_INT64_T) :: i

    n = size(A, 2_C_INT64_T, C_INT64_T)
    p = size(A, 1_C_INT64_T, C_INT64_T)
    allocate(Qt(0:size(A, 2_C_INT64_T, C_INT64_T) - 1_C_INT64_T, 0:size( &
      A, 1_C_INT64_T, C_INT64_T) - 1_C_INT64_T))
    Qt = 0.0_C_DOUBLE
    allocate(s(0:size(Q, 1_C_INT64_T, C_INT64_T) - 1_C_INT64_T, 0:size(A &
      , 1_C_INT64_T, C_INT64_T) - 1_C_INT64_T))
    s = 0.0_C_DOUBLE
    do k = 0_C_INT64_T, p - 1_C_INT64_T, 1_C_INT64_T
      Q(k, :) = A(k, :)
      Qt(:, k) = A(k, :)
      s = matmul(Q,Qt)
      do i = 0_C_INT64_T, k - 1_C_INT64_T, 1_C_INT64_T
        R(k, i) = s(i, k)
        Q(k, :) = Q(k, :) - R(k, i) * Q(i, :)
      end do
      R(k, k) = Norm2(Q(k, :))
      Q(k, :) = Q(k, :) / R(k, k)
    end do
    Out_0001 = 0_C_INT64_T
    return

  end function qr_mgs

The first major difference that I see is that you use are stack arrays in your code while the temporaries are allocatables in the pyccel code. You can ask pyccel to make this optimisation by adding the decorator @stack_arrays('Qt','s') to the top of your function.

Beyond this the only differences that I can see aside from the calculation of the norm are algorithmic.

In the fortran version the variable s does not exist. R(k,i) is calculated using a dot product (O(n)) and this is calculated k-1 times (O(pn)). In the python version s is calculated with a matmul (O(p^2n))

In the fortran version the variable Qt does not exist (reduced memory)

In fortran the inner loop has length k-1 while in python it has length k

@EmilyBourne
Copy link
Member

The following python file is closer to the original algorithm:

def qr_mgs(A : 'float[:,:]', Q : 'float[:,:]', R : 'float[:,:]'):
    from numpy import norm

    n,p=A.shape

    Q[:,:] = 0
    R[:,:] = 0

    for  k in range(p):
        Q[:,k] = A[:,k]
        for i in range(k-1):
            R[i,k] = Q[:,i] @ Q[:,k]
            Q[:,k] = Q[:,k] - R[i,k]*Q[:,i]
        R[k,k] = norm(Q[:,k])
        Q[:,k] = Q[:,k]/R[k,k]

It generates the following code:

subroutine qr_mgs(A, Q, R)

    implicit none

    real(C_DOUBLE), intent(in) :: A(0:,0:)
    real(C_DOUBLE), intent(inout) :: Q(0:,0:)
    real(C_DOUBLE), intent(inout) :: R(0:,0:)
    integer(C_INT64_T) :: n
    integer(C_INT64_T) :: p
    integer(C_INT64_T) :: k
    integer(C_INT64_T) :: i

    n = size(A, 2_C_INT64_T, C_INT64_T)
    p = size(A, 1_C_INT64_T, C_INT64_T)
    Q(:, :) = 0_C_INT64_T
    R(:, :) = 0_C_INT64_T
    do k = 0_C_INT64_T, p - 1_C_INT64_T, 1_C_INT64_T
      Q(k, :) = A(k, :)
      do i = 0_C_INT64_T, k - 1_C_INT64_T - 1_C_INT64_T, 1_C_INT64_T
        R(k, i) = sum(Q(i, :)*Q(k, :))
        Q(k, :) = Q(k, :) - R(k, i) * Q(i, :)
      end do
      R(k, k) = Norm2(Q(k, :))
      Q(k, :) = Q(k, :) / R(k, k)
    end do

  end subroutine qr_mgs

@mohamedlaminebabou
Copy link
Author

mohamedlaminebabou commented Aug 18, 2021

Hi , Emily ,
Thank you for the answer . the problem in pyccel is that the dot_product is not exist , so i try to manipulate the matmul to get the dot_product. so for that i using the matmul instead of dot_product !
Using now your new version python I found that :
from pyccel.epyccel import epyccel

qr_f90 = epyccel(qr_mgs)

ERROR at parsing (syntax) stage
pyccel: [syntax]
|fatal: mod_adrgpmsh.py [12,21]| Pyccel has encountered syntax that it does not recognise (BinOp(left=Subscript(value=Name(id='Q', ctx=Load()), slice=ExtSlice(dims=[Slice(lower=None, upper=None, step=None), Index(value=Name(id='i', ctx=Load()))]), ctx=Load()), op=MatMult(), right=Subscript(value=Name(id='Q', ctx=Load()), slice=ExtSlice(dims=[Slice(lower=None, upper=None, step=None), Index(value=Name(id='k', ctx=Load()))]), ctx=Load())))

@EmilyBourne
Copy link
Member

Pyccel has encountered syntax that it does not recognise

It looks like you need to update your version of pyccel. I think it is complaining about the matmul operator but this now exists in the most up to date version of pyccel (1.3.0). I would suggest updating (unless you are using a mac as there is a compiler bug that will be fixed in version 1.3.1 as soon as #938 is merged)

If you can't/don't want to update then the following should also work:

def qr_mgs(A : 'float[:,:]', Q : 'float[:,:]', R : 'float[:,:]'):
    from numpy import norm, matmul

    n,p=A.shape

    Q[:,:] = 0
    R[:,:] = 0

    for  k in range(p):
        Q[:,k] = A[:,k]
        for i in range(k-1):
            R[i,k] = matmul(Q[:,i], Q[:,k])
            Q[:,k] = Q[:,k] - R[i,k]*Q[:,i]
        R[k,k] = norm(Q[:,k])
        Q[:,k] = Q[:,k]/R[k,k]

@EmilyBourne
Copy link
Member

the problem in pyccel is that the dot_product is not exist

Feel free to open an issue for this. I think it was left until last as numpy.dot covers lots of different use cases but I believe most of these use cases exist in pyccel now so it should be relatively easy to add a dot function.

@mohamedlaminebabou
Copy link
Author

mohamedlaminebabou commented Aug 18, 2021

I update pyccel .
pyccel.version '1.3.0'
And I run the same example in the first and I found :
%timeit qr_fp(A) 467 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mgs(A) 167 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Does this mean the f2py it's more fast than pyccel ?.
Notice that in pyccel we use The mpi4py and in my pure code fortran i don't use mpi and openmp.

@EmilyBourne
Copy link
Member

EmilyBourne commented Aug 18, 2021

I update pyccel .
pyccel.version '1.3.0'
And I run the same example in the first and i found :
%timeit qr_fp(A) 467 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mgs(A) 167 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Haha, that looks much better!

I have the following set up:
qr.py:

def qr_mgs(A : 'float[:,:]', Q : 'float[:,:]', R : 'float[:,:]'):
    from numpy.linalg import norm

    n,p=A.shape

    Q[:,:] = 0
    R[:,:] = 0

    for  k in range(p):
        Q[:,k] = A[:,k]
        for i in range(k-1):
            R[i,k] = Q[:,i] @ Q[:,k]
            Q[:,k] = Q[:,k] - R[i,k]*Q[:,i]
        R[k,k] = norm(Q[:,k])
        Q[:,k] = Q[:,k]/R[k,k]

test_qr.py:

import numpy as np
import timeit
from qr import qr_mgs
from qr_f import mgs

A=np.random.rand(700,700)

def qr_pyc():
    n,p=A.shape
    q= np.empty((n,p),order='C')
    r= np.empty((p,p),order='C')
    qr_mgs(A,q,r)
    return q,r

def qr_fpy():
    n,p=A.shape
    q,r = mgs(A,n,p)
    return q,r

for f in [qr_pyc, qr_fpy]:
    q,r=f()
    print(f.__name__)
    print(np.linalg.norm(A-q@r))
    timer = timeit.Timer(f)
    print(timer.timeit(100))

Where qr_f is the file passed to f2py. This allows me to test the pure python version vs the pyccel version easily as well as test modifications to the generated code easily.

I can therefore run:

python3 test_qr.py #pure python
pyccel qr.py --verbose
python3 test_qr.py #pyccelised code

You can then modify the generated file __pyccel__/qr.f90 manually and recompile it using the commands indicated by pyccel (thanks to verbose mode).

@EmilyBourne
Copy link
Member

In your code you iterate over the last index R[i,k] = Q[:,i] @ Q[:,k]. This means that the code is therefore optimised for fortran. It is probably worth trying with F ordering instead of (the default) C ordering.

I have tried making this change as well as manually using dot_product but even with these changes I don't get the same speed as f2py. I am not sure what is causing this. I now have nearly identical code to your handwritten code but it is still 3 times slower on my computer.

I also tried increasing the problem size from 700x700 to 1000x1000. This had the same time ratio so I don't think the wrapper is the bottleneck.

Perhaps f2py uses better compiler flags than we do? Or gcc is better at linking than gfortran? (But I think they should both use the same linker in the background)

@mohamedlaminebabou
Copy link
Author

mohamedlaminebabou commented Aug 18, 2021

Let's see graphically the comparison between Pyccel and my f2py code , using your suggestion code
f2py_pyc

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import progressbar
import numpy as np
import timeit
from qr import qr_mgs
from linalgf_babou import mgs


def qr_pyc(A):
    n,p=A.shape
    q= np.empty((n,p),order='C')
    r= np.empty((p,p),order='C')
    qr_mgs(A,q,r)
    return q,r

def qr_fpy(A):
    q,r = mgs(A)
    return q,r



import matplotlib.pyplot as plt
import seaborn
Nrange = (np.arange(100,700,100)).astype(int)

t_fortran_qr=[]
t_pycqr=[]
bar = progressbar.ProgressBar()

for N in bar(Nrange):



    A = np.random.rand(N,N)


    t1 = %timeit -oq C = qr_pyc(A)
    t2 = %timeit -oq C = qr_fpy(A)

    t_pycqr.append(t1.best)
    t_fortran_qr.append(t2.best)
  
 
plt.loglog(Nrange, t_fortran_qr,marker=7 ,label='fortran-f2py')
plt.loglog(Nrange, t_pycqr,marker=3 ,label='pyccel')

plt.legend(loc='lower right')
plt.xlabel('Number of points')
plt.ylabel('Execution Time (s)');

@mohamedlaminebabou
Copy link
Author

mohamedlaminebabou commented Aug 18, 2021

Looking what happened, If we work with numba :

from numba import jit,njit
from numpy.linalg import norm
from numpy import dot
@jit(target='cpu',nogil=True,fastmath=True)
def qr_numba(A : 'float[:,:]', Q : 'float[:,:]', R : 'float[:,:]'):
    n,p=A.shape
    Q[:,:] = 0
    R[:,:] = 0
    for  k in range(p):
        Q[:,k] = A[:,k]
        for i in range(k-1):
            R[i,k] = Q[:,i]@Q[:,k]
            Q[:,k] = Q[:,k] - R[i,k]*Q[:,i]
        R[k,k] = norm(Q[:,k])
        Q[:,k] = Q[:,k]/R[k,k]
    return Q,R

def qr_nby(A):
    n,p=A.shape
    q= np.empty((n,p),order='F')
    r= np.empty((p,p),order='F')
    qr_numba(A,q,r)
    return q,r

numba1

That's means numba is faster than pyccel and It have the same speed to f2py . but we find in the document , pyccel is the fastest !?
Screenshot from 2021-08-19 00-03-19

@saidctb
Copy link
Member

saidctb commented Aug 19, 2021

which optimization flags are you using in f2py?
you can check by doing %%fortran --fcompiler=gnu95 -vvv

@EmilyBourne
Copy link
Member

which optimization flags are you using in f2py?
you can check by doing %%fortran --fcompiler=gnu95 -vvv

My f2py is using the flag -funroll-loops. If I add this flag to pyccel then pyccel runs as fast or faster than f2py:

pyccel qr.py --flags="-funroll-loops" --verbose

Pyccel uses -O3, while f2py uses -O2, but apparently neither of these flags turn on loop unrolling.

F2PY:

<class 'numpy.distutils.fcompiler.gnu.Gnu95FCompiler'>
version_cmd     = ['/usr/bin/gfortran', '-dumpversion']
compiler_f77    = ['/usr/bin/gfortran', '-Wall', '-g', '-ffixed-form', '-fno-second-underscore', '-fPIC', '-O3', '-funroll-loops']
compiler_f90    = ['/usr/bin/gfortran', '-Wall', '-g', '-fno-second-underscore', '-fPIC', '-O3', '-funroll-loops']
compiler_fix    = ['/usr/bin/gfortran', '-Wall', '-g', '-ffixed-form', '-fno-second-underscore', '-Wall', '-g', '-fno-second-underscore', '-fPIC', '-O3', '-funroll-loops']
linker_so       = ['/usr/bin/gfortran', '-Wall', '-g', '-Wall', '-g', '-shared']
archiver        = ['/usr/bin/gfortran', '-cr']
ranlib          = ['/usr/bin/gfortran']
linker_exe      = ['/usr/bin/gfortran', '-Wall', '-Wall']
version         = LooseVersion ('9')
libraries       = ['gfortran']
library_dirs    = ['/usr/lib/gcc/x86_64-linux-gnu/9', '/usr/lib/gcc/x86_64-linux-gnu/9']
object_switch   = '-o '
compile_switch  = '-c'
include_dirs    = ['/usr/include/python3.8']

Pyccel:

gfortran -O3 -fPIC -c -I /home/emily/Code/test/tmp/__pyccel__ /home/emily/Code/test/tmp/__pyccel__/qr.f90 -o /home/emily/Code/test/tmp/__pyccel__/qr.o -J /home/emily/Code/test/tmp/__pyccel__
gfortran -O3 -fPIC -c -I /home/emily/Code/test/tmp/__pyccel__ /home/emily/Code/test/tmp/__pyccel__/bind_c_qr.f90 -o /home/emily/Code/test/tmp/__pyccel__/bind_c_qr.o -J /home/emily/Code/test/tmp/__pyccel__
gcc -O3 -fPIC -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -pthread -c -I /home/emily/Code/test/tmp/__pyccel__ -I /usr/include/python3.8 -I /home/emily/.local/lib/python3.8/site-packages/numpy/core/include /home/emily/Code/test/tmp/__pyccel__/qr_wrapper.c -o /home/emily/Code/test/tmp/__pyccel__/qr_wrapper.o
gfortran -shared -O3 -fPIC -I /home/emily/Code/test/tmp/__pyccel__ -I /usr/include/python3.8 -I /home/emily/.local/lib/python3.8/site-packages/numpy/core/include -L /usr/lib -L /usr/lib/python3.8/config-3.8-x86_64-linux-gnu -Wl,-rpath /usr/lib -Wl,-rpath /usr/lib/python3.8/config-3.8-x86_64-linux-gnu -lm -lpython3.8 /home/emily/Code/test/tmp/__pyccel__/qr.o /home/emily/Code/test/tmp/__pyccel__/bind_c_qr.o /home/emily/Code/test/tmp/__pyccel__/qr_wrapper.o -o /home/emily/Code/test/tmp/__pyccel__/qr.cpython-38-x86_64-linux-gnu.so
> Shared library has been created: /home/emily/Code/test/tmp/qr.cpython-38-x86_64-linux-gnu.so

(or see $PYCCEL_DIR/compilers/gfortran.json)

I can add the -funroll-loop flag to PR #938 so the json file is updated with the new release. Or do you think it is better to give it its own PR @saidctb ?

@EmilyBourne
Copy link
Member

@mohamedlaminebabou

but we find in the document , pyccel is the fastest !?

According to your document the fastest tool is problem (and compiler) dependant. Assuming you are using a GCC compiler then numba and pythran are both faster in the Laplace case.
I would also beware of trusting this document too much. I believe it is at least 2 years old. Pyccel has evolved quite a lot since then. I would hope we have not degraded the performances, but it is not impossible. Especially any performance which relies on compiler options as the compilation system has been completely rewritten since those tests. When those tests were done we were using f2py to interface between our fortran code and python

@EmilyBourne
Copy link
Member

@mohamedlaminebabou Could you redraw your figure, using the branch ebourne_mac_compile? I have added -funroll-loops to the release flags so this should close the gap between pyccel and f2py

@saidctb
Copy link
Member

saidctb commented Aug 19, 2021

@EmilyBourne can you test the pyccel version with the flags -O3 -march=native -ffast-math -funroll-loops.
In my computer, it's 3 times slower than the f2py version, but if I inline the function in the bind_c_qr.f90 file , I get the same performance

@saidctb
Copy link
Member

saidctb commented Aug 19, 2021

@mohamedlaminebabou you are using the flag fastmath=True , you need to do the same for the pyccel version by adding the flag -ffast-math

@EmilyBourne
Copy link
Member

@saidctb

can you test the pyccel version with the flags -O3 -march=native -ffast-math -funroll-loops.

On my computer I get:

Optimisation Flags Time for 10 runs (s)
Default master ( -O3 ) 10.1
Default ebourne_mac_compile ( -O3 -funroll-loops ) 3.2
-O3 -funroll-loops -ffast-math 2.5
-O3 -funroll-loops -march=native 2.8
-O3 -funroll-loops -march=native -ffast-math 1.6

f2py takes approximately 3.2s

@saidctb
Copy link
Member

saidctb commented Aug 19, 2021

@EmilyBourne this is what I have for the flags -O3 -funroll-loops -march=native -ffast-math for 100 runs

pyccel version
30.89082655403763

inlined version
9.11238155211322

f2py version
9.146705145947635

@EmilyBourne
Copy link
Member

@saidctb

In my computer, it's 3 times slower than the f2py version, but if I inline the function in the bind_c_qr.f90 file , I get the same performance

Is that on the master branch? It's strange that inlining the function would make a difference. The function call itself shouldn't be that costly. The only difference that I can think of is from the declaration of A which uses f70 notation in the binding (known size).

Did you do the inlining manually? Did you copy the function into the bind_c file or did you rewrite the function using n0_A etc?
I don't see any improvement on my computer copying the function into the bind_c file manually.

What compiler are you using? Mine is GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

@saidctb
Copy link
Member

saidctb commented Aug 19, 2021

yes I did it manually by replacing the function call with the body of the function
I have this version gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

@EmilyBourne
Copy link
Member

I have this version gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

I can reproduce what you are seeing with this compiler. The behaviour appears to have been improved by gcc between 7.5 and 9.3 although I haven't been able to pinpoint when from the change log

@mohamedlaminebabou
Copy link
Author

mohamedlaminebabou commented Aug 19, 2021

@saidctb @EmilyBourne Thanks for your quick response . Exactly @saidctb when work with the option '-Ofast' and flags "-O3 -funroll-loops -march=native -ffast-math" Indeed, we find that, f2py and pyccel they both have the same speed , but numba fails to progress!. The code used and system info :

import numpy as np
import timeit
from qr import qr_mgs
from qr_f2py import mgs
from numba_qr import qr_numba


def qr_pyc():
    n,p=A.shape
    q= np.empty((n,p),order='F')
    r= np.empty((p,p),order='F')
    qr_mgs(A,q,r)
    return q,r

def qr_fpy():
    q,r = mgs(A)
    return q,r
def qr_nby():
    n,p=A.shape
    q= np.empty((n,p),order='F')
    r= np.empty((p,p),order='F')
    qr_numba(A,q,r)
    return q,r

'''for f in [qr_pyc, qr_fpy,qr_nby]:
    q,r=f()
    print(f.__name__)
    print(np.linalg.norm(A-q@r))
    timer = timeit.Timer(f)
    print(timer.timeit(100))'''


import matplotlib.pyplot as plt

Nrange = (np.arange(100,2000,100)).astype(int)

t_fortran_qr=[]
t_pycqr=[]
t_numba=[]

for N in Nrange:

    A = np.random.rand(N,N)


    t1 =timeit.Timer(qr_pyc)
    t2 =timeit.Timer(qr_fpy)
    t3 =timeit.Timer(qr_nby)

    t_pycqr.append(t1.timeit(10))
    t_fortran_qr.append(t2.timeit(10))
    t_numba.append(t3.timeit(10))
plt.loglog(Nrange, t_numba, marker='o',label='numba')
plt.loglog(Nrange, t_fortran_qr,marker=7 ,label='fortran-f2py')
plt.loglog(Nrange, t_pycqr,marker=3 ,label='pyccel')

plt.legend(loc='lower right')
plt.xlabel('Number of points')
plt.ylabel('Execution Time (s)')
plt.title('Compare the speed of f2py,pyccel, numba -babou.' )
plt.savefig('compare2.png')
"""""""""""""""""""""""""""""""""""""""""""Command Execution """"""""""""""""""""""""""""""""""""""""
 f2py3 -c   --f90flags='-O3 -funroll-loops -march=native -ffast-math' qr_f2py.f90 -m qr_f2py
pyccel qr.py --flags="-O3 -funroll-loops -march=native -ffast-math" --verbose

compare2
Info system :
info_system
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

@mohamedlaminebabou
Copy link
Author

mohamedlaminebabou commented Aug 19, 2021

@saidctb When I compile with the previous commands, in jupyter-notebook I have a problem with permision to compile
It's very useful to use flags , in epyccel function
""
PermissionError Traceback (most recent call last)
in
----> 1 from pyccel.epyccel import epyccel
2 rb_f90 = epyccel(uppertri)

/usr/local/lib/python3.8/dist-packages/pyccel/epyccel.py in
17 from importlib.machinery import ExtensionFileLoader
18
---> 19 from pyccel.codegen.pipeline import execute_pyccel
20 from pyccel.errors.errors import PyccelError, ErrorsMode
21

/usr/local/lib/python3.8/dist-packages/pyccel/codegen/pipeline.py in
22
23 from .compiling.basic import CompileObj
---> 24 from .compiling.compilers import Compiler
25
26 all = ['execute_pyccel']

/usr/local/lib/python3.8/dist-packages/pyccel/codegen/compiling/compilers.py in
35
36 compilers_folder = os.path.join(os.path.dirname(file),'..','..','compilers')
---> 37 with FileLock(compilers_folder+'.lock'):
38 # TODO: Add an additional search location for user provided compiler files
39 available_compilers = {f[:-5]:json.load(open(os.path.join(compilers_folder,f))) for f in os.listdir(compilers_folder)

/usr/local/lib/python3.8/dist-packages/filelock.py in enter(self)
321
322 def enter(self):
--> 323 self.acquire()
324 return self
325

/usr/local/lib/python3.8/dist-packages/filelock.py in acquire(self, timeout, poll_intervall)
269 if not self.is_locked:
270 logger().debug('Attempting to acquire lock %s on %s', lock_id, lock_filename)
--> 271 self._acquire()
272
273 if self.is_locked:

/usr/local/lib/python3.8/dist-packages/filelock.py in _acquire(self)
382 def _acquire(self):
383 open_mode = os.O_RDWR | os.O_CREAT | os.O_TRUNC
--> 384 fd = os.open(self._lock_file, open_mode)
385
386 try:

PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.8/dist-packages/pyccel/codegen/compiling/../../compilers.lock'

""

@EmilyBourne
Copy link
Member

@mohamedlaminebabou

When I compile with the previous commands, in jupyter-notebook I have a problem with permision to compile

Please could you create an issue for this. It needs fixing and is unrelated to the current issue.
We are currently creating the json files for the compilers in the same location as pyccel. The lock is activated to ensure that there are no problems when running pyccel in parallel. However if pyccel is installed system-wide then in certain conditions you will not have permission to modify objects in this location. I imagine a temporary fix would be to install pyccel locally (pip3 install --user pyccel), but long-term we need to pick a more accessible folder (e.g. ~/.pyccel/)

It's very useful to use flags , in epyccel function

You can do this using the fflags argument. See this file for an example.

yguclu pushed a commit that referenced this issue Aug 24, 2021
Collect Python library from correct key to fix Mac compiling. Fixes #937 
Add -funroll-loops to release flags. Fixes #942 
Update patch number so compile files are automatically regenerated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants