common:
- loop unrolling in TRMV has been enabled again.
- A domain error in the thread workload distribution for SYRK
has been fixed. - gmake builds will now automatically add -fPIC to the build
options if the platform requires it. - a pthreads key leakage (and associate crash on dlclose) in
the USE_TLS codepath was fixed. - building of the utest cases on systems that do not provide
an implementation of complex.h was fixed.
x86_64:
- the SkylakeX code was changed to compile on OSX.
- unwanted application of the -march=skylake-avx512 option
to the common code parts of a DYNAMIC_ARCH build was fixed. - improved performance of SGEMM for small workloads on Skylake X.
- performance of SGEMM and DGEMM was improved on Haswell.
ARMV8:
- a configuration error that broke the CNRM2 kernel was corrected.
- compilation of the GEMM kernels with CMAKE was fixed.
- DYNAMIC_ARCH builds are now available with CMAKE as well.
- using CMAKE for cross-compilation to the new cpu TARGETs
introduced in 0.3.4 now works.
POWER:
- a problem in cpu autodetection for AIX has been corrected.
Assets
2
martin-frbg
released this
common:
- the new, experimental thread-local memory allocation had
inadvertently been left enabled for gmake builds in 0.3.3
despite the announcement. It is now disabled by default, and
single-threaded builds will keep using the old allocator even
if the USE_TLS option is turned on. - OpenBLAS will now provide enough buffer space for at least 50
threads by default. - The output of openblas_get_config() now contains the version
number. - A serious thread safety bug in GEMV operation with small M and
large N size has been fixed. - The code will now automatically call blas_thread_init after a
fork if needed before handling a call to openblas_set_num_threads - Accesses to parallelized level3 functions from multiple callers
are now serialized to avoid thread races (unless using OpenMP).
This should provide better performance than the known-threadsafe
(but non-default) USE_SIMPLE_THREADED_LEVEL3 option. - When building LAPACK with gfortran, -frecursive is now (again)
enabled by default to ensure correct behaviour. - The OpenBLAS version cblas.h now supports both CBLAS_ORDER and
CBLAS_LAYOUT as the name of the matrix row/column order option. - Externally set LDFLAGS are now passed through to the final compile/link
steps to facilitate setting platform-specific linker flags. - A potential race condition during the build of LAPACK (that would
usually manifest itself as a failure to build TESTING/MATGEN) has been
fixed. - xHEMV has been changed to stay single-threaded for small input sizes
where the overhead of multithreading exceeds any possible gains - CSWAP and ZSWAP have been limited to a single thread except on ARMV8 or
ThunderX hardware with sizable input. - Linker flags for the PGI compiler have been updated
- Behaviour of AXPY with zero increments is now handled in the C interface,
correcting the result on at least Intel Atom. - The result matrix from calling SGELSS with an all-zero input matrix is
now zeroed completely.
x86_64:
- Autodetection of AMD Ryzen2 has been fixed (again).
- CMAKE builds now support labeling of an INTERFACE64=1 build of
the library with the _64 suffix. - AVX512 version of DGEMM has been added and the AVX512 SGEMM kernel
has been sped up by rewriting with C intrinsics - Fixed compilation on RHEL5/CENTOS5 (issue with typename __WAIT_STATUS)
POWER:
- added support for building on AIX (with gcc and GNU tools from AIX Toolbox).
- CPU type detection has been implemented for AIX.
- CPU type detection has been fixed for NETBSD.
MIPS64:
- AXPY on LOONGSON3A has been corrected to pass "zero increment" utest.
- DSDOT on LOONGSON3A has been fixed.
- the SGEMM microkernel has been hardened against potential data loss.
ARMV8:
- DYNAMic_ARCH support is now available for 64bit ARM
- cross-compiling for ARMV8 under iOS now works.
- cpu-specific code has been rearranged to make better use of both
hardware commonalities and model-specific compiler optimizations. - XGENE1 has been removed as a TARGET, superseded by the improved generic
ARMV8 support.
ARMV7:
- Older assembly mnemonics have been converted to UAL form to allow
building with clang 7.0 - Cross compiling LAPACKE for Android has been fixed again (broken by
update to LAPACK 3.7.0 some while ago).
Assets
2
martin-frbg
released this
common:
- thread memory allocation has been switched back to the method
used before version 0.3.1 due to unexpected problems caused by
the new code under some circumstances. A new compile-time option
USE_TLS has been added to allow enabling the new code instead,
and it is hoped that this can become the default again in the next version. - LAPACK PR272 has been integrated, which fixes spurious errors
in DSYEVR and related functions caused by missing conversion
from ILAENV to ILAENV_2STAGE in several _2stage routines. - the cmake-generated OpenBLASConfig.cmake now uses correct case
for the name of the library - added support for Haiku OS
x86_64:
- added AVX512 implementations of SDOT, DDOT, SAXPY, DAXPY,
DSCAL, DGEMVN and DSYMVL - added a workaround for a cygwin issue that prevented compilation
of AVX512 code
IBM Z:
- added autodetection of Z14
- fixed TRMM errors in the generic target
Assets
2
martin-frbg
released this
common:
- fixes for regressions caused by the rewrite of the thread initialization code in 0.3.1
x86_64:
- added autodetection of AMD Ryzen 2
- fixed build with older versions of MSVC
Power:
- fixed cpu autodetection for the BSDs
mips64:
- fixed utest errors in AXPY, DSDOT, ROT and SWAP
Assets
2
martin-frbg
released this
common:
- rewritten thread initialization code with significantly reduced overhead
- added CBLAS interfaces to the IxAMIN BLAS extension functions
- fixed the lapack-test target
- CMAKE builds now create an OpenBLASConfig.cmake file
- ZAXPY now uses a single thread for small input sizes
- the LAPACK code was updated from Reference-LAPACK/lapack#253
POWER:
- corrected CROT and ZROT behaviour with zero INC_X
ARMV7:
- corrected xDOT behaviour with zero INC_X or INC_Y
x86_64:
- retired some older targets of DYNAMIC_ARCH builds to a new option DYNAMIC_OLDER,
this affects PENRYN,DUNNINGTON,OPTERON,OPTERON_SSE3,BOBCAT,ATOM and NANO
(which will still be supported via the slower PRESCOTT kernels when this option is not set) - added an option DYNAMIC_LIST that (used in conjunction with DYNAMIC_ARCH) allows
to specify the list of x86_64 targets to include. Any target not on the list will be supported by
the Sandybridge or Nehalem kernels if available, or by Prescott. - improved SWITCH_RATIO on Haswell for increased GEMM throughput
- added initial support for Intel Skylake X, including an AVX512 SGEMM kernel
- added autodetection of Intel Cannon Lake series as Skylake X
- added a default L2 cache size for hypervisors that return zero here (Chromebook)
- fixed a name clash with recent Windows10 headers that broke the build with (at least)
recent mingw from MSYS2 - fixed a link error in mixed clang/gfortran builds with OpenMP
- updated the OSX deployment target to 10.8
- switched on parallel make for builds on MS Windows by default
x86:
- fixed SSWAP and DSWAP behaviour with zero INC_X and INC_Y
Assets
2
martin-frbg
released this
common:
* fixed some more thread race and locking bugs
* added preliminary support for calling an OpenMP build of the library from multiple threads
* removed performance impact of thread locks added in 0.2.20 on OpenMP code
* general code cleanup
* optimized DSDOT implementation
* improved thread distribution for GEMM
* corrected IMATCOPY/OMATCOPY implementation
* fixed out-of-bounds accesses in the multithreaded xBMV/xPMV and SYMV implementations
* cmake build improvements
* pkgconfig file now contains build options
* openblas_get_config() now reports USE_OPENMP and NUM_THREADS settings used for the build
* corrections and improvements for systems with more than 64 cpus
* LAPACK code updated to 3.8.0 including later fixes
* added ReLAPACK, a recursive implementation of several LAPACK functions
* Rewrote ROTMG to handle cases that the netlib code failed to address
* Disabled (broken) multithreading code for xTRMV
* corrected prototypes of complex CBLAS functions to make our cblas.h match the generally accepted standard
* shared memory access failures on startup are now handled more gracefully
* restored utests from earlier releases (and made them pass on all affected systems)
SPARC:
* several fixes for cpu autodetection
POWER:
* corrected vector register overwriting in several Power8 kernels
* optimized additional BLAS functions
ARM:
* added support for CortexA53 and A72
* added autodetection for ThunderX2T99
* made most optimized kernels the default for generic ARMv8 targets
x86_64:
* parallelized DDOT kernel for Haswell
* changed alignment directives in assembly kernels to boost performance on OSX
* fixed register handling in the GEMV microkernels (bug exposed by gcc7)
* added support for building on OpenBSD and Dragonfly
* updated compiler options to work with Intel release 2018
* support fully optimized build with clang/flang on Microsoft Windows
* fixed building on AIX
IBM Z:
* added optimized BLAS 1/2 functions
MIPS:
* fixed cpu autodetection helper code
* added mips32 1004K cpu (Mediatek MT7621 and similar SoC)
* added mips64 I6500 cpu
Assets
2
xianyi
released this
Version 0.2.20
24-Jul-2017
common:
* Improved CMake support
* Fixed several thread race and locking bugs
* Fixed default LAPACK optimization level
* Updated LAPACK to 3.7.0
* Added ReLAPACK (https://github.com/HPAC/ReLAPACK), make BUILD_RELAPACK=1
POWER:
* Optimizations for Power9
* Fixed several Power8 assembly bugs
ARM:
* New optimized Vulcan and ThunderX2T99 targets
* Support for ARMV7 SOFT_FP ABI (make ARM_SOFTFP_ABI=1)
* Detect all cpu cores including offline ones
* Fix compilation with CLANG
* Support building a shared library for Android
MIPS:
* Fixed several threading issues
* Fix compilation with CLANG
x86_64:
* Detect Intel Bay Trail and Apollo Lake
* Detect Intel Sky Lake and Kaby Lake
* Detect Intel Knights Landing
* Detect AMD A8, A10, A12 and Ryzen
* Support 64bit builds with Visual Studio
* Fix building with Intel and PGI compilers
* Fix building with MINGW and TDM-GCC
* Fix cmake builds for Haswell and related cpus
* Fix building for Sandybridge with CLANG 3.9
* Add support for the FLANG compiler
IBM Z:
* New target z13 with BLAS3 optimizations
[](https://sourceforge.net/projects/openblas/files/v0.2.20/OpenBLAS 0.2.20 version.zip/download)
Assets
2
xianyi
released this
Version 0.2.19
1-Sep-2016
common:
* Improved cross compiling.
* Fix the bug on musl libc.
POWER:
* Optimize BLAS on Power8
* Fixed Julia+OpenBLAS bugs on Power8
MIPS:
* Optimize BLAS on MIPS P5600 and I6400 (Thanks, Shivraj Patil, Kaustubh Raste)
ARM:
* Improved on ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)
[](https://sourceforge.net/projects/openblas/files/v0.2.19/OpenBLAS 0.2.19 version.zip/download)
Assets
2
xianyi
released this
Version 0.2.18
12-Apr-2016
common:
- If you set MAKE_NB_JOBS flag less or equal than zero, make will be without -j.
x86/x86_64:
- Support building Visual Studio static library. (#813, Thanks, theoractice)
- Fix bugs to pass buidbot CI tests (http://build.openblas.net)
ARM:
- Provide DGEMM 8x4 kernel for Cortex-A57 (Thanks, Ashwin Sekhar T K)
POWER:
- Optimize S and C BLAS3 on Power8
- Optimize BLAS2/1 on Power8
[](https://sourceforge.net/projects/openblas/files/v0.2.18/OpenBLAS 0.2.18 version.zip/download)
Assets
2
xianyi
released this
Version 0.2.17
20-Mar-2016
common:
- Enable BUILD_LAPACK_DEPRECATED=1 by default.
[](https://sourceforge.net/projects/openblas/files/v0.2.17/OpenBLAS 0.2.17 version.zip/download)