@martin-frbg martin-frbg released this Dec 31, 2018

common:

  • loop unrolling in TRMV has been enabled again.
  • A domain error in the thread workload distribution for SYRK
    has been fixed.
  • gmake builds will now automatically add -fPIC to the build
    options if the platform requires it.
  • a pthreads key leakage (and associate crash on dlclose) in
    the USE_TLS codepath was fixed.
  • building of the utest cases on systems that do not provide
    an implementation of complex.h was fixed.

x86_64:

  • the SkylakeX code was changed to compile on OSX.
  • unwanted application of the -march=skylake-avx512 option
    to the common code parts of a DYNAMIC_ARCH build was fixed.
  • improved performance of SGEMM for small workloads on Skylake X.
  • performance of SGEMM and DGEMM was improved on Haswell.

ARMV8:

  • a configuration error that broke the CNRM2 kernel was corrected.
  • compilation of the GEMM kernels with CMAKE was fixed.
  • DYNAMIC_ARCH builds are now available with CMAKE as well.
  • using CMAKE for cross-compilation to the new cpu TARGETs
    introduced in 0.3.4 now works.

POWER:

  • a problem in cpu autodetection for AIX has been corrected.

Download OpenBLAS

Assets 2

@martin-frbg martin-frbg released this Dec 2, 2018 · 78 commits to release-0.3.0 since this release

common:

  • the new, experimental thread-local memory allocation had
    inadvertently been left enabled for gmake builds in 0.3.3
    despite the announcement. It is now disabled by default, and
    single-threaded builds will keep using the old allocator even
    if the USE_TLS option is turned on.
  • OpenBLAS will now provide enough buffer space for at least 50
    threads by default.
  • The output of openblas_get_config() now contains the version
    number.
  • A serious thread safety bug in GEMV operation with small M and
    large N size has been fixed.
  • The code will now automatically call blas_thread_init after a
    fork if needed before handling a call to openblas_set_num_threads
  • Accesses to parallelized level3 functions from multiple callers
    are now serialized to avoid thread races (unless using OpenMP).
    This should provide better performance than the known-threadsafe
    (but non-default) USE_SIMPLE_THREADED_LEVEL3 option.
  • When building LAPACK with gfortran, -frecursive is now (again)
    enabled by default to ensure correct behaviour.
  • The OpenBLAS version cblas.h now supports both CBLAS_ORDER and
    CBLAS_LAYOUT as the name of the matrix row/column order option.
  • Externally set LDFLAGS are now passed through to the final compile/link
    steps to facilitate setting platform-specific linker flags.
  • A potential race condition during the build of LAPACK (that would
    usually manifest itself as a failure to build TESTING/MATGEN) has been
    fixed.
  • xHEMV has been changed to stay single-threaded for small input sizes
    where the overhead of multithreading exceeds any possible gains
  • CSWAP and ZSWAP have been limited to a single thread except on ARMV8 or
    ThunderX hardware with sizable input.
  • Linker flags for the PGI compiler have been updated
  • Behaviour of AXPY with zero increments is now handled in the C interface,
    correcting the result on at least Intel Atom.
  • The result matrix from calling SGELSS with an all-zero input matrix is
    now zeroed completely.

x86_64:

  • Autodetection of AMD Ryzen2 has been fixed (again).
  • CMAKE builds now support labeling of an INTERFACE64=1 build of
    the library with the _64 suffix.
  • AVX512 version of DGEMM has been added and the AVX512 SGEMM kernel
    has been sped up by rewriting with C intrinsics
  • Fixed compilation on RHEL5/CENTOS5 (issue with typename __WAIT_STATUS)

POWER:

  • added support for building on AIX (with gcc and GNU tools from AIX Toolbox).
  • CPU type detection has been implemented for AIX.
  • CPU type detection has been fixed for NETBSD.

MIPS64:

  • AXPY on LOONGSON3A has been corrected to pass "zero increment" utest.
  • DSDOT on LOONGSON3A has been fixed.
  • the SGEMM microkernel has been hardened against potential data loss.

ARMV8:

  • DYNAMic_ARCH support is now available for 64bit ARM
  • cross-compiling for ARMV8 under iOS now works.
  • cpu-specific code has been rearranged to make better use of both
    hardware commonalities and model-specific compiler optimizations.
  • XGENE1 has been removed as a TARGET, superseded by the improved generic
    ARMV8 support.

ARMV7:

  • Older assembly mnemonics have been converted to UAL form to allow
    building with clang 7.0
  • Cross compiling LAPACKE for Android has been fixed again (broken by
    update to LAPACK 3.7.0 some while ago).

Download OpenBLAS

Assets 2

@martin-frbg martin-frbg released this Aug 30, 2018 · 253 commits to release-0.3.0 since this release

common:

  • thread memory allocation has been switched back to the method
    used before version 0.3.1
    due to unexpected problems caused by
    the new code under some circumstances. A new compile-time option
    USE_TLS has been added to allow enabling the new code instead
    ,
    and it is hoped that this can become the default again in the next version.
  • LAPACK PR272 has been integrated, which fixes spurious errors
    in DSYEVR and related functions caused by missing conversion
    from ILAENV to ILAENV_2STAGE in several _2stage routines.
  • the cmake-generated OpenBLASConfig.cmake now uses correct case
    for the name of the library
  • added support for Haiku OS

x86_64:

  • added AVX512 implementations of SDOT, DDOT, SAXPY, DAXPY,
    DSCAL, DGEMVN and DSYMVL
  • added a workaround for a cygwin issue that prevented compilation
    of AVX512 code

IBM Z:

  • added autodetection of Z14
  • fixed TRMM errors in the generic target

Download OpenBLAS

Assets 2

@martin-frbg martin-frbg released this Jul 30, 2018 · 312 commits to release-0.3.0 since this release

common:

  • fixes for regressions caused by the rewrite of the thread initialization code in 0.3.1

x86_64:

  • added autodetection of AMD Ryzen 2
  • fixed build with older versions of MSVC

Power:

  • fixed cpu autodetection for the BSDs

mips64:

  • fixed utest errors in AXPY, DSDOT, ROT and SWAP

Download OpenBLAS

Assets 2

@martin-frbg martin-frbg released this Jul 1, 2018 · 348 commits to release-0.3.0 since this release

common:

  • rewritten thread initialization code with significantly reduced overhead
  • added CBLAS interfaces to the IxAMIN BLAS extension functions
  • fixed the lapack-test target
  • CMAKE builds now create an OpenBLASConfig.cmake file
  • ZAXPY now uses a single thread for small input sizes
  • the LAPACK code was updated from Reference-LAPACK/lapack#253

POWER:

  • corrected CROT and ZROT behaviour with zero INC_X

ARMV7:

  • corrected xDOT behaviour with zero INC_X or INC_Y

x86_64:

  • retired some older targets of DYNAMIC_ARCH builds to a new option DYNAMIC_OLDER,
    this affects PENRYN,DUNNINGTON,OPTERON,OPTERON_SSE3,BOBCAT,ATOM and NANO
    (which will still be supported via the slower PRESCOTT kernels when this option is not set)
  • added an option DYNAMIC_LIST that (used in conjunction with DYNAMIC_ARCH) allows
    to specify the list of x86_64 targets to include. Any target not on the list will be supported by
    the Sandybridge or Nehalem kernels if available, or by Prescott.
  • improved SWITCH_RATIO on Haswell for increased GEMM throughput
  • added initial support for Intel Skylake X, including an AVX512 SGEMM kernel
  • added autodetection of Intel Cannon Lake series as Skylake X
  • added a default L2 cache size for hypervisors that return zero here (Chromebook)
  • fixed a name clash with recent Windows10 headers that broke the build with (at least)
    recent mingw from MSYS2
  • fixed a link error in mixed clang/gfortran builds with OpenMP
  • updated the OSX deployment target to 10.8
  • switched on parallel make for builds on MS Windows by default

x86:

  • fixed SSWAP and DSWAP behaviour with zero INC_X and INC_Y

Download OpenBLAS

Assets 2

@martin-frbg martin-frbg released this May 23, 2018 · 479 commits to release-0.3.0 since this release

common:

* fixed some more thread race and locking bugs
* added preliminary support for calling an OpenMP build of the library from multiple threads
* removed performance impact of thread locks added in 0.2.20 on OpenMP code
* general code cleanup 
* optimized DSDOT implementation
* improved thread distribution for GEMM
* corrected IMATCOPY/OMATCOPY implementation
* fixed out-of-bounds accesses in the multithreaded xBMV/xPMV and SYMV implementations
* cmake build improvements
* pkgconfig file now contains build options
* openblas_get_config() now reports USE_OPENMP and NUM_THREADS settings used for the build
* corrections and improvements for systems with more than 64 cpus
* LAPACK code updated to 3.8.0 including later fixes
* added ReLAPACK, a recursive implementation of several LAPACK functions
* Rewrote ROTMG to handle cases that the netlib code failed to address
* Disabled (broken) multithreading code for xTRMV
* corrected prototypes of complex CBLAS functions to make our cblas.h match the generally accepted standard
* shared memory access failures on startup are now handled more gracefully
* restored utests from earlier releases (and made them pass on all affected systems)

SPARC:

* several fixes for cpu autodetection

POWER:

* corrected vector register overwriting in several Power8 kernels
* optimized additional BLAS functions

ARM:

* added support for CortexA53 and A72 
* added autodetection for ThunderX2T99
* made most optimized kernels the default for generic ARMv8 targets 

x86_64:

* parallelized DDOT kernel for Haswell
* changed alignment directives in assembly kernels to boost performance on OSX
* fixed register handling in the GEMV microkernels (bug exposed by gcc7)
* added support for building on OpenBSD and Dragonfly 
* updated compiler options to work with Intel release 2018
* support fully optimized build with clang/flang on Microsoft Windows
* fixed building on AIX

IBM Z:

* added optimized BLAS 1/2 functions

MIPS:

* fixed cpu autodetection helper code
* added mips32 1004K cpu (Mediatek MT7621 and similar SoC)
* added mips64 I6500 cpu

Download OpenBLAS

Assets 2

@xianyi xianyi released this Jul 24, 2017 · 1084 commits to develop since this release

Version 0.2.20
24-Jul-2017

common:

    * Improved CMake support
    * Fixed several thread race and locking bugs
    * Fixed default LAPACK optimization level
    * Updated LAPACK to 3.7.0
    * Added ReLAPACK (https://github.com/HPAC/ReLAPACK), make BUILD_RELAPACK=1

POWER:

    * Optimizations for Power9
    * Fixed several Power8 assembly bugs

ARM:

    * New optimized Vulcan and ThunderX2T99 targets
    * Support for ARMV7 SOFT_FP ABI  (make ARM_SOFTFP_ABI=1)
    * Detect all cpu cores including offline ones
    * Fix compilation with CLANG
    * Support building a shared library for Android

MIPS:

    * Fixed several threading issues
    * Fix compilation with CLANG

x86_64:

    * Detect Intel Bay Trail and Apollo Lake
    * Detect Intel Sky Lake and Kaby Lake
    * Detect Intel Knights Landing
    * Detect AMD A8, A10, A12 and Ryzen
    * Support 64bit builds with Visual Studio
    * Fix building with Intel and PGI compilers
    * Fix building with MINGW and TDM-GCC
    * Fix cmake builds for Haswell and related cpus
    * Fix building for Sandybridge with CLANG 3.9
    * Add support for the FLANG compiler

IBM Z:

    * New target z13 with BLAS3 optimizations

[Download OpenBLAS](https://sourceforge.net/projects/openblas/files/v0.2.20/OpenBLAS 0.2.20 version.zip/download)

Assets 2

@xianyi xianyi released this Sep 1, 2016 · 1478 commits to develop since this release

Version 0.2.19
1-Sep-2016

common:

    * Improved cross compiling.
    * Fix the bug on musl libc.

POWER:

    * Optimize BLAS on Power8
    * Fixed Julia+OpenBLAS bugs on Power8

MIPS:

    * Optimize BLAS on MIPS P5600 and I6400 (Thanks, Shivraj Patil, Kaustubh Raste)

ARM:

    * Improved on ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)

[Download OpenBLAS](https://sourceforge.net/projects/openblas/files/v0.2.19/OpenBLAS 0.2.19 version.zip/download)

Assets 2

@xianyi xianyi released this Apr 12, 2016 · 1589 commits to develop since this release

Version 0.2.18
12-Apr-2016

common:

  • If you set MAKE_NB_JOBS flag less or equal than zero, make will be without -j.

x86/x86_64:

ARM:

  • Provide DGEMM 8x4 kernel for Cortex-A57 (Thanks, Ashwin Sekhar T K)

POWER:

  • Optimize S and C BLAS3 on Power8
  • Optimize BLAS2/1 on Power8

[Download OpenBLAS](https://sourceforge.net/projects/openblas/files/v0.2.18/OpenBLAS 0.2.18 version.zip/download)

Assets 2

@xianyi xianyi released this Mar 21, 2016 · 1663 commits to develop since this release

Version 0.2.17
20-Mar-2016

common:

  • Enable BUILD_LAPACK_DEPRECATED=1 by default.

[Download OpenBLAS](https://sourceforge.net/projects/openblas/files/v0.2.17/OpenBLAS 0.2.17 version.zip/download)

Assets 2