diff --git a/CMakeLists.txt b/CMakeLists.txt
index e8b99e29e35b3..d6dd64c7bec1e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -8,7 +8,7 @@ project(taichi)
 
 SET(TI_VERSION_MAJOR 0)
 SET(TI_VERSION_MINOR 5)
-SET(TI_VERSION_PATCH 8)
+SET(TI_VERSION_PATCH 9)
 
 execute_process(
   WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
diff --git a/README.md b/README.md
index 14cf760c3368b..8475c9ad92472 100644
--- a/README.md
+++ b/README.md
@@ -34,6 +34,26 @@ python3 -m pip install taichi-nightly-cuda-10-1
 |**PyPI**|[![Build Status](https://travis-ci.com/yuanming-hu/taichi-wheels-test.svg?branch=master)](https://travis-ci.com/yuanming-hu/taichi-wheels-test)|[![Build Status](https://travis-ci.com/yuanming-hu/taichi-wheels-test.svg?branch=master)](https://travis-ci.com/yuanming-hu/taichi-wheels-test)|[![Build status](https://ci.appveyor.com/api/projects/status/39ar9wa8yd49je7o?svg=true)](https://ci.appveyor.com/project/IteratorAdvance/taichi-wheels-test)|
 
 ## Updates
+- (Mar 28, 2020) v0.5.9 released
+   - **CPU backends**
+      - Support `bitmasked` as the leaf block structure for `1x1x1` masks (#676) (by **Yuanming Hu**)
+   - **CUDA backend**
+      - Support `bitmasked` as the leaf block structure for `1x1x1` masks (#676) (by **Yuanming Hu**)
+   - **Documentation**
+      - Updated contributor guideline (#658) (by **Yuanming Hu**)
+   - **Infrastructure**
+      - 6x faster compilation on CPU backends (#673) (by **Yuanming Hu**)
+   - **Language and syntax**
+      - Simplify dense.bitmasked to bitmasked (#670) (by **Ye Kuang**)
+      - Support break in non-parallel for statements (#583) (by **彭于斌**)
+   - **Metal backend**
+      - Changes to enable `bitmasked` on Metal! (#661) (by **Ye Kuang**)
+      - Silence compile warning with [[maybe_unused]] (#650) (by **Ye Kuang**)
+      - Add bitmasked support in MetalRuntime (#638) (by **Ye Kuang**)
+   - **Optimization**
+      - Merge adjacent if's with identical conditions (#668) (by **xumingkuan**)
+      - Dive into container statements to find local loads/stores for optimization, and optimize loads of new allocas to 0 (#662) (by **xumingkuan**)
+      - [Full log](https://github.com/taichi-dev/taichi/releases/tag/0.5.9)
 - (Mar  24, 2020) v0.5.8 released. Visible/notable changes:
    - **Language features**
       - Access out-of-bound checking on CPU backends (#572) (by **xumingkuan**)
@@ -65,77 +85,8 @@ python3 -m pip install taichi-nightly-cuda-10-1
    - Fixed infinitely looping signal handlers
    - Fixed `ti test` on release mode
    - Doc updated
-- (Mar   3, 2020) v0.5.6 released
-   - Fixed runtime LLVM bitcode loading failure on Linux
-   - Fixed a GUI bug in `ti.GUI.line` (by **Mingkuan Xu [xumingkuan]**)
-   - Fixed frontend syntax error false positive (static range-fors) (by **Mingkuan Xu [xumingkuan]**)
-   - `arch=ti.arm64` is now supported. (Please build from source)
-   - CUDA supported on NVIDIA Jetson. (Please build from source)
-- (Mar   2, 2020) v0.5.5 released: **Experimental CUDA 10.0/10.1 support on Windows. Feedbacks are welcome!**
-- (Mar   1, 2020) v0.5.4 released
-   - Metal backend now supports < 32bit args (#530) (by **Ye Kuang [k-ye]**)
-   - Added `ti.imread/imwrite/imshow` for convenient image IO (by **Yubin Peng [archibate]**)
-   - `ti.GUI.set_image` now takes all numpy unsigned integer types (by **Yubin Peng [archibate]**)
-   - Bug fix: [Make sure KernelTemplateMapper extractors's size is the same as the number of args](https://github.com/taichi-dev/taichi/issues/534) (by **Ye Kuang [k-ye]**)
-   - [Avoid duplicate evaluations in chaining comparison (such as `1 < ti.append(...) < 3 < 4`)](https://github.com/taichi-dev/taichi/issues/540) (by **Mingkuan Xu [xumingkuan]**)
-   - Frontend kernel/function structure checking (#544) (by **Mingkuan Xu [xumingkuan]**)
-   - Throw exception instead of SIGABRT to obtain RuntimeError in Python-scope (by **Yubin Peng [archibate]**)
-   - Mark sync bit only after running a kernel on GPU (by **Ye Kuang [k-ye]**)
-   - `@ti.classkernel` is deprecated. Always use `ti.kernel`, no matter you are decorating a class member function or not (by **Ye Kuang [k-ye]**)
-   - Fix ti.func AST transform (due to locals() not saving compile result) #538, #539 (by **Yubin Peng [archibate]**)
-   - Add a KernelSimplicityASTChecker to ensure grad kernel is compliant (#553) (by **Ye Kuang [k-ye]**)
-   - Fixed MSVC C++ mangling which leads to unsupported characters in LLVM NVPTX ASM printer
-   - CUDA unified memory dependency is now removed. Set `TI_USE_UNIFIED_MEMORY=0` to disable unified memory usage
-   - Improved `ti.GUI.line` performance
-   - (For developers) compiler significantly refactored and folder structure reorganized
-- (Feb  25, 2020) v0.5.3 released
-   - Better error message when try to declare tensors after kernel invocation (by **Yubin Peng [archibate]**)
-   - Logging: `ti.warning` renamed to `ti.warn`
-   - Arch: `ti.x86_64` renamed to `ti.x64`. `ti.x86_64` is deprecated and will be removed in a future release
-   - (For developers) Improved runtime bit code compilation thread safety (by **Yubin Peng [archibate]**)
-   - Improved OS X GUI performance (by **Ye Kuang [k-ye]**)
-   - Experimental support for new integer types `u8, i8, u16, i16, u32` (by **Yubin Peng [archibate]**)
-   - Update doc (by **Ye Kuang [k-ye]**)
-- (Feb  20, 2020) v0.5.2 released
-   - Gradients for `ti.pow` now supported (by **Yubin Peng [archibate]**)
-   - Multi-threaded unit testing (by **Yubin Peng [archibate]**)
-   - Fixed Taichi crashing when starting multiple instances simultaneously (by **Yubin Peng [archibate]**)
-   - Metal backend now supports `ti.pow` (by **Ye Kuang [k-ye]**)
-   - Better algebraic simplification (by **Mingkuan Xu [xumingkuan]**)
-   - `ti.normalized` now optionally takes a argument `eps` to prevent division by zero in differentiable programming
-   - Improved random number generation by decorrelating PRNG streams on CUDA
-   - Set environment variable `TI_LOG_LEVEL` to `trace`, `debug`, `info`, `warn`, `error` to filter out/increase verbosity. Default=`info`
-   - [bug fix] fixed a loud failure on differentiable programming code generation due to a new optimization pass
-   - Added `ti.GUI.triangle` [example](https://github.com/taichi-dev/taichi/blob/master/misc/test_gui.py#L11)
-   - Doc update: added `ti.cross` for 3D cross products
-   - Use environment variable `TI_TEST_THREADS` to override testing threads
-   - [For Taichi developers, bug fix] `ti.init(print_processed=True)` renamed to `ti.init(print_preprocessed=True)`
-   - Various development infrastructure improvements by **Yubin Peng [archibate]**
-   - Official Python3.6 - Python3.8 packages on OS X (by **wYw [Detavern]**)
-- (Feb  16, 2020) v0.5.1 released
-   - Keyboard and mouse events supported in the GUI system. Check out [mpm128.py](https://github.com/taichi-dev/taichi/blob/4f5cc09ae0e35a47ad71fdc582c1ecd5202114d8/examples/mpm128.py) for a interactive demo! (by **Yubin Peng [archibate] and Ye Kuang [k-ye]**)
-   - Basic algebraic simplification passes (by **Mingkuan Xu [xumingkuan]**)
-   - (For developers) `ti` (`ti.exe`) command supported on Windows after setting `%PATH%` correctly (by **Mingkuan Xu [xumingkuan]**)
-   - General power operator `x ** y` now supported in Taichi kernels (by **Yubin Peng [archibate]**)
-   - `.dense(...).pointer()` now abbreviated as `.pointer(...)`. `pointer` now stands for a dense pointer array. This leads to cleaner code and better performance. (by **Kenneth Lozes [KLozes]**)
-   - (Advanced struct-fors only) `for i in X` now iterates all child instances of `X` instead of `X` itself. Skip this if you only use `X=leaf node` such as `ti.f32/i32/Vector/Matrix`.
-   - Fixed cuda random number generator racing conditions
-- (Feb  14, 2020) **v0.5.0 released with a new Apple Metal GPU backend for Mac OS X users!** (by **Ye Kuang [k-ye]**)
-   - Just initialize your program with `ti.init(..., arch=ti.metal)` and run Taichi on your Mac GPUs!
-   - A few takeaways if you do want to use the Metal backend:
-     - For now, the Metal backend only supports `dense` SNodes and 32-bit data types. It doesn't support `ti.random()` or `print()`.
-     - Pre-2015 models may encounter some undefined behaviors under certain conditions (e.g. read-after-write). According to our tests, it seems like the memory order on a single GPU thread could go inconsistent on these models.
-     - The `[]` operator in Python is slow in the current implementation. If you need to do a large number of reads, consider dumping all the data to a `numpy` array via `to_numpy()` as a workaround. For writes, consider first generating the data into a `numpy` array, then copying that to the Taichi variables as a whole.
-     - Do NOT expect a performance boost yet, and we are still profiling and tuning the new backend. (So far we only saw a big performance improvement on a 2015 MBP 13-inch model.)
-- [Full changelog](changelog.md)
+- [Full history](changelog.md)
 
-## Short-term goals
-- (Done) Fully implement the LLVM backend to replace the legacy source-to-source C++/CUDA backends (By Dec 2019)
-  - The only missing features compared to the old source-to-source backends:
-    - Vectorization on CPUs. Given most users who want performance are using GPUs (CUDA), this is given low priority.
-    - Automatic shared memory utilization. Postponed until Feb/March 2020.
-- (Done) Redesign & reimplement (GPU) memory allocator (by the end of Jan 2020)
-- (WIP) Tune the performance of the LLVM backend to match that of the legacy source-to-source backends (Hopefully by Feb, 2020. Current progress: setting up/tuning for final benchmarks)
 
 ## Related papers
 - [**(ICLR 2020) Differentiable Programming for Physical Simulation**](https://arxiv.org/abs/1910.00935) [[Video]](https://www.youtube.com/watch?v=Z1xvAZve9aE) [[BibTex]](https://raw.githubusercontent.com/yuanming-hu/taichi/master/misc/difftaichi_bibtex.txt) [[Code]](https://github.com/yuanming-hu/difftaichi)
diff --git a/changelog.md b/changelog.md
index 0fa22640aa858..a0965e1311d8c 100644
--- a/changelog.md
+++ b/changelog.md
@@ -1,4 +1,66 @@
 # Changelog
+- (Mar   3, 2020) v0.5.6 released
+   - Fixed runtime LLVM bitcode loading failure on Linux
+   - Fixed a GUI bug in `ti.GUI.line` (by **Mingkuan Xu [xumingkuan]**)
+   - Fixed frontend syntax error false positive (static range-fors) (by **Mingkuan Xu [xumingkuan]**)
+   - `arch=ti.arm64` is now supported. (Please build from source)
+   - CUDA supported on NVIDIA Jetson. (Please build from source)
+- (Mar   2, 2020) v0.5.5 released: **Experimental CUDA 10.0/10.1 support on Windows. Feedbacks are welcome!**
+- (Mar   1, 2020) v0.5.4 released
+   - Metal backend now supports < 32bit args (#530) (by **Ye Kuang [k-ye]**)
+   - Added `ti.imread/imwrite/imshow` for convenient image IO (by **Yubin Peng [archibate]**)
+   - `ti.GUI.set_image` now takes all numpy unsigned integer types (by **Yubin Peng [archibate]**)
+   - Bug fix: [Make sure KernelTemplateMapper extractors's size is the same as the number of args](https://github.com/taichi-dev/taichi/issues/534) (by **Ye Kuang [k-ye]**)
+   - [Avoid duplicate evaluations in chaining comparison (such as `1 < ti.append(...) < 3 < 4`)](https://github.com/taichi-dev/taichi/issues/540) (by **Mingkuan Xu [xumingkuan]**)
+   - Frontend kernel/function structure checking (#544) (by **Mingkuan Xu [xumingkuan]**)
+   - Throw exception instead of SIGABRT to obtain RuntimeError in Python-scope (by **Yubin Peng [archibate]**)
+   - Mark sync bit only after running a kernel on GPU (by **Ye Kuang [k-ye]**)
+   - `@ti.classkernel` is deprecated. Always use `ti.kernel`, no matter you are decorating a class member function or not (by **Ye Kuang [k-ye]**)
+   - Fix ti.func AST transform (due to locals() not saving compile result) #538, #539 (by **Yubin Peng [archibate]**)
+   - Add a KernelSimplicityASTChecker to ensure grad kernel is compliant (#553) (by **Ye Kuang [k-ye]**)
+   - Fixed MSVC C++ mangling which leads to unsupported characters in LLVM NVPTX ASM printer
+   - CUDA unified memory dependency is now removed. Set `TI_USE_UNIFIED_MEMORY=0` to disable unified memory usage
+   - Improved `ti.GUI.line` performance
+   - (For developers) compiler significantly refactored and folder structure reorganized
+- (Feb  25, 2020) v0.5.3 released
+   - Better error message when try to declare tensors after kernel invocation (by **Yubin Peng [archibate]**)
+   - Logging: `ti.warning` renamed to `ti.warn`
+   - Arch: `ti.x86_64` renamed to `ti.x64`. `ti.x86_64` is deprecated and will be removed in a future release
+   - (For developers) Improved runtime bit code compilation thread safety (by **Yubin Peng [archibate]**)
+   - Improved OS X GUI performance (by **Ye Kuang [k-ye]**)
+   - Experimental support for new integer types `u8, i8, u16, i16, u32` (by **Yubin Peng [archibate]**)
+   - Update doc (by **Ye Kuang [k-ye]**)
+- (Feb  20, 2020) v0.5.2 released
+   - Gradients for `ti.pow` now supported (by **Yubin Peng [archibate]**)
+   - Multi-threaded unit testing (by **Yubin Peng [archibate]**)
+   - Fixed Taichi crashing when starting multiple instances simultaneously (by **Yubin Peng [archibate]**)
+   - Metal backend now supports `ti.pow` (by **Ye Kuang [k-ye]**)
+   - Better algebraic simplification (by **Mingkuan Xu [xumingkuan]**)
+   - `ti.normalized` now optionally takes a argument `eps` to prevent division by zero in differentiable programming
+   - Improved random number generation by decorrelating PRNG streams on CUDA
+   - Set environment variable `TI_LOG_LEVEL` to `trace`, `debug`, `info`, `warn`, `error` to filter out/increase verbosity. Default=`info`
+   - [bug fix] fixed a loud failure on differentiable programming code generation due to a new optimization pass
+   - Added `ti.GUI.triangle` [example](https://github.com/taichi-dev/taichi/blob/master/misc/test_gui.py#L11)
+   - Doc update: added `ti.cross` for 3D cross products
+   - Use environment variable `TI_TEST_THREADS` to override testing threads
+   - [For Taichi developers, bug fix] `ti.init(print_processed=True)` renamed to `ti.init(print_preprocessed=True)`
+   - Various development infrastructure improvements by **Yubin Peng [archibate]**
+   - Official Python3.6 - Python3.8 packages on OS X (by **wYw [Detavern]**)
+- (Feb  16, 2020) v0.5.1 released
+   - Keyboard and mouse events supported in the GUI system. Check out [mpm128.py](https://github.com/taichi-dev/taichi/blob/4f5cc09ae0e35a47ad71fdc582c1ecd5202114d8/examples/mpm128.py) for a interactive demo! (by **Yubin Peng [archibate] and Ye Kuang [k-ye]**)
+   - Basic algebraic simplification passes (by **Mingkuan Xu [xumingkuan]**)
+   - (For developers) `ti` (`ti.exe`) command supported on Windows after setting `%PATH%` correctly (by **Mingkuan Xu [xumingkuan]**)
+   - General power operator `x ** y` now supported in Taichi kernels (by **Yubin Peng [archibate]**)
+   - `.dense(...).pointer()` now abbreviated as `.pointer(...)`. `pointer` now stands for a dense pointer array. This leads to cleaner code and better performance. (by **Kenneth Lozes [KLozes]**)
+   - (Advanced struct-fors only) `for i in X` now iterates all child instances of `X` instead of `X` itself. Skip this if you only use `X=leaf node` such as `ti.f32/i32/Vector/Matrix`.
+   - Fixed cuda random number generator racing conditions
+- (Feb  14, 2020) **v0.5.0 released with a new Apple Metal GPU backend for Mac OS X users!** (by **Ye Kuang [k-ye]**)
+   - Just initialize your program with `ti.init(..., arch=ti.metal)` and run Taichi on your Mac GPUs!
+   - A few takeaways if you do want to use the Metal backend:
+     - For now, the Metal backend only supports `dense` SNodes and 32-bit data types. It doesn't support `ti.random()` or `print()`.
+     - Pre-2015 models may encounter some undefined behaviors under certain conditions (e.g. read-after-write). According to our tests, it seems like the memory order on a single GPU thread could go inconsistent on these models.
+     - The `[]` operator in Python is slow in the current implementation. If you need to do a large number of reads, consider dumping all the data to a `numpy` array via `to_numpy()` as a workaround. For writes, consider first generating the data into a `numpy` array, then copying that to the Taichi variables as a whole.
+     - Do NOT expect a performance boost yet, and we are still profiling and tuning the new backend. (So far we only saw a big performance improvement on a 2015 MBP 13-inch model.)
 - (Feb  12, 2020) v0.4.6 released.
    - (For compiler developers) An error will be raised when `TAICHI_REPO_DIR` is not a valid path (by **Yubin Peng [archibate]**)
    - Fixed a CUDA backend deadlock bug
diff --git a/docs/version b/docs/version
index 659914ae9416f..416bfb0a2212b 100644
--- a/docs/version
+++ b/docs/version
@@ -1 +1 @@
-0.5.8
+0.5.9
diff --git a/misc/make_changelog.py b/misc/make_changelog.py
index 649e8b0b9a9bb..58c16955396c1 100644
--- a/misc/make_changelog.py
+++ b/misc/make_changelog.py
@@ -25,14 +25,14 @@ def format(c):
     'cuda': 'CUDA backend',
     'doc': 'Documentation',
     'infra': 'Infrastructure',
-    'ir': 'Intermediate Representation',
-    'lang': 'Language and Syntax',
+    'ir': 'Intermediate representation',
+    'lang': 'Language and syntax',
     'metal': 'Metal backend',
     'misc': 'Miscellaneous',
     'opt': 'Optimization',
 }
 
-print(f'-(, 2020) v{ver} released')
+print(f'- (, 2020) v{ver} released')
 for i, c in enumerate(commits):
     s = format(c)
     if s.startswith('[release]'):