taichi-dev · yuanming-hu · May 12, 2020 · May 11, 2020 · May 12, 2020 · May 12, 2020
diff --git a/docs/hello.rst b/docs/hello.rst
@@ -17,64 +17,67 @@ Now you are ready to run the Taichi code below (``python3 fractal.py``) to compu
 
 .. code-block:: python
 
-  # fractal.py
+    # fractal.py
 
-  import taichi as ti
+    import taichi as ti
 
-  ti.init(arch=ti.gpu) # Run on GPU by default
+    ti.init(arch=ti.gpu)
 
-  n = 320
-  pixels = ti.var(dt=ti.f32, shape=(n * 2, n))
+    n = 320
+    pixels = ti.var(dt=ti.f32, shape=(n * 2, n))
 
-  @ti.func
-  def complex_sqr(z):
-    return ti.Vector([z[0] ** 2 - z[1] ** 2, z[1] * z[0] * 2])
 
-  @ti.kernel
-  def paint(t: ti.f32):
-    for i, j in pixels: # Parallized over all pixels
-      c = ti.Vector([-0.8, ti.sin(t) * 0.2])
-      z = ti.Vector([float(i) / n - 1, float(j) / n - 0.5]) * 2
-      iterations = 0
-      while z.norm() < 20 and iterations < 50:
-        z = complex_sqr(z) + c
-        iterations += 1
-      pixels[i, j] = 1 - iterations * 0.02
+    @ti.func
+    def complex_sqr(z):
+        return ti.Vector([z[0]**2 - z[1]**2, z[1] * z[0] * 2])
+
 
-  gui = ti.GUI("Fractal", (n * 2, n))
+    @ti.kernel
+    def paint(t: ti.f32):
+        for i, j in pixels:  # Parallized over all pixels
+            c = ti.Vector([-0.8, ti.cos(t) * 0.2])
+            z = ti.Vector([i / n - 1, j / n - 0.5]) * 2
+            iterations = 0
+            while z.norm() < 20 and iterations < 50:
+                z = complex_sqr(z) + c
+                iterations += 1
+            pixels[i, j] = 1 - iterations * 0.02
 
-  for i in range(1000000):
-    paint(i * 0.03)
-    gui.set_image(pixels)
-    gui.show()
 
+    gui = ti.GUI("Julia Set", res=(n * 2, n))
 
+    for i in range(1000000):
+        paint(i * 0.03)
+        gui.set_image(pixels)
+        gui.show()
 
 Let's dive into components of this simple Taichi program.
 
 import taichi as ti
 -------------------
-Taichi is an embedded domain-specific language (DSL) in Python.
-It pretends to be a plain Python package, although heavy engineering has been done to make this happen.
+Taichi is a domain-specific language (DSL) embedded in Python.
+Heavy engineering has been done to make Taichi as easy to use as a Python package.
 
-This design decision virtually makes every Python programmer capable of writing Taichi programs, after minimal learning efforts.
-You can also reuse the package management system, Python IDEs, and existing Python packages.
+After minimal learning efforts, every Python programmer will be capable of writing Taichi programs,
+You can also reuse the Python package management system, Python IDEs, and existing Python packages.
 
 Portability
------------------
+-----------
 
 Taichi code can run on CPUs or GPUs. Initialize Taichi according to your hardware platform:
 
 .. code-block:: python
 
   # Run on GPU, automatically detect backend
   ti.init(arch=ti.gpu)
-  # Run on NVIDIA GPU, CUDA required
+
+  # Run on GPU, with the NVIDIA CUDA backend
   ti.init(arch=ti.cuda)
   # Run on GPU, with the OpenGL backend
   ti.init(arch=ti.opengl)
   # Run on GPU, with the Apple Metal backend, if you are on OS X
   ti.init(arch=ti.metal)
+
   # Run on CPU (default)
   ti.init(arch=ti.cpu)
 
@@ -91,29 +94,29 @@ Taichi code can run on CPUs or GPUs. Initialize Taichi according to your hardwar
     | Mac OS X | OK   | N/A  | N/A    | OK    |
     +----------+------+------+--------+-------+
 
-    (OK: supported, WIP: work in progress, N/A: not available)
+    (OK: supported; N/A: not available)
 
-    When specified ``arch=ti.gpu``, Taichi will try to run on CUDA.
+    With ``arch=ti.gpu``, Taichi will try to run on CUDA.
     If CUDA is not supported on your machine, Taichi will fall back to Metal or OpenGL.
     If no GPU backend (CUDA, Metal, or OpenGL) is supported, Taichi will fall back to CPUs.
 
 .. note::
 
-  When running the CUDA backend on Windows and ARM devices (e.g. NVIDIA Jetson),
+  When using the CUDA backend on Windows systems or ARM devices (e.g. NVIDIA Jetson),
   Taichi will by default allocate 1 GB memory for tensor storage. You can override this by initializing with
   ``ti.init(arch=ti.cuda, device_memory_GB=3.4)`` to allocate ``3.4`` GB GPU memory, or
   ``ti.init(arch=ti.cuda, device_memory_fraction=0.3)`` to allocate ``30%`` of total available GPU memory.
 
   On other platforms Taichi will make use of its on-demand memory allocator to adaptively allocate memory.
 
-(Sparse) Tensors
+(Sparse) tensors
 ----------------
 
 Taichi is a data-oriented programming language, where dense or spatially-sparse tensors are first-class citizens.
 See :ref:`sparse` for more details on sparse tensors.
 
-``pixels = ti.var(dt=ti.f32, shape=(n * 2, n))`` allocates a 2D dense tensor named ``pixel`` of
-size ``(640, 320)`` and type ``ti.f32`` (i.e. ``float`` in C).
+In the code above, ``pixels = ti.var(dt=ti.f32, shape=(n * 2, n))`` allocates a 2D dense tensor named ``pixel`` of
+size ``(640, 320)`` and element data type ``ti.f32`` (i.e. ``float`` in C).
 
 Functions and kernels
 ---------------------
@@ -127,7 +130,7 @@ You can also define Taichi **functions** with ``ti.func``, which can be called a
 .. note::
 
   **Taichi-scope v.s. Python-scope**: everything decorated with ``ti.kernel`` and ``ti.func`` is in Taichi-scope, which will be compiled by the Taichi compiler.
-  Code outside the Taichi-scopes is simply native Python code.
+  Code outside the Taichi-scopes is simply normal Python code.
 
 .. warning::
 
@@ -140,84 +143,107 @@ For those who came from the world of CUDA, ``ti.func`` corresponds to ``__device
 
 
 Parallel for-loops
------------------------
-For loops at the outermost scope in a Taichi kernel is automatically parallelized.
+------------------
+For loops at the outermost scope in a Taichi kernel is **automatically parallelized**.
 For loops can have two forms, i.e. `range-for loops` and `struct-for loops`.
 
-**Range-for loops** are no different from that in native Python, except that it will be parallelized
-when used as the outermost scope. Range-for loops can be nested.
+**Range-for loops** are no different from Python for loops, except that it will be parallelized
+when used at the outermost scope. Range-for loops can be nested.
 
 .. code-block:: python
 
   @ti.kernel
   def fill():
-    for i in range(10): # parallelized
-      x[i] += i
+      for i in range(10): # Parallelized
+          x[i] += i
 
-      s = 0
-      for j in range(5): # serialized in each parallel thread
-        s += j
+          s = 0
+          for j in range(5): # Serialized in each parallel thread
+              s += j
 
-      y[i] = s
+          y[i] = s
 
   @ti.kernel
   def fill_3d():
-    # Parallelized for all 3 <= i < 8, 1 <= j < 6, 0 <= k < 9
-    for i, j, k in ti.ndrange((3, 8), (1, 6), 9):
-      x[i, j, k] = i + j + k
+      # Parallelized for all 3 <= i < 8, 1 <= j < 6, 0 <= k < 9
+      for i, j, k in ti.ndrange((3, 8), (1, 6), 9):
+          x[i, j, k] = i + j + k
 
-**Struct-for loops** have a cleaner syntax, and are particularly useful when iterating over tensor elements.
-In the fractal code above, ``for i, j in pixels`` loops over all the pixel coordinates, i.e. ``(0, 0), (0, 1), (0, 2), ... , (0, 319), (1, 0), ..., (639, 319)``.
+.. note::
+
+    It is the loop **at the outermost scope** that gets parallelized, not the outermost loop.
+
+    .. code-block:: python
+
+        @ti.kernel
+        def foo():
+            for i in range(10): # Parallelized :-)
+                ...
+
+        @ti.kernel
+        def bar(k: ti.i32):
+            if k > 42:
+                for i in range(10): # Serial :-(
+                    ...
+
+**Struct-for loops** have are particularly useful when iterating over (sparse) tensor elements.
+In the code above, ``for i, j in pixels`` loops over all the pixel coordinates, i.e. ``(0, 0), (0, 1), (0, 2), ... , (0, 319), (1, 0), ..., (639, 319)``.
 
 .. note::
 
     Struct-for is the key to :ref:`sparse` in Taichi, as it will only loop over active elements in a sparse tensor. In dense tensors, all elements are active.
 
-.. note::
-    Struct-for's must be at the outer-most scope of kernels.
+.. warning::
+
+    Struct-for's must live at the outer-most scope of kernels.
 
-.. note::
     It is the loop **at the outermost scope** that gets parallelized, not the outermost loop.
 
     .. code-block:: python
 
-      # Good kernel
-      @ti.func
-      def foo():
-        for i in x:
-          ...
-
-      # Bad kernel
-      @ti.func
-      def bar(k: ti.i32):
-        # The outermost scope is a `if` statement, not the struct-for loop!
-        if k > 42:
-          for i in x:
-            ...
+        @ti.kernel
+        def foo():
+            for i in x:
+                ...
+
+        @ti.kernel
+        def bar(k: ti.i32):
+            # The outermost scope is a `if` statement
+            if k > 42:
+                for i in x: # Not allowed. Struct-fors must live in the outermost scope.
+                    ...
 
-.. note::
-    ``break`` is not supported in **outermost (parallelized)** loops:
+
+
+
+.. warning::
+
+    ``break`` **is not supported in parallel loops**:
 
     .. code-block:: python
 
       @ti.kernel
       def foo():
         for i in x:
             ...
-            break # ERROR! You cannot break a parallelized loop!
+            break # Error!
+
+        for i in range(10):
+            ...
+            break # Error!
 
       @ti.kernel
       def foo():
         for i in x:
-            for j in y:
+            for j in range(10):
                 ...
-                break # OK
+                break # OK!
 
 
 Interacting with Python
 ------------------------
 
-Everything outside Taichi-scope (``ti.func`` and ``ti.kernel``) is simply Python. You can use your favorite Python packages (e.g. ``numpy``, ``pytorch``, ``matplotlib``) with Taichi.
+Everything outside Taichi-scopes (``ti.func`` and ``ti.kernel``) is simply Python. You can use your favorite Python packages (e.g. ``numpy``, ``pytorch``, ``matplotlib``) with Taichi.
 
 In Python-scope, you can access Taichi tensors using plain indexing syntax, and helper functions such as ``from_numpy`` and ``to_torch``:
 

diff --git a/docs/layout.rst b/docs/layout.rst
@@ -96,7 +96,7 @@ See? ``x`` first increases the first index (i.e. row-major), while ``y`` first i
 
 
 Array of Structures (AoS), Structure of Arrays (SoA)
---------------
+----------------------------------------------------
 
 Tensors of same size can be placed together.
 
@@ -158,8 +158,8 @@ A better placement is to place them together:
 Then ``vel[i]`` is placed right next to ``pos[i]``, this can increase the cache-hit rate and therefore increase the performance.
 
 
-Flat layouts versus hierarchical layouts 
--------------------------
+Flat layouts versus hierarchical layouts
+----------------------------------------
 
 By default, when allocating a ``ti.var``, it follows the simplest data layout.
 

diff --git a/docs/overview.rst b/docs/overview.rst
@@ -1,5 +1,5 @@
 Why new programming language
----------------------------------------
+============================
 
 Taichi is a high-performance programming language for computer graphics applications. The design goals are
 
@@ -11,7 +11,7 @@ Taichi is a high-performance programming language for computer graphics applicat
 - Metaprogramming
 
 Design decisions
----------------------------------------
+----------------
 
 - Decouple computation from data structures
 - Domain-specific compiler optimizations

diff --git a/docs/syntax.rst b/docs/syntax.rst
@@ -1,8 +1,8 @@
 Syntax
-==========================
+======
 
 Kernels
----------------------
+-------
 
 Kernel arguments must be type-hinted. Kernels can have at most 8 parameters, e.g.,
 
@@ -37,6 +37,7 @@ The return value will be automatically casted into the hinted type. e.g.,
     For now, we only support one scalar as return value. Returning ``ti.Matrix`` or `ti.Vector`` is not supported. Python-style tuple return is not supported. e.g.:
 
     .. code-block:: python
+
         @ti.kernel
         def bad_kernel() -> ti.Matrix:
             return ti.Matrix([[1, 0], [0, 1]])  # ERROR!

diff --git a/examples/fractal.py b/examples/fractal.py
@@ -15,7 +15,7 @@ def complex_sqr(z):
 def paint(t: ti.f32):
     for i, j in pixels:  # Parallized over all pixels
         c = ti.Vector([-0.8, ti.cos(t) * 0.2])
-        z = ti.Vector([float(i) / n - 1, float(j) / n - 0.5]) * 2
+        z = ti.Vector([i / n - 1, j / n - 0.5]) * 2
         iterations = 0
         while z.norm() < 20 and iterations < 50:
             z = complex_sqr(z) + c