# 3rd Place solution (part): Simulated Annealing

First of all, I appreciate to staff planned and organized this competition and also thanks to competitors fought together on this problem. Thanks to your effort, this competition have been a lot of fun.

I mainly used simulated annealing for this problem. This algorithm is well-known and easy. The implementation of my (early) solution is below.

### prepare for execution

- import
- setting constants
- forwarding function
- score function

In [None]:
import numpy as np
import pandas as pd
from scipy.signal import convolve2d
from time import perf_counter
from tqdm.auto import tqdm
import os

SIZE = 25
NUM_PROBLEM = 10 # for debug. 50000 for submission
SOLVE_TIME = 10 # must be short enough to submit

conway_kernel = np.array([[1, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=np.uint8)
conway_transition = np.array([[0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0, 0]], dtype=np.uint8)

def game_forward(x: np.ndarray):
    adjacent_sum = convolve2d(x, conway_kernel, mode="same", boundary="wrap")
    return np.where(x == 0, conway_transition[0][adjacent_sum], conway_transition[1][adjacent_sum])

def calc_score(gt, pred):
    assert gt["id"] == pred["id"], (gt, pred)
    table = np.copy(pred["start"])
    for _ in range(gt["delta"]):
        table = game_forward(table)
    return np.sum(table == gt["stop"])

execution

- load dataset
- solve using simulated annealing
- write submission

In [None]:
def load_dataset(csv_file_name):
    with open(csv_file_name, "r") as f:
        df = pd.read_csv(f)
    data = np.zeros(NUM_PROBLEM, dtype=[
        ("id", np.uint64),
        ("delta", np.int64),
        ("stop", np.uint8, (SIZE, SIZE))])
    for i in tqdm(range(NUM_PROBLEM), desc=os.path.basename(csv_file_name)):
        data[i]["id"] = df.iloc[i, 0]
        data[i]["delta"] = df.iloc[i, 1]
        data[i]["stop"] = np.array(df.iloc[i, 2:], dtype=np.uint8).reshape(SIZE, SIZE)
    return data

dataset = load_dataset("../input/conways-reverse-game-of-life-2020/test.csv")

def solve(dataset):
    submission = np.zeros(NUM_PROBLEM, dtype=[
        ("id", np.uint64),
        ("score", np.uint64),
        ("start", np.uint8, (SIZE, SIZE))])

    begin_temperature = 20
    end_temperature = 0.01
    temperature_coeff = np.log(end_temperature/begin_temperature)/SOLVE_TIME
    progress_bar = tqdm(dataset)
    score_sum = 0
    for data_idx, data in enumerate(progress_bar):
        calc_start = perf_counter()
        submission[data_idx]["id"] = data["id"]
        submission[data_idx]["start"] = data["stop"]
        submission[data_idx]["score"] = calc_score(data, submission[data_idx])

        duration = 0
        while duration < SOLVE_TIME and submission[data_idx]["score"] < SIZE * SIZE:
            temperature = begin_temperature * np.exp(temperature_coeff * duration)
            i, j = np.random.randint(SIZE, size=2)
            submission[data_idx]["start"][i][j] = 1 - submission[data_idx]["start"][i][j]
            tmp_score = calc_score(data, submission[data_idx])
            if tmp_score > submission[data_idx]["score"] or np.random.random() < np.exp((tmp_score - submission[data_idx]["score"]) / temperature):
                submission[data_idx]["score"] = tmp_score
            else:
                submission[data_idx]["start"][i][j] = 1 - submission[data_idx]["start"][i][j]
            duration = perf_counter() - calc_start
        score_sum += submission[data_idx]["score"]
        progress_bar.set_description("score %f" % (1 - score_sum/(SIZE*SIZE*(data_idx + 1))))
    return submission

submission = solve(dataset)

writefile = "submission.csv"
with open(writefile, "w") as f:
    print(f"id,{','.join(['start_{0}'.format(i) for i in range(SIZE*SIZE)])}", file=f)
    for v in tqdm(submission, desc=os.path.basename(writefile)):
        print(f"{v['id']},{','.join(np.reshape(v['start'], -1).astype(str))}", file=f)

## optimization

This solver is too slow (about 2.5k ~ 7k iterations/sec) to achive good score. Then, I tried to execute as fast as possible.

- **Use C++**. Programs in python can run faster, but C++ is easier to optimize.
- **Partial update and compare**. When one cell of a table is swapped, the range that changes is not more than delta. So memorizing all result of updated table, this computation can be faster.
- **Multithreading**. I use 2 PCs (Ryzen 5 3600 (6C12T) and Core i7-8500U (4C8T)). Parallelize task-wise is easy.
- **Bitwise update and compare**. Theoretically, each cell can be compute using only 1 bit, and I implemented in practice. The core function is `pointwise_update()`.

In [None]:
def sum_of_two_is_0(a: int, b: int):
    return 1 - (a | b) # in C++, ~(a|b)

def sum_of_two_is_1(a: int, b: int):
    return a ^ b

def sum_of_two_is_2(a: int, b: int):
    return a & b

def sum_of_three_geq_2(a: int, b: int, c: int):
    return (a & b) | ((a ^ b) & c)

def sum_of_three_mod_2(a: int, b: int, c: int):
    return a ^ b ^ c

def sum_of_four_is_1(a: int, b: int, c: int, d: int):
    return (sum_of_two_is_0(a, b) & sum_of_two_is_1(c, d)) | (sum_of_two_is_1(a, b) & sum_of_two_is_0(c, d))

for a in [0, 1]:
    for b in [0, 1]:
        assert sum_of_two_is_0(a, b) == (1 if a+b==0 else 0)
        assert sum_of_two_is_1(a, b) == (1 if a+b==1 else 0)
        assert sum_of_two_is_2(a, b) == (1 if a+b==2 else 0)
        for c in [0, 1]:
            assert sum_of_three_geq_2(a, b, c) == (1 if a+b+c>=2 else 0)
            assert sum_of_three_mod_2(a, b, c) == (a+b+c)%2
            for d in [0, 1]:
                assert sum_of_four_is_1(a, b, c, d) == (1 if a+b+c+d==1 else 0)

def pointwise_update(x: np.ndarray):
    assert x.shape == (3, 3)
    ul, u, ur = x[0]
    l,  a, r  = x[1]
    dl, d, dr = x[2]
    sum_of_remaining_term_geq_2 = sum_of_three_geq_2(
        sum_of_two_is_1(l, r),
        sum_of_three_mod_2(ul, u, ur),
        sum_of_three_mod_2(dl, d, dr))
    sum_of_neighbor_is_2_or_3 = sum_of_four_is_1(
        sum_of_two_is_2(l, r),
        sum_of_three_geq_2(ul, u, ur),
        sum_of_three_geq_2(dl, d, dr),
        sum_of_remaining_term_geq_2)
    sum_of_neighbor_mod_2 = sum_of_three_mod_2(
        sum_of_two_is_1(l, r),
        sum_of_three_mod_2(ul, u, ur),
        sum_of_three_mod_2(dl, d, dr))
    return sum_of_neighbor_is_2_or_3 & (a | sum_of_neighbor_mod_2)

from itertools import product
for v in product([0, 1], repeat=9):
    x = np.array(v).reshape(3, 3)
    assert pointwise_update(x) == ((sum(v) - x[1][1] == 2 and x[1][1] == 1) or sum(v) - x[1][1] == 3), x

def get_neighbor(x: np.ndarray, i: int, j: int):
    return np.array([[x[(i+di)%SIZE][(j+dj)%SIZE] for dj in [-1, 0, 1]] for di in [-1, 0, 1]])

def game_forward_bitwise(x: np.ndarray):
    return np.array([[pointwise_update(get_neighbor(x, i, j)) for j in range(SIZE)] for i in range(SIZE)])
    
table = np.random.randint(2, size=(SIZE, SIZE))
assert (game_forward(table) == game_forward_bitwise(table)).all()

- parallelize using 256bit **SIMD** (avx2) instead of 32bit representation.

Now that we have been able to update to bitwise, we can update 8 tables at the same time by just transforming it appropriately. The implementation is below.

```cpp
constexpr size_t MAX_DELTA = 5;
constexpr size_t SIZE = 25;
constexpr size_t NUM_PROC = 32;

struct aligned_table{
    alignas(32) uint32_t data[MAX_DELTA+1][SIZE][NUM_PROC];
};

constexpr uint32_t VALID_BITS = ((1 << SIZE) - 1) << 1; // 00000011 11111111 11111111 11111110
using bitarray_x8_t = __m256i;

// after swapped tables.data[0][i0][proc..proc+7] in other functions, this function can update tables.data[1..delta][i0-delta..i0+delta][proc..proc+7].
void update_partial_x8(aligned_table &tables, const int i0, const int proc, const int delta){
    const bitarray_x8_t valid_bits_x8 = _mm256_set1_epi32(VALID_BITS);
    const size_t buffer_offset = 6;
    const size_t buffer_size = 2 * buffer_offset + 1;
    bitarray_x8_t last_table[buffer_size];
    bitarray_x8_t l_and_r[buffer_size];
    bitarray_x8_t l_xor_r[buffer_size];
    bitarray_x8_t xor_a_l_r[buffer_size];
    bitarray_x8_t sum_a_l_r_geq_2[buffer_size];

    for(int d=1; d<=delta; ++d){
        for(int di=-d-1; di<=d+1; ++di){
            const auto i_data = (i0 + SIZE + di) % SIZE;
            const auto i = buffer_offset + di;
            last_table[i] = _mm256_load_si256((bitarray_x8_t *)(tables.data[d-1][i_data] + proc));
            // left rotate
            const auto ll = _mm256_slli_epi32(last_table[i], 1), lr = _mm256_srli_epi32(last_table[i], SIZE-1);
            const auto l = _mm256_and_si256(_mm256_or_si256(ll, lr), valid_bits_x8);
            // right rotate
            const auto rl = _mm256_slli_epi32(last_table[i], SIZE-1), rr = _mm256_srli_epi32(last_table[i], 1);
            const auto r = _mm256_and_si256(_mm256_or_si256(rl, rr), valid_bits_x8);
            l_and_r[i] = _mm256_and_si256(l, r);
            l_xor_r[i] = _mm256_xor_si256(l, r);
            xor_a_l_r[i] = _mm256_xor_si256(l_xor_r[i], last_table[i]);
            sum_a_l_r_geq_2[i] = _mm256_or_si256(l_and_r[i], _mm256_and_si256(l_xor_r[i], last_table[i]));
        }
        for(int di=-d; di<=d; ++di){
            const auto i_data = (i0 + SIZE + di) % SIZE;
            const auto i = buffer_offset + di;
            const auto u_xor_d = _mm256_xor_si256(xor_a_l_r[i+1], xor_a_l_r[i-1]);
            const auto u_and_d = _mm256_and_si256(xor_a_l_r[i+1], xor_a_l_r[i-1]);
            const auto next_sum_mod_2 = _mm256_xor_si256(u_xor_d, l_xor_r[i]);
            const auto sum_u_d_lr_geq_2 = _mm256_or_si256(_mm256_and_si256(u_xor_d, l_xor_r[i]), u_and_d);
            const auto next_sum_is_2_or_3 = _mm256_or_si256(
                _mm256_andnot_si256(
                    _mm256_or_si256(sum_a_l_r_geq_2[i+1], sum_a_l_r_geq_2[i-1]),
                    _mm256_xor_si256(sum_u_d_lr_geq_2, l_and_r[i])
                ),
                _mm256_andnot_si256(
                    _mm256_or_si256(sum_u_d_lr_geq_2, l_and_r[i]),
                    _mm256_xor_si256(sum_a_l_r_geq_2[i+1], sum_a_l_r_geq_2[i-1])
                )
            );
            const auto new_array = _mm256_and_si256(next_sum_is_2_or_3, _mm256_or_si256(last_table[i], next_sum_mod_2));
            _mm256_store_si256((bitarray_x8_t *)(tables.data[d][i_data] + proc), new_array);
        }
    }
}
```

At this point, updating the table was no longer a bottleneck, and we needed to speed up other parts such as random sampling.

Apply these accelerations, my solver can compute about **40M iterations/sec.** (sum to all processors. i.e. 2M iter/proc/sec.)

(I wonder how fast it runs if this algorithm is implemented in GPU, but I don't have enough time to do)

## time-homogeneous parallel simulated annealing

To reduce the difficulty of temperature control, I used the idea of **time-homogeneous parallel** simulated annealing, describing in https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7865&rep=rep1&type=pdf .

I set the temperatures `np.logspace(np.log10(20), np.log10(0.1), 32)`, and ran this solver about 25 days, then achive a score about 0.015.

- `25 * 86400(sec/day) * 20(processors) / 50000 (problems) = 864(cpu-sec/problem)`
- `864 * 2M = 1.728 B(iter/problem)`
    - some of problems solve easily, then I'd executed about 3B iterations for each problems

## combine with other solvers

In addition to this, I used z3solver https://www.kaggle.com/jamesmcguigan/game-of-life-z3-constraint-satisfaction for unsolved delta=2 problems, because discussion https://www.kaggle.com/c/conways-reverse-game-of-life-2020/discussion/190796 says that these solver can easily solve low-delta problems. Using this, about 800 problem solved in 4 days, then I can achive a score 0.01315, what is my best score.

After all, my submission is combine with submission of my team mate @markuskarmann, and we could achive the great (and so amusing) score of 0.01234.

his solution is written in https://www.kaggle.com/markuskarmann/3rd-place-solution-part , it is so interesting.

## statistics

As shown the following graph, many problems remain unsolved. I'm wondering if it will be better when combined with many other solvers.

In [None]:
from IPython.display import Image
Image("../input/conway-final-submission-and-stats/score_nonsolve_20201130.png")