# Reversing game of life problem 
## A NP-Hard problem
The problem of reversing game of life (GoL) is NP-hard because it can be framed as a SAT problem. As a consequence, it is very hard to use deep learning to solve it. A lot of public kernels uses various CNNs and ends up at MAE of around .11 at most. 

## The approach of this kernel ðŸ‘€
This kernel implements a probabilistic differentiable version of GoL in order to approximate a solution using gradient descent. It is possible to find perfect solutions for simple grids even with multiple deltas. However, most of the time the grids are too complex to converge but the solutions obtained are better than NNs ones. Then for the very complex grids, just submit whatever is better between a blank grid and the grid of origin.

ðŸ‘‰ This simple code can easily go in the range 0.09 - 0.10 MAE with 2 hours GPU, and even bellow with hyperparameters tunning and minor modifications.

In [None]:
"""
    Imports
"""

import numpy as np
import pandas as pd
from random import randint
from math import sqrt
import torch
import torch.nn.functional as Func
import matplotlib.pyplot as plt
import time
import gc

In [None]:
# Let's keep track of the time to stay in the 2 hour GPU limit
ts = time.time()

In [None]:
# Classical bloc of code to use GPU if it's available
if torch.cuda.is_available(): 
    dev = "cuda:0" 
else:  
    dev = "cpu"
device = torch.device(dev)

In [None]:
"""
    Hyperparameters
"""
# Max number of forward GoL steps
max_delta = 5
# To debug the kernel, turn it to true and it wont run on the all test set
test_mode = False
# Learning rate
lr = 1.0
# Max steps per sample
max_steps = 1000
# Batch size
batch_size = 5000
# Runtime in seconds with a little margin
t_run = 3600 * 2 - 30 * 60

In [None]:
"""
    Game of life functions (With batch version)
"""

def get_padded_version_n(X):
    """
        Apply circular padding to a batch of grids
    """
    X_pad = np.zeros((X.shape[0], X.shape[-2] + 2, X.shape[-1] + 2), dtype=X.dtype)
    X_pad[:, 1:-1,1:-1] += X
    
    X_pad[:, 0, 1:-1] = X[:, -1, :]
    X_pad[:, -1, 1:-1] = X[:, 0, :]
    
    X_pad[:, 1:-1, 0] = X[:, :, -1]
    X_pad[:, 1:-1, -1] = X[:, :, 0]
    
    X_pad[:, 0, 0] = X[:, -1, -1]
    X_pad[:, 0, -1] = X[:, -1, 0]
    X_pad[:, -1, 0] = X[:, 0, -1]
    X_pad[:, -1, -1] = X[:, 0, 0]
    
    return X_pad

def nConv2d_sw_3x3(X):
    """
        Convolve with a 3x3 ones filter a batch of grids
    """
    X_pad = get_padded_version_n(X)
    N = np.zeros_like(X_pad)
    
    N[:, 1:, 1:] += X_pad[:,:-1,:-1]
    N[:, 1:, :] += X_pad[:,:-1,:]
    N[:, 1:, :-1] += X_pad[:,:-1,1:]

    N[:, :, 1:] += X_pad[:,:,:-1]
    N[:, :, :] += X_pad[:,:,:]
    N[:, :, :-1] += X_pad[:,:,1:]

    N[:, :-1, 1:] += X_pad[:,1:,:-1]
    N[:, :-1, :] += X_pad[:,1:,:]
    N[:, :-1, :-1] += X_pad[:,1:,1:]
    
    N = N[:,1:-1,1:-1]
    
    return N

def life_step(X):
    """
        Forward iteration of game of life
    """
    N =  nConv2d_sw_3x3(X) - X
    return np.logical_or(N == 3, np.logical_and(X, N==2)).astype(np.uint8)

# Probabilistic Forward Differentiable Game of life
In order to use gradient descent we need a differentiable and continuous version of the game of life step function. A good candidate for this purpose is the probabilistic version of GoL.

## PFDGol formula ðŸ¤“
PFDGol takes as input a grid of values in the range [0, 1]. Thoses values represent the probability of the given cell to be alive. As the classical version of GoL is computed using the number of neighbors and the current state of the cell, the formula of the probabilistic version is:

$$
    P(N = 2).I + P(N = 3)
$$

With $I$ being the probability of the cell being alive on the initial grid and $N$ being the number of alive neighbors of the cell. As the first one is given, we just have to find a way to compute $p(N=n)$.

## Probability of a cell having n alive neighbors
To compute $p(N=n)$, we just consider the likelyhood of all possible scenariis for the neighborhood of the cell. By scenario, I mean all possible configuration of neighborhood to obtain n alive neighbors.
Under independancy of cells in initial state assumption, we have:

$$
    P(S^{x,y}) = \prod_{(i, j) \in \{-1, 0, 1\}^2 , (i, j) \neq (0, 0)} S_{i,j} * I_{x + i, y + j}
$$

With $S^{x,y}$ being the occurence of a scenario at cell $x, y$ and $S_{i,j}$ being the state of the cell in the considered scenario.
Then to obtain $p(N=n)$, just sum up among all possible scenariis (again under assumptions of independancy)

## Little trick to go faster ðŸš€
It's possible to unlock a considerable speedup of the function by using to our advantage the highly optimized pytorch's conv2d function. This quite simple trick consist of convolving a 3x3 ones filter to the $log$ of the probabilities. To do so, we need to separate the calulation in three phases:
1. calculate $P(N \geq n)$:

$$
    P(N \geq n) = Conv2d(log(I), f)
$$

With $f$ being a 3x3 filter representing the considered scenario. ($f_{i, j} = 1$ when $cell_{i, j}$ is alive in the scenario, and 0 otherwise). Here, you multiply (add in log space) the probabilities of the cells at the locations of the alive cells in the scenario being actually alive.
2. calculate $P(N \leq n)$:

$$
    P(N \leq n) = Conv2d(log(1 - I), 1 - f)
$$

Here it is the opposit. You multiply the probabilities of the cells at the locations of the dead cells in the scenario being actually dead.
3. Merge
$$
    p(N = n) = P(N \leq n) \cap P(N \geq n)
$$

## Limitations ðŸ˜•
If this model is correct for one forward iteration, it is not (I think so, but didn't try to prove it) when you apply it recursively multiple times (as our independancy assomptions become wrong).

Also as gradient descent is based on partial derivatives there are a lot of non-optimum local minimas.
For example in the case of:
```python
I = [
    [0, 0, 0],
    [0, 0, 0],
    [0, 0, 0]
]
```
you will obtain $\frac{\partial MAE(PFDGol(I), F)}{\partial I_{x, y}} = 0, \forall F, x, y$ as you need at least 3 alive neighbors to turn a dead cell alive in the next step.

In [None]:
"""
    Probabilistic (differentiable) Game of life functions with pytorch
"""

# All possible configurations of neighborhood (flattened)
bits=np.unpackbits(np.expand_dims(np.arange(2**8, dtype=np.uint8), axis=1), axis=1)

filters, filters_i = {}, {}
for k in [2, 3]:
    # All possible configurations of a neighborhood containing k cells alive
    poss_n = bits[bits.sum(axis=-1) == k]
    # All possible neighborhood for a cell with k neighbors (reshaped on 3x3)
    poss_neigh = np.insert(poss_n, 4, 0, axis=1).reshape(len(poss_n), 3, 3)
    # Save it as a torch tensor
    filters[k] = torch.tensor(poss_neigh, dtype=torch.float32, device=device).view(len(poss_n), 1, 3, 3)
    # Negation of the previous
    poss_neigh_comp = np.insert(1 - poss_n, 4, 0, axis=1).reshape(len(poss_n), 3, 3)
    # Save it as a torch tensor
    filters_i[k] = torch.tensor(poss_neigh_comp, dtype=torch.float32, device=device).view(len(poss_n), 1, 3, 3)

def p_n_f(n, I):
    """
        Return the probability of each cell having n neighbors
    """
    ## Circular padding + log (Trick: addition becomes multiplication)
    # Complementary of alive probs in initial state
    p_F_0_ln = Func.pad(torch.log(1 - I), pad=(1, 1, 1, 1), mode='circular')
    # Probability of cells to be alive on initial state
    p_F_1_ln = Func.pad(torch.log(I), pad=(1, 1, 1, 1), mode='circular')
    
    ## As we are in log space, convolution now become a likelyhood calculation among neighbors
    ## Each filter represent a configuration of neigborhood
    ## Filters are NOT in log space so we are just calulcating the likelyhood of partial neighborhood
    # Probability of a cell having less or n neighbors (convolve complement of I to complement of neighborhood filters)
    p_Nf_lte_n_ln = Func.conv2d(p_F_0_ln, filters_i[n])
    ## As we are convolving with the complement
    # Probability of a cell having n or more neighbors (convolve I to neighborhood fliters)
    p_Nf_se_n_ln = Func.conv2d(p_F_1_ln, filters[n])
    # Now merge the twos : p(N <= n) * p(N >= n) -> p(N == n)
    p_Nf_n_ln = p_Nf_lte_n_ln + p_Nf_se_n_ln
    # Now we sum up the likelyhood of all possible n neighbors configurations, and obtain the good result
    return torch.sum(torch.exp(p_Nf_n_ln), dim=-3)

def p_life_step(I):
    """
        Probabilistic differentiable forward iteration of game of life
    """
    # I use logs in here so I tried to handle 0s and 1s by replacing them with very little values, but it doesn't seems to improve the performances
    n = len(I)
    I = I.view((n, 1, 25, 25))
    p_n_2 = p_n_f(2, I)
    p_n_3 = p_n_f(3, I)
    # The well known formula
    return p_n_2 * I.view(n, 25, 25) + p_n_3


In [None]:
"""
    Test set
"""
# Let's read the dataset
test = pd.read_csv("../input/conways-reverse-game-of-life-2020/test.csv", index_col='id').values
# Those are the deltas
test_delta = test[:,0]
# Thoses are the states to reach
X_test = test[:, 1:626].reshape((-1, 25, 25))

print(test_delta.shape, X_test.shape)

In [None]:
"""
    Choose the input of the function based on mode of the kernel
"""

if test_mode:
    n = 1000
    sample_idx = np.random.permutation(np.arange(len(X_test)))[:n]
    F = X_test[sample_idx]
    d = test_delta[sample_idx]
else:
    F = X_test
    d = test_delta

In [None]:
"""
    Core of the code (Gradient descent)
"""

# Initial states
If = np.zeros_like(F)
# Selector
selector = np.arange(len(If))
# Score of each saved initial state to keep only if we found a better one
S = np.ones_like(selector, dtype=np.float32)

while 1:
    # As memory is limited by Autograd, we have to execute batch per batch
    # This is a boolean mask to keep track of wich grid is in the current batch
    runnings = np.zeros(len(S)).astype(np.bool_)
    # Choose the boards that have the worst score
    runnings[np.arange(len(S))[S >= np.quantile(S, 1 - (batch_size / len(S)))][:batch_size]] = True
    # Targets of the current batch
    F_step = F[runnings]
    # Initial states (autograd variable) (initialized randomly as zero grid is a local optimum)
    I = torch.rand(F_step.shape, dtype=torch.float32, requires_grad=True, device=device)
    # Targets of the current batch as tensors
    Ft = torch.tensor(F_step, dtype=torch.float32, device=device)
    # Optimizer is classic gradient descent with momentum (but it's not stochastic here)
    optimizer = torch.optim.SGD([I], lr=lr)
    
    for s in range(max_steps):
        Fh = torch.zeros(I.size(), dtype=torch.float32, device=device)
        # Uses sigmoid to squeeze the values in ]0, 1[ as we use log
        current = torch.sigmoid(I)
        # Forward passes
        for i in range(max_delta):
            current = p_life_step(current)
            current_m = d[runnings] == i + 1
            # Save only those that needs i + 1 forward steps
            Fh[current_m] = current[current_m]
        ## There are numerous NaNs because of log so we want to backpropagate only the non-NaNs
        calculable_m = (torch.sum(torch.sum(torch.isnan(Fh), dim=-1), dim=-1) == 0).cpu().numpy()
        loss = torch.sum(torch.square(Fh[calculable_m] - Ft[calculable_m]))
        loss.backward()

        with torch.no_grad():
            optimizer.step()
            optimizer.zero_grad()
            #not_calculable = selector[runnings][np.logical_not(calculable_m)]
            #I[not_calculable] = torch.rand((len(not_calculable), F.shape[-2], F.shape[-1]), dtype=torch.float32, device=device)
            # Evaluate our grids at a given frequecy
            if s % 100 == 0:
                # Compute absolute error
                AE = torch.abs(Fh[calculable_m]  - Ft[calculable_m] )
                print("Step {}/{} MAE of step: {}".format(s, max_steps, torch.mean(AE).item()))
                # Apply deterministic forward GoL in the same way as before
                Is = torch.sigmoid(I[calculable_m]).cpu().numpy()
                Iv = (Is > 0.5).astype(np.uint8)
                Fv = np.zeros_like(Iv)
                current = Iv
                for i in range(max_delta):
                    current = life_step(current)
                    current_m = d[runnings][calculable_m] == i + 1
                    Fv[current_m] = current[current_m]
                # Compute MAE with deterministic version of GoL
                mae = np.mean(np.abs((Fv - F[runnings][calculable_m])), axis=(-2, -1))
                # Mask of the better ones
                local_m = mae < S[runnings][calculable_m]
                # Uses the selector to compute a mask of a mask
                If[selector[runnings][calculable_m][local_m]] = Iv[local_m]
                S[selector[runnings][calculable_m][local_m]] = mae[local_m]
                print("MAE w/ round: {}".format(np.mean(mae)))
                print("Overall MAE so far: {}".format(S[S < 1].mean()))
        # Some memory management   
        del Fh
    # Stop if it's time
    if time.time() - ts >= t_run:
        break
        
    # Some other memory management   
    del I
    del Ft
    gc.collect()

In [None]:
"""
    Compute various naive submissions
"""

# Score of a blank submission
blank_score = X_test.mean(axis=(-2, -1))

# Stop submission
stop_scores = np.zeros_like(blank_score)
# Forward passes on the final states 
current_step = np.copy(X_test)
for i in range(max_delta):
    current_step = life_step(current_step)
    stop_scores[test_delta==i+1] = 1 - (current_step[test_delta==i+1] == X_test[test_delta==i+1]).mean(axis=(-2, -1))

print(blank_score.mean())
print(stop_scores.mean())

In [None]:
"""
    Merge the naive submissions
"""

naive_submission = np.zeros_like(X_test)
stop_best_mask = stop_scores < blank_score
naive_submission[stop_best_mask] = X_test[stop_best_mask]
naive_scores = np.copy(blank_score)
naive_scores[stop_best_mask] = stop_scores[stop_best_mask]

print(naive_scores.mean())

In [None]:
"""
    Compute the final submission using main and naive ones
"""

def get_optimized_solution_and_score(y, delta, labels, naive_submission, naive_scores):
    """
        Return best submission using ys and naive submission
    """
    print("Scores from prediction")
    res = np.zeros_like(y)
    scores = []
    current_step = np.copy(y)
    for i in range(5):
        current_step = life_step(current_step)
        scores.append(current_step[delta==i+1] == labels[delta==i+1])
        print("Actual score for delta={}: {}".format(i+1, scores[-1].mean()))
        res[delta == i+1] = current_step[delta == i+1]
    print("Actual LB score = ", 1 - (res == labels).mean())
    y_scores = 1 - (labels == res).mean(axis=(-2, -1))
    naive_mask = naive_scores < y_scores
    y_final = np.copy(y)
    y_final[naive_mask] = naive_submission[naive_mask]
    f_score = np.copy(y_scores)
    f_score[naive_mask] = naive_scores[naive_mask]
    print("LB score estimation = ", f_score.mean())
    return y_final, f_score.mean()

In [None]:
"""
    Make the submission
"""
y_final, score = get_optimized_solution_and_score(If, test_delta, X_test, naive_submission, naive_scores)
print(score, naive_scores.mean() - score)

submission = pd.read_csv("../input/conways-reverse-game-of-life-2020/sample_submission.csv", index_col='id')
submission_values = y_final.reshape((len(submission), 625))
for i, col in enumerate(submission.columns):
    submission[col].values[:] = submission_values[:, i]
submission.to_csv("submission.csv")