# Tryig to hack the LB


## Evaluation code published by Kaggle

We can compute the Kendall tau correlation between the predicted cell orders and the ground truth cell orders by counting how many swaps of adjacent cells are needed to sort the predicted order into the ground truth order.

A pair \\(i, j\\) of indices is called an **inversion** within a numeric sequence \\(A\\) when \\(i < j\\) but \\(A[i] > A[j]\\). An inversion indicates that a pair of numbers in the sequence is *out of order*. The number of swaps needed to correctly sort the predictions turns out to be equivalent to the number of inversions in its ranking of the cells relative to the ground-truth ranking.

The following cell shows an intuitive, but rather slow (\\(O(n^2)\\)) way to count inversions in a list of ranks.

In [None]:
def count_inversions_slowly(ranks):
    inversions = 0
    size = len(ranks)
    for i in range(size):
        for j in range(i+1, size):
            if ranks[i] > ranks[j]:
                total += 1
    return total

This implementation is much faster, though theoretically also \\(O(n^2)\\). (You might enjoy reviewing other inversion counting algorithms from [this StackOverflow post](https://stackoverflow.com/a/47845960).)

In [None]:
from bisect import bisect


# Actually O(N^2), but fast in practice for our data
def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):  # O(N)
        j = bisect(sorted_so_far, u)  # O(log N)
        inversions += i - j
        sorted_so_far.insert(j, u)  # O(N)
    return inversions

To compute the Kendall tau correlation, we sum up the inversions across all predictions and also the worst-case number of inversions across all predictions, and apply the following formula:
\\[K = 1 - 4 \frac{\sum_i S_{i}}{\sum_i n_i(n_i - 1)}\\]
where \\(S_i\\) is the number of inversions in the predicted ranks and \\(n_i\\) is the number of cells for notebook \\(i\\).

In [None]:
def kendall_tau(ground_truth, predictions):
    total_inversions = 0  # total inversions in predicted ranks across all instances
    total_2max = 0  # maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

## My code

We are going to read a little set of notebooks

In [None]:
import pandas as pd
from pathlib import Path

def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
            .assign(id=path.stem)
            .rename_axis('cell_id')
    )

data_dir = Path('../input/AI4Code')

df_orders = pd.read_csv(data_dir / 'train_orders.csv',index_col='id',squeeze=True).str.split()

paths_train = list((data_dir / 'train').glob('*.json'))
paths_train = [x for x in paths_train][:10] # 10 notebooks is enough
notebooks_train = [ read_notebook(path) for path in paths_train]
df_notebooks = pd.concat(notebooks_train).set_index('id', append=True).swaplevel().sort_index(level='id', sort_remaining=False)

Checking the Kendall Tau metric with this set of notebooks

In [None]:
df_allcells = df_notebooks.reset_index('cell_id').groupby('id')['cell_id'].apply(list).to_frame().join(df_orders,how='inner')
kendall_tau(df_allcells.cell_order, df_allcells.cell_id)

That's ok. But, what happens if we filter and use only code cells? Remember that code cells are ordered.

In [None]:
df_code = df_notebooks.query('cell_type == "code"').reset_index('cell_id').groupby('id')['cell_id'].apply(list)
df_testbug = df_code.to_frame().join(df_orders,how='inner')
kendall_tau(df_testbug.cell_order, df_testbug.cell_id)

Oops! kendall_tau function is not checking the length of ground_truth and predictions

In [None]:
def my_kendall_tau(ground_truth, predictions):
    total_inversions = 0  # total inversions in predicted ranks across all instances
    total_2max = 0  # maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        assert len(gt) == len(pred) # <-- CHECK length
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

In [None]:
my_kendall_tau(df_allcells.cell_order, df_allcells.cell_id)

In [None]:
try:
    my_kendall_tau(df_testbug.cell_order, df_testbug.cell_id)
except:
    print("OK!")


We check this issue against LB and the submission failed. See [this thread](https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation/comments#1810701)

But, what happes if we repeat the last code cell until end? Like this:

In [None]:
my_kendall_tau([['c1','c2','c3','m1','m2','m3']],[['c1','c2','c3','c3','c3','c3']])

In [None]:
paths_test = list((data_dir / 'test').glob('*.json'))
notebooks_test = [ read_notebook(path) for path in paths_test]
df_notebooks = pd.concat(notebooks_test).set_index('id', append=True).swaplevel().sort_index(level='id', sort_remaining=False)
df_code = df_notebooks.query('cell_type == "code"').reset_index('cell_id').groupby('id')['cell_id'].apply(list).reset_index()
df_ncell = df_notebooks.groupby('id').source.count().rename('count').reset_index()
df_code['count'] = df_ncell['count']
df_code['cell_padded'] = df_code.apply(lambda r: r.cell_id + [r.cell_id[-1]] * (r['count'] - len(r.cell_id)), axis=1)
df_code['cell_order'] = df_code.cell_padded.apply(lambda x: " ".join(x))

In [None]:
submission = df_code[['id','cell_order']]
submission

Submission **failed** again!!

The metric works fine in LB

uhmm!! after my last regular submission, I have the intuition that Kaggle is filtering perfect submission. I'm going to force a litle error in submission file.

In [None]:
df = df_notebooks.reset_index('cell_id').groupby('id')['cell_id'].apply(list).reset_index()
submission.iloc[-1].cell_order = " ".join(df.iloc[-1].cell_id)
submission

In [None]:
submission.to_csv("submission.csv", index=False)

Submission **failed** again!!