# Introduction

This is a fun little baselining experiment, to get a feel for the competition and the Kendall-tau metric.

+ We'll create a **very simple** random generative model of notebook ranks, perturbed in the fashion that they are presented to us in the training data.
+ We'll then consider what happens when we take different **markdown cell merging strategies** in order to make our predictions. 

**We'll run 3 experiments:**

1. Predict the orders unchanged from the cell orders we receive in the data.
2. Randomly order the markdown cells and interleave amongst the ordered code cells.
3. Correctly order the markdown cells and interleave amongst the ordered code cells.


In [None]:
import pickle
import numpy as np
import pandas as pd
import plotly.express as px
from scipy import stats

# The Kendall-tau (KT) Metric

First, a couple of definitions from the [organisers' notebook](https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation):

+ The Kendall tau correlation is: $K = 1 - 4 \frac{\sum_i S_{i}}{\sum_i n_i(n_i - 1)}$ where \\(S_i\\) is the number of inversions in the predicted ranks and \\(n_i\\) is the number of cells for notebook \\(i\\),

and where:

+ A pair \\(i, j\\) of indices is called an **inversion** within a numeric sequence \\(A\\) when \\(i < j\\) but \\(A[i] > A[j]\\). The number of swaps needed to correctly sort the predictions is equivalent to the number of inversions in its ranking of the cells relative to the ground-truth ranking.

In [None]:
from bisect import bisect

def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):
        j = bisect(sorted_so_far, u)
        inversions += i - j
        sorted_so_far.insert(j, u)
    return inversions


def kendall_tau(ground_truth, predictions):
    total_inversions = 0 
    total_2max = 0
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

# Train Data Statistics

I processed the training data earlier - it's a list consisting of the notebook ranks and their cell types, with 0 indicating code and 1 indicating markdown. Note - all notebooks have at least one markdown cell and at least one code cell.

In [None]:
with open('../input/ai4code-corpus-ranks/train_ranks.pkl', 'rb') as f:
    train = pickle.load(f)

Here's an example "notebook":

In [None]:
print("The notebook's cell ranks are:", train[2][0], sep="\n")
print("The notebook's cell types are:", train[2][1], sep="\n")

In [None]:
train_length_df = pd.DataFrame({
    "nb_lengths":[len(l[0]) for l in train],
    "md_lengths":[sum(l[1]) for l in train]
})

px.histogram(train_length_df, x="nb_lengths", nbins=500,
             title="Train data distribution of number of cells",
             color_discrete_sequence=['purple'],
             template='plotly_white')

In [None]:
px.histogram(train_length_df, x="md_lengths", nbins=500,
             title="Train data distribution of number of markdown cells",
             color_discrete_sequence=['purple'],
             template='plotly_white')

In [None]:
train_length_df['md_proportions'] = train_length_df['md_lengths']/ train_length_df['nb_lengths'] 

px.histogram(train_length_df, x="md_proportions", nbins=500,
             title="Train data distribution of proportion of markdown cells",
             color_discrete_sequence=['purple'],
             template='plotly_white')

Let's pull out some simple summary statistics to use with our generative corpus:

In [None]:
train_ncell = train_length_df.nb_lengths.mean()
train_ratio = (train_length_df.md_lengths / train_length_df.nb_lengths).mean()

# Corpus Generating Model

Let's define a simple corpus-generating model! We'll make it hierarchical to add a bit of spice. We'll try and use the train set sample statistics for the parameters of the model. As can be seen below, the generative model doesn't fit the shape of the train distribution well yet. We could try harder (and could even set up a nice Bayesian model with priors), but... maybe later!

In [None]:
rng = np.random.default_rng(seed=200522)

First, we'll make a "notebook" generating function which outputs:

+ the original cell ranks
+ the perturbed cell ranks* 
+ a list of the code cells
+ a list of the markdown cells

\*with code cells in the correct order, as we'd receive in the dataset

In [None]:
def generate_notebook(n_cells, n_markdown=None, default_frac=0.2):
    original_nb = np.arange(n_cells)
    if not n_markdown:
        n_markdown = np.floor(n_cells * default_frac).astype('int')
    markdown_cells = rng.choice(original_nb, n_markdown, replace=False)
    code_cells = np.delete(original_nb, markdown_cells)
    perturbed_nb = np.concatenate([code_cells, markdown_cells])
    notebook = {
        'original' :  list(original_nb),
        'perturbed':  list(perturbed_nb),
        'code'     :  list(code_cells),
        'markdown' :  list(markdown_cells)
    }
    return notebook

Now we can generate a corpus of notebooks with randomly generated lengths and markdown cell proportions:

In [None]:
def generate_corpus(n_docs, mu=30, prob_md=0.2):
    original_corpus = []
    perturbed_corpus = []
    markdown_corpus = []
    code_corpus = []
    for i in range(n_docs):
        mu_i = rng.poisson(mu) + 2
        n_md = rng.binomial(mu_i, prob_md)
        nb_i = generate_notebook(mu_i, n_md)
        original_corpus.append(nb_i['original'])
        perturbed_corpus.append(nb_i['perturbed'])
        code_corpus.append(nb_i['code'])
        markdown_corpus.append(nb_i['markdown'])
    corpus = {'original' : original_corpus,
              'perturbed': perturbed_corpus,
              'code'     : code_corpus,
              'markdown' : markdown_corpus}
    return corpus

Here's a sample "notebook" of length 15 with ~20% markdown cells

In [None]:
sample_nb = generate_notebook(15)
print(sample_nb, sep='\n')

Now let's generate a nice big corpus:

In [None]:
corpus = generate_corpus(100000,
                         mu = train_ncell,
                         prob_md = train_ratio)

# Experiments

## Summary of the generated corpus

Right, so we've generated a corpus. First, let's take a look at the distribution of generated notebook lengths:

In [None]:
df = pd.DataFrame({
    "nb_lengths":[len(l) for l in corpus['original']],
    "md_lengths":[len(l) for l in corpus['markdown']]
})

px.histogram(df, x="nb_lengths", nbins=50,
             title="Sample distribution of notebook lengths for the generated corpus")

In [None]:
px.histogram(df, x="md_lengths", nbins=50,
             title="Sample distribution of markdown cell lengths for the generated corpus")

## Experiment 1 - What is the KT for a naive submission?

For this experiment we just predict the cell order as we find it. Our baseline for this corpus is **KT = 0.4490**.

In [None]:
def compute_kendall_stats(ground_truth, predictions):
    kt_pt = []
    for orig, pert in zip(ground_truth, predictions):
        kt_pt.append(kendall_tau([orig], [pert]))
    kt = kendall_tau(ground_truth, predictions)
    return kt, kt_pt

In [None]:
predictions1 = corpus['perturbed']
kendall_stats_1 = compute_kendall_stats(corpus['original'], predictions1)

print(f"Kendall-tau for the corpus is:\t",
      f"{kendall_stats_1[0]}")
#print(f"{stats.describe(kendall_stats_1[1])}")

df = pd.DataFrame({"kendall_tau": kendall_stats_1[1]})
px.histogram(df, x="kendall_tau", nbins=50,
             title="Sample distribution of pointwise Kendall-tau for experiment 1")

## Experiment 2 - What happens to the KT if we randomly interleave the *unordered* markdown cells?

First we have to make our predictions - note, this strategy knows **nothing** about the actual contents of the cells, just what type they are! Even so, KT has jumped up to **0.5967**.

In [None]:
predictions2 = []
for i, nb in enumerate(corpus['original']):
    code_positions = np.array(corpus['code'][i]) + 0.5
    
    md_positions = rng.choice(
        np.arange(len(corpus['original'][i])+1),
        len(corpus['markdown'][i]),
        replace=False
    )
    nb_ranks = np.concatenate([code_positions, md_positions])
    
    pred = np.array(corpus['perturbed'][i])[nb_ranks.argsort()]
    predictions2.append(pred)

In [None]:
kendall_stats_2 = compute_kendall_stats(corpus['original'], predictions2)

print(f"Kendall-tau for the corpus is:\t",
      f"{kendall_stats_2[0]}")
#print(f"{stats.describe(kendall_stats_2[1])}")

df = pd.DataFrame({"kendall_tau": kendall_stats_2[1]})
px.histogram(df, x="kendall_tau", nbins=50,
             title="Sample distribution of pointwise Kendall-tau for experiment 2")

## Experiment 3 - What happens to KT if we randomly interleave *correctly* ordered markdown cells?

We'll have to do some work in order to correctly order the code cells - but look! It **really pays off** in this experiment! KT has rocketed up to **0.9205**.

In [None]:
predictions3 = []
for i, nb in enumerate(corpus['original']):
    code_positions = np.array(corpus['code'][i]) + 0.5
    
    md_positions = rng.choice(
        np.arange(len(corpus['original'][i])+1),
        len(corpus['markdown'][i]),
        replace=False
    )
    
    nb_ranks = np.concatenate([code_positions, sorted(md_positions)])
    
    pred = np.concatenate([ corpus['code'][i], sorted(corpus['markdown'][i]) ])[nb_ranks.argsort()]
    predictions3.append(pred)

In [None]:
kendall_stats_3 = compute_kendall_stats(corpus['original'], predictions3)

print(f"Kendall-tau for the corpus is:\t",
      f"{kendall_stats_3[0]}")
#print(f"{stats.describe(kendall_stats_3[1])}")

df = pd.DataFrame({"kendall_tau": kendall_stats_3[1]})
px.histogram(df, x="kendall_tau", nbins=70,
             title="Sample distribution of pointwise Kendall-tau for experiment 3")

# Summary

1. We've constructed a toy notebook-corpus-generating distribution. 
2. Using a corpus that we sampled from the distribution, we've run three experiments:
    + Experiment 1: 'dummy prediction' predict the cell orders as they appear in the data. (**KT = 0.4490**)
    + Experiment 2: randomly interleave the unordered markdown cells amongst the ordered code cells (**KT = 0.5967**)
    + Experiment 3: randomly interleave the ordered markdown cells amongst the ordered code cells (**KT = 0.9205**)
    
**Hope this takes you somewhere interesting!**