In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
from pathlib import Path

pd.options.display.width = 180
pd.options.display.max_colwidth = 120
pd.options.display.max_rows = None


# Fun start: Find your notebooks and references

Just taking the quick lane by loading Rob Mulla's Parquet files.

In [None]:
%%time
train = pd.read_parquet('../input/ai4code-parquet-tabular/train_all.parquet')
train.head()

So I was curious which of my notebooks are actually in train. Below I made a function that you can use to find them easily.

In [None]:
def find_notebooks(text_piece, head_cells = 8, return_df = True):
    notebooks = pd.DataFrame()
    df = train[train.source.str.contains(text_piece)]
    ids = df.id.unique()
    for i in ids:
        df1 = train.query('id == @i').iloc[:head_cells,:]
        notebooks = notebooks.append(df1)
    if len(notebooks) >0:
        print(f'This piece of text was found {len(df)} times in {len(ids)} unique notebooks, with ids {ids}')
    else:
        print('Text not found in any notebook')
    if return_df and len(df) >0:
        print(f'The dataframe below prints the first {head_cells} cells of each notebook')
        return notebooks

If you take a piece of text out of your notebook and seems "unique enough", you should find it if the notebook is included in the train set. The piece of text below is from https://www.kaggle.com/code/erikbruin/riiid-comprehensive-eda-baseline

Of course you can still get multiple hits if somebody forked your notebook and made it public (hopefully with significant changes). There actually are 7 public forks, but I guess they forked an older version that did not contain this piece of text yet or the forks are just not included in train.

In [None]:
find_notebooks("So we should realize that example_test.csv really is just an example.")


In [None]:
#just checking if this (bad) Dutch word is found somewhere
find_notebooks('mietje')

Below, I am checking how often I have been referenced ('erikbruin' is always part of the url). I by the way found 2 more notebooks of mine via this route ('8c34197a9f9c1a' and'c1fd4f8cb2fd27'). In one of them I found a reference made by me to a dataset that I published, and in the other one I suggest also reading another notebook published by me ;-).

In [None]:
find_notebooks('erikbruin', return_df = False)

# EDA

As you can see, there is a notebook with over a 1000 cells! I wonder how tidy this one is (perhaps a new cells for every single line? Notebooks of just 2 cells make sense to me: This is likely a (long) script put into one cell.

In [None]:
cells = train.groupby('id').size()
print(f"""The average number of cells per notebook is {round(cells.mean(),1)}{os.linesep}
The longest notebook has {cells.max()} cells{os.linesep}
The shortest notebook has {cells.min()} cells""")

Let's now look at the length of cells.

In [None]:
#new column
train['cell_chars'] = train.source.str.len()
train.head()

So what are the longest cell? Well, interestingly, they are all markdown cells where images have been inserted!

In [None]:
train.sort_values(by = "cell_chars", ascending = False).head()

So, what if we just look at code cells?

In [None]:
train.query('cell_type == "code"').sort_values(by = "cell_chars", ascending = False).head()

So the fourth one seems the original notebook here, with 3 forks (same anchestor_id). What does this notebook look like? Well, this what I expected; almost the entire code in one cell ;-)

In [None]:
train.query('id == "fd4199afd8d8b2"')

Let's also have a look at the percent of markdown vs code. As you can see, some notebooks are almost entirely markdown only. Those seem hard to predict.

In [None]:
type_counts = train.groupby(['id', 'cell_type'], as_index = False)['cell_type'].size()
type_counts = type_counts.pivot_table('size', 'id', 'cell_type')
type_counts['percent_md'] = round(type_counts.markdown/(type_counts.markdown + type_counts.code),2)
type_counts = type_counts.sort_values(by = "percent_md", ascending = False)
type_counts.head()

Let's now see what the test set looks like. Since it's very small, would Kaggle have taken a few "representative" notebooks? Cannot take the quick lane here (as Rob Mulla only made Parquet files for train; makes sense as these are the ones that take long to load).

In [None]:
#taking the code to load json published by the organisers in the getting started notebook

data_dir = Path('../input/AI4Code')

def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )


paths_test = list((data_dir / 'test').glob('*.json'))
notebooks_test = [
    read_notebook(path) for path in tqdm(paths_test, desc='Train NBs')
]
test = (
    pd.concat(notebooks_test)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

test = test.reset_index()

In the description of the competition is stated:
*You are challenged to reconstruct the order of markdown cells in a given notebook based on the order of the code cells, demonstrating comprehension of which natural language references which code.*

This confused me a little initially, but in the discussion I found the answer:
*To clarify, all of the cells need to be ordered, both code and markdown. It's just that you are given the correct relative order of the code cells among themselves. You still need to figure out the correct overall order -- where the markdown cells should be placed among the code cells.*

Ok, I finally get it. So in the notebook below with 9 code cells and one markdown cell, we just need to insert the single markdown cell into the ordered list of code cell 1-9!

In [None]:
type_counts_test = test.groupby(['id', 'cell_type'], as_index = False)['cell_type'].size()
type_counts_test = type_counts_test.pivot_table('size', 'id', 'cell_type')
type_counts_test

The 3rd seems most important for the public score. Which notebook is that one with lots of cells in the test set? Well I don't know as the function that I made only works the other way around. However, it's easy to see that it's a Titanic notebook when looking at the source column.

# The Metric; Kendall Tau

Below, I am selecting one notebook in train with one code cell and a markdown cell.

In [None]:
nb = type_counts.query('code ==1 and markdown == 1').iloc[[0]].index.values[0]
nb1 = train.query('id == @nb')
nb1

Now, I am putting together a Pandas series as required as the input of the function given in the Getting Started with AI4Code notebook, and and reverse the order in the predictions.

In [None]:
gt1 = nb1.groupby('id')['cell'].apply(list)
gt1

In [None]:
pred1 = gt1.copy()
pred1[0] = ['59c59076', 'd04e6a45'] #reversing the order
pred1

In code cell below, you can find the kendall_tau function from the organizers. I adjusted the function a little to be able to print more information that just the tau score, and I added a little print function.

In [None]:
from bisect import bisect


# Actually O(N^2), but fast in practice for our data
def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):  # O(N)
        j = bisect(sorted_so_far, u)  # O(log N)
        inversions += i - j
        sorted_so_far.insert(j, u)  # O(N)
    return inversions

def kendall_tau(ground_truth, predictions, print_2max = False):
    total_inversions = 0  # total inversions in predicted ranks across all instances
    total_2max = 0  # maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    if print_2max == False:
        return 1 - 4 * total_inversions / total_2max, total_inversions
    else:
        return 1 - 4 * total_inversions / total_2max, total_inversions, total_2max

def print_tau(gt, pred, print_2max = False):
    if print_2max:
        tau, inversions, total_2max = kendall_tau(gt, pred, print_2max = True)
        print(f' tau is {round(tau, 2)}, number of inversions is {inversions}, maximum possible inversions is {total_2max}')
    else:
        tau, inversions = kendall_tau(gt, pred, print_2max = False)
        print(f' tau is {round(tau, 2)}, number of inversions is {inversions}')

Of course predicting the wrong order of just 2 cells gives us the results as printed below.

In [None]:
print_tau(gt1, pred1)

What happens to the score if I take a notebook with one markdown cell followed by 4 code cells, and switch the first 2 cells?

In [None]:
nb = type_counts.query('code ==4 and markdown == 1').iloc[[0]].index.values[0]
nb2 = train.query('id == @nb')
nb2

As you can see, the number of inversions is of course the same but the effect on Tau is much less here while we are making kind of the same mistake.

In [None]:
gt2 = nb2.groupby('id')['cell'].apply(list)
pred2 = gt2.copy()
pred2[0] = ['e2eb8c27','8aa62aa6', 'e1b55174', 'd1dd08ac', '0d9a73d3'] #reversing the order of the first 2 cells
print_tau(gt2, pred2)

What if we put both notebooks through the score function? Is our score now the simple average of the scores of both notebooks? No, fortunately not! If that were the case, getting the short notebooks wrong would be very costly. What the organizers do is combine both ranking orders and calculate one score. As you can see, maximum possible inversions of 22 comes out this time. This is n(n-1) for both notebooks; (2×1) + (5×4) = 22. Is this entirely correct? Well, yes and no. The organizers apply the formula correctly but in this case total_2max is not described correctly (maximum possible inversions across all instances). The maximum possible inversions across all instances should be the binominal coefficient, which is total_2max/2 (so 1 + 10 = 11 maximal total inversions in this case).

In [None]:
gt = gt1.append(gt2)
pred = pred1.append(pred2)
print_tau(gt, pred, print_2max = True)

**Please stay tuned!**