## Notes

First at all, to understand the goal of the competition and available approaches, read the "[My understanding of the domain - Taking a step back](https://www.kaggle.com/competitions/AI4Code/discussion/328905)" thread by @Allohvk.

Nice [live twitch stream session](https://www.twitch.tv/videos/1482350967) by Rob Mulla and [this is the result](https://www.kaggle.com/code/robikscube/google-ai4code-data-to-parquet-twitch-stream-eda).

### Metric

* Wikipedia [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)
* Notebook [Competition Metric - Kendall Tau Correlation](https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation/notebook)

### Observations

From [EDA by Sanskar Hasija](https://www.kaggle.com/code/odins0n/ai4code-detailed-eda) (A good EDA to start):
* A total of 139256 notebooks are provided train set.
* Only 4 notebooks are provided test set.
* A total of 146300 cells in train the train dataframe constructed which include two types of cell_type.
* 1/3rd consist of Markdown Cells.

From [Comprehensive EDA by ANDREAS PALMGREN](https://www.kaggle.com/code/andreaspalmgren/ai4code-comprehensive-eda)
* Almost all markdown cells in English but there are cells in other languages (40 different languages).

From this notebook (see "Markdown cells EDA" below)
* markdown cells must be preprocessed
* there are markdown cells repeated in the same notebook with different id

### To Do

* Think about "ancestor" meaning in this context
* ~~open a thread in discussion about markdown cells repeated with different id~~. This [thread](https://www.kaggle.com/competitions/AI4Code/discussion/328324).
* preprocess markdown cells:
    * avoid img html tags
    * null markdown cell after preprocess?


## Markdown cells EDA

We are going to compute a md5 hash in every markdown cell in order to find identical cells in notebooks.

In [None]:
import pandas as pd
from hashlib import md5
from pathlib import Path

DATA_DIR = "/kaggle/input/AI4Code"

paths = Path(f'{DATA_DIR}/train/').glob("*.json")
x = [p for p in paths]
r = [pd.read_json(i,dtype={'cell_type': 'category', 'source': 'str'}).assign(id=i.stem) for i in x]
allcells = pd.concat(r)
allcells['hash'] = allcells.source.str.encode('utf-8').apply(lambda h: md5(h).hexdigest())
allcells['code_len'] = allcells.source.str.len()

In [None]:
allcells.sample(10)

### Top identical markdown cells 

In [None]:
allcells.query('cell_type =="markdown"').groupby('hash').filter(lambda x: len(x) > 1).groupby('hash').count().nlargest(10, 'id')

### Top largest identical markdown cells

In [None]:
allcells.query('cell_type =="markdown"').groupby('hash').filter(lambda x: len(x) > 1).nlargest(20, 'code_len').sort_values(by=['code_len','hash'],ascending=False)

### Top identical markdown cells in the same notebook

In [None]:
allcells.query('cell_type =="markdown"').groupby(['hash','id']).filter(lambda x: len(x) > 1).groupby(['hash','id']).count().nlargest(10, 'source')