# Sort the Code!

This notebook was created on a live twitch stream session. [Check out my channel here](https://www.twitch.tv/medallionstallion_)

<img src="https://allaboutplanners.com.au/wp-content/uploads/2017/05/how-to-color-code-your-planner-using-Zooms-organized-blog-post-idea-tracking-brain-dumping-use-empty-notebook-use-empty-notes-pages-in-my-planner-min-1024x768.jpg" width="500" height="250" />


The goal of this competition is to understand the relationship between code and comments in Python notebooks.

The task is to create an algorithm that can sort notebook cells in the correct order.
- We are given 130,000 notebooks in the training set with the correct order
- We need to predict on the test set the correct order

Lets go!

# Load the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
import json

plt.style.use("ggplot")
my_pal = sns.color_palette()

In [None]:
train = pd.read_csv("../input/AI4Code/train_orders.csv")
ancestors = pd.read_csv("../input/AI4Code/train_ancestors.csv")
ss = pd.read_csv("../input/AI4Code/sample_submission.csv")

# Helper Functions
These functions help us load the training data from json. I also saved all the combined training data as a parquet file that we can load for quick access.

In [None]:
def load_example(id, is_train=True):
    """
    Helper for loading json file of a training example
    """
    filedir = "train" if is_train else "test"
    with open(f"../input/AI4Code/{filedir}/{id}.json") as f:
        example = json.load(f)
    return example

In [None]:
# Load an example
example_id = train["id"].sample(1, random_state=529).values[0]
load_example(example_id).keys()

In [None]:
def get_example_df(example_id, train, ancestors):
    """
    Creates a pandas dataframe of the json cells and correct order.
    """
    cell_order = train.query("id == @example_id")["cell_order"].values[0]
    example_df = pd.DataFrame(load_example(example_id))
    example_df["id"] = example_id
    my_orders = {}

    for idx, c in enumerate(cell_order.split(" ")):
        my_orders[c] = idx

    example_df["order"] = example_df.index.map(my_orders)
    example_df.reset_index().rename(columns={"index": "cell"})

    example_df["ancestor_id"] = ancestors.query("id == @example_id")[
        "ancestor_id"
    ].values[0]
    example_df["parent_id"] = ancestors.query("id == @example_id")["parent_id"].values[
        0
    ]
    example_df = example_df.reset_index().rename(columns={"index": "cell"})
    example_df = example_df.sort_values("order").reset_index(drop=True)
    example_df["id"] = example_id
    col_order = [
        "id",
        "cell",
        "cell_type",
        "source",
        "order",
        "ancestor_id",
        "parent_id",
    ]
    example_df = example_df[col_order]
    return example_df

Now we have a dataframe with all the cells, their types and the contents of them along with the correct order.

In [None]:
# Load the example as a dataframe
example_df = get_example_df(example_id, train, ancestors)
example_df.head()

# Combine Data as Parquet for Fast Loading

The below function was used offline to create a combined version of the training data with all the values in a single dataframe.

[Check out the dataset here](https://www.kaggle.com/datasets/robikscube/ai4code-parquet-tabular)

In [None]:
import os
import pandas as pd
import json
from tqdm.contrib.concurrent import process_map


def combine_train():
    train = pd.read_csv("../input/AI4Code/train_orders.csv")
    ancestors = pd.read_csv("../input/AI4Code/train_ancestors.csv")

    # Get the list of json files
    train_jsons = os.listdir("../input/AI4Code/train/")
    print(f"There are {len(train_jsons)} training json files")

    all_ids = train["id"].unique()
    args = ((ids, train, ancestors) for ids in all_ids)
    results = process_map(
        get_example_df, args, max_workers=32, chunksize=500, total=len(all_ids)
    )
    all_examples = pd.concat(results).reset_index(drop=True)
    all_examples.to_parquet("train_all.parquet")

In [None]:
train_all = pd.read_parquet("../input/ai4code-parquet-tabular/train_all.parquet")

# EDA of the training notebook data.

Some questions to answer:
- How many cells on average per notebook?
- What is the breakdown of markdown vs code cells.
- How many notebooks share the same ancestor_id

In [None]:
train_all['cell_type'].value_counts() \
    .plot(kind='barh',
          title='Code vs. Markdown Cells in Total',
          color=my_pal[2], figsize=(8, 5))
plt.show()

In [None]:
# Number of Cells per id
train_all['id'].value_counts() \
    .plot(kind='hist',
          bins=50,
          title='Distribution of # of Cells per Notebook')
print('The median number of cells per notebook is:',
      train_all['id'].value_counts().median())
plt.show()

There is a notebook with over 1000 cells!

In [None]:
train_all['id'].value_counts() \
    .head(50).sort_values() \
    .plot(kind='barh', color=my_pal[3], figsize=(8, 10),
         title='Top 50 Notebooks by # of Cells')
plt.show()

## What's the most forked notebook?

In [None]:
# Find top Id
# Find the most "parent"
ancestors['parent_id'].value_counts().head(20) \
    .sort_values() \
    .plot(kind='barh', figsize=(8, 8),
          color=my_pal[1], title='Top Forked Notebooks')

## Find the top "forked" notebook 
- and print the first few cells of one of the forks.
It's an intoduction to machine learning notebook!

In [None]:
top_forked = ancestors['parent_id'].value_counts().index[0]
# This parent id does not appear in our dataset
# Take one of the forks of this top id
a_fork = ancestors.query('parent_id == @top_forked')['id'].values[0]

print(train_all.query('id == @a_fork')['source'].values[0])
print(train_all.query('id == @a_fork')['source'].values[1])