## <center>Introduction<center>

Hosted by Google, the goal of this competition is to understand the relationship between code and comments in Python notebooks. You are challenged to reconstruct the order of markdown cells in a given notebook based on the order of the code cells, demonstrating comprehension of which natural language references which code.

You can find more details about the competition and the data set on the competition page, <a href="https://www.kaggle.com/competitions/AI4Code">here</a>

## <center><span>Imports</span></center>

In [1]:
# Utilities 
import numpy as np
import pandas as pd
from pandas.testing import assert_frame_equal

# Plotting
import matplotlib.pyplot as plt
import matplotlib.colors
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import matplotlib.pyplot as plto
from tqdm.notebook import tqdm, trange

pd.set_option('display.max_columns', None)

### <span>Data loading:<span>

In [2]:
# Reading in the data

df = pd.read_csv('../data/train.csv', index_col=[0,1])
df.dropna(inplace=True)

df_ancestors = pd.read_csv('../data/train_ancestors.csv', index_col='id')

df_orders = pd.read_csv('../data/train_orders.csv',
                       index_col='id',
                       squeeze=True).str.split()

## <center>EDA</center>

In [3]:
print(f'Training Shape: {df.shape[0]} rows, {df.shape[1]} columns' +
      f'\nTraining Ancestors Shape: {df_ancestors.shape[0]} rows, {df_ancestors.shape[1]} columns' +
      f'\nTraining Orders Shape: {df_orders.shape[0]} rows' )

Training Shape: 6370642 rows, 2 columns
Training Ancestors Shape: 139256 rows, 2 columns
Training Orders Shape: 139256 rows


In [4]:
code_df = df[df['cell_type'] == 'code']
md_df = df[df['cell_type'] == 'markdown']

# Number of cells in each type
print(f'Number of code cells: {len(code_df)}')
print(f'Number of markdown cells: {len(md_df)}')

# Cell distribution
labels=['Code cells', 'Markdown Cells']
values=[len(code_df), len(md_df)]
colors=['#38A6A5','#E1B580']

fig = go.Figure(data=[go.Pie(
    labels=labels, 
    values=values, 
    pull=[0.1, 0 ],
    title = 'Cell Distribution',
    marker=dict(colors=colors, 
                line=dict(color='#000000', 
                          width=2))
)])
fig.show()

Number of code cells: 4204578
Number of markdown cells: 2166064


In [5]:
# Get an example code cell
print(f'\033[94m')
print(code_df.iloc[0]['source'])# Sample code cell

[94m
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import uuid
import os
import scipy
import cv2
from tqdm import tqdm
import math
import ast
sns.set()


In [6]:
# Get an example markdown cell
print(f'\033[94m')
print(md_df.iloc[59]['source'])

[94m
### Pipeline 

At this stage, it is worth introducing pipeline. In machine learning, it is common to run a sequence of algorithms to process and learn from data. In our example, we performed StringIndexer, VectorAssembler, and ML model. In other cases, the intermediate stages can be standardization, vectorization (for text processing), normalization, etc. These operations have to be performed on a specific order. Spark represents such a workflow as a Pipeline, which consists of a sequence of stages to be run in a specific order. Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. 

Without the pipeline, we have to execute each stage, store the outcome, and feed into the next stage and evaluate, and so on. We prefer pipeline over this manual approach because of the following reasons: 

- The pipeline is less prone to mistake because the processes are automated. 
- In a production environment, this is the only way to do machine learning end to e

In [7]:
# Get an example notebook

ex_notebook_id = df.index.unique('id')[5]
print('Notebook id:', ex_notebook_id)
print('Cell count:', len(df.loc[ex_notebook_id]))
print('The cells:')

ex_notebook = df.loc[ex_notebook_id, :]
display(ex_notebook)

Notebook id: 205788e414c98a
Cell count: 16
The cells:


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6d04cec7,code,import numpy as np \nfrom scipy.special import...
e4914ff2,code,"def Cq(score, noise=0.18):\n '''\n score..."
5ab85c99,code,"noise=0.18\n\nscore = np.linspace(0, 1, 100001..."
c555bf44,code,"def improvement(score1, score2, noise):\n #..."
ee77aedf,code,noise = 0.18\ndelta = 0.0001\nscore = np.linsp...
ac56567f,code,"noise = 0.18\n \nscore1, score2 = 0.81800, 0...."
e0690dc5,code,"leaderboard = [\n (1, 'Laurent Pourchot', 0..."
568515b1,code,"import seaborn as sns\n\nfig, ax = plt.subplot..."
9d7fef54,markdown,# How much is the score improvement?\n\n> The ...
9d517c3d,markdown,Let's apply **Cp** function to the September 2...


In [8]:
cell_order = df_orders.loc[ex_notebook_id]

print('The previous notebook, in order:')
display(ex_notebook.loc[cell_order, :])

The previous notebook, in order:


Unnamed: 0_level_0,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6d04cec7,code,import numpy as np \nfrom scipy.special import...
9d7fef54,markdown,# How much is the score improvement?\n\n> The ...
c1c751ea,markdown,The metric should estimate how near is the sco...
e4914ff2,code,"def Cq(score, noise=0.18):\n '''\n score..."
12249ef5,markdown,"First, we need to estimate dataset noise level..."
5ab85c99,code,"noise=0.18\n\nscore = np.linspace(0, 1, 100001..."
db2cbd22,markdown,The **improvement** function to measure the im...
c555bf44,code,"def improvement(score1, score2, noise):\n #..."
bd5c9d42,markdown,let's see how the improvement with given delta...
ee77aedf,code,noise = 0.18\ndelta = 0.0001\nscore = np.linsp...


In [9]:
def get_ranks(base, derived):
    return [base.index(d) for d in derived]

cell_ranks = get_ranks(cell_order, ex_notebook.index)
ex_notebook.insert(0, 'ranks', cell_ranks)

ex_notebook

Unnamed: 0_level_0,ranks,cell_type,source
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6d04cec7,0,code,import numpy as np \nfrom scipy.special import...
e4914ff2,3,code,"def Cq(score, noise=0.18):\n '''\n score..."
5ab85c99,5,code,"noise=0.18\n\nscore = np.linspace(0, 1, 100001..."
c555bf44,7,code,"def improvement(score1, score2, noise):\n #..."
ee77aedf,9,code,noise = 0.18\ndelta = 0.0001\nscore = np.linsp...
ac56567f,11,code,"noise = 0.18\n \nscore1, score2 = 0.81800, 0...."
e0690dc5,13,code,"leaderboard = [\n (1, 'Laurent Pourchot', 0..."
568515b1,14,code,"import seaborn as sns\n\nfig, ax = plt.subplot..."
9d7fef54,1,markdown,# How much is the score improvement?\n\n> The ...
9d517c3d,12,markdown,Let's apply **Cp** function to the September 2...
