<a href="https://colab.research.google.com/github/rahiakela/kaggle-competition-projects/blob/master/google-ai4code/01_exploratory_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Google AI4Code: Exploratory Data Analysis



**Exploratory data analysis is the work of a detective. Understanding the possibilities of your data is the first step in laying the groundwork for future modeling. With this notebook, we try to make sense of our data and demonstrate how data can be analyzed. We'll look for trends, limitations, and other characteristics linked to the questions we're interested in as part of our investigation.**

Reference:

https://www.kaggle.com/code/andreaspalmgren/ai4code-comprehensive-eda

##Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
from tqdm import tqdm
from pathlib import Path
import re
import os

pd.options.display.width = 180
pd.options.display.max_colwidth = 100

rc = {"axes.spines.left" : True,
      "axes.spines.right" : False,
      "axes.spines.bottom" : True,
      "axes.spines.top" : False,
      "xtick.bottom" : True,
      "xtick.labelbottom" : True,
      "ytick.labelleft" : True,
      "ytick.left" : True,
      "figure.subplot.hspace" : 0.7,
    "figure.titleweight" : "bold",
    "axes.titleweight" : "bold",
     "font.weight" : "bold"}
plt.rcParams.update(rc)

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import os
# content/gdrive/My Drive/Kaggle is the path where kaggle.json is  present in the Google Drive
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle-keys"

In [4]:
%%shell

# download dataset from kaggle> URL: https://www.kaggle.com/competitions/AI4Code
kaggle competitions download -c AI4Code

unzip -qq AI4Code.zip
rm -rf AI4Code.zip

Downloading AI4Code.zip to /content
 98% 696M/714M [00:05<00:00, 134MB/s]
100% 714M/714M [00:05<00:00, 135MB/s]




In [11]:
%%shell

mkdir AI4Code
mv train AI4Code
mv test AI4Code
mv train_orders.csv AI4Code
mv train_ancestors.csv AI4Code



##Data Source

**Google and X, the moonshot factory, have supplied the dataset, which contains about 160 000 Jupyter notebooks. This is part of a <a href="https://www.kaggle.com/competitions/AI4Code">Kaggle competition</a> to train a model that can rank markdown cells depending on the order of its code cells. 

Code cells are written in python and markdown cells are written in markdown, which is the text formatting langague used in Jupyter.**

In [5]:
def read_notebook(path):
  return (
      pd.read_json(path, dtype={"cell_type": "category", "source": "str"})
        .assign(id=path.stem)
        .rename_axis("cell_id")
  )

In [14]:
data_dir = Path("./AI4Code")
# Subset of training due to its large size
NUM_TRAIN = 20000

paths_train = list((data_dir / "train").glob("*.json"))[:NUM_TRAIN]

notebooks_train = [read_notebook(path) for path in tqdm(paths_train, desc="Train NBs")]
# Get notebooks
df_notebooks = (
    pd.concat(notebooks_train)
      .set_index("id", append=True)
      .swaplevel()
      .sort_index(level="id", sort_remaining=False)
  )

# Get correct order of cells in notebooks
df_orders = pd.read_csv(data_dir / "train_orders.csv", index_col="id")
df_orders = df_orders.squeeze().str.split(" ").explode().to_frame()
df_orders["rank"] = pd.Series([np.arange(x) for x in df_orders.groupby("id").count()["cell_order"]]).explode().to_numpy()

df = df_notebooks.reset_index().merge(df_orders.reset_index().rename(columns={"cell_order": "cell_id"}), how="inner", on=["id", "cell_id"])

# Get ancestors for notebooks
df_ancestors = pd.read_csv(data_dir / "train_ancestors.csv", index_col="id")

# Final combined dataframe
df = df.merge(df_ancestors, on="id").sort_values(["id", "rank"]).set_index(["id", "cell_id"])

# Dataframe for count information - Used in EDA
mkd = df[df["cell_type"] == "markdown"].groupby(by=["id"]).count().source
code = df[df["cell_type"] == "code"].groupby(by=["id"]).count().source
df_counts = pd.concat([mkd, code], axis=1)
df_counts.columns = ["markdown_count", "code_count"]
df_counts["tot"] = df_counts.markdown_count + df_counts.code_count
df_counts["ratio"] = df_counts.code_count / df_counts.tot

Train NBs: 100%|██████████| 20000/20000 [02:03<00:00, 161.88it/s]


Training data consists out of 140 000 JSON files, each containing a notebook where markdown cells have been shuffeled. Additional files regarding correct markdown order, as well as, information of "forked notebook" has also been given. Following table is a combination of all given training files, including the correct order and ancestor_id/parent_id.

* **`id` - Unique identification of notebook.** 
* **`cell_id` - Unique identification of cell within notebooks.** 
* **`cell_type` - Factor specifying cell type, either being a code cell or markdown cell.** 
* **`source` - String with content of cell.**
* **`rank` - Order rank for given cell within notebook.**
* **`ancestor_id` - Identifies sets of notebooks with common origin.**
* **`parent_id` - Some version of the notebook id was forked from some version of the notebook `parent_id`. It may or may not be present (i.e. `parent_id` may be missing due to someone having forked a private notebook).** 

In [15]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,cell_type,source,rank,ancestor_id,parent_id
id,cell_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
00035108e64677,3496fbfe,markdown,# Import Basic Libraries,0,a41da3f9,
00035108e64677,2fa1f27b,code,# Basic Libraries\nimport numpy as np\nimport pandas as pd\nimport seaborn as sb\nimport matplot...,1,a41da3f9,
00035108e64677,719854c4,markdown,# Read test and train file,2,a41da3f9,
00035108e64677,f3c2de19,code,"#import test and train file\neverything = pd.read_json(""../input/whats-cooking/train.json"")\ntes...",3,a41da3f9,
00035108e64677,d75feb42,markdown,# Creating a list consisting of the top 100 ingredients and using them to predict the cuisine la...,4,a41da3f9,


**We're in luck because the competition organizers have already cleaned the dataset. It has been stated that:** 

* **Notebooks contain at least one of each `cell_type`, meaning notebooks should have a length of 2 or more.** 
- **Any cells with empty `source` have already have been removed.**  
- **All code is written in python.** 

**Before continuing with the analysis, we need to make a short check on our own. `cell_type` is a factor that specifies one of two cell types: markdown or code. Only these two types appear to be present in the data, which is good.**

In [16]:
df["cell_type"].unique()

array(['markdown', 'code'], dtype=object)

**We should not have any missing values, excpet within `parent_id`. Since the data appears to be clean, it is time to start our analysis.**

In [17]:
df.isna().sum()

cell_type           0
source              0
rank                0
ancestor_id         0
parent_id      798755
dtype: int64

##Data Analysis

**Analysis will be based on a subset of the training data (20 000 notebooks).**

###Code vs. Markdown

**We now possess enough information to begin our analysis. Due to the fact that our model aims to order markdown cells based on code cells, the relationship between the two is of primary importance. A great starting point is to display the proportion of cell types across all notebooks. As seen in the pie chart below, code cells appear to be far more common than markdown cells which give us plenty to work on.**