# AI4Code Languages distribution

Competition data consists of ~160k kaggle notebooks. Kaggle community is international and alothough most notebooks are commented in English, there are plenty kernels with all kinds of languages. In this kernel I demonstrate the distribution of different languages in trainset and I hope it can help you better choose your models.

Moreover, kaggle kernels can be written in two programming languages: Python and R. Distribution of programming languages in trainset is also adressed in this kernel.

In [None]:
import json
import os
import re

import fasttext
import pandas as pd
import plotly.express as px
import pygments.lexers
from tqdm.auto import tqdm


tqdm.pandas()

In [None]:
lang_codes = pd.read_csv("../input/wikipedia-language-iso639/lang.csv", index_col=0).squeeze()
train_orders = pd.read_csv("../input/AI4Code/train_orders.csv")

# Natural Languages distribution

In [None]:
nl_detector = fasttext.load_model("../input/fasttext-language-identification/lid.176.bin")

In [None]:
def detect_nl(nb_id, detector):
    with open(f"../input/AI4Code/train/{nb_id}.json", 'r') as f_in:
        notebook = json.load(f_in)
    md_cells = [cell_id for cell_id in notebook['cell_type'] if notebook['cell_type'][cell_id] == 'markdown']
    langs = []
    for cell_id in md_cells:
        cell_content = notebook['source'][cell_id].replace('\n', ' ')
        langs.append(detector.predict(cell_content, k=1)[0][0][len("__label__"):])
    return max(set(langs), key=langs.count)

In [None]:
train_nl = train_orders['id'].progress_apply(detect_nl, detector=nl_detector)

In [None]:
lang_counts = train_nl.value_counts(dropna=False).reset_index()
lang_counts['language'] = lang_counts['index'].map(lang_codes).fillna(lang_counts['index'])
lang_counts = lang_counts.rename(columns={"id": "count"})

fig = px.pie(lang_counts, values='count', names='language', title="Natural Languages distribution in Train data")
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

93% kernels are commented in English, but 7% are not, and that may be significant if you're fighting for thousandths on the leaderboard!
The plot is interactive and you can turn off English if you want to look more closely at other languages. But be aware that percentage shown on graph is always relative to the shown ("enabled") elements.

# Programming Languages distribution

I tried several methods for detecting programming languages: guesslang was too slow, Pygments was highly inaccurate, so I ended up with simple regex checking format of imports. Of course there are some notebooks that do not import anything at all, but there aren't many of them. I'd be happy to learn about a better method.

In [None]:
def detect_pl(nb_id):
    with open(f"../input/AI4Code/train/{nb_id}.json", 'r') as f_in:
        notebook = json.load(f_in)
    src_cells = [cell_id for cell_id in notebook['cell_type'] if notebook['cell_type'][cell_id] == 'code']
    py_regex = r"\bimport \w+"
    r_regex = r"\blibrary\(\w+"
    langs = []
    for cell_id in src_cells:
        cell_content = notebook['source'][cell_id]
        if re.search(py_regex, cell_content):
            langs.append("Python")
        if re.search(r_regex, cell_content):
            langs.append("R")
    if len(langs) > 0:
        return max(set(langs), key=langs.count)

In [None]:
train_pl = train_orders['id'].progress_apply(detect_pl)

In [None]:
pl_counts = train_pl.value_counts(dropna=False).reset_index()
pl_counts = pl_counts.rename(columns={"id": "count", "index": "language"})

fig = px.pie(pl_counts, values='count', names='language', title="Programming Language distribution in Train data")
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

Only 62 kernels from trainset are written in R (or my regex is not good enough). Nulls on the plot correspond either to simple educational kernels without imports (mostly in python) or to kernels which contain only bash cells, executing some py-scripts.