In [1]:
import os
import json
import pandas as pd

In [2]:
def load_json_folder(folder_path) -> pd.DataFrame:

    rows = []

    # Get all files in the given folder
    for fname in os.listdir(folder_path):
        
        full_path = os.path.join(folder_path, fname)

        with open(full_path, "r", encoding="utf-8") as f:
            try:
                obj = json.load(f)
            except Exception as e:
                print("Error in :", fname)
                raise e

        obj["file_name"] = fname
        rows.append(obj)

    return pd.DataFrame(rows)


df = load_json_folder("../data/raw/code_classification_dataset/")
print(df.shape)
print(df.columns)

(4982, 21)
Index(['prob_desc_time_limit', 'prob_desc_sample_outputs', 'src_uid',
       'prob_desc_notes', 'prob_desc_description', 'prob_desc_output_spec',
       'prob_desc_input_spec', 'prob_desc_output_to', 'prob_desc_input_from',
       'lang', 'lang_cluster', 'difficulty', 'file_name', 'code_uid',
       'prob_desc_memory_limit', 'prob_desc_sample_inputs', 'exec_outcome',
       'source_code', 'prob_desc_created_at', 'tags', 'hidden_unit_tests'],
      dtype='object')


# Dataset Column Structure

## Problem Description

| Column | Description |
|--------|-------------|
| **prob_desc_description** | Main statement of the problem. |
| **prob_desc_input_spec** | Specification of the input format. |
| **prob_desc_output_spec** | Specification of the output format. |
| **prob_desc_sample_inputs** | Example inputs provided in the problem statement. |
| **prob_desc_sample_outputs** | Example outputs provided in the problem statement. |
| **prob_desc_notes** | Additional notes or explanations related to the problem. |
| **prob_desc_time_limit** | Time limit for the problem. |
| **prob_desc_memory_limit** | Memory limit for the problem. |
| **prob_desc_output_to** | Where the program should write its output (typically "standard output"). |
| **prob_desc_input_from** | From where the program reads its input (typically "standard input"). |
| **prob_desc_created_at** | UNIX timestamp of the problem’s creation date. |

---

## Solution Code

| Column | Description |
|--------|-------------|
| **source_code** | Validated solution code in Python. |
| **exec_outcome** | Execution result (`PASSED`, etc.). |
| **lang** | Language used (e.g., "PyPy 3-64"). |
| **lang_cluster** | Language family (e.g., "Python"). |

---

## Metadata

| Column | Description |
|--------|-------------|
| **src_uid** | Unique identifier of the problem. |
| **code_uid** | Unique identifier of the solution code. |
| **file_name** | Name of the JSON file containing the example. |
| **difficulty** | Problem difficulty rating (Codeforces rating). |
| **hidden_unit_tests** | Hidden test cases (often empty). |

---

## Tags (Target Labels)

| Column | Description |
|--------|-------------|
| **tags** | List of Codeforces tags associated with the problem (multi-label). |


# Exploration & Pre-Processing

In [3]:
# Let's explore our different types of columns of our dataset

df.dtypes

prob_desc_time_limit         object
prob_desc_sample_outputs     object
src_uid                      object
prob_desc_notes              object
prob_desc_description        object
prob_desc_output_spec        object
prob_desc_input_spec         object
prob_desc_output_to          object
prob_desc_input_from         object
lang                         object
lang_cluster                 object
difficulty                  float64
file_name                    object
code_uid                     object
prob_desc_memory_limit       object
prob_desc_sample_inputs      object
exec_outcome                 object
source_code                  object
prob_desc_created_at         object
tags                         object
hidden_unit_tests            object
dtype: object

In [4]:
# Missing values
df.isna().sum()

prob_desc_time_limit           0
prob_desc_sample_outputs       0
src_uid                        0
prob_desc_notes             1350
prob_desc_description          0
prob_desc_output_spec         85
prob_desc_input_spec          33
prob_desc_output_to            1
prob_desc_input_from           1
lang                           0
lang_cluster                   0
difficulty                    39
file_name                      0
code_uid                       0
prob_desc_memory_limit         0
prob_desc_sample_inputs        0
exec_outcome                   0
source_code                    0
prob_desc_created_at           0
tags                           0
hidden_unit_tests              0
dtype: int64

In [5]:
# We have some missing values but really a few
# And not in the most meaningful columns
# We can more talk about additional information (like output and input spec, difficulty...)

# If we had some missing values in the description or code columns, we should take more time to study them
# It's even ore true if the tags were missing

## Description

We include in this part : 

- prob_desc_description
- prob_desc_input_spec
- prob_desc_output_spec
- prob_desc_sample_inputs
- prob_desc_sample_outputs
- prob_desc_notes
- prob_desc_time_limit
- prob_desc_memory_limit
- prob_desc_output_to
- prob_desc_input_from
- prob_desc_created_at
- difficulty

In [6]:
# Except the difficulty, all of our columns are texts
# Some are clearly very meaningfull, like the prob_desc_description

df['prob_desc_description'].sample(3)

3007    While resting on the ship after the "Russian C...
578     Rahul and Tina are looking forward to starting...
3281    Gottfried learned about binary number represen...
Name: prob_desc_description, dtype: object

In [7]:
# Other are just additional informations, like the prob_desc_input_from

df['prob_desc_input_from'].value_counts()

prob_desc_input_from
standard input      4933
input.txt             32
стандартный ввод      16
Name: count, dtype: int64

In [8]:
df['prob_desc_input_from'].value_counts()

prob_desc_input_from
standard input      4933
input.txt             32
стандартный ввод      16
Name: count, dtype: int64

In [9]:
# Only the column created_at seems not very useful
# We can delete it
df = df.drop(columns=["prob_desc_created_at"])

In [10]:
# We can see a different language, we will focus on this in a few cells

We will build a single description string containing all this information. The main one is probably the prob_desc_description, but we also add the other additional informations to get a very complete script.

The structure of our new string is :

```python

Problem Description:
<prob_desc_description>

Input Specification:
<prob_desc_input_spec>

Output Specification:
<prob_desc_output_spec>

Notes:
<prob_desc_notes>

Sample Inputs:
<prob_desc_sample_inputs>

Sample Outputs:
<prob_desc_sample_outputs>

Time & Memory Limits:
Time: <prob_desc_time_limit>
Memory: <prob_desc_memory_limit>

I/O Format:
Input From: <prob_desc_input_from>
Output To: <prob_desc_output_to>

Difficulty:
<difficulty>

Created At:
<prob_desc_created_at>
```

In [11]:
def build_full_description(row):
    parts = []

    def add_section(title, content):
        if content and str(content).strip() != "":
            parts.append(f"{title}:\n{content}")

    add_section("Problem Description", row.get("prob_desc_description"))
    add_section("Input Specification", row.get("prob_desc_input_spec"))
    add_section("Output Specification", row.get("prob_desc_output_spec"))
    add_section("Notes", row.get("prob_desc_notes"))
    add_section("Sample Inputs", row.get("prob_desc_sample_inputs"))
    add_section("Sample Outputs", row.get("prob_desc_sample_outputs"))

    add_section("Time Limit", row.get("prob_desc_time_limit"))
    add_section("Memory Limit", row.get("prob_desc_memory_limit"))

    io_info = []
    if row.get("prob_desc_input_from"):
        io_info.append(f"Input From: {row['prob_desc_input_from']}")
    if row.get("prob_desc_output_to"):
        io_info.append(f"Output To: {row['prob_desc_output_to']}")
    if io_info:
        parts.append("I/O Format:\n" + "\n".join(io_info))

    add_section("Difficulty", row.get("difficulty"))

    # final concatenation
    return "\n\n".join(parts)


In [12]:
# We apply our function to build the full description
df["full_description"] = df.apply(build_full_description, axis=1)

# And we can delete the used columns
df = df.drop(columns=[
    "prob_desc_description",
    "prob_desc_input_spec",
    "prob_desc_output_spec",
    "prob_desc_notes",
    "prob_desc_sample_inputs",
    "prob_desc_sample_outputs",
    "prob_desc_time_limit",
    "prob_desc_memory_limit",
    "prob_desc_input_from",
    "prob_desc_output_to",
    "difficulty"
])


In [13]:
# We can see one example
print(df["full_description"].sample(1).iloc[0])

Problem Description:
Alicia has an array, $$$a_1, a_2, \ldots, a_n$$$, of non-negative integers. For each $$$1 \leq i \leq n$$$, she has found a non-negative integer $$$x_i = max(0, a_1, \ldots, a_{i-1})$$$. Note that for $$$i=1$$$, $$$x_i = 0$$$.For example, if Alicia had the array $$$a = \{0, 1, 2, 0, 3\}$$$, then $$$x = \{0, 0, 1, 2, 2\}$$$.Then, she calculated an array, $$$b_1, b_2, \ldots, b_n$$$: $$$b_i = a_i - x_i$$$.For example, if Alicia had the array $$$a = \{0, 1, 2, 0, 3\}$$$, $$$b = \{0-0, 1-0, 2-1, 0-2, 3-2\} = \{0, 1, 1, -2, 1\}$$$.Alicia gives you the values $$$b_1, b_2, \ldots, b_n$$$ and asks you to restore the values $$$a_1, a_2, \ldots, a_n$$$. Can you help her solve the problem?

Input Specification:
The first line contains one integer $$$n$$$ ($$$3 \leq n \leq 200\,000$$$) – the number of elements in Alicia's array. The next line contains $$$n$$$ integers, $$$b_1, b_2, \ldots, b_n$$$ ($$$-10^9 \leq b_i \leq 10^9$$$). It is guaranteed that for the given array $$$b$

## Metadata

In [14]:
# The metadata columns seems not relevant at all
# Except the diffulcity (that we included in the full description)
# And mybe the hidden units tests

df['hidden_unit_tests'].unique()

array([''], dtype=object)

In [15]:
# There are all empty, so we can drop all the remaining metadata columns

# We just keep the src_uid as the index of the DataFrame
df = df.set_index("src_uid")


df = df.drop(columns=[
    "code_uid",
    "file_name",
    "hidden_unit_tests"
])

## Source code

In [16]:
# The source code columns include : 
# source_code,exec_outcome, Execution result,lang, lang_cluster


# Let's check the language values

df['lang'].value_counts()

lang
Python 3     2089
PyPy 3        935
PyPy 3-64     716
PyPy 2        654
Python 2      588
Name: count, dtype: int64

In [17]:
df['lang_cluster'].value_counts()

lang_cluster
Python    4982
Name: count, dtype: int64

In [18]:
# The langcluster is obviously useless because full of the same value ('Python')

# The language also seems not really relevant, it's very close languages. And actually the choice of the user does not really impact the code tags.


df = df.drop(columns=["lang", "lang_cluster"])

In [19]:
# Let's check the execution outcome values
df['exec_outcome'].value_counts()

exec_outcome
PASSED    4982
Name: count, dtype: int64

In [20]:
# All the code passed successfully
# The column outcome is not really usefull for prediction

# But it gives use a meaningfull information
# All the source code passed the tests
# So the source code (because it passed the tests) can be very usefull to predict the tags

# We could maybe add it later in global string with the description
# For now, we keep it as is

# But we can drop the execution result column
df = df.drop(columns=["exec_outcome"])

In [21]:
# An example of source code
print(df['source_code'].sample(1).iloc[0])

def read_input():
    n = int(input())
    line = input()
    line = line.strip().split()
    b = [int(num) for num in line]

    return n, b


def mishka_and_the_last_exam(n, b):
    l = [0] * n
    half = n // 2
    for i, sum_ in enumerate(b[::-1]):
        if i == 0:
            l[half - 1 - i] = sum_ // 2
            l[half + i] = sum_ - (sum_ // 2)
        elif i < half - 1:
            l[half -1 - i] = l[half - 1 - i + 1]
            l[half + i] = sum_ - l[half -1 - i]

            if l[half + i - 1] > l[half + i]:
                diff = l[half + i - 1] - l[half + i]
                l[half -1 - i] -= diff
                l[half + i] += diff
        else:
            l[0] = 0
            l[-1] = sum_
        
    return l


def print_results(result):
    result = [str(el) for el in result]
    print(" ".join(result))


if __name__ == "__main__":
    n, b = read_input()
    result = mishka_and_the_last_exam(n, b)
    print_results(result)



## Tags

In [22]:
# We have now a dataframe containing 2 string features : "source_code" and "description"
# And the tags (the target column)

df['tags'].sample(3)

src_uid
d6e44bd8ac03876cb03be0731f7dda3d        [implementation, brute force, math]
7ed9265b56ef6244f95a7a663f7860dd                           [implementation]
0200c1ea8d7e429e7e9e6b5dc36e3093    [two pointers, sortings, combinatorics]
Name: tags, dtype: object

In [23]:
# We see that it's stored like a list of strings

df["tags"].iloc[0]


['geometry', 'brute force']

In [24]:
# We can observe the distribution of the different tags
# We cannot use a simple value_counts because the tags are multilabels

# We use explode to transform each list of tags into multiple rows

df['tags'].explode().value_counts()


tags
greedy                       1743
implementation               1597
math                         1409
constructive algorithms      1036
dp                            984
brute force                   837
data structures               783
sortings                      671
binary search                 567
graphs                        542
dfs and similar               508
strings                       422
number theory                 350
trees                         324
two pointers                  320
combinatorics                 273
bitmasks                      256
dsu                           176
geometry                      166
shortest paths                144
interactive                   120
hashing                       108
games                         105
divide and conquer             94
probabilities                  92
*special                       58
flows                          42
matrices                       38
graph matchings                36
ternary s

In [25]:
# Based on our context, we are only interested in :
#  ['math',  'graphs',  'strings',  'number  theory', 'trees', 'geometry', 'games', 'probabilities']

# So we can delete all the other tags from our dataset

def filter_tags(tags, relevant_tags):
    return [tag for tag in tags if tag in relevant_tags]

relevant_tags = ['math',  'graphs',  'strings',  'number theory', 'trees', 'geometry', 'games', 'probabilities']

df['tags'] = df['tags'].apply(lambda tags: filter_tags(tags, relevant_tags))

In [26]:
# We can apply again the tag counting to see the new distribution

df['tags'].explode().value_counts()

tags
math             1409
graphs            542
strings           422
number theory     350
trees             324
geometry          166
games             105
probabilities      92
Name: count, dtype: int64

In [27]:
# First information, the class are quite unbalanced
# That's a point to take into account for the model training

In [28]:
# Let's check a value counts on the new tags

df['tags'].value_counts()

tags
[]                                             2304
[math]                                          928
[strings]                                       353
[graphs]                                        332
[trees]                                         147
                                               ... 
[graphs, probabilities]                           1
[graphs, math, probabilities, trees]              1
[games, graphs, math, trees]                      1
[games, graphs, math, number theory, trees]       1
[games, probabilities]                            1
Name: count, Length: 78, dtype: int64

In [29]:
# Here we have 2 very important informations

# We have 2304 empty tags
# It was tags in which we are not interested
# We can drop these rows

df = df[df['tags'].map(len) > 0]


# Second very important information
# We have multilabels cases
# Let's observe how frequent they are

num_tags = df['tags'].apply(len)
num_tags.value_counts()

tags
1    2018
2     597
3      55
4       7
5       1
Name: count, dtype: int64

In [30]:
# If we had a really few multilabels cases, we could have dropped them
# But here they are quite frequent (more than 25% of the dataset)
# So we have to keep them and use a multilabel classification approach

# Language

In [None]:
# We see previously that some description elements were in another language
# Let's observe this

# We use langdetect to detect the language of the description
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return "error"

languages = df['full_description'].apply(detect_language)

In [32]:
languages.value_counts()

full_description
en    2676
ru       2
Name: count, dtype: int64

In [33]:
# We have 2 ru labels (Russian)
# Let's observe these cases

print(df[languages == 'ru']['full_description'].iloc[0])

Problem Description:
Вам задано прямоугольное клетчатое поле, состоящее из n строк и m столбцов. Поле содержит цикл из символов «*», такой что:  цикл можно обойти, посетив каждую его клетку ровно один раз, перемещаясь каждый раз вверх/вниз/вправо/влево на одну клетку;  цикл не содержит самопересечений и самокасаний, то есть две клетки цикла соседствуют по стороне тогда и только тогда, когда они соседние при перемещении вдоль цикла (самокасание по углу тоже запрещено). Ниже изображены несколько примеров допустимых циклов:  Все клетки поля, отличные от цикла, содержат символ «.». Цикл на поле ровно один. Посещать клетки, отличные от цикла, Роботу нельзя.В одной из клеток цикла находится Робот. Эта клетка помечена символом «S». Найдите последовательность команд для Робота, чтобы обойти цикл. Каждая из четырёх возможных команд кодируется буквой и обозначает перемещение Робота на одну клетку:  «U» — сдвинуться на клетку вверх,  «R» — сдвинуться на клетку вправо,  «D» — сдвинуться на клетку 

In [34]:
# We could maybe traduce it, but there it's representing only 2 samples
# Morover, if we check the tags
df[languages == 'ru']['tags']

src_uid
dea56c6d6536e7efe80d39ebc6b819a8    [graphs]
fa897b774525f038dc2d1b65c4ceda28      [math]
Name: tags, dtype: object

In [35]:
# It's the 2 most frequent tags (math and graphs)
# IF it was a very minority tag (like with less 10 samples), theses samples would have been precious 
# We could have traduced them to keep them in the dataset

# But in our case, to not add noise and maybe biased data, we can just drop these 2 samples
df = df[languages != 'ru']

# Final Dataset

In [36]:
# Our final dataset :
df

Unnamed: 0_level_0,source_code,tags,full_description
src_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bb3fc45f903588baf131016bea175a9f,# calculate convex of polygon v.\n# v is list ...,[geometry],Problem Description:\nIahub has drawn a set of...
7d6faccc88a6839822fa0c0ec8c00251,s = input().strip();N = len(s)\nif len(s) == 1...,[strings],Problem Description:\nSome time ago Lesha foun...
891fabbb6ee8a4969b6f413120f672a8,"n = int(input())\nfor _ in range(n):\n k,x = m...","[number theory, math]",Problem Description:\nToday at the lesson of m...
9d46ae53e6dc8dc54f732ec93a82ded3,temp = list(input())\nm = int(input())\ntrans ...,"[math, strings]",Problem Description:\nPasha got a very beautif...
0e0f30521f9f5eb5cff2549cd391da3c,"N, B, E = input(), [], 0\nfor a in map(int, ra...",[math],Problem Description:\nYou are given an array $...
...,...,...,...
981e9991fb5dbd085db8a408c29564d4,import sys\nsys.setrecursionlimit(10000000)\na...,[graphs],Problem Description:\nYou are given a connecte...
ba27ac62b84705d80fa580567ab64c3b,"mas = list(map(int, input().split()))\r\nt = m...","[geometry, math]",Problem Description:\nDiamond Miner is a game ...
28b7e9de0eb583642526c077aa56daba,"def main():\n f= [1]\n for i in range(1,...",[math],Problem Description:\nYou are given an array a...
47129977694cb371c7647cfd0db63d29,def main():\n from sys import stdin\n fr...,[trees],Problem Description:\nWriting light novels is ...


In [37]:
# We have 2 string features and one multilabel target
df.dtypes

source_code         object
tags                object
full_description    object
dtype: object

In [38]:
# We have no missing values
df.isna().sum()

# Actually, the few missing we had were just additional informations, so it just created some empty elements in the global description

source_code         0
tags                0
full_description    0
dtype: int64

In [39]:
# We can quickly study the string lengths of our 2 features
full_description_lengths = df['full_description'].apply(len)
source_code_lengths = df['source_code'].apply(len)

In [40]:
import plotly.express as px

# Box plot of full description lengths
fig1 = px.box(full_description_lengths, y=full_description_lengths, title="Full Description Lengths")
fig1.show()

# Box plot of source code lengths
fig2 = px.box(source_code_lengths, y=source_code_lengths, title="Source Code Lengths")
fig2.show()

In [41]:
# For the probklem description, the distribution seems quite good
# We have some high values and maybe outliers but nothing really problematic

# For the soure code, we have some really high values
# The median is arround 750 while we have some values above 50k
# It's clearly something to take into account during the model training
# We could maybe truncate the source code at a maximum length

In [43]:
# We can save the cleaned DataFrame in a new json file

# We set the index as a column, to save it also in the json file
df = df.reset_index()

df.to_json("../data/processed/cleaned_code_classification_dataset.jsonl", orient="records", lines=True)

## Notebook overview

This notebook is dedicated to the exploration and analysis of the raw dataset.  
It was used to:
- inspect the structure of the JSON files,
- clean and normalize fields (including tags),
- build the consolidated `full_description` text input,
- analyse tag distribution and dataset characteristics.

All reusable functions (parsing, preprocessing, text construction, etc.) have been exported into the Python module  
`src/preprocessing.py`, and the cleaned dataset has been saved into `data/processed/`.

This ensures that the data cleaning and feature engineering steps can be reproduced easily and used consistently in the rest of the project (model training, evaluation, and CLI).
