#**PRE-PROCESSING**

**SUMMARY**
1. Make a single text column combining desc,inout desc,output desc

In [None]:
import pandas as pd

df=pd.read_json("/content/problems_data.jsonl",lines=True)
df.shape

(4112, 8)

In [None]:
#removed title,url,sample_io
text_cols = ["description","input_description","output_description"]

df_text = df[text_cols + ["problem_class", "problem_score"]]
df_text.loc[0]

Unnamed: 0,0
description,"Unununium (Uuu) was the name of the chemical\n element with atom number 111, until it changed to\n Röntgenium (Rg) in 2004. These heavy elements are very\n unstable and have only been synthesized in a few\n laboratories.\nYou have just been hired by one of these labs to optimize\n the algorithms used in simulations. For example, when\n simulating complicated chemical reactions, it is important to\n keep track of how many particles there are, and this is done by\n counting connected components in a graph.\nCurrently, the lab has some Python code (see attachments)\n that takes an undirected graph and outputs the number of\n connected components. As you can see, this code is based on\n everyone’s favourite data structure union-find1.\nAfter looking at the code for a while, you notice that it\n actually has a bug in it! The code still gives correct answers,\n but the bug could cause it to run inefficiently. Your task is\n to construct a graph with a given number of vertices and edges\n where the code runs very slowly. We will count how many times\n the third line (the one inside the while loop) is visited, and\n your program will get a score according to this number.\n"
input_description,"The input consists of one line with two integers\n $N$ and $M$, the number of vertices and edges\n your graph should have. Apart from the sample, there will be\n only one test case, with $N =\n 100$ and $M =\n 500$."
output_description,"The output consists of $M$ lines where the $i$:th contains two integers\n $u_ i$ and $v_ i$ ($1 \leq u_ i, v_ i \leq N$). This\n indicates that the vertices $u_\n i$ and $v_ i$ are\n connected with an edge in your graph."
problem_class,hard
problem_score,9.7


In [None]:
pd.set_option("display.max_colwidth", None) #comment out if u dont want to see the whole content of the column u get ... , i prefer looking at the whole content.

In [None]:
#we have df_text dataframe which shall have multiple columns where we can test and compare before and after cleaning.
#we have df dataframe which shall contain the final cleaned version of everything.
#i would like to merge the three columns in the end after cleaning.

In [None]:
import re
import unicodedata

def clean_text(text):
    if text is None:
        return ""
    if not isinstance(text, str):
        return ""

    # 1. Normalize unicode (ö → o, é → e)
    text = unicodedata.normalize("NFKD", text)
    text = text.encode("ascii", "ignore").decode("utf-8")
    text = text.lower()  # lowercase


    # 2. REPLACE LATEX → UNICODE AFTER ascii/lowercase (preserves ≤ as math symbols have significance)
    replacements = {
        r'\\leq': '≤', r'\\geq': '≥', r'\\lt': '<', r'\\gt': '>',
        r'\\le': '≤', r'\\ge': '≥', r'\\neq': '≠', r'\\approx': '≈',r'\\cdot': ' × '
    }
    for latex, sym in replacements.items():
        text = re.sub(latex, sym, text)

    text = re.sub(r'(\d+)([,\\\s]+)(\d+)', r'\1\3', text)


    # 3. Punctuation
    text = re.sub(r"[()\n!,_:'\"$.?{}\\/]", " ", text)


    # 4. Hyphens AFTER symbols
    text = re.sub(r"(?<=[a-z][a-z])-(?=[a-z][a-z])", " ", text)
    text = re.sub(r"\\cdot", " ", text)

    # 5. Removes whitespace
    text = " ".join(text.split()).strip()  # OR: re.sub(r'\s+', ' ', text).strip()

    # 6. Replace numbers with num so numbers dont take much columns in tfidf
    text = re.sub(r'\d+', ' num ', text)

    # 7. remove patterns like ababa, ababab, aaaa - as they give no meaning in the problem
    text = re.sub(r'\b([a-z]{1,3})\1+\b', ' ', text)

    # 8. Remove alphabet dummy sequences
    text = re.sub(r'\babc(def)?\b', ' ', text)

    return text


In [None]:
#the below step shall clean all the texts.

In [None]:
cols_to_clean = [
    "description",
    "input_description",
    "output_description"
]

for col in cols_to_clean:
    df_text[col] = df_text[col].apply(clean_text)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text[col] = df_text[col].apply(clean_text)


In [None]:
df_text.loc[0]

Unnamed: 0,0
description,unununium was the name of the chemical element with atom number num until it changed to rontgenium rg in num these heavy elements are very unstable and have only been synthesized in a few laboratories you have just been hired by one of these labs to optimize the algorithms used in simulations for example when simulating complicated chemical reactions it is important to keep track of how many particles there are and this is done by counting connected components in a graph currently the lab has some python code see attachments that takes an undirected graph and outputs the number of connected components as you can see this code is based on everyones favourite data structure union find num after looking at the code for a while you notice that it actually has a bug in it the code still gives correct answers but the bug could cause it to run inefficiently your task is to construct a graph with a given number of vertices and edges where the code runs very slowly we will count how many times the third line the one inside the while loop is visited and your program will get a score according to this number
input_description,the input consists of one line with two integers n and m the number of vertices and edges your graph should have apart from the sample there will be only one test case with n = num and m = num
output_description,the output consists of m lines where the i th contains two integers u i and v i num ≤ u i v i ≤ n this indicates that the vertices u i and v i are connected with an edge in your graph
problem_class,hard
problem_score,9.7


In [None]:
import numpy as np

In [None]:
#check if whitespace gone
pd.DataFrame({
    'nulls': df_text[['description','input_description','output_description']].isna().sum(),
    'empty_strings': (df_text[['description','input_description','output_description']] == "").sum(),
    'fake_null': (df_text[['description','input_description','output_description']].isin(['none','nan','null'])).sum()
})

Unnamed: 0,nulls,empty_strings,fake_null
description,0,81,0
input_description,0,120,0
output_description,0,131,0


In [None]:
import pandas as pd

# This finds rows where any of the three columns are empty or just spaces
is_empty = df_text[
    (df_text['description'].str.strip() == "") |
    (df_text['input_description'].str.strip() == "") |
    (df_text['output_description'].str.strip() == "")
]

print("Rows with empty strings:")
list_empty=is_empty.index.tolist()
print(list_empty)

# Use .loc to grab those specific rows and then group them
empty_counts = df_text.loc[list_empty].groupby('problem_class').size()

print("\nCounts by problem class:")
print(empty_counts)

Rows with empty strings:
[2, 5, 6, 13, 14, 15, 26, 37, 39, 47, 55, 56, 58, 66, 95, 105, 111, 112, 164, 213, 217, 317, 322, 370, 387, 392, 398, 416, 447, 455, 497, 507, 523, 540, 541, 550, 558, 603, 668, 686, 687, 731, 770, 793, 798, 841, 855, 868, 939, 971, 990, 1085, 1097, 1128, 1135, 1141, 1149, 1169, 1253, 1280, 1296, 1313, 1333, 1349, 1358, 1367, 1368, 1418, 1446, 1454, 1463, 1496, 1501, 1525, 1557, 1584, 1588, 1594, 1595, 1603, 1630, 1643, 1646, 1656, 1659, 1726, 1737, 1742, 1815, 1816, 1844, 1848, 1870, 1881, 1883, 1899, 1944, 1961, 1963, 1986, 1993, 1994, 2006, 2014, 2020, 2043, 2083, 2102, 2118, 2203, 2211, 2245, 2252, 2263, 2266, 2363, 2374, 2375, 2401, 2408, 2420, 2421, 2448, 2473, 2474, 2503, 2511, 2527, 2557, 2564, 2566, 2568, 2570, 2576, 2580, 2588, 2594, 2598, 2618, 2625, 2712, 2731, 2760, 2764, 2767, 2804, 2832, 2835, 2877, 2880, 2893, 2932, 2938, 2954, 2955, 3001, 3010, 3037, 3038, 3045, 3046, 3062, 3090, 3098, 3117, 3167, 3180, 3193, 3235, 3254, 3255, 3256, 3258, 3338,

In [None]:
import pandas as pd

# 1. Your list of indices
indices_list = [2, 5, 6, 13, 14, 15, 26, 37, 39, 47, 55, 56, 58, 66, 95, 105, 111, 112, 164, 213, 217, 317, 322, 370, 387, 392, 398, 416, 447, 455, 497, 507, 523, 540, 541, 550, 558, 603, 668, 686, 687, 731, 770, 793, 798, 841, 855, 868, 939, 971, 990, 1085, 1097, 1128, 1135, 1141, 1149, 1169, 1253, 1280, 1296, 1313, 1333, 1349, 1358, 1367, 1368, 1418, 1446, 1454, 1463, 1496, 1501, 1525, 1557, 1584, 1588, 1594, 1595, 1603, 1630, 1643, 1646, 1656, 1659, 1726, 1737, 1742, 1815, 1816, 1844, 1848, 1870, 1881, 1883, 1899, 1944, 1961, 1963, 1986, 1993, 1994, 2006, 2014, 2020, 2043, 2083, 2102, 2118, 2203, 2211, 2245, 2252, 2263, 2266, 2363, 2374, 2375, 2401, 2408, 2420, 2421, 2448, 2473, 2474, 2503, 2511, 2527, 2557, 2564, 2566, 2568, 2570, 2576, 2580, 2588, 2594, 2598, 2618, 2625, 2712, 2731, 2760, 2764, 2767, 2804, 2832, 2835, 2877, 2880, 2893, 2932, 2938, 2954, 2955, 3001, 3010, 3037, 3038, 3045, 3046, 3062, 3090, 3098, 3117, 3167, 3180, 3193, 3235, 3254, 3255, 3256, 3258, 3338, 3350, 3379, 3392, 3413, 3426, 3431, 3471, 3534, 3535, 3550, 3568, 3590, 3627, 3636, 3670, 3671, 3691, 3783, 3813, 3826, 3884, 3885, 3898, 3942, 3965, 3973, 3983, 4008, 4028, 4032, 4049, 4076, 4079, 4088, 4095, 4097, 4103, 4109, 4110]

# 2. Extract the specific rows
detailed_audit = df_text.loc[indices_list].copy()

# 3. Create a function to check individual cell status
def check_cell(val):
    if str(val).strip() == "":
        return "Empty String"
    else:
        return "Has Text"

# 4. Apply the check to each of the three columns
detailed_audit['desc_status'] = detailed_audit['description'].apply(check_cell)
detailed_audit['input_status'] = detailed_audit['input_description'].apply(check_cell)
detailed_audit['output_status'] = detailed_audit['output_description'].apply(check_cell)

# 5. Keep only the columns that matter for the report
# This includes the original index and the problem class
final_df = detailed_audit.reset_index()[['index', 'problem_class', 'desc_status', 'input_status', 'output_status']]

final_df.head(10)

Unnamed: 0,index,problem_class,desc_status,input_status,output_status
0,2,hard,Has Text,Empty String,Empty String
1,5,hard,Has Text,Empty String,Empty String
2,6,hard,Has Text,Has Text,Empty String
3,13,hard,Empty String,Has Text,Has Text
4,14,hard,Has Text,Empty String,Empty String
5,15,hard,Has Text,Empty String,Empty String
6,26,hard,Has Text,Empty String,Empty String
7,37,hard,Has Text,Empty String,Empty String
8,39,hard,Has Text,Has Text,Empty String
9,47,hard,Has Text,Has Text,Empty String


In [None]:
for problem_class_value in final_df['problem_class'].unique():
    print(f"\n--- Problem Class: {problem_class_value.capitalize()} ---")
    display(final_df[final_df['problem_class'] == problem_class_value])


--- Problem Class: Hard ---


Unnamed: 0,index,problem_class,desc_status,input_status,output_status
0,2,hard,Has Text,Empty String,Empty String
1,5,hard,Has Text,Empty String,Empty String
2,6,hard,Has Text,Has Text,Empty String
3,13,hard,Empty String,Has Text,Has Text
4,14,hard,Has Text,Empty String,Empty String
...,...,...,...,...,...
91,1848,hard,Has Text,Empty String,Empty String
92,1870,hard,Has Text,Empty String,Empty String
93,1881,hard,Empty String,Has Text,Has Text
94,1883,hard,Empty String,Has Text,Has Text



--- Problem Class: Medium ---


Unnamed: 0,index,problem_class,desc_status,input_status,output_status
96,1944,medium,Has Text,Empty String,Empty String
97,1961,medium,Has Text,Empty String,Empty String
98,1963,medium,Has Text,Empty String,Empty String
99,1986,medium,Has Text,Empty String,Empty String
100,1993,medium,Has Text,Empty String,Empty String
...,...,...,...,...,...
169,3254,medium,Empty String,Has Text,Has Text
170,3255,medium,Empty String,Has Text,Has Text
171,3256,medium,Empty String,Has Text,Has Text
172,3258,medium,Has Text,Has Text,Empty String



--- Problem Class: Easy ---


Unnamed: 0,index,problem_class,desc_status,input_status,output_status
174,3350,easy,Has Text,Empty String,Empty String
175,3379,easy,Empty String,Has Text,Has Text
176,3392,easy,Has Text,Empty String,Empty String
177,3413,easy,Empty String,Has Text,Has Text
178,3426,easy,Empty String,Has Text,Has Text
179,3431,easy,Has Text,Empty String,Empty String
180,3471,easy,Has Text,Empty String,Empty String
181,3534,easy,Has Text,Empty String,Empty String
182,3535,easy,Empty String,Empty String,Empty String
183,3550,easy,Empty String,Has Text,Has Text


In [None]:
#as we can see most of the easy problem have no description at all so it would be very tough to train the model,
#so it very expected we maybe have less recall on easy problem

In [None]:
#we shall create a binary column is_desc_empty in feature extraction.

In [None]:
#maybe we can just remove the rows as there is much more quality data already

In [None]:
df_text.shape

(4112, 5)

In [None]:
df_text.problem_class.value_counts()

Unnamed: 0_level_0,count
problem_class,Unnamed: 1_level_1
hard,1941
medium,1405
easy,766


In [None]:
# Create a mask for rows that are NOT empty
is_not_empty = ~(
    (df_text['description'].fillna('').str.strip() == "") |
    (df_text['input_description'].fillna('').str.strip() == "") |
    (df_text['output_description'].fillna('').str.strip() == "")
)

# Keep only the clean rows
df_text = df_text[is_not_empty].copy()

df_text.shape

(3899, 5)

In [None]:
df_text.problem_class.value_counts()

Unnamed: 0_level_0,count
problem_class,Unnamed: 1_level_1
hard,1845
medium,1327
easy,727


In [None]:
#check if whitespace gone
pd.DataFrame({
    'nulls': df_text[['description','input_description','output_description']].isna().sum(),
    'empty_strings': (df_text[['description','input_description','output_description']] == "").sum(),
    'fake_null': (df_text[['description','input_description','output_description']].isin(['none','nan','null'])).sum()
})


Unnamed: 0,nulls,empty_strings,fake_null
description,0,0,0
input_description,0,0,0
output_description,0,0,0


In [None]:
#cleaned table
df_text.loc[0]

Unnamed: 0,0
description,unununium was the name of the chemical element with atom number num until it changed to rontgenium rg in num these heavy elements are very unstable and have only been synthesized in a few laboratories you have just been hired by one of these labs to optimize the algorithms used in simulations for example when simulating complicated chemical reactions it is important to keep track of how many particles there are and this is done by counting connected components in a graph currently the lab has some python code see attachments that takes an undirected graph and outputs the number of connected components as you can see this code is based on everyones favourite data structure union find num after looking at the code for a while you notice that it actually has a bug in it the code still gives correct answers but the bug could cause it to run inefficiently your task is to construct a graph with a given number of vertices and edges where the code runs very slowly we will count how many times the third line the one inside the while loop is visited and your program will get a score according to this number
input_description,the input consists of one line with two integers n and m the number of vertices and edges your graph should have apart from the sample there will be only one test case with n = num and m = num
output_description,the output consists of m lines where the i th contains two integers u i and v i num ≤ u i v i ≤ n this indicates that the vertices u i and v i are connected with an edge in your graph
problem_class,hard
problem_score,9.7


In [None]:
#uncleaned table
df.loc[0]

Unnamed: 0,0
title,Uuu
description,"Unununium (Uuu) was the name of the chemical\n element with atom number 111, until it changed to\n Röntgenium (Rg) in 2004. These heavy elements are very\n unstable and have only been synthesized in a few\n laboratories.\nYou have just been hired by one of these labs to optimize\n the algorithms used in simulations. For example, when\n simulating complicated chemical reactions, it is important to\n keep track of how many particles there are, and this is done by\n counting connected components in a graph.\nCurrently, the lab has some Python code (see attachments)\n that takes an undirected graph and outputs the number of\n connected components. As you can see, this code is based on\n everyone’s favourite data structure union-find1.\nAfter looking at the code for a while, you notice that it\n actually has a bug in it! The code still gives correct answers,\n but the bug could cause it to run inefficiently. Your task is\n to construct a graph with a given number of vertices and edges\n where the code runs very slowly. We will count how many times\n the third line (the one inside the while loop) is visited, and\n your program will get a score according to this number.\n"
input_description,"The input consists of one line with two integers\n $N$ and $M$, the number of vertices and edges\n your graph should have. Apart from the sample, there will be\n only one test case, with $N =\n 100$ and $M =\n 500$."
output_description,"The output consists of $M$ lines where the $i$:th contains two integers\n $u_ i$ and $v_ i$ ($1 \leq u_ i, v_ i \leq N$). This\n indicates that the vertices $u_\n i$ and $v_ i$ are\n connected with an edge in your graph."
sample_io,"[{'input': '7 10', 'output': '1 2 2 3 1 3 3 4 5 6 6 7 5 7 1 7 7 2 5 1'}]"
problem_class,hard
problem_score,9.7
url,https://open.kattis.com/problems/uuu


In [None]:
#checking any random row to see if we can do anymore cleaning
row = df_text.sample(1).index[0]
df_text.loc[row]

Unnamed: 0,2439
description,at work atrebla recently became overwhelmed with orders from people all over his company and is trying to reduce his workload he has realized that a large portion of the orders are from people who dont have the authority to tell him what to do by identifying such orders easily he will be able to quickly discard them at atreblas company all employees have one manager except for the ceo who does not have any manager only the management chain of an employee has the authority give orders to him or her the management chain of the ceo is defined as the empty set and the management chain of any other employee a with manager b is defined as the set of employees consisting of b and bs management chain talking to other employees in the company atrebla realized that many employees are facing the same issue help atrebla and other employees identify the orders they can discard
input_description,the first line contains two integers n and m num ≤ n m ≤ num the number of employees at atreblas company and the number of orders employees are numbered from num to n inclusive and employee num is the ceo next will follow n- num lines containing the manager of employees num to n it is guaranteed that the ceo is in the management chain of all other employees and that an employee is never in their own management chain this is followed by m lines one for each order where the i th line contains two integers a i and b i num ≤ a i b i ≤ n a i ≠ b i meaning that employee a i received an order from employee b i
output_description,output m lines one for each order containing yes if the order should be ignored and no otherwise
problem_class,medium
problem_score,4.5


In [None]:
#finally adding all three columns.

df_text['text'] = (df_text['description'].fillna('') + ' | ' +
                         df_text['input_description'].fillna('') + ' | ' +
                         df_text['output_description'].fillna('')).str.strip()

# Clean separator: replace multiple | with single space if needed
df_text['text'] = df_text['text'].str.replace(r'\s*\|\s*', ' ', regex=True)

In [None]:
df_text.loc[0]

Unnamed: 0,0
description,unununium was the name of the chemical element with atom number num until it changed to rontgenium rg in num these heavy elements are very unstable and have only been synthesized in a few laboratories you have just been hired by one of these labs to optimize the algorithms used in simulations for example when simulating complicated chemical reactions it is important to keep track of how many particles there are and this is done by counting connected components in a graph currently the lab has some python code see attachments that takes an undirected graph and outputs the number of connected components as you can see this code is based on everyones favourite data structure union find num after looking at the code for a while you notice that it actually has a bug in it the code still gives correct answers but the bug could cause it to run inefficiently your task is to construct a graph with a given number of vertices and edges where the code runs very slowly we will count how many times the third line the one inside the while loop is visited and your program will get a score according to this number
input_description,the input consists of one line with two integers n and m the number of vertices and edges your graph should have apart from the sample there will be only one test case with n = num and m = num
output_description,the output consists of m lines where the i th contains two integers u i and v i num ≤ u i v i ≤ n this indicates that the vertices u i and v i are connected with an edge in your graph
problem_class,hard
problem_score,9.7
text,unununium was the name of the chemical element with atom number num until it changed to rontgenium rg in num these heavy elements are very unstable and have only been synthesized in a few laboratories you have just been hired by one of these labs to optimize the algorithms used in simulations for example when simulating complicated chemical reactions it is important to keep track of how many particles there are and this is done by counting connected components in a graph currently the lab has some python code see attachments that takes an undirected graph and outputs the number of connected components as you can see this code is based on everyones favourite data structure union find num after looking at the code for a while you notice that it actually has a bug in it the code still gives correct answers but the bug could cause it to run inefficiently your task is to construct a graph with a given number of vertices and edges where the code runs very slowly we will count how many times the third line the one inside the while loop is visited and your program will get a score according to this number the input consists of one line with two integers n and m the number of vertices and edges your graph should have apart from the sample there will be only one test case with n = num and m = num the output consists of m lines where the i th contains two integers u i and v i num ≤ u i v i ≤ n this indicates that the vertices u i and v i are connected with an edge in your graph


In [None]:
df_text=df_text[['text','problem_class','problem_score']]

In [None]:
df_text.loc[0]

Unnamed: 0,0
text,unununium was the name of the chemical element with atom number num until it changed to rontgenium rg in num these heavy elements are very unstable and have only been synthesized in a few laboratories you have just been hired by one of these labs to optimize the algorithms used in simulations for example when simulating complicated chemical reactions it is important to keep track of how many particles there are and this is done by counting connected components in a graph currently the lab has some python code see attachments that takes an undirected graph and outputs the number of connected components as you can see this code is based on everyones favourite data structure union find num after looking at the code for a while you notice that it actually has a bug in it the code still gives correct answers but the bug could cause it to run inefficiently your task is to construct a graph with a given number of vertices and edges where the code runs very slowly we will count how many times the third line the one inside the while loop is visited and your program will get a score according to this number the input consists of one line with two integers n and m the number of vertices and edges your graph should have apart from the sample there will be only one test case with n = num and m = num the output consists of m lines where the i th contains two integers u i and v i num ≤ u i v i ≤ n this indicates that the vertices u i and v i are connected with an edge in your graph
problem_class,hard
problem_score,9.7


In [None]:
difficulty_map = {'easy': 0, 'medium': 1, 'hard': 2}
df_text['problem_level'] = df_text['problem_class'].map(difficulty_map)

In [None]:
df_text=df_text[['text','problem_level','problem_score']]

In [None]:
df_text.loc[0]

Unnamed: 0,0
text,unununium was the name of the chemical element with atom number num until it changed to rontgenium rg in num these heavy elements are very unstable and have only been synthesized in a few laboratories you have just been hired by one of these labs to optimize the algorithms used in simulations for example when simulating complicated chemical reactions it is important to keep track of how many particles there are and this is done by counting connected components in a graph currently the lab has some python code see attachments that takes an undirected graph and outputs the number of connected components as you can see this code is based on everyones favourite data structure union find num after looking at the code for a while you notice that it actually has a bug in it the code still gives correct answers but the bug could cause it to run inefficiently your task is to construct a graph with a given number of vertices and edges where the code runs very slowly we will count how many times the third line the one inside the while loop is visited and your program will get a score according to this number the input consists of one line with two integers n and m the number of vertices and edges your graph should have apart from the sample there will be only one test case with n = num and m = num the output consists of m lines where the i th contains two integers u i and v i num ≤ u i v i ≤ n this indicates that the vertices u i and v i are connected with an edge in your graph
problem_level,2
problem_score,9.7


In [None]:
#checking any specific row just to compare the jsnol and actual
df_text.loc[9]

Unnamed: 0,9
text,you are given a simple undirected graph with no self loops or multiple edges some of the edges are marked as special your task is to find a simple cycle where for each special edge that edge either belongs to the cycle or neither of its endpoints touch the cycle the cycle is not allowed to repeat vertices output any solution or report that none exist the first line of input contains three integers n num ≤ n ≤ num m num ≤ m ≤ frac n × n- num num and k num ≤ k ≤ m where n is the number of nodes in the graph m is the number of edges and k is the number of edges that are special the nodes are numbered num through n output an integer denoting the length of the found cycle on one line on subsequent lines output the vertices of the cycle in order around the cycle one per line if no such cycle exists simply output - num
problem_level,2
problem_score,9.5


In [None]:
#check if whitespace gone after merging
pd.DataFrame({
    'nulls': df_text[['text']].isna().sum(),
    'empty_strings': (df_text[['text']] == "").sum(),
    'fake_null': (df_text[['text']].isin(['none','nan','null'])).sum()
})

Unnamed: 0,nulls,empty_strings,fake_null
text,0,0,0


In [None]:
df_text.shape

(3899, 3)

In [None]:
df_text.to_json('problems_data_cleaned.jsonl', orient='records', lines=True)

In conclusion:
1. we converted everything into lower case, removed empty and null values, preserved math symbols and converted unicodes.
2. droppped unnecessary columns, gave numbers to hard,medium,easy classes.