# Crafting Leetcode dataset

The datset provided by Abhishek Chaudhary in kaggle (https://www.kaggle.com/datasets/theabbie/leetcode/data) consists of 2315 python program files. These files are named such that their name gives a precise discription of the program they contain.

The dataset (B) provided by gzipChrist in kaggle (https://www.kaggle.com/datasets/gzipchrist/leetcode-problem-dataset/data) consists of a .csv file with 1825 datapoints which consists of problem title, discriptions and other discriptive information but no actual solution to the problems.

We used these two dataset to generate a final dataset which contains problems with proper discription and solution. The dataset was created by first taking in name of .py files from dataset provided by Abhishek Chaudhary then normalize the name then creating a dictionary with name as key and code as value. Then the key was matched with "title" in dataset provided by gzipChrist and the coded was added to that dataset if a match for name was found.

In [None]:
path_to_leetcode_programs = "/path/to/leetcode/programs" #provide path to dataset downloaded from https://www.kaggle.com/datasets/theabbie/leetcode/data
path_to_leetcode_discription = "/path/to/leetcode/discription" #provide path to discriptions downloaded from https://www.kaggle.com/datasets/gzipchrist/leetcode-problem-dataset/data

In [None]:
import os
import re
import pandas as pd

df = pd.read_csv(path_to_leetcode_discription)

# normalizes normalize the file name i.e. remove .py and and any other special character then return name only
def normalize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9]+", "-", text)
    return text.strip("-")

# dictionary of file name and code
code_map = {}

# reads all code file then add to code_map discription
for filename in os.listdir(path_to_leetcode_programs):
    if filename.endswith(".py"):
        file_path = os.path.join(path_to_leetcode_programs, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            code_map[normalize(filename[:-3])] = f.read()


# add code to df if the title is in code_map
df["code"] = df["title"].apply(
    lambda title: code_map.get(normalize(title))
)

# info about df 
print("Total rows:", len(df))
print("Code matched:", df["code"].notna().sum())

# save file in .parquet and .csv format
df.to_parquet("leetcode.parquet", index=False)
df.to_csv("leetcode.csv", index=False)
