# LeetCode Problem Scraper & Dataset Builder

This project builds a **complete dataset of LeetCode problems** directly from the official GraphQL API.

### What It Does
- Connects to LeetCode’s GraphQL endpoint using authenticated sessions
- Fetches problem metadata (title, difficulty, tags, stats, etc.)
- Extracts **descriptions, similar problems, and solution URLs**
- Saves everything into a structured CSV file for analysis or dashboards

### Goals
- Maintain an up-to-date, queryable dataset for personal analytics and research
- Support future automation for weekly updates
- Enable quick data exploration and ML-based recommendation experiments

> This notebook is designed as a **step-by-step guide** — part scraper, part documentation.

In [11]:
import requests, json, time, random, pandas as pd, os
from requests.adapters import HTTPAdapter
from datetime import datetime
try:
    from urllib3.util.retry import Retry
except Exception:
    from requests.packages.urllib3.util.retry import Retry

GRAPHQL_URL = "https://leetcode.com/graphql/"
HOMEPAGE = "https://leetcode.com/problemset/"

## Session Setup

We create a persistent `requests.Session()` with:
- Retry mechanism (to handle rate limits and 5xx errors)
- Proper CSRF token handling
- Headers that mimic a browser

This ensures stable and polite communication with LeetCode’s servers.


In [12]:
def make_leetcode_session():
    s = requests.Session()
    retry = Retry(
        total=5, connect=5, read=5,
        backoff_factor=1.2,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=frozenset(["GET", "POST"]),
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retry, pool_connections=20, pool_maxsize=20)
    s.mount("https://", adapter)
    s.mount("http://", adapter)

    r = s.get(HOMEPAGE, headers={
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://leetcode.com/",
        "Origin": "https://leetcode.com",
        "Accept": "application/json, text/plain, */*",
    }, timeout=(10, 30))
    csrftoken = s.cookies.get("csrftoken", "")
    s.headers.update({
        "User-Agent": "Mozilla/5.0",
        "Referer": "https://leetcode.com/",
        "Origin": "https://leetcode.com",
        "Accept": "application/json, text/plain, */*",
        "Content-Type": "application/json",
        "x-csrftoken": csrftoken,
    })
    return s

## GraphQL Query Handler

LeetCode exposes a single `/graphql/` endpoint.
All requests — from problem listing to details — go through it.

We define a helper function that:
- Accepts a query + variables
- Retries failed requests
- Returns JSON data directly

This makes all downstream queries clean and reusable.


In [13]:
def graphql_query(session, query, variables=None, max_retries=4):
    payload = {"query": query}
    if variables:
        payload["variables"] = variables

    last_err = None
    for attempt in range(1, max_retries + 1):
        try:
            r = session.post(GRAPHQL_URL, json=payload, timeout=(10, 60))
            js = r.json()
        except Exception as e:
            last_err = e
            time.sleep((2 ** (attempt - 1)) + random.uniform(0, 0.6))
            continue

        if "errors" in js:
            last_err = RuntimeError(js["errors"][0].get("message", "GraphQL error"))
            time.sleep((2 ** (attempt - 1)) + random.uniform(0, 0.6))
            continue

        if "data" in js:
            return js["data"]

        last_err = RuntimeError(f"Unexpected response: {r.status_code} {r.text[:300]}")
        time.sleep((2 ** (attempt - 1)) + random.uniform(0, 0.6))
    raise last_err or RuntimeError("GraphQL request failed")

## GraphQL Queries

We’ll use two main queries:

1. **PROBLEMSET_QUERY** – Lists problems in paginated chunks
2. **QUESTION_DETAIL_QUERY** – Fetches full details for a single problem

These are defined once and reused throughout the scraper.


In [14]:
PROBLEMSET_QUERY = """
query problemsetQuestionList($categorySlug: String, $limit: Int, $skip: Int, $filters: QuestionListFilterInput) {
  problemsetQuestionList: questionList(categorySlug: $categorySlug, limit: $limit, skip: $skip, filters: $filters) {
    total: totalNum
    questions: data {
      questionFrontendId
      title
      titleSlug
      difficulty
      acRate
      isPaidOnly
      topicTags { name slug }
    }
  }
}
"""

QUESTION_DETAIL_QUERY = """
query questionData($titleSlug: String!) {
  question(titleSlug: $titleSlug) {
    questionId
    questionFrontendId
    title
    titleSlug
    difficulty
    isPaidOnly
    acRate
    content
    stats
    likes
    dislikes
    topicTags { name slug }
    similarQuestions
    discussionCount
  }
}
"""

## Fetching Problem Data

Here’s where the real scraping happens.

- The scraper loops through all problem pages
- For each problem, it fetches detailed stats (likes, tags, discussions, etc.)
- The data is collected and periodically saved to a checkpoint CSV

We also ensure duplicate prevention by checking existing entries.


In [15]:
def fetch_all_problems_df(page_size=50, checkpoint_path=None):
    session = make_leetcode_session()
    all_rows = []
    skip = 0
    total = None

    seen = set()
    if checkpoint_path and os.path.exists(checkpoint_path):
        old_df = pd.read_csv(checkpoint_path)
        seen = set(old_df['titleSlug'])
        all_rows = old_df.to_dict('records')
        print(f"Loaded {len(seen)} existing problems from {checkpoint_path}")

    while True:
        variables = {"categorySlug": "", "limit": page_size, "skip": skip, "filters": {}}
        data = graphql_query(session, PROBLEMSET_QUERY, variables)
        root = data["problemsetQuestionList"]
        if total is None:
            total = root["total"] or 0
        batch = root["questions"] or []
        if not batch:
            break

        for q in batch:
            slug = q["titleSlug"]
            if slug in seen:
                continue

            try:
                detail_data = graphql_query(session, QUESTION_DETAIL_QUERY, {"titleSlug": slug})
                qd = detail_data["question"]

                stats = json.loads(qd.get("stats", "{}"))
                similar = json.loads(qd.get("similarQuestions", "[]"))

                row = {
                    "frontend_id": qd.get("questionFrontendId"),
                    "internal_id": qd.get("questionId"),
                    "title": qd.get("title"),
                    "titleSlug": slug,
                    "difficulty": qd.get("difficulty"),
                    "is_premium": qd.get("isPaidOnly"),
                    "topic_tags": [t["name"] for t in qd.get("topicTags", [])],
                    "similar_questions": [s.get("title") for s in similar] if similar else [],
                    "no_similar_questions": len(similar),
                    "acceptance": qd.get("acRate"),
                    "accepted": stats.get("totalAcceptedRaw"),
                    "submission": stats.get("totalSubmissionRaw"),
                    "discussion_count": qd.get("discussionCount"),
                    "likes": qd.get("likes"),
                    "dislikes": qd.get("dislikes"),
                    "description": qd.get("content", ""),
                    "problem_URL": f"https://leetcode.com/problems/{slug}/",
                    "solution_URL": f"https://leetcode.com/problems/{slug}/solution/" if not qd.get("isPaidOnly") else None,
                    "last_updated": datetime.now().strftime("%Y-%m-%d")
                }

                all_rows.append(row)
                seen.add(slug)

            except Exception as e:
                print(f"Error fetching {slug}: {e}")
                continue

        if checkpoint_path:
            pd.DataFrame(all_rows).to_csv(checkpoint_path, index=False)

        skip += page_size
        if len(seen) >= total:
            break
        time.sleep(random.uniform(0.8, 1.5))  # avoid rate-limit

    return pd.DataFrame(all_rows)

## Save & Inspect

Once scraping is complete, the dataset is stored as `leetcode_latest.csv`.
We preview the first few rows to verify integrity and schema.

> Tip: Each record includes both internal and frontend problem IDs.


In [16]:
def scrape(file_name="leetcode_full.csv"):
    df = fetch_all_problems_df(page_size=50, checkpoint_path=file_name)
    df.to_csv(file_name, index=False)
    print(f"Scraping complete! {len(df)} problems saved to {file_name}")

In [17]:
if __name__ == "__main__":
    scrape("../data/raw/leetcode_latest.csv")

Scraping complete! 3730 problems saved to leetcode_latest.csv


## Future Plans

- Integrate PostgreSQL for scalable querying
- Add automatic weekly refresh and delta-tracking
- Visualize trends via dashboards (Reflex, Streamlit, or Plotly)
- Extend scraper to fetch company tags and problem frequency stats


---

## References & Resources

Below are key references used in developing this LeetCode scraper and dataset builder:

### Official Documentation
- [LeetCode GraphQL Endpoint Overview (Unofficial Discussion)](https://leetcode.com/graphql)
- [LeetCode Problemset Page (Used to Initialize Session)](https://leetcode.com/problemset/)
- [LeetCode API Reference – Community Wiki](https://github.com/skygragon/leetcode-cli/wiki/API)
- [LeetCode Explore API Explanation](https://github.com/chenjianhui96/leetcode-graphql-docs) *(community maintained)*

### Python Libraries
- [Requests — Python HTTP for Humans](https://docs.python-requests.org/en/master/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [urllib3 Retry Mechanism](https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html#urllib3.util.retry.Retry)
- [HTTPAdapter for Session Handling](https://requests.readthedocs.io/en/latest/api/#requests.adapters.HTTPAdapter)

### Related Articles / Blogs
- [How to Use LeetCode’s GraphQL API — by Taranjeet Singh](https://medium.com/@taranjeet/how-to-use-leetcodes-graphql-api-bd5f3f0a6a61)
- [Building a Web Scraper with Python Requests and BeautifulSoup](https://realpython.com/beautiful-soup-web-scraper-python/)
- [GraphQL Query Basics](https://graphql.org/learn/queries/)
- [Creating a Robust Data Scraper with Retries and Backoff](https://docs.python.org/3/library/retry.html) *(official concepts)*

### Dataset Examples (Inspiration)
- [Kaggle: LeetCode Questions Dataset (Community Maintained)](https://www.kaggle.com/datasets/kanchana1990/leetcode-questions-dataset)
- [LeetCode Problems – Open Source JSON Archive (GitHub)](https://github.com/skygragon/leetcode-cli/blob/master/lib/plugins/leetcode/api.js)

---

> **Note:**
> Always cross-check third-party datasets and documentation — LeetCode doesn’t officially expose a public API, so the GraphQL endpoint can change at any time.
