# Handwritten Dataset Extraction (Hard-copy Image to Digital Text)

This notebook converts Cambridge IELTS answer-sheet images into a clean JSONL dataset with:

`topic`  (printed prompt)

`band score`  (printed score)

`essay` (handwritten, OCR’d with TrOCR, then corrected)

`examiner_comment` (printed feedback, OCR’d with robust preprocessing)

`source`(filename stem)

The pipeline combines computer vision (OpenCV + Tesseract) and NLP correction (SymSpell + GECToR/Neuspell) for high-quality text.

## 0. Setup
Install all required OCR, NLP, and correction libraries

In [1]:
# OCR + Deep Learning backbone
!apt-get -qq update && apt-get -qq install -y tesseract-ocr tesseract-ocr-eng
!pip -q install pytesseract pillow opencv-python-headless transformers timm accelerate torch torchvision

# Lexical frequency correction
!pip -q install symspellpy wordfreq

# Contextual correction
!pip -q install git+https://github.com/grammarly/gector.git allennlp==2.10.1 allennlp-models==2.10.1
!pip -q install neuspell

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.6/159.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: git+https://github.com/grammarly/gector.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [2]:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import re, json, pytesseract, cv2, torch, numpy as np
import torch

from symspellpy import SymSpell
from wordfreq import top_n_list
from google.colab import drive
from pathlib import Path
from PIL import Image
import os, glob, json

In [3]:
# Load datasets
drive.mount("/content/drive")

input_dir = "/content/drive/MyDrive/cambridge_ielts"
out_jsonl = "/content/drive/MyDrive/cambridge_trocr_gector.jsonl"

inputs = sorted(glob.glob(os.path.join(input_dir, "*.png")))
print(f" Found {len(inputs)} PNG files in {input_dir}")

Mounted at /content/drive
 Found 14 PNG files in /content/drive/MyDrive/cambridge_ielts


## 1. Models

**a. TrOCR Handwriting Model**

 `microsoft/trocr-large-handwritten` encoder–decoder model is used for text-line recognition.
It works on cropped line images and outputs Unicode strings.

In [4]:
device = "cuda" # use GPU
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
htr = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten").to(device).eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.23G [00:00<?, ?B/s]

Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.23G [00:00<?, ?B/s]

**b. SymSpell**

A lightweight spell corrector that uses word frequencies to repair obvious OCR distortions.

In [5]:
def build_symspell():
    sym = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
    for w in top_n_list("en", n=50000):
        sym.create_dictionary_entry(w, 1)
    return sym

_symspell = build_symspell()

def symspell_correct_sentence(text: str) -> str:
    try:
        res = _symspell.lookup_compound(text, max_edit_distance=2)
        return res[0].term if res else text
    except Exception:
        return text

**c. Contextual Corrector (GECToR → Neuspell fallback)**

`GECToR` is a transformer-based grammatical-error-correction model that fixes context-dependent mistakes.
If no checkpoint is available, the code falls back to Neuspell’s BERT-based corrector.

In [6]:
_USE_GECTOR = False
_NEUSPELL_OK = False

GECTOR_VOCAB_PATH  = "/content/gector_vocab"
GECTOR_MODEL_PATHS = ["/content/roberta_base_gector.th"]

try:
    from gector.gec_model import GecBERTModel
    import os
    have_vocab  = os.path.isdir(GECTOR_VOCAB_PATH) and any("labels" in p for p in os.listdir(GECTOR_VOCAB_PATH))
    have_models = all(os.path.isfile(p) for p in GECTOR_MODEL_PATHS)
    if have_vocab and have_models:
        gector_model = GecBERTModel(
            vocab_path=GECTOR_VOCAB_PATH,
            model_paths=GECTOR_MODEL_PATHS,
            max_len=128, min_len=3, iterations=3,
            min_error_probability=0.0, confidence=0, log=False
        )
        _USE_GECTOR = True
except Exception:
    _USE_GECTOR = False

if not _USE_GECTOR:
    try:
        from neuspell import BertChecker
        _checker = BertChecker()
        _checker.from_pretrained("bert-base-uncased")
        _NEUSPELL_OK = True
    except Exception:
        _checker = None
        _NEUSPELL_OK = False

def contextual_correct(text: str) -> str:
    if not text.strip(): return text
    if _USE_GECTOR:
        try: return gector_model.handle_batch([text])[0]
        except Exception: pass
    if _NEUSPELL_OK:
        try: return _checker.correct(text)
        except Exception: return text
    return text

data folder is set to `/usr/local/lib/python3.12/dist-packages/neuspell/../data` script


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

**d. Text Utilities**

Normalisation and sentence-level cleanup are applied to ensure spacing, punctuation, and capitalisation consistency.

In [7]:
def normalize_spaces(s: str) -> str:
    s = s.replace("\r\n", " ").replace("\r", " ").replace("\n", " ")
    return re.sub(r"\s+", " ", s).strip()

def clean_trocr_output(text: str) -> str:
    # Clean spacing and artifacts
    text = re.sub(r"[#~^_]+", "", text)
    text = re.sub(r"\s*([,.;:!?])\s*", r"\1 ", text)
    return re.sub(r"\s{2,}", " ", text).strip()

def restore_sentence_case(text: str) -> str:
    """Restore capitalization for sentence starts and the pronoun 'I'."""
    t = text.strip()
    if not t:
        return t
    parts = re.split(r'([.?!]\s+)', t)
    rebuilt = ""
    for seg in parts:
        if not seg:
            continue
        if re.match(r'[.?!]\s+', seg):
            rebuilt += seg
        else:
            rebuilt += seg[:1].upper() + seg[1:]
    rebuilt = re.sub(r'\bi\b', 'I', rebuilt)
    rebuilt = re.sub(r'\s*([,.;:!?])\s*', r'\1 ', rebuilt)
    return re.sub(r'\s{2,}', ' ', rebuilt).strip()

## 2. Tesseract Helpers

Tesseract was used for the printed parts (title box, band sentence, examiner comment header).
image_to_data gives us word boxes so we can find anchor lines and crop regions precisely.

In [8]:
def ocr_tesseract_text(img_bgr):
    return pytesseract.image_to_string(img_bgr, config="--oem 3 --psm 6 -l eng")

def ocr_tesseract_data(img_bgr):
    return pytesseract.image_to_data(img_bgr, config="--oem 3 --psm 6 -l eng",
                                     output_type=pytesseract.Output.DICT)

def find_line_bbox(img_bgr, pattern_re):
    """
    Return (x,y,w,h), line_text for the FIRST line matching regex; else (None, None)
    """
    d = ocr_tesseract_data(img_bgr)
    lines = {}
    n = len(d["text"])
    for i in range(n):
        txt = d["text"][i]
        if not txt or int(d["conf"][i]) < 0:
            continue
        key = (d["page_num"][i], d["block_num"][i], d["par_num"][i], d["line_num"][i])
        lines.setdefault(key, {"txt": [], "l": [], "t": [], "w": [], "h": []})
        lines[key]["txt"].append(txt)
        lines[key]["l"].append(d["left"][i])
        lines[key]["t"].append(d["top"][i])
        lines[key]["w"].append(d["width"][i])
        lines[key]["h"].append(d["height"][i])
    for info in lines.values():
        line = " ".join(info["txt"])
        if re.search(pattern_re, line, flags=re.I):
            x1, y1 = min(info["l"]), min(info["t"])
            x2 = max(l + w for l, w in zip(info["l"], info["w"]))
            y2 = max(t + h for t, h in zip(info["t"], info["h"]))
            return (int(x1), int(y1), int(x2 - x1), int(y2 - y1)), line
    return None, None

## 3. Handwritten Essay

Essay region was cropped using printed anchors, segment into lines by horizontal projection, and run TrOCR with beam search per line. The best threshold setting (by length) is used.

In [9]:
def crop_essay_region(img_bgr):
    """Crop between 'Band ... score.' and 'Here is the examiner’s comment:' anchors, with a small bottom pad."""
    band_re = r"This\s+is\s+an\s+answer\s+written\s+by\s+a\s+candidate\s+who\s+achieved\s+a\s*Band\s*\d(?:\.\d)?\s*score\?"
    exam_re = r"Here\s+is\s+the\s+examiner[’']s\s+comment:"
    H, W = img_bgr.shape[:2]
    band_bbox, _ = find_line_bbox(img_bgr, band_re)
    exam_bbox, _ = find_line_bbox(img_bgr, exam_re)

    top_y = band_bbox[1] + band_bbox[3] + 6 if band_bbox else int(0.33 * H)
    bot_y = exam_bbox[1] - 6 if exam_bbox else int(0.85 * H)
    bot_y = min(H, bot_y + int(0.02 * H))  # small bottom padding

    top_y = max(0, min(H, top_y)); bot_y = max(0, min(H, bot_y))
    if bot_y <= top_y + 20:  # degenerate fallback
        top_y, bot_y = int(0.33*H), int(0.87*H)
    return img_bgr[top_y:bot_y, :].copy()

def preprocess_for_lines(img_bgr):
    g = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
    g = cv2.bilateralFilter(g, d=7, sigmaColor=50, sigmaSpace=50)
    th = cv2.adaptiveThreshold(g, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                               cv2.THRESH_BINARY, 35, 15)
    return 255 - th  # ink white for projection

def segment_lines(img_bgr, min_h=14, smooth=11, thr_ratio=0.12):
    inv = preprocess_for_lines(img_bgr)
    proj = cv2.GaussianBlur(inv.sum(axis=1).astype("float32").reshape(-1,1),(smooth,1),0).ravel()
    mask = proj > (proj.max()*thr_ratio)
    mask = cv2.morphologyEx(mask.astype("uint8")*255, cv2.MORPH_CLOSE, np.ones((3,1), np.uint8)).astype(bool)
    lines, i, H = [], 0, inv.shape[0]
    while i < H:
        if mask[i]:
            j = i
            while j < H and mask[j]: j += 1
            y0, y1 = max(0, i-2), min(H, j+2)
            if y1 - y0 >= min_h: lines.append((y0, y1))
            i = j
        else:
            i += 1
    return lines

@torch.inference_mode()
def trocr_line_beam(img_line_bgr):
    # normalize height for stability
    h = img_line_bgr.shape[0]
    target_h = 64
    scale = max(1.0, target_h / float(h))
    if scale != 1.0:
        img_line_bgr = cv2.resize(img_line_bgr, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)

    rgb = cv2.cvtColor(img_line_bgr, cv2.COLOR_BGR2RGB)
    pixel = processor(Image.fromarray(rgb), return_tensors="pt").pixel_values.to(device)

    ids = htr.generate(
        pixel,
        max_length=256,
        num_beams=8,
        early_stopping=True,
        no_repeat_ngram_size=3,
        length_penalty=1.0
    )
    text = processor.batch_decode(ids, skip_special_tokens=True)[0]
    return normalize_spaces(text)

def ocr_essay_with_trocr(img_bgr):
    roi = crop_essay_region(img_bgr)
    if roi.size == 0: return ""
    best = ""
    for thr in (0.14, 0.12, 0.10, 0.08, 0.06):
        txts = []
        for y0, y1 in segment_lines(roi, thr_ratio=thr):
            t = trocr_line_beam(roi[y0:y1, :]).strip()
            if t: txts.append(t)
        joined = " ".join(txts)
        if len(joined) > len(best): best = joined
    return clean_trocr_output(best)

## 4. Examiner Comment

The examiner comment is printed, not handwritten.
It was cropped under the “Here is the examiner’s comment:” anchor, split by large horizontal gaps (paragraphs), and try several preprocessing variants/PSMs, choosing the best by a simple quality score.

In [10]:
def crop_examiner_region(img_bgr):
    exam_re = r"Here\s+is\s+the\s+examiner[’']s\s+comment:"
    H, W = img_bgr.shape[:2]
    exam_bbox, _ = find_line_bbox(img_bgr, exam_re)
    if not exam_bbox:
        y0 = int(0.72 * H)
        return img_bgr[y0:H, :].copy()
    x, y, w, h = exam_bbox
    left, right = int(0.03 * W), int(0.97 * W)
    return img_bgr[min(H, y+h+6):H, left:right].copy()

def split_paragraphs(img_bgr, gap_thr=40):
    gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
    _, th = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    inv = 255 - th
    proj = inv.sum(axis=1)
    H = len(proj)
    mask = proj < proj.max() * 0.15
    gaps, i = [], 0
    while i < H:
        if mask[i]:
            j = i
            while j < H and mask[j]: j += 1
            if j - i >= gap_thr: gaps.append((i, j))
            i = j
        else:
            i += 1
    starts = [0] + [g[1] for g in gaps]
    ends = [g[0] for g in gaps] + [H]
    return [img_bgr[s:e, :] for s, e in zip(starts, ends) if e - s > 15]

def preprocess_variants(img_bgr):
    outs = []
    base = cv2.resize(img_bgr, None, fx=1.6, fy=1.6, interpolation=cv2.INTER_CUBIC); outs.append(base)
    g = cv2.cvtColor(base, cv2.COLOR_BGR2GRAY)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    outs.append(cv2.cvtColor(clahe.apply(g), cv2.COLOR_GRAY2BGR))
    b = cv2.bilateralFilter(g, 7, 50, 50)
    _, th = cv2.threshold(b, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    outs.append(cv2.cvtColor(th, cv2.COLOR_GRAY2BGR))
    ath = cv2.adaptiveThreshold(g, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 35, 15)
    outs.append(cv2.cvtColor(ath, cv2.COLOR_GRAY2BGR))
    return outs

def score_ocr(txt):
    t = txt.strip()
    if not t: return -1e9
    tokens = re.findall(r"[A-Za-z]+(?:'[A-Za-z]+)?|[.,;:!?]", t)
    if not tokens: return -1e9
    letters = sum(ch.isalpha() for ch in t)
    alpha_ratio = letters / max(1, len(t))
    one_char_runs = len(re.findall(r"(?:\b[a-zA-Z]\b[\s,.;:!?]*){3,}", t))
    punct_runs = len(re.findall(r"[.,;:!?]{2,}", t))
    return 2.0*len(tokens) + 200.0*alpha_ratio - 50.0*one_char_runs - 30.0*punct_runs

def ocr_best_paragraph(img_bgr):
    best_txt, best_score = "", -1e9
    for v in preprocess_variants(img_bgr):
        for psm in (6, 4, 3):
            cfg = f"--oem 1 --psm {psm} -l eng"
            txt = normalize_spaces(pytesseract.image_to_string(v, config=cfg))
            s = score_ocr(txt)
            if s > best_score:
                best_txt, best_score = txt, s
    return best_txt

## 5. Structure Parsing

I did one full-page Tesseract pass to get the printed scaffolding, then slice into the four fields: topic, band, essay, examiner_comment.
Any leaked anchors in the essay are removed.

In [11]:
def parse_cambridge_text(full_text: str) -> dict:
    txt = normalize_spaces(full_text)
    topic_match = re.search(r"Write about the following topic[:\-]?\s*(.*?)\s*TEST\b", txt, flags=re.I | re.S)
    topic = topic_match.group(1).strip() if topic_match else ""
    band_sent_re = re.compile(r"This is an answer written by a candidate who achieved a\s*Band\s*(?P<band>\d(?:\.\d)?)\s*score\.", flags=re.I)
    band_sent = band_sent_re.search(txt)
    band = float(band_sent.group("band")) if band_sent else None
    exam_anchor = re.compile(r"Here is the examiner[’']s comment:", flags=re.I).search(txt)
    essay = ""
    examiner_comment = ""
    if band_sent and exam_anchor:
        essay = txt[band_sent.end():exam_anchor.start()].strip()
        examiner_comment = txt[exam_anchor.end():].strip()
    elif band_sent:
        essay = txt[band_sent.end():].strip()
    else:
        essay = txt

    # Remove any residual anchor phrase from essay
    essay = re.sub(r"here is the examiner[’']?s comment:.*$", "", essay, flags=re.I).strip()
    # Also remove any leaked header
    essay = re.sub(r"^Write about the following topic[:\-]?.*?TEST\b.*?score\.\s*", "", essay, flags=re.I | re.S).strip()

    return {"topic": topic, "band": band, "essay": essay, "examiner_comment": examiner_comment}


## 6. Main Pipeline (Post-processing & Assembly)

SymSpell → Contextual correction (GECToR or Neuspell) → Sentence casing
Essay uses TrOCR output if available; examiner comment uses the printed OCR pipeline.

In [12]:
def post_process_text(s: str) -> str:
    # SymSpell → Contextual → sentence case
    s = symspell_correct_sentence(s)
    s = contextual_correct(s)
    s = restore_sentence_case(s)
    return s

def process_page(img_path):
    img = cv2.imread(img_path)
    if img is None:
        raise FileNotFoundError(img_path)

    # Printed OCR for structure
    full_text = ocr_tesseract_text(img)
    rec = parse_cambridge_text(full_text)

    # Essay via TrOCR (beam search) → clean → post-process
    essay_trocr = ocr_essay_with_trocr(img)
    essay_text = essay_trocr or rec["essay"]
    essay_text = re.sub(r"here is the examiner[’']?s comment:.*$", "", essay_text, flags=re.I).strip()
    rec["essay"] = post_process_text(essay_text)

    # Examiner comment: crop → split paragraphs → OCR each → join → post-process
    exam_roi = crop_examiner_region(img)
    if exam_roi.size > 0:
        paras = split_paragraphs(exam_roi, gap_thr=40)
        paragraphs = [ocr_best_paragraph(p) for p in paras]
        comment = " ".join(t for t in paragraphs if t.strip())
        if comment.strip():
            rec["examiner_comment"] = comment
    rec["examiner_comment"] = post_process_text(rec["examiner_comment"])

    # Final tidy
    rec["topic"] = restore_sentence_case(normalize_spaces(rec["topic"]))
    rec["source"] = Path(img_path).stem
    return rec

## 7. Run

Point to your images, process them, and write a JSONL with one record per page

In [13]:
# Run
recs = [process_page(p) for p in inputs]
with open(out_jsonl, "w", encoding="utf-8") as f:
    for r in recs:
        print(r) # See the result
        f.write(json.dumps(r, ensure_ascii=False) + "\n") # Download as json

{'topic': 'In some countries, owning a home rather than renting one is very important for people. Why might this be the case? Do you think this is a positive or negative situation?', 'band': 7.0, 'essay': "To some countries the ownership of peoples home is an important matter in these countries it is every important to our your own tone rather than routing it might be indifferent for some but for the people's matter in addition to the why is that the case you might wonder I think it is because you supposed to be exactly what it sounds like your home as a human think we long after having stiff to call our own doesn't matter what it is but humans will always want to claim ownership this is nothing has been like this through human history like colonies for example which later once again became the same country as before had by its own inhabitants people will always want to be the tackle what happens to them and then you rent your home you can't even paint it without the if you as a person

In [14]:
for r in recs:
    print(r["band"])

7.0
6.0
7.0
6.5
6.0
4.5
7.0
6.5
6.5
6.5
6.0
6.0
7.0
6.0


In [15]:
for r in recs:
    print(r['topic'])

In some countries, owning a home rather than renting one is very important for people. Why might this be the case? Do you think this is a positive or negative situation?
In the future, nobody will buy printed newspapers or books because they will be able to read everything they want online without paying. To what extent do you agree or disagree with this statement?
Some people say that advertising is extremely successful at persuading us to buy things. Other people think that advertising is so common that we no longer pay attention to it. Discuss both these views and give your own opinion.
In some cultures, children are often told that they can achieve anything if they try hard enough. What are the advantages and disadvantages of giving children this message?
In some countries, more and more people are becoming interested in finding out about the history of the house or building they live in. What are the reasons for this? How can people research this?
In their advertising, businesses 