## Preprocess input

In [4]:
import panflute as pf
import re

In [5]:
def my_filter(doc):
    bad_blocks = [pf.BlockQuote]
    filtered_doc = []
    for _, x in enumerate(doc):
        if type(x) in bad_blocks:
            continue
        markdown_output = pf.convert_text(x, input_format='panflute', output_format='markdown')
        markdown_output_single_line = re.sub(r'([^\n])\n([^\n])', r'\1 \2', markdown_output)
        markdown_output_single_line = re.sub(r'  +', r' ', markdown_output_single_line)
        filtered_doc.append(markdown_output_single_line)
        # filtered_doc += markdown_output_single_line.split('. ')
    return filtered_doc

Design choices

I chose to use `max_chars` instead of `max_tokens`, because while tokenization schemes change (and probably should be replaced by pure character-level encoding), characters stay.

Generally, setting `max_chars` to be about 1/4 of the total context window would be a safe choice. You can push up to 1/2 of the context window. More is not recommended, since the reply could be as long as the input.

According to [Greg Kamradt's tests](https://twitter.com/GregKamradt/status/1722386725635580292), GPT-4 128k has perfect recall up to 64k context window, so we should be fine using 32k token inputs. For typical English text, that means something like 60k characters.

In [6]:
def split_into_chunks(paragraphs, max_chars):
    """
    Split a list of paragraphs into chunks.

    :param paragraphs: List of paragraphs (each paragraph is a string).
    :param max_chars: Maximum number of characters allowed in a chunk.
    :return: List of chunks (each chunk is a list of paragraphs).
    """
    
    chunks = []
    current_chunk = []

    current_length = 0
    for para in paragraphs:
        para_length = len(para)

        # Check if the paragraph itself is too long
        if para_length > max_chars:
            raise ValueError(f"Paragraph exceeds the maximum character limit: {para_length} characters")

        # Check if adding this paragraph would exceed the max length
        if current_length + para_length <= max_chars:
            current_chunk.append(para)
            current_length += para_length
        else:
            # Start a new chunk
            chunks.append(current_chunk)
            current_chunk = [para]
            current_length = para_length

    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk)
    return chunks


## Make calls to GPT-4

In [16]:
from openai import OpenAI
import openai


def gpt_proofread(text, model="gpt-4-1106-preview"):
    prompt = """
You are a proofreader. The user provides a piece of R-markdown, and you will proofread it. Do not say anything else.
Do not simply perform an active-to-passive voice conversion. Do not improve the style. Restrict yourself to grammatical checks.
You MUST reply in this format:

a: <original sentence>
b: <rewritten sentence>

a: <original sentence>
b: <rewritten sentence>

...

Reply only rewritten sentences. Skip all other sentences. Do not reply anything else. If the text requires no change, return an empty string.
Do not use American-style quotation. Use logical quotation.
""".strip()
    example_user_1 = """
We use the convention putting derivative on the rows. This convention simplifies a lot of equations, and completely avoids transposing any matrix.

In the next section, using the "pebble construction," they studied "Gamba perceptrons." They stated "MLPs are essentially Gamba perceptrons."
    """.strip()
    example_assistant_1 = """
a: We use the convention putting derivative on the rows.
b: We use the convention of putting the derivatives on the rows.

a: In the next section, using the "pebble construction," they studied "Gamba perceptrons."
b: In the next section, using the "pebble construction", they studied "Gamba perceptrons".

a: They stated "MLPs are essentially Gamba perceptrons."
b: They stated "MLPs are essentially Gamba perceptrons.".
    """
    example_user_2 = "That is, for any $X \subset R$, we have $\psi(X)=\psi(g(X))$."
    example_assistant_2 = ""
    client = OpenAI()
    response = client.chat.completions.create(model=model, 
                  messages=[{"role": "system", "content": prompt}, 
                            {"role": "user", "content": example_user_1},
                            {"role": "assistant", "content": example_assistant_1},
                            {"role": "user", "content": example_user_2},
                            {"role": "assistant", "content": example_assistant_2},
                            {"role": "user", "content": text}])
    return response

## Postprocess output

The most important problem we need to solve is fuzzy search, which might be necessary if the damned LLM does not return an exact string match.

In [20]:
from fuzzysearch import find_near_matches
from warnings import warn

def process_response(response, original_text="", max_l_dist=10):
    text = response.choices[0].message.content
    # Check that the input sequence satisfies a certain format
    prefix_list = [line.strip()[:2] for line in text.split('\n') if line.strip() != ""]
    prefix_string = ''.join(prefix_list)
    if not re.fullmatch(r"((\?|a):b:(c:)?)*", prefix_string):
        raise ValueError(f"Illegal response sequence:\n{prefix_string}")
        
    fixed_text_tuples = []
    for line in text.split('\n'):
        if line.strip() == "": continue
        if ":" not in line: raise ValueError(f"Illegal response:\n{stripped_line}\n{'-'*80}\nFound in text:\n{text}")
        prefix = line.split(':')[0]
        if prefix not in ["a", "b", "c"]: raise ValueError(f"Illegal response:\n{stripped_line}\n{'-'*80}\nFound in text:\n{text}")
            
        stripped_line = line[len(prefix)+1:].strip()
        if prefix == "a" and original_text and stripped_line not in original_text:
            warn(f"The proofreader returned a line of inexact match:\n{stripped_line}", UserWarning)
            fuzzy_search_result = find_near_matches(stripped_line, original_text, max_l_dist=max_l_dist)
            if fuzzy_search_result == []:
                prefix = "?"
                warn(f"The proofreader returned a line of unknown origin:\n{stripped_line}", UserWarning)
            else:
                stripped_line = fuzzy_search_result[0].matched
        fixed_text_tuples.append((prefix, stripped_line))

    fixed_text = ""
    for prefix, stripped_line in fixed_text_tuples:
        if prefix in "?a": fixed_text += "\n"
        fixed_text += f"{prefix} {stripped_line}\n"
    fixed_text = f"\n\n{fixed_text.strip()}"
    return fixed_text

Then, the human would have to go in and read through the file, annotate the selection by square brackets, then perform a single `multiple_replace`.

The most important problem we need to solve

In [18]:
def multiple_replace(text, replacements):
    """
    Replace multiple substrings in a single pass.

    Args:
    text (str): The original string.
    replacements (dict): A dictionary mapping old substrings to new substrings.

    Returns:
    str: The modified string after all replacements.
    """
    regex = re.compile("|".join(map(re.escape, replacements.keys())))
    return regex.sub(lambda match: replacements[match.group(0)], text)

## All together

In [21]:
%pdb on
# Step 1: Read input_file, query LLM, then write to proofread_file
def get_proofread_files(input_file, proofread_file):
    with open(input_file, 'r', encoding="utf8") as file:
        input_markdown = file.read()
    doc = pf.convert_text(input_markdown)
    try:
        chunks = split_into_chunks(my_filter(doc), max_chars=4_000)
        print(f"{len(chunks)} chunks")
        for i, chunk in enumerate(chunks):
            input_string = '\n\n'.join(chunk).strip()
            if input_string == "": continue
            response = gpt_proofread(input_string)
            with open(proofread_file, 'a', encoding="utf8") as file:
                file.write(process_response(response))
    except ValueError as e:
        raise e

get_proofread_files('index.qmd', 'index_pr.txt')

Automatic pdb calling has been turned ON
8 chunks


In [11]:
# Step 2: After a human has finished proofreading them, perform the substitutions
def perform_substitution(input_file, proofread_file, output_file):
    # multiple_replace(something)
    # ...

IndentationError: expected an indented block (219380869.py, line 4)

In [41]:
import difflib

print('\n'.join(list(difflib.ndiff(["XZ."], ["XZY."]))))

- XZ.
+ XZY.
?   +

