# Polish your entire paper with GPT for academic writing

Firs put your paper content into a text file and put in the same directory as this file. Latex format is fine but there might be some bugs. I recommend using markdown.

<a href="https://colab.research.google.com/github/user074/paperPolisher/blob/main/paperPolisher.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setup

In [None]:
!pip install openai tiktoken

In [10]:
import os
import openai
import tiktoken

Then put your openai API key down here. Dont release it to public!

In [2]:
openai.api_key = ('PUT_YOUR_API_KEY_HERE')

Change the filename of 'YOUR_ORIGIONAL_PAPER_FILE_NAME.txt' to your paper filename.

In [1]:
filename = 'YOUR_ORIGIONAL_PAPER_FILE_NAME.txt'

Run all the support functions below. Feel free to change things

## Support functions

In [2]:

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
  """Returns the number of tokens used by a list of messages."""
  try:
      encoding = tiktoken.encoding_for_model(model)
  except KeyError:
      encoding = tiktoken.get_encoding("cl100k_base")
  if model == "gpt-3.5-turbo-0613":  # note: future models may deviate from this
      num_tokens = 0
      for message in messages:
          num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
          for key, value in message.items():
              num_tokens += len(encoding.encode(value))
              if key == "name":  # if there's a name, the role is omitted
                  num_tokens += -1  # role is always required and always 1 token
      num_tokens += 2  # every reply is primed with <im_start>assistant
      return num_tokens
  else:
      raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
  See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")

In [3]:
#read text from file and convert to list of messages for openai api

def read_text_from_file(filename):
    with open(filename, 'r') as file:
        text = file.read()
    return text

def convert_text_to_messages(text, user_instructions=None):
    messages = [
    {"role": "system", "content": "You are an expert academic journal editor in CVPR, IEEE, and ACM.\
        You are helping a researcher to polish their paper.\
        You help researchers to improve their papers by correcting grammar, improving readability, and improving the overall quality of the paper.\
        You do not use passive voice, and you would eliminate first person pronouns in the correction process.\
        You try to be objective in your writing and avoid using personal opinions.\
        Your writing is formal and percise. You express concepts succinctly. \
        You are not responsible for the technical content of the paper and you do not change the technical content of the paper."},
    # {"role": "user", "content": "Help me to polish following text."},
    ]
    if user_instructions:
        messages.extend([{"role": "user", "content": user_instructions} ])
        
    messages.extend([{"role": "user", "content": text} ])
    return messages


def convert_messages_to_text(messages):
    text = ''
    for message in messages:
        if message['role'] == 'user':
            text += 'Human: ' + message['text'] + '\n'
        elif message['role'] == 'assistant':
            text += 'AI: ' + message['text'] + '\n'
    return text

def write_text_to_file(text, filename):
    with open(filename, 'w') as file:
        file.write(text)

def write_messages_to_file(messages, filename):
    write_text_to_file(convert_messages_to_text(messages), filename)



In [4]:
#count tokens in text
def num_tokens_from_text(text, model = "gpt-3.5-turbo-0613"):
    messages = convert_text_to_messages(text)
    return num_tokens_from_messages(messages, model)

In [5]:
#read the text file, then return a list of texts whereas each contains less than given number of tokens
def split_text_into_list(text, max_tokens=4096):
    messages = []
    message = ''
    for line in text.splitlines():
        message_token = num_tokens_from_messages(convert_text_to_messages(message))
        line_token = num_tokens_from_messages(convert_text_to_messages(line))
        if message_token + line_token < max_tokens/2:
            message += line + '\n'
        else:
            messages.append(message)
            message = line + '\n'
    messages.append(message)
    return messages

In [6]:
#call openai api to polish the text
def polish_text(text, model = "gpt-3.5-turbo"):
    user_instructions = "Please help me to polish the following text for publication. Keep the text as origional LaTeX format."
    messages = convert_text_to_messages(text, user_instructions=user_instructions)
    response = openai.ChatCompletion.create(model=model, messages=messages)
    return response

In [7]:
#Polish an entire txt file then write the results into a new file
def polish_text_file(filename, model = "gpt-3.5-turbo"):
    text = read_text_from_file(filename)
    paragraphs = split_text_into_list(text, max_tokens=4096)
    polished_text = ''
    for paragraph in paragraphs:
        result = polish_text(paragraph, model)
        polished_text += result['choices'][0]["message"]["content"]
    return polished_text

def write_text_to_file(text, filename):
    with open(filename, 'w') as file:
        file.write(text)

def polish_text_file_to_file(filename, model = "gpt-3.5-turbo"):
    text = polish_text_file(filename, model)
    write_text_to_file(text, 'polished_' + filename)

## Cost Estimation

Estimate your cost before doing anything for free. This is a rough estimate. No API key required

In [8]:
def cost_estimation(filename):
    text = read_text_from_file(filename)
    messages = convert_text_to_messages(text)
    model = "gpt-3.5-turbo-0613"
    paragraphs = split_text_into_list(text, max_tokens=4096)
    estimated_cost = num_tokens_from_messages(messages, model) * (0.0015 + 0.002) / 1000
    print(f"{num_tokens_from_messages(messages, model)} prompt tokens counted.")
    print(f"Number of paragraphs to process: {len(paragraphs)}")
    print(f"Estimated cost using gpt-3.5-turbo: ${estimated_cost}")

In [11]:
cost_estimation(filename)

12804 prompt tokens counted.
Number of paragraphs to process: 8
Estimated cost using gpt-3.5-turbo: $0.044814


## Run the polisher

Just run 'polish_text_file_to_file' with the 'YOUR_ORIGIONAL_PAPER_FILE_NAME.txt'. The polished version would be polished_YOUR_ORIGIONAL_PAPER_FILE_NAME.txt. It might take about 3-5 minutes.

In [91]:
polish_text_file_to_file(filename)