# Tokenizing and Translating Markdown with OpenAI

## Introduction

In this notebook, we will implement a pipeline to tokenize a markdown document, split it into chunks at multiple newlines, translate those chunks using OpenAI's LLMs, and reconstruct the full document while retaining original formatting.

Specifically, we will:

- Read in a markdown file 
- Tokenize the text into words, punctuation, etc.
- Count the number of tokens
- Split the tokens into chunks whenever there are multiple successive newlines 
- Translate each chunk into another language using OpenAI's translation LLMs
- Reconstruct the translated chunks into a full document, preserving original formatting like headers, lists, etc.

This allows us to get translations while keeping code blocks, images, tables intact. 

We'll use Python and Jupyter notebooks to implement the pipeline. The notebook will be structured into sections for each step of the process.

To follow along, you'll want a markdown file to process. We'll use a small sample file included with the notebook. You'll need access to OpenAI's API.

Let's get started! First we'll import the modules and setup the notebook. Then we'll define functions for each step - tokenizing, counting tokens, splitting, translating, and reconstructing.

---

### Setup

In [None]:
%pip install --upgrade tiktoken

In [3]:
import openai, tiktoken

### Import Input File

Import the data file you want translated from `data/input.txt`.

In [9]:
with open("data/input.txt", "r") as f:
    text = f.read()

Length of the text file (for our purposes, we'll be using a small sample text):

In [12]:
len(text)

5667

### Counting the tokens

We'll use OpenAI's `tiktoken` - a fast open-source tokenizer:

> Given a text string (e.g., `"tiktoken is great!"`) and an encoding (e.g., `"cl100k_base"`), a tokenizer can split the text string into a list of tokens (e.g., `["t", "ik", "token", " is", " great", "!"]`).

> Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).

In [20]:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
encoded_text = encoding.encode(text)
input_token_count = len(encoded_text)


In [21]:
input_token_count

1062