<!-- Banner Image -->
<img src="https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/ocr2.png?t=2023-11-09T00%3A26%3A25.198Z" width="100%">

<!-- Links -->
<center>
  <a href="https://console.brev.dev" style="color: #06b6d4;">Console</a> •
  <a href="https://brev.dev" style="color: #06b6d4;">Docs</a> •
  <a href="/" style="color: #06b6d4;">Templates</a> •
  <a href="https://discord.gg/NVDyv7TUgJ" style="color: #06b6d4;">Discord</a>
</center>

# OCR + Amazon's MistralLite for a PDF Analysis Chatbot 🤙

Welcome!

In this notebook and tutorial, we'll allow for long-context PDF analysis using [OCR (Optical Character Recognition)](https://en.wikipedia.org/wiki/Optical_character_recognition) + Amazon's adapted [Mistral 7B](https://github.com/mistralai/mistral-src) model, [MistralLite](https://huggingface.co/amazon/MistralLite?library=true), which allows for contexts of up to 32K, which is [roughly 24000 words, or 48 pages of text](https://twitter.com/SullyOmarr/status/1654576775970828293). From the Hugging Face page:

> MistralLite is a fine-tuned Mistral-7B-v0.1 language model, with **enhanced capabilities of processing long context (up to 32K tokens)**. By utilizing an adapted Rotary Embedding and sliding window during fine-tuning, **MistralLite is able to perform significantly better on several long context retrieve and answering tasks**, while keeping the simple model structure of the original model. MistralLite is useful for applications such as long context line and topic retrieval, summarization, question-answering, and etc. 

Disclaimer: Note that [LLMs have had trouble effectively using information from long contexts](https://twitter.com/LouisKnightWebb/status/1683874116410155009), so you may find you'll still want to use [RAG](https://www.promptingguide.ai/techniques/rag), but it's worth a shot to first try without. Based on MistralLite's description - that it is made for long context retrieve and answering tasks - it just may work. It did for me (but I only used about 5400 tokens).

**The text in the PDF I upload in this tutorial is too long to fit into ChatGPT's GPT-4, so this step-by-step guide shows you how to get around that limitation.**

We will load the large model in 4-bit quantization using `bitsandbytes` so that we can load it on a smaller GPU (you can optionally skip this if you have the compute for the full model!).

Note that if you ever have trouble importing something from Hugging Face, you may need to run `huggingface-cli login` in a shell. To open a shell in Jupyter Lab, click on 'Launcher' (or the '+' if it's not there) next to the notebook tab at the top of the screen. Under "Other", click "Terminal" and then run the command.

### Help us make this tutorial better! Please provide feedback on the [Discord channel](https://discord.gg/pnCpkwU3G5) or on [X](https://x.com/harperscarroll).

#### Before we begin: A note on OOM errors

If you get an error like this: `OutOfMemoryError: CUDA out of memory`, tweak your parameters to make the model less computationally intensive. I will help guide you through that in this guide, and if you have any additional questions you can reach out on the [Discord channel](https://discord.gg/pnCpkwU3G5) or on [X](https://x.com/harperscarroll).

To re-try after you tweak your parameters, open a Terminal ('Launcher' or '+' in the nav bar above -> Other -> Terminal) and run the command `nvidia-smi`. Then find the process ID `PID` under `Processes` and run the command `kill [PID]`. You will need to re-start your notebook from the beginning. (There may be a better way to do this... if so please do let me know!)

## Let's begin!
I used a GPU and dev environment from [brev.dev](https://brev.dev). Provision a pre-configured GPU in one click [here](https://console.brev.dev/environment/new?instance=A10G:g5.xlarge&name=ocr-pdf-analysis) (I used an A10G, with 24GB GPU Memory). Once you've checked out your machine and landed in your instance page, select the specs you'd like (I used Python 3.10 and CUDA 12.0.1) and click the "Build" button to build your autogpu container. Give this a few minutes.

A few minutes after your model has started Running, click the 'Notebook' button on the top right of your screen once it illuminates (you may need to refresh the screen). You will be taken to a Jupyter Lab environment, where you can upload this Notebook.


Note: You can connect your cloud credits (AWS or GCP) by clicking "Org: " on the top right, and in the panel that slides over, click "Connect AWS" or "Connect GCP" under "Connect your cloud" and follow the instructions linked to attach your credentials.



In [2]:
# You only need to run this once per machine
!pip install -q -U bitsandbytes reportlab scipy
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 1. OCR: PDF → Text
In this section, we'll use [OCR (Optical Character Recognition)](https://en.wikipedia.org/wiki/Optical_character_recognition) to extract text from our PDF. We will use the open-source tool [pd3f](https://pd3f.com/).

### Install docker-compose and pd3f

In [2]:
import time, requests, os

os.chdir('/home/ubuntu')

!sudo apt update -y -q
!sudo apt-get install ufw -y -q
!sudo apt install docker-compose -y -q
!git clone https://github.com/pd3f/pd3f
!sudo systemctl start docker.service

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists...
Building dependency tree...
Reading state information...
48 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists...
Building dependency tree...
Reading state information...
ufw is already the newest version (0.36.1-4ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 48 not upgraded.
Reading package lists...
Building dependency tree...
Reading state information...
docker-compose is already the newest version (1.29.2-1).
0 upgraded, 0 newly installed, 0 to remove and 48 not upgraded.
Cloning into 'pd3f'...
remote: Enumerating objects: 492, done.[K
remote: Counting objects: 100% (28/28)

### Run pd3f

Now open a Terminal ('+' or 'Launcher' at the top tab section -> 'Terminal') and run this command:

```
cd /home/ubuntu/pd3f && sudo ./dev.sh
```

Wait until you see something like this, repeating:
```
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
```
Leave it running.

### Gather PDFs

I put my PDF in a directory called "ocr-example". You can put multiple PDFs in here, just keep your model's maximum context length in mind. In this case, we use a model that allows for 32K tokens, which is [roughly 24000 words, or 48 pages of text](https://twitter.com/SullyOmarr/status/1654576775970828293). We will want that context to contain:

1. The PDF(s) text you'd like to ask questions about
2. The instructions ("Answer the following questions by referring to the data below...")
3. Optionally (recommended), chat history. What questions have been asked and the responses provided. Good for questions that require context, e.g. "Can you explain that further?". 

Keep this in mind as you gather your dataset and decide how you'd like your model to behave.

My PDF is an old paper I wrote for a required writing class my sophomore year of college. It was a class on celebrity, and I wrote about how and why I thought Kylie Jenner would be the most successful Kardashian. 

In [3]:
# If you'd like to use my example, you can pull it here:
!git clone https://github.com/harper-carroll/ocr-example.git

directory = "ocr-example"

Cloning into 'ocr-example'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 8 (delta 0), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (8/8), 836.07 KiB | 28.83 MiB/s, done.


In [4]:
os.chdir('/home/ubuntu/') # The directory that contains the pdf directory (change if necessary)
files = []

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        value = (filename, open(directory + '/' + filename, 'rb'))
        files.append({'pdf': value})

### Use the pd3f OCR API! 

Post params to put in `data` map (more info [here](https://pd3f.com/docs/pd3f/usage/)):

- `lang`: set the language (options: ‘de’, ’en’, ’es’, ‘fr’)
- `fast`: whether to check for tables (default: False) (Harper’s note: This seems weird to me, but it’s what the documentation says 🤷‍♀️)
- `tables`: whether to check for tables (default: False)
- `experimental`: whether to extract text in experimental mode (footnotes to endnotes, depuplicate page header / footer) (default: False)
- `check_ocr`: whether to check first if all pages were OCRd (default: True, cannot be modified in GUI)

In [5]:
import time, requests, os

ocr_texts = {}

for file in files:     
    response = requests.post(
        'http://localhost:1616', 
        files=file, 
        data={'lang': 'en'}
    )
    id = response.json()['id']

    while True:
        req = requests.get(f"http://localhost:1616/update/{id}")
        reqAsJson = req.json()
        if 'text' in reqAsJson:
            break
        time.sleep(1)
    filename = file['pdf'][0]
    ocr_texts[filename] = reqAsJson['text']

print("The text extracted by pd3f is:", ocr_texts)

The text extracted by pd3f is: {'rba.pdf': 'The Rise of Kylie Jenner\n\nHarper Carroll\nPWR 2: Cultures of Personality\nDr. Maxe Crandall\nWinter 2016\n\n2\n\nKeeping Up With the Kardashians broke into the world of reality television on October 14 th , 2007 (Wikipedia). Since the airing of its first episode, it has had eleven seasons and is renewed for its twelfth season, set to air Spring 2016 (E! Online). Eighteen-year-old Kylie Jenner, the youngest of the Kardashian-Jenner clan, is worth far above $10 million owns a $2.7 million home, and retains 74% of the subscribers of all of her sisters\' apps combined (Celebrity Net Worth, Hollywood Life 2015). In late September, Kylie had 36.5 million Instagram followers. Today, this number has risen a whopping 47% to 53.6 million. "King Kylie" is now the second most followed Kardashian sister, second only to Kim, long the Queen of the clan. But only two years ago, Kylie was almost entirely shadowed by the spotlights that shone on her older si

## 2. Load the Model

[MistralLite](https://huggingface.co/amazon/MistralLite?library=true) is Amazon's variation of Mistral that allows for contexts of up to 32K. From the HuggingFace page:

> MistralLite is a fine-tuned Mistral-7B-v0.1 language model, with enhanced capabilities of processing long context (up to 32K tokens). By utilizing an adapted Rotary Embedding and sliding window during fine-tuning, MistralLite is able to perform significantly better on several long context retrieve and answering tasks, while keeping the simple model structure of the original model. MistralLite is useful for applications such as long context line and topic retrieval, summarization, question-answering, and etc. 

In this section, we load a 4-bit quantized version of the model so it will fit on a smaller GPU. You can choose to remove the `bnb_config` if you have the compute to load the full version.

***Quantization*** in the context of deep learning is the process of reducing the numerical precision of a model's tensors, making the model more compact and the operations faster in execution. This is by nature lossy and usually has some negative effect on accuracy. Mistral's tensors were 16-bit, and we load them in 4-bit, which reduces the bit usage by 75%.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, StoppingCriteriaList, StoppingCriteria

base_model_id = "amazon/MistralLite"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True)

## 3. Get Prompt Ready

In this tutorial, since we're using a model that allows for long context lengths, we'll be inserting the text from our entire PDF into the prompt.
We want to prompt engineer a bit so that we get the functionality we'd like.

Here I've added some code to remove the "Works Cited" section of my essay, since it is quite long and I may not need it, in which case I'd rather reserve that context space for something else, like chat history. The code below removes everything after `end_phrase` ("Works Cited" in this case) section if `trim=True`.

In [7]:
end_phrase = "Works Cited"
full_essay = ocr_texts['rba.pdf'] # If you have ≥ 1 pdf, you'll need to alter this code
trim = True

index = full_essay.find(end_phrase)
if trim and index != -1:
    # If "Works Cited:" is found, remove everything after that index
    essay = full_essay[:index]
else:
    # If "Works Cited:" is not found, return the original string
    essay = full_essay

In [8]:
eval_prompt = """Task: Provide detailed answers to questions provided about the following essay, referencing the essay itself.

The essay: """ + essay + "-----"

In [9]:
model_input = tokenizer(eval_prompt, return_tensors="pt", return_attention_mask=False).to("cuda")

In [10]:
f"The initial prompt and essay is {len(model_input[0])} tokens."

'The initial prompt and essay is 5300 tokens.'

## 4. Run the Model!
### Add Stopping Criteria List

Here is where we teach the model to stop if it sees certain tokens. A common issue with chatbots is that the model will answer its question and then generate another one, assuming the role of the user. For example:

You: 
```
Question: How many people live in the United States in 2023? 
```
The model:
```
Answer: Almost 340 million people.
Question: How many people live in Canada in 2023?
Answer: About 38.8 million people.
Question: ....
```

To mitigate this issue, we provide stopping criteria, where each "criteria" is represented as a list of tokens for the words we'd like the model to stop at. The stopping criteria we will use for this model is the word "Question: " - if the model generates "Question: ", it know it's gone too far. Another common stopping criteria is the presence of a newline.

First, we need to get the tokenized form of our stop words (in this case, just "Question: ").

In [11]:
list_of_stop_words = ["Question: "]

stop_words_ids = [
    tokenizer.encode(stop_word) for stop_word in list_of_stop_words]

print(stop_words_ids)

[[1, 22478, 28747, 28705]]


In [12]:
# Code from https://discuss.huggingface.co/t/implimentation-of-stopping-criteria-list/20040

class StoppingCriteriaSub(StoppingCriteria):

    def __init__(self, stops = []):
      StoppingCriteria.__init__(self), 

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, stops = []):
      self.stops = stops
      for i in range(len(stops)):
        self.stops = self.stops[i]

stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops = stop_words_ids)])

Now, let's define the chatbot loop. This loop takes in a user's question, places "\n\nQuestion: " as the prefix and "\nAnswer: " as the suffix, and then tokenizes just that new string to save memory. Then, it concatenates it onto the end of the old Encoding object, i.e. the previous tokenized prompt. 

I noticed that the Encoding's `num_tokens` remains the same - not sure how to fix this. If you know, you'll get $10 of free Brev credits. Just reach out to me. 

The model will stop if you input one of the `exit_terms` as the question.

In [13]:
model.eval()

exit_terms = ["stop", "exit"]

while True:
    q = input("Question: ")
    if q.lower() in exit_terms: 
        break
    next_prompt = "\n\nQuestion: " + q + "\nAnswer: "
    next_input_tokenized = tokenizer(next_prompt, return_tensors="pt", return_attention_mask=False).to("cuda")   
    model_input["input_ids"] = torch.cat((model_input["input_ids"], next_input_tokenized["input_ids"]), dim=1)
    model_input["attention_mask"] = torch.ones(model_input["input_ids"].shape)
    with torch.no_grad():
        gen_tokens = model.generate(**model_input, stopping_criteria=stopping_criteria, max_new_tokens=300, repetition_penalty=1.05, pad_token_id=tokenizer.eos_token_id)
        out = tokenizer.decode(gen_tokens[0][len(model_input["input_ids"][0]):], skip_special_tokens=True)
        print("Answer: " + out)
        model_input["input_ids"] = gen_tokens
        model_input["attention_mask"] = torch.ones(model_input["input_ids"].shape)

Question:  What specific incidents helped Kylie Jenner rise to fame?


Answer: 1. Kylie's transformation to look like Kim 2. The Kylie Jenner Lip Challenge 3. Kylie's Snapchat story 4. Kylie's website and mobile app


Question:  How did Kylie Jenner use branding strategies to maintain her celebrity status?


Answer: 1. Becoming the "new" Kim 2. Trademarking her lips 3. Connecting with her fans 4. Marketing an alternative, ethereal image of herself


Question:  What celebrities did Kylie draw inspiration from?


Answer: 1. Lady Gaga 2. Jennifer Lopez 3. Greta Garbo


Question:  stop


### Sweet... it worked! Epic!!!

I hope you enjoyed this tutorial on OCR + building a PDF analysis chatbot using MistralLite. Please join the community on [Discord](https://discord.gg/pnCpkwU3G5)! 

If you have any questions, please reach out to me on [X](https://x.com/harperscarroll) or in the [Discord channel](https://discord.gg/pnCpkwU3G5).

🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙 🤙