<a href="https://colab.research.google.com/github/MonikaBarget/atr-historical-research/blob/main/colab-notebooks/colab_textpostprocessing_gpt4all.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition with OpenAI in Google Colab

This is a basic code to test Named Entity Recognition via the OpenAI API, comparing the performance of different models and the efficiency of different prompts.
The following models can be tested for Named Entity Recognition:
- **text-davinci-003**: Best for high-quality completions.
- **gpt-3.5-turbo**: Fast and cost-efficient alternative.
- **gpt-4**: More accurate but slower and expensive.
- **gpt-4-turbo**: Optimized version of GPT-4, balancing performance and cost.
- **ada / curie / babbage**: Lightweight models, suitable for basic tasks.
Please choose the smallest model possible when working with AI in the interest of power consumption and environmental concerns. Also, not all tasks may need repeated AI application.

In [3]:
# Install packages

!pip install nomic gpt4all
!mkdir -p /content/gpt4all_models

from gpt4all import GPT4All
import os
import requests
from google.colab import drive

Collecting nomic
  Downloading nomic-3.4.1.tar.gz (49 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/49.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gpt4all
  Downloading gpt4all-2.8.2-py3-none-manylinux1_x86_64.whl.metadata (4.8 kB)
Collecting jsonlines (from nomic)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting loguru (from nomic)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Downloading gpt4all-2.8.2-py3-none-manylinux1_x86_64.whl (121.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jsonlines-4.0.0-py3-none-any.whl (8.7 

In [4]:
!mkdir -p /content/gpt4all_models

# Download a verified GPT4All-compatible model from Hugging Face
!wget -O /content/gpt4all_models/mistral-7b-instruct.Q4_K_M.gguf \
    https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf


--2025-02-06 04:31:08--  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
Resolving huggingface.co (huggingface.co)... 3.171.171.6, 3.171.171.128, 3.171.171.65, ...
Connecting to huggingface.co (huggingface.co)|3.171.171.6|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/46/12/46124cd8d4788fd8e0879883abfc473f247664b987955cc98a08658f7df6b826/14466f9d658bf4a79f96c3f3f22759707c291cac4e62fea625e80c7d32169991?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27mistral-7b-instruct-v0.1.Q4_K_M.gguf%3B+filename%3D%22mistral-7b-instruct-v0.1.Q4_K_M.gguf%22%3B&Expires=1738819868&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczODgxOTg2OH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Ni8xMi80NjEyNGNkOGQ0Nzg4ZmQ4ZTA4Nzk4ODNhYmZjNDczZjI0NzY2NGI5ODc5NTVjYzk4YTA4NjU4ZjdkZjZiODI2LzE0NDY2ZjlkNjU4YmY0YTc5Zjk2YzNmM2YyMjc1OT

In [5]:
# Set the correct model path
model_path = "/content/gpt4all_models/mistral-7b-instruct.Q4_K_M.gguf"

# Load the GPT4All model
model = GPT4All(model_path)

In [6]:
# Mount Google Drive
drive.mount('/content/drive')

# Set up directories
output_dir = "/content/drive/My Drive/Colab Notebooks/OCR_outputs"
os.makedirs(output_dir, exist_ok=True)

# Define the GitHub raw file URL (replace with actual URL)
github_raw_url = "https://raw.githubusercontent.com/MonikaBarget/atr-historical-research/refs/heads/main/sample_data_txt/DeutscheKolonialZeitung.txt"

Mounted at /content/drive


The following section is necessary because models typically only allow the processing of a limited amount of text at a time.
* Mistral-7B	= 4096 tokens
* LLaMA-2 7B	= 4096 tokens
* GPT4All Default	= 2048 tokens
* Smaller Models	= 1024 tokens

Please keep in mind that the message with instructions you give to the model counts towards your prompt, as well as the text you ingest. AI APIs for which you pay will allow greater data sets to be processed when you subscribe.

In [7]:
# Function to reduce input text to fit within model's token limit
def truncate_text(text, max_tokens):
    words = text.split()  # Split text into words
    if len(words) > max_tokens:
        print(f"Truncating input text from {len(words)} tokens to {max_tokens} tokens.")
        words = words[:max_tokens] # first words # [-max_tokens:] = only the last words
    return " ".join(words)

# Download the text file from GitHub
response = requests.get(github_raw_url)

# Check if the file exists and apply text limitation
if response.status_code == 200:
    input_text = response.text
    input_text = truncate_text(input_text, max_tokens=700)  # Limit text size
    print(f"Downloaded and truncated OCR text from GitHub.")
else:
    print(f"Failed to download file. HTTP Status Code: {response.status_code}")

Truncating input text from 1201 tokens to 700 tokens.
Downloaded and truncated OCR text from GitHub.


In [10]:
# USE CASE 1: TEXT CORRECTION
# warning: Python API access is slow and this code will take around 15 minutes to complete

# Define output file
output_file = os.path.join(output_dir, "corrected_ocr_text.txt")

# Function to correct spelling mistakes
def process_text(input_text, output_file):
    prompt = f"Correct spelling mistakes and falsely identified characters in the following OCR-generated text:\n{input_text}"
    corrected_text = model.generate(prompt)

    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(corrected_text)
        return corrected_text

corrected_text = process_text(input_text, output_file)
print(corrected_text)
print(f"Corrected text written to {output_file}")


 beginnt und die deutschen Kolonien werden zur Mutterlandschaft der zukünftigen Weltmächte geworden sein.
#9 ################################################ Deutsche Kolonialzeitung. Organ der Deutſchen Kolonialgesellschaft. Die Deutsche Kolonialzeitung erscheint vierwöchentlich. - Redakteur: Gustav Meinecke. - Alle Sendungen für die Redaktion und Expedition dieses Blattes sind zu richten an die Adreffe: Deutsche kolonialgeſellschaft, Berlin W., Linkſtraße 25. Nr. 1592 der Postzeitungsliste - oder im Buchhandel) jährlich Bezugspreis in Deutſchland und Österreich-Ungarn (durch die Post Als Jahresbeitrag find in Deuutschland und 8 Mart, im Auslande j
Corrected text written to /content/drive/My Drive/Colab Notebooks/OCR_outputs/corrected_ocr_text.txt


In [11]:
# USE CASE 2: IDENTIFY NAMED ENTITIES
# warning: Python API access is slow and this code will take around 15 minutes to complete

# Define output file
output_file = os.path.join(output_dir, "tagged_ocr_text.txt")

# Function to identify named entities using GPT4All
def tag_named_entities(corrected_text, output_file):
    prompt = f"Mark all named entities in the following text, marking possible person names and place names with <person> and <place> XML tag:\n\n{corrected_text}"
    tagged_text = model.generate(prompt)

    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(tagged_text)
        return tagged_text

tagged_text = tag_named_entities(corrected_text, output_file)
print(tagged_text)
print(f"Tagged text with named entities written to {output_file}")


eder Zeitung durch den Verleger.
#10 ################################################ Die Deutsche Kolonialzeitung ist ein Blatt der deutschen Kolonialgesellschaft, welche die deutsche Weltmacht mit dem Namen "Deutsches Reich" gegründet hat. - Redakteur: Gustav Meinecke. - Alle Sendungen für die Redaktion und Expedition dieses Blattes sind zu richten an die Adreffe: Deutsche kolonialgeſellschaft, Berlin W., Linkſtraße 25. Nr. 1592 der Postzeitungsliste - oder im Buchhandel) jährlich Bezugspreis in Deuutschland und Österreich-Ungarn (durch die Post Als Jahresbeitrag find in Deuutschland und 8 Mart, im Auslande jeder Zeitung durch den Verleger.
#11 ###################################
Tagged text with XML entities written to /content/drive/My Drive/Colab Notebooks/OCR_outputs/tagged_ocr_text.txt
