## Introduction
In this notebook, we will build a very basic pipeline for converting scanned documents to an OCR format. The idea is to demonstrate how we can scalably use out of the box OCR services like AWS textract to quickly and accurately convert vast amounts of scanned PDFs to raw text. This converted text can then be programmatically fed to large language models for further analysis.

In [58]:
import boto3
import time
from dataclasses import dataclass
import os
from pdf2image import convert_from_path
from PIL import Image, ImageDraw
import time
from openai import OpenAI
import faiss
import numpy as np

## Preliminary Configurations
Here, we create some preliminary configurations.
1. First we set up a S3 bucket name where all the raw input books will be stored. The bucket that has been created for this purpose is `cps-scanned-archival-material`.
2. We need textract to be able to access the files in this bucket and hence we will have to amend the default bucket policy to provide textract with read access. We use the following policy
```
                {
                    "Version": "2012-10-17",
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "Service": "textract.amazonaws.com"
                            },
                            "Action": [
                                "s3:GetObject",
                                "s3:*"
                            ],
                            "Resource": "arn:aws:s3:::cps-scanned-archival-material/*"
                        }
                    ]
                }
```
3. We then set up the S3 client as needed.
4. We will also set up a textract client.

In [6]:
@dataclass
class BasicConfigurations:
    BucketName: str

basic_configurations = BasicConfigurations(BucketName='cps-scanned-archival-material')
# Set up the s3 client
s3_client = boto3.client('s3')
textract = boto3.client('textract', region_name='ap-south-1')

## Sample Input Preparation
Prior to proceeding further we have to prepare some sample inputs. We will pick up 5 pdf documents from the CPS website and use them for our purposes of analysis. We have chosen the following 4 documents, each with differing characteristics:
1. **BV07 - Indian Rulers and The British Government** - Type written pages, some of which are blurred. A total of 105 pages of text. 19.2 MB
2. **BV12 - British Rule in South India** - Slightly better scan quality, but some interesting orientation of the scan (a slight clockwise tilt). 96 pages - 18.4 MB
3. **CPM03 - Reports of the Agitation in Bihar** - Very Poor Original Document quality. Fading print, stains on the page. 50 Pages
4. **TS05 - Manufacture of Iron and Steel** Poor quality typescripts with blurred type and annotations on the margins. 15 pages.

The documents were chosen to be as contextually diverse as possible while being subjectively difficult to convert to a raw text form. They have various sources of error and confusion ranging from poor scan quality to hand annotations and corrections. These documents were loaded to the S3 buckets using the script below.


In [None]:
pdf_files = []
folder_path = 'pdf_files'
for file in os.listdir(folder_path):
    if file.endswith('.pdf'):
        pdf_files.append(file)
# Upload these files to S3
for file in pdf_files:
    s3_client.upload_file(f'{folder_path}/{file}', basic_configurations.BucketName, f'{file}')
    print(f'Uploaded {file} to {basic_configurations.BucketName}')
    time.sleep(1)

## We will now use textract to convert documents
To do this, we will create a simple function that can perform OCR on these PDF files using the textract API via the boto3 package. Before building an actual pipeline for the same, we can try and see how to interpret the output of the text recognition process. 

We start with a simple page: Page number 45 of 96 in BV12- British Rule in south India. This should be an easy task. We will use `FeatureTypes = ['TABLES']` for the time being. Note that we are first analyzing the document here.

The code below shows how we can analyze the text fully in a document and then use the API responses to get the information we need about different elements in the text. This information is then used to annotate the page with bounding boxes as necessary.


In [48]:
document_name =  'CPM-03-Reports-of-Agitation-in-Bihar-1893.pdf'
page_number = '45'
response = textract.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': basic_configurations.BucketName,
            'Name': document_name
        }
    },
    FeatureTypes=['TABLES']
)
# Print the job ID
job_id = response['JobId']
# Polling the job status
while True:
    job_status = textract.get_document_analysis(JobId=job_id)
    status = job_status['JobStatus']
    if status in ['SUCCEEDED', 'FAILED']:
        break
    print(f"Job status: {status}. Waiting...")
    time.sleep(5)  # Wait for 5 seconds before polling again

# Initialize variables
next_token = None
blocks = []
while True:
    if next_token:
        response = textract.get_document_analysis(JobId=job_id, NextToken=next_token)
    else:
        response = textract.get_document_analysis(JobId=job_id)
    blocks.extend(response['Blocks'])
    if 'NextToken' in response:
        next_token = response['NextToken']
    else:
        break


Job status: IN_PROGRESS. Waiting...
Job status: IN_PROGRESS. Waiting...
Job status: IN_PROGRESS. Waiting...
Job status: IN_PROGRESS. Waiting...
Job status: IN_PROGRESS. Waiting...
Job status: IN_PROGRESS. Waiting...
Job status: IN_PROGRESS. Waiting...


In [59]:
 # Now that we have the full document analysis, we can just get the information for page 45
page_blocks = [block for block in blocks if block['Page'] == int(1)]
artefact_path = 'output_artifacts'
# We can now begin the process of annotating the image
images = convert_from_path(f'{folder_path}/{document_name}', 
                           first_page=int(1), 
                           last_page=int(1))
page_image = images[0]
page_image.save(f'{artefact_path}/cpm03-page1.jpg')
# Draw bounding boxes on the image
draw = ImageDraw.Draw(page_image)
for block in page_blocks:
    if block['BlockType'] == 'LINE':
        # Draw the bounding box
        box = block['Geometry']['BoundingBox']
        width, height = page_image.size
        polygon_coords = [
            (point['X']*width, point['Y']*height)
            for point in block['Geometry']['Polygon']
        ]
        draw.polygon(polygon_coords, outline='red', width=3)
page_image.save(f'{artefact_path}/cpm03-page1-annotated.jpg')

Annotated image saved to output_artifacts folder


## Building an OCR pipeline.

We can now write a simple function to take in a pdf file from the archives, perform an OCR and then collect all the text after removing some watermarks and other information into a raw text file. We will also simultaneoulsy load the output from the textract API into a json object in S3 so that they can be referenced later.

In [63]:
def convert_file(file_name: str, configurations: BasicConfigurations, output_key_prefix: str):
        """
        Calls the textract API and stores outpout in S3
        """
        # Create clients
        s3_client = boto3.client('s3')
        textract = boto3.client('textract', region_name='ap-south-1')
        # Call the API
        response = textract.start_document_analysis(
                DocumentLocation={
                        'S3Object': {
                        'Bucket': configurations.BucketName,
                        'Name': file_name
                        }
                },
                FeatureTypes=['TABLES']
        )
        # Poll for the response
        # Print the job ID
        job_id = response['JobId']
        # Polling the job status
        while True:
                job_status = textract.get_document_analysis(JobId=job_id)
                status = job_status['JobStatus']
                if status in ['SUCCEEDED', 'FAILED']:
                        break
                print(f"Job status: {status}. Waiting...")
                time.sleep(5)  # Wait for 5 seconds before polling again       
        # Initialize variables
        next_token = None
        blocks = []
        while True:
                if next_token:
                        response = textract.get_document_analysis(JobId=job_id, NextToken=next_token)
                else:
                        response = textract.get_document_analysis(JobId=job_id)
                blocks.extend(response['Blocks'])
                if 'NextToken' in response:
                        next_token = response['NextToken']
                else:
                        break
        # Write all the blocks to S3 in a specific location
        output_key = f'extraction_detal/{output_key_prefix}/apiResponse.json'
        # Write the blocks to S3 as a json file (in a byte format)
        s3_client.put_object(
                Bucket=configurations.BucketName,
                Key=output_key,
                Body=str(blocks).encode('utf-8')
        )
        # Now loop through the blocks get the 'Text' attribute from the block and keep appending with a space to a raw_text string
        raw_text = ''
        for block in blocks:
                if block['BlockType'] == 'LINE':
                        text = block['Text']
                        if 'cpsindia' in text.lower() or 'centre for policy' in text.lower() or 'cps-' in text.lower():
                                continue
                        raw_text += block['Text'] + ' '
        # Write the raw text to S3
        output_key = f'extraction_detal/{output_key_prefix}/rawText.txt'
        s3_client.put_object(
                Bucket=configurations.BucketName,
                Key=output_key,
                Body=raw_text.encode('utf-8')
        )
        print(f'Finished processing {file_name} and wrote the output to S3')


In [None]:
start = time.time()
convert_file('BV-12-British-Rule-in-South-India.pdf', basic_configurations, 'bv12')
end = time.time()
print(f'Time Taken: {end-start} seconds')

## Text Summarization
We can now perform the actual task of text summarization. We will use the openAI API and send in the entire text as a chunk of information and then get the output response. This is a risky approach, but worth a try.
We will be writing a function to take in the text from an OCR output and then directly insert it into a prompt which will be fed to the open AI completions API. 

In [55]:
def summarize_text_without_chunking(text_key: str, configurations: BasicConfigurations, temperature: float = 0.1):
    """
    Summarizes the text from the S3 object
    """
    # Create an s3 client 
    s3_client = boto3.client('s3')
    # Create a secrets manager client in ap-south-1 using boto3
    secrets_manager = boto3.client('secretsmanager', region_name='ap-south-1')
    # Get the OpenAI API key from the secrets manager
    secret = secrets_manager.get_secret_value(SecretId='openai/key')
    # Extract the api_key from the secret
    api_key = secret['SecretString']
    # Convert the api_key to a dictionary and get the api_key from it
    api_key = eval(api_key)['api_key']
    client = OpenAI(
        api_key=api_key,  # this is also the default, it can be omitted
    )
    # download the file rawText.txt from the S3 bucket and store the text in the file in a variable raw_text
    response = s3_client.get_object(Bucket=configurations.BucketName, Key=f'extraction_detal/{text_key}/rawText.txt')
    raw_text = response['Body'].read().decode('utf-8')
    # Now call the completions API
    
    try:
        print("Sending request to GPT-4...")
        response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": f"""You are a helpful assistant that summarizes text. The text is produced by a OCR system on old documents. 
                 There will be some errors and the text may not be perfect. Take that into account when summarizing the text. Use ONLY the text provided and donot add context from outside. Make sure that the summary is atmost 400 words long and atleast 300 words long. 
                 I donot want to exceed these limita. Touch upon all the important points in the text.
                 In your response provide the output as a JSON object with the key 'summary' and the value as the summarized text. Use the key 'locations' to provide a list of locations referenced in the text.
                 Use the key 'people' to provide a list of people referenced in the text, I mean people with actual names like John Jopkins, Shivaji. Use the key 'organizations' to provide a list of organizations referenced in the text.
                 Use the key 'groups' to provide a list of caste names referenced in the text such as Brahmins, Paraiyar (try to get a minimum of 5). In enlisting be as expansive as you can. I should be able to convert this to a python dictionary.
                 """},
                {"role": "user", "content": f"Summarize the following text: {raw_text}"}
            ],
            temperature=temperature,
        )
        summary = response.choices[0].message.content
        output_key = f'extraction_detal/{text_key}/summaries.json'
        s3_client.put_object(
                Bucket=configurations.BucketName,
                Key=output_key,
                Body=str(summary).encode('utf-8')
        )
        return summary
    except Exception as e:
        print(f"Error: {e}")
        return response  

In [56]:
# Call the API
x = summarize_text_without_chunking('bv12', basic_configurations)

Sending request to GPT-4...


## Interacting with documents
Here, we design a simple system to chunk the raw text, store it in a vector database and use this to perform Retrieval Augmented Generation for answering questions about the document itself.
1. `chunk_text` - chunks the text which is the raw text of the document and it has some overlap defined which means that two adjacent chunks of text will have some amout of overlap.

In [66]:
def chunk_text(text, chunk_size=500, overlap=100):
    text = text.replace('\n', ' ')
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # Overlap for continuity
    return chunks

def create_and_populate_vector_data_base(doc_key:str, configurations: BasicConfigurations):
    """
    Creates and populates a vector database while using the OpenAI embedings API.
    We just specify the document key and the function will take care of the rest.
    """
    # Create an S3 client
    s3_client = boto3.client('s3')
    # Download rawText from S3
    response = s3_client.get_object(Bucket=configurations.BucketName, Key=f'extraction_detal/{doc_key}/rawText.txt')
    raw_text = response['Body'].read().decode('utf-8')
    # chunk the text
    chunks = chunk_text(raw_text)
    # Create an OpenAI client
    secrets_manager = boto3.client('secretsmanager', region_name='ap-south-1')
    # Get the OpenAI API key from the secrets manager
    secret = secrets_manager.get_secret_value(SecretId='openai/key')
    # Extract the api_key from the secret
    api_key = secret['SecretString']
    # Convert the api_key to a dictionary and get the api_key from it
    api_key = eval(api_key)['api_key']
    client = OpenAI(
        api_key=api_key,  # this is also the default, it can be omitted
    )
    # Create a vector database
    dimension = 1536
    index = faiss.IndexFlatL2(dimension)
    # Populate the vector database
    for i, chunk in enumerate(chunks):
        if i % 10 == 0:
            print(f'Processing chunk {i} of {len(chunks)}')
        response = client.embeddings.create(
            input=[chunk],
            model="text-embedding-ada-002",
        )
        vector = response.data[0].embedding
        index.add(np.array([vector]).astype('float32'))
    print(f'Finished creating and populating the vector database for {doc_key}')
    return index

In [None]:
m = create_and_populate_vector_data_base('cpm03', basic_configurations)

In [90]:
# Create a chunk map of the raw text
def create_chunk_map(doc_key:str, configurations: BasicConfigurations):
    # Create an S3 client
    s3_client = boto3.client('s3')
    # Download rawText from S3
    response = s3_client.get_object(Bucket=configurations.BucketName, Key=f'extraction_detal/{doc_key}/rawText.txt')
    raw_text = response['Body'].read().decode('utf-8')
    # chunk the text
    chunks = chunk_text(raw_text)
    chunk_map = {i: chunk for i, chunk in enumerate(chunks)}
    return chunk_map

def search_vector_database(query:str, db, chunk_map, k=20):
    """
    Searches the vector database for the most similar vectors to the query
    """
    # Create an OpenAI client
    secrets_manager = boto3.client('secretsmanager', region_name='ap-south-1')
    # Get the OpenAI API key from the secrets manager
    secret = secrets_manager.get_secret_value(SecretId='openai/key')
    # Extract the api_key from the secret
    api_key = secret['SecretString']
    # Convert the api_key to a dictionary and get the api_key from it
    api_key = eval(api_key)['api_key']
    client = OpenAI(
        api_key=api_key,  # this is also the default, it can be omitted
    )
    dimension = 1536
    # Embed the query
    response = client.embeddings.create(
        input=[query],
        model="text-embedding-ada-002",
    )
    vector = response.data[0].embedding
    query_embedding = np.array([vector]).astype('float32')
    # Search the index
    D, I = db.search(query_embedding, k)
    # Return the chunks
    return [chunk_map[idx] for idx in I[0]]

def answer_document_question(doc_key: str,
                             question: str,
                             configurations: BasicConfigurations, 
                             db):
    """
    Answers a question from the document
    """
    # Create an S3 client
    s3_client = boto3.client('s3')
    # Read in the Raw Summary from the document
    response = s3_client.get_object(Bucket=configurations.BucketName, Key=f'extraction_detal/{doc_key}/summaries.json')
    # Store the response as a string
    broad_context = response['Body'].read().decode('utf-8')
    # Create a vector database
    chunk_map = create_chunk_map(doc_key, configurations)
    # Search the vector database
    relevant_chunks = search_vector_database(question, db, chunk_map)
    # Concatenate the chunks
    specific_context = ' '.join(relevant_chunks)
    # Master Content
    master_content = f"""
    You are a helpful assistant that will answer some questions given a 
    broad context and a specific context. The broad context given as <<BROAD CONTEXT>>
    is a summary of the document from which you are answering questions. The specific context 
    given as <<SPECIFIC CONTEXT>> are specific chunks of text related to the question
    from the document. The question is specified as <<QUESTION>>. You should answer the 
    question using information only from the broad and specific context. You should not
    provide information from outside the context. You should provide the answer as <<ANSWER>> followed
    by the actual answer and your level of confidence as <<CONFIDENCE>> which can be HIGH, MEDIUM or LOW.set_matplotlib_close
    """
    question_prompt = f"""
    <<QUESTION>>: {question}
    <<BROAD CONTEXT>>: {broad_context}
    <<SPECIFIC CONTEXT>>: {specific_context}
    """
    # Get the OpenAI API key from the secrets manager
    secrets_manager = boto3.client('secretsmanager', region_name='ap-south-1')
    secret = secrets_manager.get_secret_value(SecretId='openai/key')
    # Extract the api_key from the secret
    api_key = secret['SecretString']
    # Convert the api_key to a dictionary and get the api_key from it
    api_key = eval(api_key)['api_key']
    client = OpenAI(
        api_key=api_key,  # this is also the default, it can be omitted
    )
    response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": master_content},
                {"role": "user", "content": question_prompt}
            ],
            temperature=0.2,
        )
    answer = response.choices[0].message.content
    return answer
    


In [99]:
answer_document_question('cpm03', 'What are these agitations about?', 
                         basic_configurations, m)

'<<ANSWER>>: The agitations discussed are about the anti-kine-killing movement, which initially started as a religious and peaceful advocacy for the protection of cows but later escalated into significant civil unrest. This unrest involved coercion and violence, particularly from itinerant propagandists from outside the province. The agitation led to tensions between different community groups, prompting the need for stringent administrative actions to manage and prevent further escalation of communal unrest.\n\n<<CONFIDENCE>>: HIGH'

In [100]:
answer_document_question('cpm03', 'Name a prominent person who participated in these agitations. Just one is enough', 
                         basic_configurations, m)

'<<ANSWER>>: Gopalanand Swami\n\n<<CONFIDENCE>>: HIGH'

In [101]:
answer_document_question('cpm03', 'What role did Gopalanad Swami Play?', 
                         basic_configurations, m)

'<<ANSWER>>: Gopalanand Swami played a significant role as an organizer of Sabhas and a preacher of their doctrines. He was involved in the anti-kine-killing agitation, where his preaching led to significant unrest and eventually to his arrest and imprisonment for two years. He was known for his ability to organize and promote the doctrines of the Sabhas, which contributed to the spread of the movement and the associated disturbances.\n\n<<CONFIDENCE>>: HIGH'

In [102]:
answer_document_question('cpm03', 'What is the approximate time frame and locations of these agitations?', 
                         basic_configurations, m)

'<<ANSWER>>: The approximate time frame of the agitations is primarily around the late 19th century, specifically highlighted around the years 1891, 1892, and 1893. The locations of these agitations include various districts and towns across Bengal and the surrounding areas, such as Gaya, Patna, Darbhanga, Madhubani, Shahabad, Muzaffarpur, and others mentioned in the broad context.\n\n<<CONFIDENCE>>: HIGH'

In [103]:
answer_document_question('cpm03', 'Did any muslims participate in these agitations?', 
                         basic_configurations, m)

'<<ANSWER>>: Yes, some Muslims did participate in the agitations.\n\n<<CONFIDENCE>>: HIGH'

In [104]:
answer_document_question('cpm03', 'Name a few muslims who participated in these agitations and their role?', 
                         basic_configurations, m)

"<<ANSWER>>: The Muslims who participated in the agitations include Moulvi Maniralam, who was described as a rabid anti-Englishman and joined the Hindus in other agitations. Another participant was a Maulvi from the district of Azamgarh, who was involved in a discussion at a fair and was confounded by Pandit Jagat Narain's responses. Additionally, four Muslims named Mohammed Ali, Race, Tiktiq Ali Bakhah of Dumraon, Ali Mohammad of Buzar, and Khuda Bakbah of Balia renounced the use of flesh following the discomfiture of the Maulvi from Azamgarh.\n\n<<CONFIDENCE>>: HIGH"

In [105]:
answer_document_question('cpm03', 'What was the role of Muslims in this agitation?', 
                         basic_configurations, m)

'<<ANSWER>>: The role of Muslims in the agitation was primarily characterized by their involvement in counter-agitations and forming vigilance committees to stay informed and possibly coordinate with Europeans. They were also subjects of rumors and misinformation, which alarmed them and potentially incited further unrest. There were instances where Muslims were reported to be forming groups to enforce their interests, particularly in response to the Hindu-led anti-kine-killing agitation. Additionally, there were mentions of Muslims being encouraged to contribute to the support of additional police due to local tensions, and there were fears among them that the government might not be able to protect them adequately.\n\n<<CONFIDENCE>>: HIGH'

In [106]:
answer_document_question('cpm03', 'Did any Muslims support the Hindus at all?', 
                         basic_configurations, m)

'<<ANSWER>>: Yes, some Muslims did support the Hindus in their agitations. The specific context mentions Moulvi Maniralam, a Muhammadan who is described as a rabid anti-Englishman, joining the Hindus in other agitations.\n\n<<CONFIDENCE>>: HIGH'

In [107]:
answer_document_question('cpm03', 'What castes among Hindus were most active in these agitations?', 
                         basic_configurations, m)

'<<ANSWER>>: The castes among Hindus that were most active in these agitations were the Bunniahs and Kshatriyas (Kniths).\n\n<<CONFIDENCE>>: HIGH'

In [108]:
answer_document_question('cpm03', 'Were Brahmins not active as Banias and Kshatriyas?', 
                         basic_configurations, m)

'<<ANSWER>>: Yes, Brahmins were active as Banias and Kshatriyas. The specific context mentions various activities and roles taken by Brahmins, such as being appointed as accountants and curators of Gaushalas, participating in agitations, and preaching against kine-killing. This indicates their involvement in activities typically associated with the Bania (trader) and Kshatriya (warrior) roles, such as managing funds and leading social movements.\n\n<<CONFIDENCE>>: HIGH'

In [110]:
answer_document_question('cpm03', 'What happened in the Brahmapur Fair?', 
                         basic_configurations, m)

'<<ANSWER>>: At the Brahmapur Fair held on 25th April 1839, a group of preachers set up their tent and began preaching. Notable attendees included Pandit Jagat Narain, Pandit Kishori Lal, Pandit Har Narain, Pandit Mahabir Pershad, and others from various locations such as Benares and Arrah. The fair was significant for the anti-kine-killing movement, as the lives of several cows and bullocks were saved from being sold to butchers. Donations were collected to support the cause, and the event proceeded peacefully under the supervision of a Joint-Magistrate. However, a previous fair in April 1891 at Berhampore saw a violent incident where a large mob of armed Hindus attacked butchers, leading to police intervention and the arrest of a key agitator, Gopalanand Swami.\n\n<<CONFIDENCE>>: HIGH'