Assignment 1

# Summarization of large document 

* Data source: Arxiv papers for math/AI/physics (5 papers each)

* File path: '../data_source/arxiv.org/...'

In [48]:
import os
from dotenv import load_dotenv
load_dotenv()

True

## Task 1: Data Preparation

* Form Recognizer - analyze PDF file, page contents are returned.

* Save the page contents into JSON file for later processing.

#### Read PDF file using form recognizer

In [50]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
# import os
# from dotenv import load_dotenv
# load_dotenv()
document_analysis_client = DocumentAnalysisClient(
    endpoint=os.environ["AZURE_FORM_RECOGNIZER_ENDPOINT"], 
    credential=AzureKeyCredential(os.environ["AZURE_FORM_RECOGNIZER_KEY"]))

# pdf_file = './data_source/arxiv.org/AI/2311.05227.pdf'
pdf_file = './data_source/arxiv.org/math/2303.17103.pdf'
with open(pdf_file, "rb") as f:
    poller = document_analysis_client.begin_analyze_document(
        "prebuilt-read", document=f
    )
    # not_completed=False
result = poller.result()

In [None]:
# result.pages

#### Put page contents into json structure

In [52]:
page_content = []
for page in result.pages:
    all_lines_content = []
    for line_idx, line in enumerate(page.lines):
        all_lines_content.append(' '.join([word.content for word in line.get_words()]))
    page_content.append({'filename': pdf_file,
        'page_number':page.page_number, 
        'page_content':' '.join(all_lines_content)})

In [None]:
# page_content

#### Write to a JSON file so that we can summarize it later.

In [53]:
import json

# Save JSON data into a file so that we donot need to call form recongnizer again
# Specify the output file path
output_file_path = pdf_file + '_output.json'

# Save the JSON data to a file
with open(output_file_path, 'w') as json_file:
    json.dump(page_content, json_file, indent=4)

print(f"JSON data has been saved to {output_file_path}")

JSON data has been saved to ./data_source/arxiv.org/math/2303.17103.pdf_output.json


## Task 2: Document Summarization

* Read documnet (previously processed JSON file)

* Use Azure OpenAI 'completion' to summarize the document. 

* Summary (append): summarize first page, then summarize: 1st page summary + 2nd page contents, then previous summary + 3rd page content, ... till all pages are done.

#### Use Azure OpenAI 'Completion' to summarize content

In [5]:
# Note: The openai-python library support for Azure OpenAI is in preview. 
# This version is not supported in ChatCompletion.
# import os
import openai

openai.api_type = "azure"
# openai.api_version = "2023-07-01-preview"
openai.api_version = "2023-09-15-preview"
API_KEY = os.getenv("OPENAI_API_KEY","").strip()
assert API_KEY, "ERROR: Azure OpenAI Key is missing"
openai.api_key = API_KEY
RESOURCE_ENDPOINT = os.getenv("OPENAI_API_ENDPOINT","").strip()
assert RESOURCE_ENDPOINT, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in RESOURCE_ENDPOINT.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = RESOURCE_ENDPOINT

COMPLETIONS_MODEL = os.getenv('DEPLOYMENT_NAME')

Try different prompts ...

In [35]:
def openai_completion_summarization(previous_summary, new_content):
    # Construct prompt
    # prompt_text = (
    #     'Provide a summary of the contents below. Note that your summary should consider:\n'
    #     'Summary based on previous content:\n ' +
    #     previous_summary + ' \n' +
    #     'and the new content: \n' + 
    #     new_content
    # )

    prompt_text = 'Provide a summary of the contents below.\n' + \
        previous_summary + ' \n ' + new_content    

    # prompt_text = 'Provide a summary of the text below that captures its main idea.\n\n' + \
    #     previous_summary + ' \n ' + new_content
    
    debug = False
    if debug: print("prompt_text:", prompt_text)

    response = openai.Completion.create(
        engine=COMPLETIONS_MODEL,
        prompt=prompt_text,
        temperature=0,
        max_tokens=2000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        best_of=1,
        stop=None)

    if debug: print(f"\nOpenAI completion summary: [{repr(response['choices'][0]['text'])}]")
    return response['choices'][0]['text']

# previous_summary = ''
# new_content = "At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding. As Chief Technology Officer of Azure AI Services, I have been working with a team of amazing scientists and engineers to turn this quest into a reality. In my role, I enjoy a unique perspective in viewing the relationship among three attributes of human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z). At the intersection of all three, there’s magic—what we call XYZ-code as illustrated in Figure 1—a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pre-trained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today. Over the past five years, we have achieved human performance on benchmarks in conversational speech recognition, machine translation, conversational question answering, machine reading comprehension, and image captioning. These five breakthroughs provided us with strong signals toward our more ambitious aspiration to produce a leap in AI capabilities, achieving multi-sensory and multilingual learning that is closer in line with how humans learn and understand. I believe the joint XYZ-code is a foundational component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks."
# result = openai_completion_summarization(previous_summary, new_content)
# print(result)


#### Read the output JSON file and do summarization using summary append approach

In [30]:
import json

def load_page_content_from_json_file(json_file_path):
    with open(json_file_path, 'r') as file:
        json_data = json.load(file)
    return json_data


Read the paper in reversed order seems better since most likely the reference pages are at the end of the article, the most likely the import semantics are at the beginning of the article. Actually we could use the Custom Classification Model to find the 'content page' and use that to guide the summarization on content pages.

In [54]:
import time
# Specify the JSON file path
# json_file_path = './data_source/arxiv.org/AI/2311.05227.pdf_output.json'
json_file_path = './data_source/arxiv.org/math/2303.17103.pdf_output.json'

# Call the function to print page content from the JSON file
page_content_in_json = load_page_content_from_json_file(json_file_path)

debug = False

# Loop through the JSON and print page_content
previous_summary_text = 'None'
for page in reversed(page_content_in_json):
    new_page_number = page.get('page_number', '')
    new_page_content = page.get('page_content', '')
    if debug:
        print("Page number: ", new_page_number)

    # print("Page content: ", new_page_content)
    summary_result = openai_completion_summarization(previous_summary_text, new_page_content)
    previous_summary_text = summary_result.split('\n\n', 1)[-1]
    # time.sleep(2)
    if debug:
        print("Summary: ", previous_summary_text)
        print("=" * 50)  # Separating page contents for better readability

final_summary = [{"File name": json_file_path,
                  "Summary": previous_summary_text}]


In [55]:
final_summary

[{'File name': './data_source/arxiv.org/math/2303.17103.pdf_output.json',
  'Summary': 'The article discusses the correlation between UFO sightings and meteor showers, and the use of scientific tools and data analysis in researching this topic. It also explores the intersection of science and popular culture in the study of UFO sightings and meteor showers. The article includes a graph showing the distribution of UFO sightings by year and a map showing the concentration of sightings during meteor showers in different regions of the world. It also mentions the use of parameters such as shape and duration in studying UFO sightings and discusses the occurrence of "balloon" incidents in February 2023. The article concludes by summarizing the methods used in the study, including the analysis of over 80,000 UFO sightings and their correlation with reports of high-altitude balloons and meteor showers. The study suggests that there may be a transport pipeline for alien craft from interplanetar

#### Save final summary in json file

In [56]:
import json

# Save JSON data into a file so that we donot need to call form recongnizer again
# Specify the output file path
# Find the position of '_output'
output_index = json_file_path.find('_output')

# Extract the substring before '_output'
summary_file_path = json_file_path[:output_index] + '_summary.json'

# Save the JSON data to a file
with open(summary_file_path, 'w') as json_file:
    json.dump(final_summary, json_file, indent=4)

print(f"JSON data has been saved to {summary_file_path}")

JSON data has been saved to ./data_source/arxiv.org/math/2303.17103.pdf_summary.json
