# Summarize the Arxiv papers

 - ☑️ Search for `cs.AI`
 - ☑️ Download PDF. Limit to 10/day for last 3 years _(so far have 3/day done as downloads are taking a very long time)_
 - ☑️ Convert PDF to Markdown text 
 - ✔️ Send to gpt-4o-mini for high quality summary
 - ✔️ Send it via their paralell batch script that takes care of request/token rate limits and retries
 - ✔️ Deal with some problem/exception related to disallowed tokens coming through.
 - ✔️ Add ID to the Markdown source so the batch result can be tied bakc to it's source.

## Performance gotchas

 - Running them serially. `gpt-4o-mini` takes 6 minutes for 30 summarizations _(I have 3000 to go though)_. This is roughly 12s per.
 - Going to see if paralell invocation while staying under the rate-limit is possible.
   - Had to fix their code to add `allowed_special` to the encoder
   - Send tokenizer name specific to gpt-4o-mini
   - Constantly running into TPM limits but it keeps chugging along

Gemini flash has much higher token and request limits, however, there I do not have readymade paralell-call code. I did find some examples in the cookbook but there was no facility to handle backoffs and retries. Don't want to mess with that yet when I have a working OpenAI setup. Took a 30-file copy so I can continue to work on the topic-extraction and UI part of it while the full 3k batch complete on OpenAI: I am expecting it to tak 5 hours (9pm completion).

## Prompt for summary

There are many examples

### Trivial prompts

   - give me the TLDR
   - summarize the document

### Comprehensive prompts

```
```

### ChatGPT generated


<pre>
Generate a prompt I can use to summarize technical arxiv papers.  The papers are technical in nature and contain 
AI subject matter. I want the summaries to be long enough to be used successfully in topic modeling. 
</pre>

↪

<pre>
Prompt:

"Summarize the following arXiv paper in a detailed and structured manner, focusing on its key contributions, 
methodology, experimental results, and conclusions. The summary should be long enough (approximately 
300-500 words) to capture important technical details for topic modeling purposes.

Ensure that the summary includes:

 - Title and Authors – Include the paper’s title and author names.
 - Abstract Summary – Provide a high-level summary of the abstract.
 - Problem Statement – Clearly describe the research problem being addressed and its significance.
 - Methodology – Summarize the approach, techniques, and models used, including any mathematical 
   formulations if applicable.
 - Experiments and Results – Describe the datasets used, evaluation metrics, and key findings, 
   including tables or numerical results if relevant.
 - Conclusions and Implications – Explain the broader impact of the findings and possible 
   future directions.
 - Keywords and Topics – Extract key technical terms and concepts for topic modeling.
 - Ensure clarity and coherence while maintaining technical precision. The summary 
   should avoid excessive jargon where possible while still being informative to researchers 
   familiar with AI and machine learning topics."
</pre>

### Other resources
 - Fabric!

## Notebook setup

In [1]:
# Setup paths to our libs
import os
import sys
from pathlib import Path

lib_path = (Path(os.getcwd()) / "lib").resolve()
sys.path.append(str(lib_path))

# Import jupyter utils
import logging
from util import jupyter_util
from util.jupyter_util import DisplayHTML as jh
from util.jupyter_util import DisplayMarkdown as jm

# Init jupyter env. Set to DEBUG if you want to see the gory details
# of schemas and such.
jupyter_util.setup_logging(logging.WARNING)
jupyter_util.ColabEnv.import_api_keys()

In [20]:
import json

DATA_DIR = Path(os.getcwd()) / "data"
FEED_RAW_DATA_DIR = DATA_DIR / "feed" / "raw"

# Test with the small 3 day one first
#metadata_path = FEED_RAW_DATA_DIR / "Arxiv_csAI_API_dailysampled_3d.json"
metadata_path = FEED_RAW_DATA_DIR / "Arxiv_csAI_API_dailysampled_3y.json"

#----------------------------
# Read the JSON in
arxiv_per_day_mtd = {}
with open( str(metadata_path), 'r') as json_data:
    arxiv_per_day_mtd = json.load(json_data)

logging.debug(f"Finished loading JSON from {str(metadata_path)}.\nHave {len(arxiv_per_day_mtd)} days worth of records")

In [21]:
# Suggested by ChatGPT
# Thought I could use JSON but even if it sends it in JSON, some of the items
# are not legal JSON. Don't want to chase this. Down. Will just use a Markdown parser to extract the 
# topics and keywords separately
summarization_prompt = """
Summarize the following arXiv paper in a detailed and structured manner, focusing on its key contributions, 
methodology, experimental results, and conclusions. The summary should be long enough (approximately 
300-500 words) to capture important technical details for topic modeling purposes.

Ensure that the summary includes:

 - Title and Authors – Include the paper’s title and author names.
 - Abstract Summary – Provide a high-level summary of the abstract.
 - Problem Statement – Clearly describe the research problem being addressed and its significance.
 - Methodology – Summarize the approach, techniques, and models used, including any mathematical formulations if applicable.
 - Experiments and Results – Describe the datasets used, evaluation metrics, and key findings, including tables or numerical results if relevant.
 - Conclusions and Implications – Explain the broader impact of the findings and possible future directions.
 - Keywords and Topics – Extract key technical terms and concepts for topic modeling.

Ensure clarity and coherence while maintaining technical precision. The summary should avoid excessive jargon where possible while still being informative to researchers familiar with AI and machine learning topics.

The paper is included after the separator as a markdown document.

-----------------------------------------------------------------------------------
"""



In [72]:
import json 
from llm.openai_util import get_completion
from dateutil import parser

PAPER_PDF_DIR     = DATA_DIR / "arxiv" / "pdf"
PAPER_SUMMARY_DIR = DATA_DIR / "arxiv" / "summary_md"
BULK_REQ_DIR      = Path(os.getcwd()) / "oai_bulk_req"

def serially_save_summary(per_day_dict):
    # Loop through and build the path of the extract Markdown files
    for day_str, day_items in per_day_dict.items():
        for day_item in day_items:
            pdf_url      = day_item["pdf_url"]
            day_dir      = parser.parse(day_str).strftime("%m_%d_%Y")
            fname_wo_ext = pdf_url.split("/")[-1]
            pdf_path     = str(PAPER_PDF_DIR / day_dir / f"{fname_wo_ext}.pdf")
            summ_path    = str(PAPER_SUMMARY_DIR / day_dir / f"{fname_wo_ext}.md")
            md_path      = pdf_path.replace("pdf", "md")
            #print(md_path)

            # Ensure output dir exists
            os.makedirs(str(PAPER_SUMMARY_DIR / day_dir), exist_ok=True)

            # Convert
            md_text = Path(md_path).read_text()
            summary = get_completion(f"{summarization_prompt}\n{md_text}")

            # Save to 
            Path(summ_path).write_text(summary)
            logging.debug(f"✔️ - {summ_path}")

def gen_batch_requests_for_summary(per_day_dict, req_file_path):    

    num_req = 0
    with open(req_file_path, "w") as f:

        for day_str, day_items in per_day_dict.items():

            day_dir      = parser.parse(day_str).strftime("%m_%d_%Y")

            for day_item in day_items:
                pdf_url      = day_item["pdf_url"]        
                fname_wo_ext = pdf_url.split("/")[-1]
                pdf_fpath     = str(PAPER_PDF_DIR / day_dir / f"{fname_wo_ext}.pdf")            
                md_fpath      = pdf_fpath.replace("pdf", "md")           
                summ_path     = md_fpath.replace("/md/", "/summary/")
                
                # If summary has already been created, don't add to job.
                if Path(summ_path).exists():
                    logging.debug(f"Summary file {summ_path} is already present. Not adding to creation job")
                    continue

                # Convert                
                md_path = Path(md_fpath)
                if not md_path.exists():
                    # We have some errors will the PDF itself missing from the server
                    # and/or the PDF conversion to MD failing
                    continue

                # 👉 I also embed a 
                #   $$MD_PATH$$:{md_fpath}
                # to the end so I can later parse it out and know which MD
                # it came from and tie it back in post
                md_text = Path(md_path).read_text()                
                job = {
                    "model"   : "gpt-4o-mini", 
                    "temperature" : 1,
                    "messages": [{
                        "role" : "user",
                        "content" : f"{summarization_prompt}\n{md_text}\n$$MD_PATH$$:{md_fpath}"
                }]}

                json_string = json.dumps(job)
                num_req+= 1
                f.write(json_string + "\n")
    
    logging.debug("Wrote {num_req} OpenAI jobs")


# Generate the bulk requests.
# Run these as follows (see Scratchpad below)
#
# > cd oai_bulk_req
# > python ../bin/api_request_parallel_processor.py --token_encoding_name o200k_base --requests_filepath 3d_summarization_req.jsonl --save_filepath 3d_summarizations.jsonl --request_url https://api.openai.com/v1/chat/completions --max_requests_per_minute 500 --max_tokens_per_minute 200000
gen_batch_requests_for_summary(arxiv_per_day_mtd, 
                               #str(BULK_REQ_DIR / "3d_summarization_req.jsonl")
                               str(BULK_REQ_DIR / "3y_summarization_req.jsonl")
                               )

In [None]:
# The batch execution process dies once in a while. Takes forever so I want to 
# save the outputs so that the batch-request process only generates reqs 
# for the remaining files
import pandas as pd
import re

# Read the generated summaries
# While this is jsonl, the format is strange
# Each line is an array of 2 items
#   [ {job}, {result}]
# So shows up as two unnamed columns in pandas
GENERATED_SUMMARIES_FILE = BULK_REQ_DIR / "3y_summarizations.jsonl"
summary_gen_jsonl_df = pd.read_json(str(GENERATED_SUMMARIES_FILE), lines=True)

for index, row in summary_gen_jsonl_df.iterrows():
    # Get the prompt
    # It has an embedded $$MD_PATH$$:{md_fpath} at the end
    # row[0]: json
    #    model
    #    temperature
    #    messages : [
    #         role :
    #         content:
    #    ]
    prompt = row[0]["messages"][0]["content"]
    m = re.findall(r'^\$\$MD_PATH\$\$:(.*?)$',  prompt, re.MULTILINE)
    if m and len(m) > 0:
        assert(len(m) == 1)
        md_fpath = m[0]
        sum_path = md_fpath.replace('/md/', '/summary/')

        # -- row[1]
        # id : 
        # object: "chat.completion"
        # created: 
        # model:
        # choices: [{
        #     index: 
        #     message: {
        #         role: "assistant"
        #         content: 👉 this is the summary 
        #         refusal : 
        #         annotations :
        #     }
        # }]    
        choices = row[1]["choices"]    
        if len(choices) > 0:
            assert(len(choices) == 1)
            choice_msg = choices[0]["message"]        
            if choice_msg["refusal"]:
                logging.warning(f"Response was refucsed by OpenAI: {choice_msg["refusal"]}")
            else:
                sum_file = Path(sum_path)
                if sum_file.exists():
                    logging.warning(f"Summary file already exists: Not creating: {sum_path}")
                else:
                    logging.debug(f"Writing {sum_path}")
                    os.makedirs(os.path.dirname(sum_file), exist_ok=True)                    
                    sum_file.write_text(choice_msg["content"])
        else:
            logging.warning("Response is missing for request ??")    

In [68]:
txt = """
blah blah blah
$$MD_PATH$$:/home/vamsi/bitbucket/hillops/nbs/BSL_TakeHome/data/arxiv/md/03_16_2025/2503.12688v1.md
""".strip()

import re
m = re.findall(r'^\$\$MD_PATH\$\$:(.*?)$',  txt, re.MULTILINE)
print(m)    

['/home/vamsi/bitbucket/hillops/nbs/BSL_TakeHome/data/arxiv/md/03_16_2025/2503.12688v1.md']


# Scratchpad - Test OpenAI paralell calls

 - See https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py
 - Takes requests from a `jsonl` file where each row _(a complete json block)_ is a request
 - Their code to generate the req is listed below. Will modify it to tell a joke and make 10 requests and see

 - https://api.openai.com/v1/chat/completions
 - https://platform.openai.com/settings/organization/limits
   - tokens = 200,000 TPM
   - 500 RPM, 10,000 RPD

 - Generated the jsonl file using the cell below
 - Downloaded the file from https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py 
 - Used the following command to run it.

```bash
(jupyter) vamsi@hillops_dev:~/bitbucket/hillops/nbs/BSL_TakeHome/oai_bulk_req$python ../bin/api_request_parallel_processor.py --requests_filepath jokes_paralell.jsonl --save_filepath jokes.jsonl --request_url https://api.openai.com/v1/chat/completions --max_requests_per_minute 500 --max_tokens_per_minute 200000
```   

**Problems**
 - When running the actual summarization, it gave me an error about encoding `<|endoftext|>` and asked me to set `disallowed_tokens=()`. Seems to be a tokenizer setting. I replaced the default `  ` with the one meant for gpt-4o-mini: `o200k_base` via `--token_encoder_base o200k_base` and re-ran the batch. Still failed!

```diff
def num_tokens_consumed_from_request(
    request_json: dict,
    api_endpoint: str,
    token_encoding_name: str,
):
    """Count the number of tokens in the request. Only supports completion and embedding requests."""
    encoding = tiktoken.get_encoding(token_encoding_name)
    # if completions request, tokens = prompt + n * max_tokens
    if api_endpoint.endswith("completions"):
        max_tokens = request_json.get("max_tokens", 15)
        n = request_json.get("n", 1)
        completion_tokens = n * max_tokens

        # chat completions
        if api_endpoint.startswith("chat/"):
            num_tokens = 0
            for message in request_json["messages"]:
                num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
                for key, value in message.items():
-                   num_tokens += len(encoding.encode(value))
+                   num_tokens += len(encoding.encode(value, allowed_special="all"))
```

In [11]:
import json
BULK_REQ_DIR = Path(os.getcwd()) / "oai_bulk_req"

filename = BULK_REQ_DIR / "jokes_paralell.jsonl"
n_requests = 10
jobs = [
    {"model"   : "gpt-4o-mini", 
     "temperature" : 1,
     "messages": [{
         "role" : "user",
         "content" : "Tell me a joke"         
         }],
    } for x in range(n_requests)]

with open(filename, "w") as f:
    for job in jobs:
        json_string = json.dumps(job)
        f.write(json_string + "\n")

In [None]:
import tiktoken

end = tiktoken.get_encoding("o200k_base", allowed_special={'<|endoftext|>'})