# AI research Papers

> Turns out, the summary information is also provided by the arxiv RSS feed. However, that feed has a limited set of entries. This API allows for a much larger collection of data.

Since I am asked to look at AI feeds, I could scan those for links to papers etc. However, the main point was to get papers and then process them. Defaulting to arxiv for now.

Normally folks would keep doing this on a weekly or daily basis to keep track of trens _(which are invariably historical comparisons)_. Howver, I have to download a bunch of these to build up my history.

Arxiv TOS specifies a rate-limit of 1req/3s which applies to all the machines being controlled by an entity. In my case, just one machine so will see how long it'll take. Maybe cycle through each day/month so I'll atleast have some data for each month and it'll keep adding to it.

 - python arxiv package
 - python ratelimit package
 - PDF to text conversion. Save as json.

The result has the following useful attribs (fields/methods) for my use case

 - 'authors'
 - 'links'
 - ✔️'categories'
 - 'comment'
 - 'doi'
 - 'download_pdf' _method_
 - 'download_source' _method_
 - 'entry_id'
 - 'get_short_id' 
 - 'links', 
 - 'pdf_url', 
 - 'primary_category'
 - ✔️'published'
 - ✔️'summary'
 - ✔️'title'

 If there is a summary, then I don't need to deal with the PDFs, conveting them to text etc yet. Can do at the end if end-to-end gets done!

# What to download

There are tons of papers every day at arxiv. To manage the load but still get enough time-spread, sample daily.
 - 10 papers per day in `cs.AI`
 - Spread over 4 years

## Notebook setup 

In [1]:
# Setup paths to our libs
import os
import sys
from pathlib import Path

lib_path = (Path(os.getcwd()) / "lib").resolve()
sys.path.append(str(lib_path))

# Import jupyter utils
import logging
from util import jupyter_util
from util.jupyter_util import DisplayHTML as jh
from util.jupyter_util import DisplayMarkdown as jm

# Init jupyter env. Set to DEBUG if you want to see the gory details
# of schemas and such.
jupyter_util.setup_logging(logging.DEBUG)

In [2]:
# Move into arxiv_util?
from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel

# Unused currently. Will simply load the JSON that is provided by feedreader.
class ArxivResultItem(BaseModel):
    authors: List[str]
    title : str
    summary: str
    published: str

    primary_category: str
    categories: List[str]

    pdf_url : str
    entry_url : str        
    
    summary: str    
    

In [23]:
import json

DATA_DIR = Path(os.getcwd()) / "data"
FEED_RAW_DATA_DIR = DATA_DIR / "feed" / "raw"

# Test with the small 3 day one first
#metadata_path = FEED_RAW_DATA_DIR / "Arxiv_csAI_API_dailysampled_3d.json"
metadata_path = FEED_RAW_DATA_DIR / "Arxiv_csAI_API_dailysampled_3y.json"

#----------------------------
# Read the JSON in
arxiv_per_day_mtd = {}
with open( str(metadata_path), 'r') as json_data:
    arxiv_per_day_mtd = json.load(json_data)

logging.debug(f"Finished loading JSON from {str(metadata_path)}.\nHave {len(arxiv_per_day_mtd)} days worth of records")

12:30:04 DEBUG:Finished loading JSON from /home/vamsi/bitbucket/hillops/nbs/BSL_TakeHome/data/feed/raw/Arxiv_csAI_API_dailysampled_3y.json.
Have 1095 days worth of records


In [24]:
import os
import urllib
from pathlib import Path
from ratelimit import limits, sleep_and_retry

@sleep_and_retry
@limits(calls=10,period=5)
def download_url(url, dest_dir):
    fname = url.split("/")[-1]
    if not fname.endswith(".pdf"):
        fname += ".pdf"

    dest_path = Path(dest_dir) / fname
    if dest_path.exists():
        return        

    retry = 3
    for r in range(retry):
        try:
            urllib.request.urlretrieve(url, str(dest_path))
        except Exception as e:
            if r < 2:
                print(f'Failed. Attempt # {r + 1}')
            else:
                print(f"Error encountered at third attempt. Aborting {url}")
                print(e)
        else:
            #print(f"Success downloading {url}")
            break

In [None]:
# Prepare download_list for download
import re
from dateutil import parser

PAPER_DATA_DIR = DATA_DIR / "arxiv" / "pdf"
download_list = []

# I have 10 entries per day
# To reduce data-load.
# Download 1 per day in sequence. 40-70min for each set of 1095 pdfs.
# Done: 0, 1, 2, 3, 4
nth = 5
for k,v in arxiv_per_day_mtd.items():
    date = parser.parse(k)
    date_folder = date.strftime("%m_%d_%Y")
    download_list.append((v[nth]["pdf_url"], str(PAPER_DATA_DIR / date_folder)))

#print(download_list)

In [26]:
from tqdm import tqdm

# After first pass, no need to do os.makedirs. Simply takes up time
DO_MAKE_DIRS = False

for (url, dest_dir) in tqdm(download_list):
    fname = url.split("/")[-1]
    if not fname.endswith(".pdf"):
        fname += ".pdf"

    dest_path = Path(dest_dir) / fname
    if not dest_path.exists():            
        if DO_MAKE_DIRS: 
            os.makedirs(dest_dir, exist_ok=True)
            
        download_url(url, dest_dir)

  0%|          | 0/1095 [00:00<?, ?it/s]

Failed. Attempt # 1
Failed. Attempt # 2


 42%|████▏     | 455/1095 [00:15<00:21, 30.07it/s]

Error encountered at third attempt. Aborting http://arxiv.org/pdf/2306.10409v2
HTTP Error 404: Not Found


 57%|█████▋    | 625/1095 [00:49<01:21,  5.76it/s]

Failed. Attempt # 1


 61%|██████    | 670/1095 [03:21<16:00,  2.26s/it]

Failed. Attempt # 1


 63%|██████▎   | 686/1095 [03:59<22:02,  3.23s/it]

Failed. Attempt # 1
Failed. Attempt # 2
Error encountered at third attempt. Aborting http://arxiv.org/pdf/2402.05951v3
HTTP Error 404: NOT FOUND


 63%|██████▎   | 692/1095 [04:08<08:59,  1.34s/it]

Failed. Attempt # 1


 64%|██████▍   | 702/1095 [04:31<12:33,  1.92s/it]

Failed. Attempt # 1


 65%|██████▍   | 711/1095 [04:57<20:08,  3.15s/it]

Failed. Attempt # 1


 72%|███████▏  | 785/1095 [10:40<04:25,  1.17it/s]  

Failed. Attempt # 1
Failed. Attempt # 2
Error encountered at third attempt. Aborting http://arxiv.org/pdf/2405.08031v2
HTTP Error 404: Not Found


 78%|███████▊  | 858/1095 [13:54<08:43,  2.21s/it]

Failed. Attempt # 1


 79%|███████▉  | 868/1095 [14:37<05:50,  1.54s/it]

Failed. Attempt # 1


 88%|████████▊ | 961/1095 [20:45<04:46,  2.14s/it]  

Failed. Attempt # 1


 91%|█████████ | 996/1095 [22:41<03:25,  2.08s/it]

Failed. Attempt # 1


 92%|█████████▏| 1004/1095 [23:06<02:32,  1.67s/it]

Failed. Attempt # 1


 94%|█████████▍| 1032/1095 [24:20<05:20,  5.08s/it]

Failed. Attempt # 1


 97%|█████████▋| 1059/1095 [25:41<00:21,  1.70it/s]

Failed. Attempt # 1
Failed. Attempt # 2
Error encountered at third attempt. Aborting http://arxiv.org/pdf/2502.07115v3
HTTP Error 404: NOT FOUND


100%|██████████| 1095/1095 [26:26<00:00,  1.45s/it]


In [64]:
# Copied from 
# https://gist.github.com/darwing1210/c9ff8e3af8ba832e38e6e6e347d9047a
# And modified to be per-day
import os
import logging

import nest_asyncio
nest_asyncio.apply()

import asyncio
import aiohttp  # pip install aiohttp
import aiofile  # pip install aiofile

def download_files_from_report(urls_folder_list):

    # Create all dirs needed
    for (_, folder) in urls_folder_list:
          os.makedirs(folder, exist_ok=True)

    sema = asyncio.BoundedSemaphore(5)

    async def fetch_file(session, url, out_dir):
        fname = url.split("/")[-1]
        if not fname.endswith(".pdf"):
            fname += ".pdf"
            
        async with sema:
            logging.debug(f"Queing {url}")
            async with session.get(url) as resp:
                assert resp.status == 200
                data = await resp.read()

        async with aiofile.async_open(
            os.path.join(out_dir, fname), "wb"
        ) as outfile:
            await outfile.write(data)

    async def main():
        async with aiohttp.ClientSession() as session:
            tasks = [fetch_file(session, url, out_dir) for (url, out_dir) in urls_folder_list]
            await asyncio.gather(*tasks)

    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    #loop.close()

Success downloading http://arxiv.org/pdf/2503.12688v1
Success downloading http://arxiv.org/pdf/2503.12687v1
Success downloading http://arxiv.org/pdf/2503.12667v1
Success downloading http://arxiv.org/pdf/2503.12651v1
Success downloading http://arxiv.org/pdf/2503.12649v1
Success downloading http://arxiv.org/pdf/2503.12642v1
Success downloading http://arxiv.org/pdf/2503.13554v1
Success downloading http://arxiv.org/pdf/2503.13553v1
Success downloading http://arxiv.org/pdf/2503.12637v1
Success downloading http://arxiv.org/pdf/2503.12635v1
Success downloading http://arxiv.org/pdf/2503.13778v1
Success downloading http://arxiv.org/pdf/2503.13771v1
Success downloading http://arxiv.org/pdf/2503.13754v2
Success downloading http://arxiv.org/pdf/2503.13751v1
Success downloading http://arxiv.org/pdf/2503.13708v1
Success downloading http://arxiv.org/pdf/2503.13690v1
Success downloading http://arxiv.org/pdf/2503.13660v1
Success downloading http://arxiv.org/pdf/2503.13657v1
Success downloading http://a

In [60]:
fn = "hello.pdf"
if fn.endswith(".pdf"):
    print("Yes")

Yes
