# AI System to Automatically Review and Summarize Research Papers

# MILESTONE 1

Install required packages (run once)

In [30]:

!pip install -q requests pandas tqdm pymupdf nltk scikit-learn gradio sentence-transformers faiss-cpu pytesseract pdf2image



“This cell installs all the required libraries. requests lets me talk to the Semantic Scholar API. pandas helps manage data in tables. PyMuPDF extracts text from PDFs, and pytesseract helps if a PDF is scanned. nltk and scikit-learn are for basic NLP and summarization. sentence-transformers and faiss help with semantic search. gradio lets me build a small UI. These installations ensure the entire pipeline runs smoothly.”

Simple imports

In [31]:

import os, time, json, logging, random, hashlib
from getpass import getpass
from urllib.parse import quote
from functools import wraps
import requests
import pandas as pd
from tqdm import tqdm
import fitz            # PyMuPDF
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer


In this cell, we import all the tools needed for our project. We use requests to call the Semantic Scholar API, and pandas/numpy to store data in tables. os, shutil, and datetime help with creating folders and saving files. tqdm shows progress bars, while logging helps track errors. To read PDFs, we use PyMuPDF (fitz), and pytesseract/PIL help extract text from scanned PDFs. For text processing, we import nltk, re, and tools like TF-IDF and cosine similarity. yake helps extract keywords, and json lets us save data. Finally, gradio is used to create a simple user interface. These imports prepare everything needed for paper search, PDF download, text extraction, and analysis.

NLTK setup

In [32]:
#  Download NLTK resources used by summarizer
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

This cell downloads the NLTK “punkt” tokenizer, which is a small language tool used to split text into sentences. When we extract text from research papers later, we need to break the long text into smaller sentences so we can summarize it or analyze it easily. The punkt model teaches Python how to correctly recognize sentence boundaries (like after periods, question marks, etc.). Without downloading this resource, the summarizer and text-processing functions would not work. So this cell is simply preparing NLTK so our project can handle and process text properly.Natural Language Toolkit(NLTK)

Create tidy output folders

In [33]:
#  Setup output folders
OUT_ROOT = "milestone1_output"
PDF_DIR = os.path.join(OUT_ROOT, "pdfs")
TEXT_DIR = os.path.join(OUT_ROOT, "texts")
CACHE_DIR = os.path.join(OUT_ROOT, "cache")
os.makedirs(PDF_DIR, exist_ok=True)
os.makedirs(TEXT_DIR, exist_ok=True)
os.makedirs(CACHE_DIR, exist_ok=True)
print("Folders created:", OUT_ROOT, PDF_DIR, TEXT_DIR, CACHE_DIR)


Folders created: milestone1_output milestone1_output/pdfs milestone1_output/texts milestone1_output/cache


This cell creates the folders where all your project files will be saved. The main folder is milestone1_output, and inside it, we make three sub-folders: pdfs to store downloaded research papers, texts to store extracted text from those PDFs, and cache to save temporary data like API responses. The os.makedirs(..., exist_ok=True) command creates these folders only if they don’t already exist, so it never causes errors. By organizing everything into separate folders, the project stays clean and easy to manage, and all the files generated later have a proper place to be saved.

Enter API key securely & basic logging

In [36]:
# Cell 5 — Enter Semantic Scholar API key (hidden) and initialize logging
SEMANTIC_SCHOLAR_API_KEY = getpass("Paste your Semantic Scholar API key (hidden): ")
HEADERS = {"x-api-key": SEMANTIC_SCHOLAR_API_KEY} if SEMANTIC_SCHOLAR_API_KEY else {}
API_BASE = "https://api.semanticscholar.org/graph/v1"

#Mo6pbi9AuI1vlkhN99RKg970XzEGlHh34TSXe4kp
# Logging to console and to file
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger("milestone1")
file_handler = logging.FileHandler(os.path.join(OUT_ROOT, "pipeline.log"))
file_handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(message)s"))
logger.addHandler(file_handler)

logger.info("API key set and logging initialized.")


Paste your Semantic Scholar API key (hidden): ··········


In this cell, we enter the Semantic Scholar API key, which is required to access the research paper search API. The getpass() function hides the key so no one else can see it. After entering the key, we store it in the HEADERS variable, which will be sent along with every API request.

Next, we set up logging, which helps us keep track of everything the program does. Logging shows messages like “search started,” “download complete,” or “error occurred.” We configure it to print messages on the screen and save them into a file called pipeline.log inside the output folder. This helps with debugging and makes the project look more professional.

Finally, the last line confirms that the API key and logging system are ready to use.

Simple GET with retry/backoff

In [35]:
# simple network GET with retries/backoff
def simple_get(url, headers=None, stream=False, timeout=20, retries=3):
    for attempt in range(1, retries+1):
        try:
            r = requests.get(url, headers=headers, stream=stream, timeout=timeout)
            r.raise_for_status()
            return r
        except Exception as e:
            if attempt == retries:
                logger.error(f"GET failed for {url}: {e}")
                raise
            wait = 1 * (2 ** (attempt-1)) + random.random()
            logger.warning(f"GET attempt {attempt} failed for {url}. Waiting {wait:.1f}s before retry.")
            time.sleep(wait)


This cell creates a function called simple_get() that safely downloads data from the internet. If the request fails (because of network issues or server errors), it automatically tries again up to 3 times, waiting a little longer each time. If all attempts fail, it logs an error. This makes the program more stable so it doesn’t crash during paper search or PDF downloads.

Simple caching for API responses (so repeated runs don't re-query)

In [37]:
# Cell 7 — simple JSON cache utility
def cache_get(key, fetch_fn, cache_dir=CACHE_DIR):
    h = hashlib.sha1(key.encode()).hexdigest()
    path = os.path.join(cache_dir, f"{h}.json")
    if os.path.exists(path):
        logger.info(f"Loading cached response for key {key[:80]}... -> {path}")
        with open(path, "r", encoding="utf-8") as f:
            return json.load(f)
    data = fetch_fn()
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f)
    return data


This cell creates a cache system so we don’t repeatedly call the API for the same search. When we search for a topic, the API returns JSON data. This function saves that response in a file. The next time we search the same topic, it loads the result from the saved file instead of calling the API again.

This makes the program faster, reduces API usage, and prevents hitting rate limits. It works by creating a unique filename (using SHA-1) for each search key, checking if it already exists, and if not, saving the new response.

Semantic Scholar search wrapper

In [38]:
# Replacement for "Cell 8" — safer Semantic Scholar search wrapper with debug & fallback
import requests

def semantic_scholar_search_safe(query, limit=10):
    """
    Safer search wrapper:
    - uses requests params (handles encoding)
    - requests openAccessPdf.url as a sub-field
    - prints debug info on non-200 responses
    - falls back to minimal request (no fields) if needed
    """
    # fields: include openAccessPdf.url (so we can get the actual PDF link)
    fields = ",".join([
        "paperId","title","authors","year","venue","abstract",
        "citationCount","isOpenAccess","openAccessPdf.url","url","doi"
    ])
    params = {"query": query, "limit": limit, "fields": fields}

    try:
        resp = requests.get(f"{API_BASE}/paper/search", params=params, headers=HEADERS, timeout=30)
    except Exception as e:
        # network-level failure
        print("Network error when calling Semantic Scholar:", e)
        raise

    # If success, return parsed JSON
    if resp.status_code == 200:
        return resp.json()    # typically contains {"total":..., "data":[...]}
    # If bad request or other non-200, show debug info
    print(f"Semantic Scholar returned status {resp.status_code} for query. Response body (first 800 chars):")
    print(resp.text[:800])

    # If we got 400, try a fallback minimal request (no fields) to check if fields caused it
    if resp.status_code == 400:
        print("Received 400. Trying fallback request without fields to isolate the problem...")
        try:
            resp2 = requests.get(f"{API_BASE}/paper/search", params={"query": query, "limit": limit}, headers=HEADERS, timeout=30)
            print("Fallback response status:", resp2.status_code)
            if resp2.status_code == 200:
                print("Fallback succeeded (no fields). The 'fields' parameter likely caused the 400. Try requesting fewer/other fields.")
                return resp2.json()
            else:
                print("Fallback also failed. Response body (first 800 chars):")
                print(resp2.text[:800])
        except Exception as e:
            print("Fallback network error:", e)
    # If still failing, raise an HTTPError with response attached for debugging
    resp.raise_for_status()

# Quick manual test: run this cell after setting 'topic' variable
try:
    result = semantic_scholar_search_safe("ai generated model for summarizing research paper models", limit=6)
    # normalize result if needed
    data = result.get("data", result) if isinstance(result, dict) else result
    print("Number of items returned:", len(data) if isinstance(data, list) else "unknown")
    # show first two titles if present
    if isinstance(data, list) and data:
        for i, item in enumerate(data[:2], start=1):
            print(i, "-", item.get("title"))
    else:
        print("No data list in response; printing whole response object (trimmed):")
        print(str(result)[:1000])
except Exception as e:
    print("Search failed with exception:", e)


Semantic Scholar returned status 400 for query. Response body (first 800 chars):
{"error":"Unrecognized or unsupported fields: [openAccessPdf.url, doi]"}

Received 400. Trying fallback request without fields to isolate the problem...
Fallback response status: 429
Fallback also failed. Response body (first 800 chars):
{"message": "Too Many Requests. Please wait and try again or apply for a key for higher rate limits. https://www.semanticscholar.org/product/api#api-key-form", "code": "429"}
Search failed with exception: 400 Client Error: Bad Request for url: https://api.semanticscholar.org/graph/v1/paper/search?query=ai+generated+model+for+summarizing+research+paper+models&limit=6&fields=paperId%2Ctitle%2Cauthors%2Cyear%2Cvenue%2Cabstract%2CcitationCount%2CisOpenAccess%2CopenAccessPdf.url%2Curl%2Cdoi


This cell creates a safe paper-search function.

It sends your topic to Semantic Scholar and gets papers.

If the API gives an error, it tries again or uses a simpler request.

It prints helpful messages so you understand what went wrong.

At the end, it runs a test search to show the first few paper titles.

In short:
 This cell searches for papers safely and avoids crashes.

Simple search run & create results DataFrame

In [39]:
# New Cell 9 — run safe search and build df_results
topic = "ai generated model for summarizing research paper models"   # change if desired
limit = 12

# Call the safe search wrapper
resp = semantic_scholar_search_safe(topic, limit=limit)

# Normalize response: sometimes it's {"total":.., "data":[...]} or directly a list
if isinstance(resp, dict) and "data" in resp:
    data = resp["data"]
elif isinstance(resp, list):
    data = resp
else:
    # If unexpected, print resp for debugging
    print("Unexpected response shape (trimmed):", str(resp)[:1000])
    data = []

rows = []
for i, p in enumerate(data, start=1):
    authors = ", ".join([a.get("name","") for a in p.get("authors", [])])
    rows.append({
        "index": i,
        "paperId": p.get("paperId"),
        "title": (p.get("title") or "")[:300],
        "authors": authors,
        "authors_list": p.get("authors", []),
        "year": p.get("year"),
        "venue": p.get("venue"),
        "citationCount": p.get("citationCount") or 0,
        "isOpenAccess": p.get("isOpenAccess"),
        "openAccessPdf": (p.get("openAccessPdf") or {}).get("url") if p.get("openAccessPdf") else None,
        "semanticUrl": p.get("url"),
        "doi": p.get("doi"),
        "abstract": p.get("abstract","")
    })

df_results = pd.DataFrame(rows).set_index("index")
print("Search completed — number of rows:", len(df_results))
df_results.head()



Semantic Scholar returned status 400 for query. Response body (first 800 chars):
{"error":"Unrecognized or unsupported fields: [openAccessPdf.url, doi]"}

Received 400. Trying fallback request without fields to isolate the problem...
Fallback response status: 200
Fallback succeeded (no fields). The 'fields' parameter likely caused the 400. Try requesting fewer/other fields.
Search completed — number of rows: 12


Unnamed: 0_level_0,paperId,title,authors,authors_list,year,venue,citationCount,isOpenAccess,openAccessPdf,semanticUrl,doi,abstract
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,b2225a8b872f2fc4d39a9f5a3470ff47404d7b2e,Research on Generating Naked-Eye 3D Display Co...,,[],,,0,,,,,
2,be0c1080f11f913ca58279a92db0764dbd97ada8,RIGID: A Training-free and Model-Agnostic Fram...,,[],,,0,,,,,
3,f9819c7ea50007a55a3857cfa42e3b6b65577df3,Multiclass AI-Generated Deepfake Face Detectio...,,[],,,0,,,,,
4,d0b5194032451157f264db4a6da569f03347d1cb,ReviewAgents: Bridging the Gap Between Human a...,,[],,,0,,,,,
5,dd44a086729e962af046aff808385b523fbcd856,Organic or Diffused: Can We Distinguish Human ...,,[],,,0,,,,,


This cell takes the papers found from Semantic Scholar and converts them into a clean, organized table (DataFrame).

What it does:

Runs the safe search function from Cell 8 using your topic.

Extracts the list of papers from the API response.

Creates a table with important details for each paper:

title

authors

year

venue

citation count

DOI

PDF link
abstract

Puts all the papers into a pandas DataFrame so the next steps are easy.

Shows the first few rows so you can verify everything looks right.

In short:

 This cell takes raw API data and converts it into a clean, readable table of papers.

Small helpers: author stats, DOI-safe filename, APA formatter

In [40]:
# Cell 10 — helper utilities
def author_stats(authors_list):
    names = [a.get("name","").strip() for a in authors_list if a.get("name")]
    return len(names), (names[0] if names else "")

def doi_safe_filename(doi, title, index):
    if doi:
        safe = doi.replace("/", "_").replace(":", "_")
        return f"{index}_DOI_{safe}.pdf"
    t = (title or "paper")[:80].replace("/", "_").replace("\n"," ").replace(" ", "_")
    return f"{index}_{t}.pdf"

def format_authors_apa(authors_list):
    apa = []
    for a in authors_list:
        name = a.get("name","").strip()
        if not name: continue
        parts = name.split()
        last = parts[-1]
        initials = " ".join([p[0].upper() + "." for p in parts[:-1]]) if len(parts) > 1 else ""
        apa.append(f"{last}, {initials}" if initials else last)
    if not apa:
        return ""
    if len(apa) == 1:
        return apa[0]
    if len(apa) <= 7:
        return ", ".join(apa[:-1]) + ", & " + apa[-1]
    return ", ".join(apa[:6]) + ", ... " + apa[-1]

def apa_reference_from_row(r):
    authors_apa = format_authors_apa(r.get("authors_list") or [])
    year = r.get("year") or "n.d."
    title = r.get("title") or ""
    venue = r.get("venue") or ""
    doi = r.get("doi")
    doi_part = f" https://doi.org/{doi}" if doi else ""
    return f"{authors_apa} ({year}). {title}. {venue}.{doi_part}".strip()


This cell creates small helper functions:

author_stats() → counts authors and gets the first author.

doi_safe_filename() → makes a clean, safe PDF filename using DOI or title.

format_authors_apa() → converts author names into APA-style format.

apa_reference_from_row() → builds a full APA reference for each paper.

These helpers are used later for saving PDFs and creating citations.

Parallel downloader (ThreadPoolExecutor) — controlled concurrency

In [41]:
# Cell 11 — Parallel downloads with limited workers
from concurrent.futures import ThreadPoolExecutor, as_completed

def download_worker(task):
    url, dest = task
    ok, err = False, None
    try:
        ok, err = download_file(url, dest)
    except Exception as e:
        ok, err = False, str(e)
    return ok, err, url, dest

def parallel_download(candidate_list, max_workers=3):
    """
    candidate_list: list of tuples (url, dest_path)
    returns list of (ok, err, url, dest)
    """
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futures = {ex.submit(download_worker, t): t for t in candidate_list}
        for fut in tqdm(as_completed(futures), total=len(futures), desc="parallel downloading"):
            ok, err, url, dest = fut.result()
            results.append((ok, err, url, dest))
    return results


This cell adds parallel PDF downloading so files download faster instead of one-by-one.

What each part does:

download_worker() → downloads one PDF and reports success or error.

parallel_download()

Takes a list of PDF links + save locations.

Uses ThreadPoolExecutor to download multiple PDFs at the same time (default 3 downloads together).

Shows a progress bar using tqdm.

Returns the results of all downloads.

Why it is used:

To speed up downloading research papers and avoid waiting a long time.

Select top-N with filters (citations/year) and build candidate download tasks

In [42]:
# Cell 12 — Select top-N and prepare download candidate URLs
TOP_N = 4         # change to number you want to download
MIN_CITATIONS = 0 # set to >0 to filter
YEAR_FROM = 2015  # set to 0 to disable

candidates = []
selected_rows = []
# apply simple client-side filters
filtered = df_results.copy()
if MIN_CITATIONS and MIN_CITATIONS > 0:
    filtered = filtered[filtered['citationCount'] >= MIN_CITATIONS]
if YEAR_FROM and YEAR_FROM > 0:
    filtered = filtered[filtered['year'].notnull() & (filtered['year'] >= YEAR_FROM)]
# pick top by citationCount
filtered = filtered.sort_values(by='citationCount', ascending=False)
selected = filtered.head(TOP_N).copy().reset_index()

for _, row in selected.iterrows():
    idx = row['index']
    filename = doi_safe_filename(row.get('doi'), row.get('title'), idx)
    dest = os.path.join(PDF_DIR, filename)
    # prefer openAccessPdf if present, else semanticUrl
    urls = []
    if row.get('openAccessPdf'):
        urls.append(row.get('openAccessPdf'))
    if row.get('semanticUrl'):
        urls.append(row.get('semanticUrl'))
    selected_rows.append(row)
    # create candidate tuples (first url first)
    for u in urls:
        candidates.append((u, dest))

print(f"Prepared {len(selected_rows)} papers and {len(candidates)} download attempts (fallbacks included).")
selected.head()


Prepared 0 papers and 0 download attempts (fallbacks included).


Unnamed: 0,index,paperId,title,authors,authors_list,year,venue,citationCount,isOpenAccess,openAccessPdf,semanticUrl,doi,abstract


This cell chooses the best papers and prepares their PDF download links.

It filters papers by year and citations.

It picks the top 4 most cited papers.

For each selected paper, it creates a safe filename.

Then it collects all possible PDF URLs (open-access first, webpage second).

These URLs will be used later to download the PDFs.

Run parallel downloads and record results

In [43]:
# Cell 13 — perform parallel downloads and record outcomes
download_results = parallel_download(candidates, max_workers=3)

# Build a summary map from dest -> (ok, err, url)
dest_map = {}
for ok, err, url, dest in download_results:
    if dest not in dest_map:
        dest_map[dest] = {"ok": ok, "err": err, "url": url}
    else:
        # if already had False and now True, update
        if ok:
            dest_map[dest] = {"ok": ok, "err": err, "url": url}

# Build downloads_df for selected rows
download_records = []
for row in selected_rows:
    idx = row['index']
    filename = doi_safe_filename(row.get('doi'), row.get('title'), idx)
    dest = os.path.join(PDF_DIR, filename)
    rec = dest_map.get(dest, {"ok": False, "err": "not attempted", "url": None})
    num_authors, first_author = author_stats(row['authors_list'])
    download_records.append({
        "index": idx,
        "title": row['title'],
        "doi": row.get('doi'),
        "authors": row['authors'],
        "num_authors": num_authors,
        "first_author": first_author,
        "year": row.get('year'),
        "citationCount": int(row.get('citationCount') or 0),
        "isOpenAccess": row.get('isOpenAccess'),
        "downloaded": rec['ok'],
        "saved_path": dest if rec['ok'] else None,
        "used_url": rec['url'],
        "error": rec['err']
    })

downloads_df = pd.DataFrame(download_records)
downloads_df


parallel downloading: 0it [00:00, ?it/s]


This cell downloads the selected PDFs and keeps track of what happened.

What it does:

Runs the parallel downloader created earlier.

Saves whether each PDF:

downloaded successfully

failed

which URL was used

where the file was saved

Also adds extra metadata like:

number of authors

first author

citation count

year

Finally, it stores everything in a downloads_df table so you can see which papers were downloaded.

Extract text from PDFs with OCR(Optical Character Recognition) fallback (Tesseract) — optional OCR install

In [44]:
# Cell 14 — Extract text; if plain extraction is empty, optionally use OCR (pytesseract/pdf2image)
# Note: OCR steps are slower and may need apt install in Colab; only run if needed.

try:
    from pdf2image import convert_from_path
    import pytesseract
    ocr_available = True
except Exception:
    ocr_available = False

def extract_text_with_ocr_fallback(pdf_path):
    # Try PyMuPDF first
    text = ""
    try:
        doc = fitz.open(pdf_path)
        text = "\n\n".join([p.get_text("text") for p in doc])
    except Exception as e:
        text = ""
    if text and len(text) > 200:
        return text
    # fallback to OCR if available
    if ocr_available:
        try:
            pages = convert_from_path(pdf_path, dpi=200)
            ocr_texts = []
            for p in pages:
                ocr_texts.append(pytesseract.image_to_string(p))
            full = "\n\n".join(ocr_texts)
            return full
        except Exception as e:
            logger.warning(f"OCR failed for {pdf_path}: {e}")
            return text
    return text

# Run extraction for downloaded files
text_records = []
for _, r in downloads_df.iterrows():
    txt_path = None
    txt_len = 0
    if r['downloaded'] and r['saved_path']:
        txt = extract_text_with_ocr_fallback(r['saved_path'])
        if txt:
            txt_path = os.path.join(TEXT_DIR, os.path.basename(r['saved_path']).replace('.pdf','.txt'))
            with open(txt_path, "w", encoding="utf-8") as f:
                f.write(txt)
            txt_len = len(txt)
    text_records.append({
        "index": r['index'],
        "saved_pdf": r['saved_path'],
        "text_path": txt_path,
        "text_len": txt_len
    })

texts_df = pd.DataFrame(text_records)
texts_df


This cell extracts text from every downloaded PDF.

What it does:

Tries normal PDF text extraction using PyMuPDF (fast and accurate).

If the PDF is scanned or the text is empty:

It uses OCR (Optical Character Recognition) with pytesseract + pdf2image to read text from images inside the PDF.

Saves the extracted text into a .txt file.

Records:

text file path

text length

which PDF it came from

All results are stored in texts_df so you can see which PDFs were successfully extracted.

Extractive summarizer (TF-IDF sentence ranking)

In [45]:
# Cell 15 — extractive summarizer: choose top 3 sentences by TF-IDF
import nltk
def extractive_summary(text, n_sentences=3):
    sents = nltk.sent_tokenize(text)
    if len(sents) <= n_sentences:
        return " ".join(sents)
    try:
        vec = TfidfVectorizer(stop_words='english')
        X = vec.fit_transform(sents)
        scores = X.sum(axis=1).A1
        top_idxs = scores.argsort()[-n_sentences:][::-1]
        top_sorted = sorted(top_idxs)
        return " ".join([sents[i] for i in top_sorted])
    except Exception as e:
        return " ".join(sents[:n_sentences])

# Build summaries for texts
summary_records = []
for _, r in texts_df.iterrows():
    summ = ""
    if r['text_path'] and r['text_len'] > 80:
        with open(r['text_path'], "r", encoding="utf-8") as f:
            txt = f.read()
        summ = extractive_summary(txt, n_sentences=3)
    summary_records.append({"index": r['index'], "summary": summ})
summaries_df = pd.DataFrame(summary_records)
summaries_df


This cell creates a simple extractive summarizer.

What it does:

Splits the paper text into sentences.

Uses TF-IDF to score each sentence (how important it is).

Picks the top 3 best sentences.

Joins them together as a summary.

Saves all summaries into summaries_df.

This gives a quick, automatic summary for every extracted research paper

APA references & combine metadata into a final CSV

In [46]:
# Fix for KeyError: 'index' during merge — ensures every DF has an 'index' column, then merges.
# Works for: selected, downloads_df, texts_df, summaries_df, apa_df

# 1) Helper to ensure 'index' column exists
def ensure_index_column(df, df_name="<df>"):
    df2 = df.copy()
    if 'index' not in df2.columns:
        # reset_index() will create an 'index' column from the current index
        df2 = df2.reset_index()
        # If reset_index created a column with another name (rare), ensure 'index' exists
        if 'index' not in df2.columns:
            df2['index'] = df2.index + 1  # 1-based fallback
    return df2

# 2) Apply to all DataFrames we plan to merge
selected_e = ensure_index_column(selected, "selected")
downloads_e = ensure_index_column(downloads_df, "downloads_df")
texts_e = ensure_index_column(texts_df, "texts_df")
summaries_e = ensure_index_column(summaries_df, "summaries_df")
apa_e = ensure_index_column(apa_df, "apa_df")

# Optional: show columns for debugging
print("Columns (selected):", selected_e.columns.tolist())
print("Columns (downloads):", downloads_e.columns.tolist())
print("Columns (texts):", texts_e.columns.tolist())
print("Columns (summaries):", summaries_e.columns.tolist())
print("Columns (apa):", apa_e.columns.tolist())

# 3) Merge step-by-step (catch and print any merge mismatches)
meta = selected_e.merge(downloads_e, on='index', how='left', suffixes=('_sel','_dl'))
meta = meta.merge(texts_e, on='index', how='left', suffixes=('','_txt'))
meta = meta.merge(summaries_e, on='index', how='left', suffixes=('','_sum'))
meta = meta.merge(apa_e, on='index', how='left', suffixes=('','_apa'))

# 4) Quick sanity checks
print("\nMerged rows:", len(meta))
print("Sample merged columns:", list(meta.columns)[:20])
display(meta.head(6))

# 5) Save merged CSV
out_csv = os.path.join(OUT_ROOT, "papers_metadata.csv")
meta.to_csv(out_csv, index=False)
print("Saved merged metadata CSV to:", out_csv)



Columns (selected): ['index', 'paperId', 'title', 'authors', 'authors_list', 'year', 'venue', 'citationCount', 'isOpenAccess', 'openAccessPdf', 'semanticUrl', 'doi', 'abstract']
Columns (downloads): ['index']
Columns (texts): ['index']
Columns (summaries): ['index']
Columns (apa): ['index']

Merged rows: 0
Sample merged columns: ['index', 'paperId', 'title', 'authors', 'authors_list', 'year', 'venue', 'citationCount', 'isOpenAccess', 'openAccessPdf', 'semanticUrl', 'doi', 'abstract']


Unnamed: 0,index,paperId,title,authors,authors_list,year,venue,citationCount,isOpenAccess,openAccessPdf,semanticUrl,doi,abstract


Saved merged metadata CSV to: milestone1_output/papers_metadata.csv


✅ What this cell does (very short)

It ensures every intermediate table has an index column, then merges the selected papers, download info, extracted texts, summaries and APA references into one final table (meta) and saves it as papers_metadata.csv.

✅ Why it’s needed (one line)

Some tables didn’t have an index column (so merge failed), so this cell normalizes them first and then safely joins them together.

✅ What to say in the demo (one sentence)

“I made sure each partial table has a common key (index), merged them step-by-step into one dataset, checked the result, and saved the final papers_metadata.csv for further analysis.”

Best-paper detector (explainable heuristic)

In [47]:
# Cell 17 — compute simple "best" score and flag best paper
def best_score_calc(citations, year, is_open):
    now = pd.Timestamp.now().year
    cit_score = min((citations or 0) / 100.0, 1.0)
    recency = max(0, (now - (year or (now-10))))
    recency_score = max(0, 1 - (recency / 10.0))
    open_score = 1 if is_open else 0
    return 0.6*cit_score + 0.3*recency_score + 0.1*open_score

scores = []
for _, r in meta.iterrows():
    # use available fields if present
    citations = r.get('citationCount') if 'citationCount' in r else r.get('citationCount_y', 0)
    year = r.get('year') if 'year' in r else r.get('year_y', None)
    is_open = r.get('isOpenAccess') if 'isOpenAccess' in r else r.get('isOpenAccess_y', False)
    s = best_score_calc(citations, year, is_open)
    scores.append(s)
meta['best_score'] = scores
meta['is_best'] = meta['best_score'] == meta['best_score'].max()
meta[['title','citationCount','best_score','is_best']]


Unnamed: 0,title,citationCount,best_score,is_best


This cell creates a “best paper” score for every research paper.

How it works:

It gives each paper a score based on:

citations (60% weight)

recency (year) (30% weight)

open-access availability (10% weight)

Then it calculates this score for each paper, adds it to the table, and marks the paper with the highest score as is_best = True.

Why this is useful:

It automatically identifies the most impactful + recent + accessible research paper in your dataset.

(Optional) Build embeddings + FAISS semantic search (install may have been slow)

In [48]:
# Cell 18 — Embeddings + FAISS semantic search (optional, may be slow)
# Only run if sentence-transformers and faiss installed successfully.
try:
    from sentence_transformers import SentenceTransformer
    import faiss
    emb_model = SentenceTransformer('all-MiniLM-L6-v2')
    texts = meta['abstract'].fillna("").astype(str).tolist()
    # fallback: if abstracts missing, use extractive_summary
    texts = [t if t.strip() else (meta.iloc[i]['extractive_summary'] or "") for i,t in enumerate(texts)]
    embs = emb_model.encode(texts, convert_to_numpy=True)
    dim = embs.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embs)
    print("FAISS index built with dimension:", dim)
except Exception as e:
    print("Embeddings/FAISS not available or failed to build:", e)
    index = None


Embeddings/FAISS not available or failed to build: tuple index out of range


This cell adds an optional AI feature:
semantic search using embeddings + FAISS.

What it does:

Loads the SentenceTransformer model (all-MiniLM-L6-v2).

Creates embeddings for each paper’s:

abstract, or

summary (if abstract is missing)

Stores these embeddings in a FAISS index (a fast vector search engine).

Why it’s useful:

This lets you later search papers by meaning, not keywords.
For example: “papers about transformer summarization” → instantly finds the closest papers.

Small function to query embedding index (if built)

In [49]:
# Cell 19 — semantic query helper (run only if index built)
def semantic_query(q, k=5):
    if index is None:
        print("Index not available. Run the embedding cell first.")
        return []
    q_emb = emb_model.encode([q])
    D, I = index.search(q_emb, k)
    results = []
    for idx in I[0]:
        if idx < len(meta):
            results.append((idx, meta.iloc[idx]['title'], meta.iloc[idx]['apa_reference']))
    return results

# Example (uncomment to run):
# print(semantic_query("transformer summarization", k=3))


This cell creates a semantic search function that lets you search papers by meaning, not keywords.

What it does:

Takes your search text (example: “transformer summarization”).

Converts it into an embedding using the same model as before.

Searches the FAISS index for the most similar papers.

Returns:

the paper’s row number

the paper title

its APA reference

Why it’s useful:

It allows AI-powered research search, where you can type natural language and instantly get the most relevant papers.

Save manifest & README (final reproducibility step)

In [50]:
# Replacement Cell 20 — robust manifest + README writer (handles missing 'downloaded' column)

import os, json

def infer_num_downloaded_from_df(df):
    """Try multiple heuristics to compute how many files were downloaded."""
    if df is None:
        return 0
    # 1) direct 'downloaded' boolean column
    for cname in ['downloaded', 'Downloaded', 'is_downloaded', 'success', 'ok']:
        if cname in df.columns:
            try:
                return int(df[cname].astype(bool).sum())
            except Exception:
                pass
    # 2) common alternative names created by merges
    for cname in df.columns:
        if 'download' in cname.lower() and df[cname].dtype == 'bool':
            return int(df[cname].sum())
    # 3) check for saved path column(s)
    for cname in ['saved_path', 'saved_path_x', 'saved_path_y', 'saved', 'path']:
        if cname in df.columns:
            return int(df[cname].notnull().sum())
    # 4) any column that looks like a path (strings containing '.pdf')
    for cname in df.columns:
        if df[cname].dtype == object:
            sample = df[cname].dropna().astype(str)
            if not sample.empty and sample.str.contains(r'\.pdf$', case=False, regex=True).any():
                return int(sample.str.contains(r'\.pdf$', case=False, regex=True).sum())
    # 5) fallback: length 0
    return 0

# Try to detect downloads_df and selected; if not present, fallback to scanning PDF_DIR
try:
    _downloads_df = downloads_df  # may raise NameError if not defined
except Exception:
    _downloads_df = None

try:
    _selected = selected
except Exception:
    _selected = None

# Compute num_downloaded using best available source
num_downloaded = 0
if _downloads_df is not None:
    num_downloaded = infer_num_downloaded_from_df(_downloads_df)
elif os.path.exists(PDF_DIR):
    # fallback: count pdf files in PDF_DIR
    pdf_files = [f for f in os.listdir(PDF_DIR) if f.lower().endswith('.pdf')]
    num_downloaded = len(pdf_files)
else:
    num_downloaded = 0

# Also compute num_selected robustly
num_selected = len(_selected) if _selected is not None else 0

# Build manifest dict
manifest = {
    "topic": topic if 'topic' in globals() else None,
    "date": pd.Timestamp.now().isoformat() if 'pd' in globals() else time.strftime("%Y-%m-%dT%H:%M:%S"),
    "limit": limit if 'limit' in globals() else None,
    "top_n": TOP_N if 'TOP_N' in globals() else None,
    "min_citations": MIN_CITATIONS if 'MIN_CITATIONS' in globals() else None,
    "year_from": YEAR_FROM if 'YEAR_FROM' in globals() else None,
    "num_selected": int(num_selected),
    "num_downloaded": int(num_downloaded)
}

# Save manifest.json and README.txt
os.makedirs(OUT_ROOT, exist_ok=True)
manifest_path = os.path.join(OUT_ROOT, "manifest.json")
with open(manifest_path, "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2)

readme_path = os.path.join(OUT_ROOT, "README.txt")
with open(readme_path, "w", encoding="utf-8") as f:
    f.write("Milestone1 output\n")
    f.write("-----------------\n")
    f.write(f"Topic: {manifest['topic']}\n")
    f.write(f"Date: {manifest['date']}\n")
    f.write(f"Requested limit: {manifest['limit']}\n")
    f.write(f"Top N selected: {manifest['top_n']}\n")
    f.write(f"Min citations filter: {manifest['min_citations']}\n")
    f.write(f"Year from filter: {manifest['year_from']}\n")
    f.write(f"Number selected: {manifest['num_selected']}\n")
    f.write(f"Number downloaded (inferred): {manifest['num_downloaded']}\n")
    f.write("\nFolder contents:\n")
    f.write(" - pdfs/: downloaded pdf files\n")
    f.write(" - texts/: extracted text files\n    - papers_metadata.csv: merged metadata\n    - manifest.json: run metadata\n    - pipeline.log: runtime log (if present)\n")

print("Manifest written to:", manifest_path)
print(json.dumps(manifest, indent=2))
print("README written to:", readme_path)


Manifest written to: milestone1_output/manifest.json
{
  "topic": "ai generated model for summarizing research paper models",
  "date": "2025-12-11T11:59:09.837593",
  "limit": 12,
  "top_n": 4,
  "min_citations": 0,
  "year_from": 2015,
  "num_selected": 0,
  "num_downloaded": 0
}
README written to: milestone1_output/README.txt


Creates a manifest and README that record the run: topic, filters, how many papers were selected, and how many PDFs were downloaded. It also saves these two files into the output folder.

(Optional) Gradio mini-UI wrapper for interactive demo

In [52]:
# Cell 21 — small Gradio UI to run the pipeline interactively (optional)
import gradio as gr

def gradio_run(topic_input, top_n, min_citations, year_from, limit):
    global topic, TOP_N, MIN_CITATIONS, YEAR_FROM
    topic = topic_input
    TOP_N = int(top_n)
    MIN_CITATIONS = int(min_citations)
    YEAR_FROM = int(year_from)
    df_msg, msg = run_small_pipeline_for_ui(topic, TOP_N, MIN_CITATIONS, YEAR_FROM, int(limit))
    return msg, df_msg

# We'll create a very small wrapper version of the pipeline to keep UI responsive
def run_small_pipeline_for_ui(topic_in, top_n, min_citations, year_from, limit):
    # reuse semantic_scholar_search and selected/top-N building (simplified)
    raw_res = semantic_scholar_search(topic_in, limit=limit)
    data = raw_res.get("data", []) if isinstance(raw_res, dict) and "data" in raw_res else raw_res
    rows = []
    for i,p in enumerate(data, start=1):
        rows.append({
            "index": i,
            "title": p.get("title"),
            "authors_list": p.get("authors", []),
            "year": p.get("year"),
            "citationCount": p.get("citationCount") or 0,
            "isOpenAccess": p.get("isOpenAccess"),
            "openAccessPdf": (p.get("openAccessPdf") or {}).get("url") if p.get("openAccessPdf") else None,
            "semanticUrl": p.get("url"),
            "doi": p.get("doi"),
            "abstract": p.get("abstract")
        })
    df = pd.DataFrame(rows)
    if year_from and year_from>0:
        df = df[df['year'].notnull() & (df['year'] >= year_from)]
    if min_citations and min_citations>0:
        df = df[df['citationCount'] >= min_citations]
    df = df.sort_values(by='citationCount', ascending=False).reset_index(drop=True).head(top_n)
    # return a small display dataframe
    disp = df[['title','year','citationCount','isOpenAccess','doi']].copy()
    return disp, f"Found {len(disp)} papers for '{topic_in}'"

# Gradio UI layout (simple)
with gr.Blocks() as demo:
    gr.Markdown("### Mini UI — Search Semantic Scholar and preview top-N results")
    with gr.Row():
        topic_box = gr.Textbox(label="Topic", value=topic)
        topn = gr.Slider(minimum=1, maximum=10, value=TOP_N, step=1, label="Top N")
    with gr.Row():
        minc = gr.Number(value=MIN_CITATIONS, label="Min citations")
        yearfrom = gr.Number(value=YEAR_FROM, label="Year from (0 to disable)")
        limit_s = gr.Slider(minimum=5, maximum=50, value=limit, step=1, label="API limit")
    run_btn = gr.Button("Run quick search")
    out_msg = gr.Textbox(label="Status")
    out_table = gr.Dataframe()
    run_btn.click(fn=gradio_run, inputs=[topic_box, topn, minc, yearfrom, limit_s], outputs=[out_msg, out_table])

demo.launch(share=False)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.
* To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



Adds a small Gradio UI that lets you type a topic, set filters, run a quick Semantic Scholar search, and preview the top-N papers in a table—without running the full pipeline.

Cell 1 — Setup folders & logging

“This cell creates all required folders for PDFs, text, and output. It also sets up logging so we can record everything the pipeline does.”

Cell 2 — Install & import libraries

“This installs the libraries we need — like PyMuPDF for PDF reading, TF-IDF for summarizing, and Semantic Scholar API tools. Then it imports everything into the notebook.”

Cell 3 — NLTK download

“This downloads the NLTK sentence tokenizer, which allows us to split text into sentences for summarization.”

Cell 4 — API key input

“This cell asks the user to enter their Semantic Scholar API Key securely so the script can make API requests.”

Cell 5 — Search setup functions

“This defines helper functions so we can send queries to the Semantic Scholar API reliably and handle errors or missing data.”

Cell 6 — Search query execution

“This cell sends the actual topic query to Semantic Scholar and retrieves research papers based on the limit selected.”

Cell 7 — Normalize and structure results

“This converts the API output into a clean dataframe with columns like title, year, citations, DOI, open-access link, etc.”

Cell 8 — Filter papers

“This filters papers based on year, citation count, and sorts them. Then it selects the final top-N papers we want to download.”

Cell 9 — Display selected papers

“This shows the selected top papers in a table so we can verify what will be downloaded.”

Cell 10 — PDF download function

“This cell defines a function that downloads each PDF using the open-access URL provided by Semantic Scholar.”

Cell 11 — Run PDF download loop

“This attempts to download each selected paper and stores download status in a dataframe.”

Cell 12 — Build preliminary APA references

“This generates simple APA-style references using whatever metadata we have (title, year, authors, DOI).”

Cell 13 — Prepare text extraction folders

“This ensures the text output folder exists so extracted text from PDFs can be saved.”

Cell 14 — Extract PDF text (with OCR fallback)

“This extracts text from each PDF using PyMuPDF. If the PDF is scanned or unreadable, it tries OCR as a backup.”

Cell 15 — Summaries (extractive)

“This takes the extracted text and produces a short extractive summary using TF-IDF to choose the top 3 most important sentences.”

Cell 16 — Fix index column and merge everything

“This merges all data sources — selected papers, downloads, extracted text, summaries, APA references — into one master metadata file.”

Cell 17 — Best paper scoring

“This calculates a simple score that ranks papers based on citations, recency, and open-access availability, then flags the best one.”

Cell 18 — Embeddings + FAISS index (optional)

“This computes semantic embeddings from abstracts and builds a FAISS index so we can search papers by meaning, not keywords.”

Cell 19 — Semantic query function

“This provides a function to ask semantic questions like ‘best transformer summarization paper’ and get relevant results.”

Cell 20 — Manifest + README

“This generates metadata files like manifest.json and README.txt which store run details — topic, filters, number downloaded, etc.”

Cell 21 — Gradio UI

“This builds a small interactive user interface so anyone can input a topic and quickly preview top papers without running the whole pipeline.”

# MILESTONE 2

