# Semantic Scholar KG Enrichment: USES_DATASET

### Overview
In this notebook, we will:
1. Fetch ML-related papers from the Semantic Scholar API.
2. Perform Exploratory Data Analysis (EDA).
3. Apply a simple NLP pipeline (spaCy) to extract dataset mentions from abstracts.
4. Build a Neo4j knowledge graph with `USES_DATASET` relationships.
5. Query and visualize insights about dataset usage.

**Author**: João Pereira  
**Date**: 2025-03-01


In [None]:
# ============================================
# 1. Imports and Initial Setup
# ============================================

import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

# For NLP
import spacy

# For Neo4j
from py2neo import Graph, Node, Relationship

# Load spaCy model (English)
nlp = spacy.load("en_core_web_sm")

# Configure display
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)

# Neo4j Credentials
NEO4J_URI = "bolt://localhost:7687"  # or "bolt://neo4j:7687" if in Docker
NEO4J_USER = "neo4j"
NEO4J_PASS = "password123"

try:
    graph = Graph(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS))
    print("Connected to Neo4j successfully!")
except Exception as e:
    print(f"Error connecting to Neo4j: {e}")

In [None]:
# ============================================
# 2. Fetch Data from Semantic Scholar
# ============================================

import requests
import json

# Semantic Scholar API does NOT require an API Key for basic queries
API_URL = "https://api.semanticscholar.org/graph/v1/paper/search"

def fetch_papers(query, offset=0, limit=100):
    """
    Fetches papers from the Semantic Scholar API given a query.
    """
    params = {
        "query": query,
        "fields": "paperId,title,abstract,year,authors,referenceCount,citationCount",
        "offset": offset,
        "limit": limit
    }
    
    # No API key required for basic search
    resp = requests.get(API_URL, params=params)

    print(f"🔹 Requesting: {resp.url}")  # Debugging: See the full request URL
    print(f"🔹 Status Code: {resp.status_code}")  # Debugging: Print response status
    
    if resp.status_code == 200:
        resp_data = resp.json()
        return resp_data.get("data", [])  # Extract papers from response
    else:
        print(f"❌ API Error {resp.status_code}: {resp.text}")
        return []

# Test with a single request before looping
papers_batch = fetch_papers("machine learning", offset=0, limit=10)
print(f"✅ Papers fetched: {len(papers_batch)}")

In [None]:
import requests
import json
import time

API_URL = "https://api.semanticscholar.org/graph/v1/paper/search"

def fetch_papers(query, offset=0, limit=100, max_retries=5):
    """
    Fetches papers from the Semantic Scholar API given a query.
    Implements automatic retrying when rate-limited (HTTP 429).
    """
    params = {
        "query": query,
        "fields": "paperId,title,abstract,year,authors,referenceCount,citationCount",
        "offset": offset,
        "limit": limit
    }

    attempt = 0
    wait_time = 5  # Start with 5-second delay for retries

    while attempt < max_retries:
        resp = requests.get(API_URL, params=params)

        print(f"🔹 Requesting: {resp.url}")
        print(f"🔹 Status Code: {resp.status_code}")

        if resp.status_code == 200:
            resp_data = resp.json()
            print(f"✅ Successfully fetched {len(resp_data.get('data', []))} papers")
            return resp_data.get("data", [])

        elif resp.status_code == 429:
            print(f"⚠️ Rate limit exceeded. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)  # Wait before retrying
            wait_time *= 2  # Exponential backoff
            attempt += 1

        else:
            print(f"❌ API Error {resp.status_code}: {resp.text}")
            return []

    print("❌ Maximum retries reached. Skipping batch.")
    return []

# Fetching multiple batches
all_papers = []
num_batches = 10  

for i in range(num_batches):
    offset = i * 100
    papers_batch = fetch_papers("machine learning", offset=offset, limit=100)

    all_papers.extend(papers_batch)
    print(f"✅ Fetched {len(papers_batch)} papers in batch {i+1}, Total: {len(all_papers)}")

    time.sleep(3)  # Add a small delay between successful requests

# Save results
with open("/data/raw/papers_ml.json", "w", encoding="utf-8") as f:
    json.dump(all_papers, f, indent=4)

print(f"✅ Total papers fetched: {len(all_papers)}")

In [None]:
# ============================================
# 3. Exploratory Data Analysis (EDA)
# ============================================

df = pd.DataFrame(all_papers)
print("DataFrame shape:", df.shape)
df.head()

# Basic stats
df.info()
display(df.describe())

# Distribution of years
year_counts = df["year"].value_counts().sort_index()
year_counts.plot(kind="bar", title="Publication Year Distribution")
plt.xlabel("Year")
plt.ylabel("Count of Papers")
plt.show()

# Check some sample abstracts
for idx, row in df.head(3).iterrows():
    print("Title:", row.get("title"))
    print("Abstract:", row.get("abstract"))
    print("-" * 60)

### Observations
- We have ~1000 papers about "machine learning."
- Some might not have abstracts (missing fields), so we need to handle that.

In [None]:
# ============================================
# 4. NLP to Extract Dataset Mentions
# ============================================

known_datasets = [
    "MNIST", "ImageNet", "CIFAR-10", "CIFAR-100", "COCO", "Fashion-MNIST",
    "OpenImages", "Cityscapes", "Pascal VOC"
    # add more known dataset names if needed
]

def extract_datasets_from_text(abstract_text, dataset_list=None):
    if not abstract_text:
        return []
    
    found = []
    lower_text = abstract_text.lower()
    for ds in dataset_list or []:
        if ds.lower() in lower_text:
            found.append(ds)
    
    # Alternatively, spaCy NER could be used for more advanced detection:
    # doc = nlp(abstract_text)
    # for ent in doc.ents:
    #     # if ent.label_ == "PRODUCT" or something relevant
    #     # maybe cross-check if ent.text is in a dataset dictionary
    
    return list(set(found))

paper_dataset_map = []

for paper in all_papers:
    paper_id = paper.get("paperId", "")
    paper_title = paper.get("title", "")
    abstract = paper.get("abstract", "")
    
    datasets_found = extract_datasets_from_text(abstract, known_datasets)
    for ds in datasets_found:
        paper_dataset_map.append({
            "paperId": paper_id,
            "title": paper_title,
            "dataset": ds
        })

df_map = pd.DataFrame(paper_dataset_map)
print("Rows in paper-dataset mapping:", df_map.shape[0])
df_map.head(15)

### Validate Extraction
- We can do a quick manual check on a few samples to see if the extraction is correct.

In [None]:
# ============================================
# 5. Build Knowledge Graph in Neo4j
# ============================================

# Optional: Wipe the existing DB (DANGER: deletes everything!)
# graph.run("MATCH (n) DETACH DELETE n")

for _, row in df_map.iterrows():
    p_id = row["paperId"]
    p_title = row["title"]
    ds_name = row["dataset"]
    
    paper_node = Node("Paper", paperId=p_id, title=p_title)
    dataset_node = Node("Dataset", name=ds_name)
    
    # MERGE ensures uniqueness
    graph.merge(paper_node, "Paper", "paperId")
    graph.merge(dataset_node, "Dataset", "name")
    
    rel = Relationship(paper_node, "USES_DATASET", dataset_node)
    graph.merge(rel, "Paper", "paperId")

print("Knowledge Graph construction complete.")

In [None]:
# ============================================
# 6. Querying & Insights
# ============================================

# 6.1 Top 5 Datasets
query_top5 = """
MATCH (p:Paper)-[:USES_DATASET]->(d:Dataset)
RETURN d.name AS datasetName, COUNT(p) AS usageCount
ORDER BY usageCount DESC
LIMIT 5
"""
res_top5 = graph.run(query_top5).data()
df_top5 = pd.DataFrame(res_top5)
print("Top 5 Datasets:")
display(df_top5)

# Bar chart
plt.figure(figsize=(8,4))
sns.barplot(data=df_top5, x="datasetName", y="usageCount")
plt.title("Top 5 Datasets Mentioned")
plt.xlabel("Dataset")
plt.ylabel("Number of Papers")
plt.show()

In [None]:
# 6.2 Yearly Trend for a Chosen Dataset
chosen_dataset = "ImageNet"
query_trend = f"""
MATCH (p:Paper)-[:USES_DATASET]->(d:Dataset {{name: '{chosen_dataset}'}})
RETURN p.year AS year, COUNT(p) AS usageCount
ORDER BY year
"""
res_trend = graph.run(query_trend).data()
df_trend = pd.DataFrame(res_trend)
df_trend = df_trend.dropna(subset=["year"])  # drop None years if any

if not df_trend.empty:
    plt.figure(figsize=(8,4))
    sns.lineplot(data=df_trend, x="year", y="usageCount", marker="o")
    plt.title(f"Yearly Trend of '{chosen_dataset}' Usage")
    plt.xlabel("Year")
    plt.ylabel("Count of Papers")
    plt.show()
else:
    print(f"No data found for dataset {chosen_dataset}.")

## 7. Conclusion

We have:
1. Fetched ~1000 ML papers from Semantic Scholar.
2. Explored the data briefly.
3. Extracted dataset mentions using a simple approach.
4. Built a Neo4j graph to represent `Paper` and `Dataset` nodes, linked by `USES_DATASET`.
5. Queried top datasets and a time-trend of a chosen dataset.

### Next Steps
- Enhance NLP extraction using advanced named-entity recognition or SciBERT.
- Broaden the dataset list (or do an unsupervised approach to detect new dataset names).
- Expand queries and visualizations, possibly integrating a front-end to explore the graph.

**End of Notebook**