# ArXiv Metadata Analysis

This notebook analyzes the content and format of the `arxiv-metadata-oai-snapshot.json` file.


In [10]:
import json
import os
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the file path
file_path = "/work3/s242644/PaperTrail/arxiv-metadata-oai-snapshot.json"

# Check file size and basic info
file_size = os.path.getsize(file_path)
print(f"File size: {file_size / (1024**3):.2f} GB")
print(f"File size: {file_size / (1024**2):.2f} MB")
print(f"File size: {file_size:,} bytes")


File size: 4.54 GB
File size: 4649.80 MB
File size: 4,875,669,363 bytes


In [11]:
# Count total number of papers
with open(file_path, 'r') as f:
    line_count = sum(1 for line in f)

print(f"Total number of papers: {line_count:,}")
print(f"Average size per paper: {file_size / line_count:.0f} bytes")


Total number of papers: 2,840,638
Average size per paper: 1716 bytes


In [12]:
# Load and examine a sample of papers
sample_papers = []
with open(file_path, 'r') as f:
    for i, line in enumerate(f):
        if i >= 10:  # Load first 10 papers
            break
        sample_papers.append(json.loads(line.strip()))

print("Sample paper structure:")
print("=" * 50)
for i, paper in enumerate(sample_papers[:3]):
    print(f"\nPaper {i+1}:")
    print(f"ID: {paper['id']}")
    print(f"Title: {paper['title'][:100]}...")
    print(f"Authors: {paper['authors'][:100]}...")
    print(f"Categories: {paper['categories']}")
    print(f"Abstract length: {len(paper['abstract'])} characters")
    print(f"Versions: {len(paper['versions'])}")
    print(f"Update date: {paper['update_date']}")
    print("-" * 30)


Actual paper structure analysis:

First paper keys: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed']
Number of fields: 14

Detailed structure of first paper:
----------------------------------------
id (str): 0704.0001
submitter (str): Pavel Nadolsky
authors (str): C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan
title (str): Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies
comments (str): 37 pages, 15 figures; published version
journal-ref (str): Phys.Rev.D76:013009,2007
doi (str): 10.1103/PhysRevD.76.013009
report-no (str): ANL-HEP-PR-07-12
categories (str): hep-ph
license (NoneType): None
abstract (str):   A fully differential calculation in perturbative quantum chromodynamics is
presented for the produ...
versions (list): [{'version': 'v1', 'created': 'Mon, 2 Apr 2007 19:18:42 GMT'}, {'version': 'v2', 'created': 'Tue, 24

In [13]:
# Analyze the complete structure of one paper
print("Complete structure of a single paper:")
print("=" * 50)
sample_paper = sample_papers[0]
for key, value in sample_paper.items():
    if isinstance(value, str) and len(value) > 100:
        print(f"{key}: {value[:100]}... (length: {len(value)})")
    elif isinstance(value, list):
        print(f"{key}: {value} (length: {len(value)})")
    else:
        print(f"{key}: {value}")


Consistency check across papers:
All unique keys found: ['abstract', 'authors', 'authors_parsed', 'categories', 'comments', 'doi', 'id', 'journal-ref', 'license', 'report-no', 'submitter', 'title', 'update_date', 'versions']
Total unique keys: 14

Key presence across papers:
report-no: present in 5/5 papers
categories: present in 5/5 papers
comments: present in 5/5 papers
authors: present in 5/5 papers
title: present in 5/5 papers
authors_parsed: present in 5/5 papers
license: present in 5/5 papers
versions: present in 5/5 papers
journal-ref: present in 5/5 papers
abstract: present in 5/5 papers
update_date: present in 5/5 papers
doi: present in 5/5 papers
submitter: present in 5/5 papers
id: present in 5/5 papers


In [14]:
# Analyze categories distribution
categories = []
abstract_lengths = []
years = []

print("Analyzing categories and other metadata...")
with open(file_path, 'r') as f:
    for i, line in enumerate(f):
        if i >= 1000:  # Analyze first 1000 papers for efficiency
            break
        paper = json.loads(line.strip())
        
        # Extract categories
        if paper['categories']:
            categories.extend(paper['categories'].split())
        
        # Extract abstract length
        abstract_lengths.append(len(paper['abstract']))
        
        # Extract year from update_date
        if paper['update_date']:
            try:
                year = int(paper['update_date'].split('-')[0])
                years.append(year)
            except:
                pass

print(f"Analyzed {min(1000, line_count)} papers")
print(f"Total category mentions: {len(categories)}")
print(f"Unique categories: {len(set(categories))}")
print(f"Average abstract length: {sum(abstract_lengths)/len(abstract_lengths):.0f} characters")
print(f"Year range: {min(years)} - {max(years)}")


Analyzing actual data structure...
All possible keys in dataset: ['abstract', 'authors', 'authors_parsed', 'categories', 'comments', 'doi', 'id', 'journal-ref', 'license', 'report-no', 'submitter', 'title', 'update_date', 'versions']

Data types for each key:
report-no: {'str', 'NoneType'}
categories: {'str'}
comments: {'str', 'NoneType'}
authors: {'str'}
title: {'str'}
authors_parsed: {'list'}
license: {'str', 'NoneType'}
versions: {'list'}
journal-ref: {'str', 'NoneType'}
abstract: {'str'}
update_date: {'str'}
doi: {'str', 'NoneType'}
submitter: {'str'}
id: {'str'}

Null/None value analysis:
report-no: 92/100 papers have null/empty values
categories: 0/100 papers have null/empty values
comments: 13/100 papers have null/empty values
authors: 0/100 papers have null/empty values
title: 0/100 papers have null/empty values
authors_parsed: 0/100 papers have null/empty values
license: 87/100 papers have null/empty values
versions: 0/100 papers have null/empty values
journal-ref: 48/100 pape

In [None]:

category_counts = Counter(categories)
print("\nTop 20 categories:")
print("=" * 30)
for category, count in category_counts.most_common(20):
    print(f"{category}: {count}")

# Year distribution
year_counts = Counter(years)
print(f"\nYear distribution (first 10 years):")
print("=" * 30)
for year in sorted(year_counts.keys())[:10]:
    print(f"{year}: {year_counts[year]} papers")




Sample data from first 3 papers:

--- Paper 1 ---
id: 0704.0001
submitter: Pavel Nadolsky
authors: C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan
title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC e...
comments: 37 pages, 15 figures; published version
journal-ref: Phys.Rev.D76:013009,2007
doi: 10.1103/PhysRevD.76.013009
report-no: ANL-HEP-PR-07-12
categories: hep-ph
license: None
abstract:   A fully differential calculation in perturbative quantum chromodynamics is
pre...
versions: [{'version': 'v1', 'created': 'Mon, 2 Apr 2007 19:18:42 GMT'}, {'version': 'v2', 'created': 'Tue, 24 Jul 2007 20:10:27 GMT'}]
update_date: 2008-11-26
authors_parsed: [['Balázs', 'C.', ''], ['Berger', 'E. L.', ''], ['Nadolsky', 'P. M.', '']]... (length: 4)
------------------------------

--- Paper 2 ---
id: 0704.0002
submitter: Louis Theran
authors: Ileana Streinu and Louis Theran
title: Sparsity-certifying Graph Decompositions
comments: To appear in Graphs and Combin