Arxiv has more than 1.5m articles in many fields of study. It was founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

In this kernel I work with metadata information from this dataset: https://www.kaggle.com/Cornell-University/arxiv

It contains metadata of papers and information about citations.

Let's see what interesting insights can be extracted form this data!

*Work is still in progress*

![](https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG)

In [4]:
pip install plotly

Collecting plotly
  Downloading plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-9.0.0-py3-none-any.whl.metadata (1.2 kB)
Downloading plotly-5.24.1-py3-none-any.whl (19.1 MB)
   ---------------------------------------- 0.0/19.1 MB ? eta -:--:--
   ---------------------------------------- 0.1/19.1 MB 1.7 MB/s eta 0:00:12
    --------------------------------------- 0.4/19.1 MB 5.0 MB/s eta 0:00:04
   -- ------------------------------------- 1.4/19.1 MB 11.2 MB/s eta 0:00:02
   ----- ---------------------------------- 2.7/19.1 MB 15.6 MB/s eta 0:00:02
   -------- ------------------------------- 4.2/19.1 MB 19.3 MB/s eta 0:00:01
   ---------- ----------------------------- 5.1/19.1 MB 20.4 MB/s eta 0:00:01
   ----------- ---------------------------- 5.7/19.1 MB 18.1 MB/s eta 0:00:01
   ------------- -------------------------- 6.6/19.1 MB 19.2 MB/s eta 0:00:01
   -------------- ------------------------- 7.1/19.1 MB 17.5 MB/s eta


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
# import libraries

import numpy as np
import pandas as pd
import gc
import os
import json
from collections import Counter, defaultdict
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import re
year_pattern = r'([1-2][0-9]{3})'

In [4]:
import json

# https://arxiv.org/help/api/user-manual
category_map = {'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'astro-ph.HE': 'High Energy Astrophysical Phenomena',
'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
'astro-ph.SR': 'Solar and Stellar Astrophysics',
'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
'cond-mat.mtrl-sci': 'Materials Science',
'cond-mat.other': 'Other Condensed Matter',
'cond-mat.quant-gas': 'Quantum Gases',
'cond-mat.soft': 'Soft Condensed Matter',
'cond-mat.stat-mech': 'Statistical Mechanics',
'cond-mat.str-el': 'Strongly Correlated Electrons',
'cond-mat.supr-con': 'Superconductivity',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control',
'econ.EM': 'Econometrics',
'eess.AS': 'Audio and Speech Processing',
'eess.IV': 'Image and Video Processing',
'eess.SP': 'Signal Processing',
'gr-qc': 'General Relativity and Quantum Cosmology',
'hep-ex': 'High Energy Physics - Experiment',
'hep-lat': 'High Energy Physics - Lattice',
'hep-ph': 'High Energy Physics - Phenomenology',
'hep-th': 'High Energy Physics - Theory',
'math.AC': 'Commutative Algebra',
'math.AG': 'Algebraic Geometry',
'math.AP': 'Analysis of PDEs',
'math.AT': 'Algebraic Topology',
'math.CA': 'Classical Analysis and ODEs',
'math.CO': 'Combinatorics',
'math.CT': 'Category Theory',
'math.CV': 'Complex Variables',
'math.DG': 'Differential Geometry',
'math.DS': 'Dynamical Systems',
'math.FA': 'Functional Analysis',
'math.GM': 'General Mathematics',
'math.GN': 'General Topology',
'math.GR': 'Group Theory',
'math.GT': 'Geometric Topology',
'math.HO': 'History and Overview',
'math.IT': 'Information Theory',
'math.KT': 'K-Theory and Homology',
'math.LO': 'Logic',
'math.MG': 'Metric Geometry',
'math.MP': 'Mathematical Physics',
'math.NA': 'Numerical Analysis',
'math.NT': 'Number Theory',
'math.OA': 'Operator Algebras',
'math.OC': 'Optimization and Control',
'math.PR': 'Probability',
'math.QA': 'Quantum Algebra',
'math.RA': 'Rings and Algebras',
'math.RT': 'Representation Theory',
'math.SG': 'Symplectic Geometry',
'math.SP': 'Spectral Theory',
'math.ST': 'Statistics Theory',
'math-ph': 'Mathematical Physics',
'nlin.AO': 'Adaptation and Self-Organizing Systems',
'nlin.CD': 'Chaotic Dynamics',
'nlin.CG': 'Cellular Automata and Lattice Gases',
'nlin.PS': 'Pattern Formation and Solitons',
'nlin.SI': 'Exactly Solvable and Integrable Systems',
'nucl-ex': 'Nuclear Experiment',
'nucl-th': 'Nuclear Theory',
'physics.acc-ph': 'Accelerator Physics',
'physics.ao-ph': 'Atmospheric and Oceanic Physics',
'physics.app-ph': 'Applied Physics',
'physics.atm-clus': 'Atomic and Molecular Clusters',
'physics.atom-ph': 'Atomic Physics',
'physics.bio-ph': 'Biological Physics',
'physics.chem-ph': 'Chemical Physics',
'physics.class-ph': 'Classical Physics',
'physics.comp-ph': 'Computational Physics',
'physics.data-an': 'Data Analysis, Statistics and Probability',
'physics.ed-ph': 'Physics Education',
'physics.flu-dyn': 'Fluid Dynamics',
'physics.gen-ph': 'General Physics',
'physics.geo-ph': 'Geophysics',
'physics.hist-ph': 'History and Philosophy of Physics',
'physics.ins-det': 'Instrumentation and Detectors',
'physics.med-ph': 'Medical Physics',
'physics.optics': 'Optics',
'physics.plasm-ph': 'Plasma Physics',
'physics.pop-ph': 'Popular Physics',
'physics.soc-ph': 'Physics and Society',
'physics.space-ph': 'Space Physics',
'q-bio.BM': 'Biomolecules',
'q-bio.CB': 'Cell Behavior',
'q-bio.GN': 'Genomics',
'q-bio.MN': 'Molecular Networks',
'q-bio.NC': 'Neurons and Cognition',
'q-bio.OT': 'Other Quantitative Biology',
'q-bio.PE': 'Populations and Evolution',
'q-bio.QM': 'Quantitative Methods',
'q-bio.SC': 'Subcellular Processes',
'q-bio.TO': 'Tissues and Organs',
'q-fin.CP': 'Computational Finance',
'q-fin.EC': 'Economics',
'q-fin.GN': 'General Finance',
'q-fin.MF': 'Mathematical Finance',
'q-fin.PM': 'Portfolio Management',
'q-fin.PR': 'Pricing of Securities',
'q-fin.RM': 'Risk Management',
'q-fin.ST': 'Statistical Finance',
'q-fin.TR': 'Trading and Market Microstructure',
'quant-ph': 'Quantum Physics',
'stat.AP': 'Applications',
'stat.CO': 'Computation',
'stat.ME': 'Methodology',
'stat.ML': 'Machine Learning',
'stat.OT': 'Other Statistics',
'stat.TH': 'Statistics Theory'}

data_file = '../input/arxiv/arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line
            
titles = []
abstracts = []
years = []
categories = []
metadata = get_metadata()
for paper in metadata:
    paper_dict = json.loads(paper)
    ref = paper_dict.get('journal-ref')
    try:
        year = int(ref[-4:]) 
        if 2000 < year <= 2021:
            categories.append(category_map[paper_dict.get('categories').split(" ")[0]])
            years.append(year)
            titles.append(paper_dict.get('title'))
            abstracts.append(paper_dict.get('abstract'))
    except:
        pass 

len(titles), len(abstracts), len(years), len(categories)

KeyboardInterrupt: 

In [18]:
import json
import random
import pandas as pd
from collections import defaultdict
from tqdm import tqdm
from datetime import datetime

# Function to clean authorship
def get_clean_authors(authors_parsed):
    cleaned_authors = []
    for author in authors_parsed:
        # Join first name, last name, and other parts of the name, and remove excess spaces
        cleaned_authors.append(" ".join([str(part) for part in author if part]).strip())
    return cleaned_authors

# Function to get metadata (replace with your actual method for loading metadata)
def get_metadata():
    with open("C:/Users/yichg/Downloads/arxiv-metadata-oai-snapshot.json", "r") as f:
        return f.readlines()
# Classification into Theoretical or Application areas
theoretical_categories = [
    # Math-related theoretical fields
    'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 
    'math.DG', 'math.DS', 'math.FA', 'math.GN', 'math.GR', 'math.GT', 'math.IT', 'math.KT', 
    'math.LO', 'math.MG', 'math.MP', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 
    'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST', 'math-ph',
    
    # Condensed matter (theoretical)
    'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con',

    # High energy and theoretical physics
    'gr-qc', 'hep-th', 'hep-ph', 'nlin.SI', 'quant-ph', 'nlin.CD',

    # Computer science-related theoretical fields
    'cs.CC', 'cs.FL', 'cs.LO', 'cs.DM', 'cs.DS', 'cs.MA', 'cs.SC', 'cs.IT', 'cs.GT','cs.CG'

    # Neural networks and evolutionary computing (partly theoretical)
    'cs.NE',

    # Statistics theory
    'stat.TH'
]
application_categories = [
    # Astrophysics-related application fields
    'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 
    'astro-ph.SR',

    # Condensed matter (applied)
    'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.quant-gas', 
    'cond-mat.soft', 'cond-mat.other',

    # High energy physics experiments and phenomenology
    'hep-ex', 'hep-lat',

    # Nuclear physics (theoretical and experimental)
    'nucl-ex', 'nucl-th',

    # Applied physics fields
    'physics.acc-ph', 'physics.ao-ph', 'physics.app-ph', 'physics.atm-clus', 'physics.atom-ph', 
    'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 
    'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 
    'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 
    'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph', 'physics.space-ph',

    # Computer science applied fields
    'cs.AI', 'cs.AR', 'cs.CE', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 
    'cs.ET', 'cs.GR', 'cs.HC', 'cs.IR', 'cs.LG', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NI', 'cs.OS', 
    'cs.PF', 'cs.PL', 'cs.RO', 'cs.SE', 'cs.SI', 'cs.SY', 'cs.OH', 'cs.SD','cs.SI','cs.AO'

    # Econometrics
    'econ.EM',

    # Electrical engineering and signal processing
    'eess.AS', 'eess.IV', 'eess.SP',

    # Quantitative biology
    'q-bio.BM', 'q-bio.CB', 'q-bio.GN', 'q-bio.MN', 'q-bio.NC', 'q-bio.OT', 'q-bio.PE', 
    'q-bio.QM', 'q-bio.SC', 'q-bio.TO',

    # Quantitative finance
    'q-fin.CP', 'q-fin.EC', 'q-fin.GN', 'q-fin.MF', 'q-fin.PM', 'q-fin.PR', 'q-fin.RM', 
    'q-fin.ST', 'q-fin.TR',

    # Applied statistics
    'stat.AP', 'stat.CO', 'stat.ME', 'stat.ML', 'stat.OT'
]

# Function to classify the categories
# Function to classify the categories
# Function to classify categories as theoretical or application
def classify_category(main_category, subcategories):
    theoretical_count = 0
    application_count = 0

    # Check main category first
    if main_category in theoretical_categories:
        theoretical_count += 1
    elif main_category in application_categories:
        application_count += 1

    # Check subcategories
    for subcategory in subcategories:
        full_category = f'{main_category}.{subcategory}'  # Full category in form 'stat.TH', etc.
        if full_category in theoretical_categories:
            theoretical_count += 1
        elif full_category in application_categories:
            application_count += 1

    # Return classification based on majority count or 'Other' if no match
    if theoretical_count > application_count:
        return 'Theoretical'
    elif application_count > theoretical_count:
        return 'Application'
    else:
        return 'Other'

# Initialize data structures
metadata = get_metadata()
filtered_papers = []

# Relevant categories (main categories only, without subcategories)
relevant_categories = ['math', 'cs', 'stat', 'physics']

# Year range
n_sample=500
seed=123
year_range = range(2000, 2025)


# Process metadata
for paper in tqdm(metadata):
    paper = json.loads(paper)
    
    # Extract the year of publication from 'update_date'
    year = int(paper['update_date'][:4])
    
    # Skip papers not within the year range
    if year not in year_range:
        continue
    
    # Extract the main categories (before the dot, if present)
    paper_categories = [cat.split('.')[0] for cat in paper['categories'].split(' ')]
    paper_categories_detail = [cat.split('.')[1] if '.' in cat else 'None' for cat in paper['categories'].split(' ')]
    
    # Filter for relevant categories or classify as 'others'
    main_category = 'others'
    for cat in paper_categories:
        if cat in relevant_categories:
            main_category = cat
            break  # Select the first matching relevant category

    # Classify the subcategories (cat_detail) into theoretical/application
    classification = classify_category(main_category, paper_categories_detail)
    
    # Store filtered data
    paper_data = {
        'id': paper['id'],
        'title': paper['title'],
        'abstract': paper['abstract'],
        'year': year,
        'category': main_category,
        'cat_detail': paper_categories_detail,
        'clean_authors': get_clean_authors(paper['authors_parsed']),
        'classification': classification  # Add the classification here
    }
    filtered_papers.append(paper_data)

# Convert to DataFrame or process further as needed

# Convert the filtered papers to a DataFrame
papers_df = pd.DataFrame(filtered_papers)

# Select 100 random papers per year
sampled_papers = papers_df.groupby('year').apply(lambda x: x.sample(n_sample, random_state=seed, replace=True)).reset_index(drop=True)

# Ensure sampled_papers is a DataFrame (this is already done by the previous steps)
sampled_papers_df = pd.DataFrame(sampled_papers)

# Display the result
print(sampled_papers_df.head())

# Save to CSV or further process
sampled_papers_df.to_csv('sampled_arxiv_papers_2000_2024_500.csv', index=False)


100%|██████████| 2560035/2560035 [00:44<00:00, 58003.31it/s] 
  sampled_papers = papers_df.groupby('year').apply(lambda x: x.sample(n_sample, random_state=seed, replace=True)).reset_index(drop=True)


                 id                                              title  \
0  astro-ph/0109480        Structure and dynamics of disks in galaxies   
1  astro-ph/9808257   Galaxy Clusters in the Hubble Volume Simulations   
2  astro-ph/0211397  Possible mechanism of electrical field origin ...   
3      math/0703727                             On symplectic quandles   
4   physics/0102009   Self-adaptive exploration in evolutionary search   

                                            abstract  year category  \
0    One the most cited papers in astronomy is Ke...  2007   others   
1    We report on analyses of cluster samples obt...  2007   others   
2    Slow magnetic field variations in stars and ...  2007   others   
3    We study the structure of symplectic quandle...  2007     math   
4    We address a primary question of computation...  2007  physics   

               cat_detail                                      clean_authors  \
0                  [None]                       

In [5]:
import json
import random
import pandas as pd
from collections import defaultdict, Counter
from tqdm import tqdm
from datetime import datetime

# Function to clean authorship
def get_clean_authors(authors_parsed):
    cleaned_authors = []
    for author in authors_parsed:
        # Join first name, last name, and other parts of the name, and remove excess spaces
        cleaned_authors.append(" ".join([str(part) for part in author if part]).strip())
    return cleaned_authors

# Function to get metadata (replace with your actual method for loading metadata)
def get_metadata():
    with open("C:/Users/yichg/Downloads/arxiv-metadata-oai-snapshot.json", "r") as f:
        return f.readlines()

# Initialize data structures
metadata = get_metadata()
filtered_papers = []

# Relevant categories (main categories only, without subcategories)
relevant_categories = ['math', 'cs', 'stat', 'physics']
nprofessor=20
npaper=4
# Year range (2010-2024)
year_range = range(2010, 2025)

# Process metadata
for paper in tqdm(metadata):
    paper = json.loads(paper)
    
    # Extract the year of publication from 'update_date'
    year = int(paper['update_date'][:4])
    
    # Skip papers not within the year range
    if year not in year_range:
        continue
    
    # Extract the main categories (before the dot, if present)
    paper_categories = [cat.split('.')[0] for cat in paper['categories'].split(' ')]
    paper_categories_detail = [cat.split('.')[1] if '.' in cat else 'None' for cat in paper['categories'].split(' ')]
    
    # Filter for relevant categories or classify as 'others'
    main_category = 'others'
    for cat in paper_categories:
        if cat in relevant_categories:
            main_category = cat
            break  # Select the first matching relevant category

    # Store filtered data
    paper_data = {
        'id': paper['id'],
        'title': paper['title'],
        'abstract': paper['abstract'],
        'year': year,
        'category': main_category,
        'cat_detail':paper_categories_detail,
        'clean_authors': get_clean_authors(paper['authors_parsed'])
    }
    filtered_papers.append(paper_data)

# Convert to DataFrame

# Convert to DataFrame
papers_df = pd.DataFrame(filtered_papers)

# Flatten authors and create a DataFrame with authors and their papers
papers_df['authors'] = papers_df['clean_authors'].apply(lambda x: '; '.join(x))  # Flatten authors list
authors_df = papers_df.explode('clean_authors')  # Explode the author list so that each row has one author

# Count the number of papers per author per category
author_paper_counts = authors_df.groupby(['clean_authors', 'category']).size().reset_index(name='paper_count')

# Get the top 50 authors per category
top_authors = author_paper_counts.groupby('category').apply(lambda x: x.nlargest(nprofessor, 'paper_count')).reset_index(drop=True)

# Now we want to select exactly 4 random papers per year for each of the top authors (from 2010 to 2024)
selected_papers = []

for category in top_authors['category'].unique():
    top_category_authors = top_authors[top_authors['category'] == category]['clean_authors'].unique()
    
    for author in top_category_authors:
        # Filter papers for the given author and category
        author_papers = papers_df[(papers_df['clean_authors'].apply(lambda x: author in x)) & 
                                  (papers_df['category'] == category)]
        
        # Randomly select 4 papers per year for this author from 2010 to 2024
        for year in year_range:
            papers_per_year = author_papers[author_papers['year'] == year]
            if len(papers_per_year) > 0:
                # Randomly select 4 papers per year, or fewer if less than 4 available
                sampled_papers = papers_per_year.sample(min(len(papers_per_year), npaper), random_state=42)
                
                # Add a column indicating the selected author
                sampled_papers['selected_author'] = author
                
                # Append the selected papers
                selected_papers.append(sampled_papers)

# Concatenate the selected papers into a final DataFrame
final_papers_df = pd.concat(selected_papers)

# Display the result
print(final_papers_df.head())

# Save the selected papers to a CSV file
final_papers_df.to_csv('selected_arxiv_papers_2010_2024.csv', index=False)


100%|██████████| 2560035/2560035 [00:29<00:00, 85613.81it/s] 
  top_authors = author_paper_counts.groupby('category').apply(lambda x: x.nlargest(nprofessor, 'paper_count')).reset_index(drop=True)


               id                                              title  \
94518   1007.2675  Algorithms for Testing Monomials in Multivaria...   
80953   1005.0806  A New Benchmark For Evaluation Of Graph-Theore...   
136930  1102.2831  The effect of linguistic constraints on the la...   
111811  1010.2818  Duty-Cycle-Aware Minimum-Energy Multicasting i...   
182217  1109.5244  Minimum-Energy All-to-All Multicasting in Mult...   

                                                 abstract  year category  \
94518     This paper is our second step towards develo...  2010       cs   
80953     We propose a new graph-theoretic benchmark i...  2010       cs   
136930    This paper studies the effect of linguistic ...  2011       cs   
111811    In duty-cycled wireless sensor networks, the...  2011       cs   
182217    Designing energy-efficient all-to-all multic...  2012       cs   

       cat_detail                                      clean_authors  \
94518        [CC]  [Chen Zhixiang, Fu 

In [3]:
print(papers_df.head())

          id                                              title  \
0  0704.0005  From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...   
1  0704.0006  Bosonic characters of atomic Cooper pairs acro...   
2  0704.0009  The Spitzer c2d Survey of Large, Nearby, Inste...   
3  0704.0020  Measurement of the Hadronic Form Factor in D0 ...   
4  0704.0025  Spectroscopic Properties of Polarons in Strong...   

                                            abstract  year category  \
0    In this paper we show how to compute the $\L...  2013     math   
1    We study the two-particle wave function of p...  2015   others   
2    We discuss the results from the combined IRA...  2010   others   
3    The shape of the hadronic form factor f+(q2)...  2015   others   
4    We present recent advances in understanding ...  2015   others   

                                       clean_authors  \
0            [Abu-Shammala Wael, Torchinsky Alberto]   
1                            [Pong Y. H., Law C. K.]   
2 

## Looking at the available data

In [6]:
metadata = get_metadata()
for paper in metadata:
    for k, v in json.loads(paper).items():
        print(f'{k}: {v}')
    break

id: 0704.0001
submitter: Pavel Nadolsky
authors: C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan
title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies
comments: 37 pages, 15 figures; published version
journal-ref: Phys.Rev.D76:013009,2007
doi: 10.1103/PhysRevD.76.013009
report-no: ANL-HEP-PR-07-12
categories: hep-ph
license: None
abstract:   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests w

I don't think I'll use all the information, which is available, but there are several interesting fields:
* the authors of the paper
* the title and the abstract
* categories (in the cell below I made a dictionary to help understand abbreviations)
* jornal-ref - this field should contain year.

In [7]:
# https://arxiv.org/help/api/user-manual
category_map = {'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'astro-ph.HE': 'High Energy Astrophysical Phenomena',
'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
'astro-ph.SR': 'Solar and Stellar Astrophysics',
'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
'cond-mat.mtrl-sci': 'Materials Science',
'cond-mat.other': 'Other Condensed Matter',
'cond-mat.quant-gas': 'Quantum Gases',
'cond-mat.soft': 'Soft Condensed Matter',
'cond-mat.stat-mech': 'Statistical Mechanics',
'cond-mat.str-el': 'Strongly Correlated Electrons',
'cond-mat.supr-con': 'Superconductivity',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control',
'econ.EM': 'Econometrics',
'eess.AS': 'Audio and Speech Processing',
'eess.IV': 'Image and Video Processing',
'eess.SP': 'Signal Processing',
'gr-qc': 'General Relativity and Quantum Cosmology',
'hep-ex': 'High Energy Physics - Experiment',
'hep-lat': 'High Energy Physics - Lattice',
'hep-ph': 'High Energy Physics - Phenomenology',
'hep-th': 'High Energy Physics - Theory',
'math.AC': 'Commutative Algebra',
'math.AG': 'Algebraic Geometry',
'math.AP': 'Analysis of PDEs',
'math.AT': 'Algebraic Topology',
'math.CA': 'Classical Analysis and ODEs',
'math.CO': 'Combinatorics',
'math.CT': 'Category Theory',
'math.CV': 'Complex Variables',
'math.DG': 'Differential Geometry',
'math.DS': 'Dynamical Systems',
'math.FA': 'Functional Analysis',
'math.GM': 'General Mathematics',
'math.GN': 'General Topology',
'math.GR': 'Group Theory',
'math.GT': 'Geometric Topology',
'math.HO': 'History and Overview',
'math.IT': 'Information Theory',
'math.KT': 'K-Theory and Homology',
'math.LO': 'Logic',
'math.MG': 'Metric Geometry',
'math.MP': 'Mathematical Physics',
'math.NA': 'Numerical Analysis',
'math.NT': 'Number Theory',
'math.OA': 'Operator Algebras',
'math.OC': 'Optimization and Control',
'math.PR': 'Probability',
'math.QA': 'Quantum Algebra',
'math.RA': 'Rings and Algebras',
'math.RT': 'Representation Theory',
'math.SG': 'Symplectic Geometry',
'math.SP': 'Spectral Theory',
'math.ST': 'Statistics Theory',
'math-ph': 'Mathematical Physics',
'nlin.AO': 'Adaptation and Self-Organizing Systems',
'nlin.CD': 'Chaotic Dynamics',
'nlin.CG': 'Cellular Automata and Lattice Gases',
'nlin.PS': 'Pattern Formation and Solitons',
'nlin.SI': 'Exactly Solvable and Integrable Systems',
'nucl-ex': 'Nuclear Experiment',
'nucl-th': 'Nuclear Theory',
'physics.acc-ph': 'Accelerator Physics',
'physics.ao-ph': 'Atmospheric and Oceanic Physics',
'physics.app-ph': 'Applied Physics',
'physics.atm-clus': 'Atomic and Molecular Clusters',
'physics.atom-ph': 'Atomic Physics',
'physics.bio-ph': 'Biological Physics',
'physics.chem-ph': 'Chemical Physics',
'physics.class-ph': 'Classical Physics',
'physics.comp-ph': 'Computational Physics',
'physics.data-an': 'Data Analysis, Statistics and Probability',
'physics.ed-ph': 'Physics Education',
'physics.flu-dyn': 'Fluid Dynamics',
'physics.gen-ph': 'General Physics',
'physics.geo-ph': 'Geophysics',
'physics.hist-ph': 'History and Philosophy of Physics',
'physics.ins-det': 'Instrumentation and Detectors',
'physics.med-ph': 'Medical Physics',
'physics.optics': 'Optics',
'physics.plasm-ph': 'Plasma Physics',
'physics.pop-ph': 'Popular Physics',
'physics.soc-ph': 'Physics and Society',
'physics.space-ph': 'Space Physics',
'q-bio.BM': 'Biomolecules',
'q-bio.CB': 'Cell Behavior',
'q-bio.GN': 'Genomics',
'q-bio.MN': 'Molecular Networks',
'q-bio.NC': 'Neurons and Cognition',
'q-bio.OT': 'Other Quantitative Biology',
'q-bio.PE': 'Populations and Evolution',
'q-bio.QM': 'Quantitative Methods',
'q-bio.SC': 'Subcellular Processes',
'q-bio.TO': 'Tissues and Organs',
'q-fin.CP': 'Computational Finance',
'q-fin.EC': 'Economics',
'q-fin.GN': 'General Finance',
'q-fin.MF': 'Mathematical Finance',
'q-fin.PM': 'Portfolio Management',
'q-fin.PR': 'Pricing of Securities',
'q-fin.RM': 'Risk Management',
'q-fin.ST': 'Statistical Finance',
'q-fin.TR': 'Trading and Market Microstructure',
'quant-ph': 'Quantum Physics',
'stat.AP': 'Applications',
'stat.CO': 'Computation',
'stat.ME': 'Methodology',
'stat.ML': 'Machine Learning',
'stat.OT': 'Other Statistics',
'stat.TH': 'Statistics Theory'}

### preparing data

In [77]:
categories = {}
abstract_words = {}
authors = {}
journal = {}
metadata = get_metadata()
year_pub = {}
for ind, paper in tqdm(enumerate(metadata)):
    paper = json.loads(paper)
    # try to extract year
    year_pub[ind] = paper['update_date']               
    # collect counts of various things over years
    categories[ind] =   defaultdict(int)
    abstract_words[ind] =   defaultdict(int)
    for cat in paper['categories'].split(' '):
        categories[ind][cat] += 1
  
    for word in paper['abstract'].replace('\n', ' ').split():
        abstract_words[ind][word] += 1
    authors[ind] = paper['authors_parsed']

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [2]:
authors[1]

NameError: name 'authors' is not defined

In [68]:
paper

{'id': '0704.0005',
 'submitter': 'Alberto Torchinsky',
 'authors': 'Wael Abu-Shammala and Alberto Torchinsky',
 'title': 'From dyadic $\\Lambda_{\\alpha}$ to $\\Lambda_{\\alpha}$',
 'comments': None,
 'journal-ref': 'Illinois J. Math. 52 (2008) no.2, 681-689',
 'doi': None,
 'report-no': None,
 'categories': 'math.CA math.FA',
 'license': None,
 'abstract': '  In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge\n0$, using the dyadic grid. This result is a consequence of the description of\nthe Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.\n',
 'versions': [{'version': 'v1', 'created': 'Mon, 2 Apr 2007 18:09:58 GMT'}],
 'update_date': '2013-10-15',
 'authors_parsed': [['Abu-Shammala', 'Wael', ''],
  ['Torchinsky', 'Alberto', '']]}

In [22]:
if paper['journal-ref']:
    year = re.match(year_pattern, paper['journal-ref']).groups() if re.match(year_pattern, paper['journal-ref']) else None
    if year:
        year = [int(i) for i in year if int(i) < 2020 and int(i) >= 1991]
        if year == []:
            year = None
        else:
            year = min(year)
    print(year)

None


In [25]:
    # collect counts of various things over years
    for cat in paper['categories']:
        for c in cat.split():
            if year:
                categories[ind][c] += 1
    abstract_words[ind] = {}
    for word in paper['abstract'].replace('\n', ' ').split():
        abstract_words[ind][word] += 1
    paper_authors = authors.get(paper['id'])
    if paper_authors:
        for author in paper_authors:
                authors[ind] = paper_authors

KeyError: 'A'

In [24]:
abstract_words

{}

In [15]:
paper['journal-ref'], year_pattern

('Phys.Rev.D76:013009,2007', '([1-2][0-9]{3})')

## Number of papers by categories over years

I'll take top 10 most popular categories from each year and plot all of them.

**A warning beforehand**! There is no field with data of the paper, so I extracted it from `journal-ref` with regex. There could be some errors in regex, also some papers don't have `journal-ref`.

In [None]:
df = pd.DataFrame(year_categories)
cats = []
for col in df.columns:
    top_cats = [i for i in df[col].fillna(0).sort_values().index][-10:]
    cats.extend(top_cats)
cats = list(set(cats))

df1 = df.T[cats]
df1 = df1.sort_index()
df2 = df1.reset_index().melt(id_vars=['index'])
df2.columns = ['year', 'category', 'count']
fig = px.line(df2, x="year", y="count", color='category')
fig.show()

In [None]:
for c in sorted(cats):
    if c in category_map:
        print(f"{c}: {category_map[c]}")

There are so many different and interesting trends!
* for example, there are some fluctuations due to terminology - at first there were a lot of papers in `astro-ph` category, but later it was split in multiple categories
* there was a surge in papers on astrophysics since 2010, but since 2014 `Cosmology and Nongalactic Astrophysics` became less popular than `Astrophysics of Galaxies`
* of course, in the last several years there are many papers about `Machine Learning`

## Number of papers by authors over years

In [None]:
df = pd.DataFrame(year_authors)
authors = []
for col in df.columns:
    top_authors = [i for i in df[col].fillna(0).sort_values().index][-10:]
    authors.extend(top_authors)
authors = list(set(authors))

df1 = df.T[authors]
df1 = df1.sort_index()
df2 = df1.reset_index().melt(id_vars=['index'])
df2.columns = ['year', 'author', 'count']
fig = px.line(df2, x="year", y="count", color='author', width=1600, height=600)
fig.show()

We can see some prominent authors from many fields on study!