# arXivr

###
### arXiv metadata storage and statistics retriver

#### Features:
* Check the daily releases of the Astrophysics section of the arXive
* Retrive metadata of the publications including authors, affiliations, etc
* Run an LLM api to create a one line summary of the publication abstract
* Store all this infomation as a csv retriving statistics such as highest publishing authors/universities 
* Create a column of keywords relating to the subject matter of the publication 
* Get notified everytime a publication with your specific input keyword appears 
* Automate VS Code running the code to do this every day


In [1]:
import pandas as pd
import numpy as np
import urllib.request as libreq
from groq import Groq
import xml.etree.ElementTree as ET
from IPython.display import display, Latex
from scholarly import ProxyGenerator, scholarly
import certifi
import os

os.environ['SSL_CERT_FILE'] = certifi.where()

pg = ProxyGenerator()
pg.FreeProxies()


True

In [22]:
query = 'search_query=cat:astro-ph*+AND+submittedDate:[20241224+TO+20241225]&start=0&max_results=5&sortBy=submittedDate&sortOrder=ascending'
base_url = 'http://export.arxiv.org/api/query?'

with libreq.urlopen(base_url + query) as url:
    r = url.read()
print(r)


b'<?xml version="1.0" encoding="UTF-8"?>\n<feed xmlns="http://www.w3.org/2005/Atom">\n  <link href="http://arxiv.org/api/query?search_query%3Dcat%3Aastro-ph%2A%20AND%20submittedDate%3A%5B20241224%20TO%2020241225%5D%26id_list%3D%26start%3D0%26max_results%3D5" rel="self" type="application/atom+xml"/>\n  <title type="html">ArXiv Query: search_query=cat:astro-ph* AND submittedDate:[20241224 TO 20241225]&amp;id_list=&amp;start=0&amp;max_results=5</title>\n  <id>http://arxiv.org/api/uLiwtc/WLiRO0ArARC0DwGb29PY</id>\n  <updated>2024-12-26T00:00:00-05:00</updated>\n  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">23</opensearch:totalResults>\n  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>\n  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">5</opensearch:itemsPerPage>\n  <entry>\n    <id>http://arxiv.org/abs/2412.18098v1</id>\n    <updated>2024-12-24T02:15:19Z</updated>\

In [23]:
# Parse the XML content
root = ET.fromstring(r)

# Function to print the XML in a readable format
def print_readable_xml(element, indent=""):
    for child in element:
        print(f"{indent}{child.tag}: {child.text.strip() if child.text else ''}")
        print_readable_xml(child, indent + "  ")

# Print the XML content in a readable format
print_readable_xml(root)

{http://www.w3.org/2005/Atom}link: 
{http://www.w3.org/2005/Atom}title: ArXiv Query: search_query=cat:astro-ph* AND submittedDate:[20241224 TO 20241225]&id_list=&start=0&max_results=5
{http://www.w3.org/2005/Atom}id: http://arxiv.org/api/uLiwtc/WLiRO0ArARC0DwGb29PY
{http://www.w3.org/2005/Atom}updated: 2024-12-26T00:00:00-05:00
{http://a9.com/-/spec/opensearch/1.1/}totalResults: 23
{http://a9.com/-/spec/opensearch/1.1/}startIndex: 0
{http://a9.com/-/spec/opensearch/1.1/}itemsPerPage: 5
{http://www.w3.org/2005/Atom}entry: 
  {http://www.w3.org/2005/Atom}id: http://arxiv.org/abs/2412.18098v1
  {http://www.w3.org/2005/Atom}updated: 2024-12-24T02:15:19Z
  {http://www.w3.org/2005/Atom}published: 2024-12-24T02:15:19Z
  {http://www.w3.org/2005/Atom}title: Bar instability and formation timescale across Toomre's $Q$ parameter
  and central mass concentration: slow bar formation or true stability
  {http://www.w3.org/2005/Atom}summary: We investigate the bar formation process using $N$-body simu

In [35]:
# Extract the abstract from the XML content
namespace = {'atom': 'http://www.w3.org/2005/Atom'}
abstract = root.find('.//atom:entry/atom:summary', namespace).text.strip()

print(abstract)

We investigate the bar formation process using $N$-body simulations across
the Toomre's parameter $Q_{min}$ and central mass concentration (CMC), focusing
principally on the formation timescale. Of importance is that, as suggested by
cosmological simulations, disk galaxies have limited time of $\sim 8$ Gyr in
the Universe timeline to evolve secularly, starting when they became physically
and kinematically steady to prompt the bar instability. By incorporating this
time limit, bar-unstable disks are further sub-divided into those that
establish a bar before and after that time, namely the normal and the slowly
bar-forming disks. Simulations demonstrate that evolutions of bar strengths and
configurations of the slowly bar-forming and the bar-stable cases are nearly
indistinguishable prior to $8$ Gyr, albeit dynamically distinct, while
differences can be noticed afterwards. Differentiating them before $8$ Gyr is
possible by identifying the proto-bar, a signature of bar development visible

In [25]:
# Display the abstract with LaTeX interpretation
display(Latex(abstract))

<IPython.core.display.Latex object>

In [27]:
data = pd.DataFrame(columns=['title', 'abstract', 'authors', 'published', 'link'])

entries = []
for entry in root.findall('atom:entry', namespace):
    title = entry.find('atom:title', namespace).text.strip()
    abstract = entry.find('atom:summary', namespace).text.strip()
    authors = ', '.join([author.find('atom:name', namespace).text.strip() for author in entry.findall('atom:author', namespace)])
    published = entry.find('atom:published', namespace).text.strip()
    link = entry.find('atom:link[@rel="alternate"]', namespace).attrib['href']
    
    entries.append({'title': title, 'abstract': abstract, 'authors': authors, 'published': published, 'link': link})

data = pd.concat([data, pd.DataFrame(entries)], ignore_index=True)

print(data)

                                               title  \
0  Bar instability and formation timescale across...   
1  Exploring variation of double-peak broad-line ...   
2  Extremely luminous optical afterglow of a dist...   
3  Dependence of the estimated electric potential...   
4  Comparison of Relative Magnetic Helicity Flux ...   

                                            abstract  \
0  We investigate the bar formation process using...   
1  The geometry and kinematics of the broad-line ...   
2  Robotic telescope networks play an important r...   
3  A potential difference of 1.3 Giga-Volts (GV) ...   
4  Magnetic helicity is a key geometrical paramet...   

                                             authors             published  \
0                            Tirawut Worrakitpoonpon  2024-12-24T02:15:19Z   
1  Jiancheng Wu, Qingwen Wu, Kaixing Lu, Xinwu Ca...  2024-12-24T04:04:30Z   
2  Rahul Gupta, Judith Racusin, Vladimir Lipunov,...  2024-12-24T04:10:49Z   
3  B. Harihara

In [38]:
data

Unnamed: 0,title,abstract,authors,published,link
0,Bar instability and formation timescale across...,We investigate the bar formation process using...,Tirawut Worrakitpoonpon,2024-12-24T02:15:19Z,http://arxiv.org/abs/2412.18098v1
1,Exploring variation of double-peak broad-line ...,The geometry and kinematics of the broad-line ...,"Jiancheng Wu, Qingwen Wu, Kaixing Lu, Xinwu Ca...",2024-12-24T04:04:30Z,http://arxiv.org/abs/2412.18146v1
2,Extremely luminous optical afterglow of a dist...,Robotic telescope networks play an important r...,"Rahul Gupta, Judith Racusin, Vladimir Lipunov,...",2024-12-24T04:10:49Z,http://arxiv.org/abs/2412.18152v1
3,Dependence of the estimated electric potential...,A potential difference of 1.3 Giga-Volts (GV) ...,"B. Hariharan, S. K. Gupta, Y. Hayashi, P. Jaga...",2024-12-24T04:59:24Z,http://arxiv.org/abs/2412.18167v1
4,Comparison of Relative Magnetic Helicity Flux ...,Magnetic helicity is a key geometrical paramet...,"Shangbin Yang, Suo Liu, Jiangtao Su, Yuanyong ...",2024-12-24T06:17:49Z,http://arxiv.org/abs/2412.18203v1


In [39]:
content = "Abstract:" + data.loc[0, 'abstract']

In [46]:
client_summary = Groq(
    api_key="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",
)

summary = client_summary.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You need to provide a one line summary of the provided abstract. You should not fabricate any information. Please do not include any other text other than the requested content.",
        },
        
        {
            "role": "user",
            "content": content,
        }
    ],
    model="llama3-8b-8192",
    stream=False,
)

print(summary.choices[0].message.content)

A study using N-body simulations found that the formation timescale of bars in disk galaxies is limited to approximately 8 billion years, with two types of bars forming before and after this time, and that the interaction between Toomre's parameter and central mass concentration regulates bar formation.


In [42]:
title = "Title:" + data.loc[0, 'title']

In [49]:
client_keywords = Groq(
    api_key="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",
)

keywords = client_keywords.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You need to pick out 2-3 keywords from the provided title that are relevnt to the topic of the paper. Please do not include any other text explaining what is the response other than the requested content itself. No need to number the keywords.",
        },
        
        {
            "role": "user",
            "content": title,
        }
    ],
    model="llama3-8b-8192",
    stream=False,
)

print(keywords.choices[0].message.content)

Toomre's Q, bar formation, mass concentration


In [50]:
data['summary'] = summary.choices[0].message.content
data['keywords'] = keywords.choices[0].message.content

In [51]:
data

Unnamed: 0,title,abstract,authors,published,link,summary,keywords
0,Bar instability and formation timescale across...,We investigate the bar formation process using...,Tirawut Worrakitpoonpon,2024-12-24T02:15:19Z,http://arxiv.org/abs/2412.18098v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
1,Exploring variation of double-peak broad-line ...,The geometry and kinematics of the broad-line ...,"Jiancheng Wu, Qingwen Wu, Kaixing Lu, Xinwu Ca...",2024-12-24T04:04:30Z,http://arxiv.org/abs/2412.18146v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
2,Extremely luminous optical afterglow of a dist...,Robotic telescope networks play an important r...,"Rahul Gupta, Judith Racusin, Vladimir Lipunov,...",2024-12-24T04:10:49Z,http://arxiv.org/abs/2412.18152v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
3,Dependence of the estimated electric potential...,A potential difference of 1.3 Giga-Volts (GV) ...,"B. Hariharan, S. K. Gupta, Y. Hayashi, P. Jaga...",2024-12-24T04:59:24Z,http://arxiv.org/abs/2412.18167v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
4,Comparison of Relative Magnetic Helicity Flux ...,Magnetic helicity is a key geometrical paramet...,"Shangbin Yang, Suo Liu, Jiangtao Su, Yuanyong ...",2024-12-24T06:17:49Z,http://arxiv.org/abs/2412.18203v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"


In [54]:
data['published'] = pd.to_datetime(data['published'])
data['published'] = data['published'].dt.date

In [55]:
data

Unnamed: 0,title,abstract,authors,published,link,summary,keywords
0,Bar instability and formation timescale across...,We investigate the bar formation process using...,Tirawut Worrakitpoonpon,2024-12-24,http://arxiv.org/abs/2412.18098v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
1,Exploring variation of double-peak broad-line ...,The geometry and kinematics of the broad-line ...,"Jiancheng Wu, Qingwen Wu, Kaixing Lu, Xinwu Ca...",2024-12-24,http://arxiv.org/abs/2412.18146v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
2,Extremely luminous optical afterglow of a dist...,Robotic telescope networks play an important r...,"Rahul Gupta, Judith Racusin, Vladimir Lipunov,...",2024-12-24,http://arxiv.org/abs/2412.18152v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
3,Dependence of the estimated electric potential...,A potential difference of 1.3 Giga-Volts (GV) ...,"B. Hariharan, S. K. Gupta, Y. Hayashi, P. Jaga...",2024-12-24,http://arxiv.org/abs/2412.18167v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"
4,Comparison of Relative Magnetic Helicity Flux ...,Magnetic helicity is a key geometrical paramet...,"Shangbin Yang, Suo Liu, Jiangtao Su, Yuanyong ...",2024-12-24,http://arxiv.org/abs/2412.18203v1,A study using N-body simulations found that th...,"Toomre's Q, bar formation, mass concentration"


In [58]:
# Stats 
# Authors
authors = data['authors'].str.split(', ', expand=True).stack().value_counts()
authors

Tirawut Worrakitpoonpon    1
Vladislav Topolev          1
Bao-Li Lun                 1
Jirong Mao                 1
Xiao-Hong Zhao             1
                          ..
A. Aryan                   1
V. Sharma                  1
S. Iyyani                  1
Shashi B. Pandey           1
Yuanyong Deng              1
Name: count, Length: 78, dtype: int64

In [59]:
print(authors.index[0])

Tirawut Worrakitpoonpon


In [3]:
# This needs to be done only once per session
scholarly.use_proxy(pg)

search_query = scholarly.search_author('Jia Wang')

# Retrieve all the details for the author
author = scholarly.fill(next(search_query))
scholarly.pprint(author)

# Get the author's affiliation
affiliation = author['affiliation']

MaxTriesExceededException: Cannot Fetch from Google Scholar.

In [68]:
position, affiliation = affiliation.split(', ')
print(position)
print(affiliation)

Lecturer
Suranaree University of Technology
