## Primera Etapa
Instalación de las librerías necesarias

In [None]:
#pip install pymongo
#pip install dnspython

## Se crea la wea de base de datos y los articulos

In [1]:
from pymongo import MongoClient

# se chequea que hay conexión a MongoDB
try:
    client = MongoClient(
        'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
    )
    result = client.admin.command('ping')
    print("Ping result:", result)
    print("Conexión exitosa a MongoDB")
except Exception as e:
    print("Error:", e)

Ping result: {'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1749487575, 1), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1749487575, 1)}
Conexión exitosa a MongoDB


## Insertar datos

Como la insercion de datos esta saturando la Ram del servidor, utilizaremos el modulo gc para forzar la liberacion de ram.

In [None]:
import json
from pymongo import MongoClient
import gc


client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]

# Por cada 500 datos usará aproximadamente 8~9 GB de RAM
batch_size = 500  # Ajusta según tu RAM

def batch_insert(file_path, batch_size=500):
    with open(file_path, "r") as f:
        batch = []
        for line in f:
            if line.strip():
                batch.append(json.loads(line))
                if len(batch) == batch_size:
                    collection.insert_many(batch)
                    batch = []
                    gc.collect()
        if batch:
            collection.insert_many(batch)
            gc.collect()

batch_insert("dataset/arxiv-metadata-oai-snapshot.json", batch_size)
print("Inserción por lotes completada.")

Inserción por lotes completada.


## Consulta

Se haran las consultas pedidas en la tarea

a) Primera consulta

In [None]:
from pymongo import MongoClient
from pandas import DataFrame

client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]

consulta = {"update_date": {"$regex": "^2025"}}
parametros = {"title": 1, "update_date": 1, "_id": 0}

# usamos list porque head() se come unos resultados
docs = list(collection.find(consulta, parametros).limit(20))
df = DataFrame(docs)
print(df)

                                                title update_date
0   Hamiltonian Graphs and the Traveling Salesman ...  2025-02-26
1   Spin-dependent three-nucleon force effects on ...  2025-03-20
2   Adjointability of densely defined closed opera...  2025-04-29
3                         Proof of Riemann Hypothesis  2025-04-11
4   Critical fluctuations of time-dependent magnet...  2025-01-06
5   Gauss--Berezin integral operators and spinors ...  2025-02-11
6                The nature of electromagnetic energy  2025-05-16
7   Periodic relativity: the theory of gravity in ...  2025-01-16
8      Quantization of Atomic and Nuclear Rest Masses  2025-02-08
9   Solution of the equation d/dx(pdu/dx)+qu=cu by...  2025-05-06
10  The role of the quark and gluon GPDs in hard v...  2025-01-22
11  A new weak approximation scheme of stochastic ...  2025-04-28
12  Strict essential extensions of C*-algebras and...  2025-04-29
13     Rationalization of EPR Coincidence Experiments  2025-04-03
14        

b. Devolver los títulos y los autores de artículos que pertenezcan a las categorías "cs.AI"
o "stat.ML" y que tengan al menos tres autores. Mostrar solo esos campos y limitar a los
primeros 10 resultados.

In [20]:
from pymongo import MongoClient
import pandas as pd
pd.set_option('display.max_colwidth', 50)

client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]

# parametros de consulta
consulta = {"categories": {"$regex": "cs.AI|stat.ML"}, "authors":{"$regex":".*,.*,"}}
parametros = {"title": 1, "authors": 1, "_id": 0}

# se devuelven solo los primeros 10 resultados
resultados = list(collection.find(consulta, parametros).limit(10))
df = pd.DataFrame(resultados)
print(df)

                                             authors  \
0  Tarik Hadzic, Rune Moller Jensen, Henrik Reif ...   
1  Stefano Bistarelli, Ugo Montanari, Francesca R...   
2  Juliana S Bernardes, Alberto Davila, Vitor San...   
3  Giorgio Terracina, Nicola Leone, Vincenzino Li...   
4               S. Mohamed, D. Rubin, and T. Marwala   
5                   J. Uglov, V. Schetinin, C. Maple   
6  Christian Gagn\'e (INFORMATIQUE WGZ INC.), Mic...   
7  Edgar H. de Graaf, Joost N. Kok, Walter A. Kos...   
8  Edgar H. de Graaf, Joost N. Kok, Walter A. Kos...   
9  Nicolas Godzik (INRIA Futurs, INRIA Rocquencou...   

                                               title  
0  Calculating Valid Domains for BDD-Based Intera...  
1  Unicast and Multicast Qos Routing with Soft Co...  
2  A study of structural properties on profiles HMMs  
3  Experimenting with recursive queries in databa...  
4  An Adaptive Strategy for the Classification of...  
5  Comparing Robustness of Pairwise and Multiclas... 

c. Devolver los títulos, las categorías y los enlaces al PDF de artículos que pertenezcan a
la categoría "hep-ph" y tengan un DOI asignado. Mostrar solo esos campos y limitar a 15
resultados.

In [None]:
from pymongo import MongoClient
import pandas as pd

pd.set_option('display.max_colwidth', 50)

client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]

consulta = {"categories":{"$regex":"hep-ph"}, "doi":{"$exists":True, "$ne":None}}
parametros = {"title":1, "categories":1, "pdf_source":1, "_id":0}

resultados = list(collection.find(consulta, parametros).limit(15))

print(pd.DataFrame(resultados))

                                                title  \
0                  Lifetime of doubly charmed baryons   
1   Scalar radius of the pion and zeros in the for...   
2   Calculation of prompt diphoton production cros...   
3   Probing non-standard neutrino interactions wit...   
4   Understanding the Flavor Symmetry Breaking and...   
5   Crystal channeling of LHC forward protons with...   
6                    Strong decays of charmed baryons   
7   Multiple Parton Scattering in Nuclei: Quark-qu...   
8     Strong Phase and $D^0-D^0bar$ mixing at BES-III   
9                       CP violation in beauty decays   
10  Energy density for chiral lattice fermions wit...   
11  Instanton Liquid at Finite Temperature and Che...   
12  The Return of a Static Universe and the End of...   
13       Unravelling the sbottom spin at the CERN LHC   
14  Broadening the Higgs Boson with Right-Handed N...   

                      categories  
0                         hep-ph  
1         hep-ph 

d. Devolver los títulos, nombres de los autores y la referencia de publicación (journal-ref)
de los artículos que tengan un DOI asignado. Mostrar solo esos campos y ordenar los
resultados alfabéticamente por título. Limitar a los primeros 20 resultados.

In [27]:
from pymongo import MongoClient
import pandas as pd

pd.set_option('display.max_colwidth', 50)

client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]

consulta = {"doi":{"$exists":True,"$ne":None}}
parametros = {"title":1, "authors":1, "journal-ref":1, "_id":0}

resultados = list(collection.find(consulta, parametros).limit(20).sort("title", 1))

print(pd.DataFrame(resultados))

                                              authors  \
0   Aleks Kissinger (University of Oxford), Vladim...   
1              Oscar Henriksson and Christopher Rosen   
2   Olga Zagovora (1), Fabian Fl\"ock (1), Claudia...   
3   Stavros Mouslopoulos and Antonios Papazoglou (...   
4     K.S. Babu, Jogesh C. Pati and Hanns Stremnitzer   
5   Paola Bacigaluppi, R\'emi Abgrall and Svetlana...   
6                                       V. Karimipour   
7                                  Gabriella De Lucia   
8          Julie Nutini, Mark Schmidt and Warren Hare   
9                   Matthieu Gaudet and Steve Arscott   
10  Florentin Millour (MPIFR), Bruno Valat (FIZEAU...   
11  Mo Houtti, Moyan Zhou, Loren Terveen, and Stev...   
12                                    Emilio Elizalde   
13                                       Adan Cabello   
14  C. Cinelli, M. Barbieri, R. Perris, P. Matalon...   
15  Shashank Gupta, Debarshi Das, C. Jebarathinam,...   
16  Daniel Franzen (1), Saskia 

e. Devolver los títulos y la fecha de la primera versión (versions.created) de los artículos
enviados entre los años 2010 y 2015. Mostrar solo esos campos y limitar a los primeros 15
resultados.

In [29]:
from pymongo import MongoClient
import pandas as pd

pd.set_option('display.max_colwidth', 50)

client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]

# nos vamos a la primera version del articulo en el indice 0 del array
consulta = {"versions.0.created":{"$regex":"2010|2011|2012|2013|2014|2015"}}
parametros = {"title":1, "versions":1,"_id":0}

resultados = list(collection.find(consulta, parametros).limit(15))

print(pd.DataFrame(resultados))

                                                title  \
0   Strong Constraints to the Putative Planet Cand...   
1   A landscape of non-supersymmetric AdS vacua on...   
2   Arrested phase separation in reproducing bacte...   
3               Mu-Tau Production at Hadron Colliders   
4   New identities involving q-Euler polynomials o...   
5   Rigid Symmetries and Conservation Laws in Non-...   
6   A Comprehensive Analysis of Uncertainties Affe...   
7   News on PHOTOS Monte Carlo: gamma^* -> pi^+ pi...   
8                   Nonmeasurability in Banach spaces   
9               Jet Shapes and Jet Algorithms in SCET   
10  Testing product states, quantum Merlin-Arthur ...   
11             Euclid Imaging Consortium Science Book   
12           Bayesian Methods and Universal Darwinism   
13  The quadratic character of 1+\sqrt{2} and an e...   
14  Topological properties of punctual Hilbert sch...   

                                             versions  
0   [{'version': 'v1', 'created

## Dropea la tabla

In [14]:
from pymongo import MongoClient

# Conexión a MongoDB
client = MongoClient(
    'mongodb://localhost:30001,localhost:30002,localhost:30003/?replicaSet=my-replica-set'
)
db = client["arxiv_db"]
collection = db["articles"]
collection.drop()


## Tercera Etapa
Consultar los datos

In [4]:
import pprint

doc=collection.find_one()
pprint.pprint(doc)

None


In [6]:
list_cur = list(doc)
df = DataFrame(doc)
df.head()