# End-to-End.

# Data Lake.

Inicjalne zasilenie danymi.

```bash
rag minio --path-to-raw-data ../../data/raw/arxiv-metadata-oai-snapshot.json \
    --bucket papers \
    --processes 1 \
    --path-to-env ../../.dev-env
```

# RAG.

## Preprocessing.

Preprocessing można stosować na całości lub części danych w `bucket`: `papers`.

In [9]:
!rag preprocessing --path-to-data s3a://papers/100k/*.json

INFO:rag.processor.preprocessing:Setting up Spark...
:: loading settings :: url = jar:file:/usr/local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-a0b506b5-578f-4d0c-a20f-e0a45d1b1209;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.mongodb.spark#mongo-spark-connector_2.12;10.3.0 in central
	found org.mongodb#mongodb-driver-sync;4.8.2 in central
	[4.8.2] org.mongodb#mongodb-driver-sync;[4.8.1,4.8.99)
	found org.mongodb#bson;4.8.2 in cent

## Embeddings.

Embeddings można stosować na całości lub części danych odłożonych w bazie danych: `arxiv`, kolekcji: `sentences`.

In [11]:
!rag embeddings --model 'sentence-transformers/all-MiniLM-L6-v2' \
    --source s3a:/papers/100k

INFO:rag.processor.embeddings:Setting up Spark...
:: loading settings :: url = jar:file:/usr/local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-912f5b5d-b98c-4b5c-a9f0-529a1b7d5494;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.mongodb.spark#mongo-spark-connector_2.12;10.3.0 in central
	found org.mongodb#mongodb-driver-sync;4.8.2 in central
	[4.8.2] org.mongodb#mongodb-driver-sync;[4.8.1,4.8.99)
	found org.mongodb#bson;4.8.2 in central

## Most-Similar-Docs.

In [12]:
!rag most-similar-docs --help

Usage: rag most-similar-docs [OPTIONS]

  Get most similar documents.

  Args:     text (str): The input text.     model (str): The name of the model
  to use.     num_docs (int): The number of most similar documents to
  retrieve.     query (str): The query to use for helping to retrieve the most
  similar documents.     path_to_env (str): The path to the environment.

  Returns:     List[str]: A list of most similar documents.

Options:
  --text TEXT         [required]
  --model TEXT
  --num-docs INTEGER
  --query TEXT
  --path-to-env TEXT
  --help              Show this message and exit.


In [15]:
!rag most-similar-docs --text "solar axions or other pseudoscalar particles that couple to two photons" \
    --num-docs 5 \
    --query '[{$match: { _id: {$regex: "0810"} } }]'

INFO:rag.processor.most_similar_docs:Setting up Spark session...
:: loading settings :: url = jar:file:/usr/local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5e6ac52b-e9b1-434c-bf07-b8972a1dd438;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.mongodb.spark#mongo-spark-connector_2.12;10.3.0 in central
	found org.mongodb#mongodb-driver-sync;4.8.2 in central
	[4.8.2] org.mongodb#mongodb-driver-sync;[4.8.1,4.8.99)
	found org.mongodb#bson;4

In [20]:
from rag.clients import setup_mongo_client

mongo_client = setup_mongo_client(path_to_env="../../.dev-env")
db = mongo_client["arxiv"]
sentences = db["sentences"]
print(sentences.find_one({"_id": "0810.4482"}).get("full_text"))

  We have searched for solar axions or other pseudoscalar particles that couple
to two photons by using the CERN Axion Solar Telescope (CAST) setup. Whereas we
previously have reported results from CAST with evacuated magnet bores (Phase
I), setting limits on lower mass axions, here we report results from CAST where
the magnet bores were filled with \hefour gas (Phase II) of variable pressure.
The introduction of gas generated a refractive photon mass $m_\gamma$, thereby
achieving the maximum possible conversion rate for those axion masses \ma that
match $m_\gamma$. With 160 different pressure settings we have scanned \ma up
to about 0.4 eV, taking approximately 2 h of data for each setting. From the
absence of excess X-rays when the magnet was pointing to the Sun, we set a
typical upper limit on the axion-photon coupling of $\gag\lesssim 2.17\times
10^{-10} {\rm GeV}^{-1}$ at 95% CL for $\ma \lesssim 0.4$ eV, the exact result
depending on the pressure setting. The excluded parameter r

## QA.

In [25]:
!rag qa --text "What was used to search for solar axions or other pseudoscalar particles?" \
    --num-docs 1 \
    --query '[{$match: { _id: {$regex: "0810"} } }]' \
    --path-to-env "../../.dev-env"

INFO:rag.processor.most_similar_docs:Setting up Spark session...
:: loading settings :: url = jar:file:/usr/local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-eb4edd1f-bcf5-4b55-bda5-d58095e76cb0;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.mongodb.spark#mongo-spark-connector_2.12;10.3.0 in central
	found org.mongodb#mongodb-driver-sync;4.8.2 in central
	[4.8.2] org.mongodb#mongodb-driver-sync;[4.8.1,4.8.99)
	found org.mongodb#bson;4

LLM.

![image](llm.png)

# Drafts.

## Knowledge Graphs.

In [26]:
...

Ellipsis