## Load OpenAI API Key

### Add OPENAI_API_KEY secret

At first, you need to add your openai api key at secrets.
Check out left side and go to 'Secrets' tab.
At there, press 'add new secret' and set name to `OPENAI_API_KEY`.
And set value to your openai api key.
Be sure to press toggle for notebook access!

If there are no error at below code, you are ready to go!

In [None]:
from google.colab import userdata
import os
openai_api_key = userdata.get('OPENAI_API_KEY')
assert bool(openai_api_key), "You have to set OPENAI_API_KEY at colab secrets."
os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
%pip install -Uq AutoRAG chromadb numpy==1.24.4 datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.8/100.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.9/262.9 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m54.7 MB

In [None]:
import nest_asyncio
nest_asyncio.apply()

## Download data and preprocess

In this tutorial, we will use `eli5` dataset for evaluation.

In [None]:
import os
os.makedirs('/content/eli5_data')

In [None]:
import pathlib

from datasets import load_dataset

def load_eli5_dataset(save_path):
    # set file path
    file_path = "MarkrAI/eli5_sample_autorag"

    # load dataset
    corpus_dataset = load_dataset(file_path, "corpus")['train'].to_pandas()
    qa_train_dataset = load_dataset(file_path, "qa")['train'].to_pandas()
    qa_test_dataset = load_dataset(file_path, "qa")['test'].to_pandas()

    # save data
    if os.path.exists(os.path.join(save_path, "corpus.parquet")) is True:
        raise ValueError("corpus.parquet already exists")
    if os.path.exists(os.path.join(save_path, "qa.parquet")) is True:
        raise ValueError("qa.parquet already exists")
    corpus_dataset.to_parquet(os.path.join(save_path, "corpus.parquet"))
    qa_train_dataset.to_parquet(os.path.join(save_path, "qa_train.parquet"))
    qa_test_dataset.to_parquet(os.path.join(save_path, "qa_test.parquet"))

In [None]:
load_eli5_dataset("/content/eli5_data")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/229 [00:00<?, ?B/s]

Some datasets params were ignored: ['splits']. Make sure to use only valid params for the dataset builder and to have a up-to-date version of the `datasets` library.


Downloading data:   0%|          | 0.00/7.32M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Some datasets params were ignored: ['splits']. Make sure to use only valid params for the dataset builder and to have a up-to-date version of the `datasets` library.


Downloading data:   0%|          | 0.00/324k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/209k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Some datasets params were ignored: ['splits']. Make sure to use only valid params for the dataset builder and to have a up-to-date version of the `datasets` library.


In [None]:
import pandas as pd
qa_df = pd.read_parquet('/content/eli5_data/qa_train.parquet')
sample_qa_df = qa_df.sample(100, random_state=42) # In this sample code, we will only optimize pipeline with 100 samples.
sample_qa_df = sample_qa_df.reset_index()
sample_qa_df.to_parquet('/content/eli5_data/qa_sample.parquet')

## Making config YAML file

In this file, we will test three different retrieval methods, which are vectordb, bm25, and hybrid_rrf.
And use one prompt and openai gpt-3.5-turbo-16k model for generation.
It also evaluates generation performance with meteor, rouge, and sem_score.

You can learn about config YAML file at [here](https://marker-inc-korea.github.io/AutoRAG/optimization/custom_config.html).

In [None]:
%%writefile config.yaml

node_lines:
- node_line_name: retrieve_node_line
  nodes:
    - node_type: retrieval
      strategy:
        metrics: [retrieval_f1, retrieval_recall, retrieval_precision]
      top_k: 3
      modules:
        - module_type: vectordb
          embedding_model: openai
        - module_type: bm25
        - module_type: hybrid_rrf
          target_modules: ('bm25', 'vectordb')
          rrf_k: [5, 10, 30]
- node_line_name: post_retrieve_node_line
  nodes:
    - node_type: prompt_maker
      strategy:
        metrics:
          - metric_name: meteor
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
      modules:
        - module_type: fstring
          prompt: "Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : "
    - node_type: generator
      strategy:
        metrics:
          - metric_name: meteor
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
      modules:
        - module_type: llama_index_llm
          llm: openai
          model: [gpt-3.5-turbo-16k]
          batch: 2

Writing config.yaml


You must make new project folder per dataset.
**Per dataset, you have to use one project folder.**
It means, if dataset is changed a little bit, you need to make a new project folder.

In [None]:
# make project folder
import os
os.makedirs('/content/project_dir')

## Run evaluation

In [None]:
from autorag.evaluator import Evaluator
evaluator = Evaluator(qa_data_path='/content/eli5_data/qa_sample.parquet', corpus_data_path='/content/eli5_data/corpus.parquet',
                      project_dir='/content/project_dir')

You are successfully evaluate RAG pipeline with data!

You can check out the resut at the project dir. `project_dir` folder at your left. Go to every file, and see `.csv` file.

In [None]:
evaluator.start_trial('/content/config.yaml')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Dask Apply:   0%|          | 0/4 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2000 [00:00<?, ?it/s]



Dask Apply:   0%|          | 0/4 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/100 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/100 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Generating embeddings:   0%|          | 0/100 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/100 [00:00<?, ?it/s]

## Extract pipeline

Now, let's deploy with the optimal pipeline we found with evaluate!

---

First, you need to create an optimal pipeline as a yaml file.

Let's make `best.yaml` file to use `extract_best_config` function.

`output_path` must be `.yaml` or `.yml` file. If None, it does not save yaml file and just return dict values.

In [None]:
from autorag.deploy import extract_best_config
extract_best_config(trial_path='/content/project_dir/0', output_path='/content/project_dir/0/best.yaml')

{'node_lines': [{'node_line_name': 'retrieve_node_line',
   'nodes': [{'node_type': 'retrieval',
     'strategy': {'metrics': ['retrieval_f1',
       'retrieval_recall',
       'retrieval_precision']},
     'modules': [{'module_type': 'bm25', 'top_k': 3}]}]},
  {'node_line_name': 'post_retrieve_node_line',
   'nodes': [{'node_type': 'prompt_maker',
     'strategy': {'metrics': [{'metric_name': 'meteor'},
       {'metric_name': 'rouge'},
       {'metric_name': 'sem_score', 'embedding_model': 'openai'}]},
     'modules': [{'module_type': 'fstring',
       'prompt': 'Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : '}]},
    {'node_type': 'generator',
     'strategy': {'metrics': [{'metric_name': 'meteor'},
       {'metric_name': 'rouge'},
       {'metric_name': 'sem_score', 'embedding_model': 'openai'}]},
     'modules': [{'module_type': 'llama_index_llm',
       'llm': 'openai',
       'model': 'gpt-3.5-turbo-16k',
      

## Deploy your optimal RAG pipeline

Second, it can be deployed as a `CLI`, `API server`, or `Web Interface`.

### 1. Run as a CLI

You can use a found optimal RAG pipeline right away with extracted yaml file.

In [None]:
from autorag.deploy import Runner
runner = Runner.from_yaml('/content/project_dir/0/best.yaml', project_dir='/content/project_dir')
runner.run('who are you?')

'The passage does not provide any information about the identity of the person.'

### 2. Run as an API server

You can run this pipeline as an API server.

Check out API endpoint at [here](https://marker-inc-korea.github.io/AutoRAG/deploy/api_endpoint.html).

In [None]:
from autorag.deploy import Runner
runner = Runner.from_yaml('/content/project_dir/0/best.yaml', project_dir='/content/project_dir')
runner.run_api_server()

INFO:     Started server process [2221]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2221]


KeyboardInterrupt: 

### 3. Run as a Web Interface

you can run this pipeline as a web interface.

Check out web interface at [here](https://marker-inc-korea.github.io/AutoRAG/deploy/web.html).

In [None]:
!autorag run_web --yaml_path /content/project__dir/0/best.yaml --project_dir /content/project_dir

UsageError: Line magic function `%autorag` not found.
