To use AutoRAG, you need to convert your raw data into `corpus data` and `qa_data` to our [data format](https://marker-inc-korea.github.io/AutoRAG/data_creation/data_format.html)

The following guide covers how to use LLM to create data in a form that AutoRAG can use.

## Load OpenAI API Key

### Add OPENAI_API_KEY secret

At first, you need to add your openai api key at secrets.
Check out left side and go to 'Secrets' tab.
At there, press 'add new secret' and set name to `OPENAI_API_KEY`.
And set value to your openai api key.
Be sure to press toggle for notebook access!

If there are no error at below code, you are ready to go!

In [None]:
from google.colab import userdata
import os
openai_api_key = userdata.get('OPENAI_API_KEY')
assert bool(openai_api_key), "You have to set OPENAI_API_KEY at colab secrets."
os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
%pip install -Uq AutoRAG

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m514.3 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.1/106.1 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?2

In [None]:
import nest_asyncio
nest_asyncio.apply()

## Set raw documents

For the tutorial, I used my favorite mlb rule book (2023) pdf file.

When using it, you just put your own raw data in the desired path.


In [None]:
import os
os.makedirs('/content/raw_data')

Since this is a tutorial, I'll use `wget` to quickly download a single file.

In [None]:
!wget -P /content/raw_data  https://img.mlbstatic.com/mlb-images/image/upload/mlb/atcjzj9j7wrgvsm8wnjq.pdf

--2024-03-29 11:49:44--  https://img.mlbstatic.com/mlb-images/image/upload/mlb/atcjzj9j7wrgvsm8wnjq.pdf
Resolving img.mlbstatic.com (img.mlbstatic.com)... 146.75.93.91
Connecting to img.mlbstatic.com (img.mlbstatic.com)|146.75.93.91|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 959403 (937K) [application/pdf]
Saving to: ‘/content/raw_data/atcjzj9j7wrgvsm8wnjq.pdf’


2024-03-29 11:49:45 (13.7 MB/s) - ‘/content/raw_data/atcjzj9j7wrgvsm8wnjq.pdf’ saved [959403/959403]



## Make `corpus data` from raw documents

First, **Load your raw data** to
texts with loaders such as `llama_index`, `LangChain`, etc.

In our example, I'm going to use `llama_index` to create one

In [None]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader('/content/raw_data').load_data()

Second, chunk the texts into passages.

In [None]:
from llama_index.core.node_parser import TokenTextSplitter
nodes = TokenTextSplitter().get_nodes_from_documents(documents=documents, chunk_size=512, chunk_overlap=128)

Third, Make it into corpus data to use converter functions. There are converter functions for llama index `Document`, `TextNode`, and Langchain `Document` objects, which is `llama_document_to_parquet`, `llama_text_node_to_parquet`, and `langchain_document_to_parquet`.

In this case, I'm going to use `llama_text_node_to_parquet`.

In [None]:
from autorag.data.corpus import llama_text_node_to_parquet
corpus_df = llama_text_node_to_parquet(nodes, '/content/corpus.parquet')

## Make `qa data` from corpus data

Set LLM for create `qa_data`.

In this case, I'm going to use `gpt-3.5-turbo` model.

In [None]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model='gpt-3.5-turbo', temperature=1.0)

In [None]:
from autorag.data.qacreation import generate_qa_llama_index, make_single_content_qa
qa_df = make_single_content_qa(corpus_df, 50, generate_qa_llama_index, llm=llm, question_num_per_content=1,
                               output_filepath='/content/qa.parquet')

`generate_qa_llama_index` is a function designed to generate **questions** and its **generation_gt** per content. You can set the number of questions per content by changing `question_num_per_content` parameter.

And the `make_single_content_qa` function is designed to generate `qa.parquet` file using input function. It generates ‘single content’ qa data, also known as ‘single-hop’ or ‘single-document’ QA data. Which means it uses only one passage per question for answering the question.