# Use a LlamaIndex loader with Lilac

This notebook will show you how to load any LlamaIndex loader and load the data into Lilac.

LlamaIndex loaders [can be found on LlamaHub](https://llamahub.ai/).

In this example, we'll use the [ArxivReader loader from LlamaHub](https://llamahub.ai/l/papers-arxiv), and load arxiv papers into Lilac.


In [1]:
!pip install pypdf



In [2]:
from llama_index import download_loader

# See: https://llamahub.ai/l/papers-arxiv
ArxivReader = download_loader("ArxivReader")

loader = ArxivReader()
documents = loader.load_data(search_query='au:Karpathy')

In [2]:
import lilac as ll

# Set the project directory for Lilac.
ll.set_project_dir('./data')

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# This assumes you already have a lilac project set up.
# If you don't, use ll.init(project_dir='./data')
ll.create_dataset(
  config=ll.DatasetConfig(
    namespace='local',
    name='arxiv-karpathy',
    source=ll.LlamaIndexDocsSource(
      # documents comes from the loader.load_data call in the previous cell.
      documents=documents,)))

Reading from source llama_index_docs...: 100%|██████████| 107/107 [00:00<00:00, 10133.46it/s]


Executing:
SELECT COUNT() as count FROM t
Query took 0.001s.
Executing:

        SELECT avg(length(val))
        FROM (SELECT "doc_id" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "text" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "page_label" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "file_name" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "Title of this paper" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "Authors" AS val FROM t) USING SAMPLE 1000;
      
Query took 0.001s.
Executing:

        SELECT avg(length(val))
        FROM (SELECT "Date published" AS val FROM t) USING SAMPLE 1000;
      
Executing:
SELECT count(val) FROM (SELECT "doc_id" as val FROM t)
Query took 0.001s.
Que

<lilac.data.dataset_duckdb.DatasetDuckDB at 0x2a843ff10>

In [3]:
# Print the first few rows:
dataset = ll.get_dataset('local', 'arxiv-karpathy')
for row in dataset.select_rows(['*'], limit=5):
  print(row)

Executing:
SELECT COUNT() as count FROM t
Query took 0.000s.
{'doc_id': 'd882b74b-1c27-4f44-aa5d-4e25683b9f5a', 'text': 'DenseCap: Fully Convolutional Localization Networks for Dense Captioning\nJustin Johnson∗Andrej Karpathy∗Li Fei-Fei\nDepartment of Computer Science, Stanford University\n{jcjohns, karpathy, feifeili }@cs.stanford.edu\nAbstract\nWe introduce the dense captioning task, which requires a\ncomputer vision system to both localize and describe salient\nregions in images in natural language. The dense caption-\ning task generalizes object detection when the descriptions\nconsist of a single word, and Image Captioning when one\npredicted region covers the full image. To address the local-\nization and description task jointly we propose a Fully Con-\nvolutional Localization Network (FCLN) architecture that\nprocesses an image with a single, efﬁcient forward pass, re-\nquires no external regions proposals, and can be trained\nend-to-end with a single round of optimization. The

In [None]:
# You can start a lilac server with:
ll.start_server(project_dir='./data')

INFO:     Started server process [2276]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit)
