# LlamaParse over Powerpoint Files

In this notebook we show you how to build a RAG pipeline over [our talk at PyData Global](https://docs.google.com/presentation/d/1rFQ0hPyYja3HKRdGEgjeDxr0MSE8wiQ2iu4mDtwR6fc/edit?usp=sharing) in 2023.

We use LlamaParse to load in our slides in .pptx format, and use LlamaIndex to build a RAG pipeline over these files.

**NOTE**: LlamaParse is capable of image extraction through JSON mode, in this notebook we stick with text.

In [None]:
import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"

## Download Data

First, download the slides from https://docs.google.com/presentation/d/1rFQ0hPyYja3HKRdGEgjeDxr0MSE8wiQ2iu4mDtwR6fc/edit?usp=sharing and export in .pptx format, and put it in the folder that you're running this notebook.

Name the file `pydata_global.pptx`.

## [Basic] Build a RAG Pipeline over Powerpoint Text

In this example, we use LlamaParse in markdown mode to extract out text from the slides, and we build a top-k RAG pipeline over it.

**Notes**: 
- This does not use our `MarkdownElementNodeParser`, which is tailored for documents with tables.
- This also does not parse out images (we show that in the next section).


In [None]:
parser = LlamaParse(result_type="markdown")

In [None]:
docs = parser.load_data("pydata_global.pptx")

Started parsing the file under job_id 9c687e37-4239-4c2f-b2a1-2564bfc98473


Let's take a look at a few slides.

In [None]:
print(docs[0].get_content()[:5000])

# Building and Productionizing RAG

Jerry Liu, LlamaIndex co-founder/CEO
---
|Content|Page Number|
|---|---|
|Document Processing| |
|Tagging & Extraction| |
|Knowledge Base| |
|Knowledge Search & QA| |
|Workflow:| |
|Read latest messages from user A| |
|Send email suggesting next-steps| |
|Document| |
|Human:| |
|Agent:| |
|Topic:| |
|Summary:| |
|Author:| |
|Conversational Agent| |
|Workflow Automation| |
---
Context

- LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data.

Use Cases

- Question-Answering
- Text Generation
- Summarization
- Planning

# LLM’s
---
|Context|
|---|
|How do we best augment LLMs with our own private data?|
|Raw Files|API’s|
| |salesforce|?|
| | |Use Cases|
| | |Question-Answering|
| | |Text Generation|
| | |Summarization|
|Vector Stores|SQL DB’s|
| | |Planning|
| |LLM’s|
| |Milvus|
---
Paradigms for inserting knowledge

Retrieval Augmentation - Fix pe model, put c

## Build a RAG pipeline over these documents

We now use LlamaIndex to build a RAG pipeline over these powerpoint slides.

In [None]:
from llama_index.core import VectorStoreIndex

In [None]:
index = VectorStoreIndex.from_documents(docs)

In [None]:
query_engine = index.as_query_engine()

In [None]:
response = query_engine.query(
    "What are some response quality challenges with naive RAG?"
)

In [None]:
print(str(response))

Some response quality challenges with naive RAG include issues related to bad retrieval, such as low precision where not all retrieved chunks are relevant, leading to problems like hallucination and being lost in the middle. Additionally, low recall can occur when not all relevant chunks are retrieved, resulting in a lack of sufficient context for the language model to synthesize an answer. Outdated information in the retrieved data can also pose a challenge. On the response generation side, challenges include hallucination where the model generates an answer not present in the context, irrelevance where the answer does not address the question, and toxicity/bias where the answer is harmful or offensive.
