# RAG Podcast File

This notebook aims to create a RAG into an episode of Naruhodo Podcast.
RAG means Retrival Augmented Generation.

To create a RAG  we need to work in several steps. 

1. Obtain the data
2. Transform the data into specific format to retrival
3. Store the data
4. Use the user input as context to query the data
5. Find the most meaninful data to the context
6. Generate a response to answer the user based on the context and the obtaind data

## Data

Let's start with the data. In this example we'll work with an audio file, from the Naruhodo Podcast (Brazilian podcast about science applyed to common things)

The audio source can be downlowad [here](https://cdn.simplecast.com/audio/ab2964e7-bcad-4f2f-9698-45cb681f0d69/episodes/a5c8e7da-82ad-40c9-a440-0a2e5217ac60/audio/164e593b-633e-4733-9c46-916a4a5ce660/default_tc.mp3?nocache)

This is an episode discussing about the impact of the blue light in the eye health and the need of the blue light filter lens to people who work most part of the time into computer or cellphone screens.

## Transform data to text

To transform an audio to text we need to use a speech to text model (STT or S2T). There is a lot of these models availabe, but we chose the [Deepgram](https://deepgram.com/) API to get this work done.

This API allow to interact with a lot of S2F models and choose which one fits best (quality vs price) to the use case.

It also allows to do Diarization, which is the ability to identify multiples speakers in a single audio file and outputs the speaker info into transcript.

Lets start with the code of transcription.

In [2]:
!pip install deepgram-sdk
#!pip install dotenv

415.57s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


^C
[31mERROR: Operation cancelled by user[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
import os
from dotenv import load_dotenv

from deepgram import (
    DeepgramClient,
    PrerecordedOptions,
    FileSource,
)

load_dotenv()

# Path to the audio file
AUDIO_FILE = "data/audio/naruhodo-423.mp3"

API_KEY = os.getenv("DG_API_KEY")


def main():
    try:
        # STEP 1 Create a Deepgram client using the API key
        deepgram = DeepgramClient(API_KEY)

        with open(AUDIO_FILE, "rb") as file:
            buffer_data = file.read()

        payload: FileSource = {
            "buffer": buffer_data,
        }

        # STEP 2: Configure Deepgram options for audio analysis
        options = PrerecordedOptions(
            model="enhanced",
            language="pt-BR",
            smart_format=True,
            diarize=True,
        )

        # STEP 3: Call the transcribe_file method with
        response = deepgram.listen.prerecorded.v("1").transcribe_file(
            payload,
            options,
            timeout=300,
        )

        # STEP 4: Print the response
        print(response.to_json(indent=4))

        # STEP 5: Save to output json file
        with open("data/transcription/naruhodo-424-transcript.json", "w") as file:
            file.write(response.to_json(indent=4))

    except Exception as e:
        print(f"Exception: {e}")


if __name__ == "__main__":
    main()

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/rafaelgirolineto/development/notebooks/podcastRag/venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
  File "/var/folders/2_/68k32jsd7kb870pnl39brz500000gn/T/ipykernel_13864/926815910.py", line 53, in <module>
    main()
  File "/var/folders/2_/68k32jsd7kb870pnl39brz500000gn/T/ipykernel_13864/926815910.py", line 23, in main
    with open(AUDIO_FILE, "rb") as file:
  File "/Users/rafaelgirolineto/development/notebooks/podcastRag/venv/lib/python3.12/site-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_frame.py", line 1197, in trace_dispatch
  File "/Users/rafaelgirolineto/development/notebooks/podcastRag/venv/lib/python3.12/site-packages/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_frame.py", line 165, in do_wait_suspend
  File "/Users/rafaelgirolineto/development/notebooks/podcastRag/venv/lib/python3.12/site-packages/debugpy/_vendored/pydevd/pydevd.py", line 2070, in do_wait_suspend
  Fi