# **Transcribing Audio and Question Answering with ASR, Diarization, and Retrieval-Augmented Generation**

<div class="align-center">
  <a href="https://getindexify.ai/"><img src="https://getindexify.ai/Indexify_Logo_Wordmark.svg" width="145"></a>
  <a href="https://discord.com/invite/kF8UZACA7r"><img src="https://raw.githubusercontent.com/rishiraj/random/main/Discord%20button.png" width="145"></a><br>
  Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/tensorlakeai/indexify">Github</a></i> ⭐
</div>

This notebook demonstrates a powerful pipeline for transcribing audio, such as podcasts, and performing question answering using Retrieval-Augmented Generation (RAG). The pipeline combines Automatic Speech Recognition (ASR), diarization, and speculative decoding techniques to efficiently process audio data and generate informative responses.

## Key Components

1. **ASR**: We employ a state-of-the-art ASR model to convert audio into text transcriptions. The ASR pipeline is modularized, allowing flexibility in use cases where diarization may not be required.

2. **Diarization**: Built on top of the ASR outputs, our diarization pipeline utilizes the Pyannote model, currently a leading open-source implementation. Diarization enables speaker identification and separation within the transcribed audio.

3. **Speculative Decoding**: To accelerate inference, we incorporate speculative decoding. This technique uses a smaller, faster assistant model to propose generations that are then validated by the larger main model.

4. **Retrieval-Augmented Generation (RAG)**: By leveraging the transcribed and diarized audio, we apply RAG to perform question answering. RAG combines information retrieval techniques with generation models to produce accurate and contextually relevant responses.

## **Install Indexify, Start the Server & Download the Extractors**

In [1]:
%pip install -q -U indexify indexify-extractor-sdk

# Download Indexify Server
!curl https://getindexify.ai | sh

# Download Extractors
!indexify-extractor download hub://audio/asrdiarization
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://embedding/arctic

Note: you may need to restart the kernel to use updated packages.


After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

**Open 2 terminals and run the following commands:**

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

## **Create a Client, Define Extraction Graph & Ingest Contents**

Instantiate the Indexify Client

In [1]:
from indexify import IndexifyClient
client = IndexifyClient()

1. Import the `ExtractionGraph` class from the `indexify` package.

2. Define the extraction graph specification in YAML format:
   - Set the name of the extraction graph to "transcribe".
   - Define the extraction policies:
     - Use the "tensorlake/asrdiarization" extractor for speech to text, specify its parameters and name it "sttextractor".
     - Use the "tensorlake/chunk-extractor" for text chunking, specify its parameters, name it "chunker" and connect to "sttextractor".
     - Use the "tensorlake/minilm-l6" extractor for embedding, name it "embedder" and connect to "chunker".
       - Set the content source for embedding to "chunks".

3. Create an `ExtractionGraph` object from the YAML specification using `ExtractionGraph.from_yaml()`.

4. Create the extraction graph on the Indexify client using `client.create_extraction_graph()`.

In [2]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'transcribe'
extraction_policies:
   - extractor: 'tensorlake/asrdiarization'
     name: 'sttextractor'
     input_params:
        batch_size: 24
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'sttextractor'
   - extractor: 'tensorlake/arctic'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

Upload any audio file to the Indexify Client

In [None]:
import requests
req = requests.get("https://raw.githubusercontent.com/tensorlakeai/indexify/main/docs/docs/files/interview.mp3")

with open('interview.mp3','wb') as f:
    f.write(req.content)

In [3]:
content_id = client.upload_file("transcribe", "interview.mp3")
print(content_id)
client.wait_for_extraction(content_id)

26c06462ef9ce19b


## **Performing RAG with OpenAI**

In [4]:
def get_context(question: str, index: str, top_k=1):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

In [5]:
question = "What does the guy has to say about his familiarity with the fashion world?"
context = get_context(question, "transcribe.embedder.embedding")
context

"content id: 6423bc4f19ad03cd \n\n passage: [{'speaker': 'SPEAKER_00', 'timestamp': (18.0, 22.0), 'text': ' So are you into fashion? Or are you kind of new to the fashion world?'}, {'speaker': 'SPEAKER_01', 'timestamp': (22.0, 24.0), 'text': ' I would consider myself new to the fashion world.'}, {'speaker': 'SPEAKER_01', 'timestamp': (24.0, 27.38), 'text': ' I, you know, this is like Mark said, fish out of water a little bit.'}, {'speaker': 'SPEAKER_01', 'timestamp': (27.38, 29.72), 'text': ' But I couldnt say no to the invitation.'}, {'speaker': 'SPEAKER_01', 'timestamp': (29.72, 32.24), 'text': ' I am opening my eyes about the world of fashion right now.'}]\n"

In [6]:
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)

Instantiate the OpenAI Client

In [14]:
from openai import OpenAI
client_openai = OpenAI()

Now ask any question related to the ingested audio file

In [15]:
chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)

Based on the transcript, the man (SPEAKER_01) says that he is new to the fashion world and considers himself a 'fish out of water' in this context. However, he accepted the invitation despite his unfamiliarity with fashion. He also mentions that he is currently opening his eyes to the world of fashion, suggesting that this experience is exposing him to new insights about the industry.
