In [23]:
# Installing Libraries
!pip install pytube -q
!pip install git+https://github.com/openai/whisper.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [24]:
from pytube import YouTube
import whisper

# Function to download video and get transcriptions
def get_transcriptions(url):
    yt_video = YouTube(url)
    stream = yt_video.streams.filter(only_audio=True)
    stream = stream.first()
    stream.download(filename="test.mp4")

    model = whisper.load_model('base')
    output = model.transcribe("test.mp4")
    return output["text"]

In [25]:
# List of 4 video URLs
video_urls = ["https://www.youtube.com/watch?v=lE-VKc2R9L4","https://www.youtube.com/watch?v=P9EnUasQQbE",
              "https://www.youtube.com/watch?v=Ca7B6v-Unec","https://www.youtube.com/watch?v=U6os4i3lu1U"]

transcription = []  # list to store transcriptions

# Loop through each video URL and get transcriptions
for i, url in enumerate(video_urls):
    transcriptions = get_transcriptions(url)
    transcription.append(transcriptions)

# Print the transcriptions list length
print(len(transcription))



4


In [26]:
#Converting the long transcribe into short text

sh_text = []

for i in transcription:
  if len(i) >= 300:
    for j in range(0,len(i),300):
      sh_text.append(i[j:j+300])
  else:
    sh_text.append(i)

In [28]:
len(sh_text)

109

# Retrieval Enhanced Generative Question Answering with OpenAI

#### Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources. Required installs for this notebook are:

In [29]:
!pip install -qU openai pinecone-client datasets tqdm

In [30]:
import os
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = "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"

openai.Engine.list()  # check we have authenticated

<OpenAIObject list at 0x7f0682b5f3b0> JSON: {
  "data": [
    {
      "created": null,
      "id": "babbage",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "davinci",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-davinci-edit-001",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "babbage-code-search-code",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-similarity-babbage-001",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "code-davinci-edit-001",
      "object": "engine",
      

For many questions *state-of-the-art (SOTA)* LLMs are more than capable of answering correctly.

In [31]:
query = "How is Dell Alienware laptop?"

# now query text-davinci-003 WITHOUT context
res = openai.Completion.create(
    engine='text-davinci-003',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

'The Dell Alienware laptop is a great choice for gamers and power users. It offers powerful hardware, a sleek design, and a wide range of features. The laptop is well-built and offers excellent performance, making it a great choice for gaming and other intensive tasks. It also has a great battery life and a wide range of ports and connectivity options.'

However, that isn't always the case. Let's first rewrite the above into a simple function so we're not rewriting this every time.

In [32]:
def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

Now let's ask a more specific question about training a specific type of transformer model called a *sentence-transformer*. The ideal answer we'd be looking for is _"Multiple Negatives Ranking (MNR) loss"_.

Don't worry if this is a new term to you, it isn't required to understand what we're doing or demoing here.

In [33]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

complete(query)

'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as natural language inference, semantic similarity, and paraphrase identification.'

One of the common answers I get to this is:

```
The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.
```

This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but *cannot* be used to fine-tune a sentence-transformer, and has nothing to do with having _"pairs of related sentences"_.

An alternative answer I recieve is about `supervised learning approach` being the most suitable. This is completely true, but it's not specific and doesn't answer the question.

We have two options for enabling our LLM in understanding and correctly answering this question:

1. We fine-tune the LLM on text data covering the topic mentioned, likely on articles and papers talking about sentence transformers, semantic search training methods, etc.

2. We use **R**etrieval **A**ugmented **G**eneration (RAG), a technique that implements an information retrieval component to the generation process. Allowing us to retrieve relevant information and feed this information into the generation model as a *secondary* source of information.

We will demonstrate option **2**.

---

## Building a Knowledge Base

With open **2** the retrieval of relevant information requires an external _"Knowledge Base"_, a place where we can store and use to efficiently retrieve information. We can think of this as the external _long-term memory_ of our LLM.

We will need to retrieve information that is semantically related to our queries, to do this we need to use _"dense vector embeddings"_. These can be thought of as numerical representations of the *meaning* behind our sentences.

There are many options for creating these dense vectors, like open source [sentence transformers](https://pinecone.io/learn/nlp/) or OpenAI's [ada-002 model](https://youtu.be/ocxq84ocYi0). We will use OpenAI's offering in this example.

We have already authenticated our OpenAI connection, to create an embedding we just do:

In [34]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [35]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model.

In [36]:
len(res['data'])

2

In [37]:
res['data'][0]['embedding']

[-0.0031135426834225655,
 0.011766765266656876,
 -0.00509151816368103,
 -0.027159256860613823,
 -0.01633599027991295,
 0.03237545117735863,
 -0.016160769388079643,
 -0.0010808103252202272,
 -0.02583836019039154,
 -0.006641550455242395,
 0.02012345939874649,
 0.016672953963279724,
 -0.009178885258734226,
 0.02331787347793579,
 -0.010149340145289898,
 0.013458321802318096,
 0.02527226135134697,
 -0.016915567219257355,
 0.012056553736329079,
 -0.01636294648051262,
 -0.004303023684769869,
 -0.006402306258678436,
 -0.00437378603965044,
 0.020810864865779877,
 -0.010567175224423409,
 -0.003726816037669778,
 0.013626803644001484,
 -0.02635054476559162,
 -0.0004172029148321599,
 -0.0021852082572877407,
 0.00576881505548954,
 -0.01012912206351757,
 -0.02817014791071415,
 -0.01622816175222397,
 -0.004255848936736584,
 0.007426674943417311,
 -0.002897885860875249,
 -0.031431954354047775,
 0.023843536153435707,
 -0.03329199180006981,
 -0.0003548646636772901,
 0.013087661936879158,
 0.0071234079077

In [38]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

In [39]:

# Setup pinecone environment

import pinecone

index_name = 'openai-youtube-transcriptions'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(api_key="886881e8-cecd-45bb-ba0f-7d41cca305d2",
              environment="us-west4-gcp"  
             )

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():

      # if does not exist, create index
  pinecone.create_index(
  index_name,
  dimension=len(res['data'][0]['embedding']),
  metric='cosine',
  # metadata_config={'indexed': ['index']}
    )
  
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:

In [40]:

# make upsert list which includes the id, vector and actual text in the meta data
# This text will be used to build the prompt to query the davinci model 

from tqdm.auto import tqdm
import datetime
from time import sleep

upsert_list = []
for i in tqdm(range(0, len(sh_text))):

    texts = sh_text[i]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = res['data'][0]['embedding']
    metadata = {'text' : texts}
    upsert_list.append((str(i), embeds, metadata)) 


  0%|          | 0/109 [00:00<?, ?it/s]

In [41]:

# Upload vectors
index.upsert(upsert_list)

{'upserted_count': 109}

In [42]:
# testing the process without any function calls

res = openai.Embedding.create(
    input=["how is the taste of myprotein chocolate brownie whey protein"],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=4, include_metadata=True)
     

In [43]:
res

{'matches': [{'id': '43',
              'metadata': {'text': "ly 1.9 kg, that's 100 grams lighter than "
                                   "Apple's MacBook Pro 16 inches. The "
                                   'keyboard is pleasant to use with Perky RGB '
                                   'lighting if you want it, and it boasts '
                                   'plenty of ports, including two USB-C '
                                   'ports. With a score of just under 7 hours '
                                   'in our video rundown test, its battery '
                                   "life isn't anything to w"},
              'score': 0.724330604,
              'values': []},
             {'id': '85',
              'metadata': {'text': 'up to 23mm thick, and it weighs 7.28 '
                                   "pounds, so it's not easy to carry around. "
                                   "It's a fairly flashy laptop too, with "
                                   'perky RGB 

In [45]:
# Retrieve function takes the query as input convert the input query into vector embeddings
# This embeddings are used to do sematic search against the vectors in pinecone
# In reponse we get 4 similar text which we combine with prompt start and prompt end to build the complete context for query
# Finally the prompt is returned which is passed as input to the complete function for davinci model

def retrieve(query):
    limit = 3750
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    res = index.query(xq, top_k=4, include_metadata=True)
    contexts = [x['metadata']['text'] for x in res['matches']]


    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

In [46]:
query_with_contexts = retrieve("How is Dell Alienware laptop?")
print(query_with_contexts)

Answer the question based on the context below.

Context:
2 USB Type-C 3.1 Gen 2, and 1 Thunderbolt 4 USB 4. Overall, the MSI-G76 Raider is a good choice for gamers who are looking for a high-performance gaming laptop with many gamer-friendly features. At number 2, we got the Alienware X14. The latest round of the Razer Blade 14 with the Alienware X14 for 

---

enware pedigree to speak up. Still, the performance to rival some of the best gaming laptops in the business, and only a few sacrifices to get there for a great price, this is an excellent option for most players. You could consider the X-15 technically the best Alienware gaming laptop on paper alon

---

xpensive X15 and X17 Alienware models, which means this is one of the most visually-arresting gaming laptops on the market, and the build quality is second to none. There are a few downgrades such as the absence of perky RGB lighting and the stadium LED ring light that circles the back of the larger

---

up to 23mm thick, and

In [47]:
# Final output of model with context query

complete(query_with_contexts)

'Dell Alienware laptops are excellent options for most players, offering performance to rival some of the best gaming laptops in the business with only a few sacrifices to get there for a great price. They are visually-arresting gaming laptops with perky RGB lighting and a light strip running across the inside of the fan exhaust, and they are up to 23mm thick and weigh 7.28 pounds.'

---