<a href="https://colab.research.google.com/github/rsamala/Bots/blob/main/Streamlit_Chatbot_for_PDF_and_Youtube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

This notebook has all the code you need to create your own chatbot with custom knowledge base using GPT-3. 

Follow the instructions for each steps and then run the code sample. In order to run the code, you need to press "play" button near each code sample.

#Download the data for your custom knowledge base
For the demonstration purposes we are going to use ----- as our knowledge base. You can download them to your local folder from the github repository by running the code below.
Alternatively, you can put your own custom data into the local folder. 

# Install the dependicies
Run the code below to install the depencies we need for our functions

In [None]:
!pip install llama-index


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting llama-index
  Downloading llama_index-0.4.40.tar.gz (152 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.8/152.8 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dataclasses_json
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting langchain
  Downloading langchain-0.0.123-py3-none-any.whl (426 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m426.3/426.3 KB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0.26.4
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken
  Downloading tiktoken-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 KB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
!pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# youtube requirements
!pip install youtube_transcript_api
import sys, re
from youtube_transcript_api import YouTubeTranscriptApi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube_transcript_api
  Downloading youtube_transcript_api-0.5.0-py3-none-any.whl (23 kB)
Installing collected packages: youtube_transcript_api
Successfully installed youtube_transcript_api-0.5.0


# Define the functions
The following code defines the functions we need to construct the index and query it

In [None]:
from llama_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain import OpenAI
import sys
import os
from IPython.display import Markdown, display

def construct_index(directory_path):
    # set maximum input size
    max_input_size = 4096
    # set number of output tokens
    num_outputs = 2000
    # set maximum chunk overlap
    max_chunk_overlap = 20
    # set chunk size limit
    chunk_size_limit = 600 

    # define LLM
    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.5, model_name="text-davinci-003", max_tokens=num_outputs))
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
 
    documents = SimpleDirectoryReader(directory_path).load_data()
    
    index = GPTSimpleVectorIndex(
        documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
    )

    index.save_to_disk('index.json')

    return index

def ask_ai():
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    while True: 
        query = input("What do you want to ask? ")
        response = index.query(query, response_mode="compact")
        display(Markdown(f"Response: <b>{response.response}</b>"))
  

In [None]:
#define youtube related functions
def get_video_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return transcript
    except Exception as e:
        print("Error:", e)
        return None

def save_transcript_to_file(transcript, output_file):
    with open(output_file, "w") as f:
        for entry in transcript:
            caption_text = entry["text"]
            # Remove special characters and line breaks, then append a space.
            cleaned_text = re.sub(r'[\n\(\[].*?[\)\]]', '', caption_text).replace('\n', ' ') + ' '
            f.write(cleaned_text)

def extract_video_id(url):
    patterns = [
        r"(?:https?:\/\/)?(?:www\.)?youtube\.com\/watch\?v=([^&]+)",
        r"(?:https?:\/\/)?(?:www\.)?youtu\.be\/([^&]+)",
        r"(?:https?:\/\/)?(?:www\.)?youtube\.com\/embed\/([^&]+)",
        r"(?:https?:\/\/)?(?:www\.)?youtube\.com\/v\/([^&]+)",
    ]

    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)

    return None

# Set OpenAI API Key
You need an OPENAI API key to be able to run this code.

If you don't have one yet, get it by [signing up](https://platform.openai.com/overview). Then click your account icon on the top right of the screen and select "View API Keys". Create an API key.

Then run the code below and paste your API key into the text input.

In [None]:
os.environ["OPENAI_API_KEY"] = input("Paste your OpenAI key here and hit enter:")

In [None]:
#Get Content into required folder /content/sample_data/Data"
url=input("Paster youtube url")
video_id = extract_video_id(url)

if url:
  transcript = get_video_transcript(video_id)

output_file = f"{video_id}_transcript.txt"
FQFN = "/content/sample_data/Data/" +  output_file

if transcript:
  save_transcript_to_file(transcript, FQFN)
  print(f"Transcript saved to {output_file}")
else:
  print("Failed to fetch transcript")

Paster youtube urlhttps://www.youtube.com/watch?v=AWAo4iyNWGc
Transcript saved to AWAo4iyNWGc_transcript.txt


#Construct an index
Now we are ready to construct the index. This will take every file in the folder 'data', split it into chunks, and embed it with OpenAI's embeddings API.

**Notice:** running this code will cost you credits on your OpenAPI account ($0.02 for every 1,000 tokens). If you've just set up your account, the free credits that you have should be more than enough for this experiment.

In [None]:
construct_index("/content/sample_data/Data")

<llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex at 0x7f0a1af5cd00>

#Ask questions
It's time to have fun and test our AI. Run the function that queries GPT and type your question into the input. 

If you've used the provided example data for your custom knowledge base, here are a few questions that you can ask:
1. Why people cook at home? Make classification
2. Make classification about what frustrates people about cooking?
3. Brainstorm marketing campaign ideas for an air fryer that would appeal people that cook at home
4. Which kitchen appliences people use most often?
5. What people like about cooking at home?

In [None]:
ask_ai()

What do you want to ask? summarize


Response: <b>
This context is about a model trained on a clean version of a dataset similar to the original alpaca dataset. The model is not as good as the original alpaca model, as it was trained on fewer tokens. However, it is still able to provide coherent responses to questions and generate code, though it is not always correct.</b>

What do you want to ask? provide synopsis


Response: <b>
In this video, Dolly explains how to use GPT four, a model developed by Google, for inference. She explains that GPT four is capable of generating plausible text, and that it could benefit from more fine tuning and more specific data sets for a specific use. She encourages viewers to try it out and compare it to other models, and invites them to leave questions in the comments.</b>

What do you want to ask? who is the author


Response: <b>
The author of this context information is Dolly.</b>

What do you want to ask? what was the conclusion


Response: <b>
The conclusion was that GPT four should be open source so that its algorithms are accessible to everyone, including researchers who want to study them. It would also benefit from more fine tuning and more fine tuning on a very specific data set for a very specific use.</b>

What do you want to ask? what did databricks anounce last week


Response: <b>
Last week, Databricks announced their new fine tuning called Dolly, which is based on the GPT-J 6B model released by Eluther AI.</b>

What do you want to ask? what is alpaca dataset


Response: <b>
The alpaca dataset is a clean version of a dataset used to train a machine learning model. It is used to fine-tune the model and test its accuracy.</b>

What do you want to ask? what is lora?


Response: <b>
Lora is not mentioned in the context information, so it is not possible to answer the question with the given information.</b>

What do you want to ask? how many tokens were used to train the dolly model


Response: <b>
It is not possible to answer this question with the given context information.</b>

What do you want to ask? how many tokens were used to train the model


Response: <b>
402 billion tokens</b>

What do you want to ask? how about lama model


Response: <b>
The Lama model is the original model that this model is based on. It is trained on a data set of llamas and is used as a base model for fine tuning. It has been trained for a trillion tokens for the 7 billion and 13 billion parameter versions, and 1.4 trillion for the 30 billion and 65 billion parameter versions.</b>

KeyboardInterrupt: ignored