<a href="https://colab.research.google.com/github/rsamala/Bots/blob/main/Chatbot_for_PDF_and_Youtube_Apr19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

This notebook has all the code you need to create your own chatbot with custom knowledge base using GPT-3. 

Follow the instructions for each steps and then run the code sample. In order to run the code, you need to press "play" button near each code sample.

#Download the data for your custom knowledge base
For the demonstration purposes we are going to use ----- as our knowledge base. You can download them to your local folder from the github repository by running the code below.
Alternatively, you can put your own custom data into the local folder. 

# Install the dependicies
Run the code below to install the depencies we need for our functions

In [None]:
!pip install llama-index


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting llama-index
  Downloading llama_index-0.5.18.tar.gz (181 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/181.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.7/181.7 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dataclasses_json
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting langchain==0.0.142
  Downloading langchain-0.0.142-py3-none-any.whl (548 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.8/548.8 kB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0.26.4
  Downloading openai-0.27.4-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken
  Downloading ti

In [None]:
!pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
!pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# youtube requirements
!pip install youtube_transcript_api
import sys, re
from youtube_transcript_api import YouTubeTranscriptApi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube_transcript_api
  Downloading youtube_transcript_api-0.6.0-py3-none-any.whl (23 kB)
Installing collected packages: youtube_transcript_api
Successfully installed youtube_transcript_api-0.6.0


# Define the functions
The following code defines the functions we need to construct the index and query it

In [None]:
from llama_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain import OpenAI
import sys
import os
from IPython.display import Markdown, display
from llama_index import ServiceContext

def construct_index(directory_path):
    # set maximum input size
    max_input_size = 4096
    # set number of output tokens
    num_outputs = 2000
    # set maximum chunk overlap
    max_chunk_overlap = 20
    # set chunk size limit
    chunk_size_limit = 600 

    # define LLM
    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.5, model_name="text-davinci-003", max_tokens=num_outputs))
    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
 
    documents = SimpleDirectoryReader(directory_path).load_data()
    
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    #index = GPTSimpleVectorIndex(
    #    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
    #)

    index.save_to_disk('index.json')

    return index

def ask_ai():
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    while True: 
        query = input("What do you want to ask? ")
        response = index.query(query, response_mode="compact")
        display(Markdown(f"Response: <b>{response.response}</b>"))
  

In [None]:
#define youtube related functions
def get_video_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return transcript
    except Exception as e:
        print("Error:", e)
        return None

def save_transcript_to_file(transcript, output_file):
    with open(output_file, "w") as f:
        for entry in transcript:
            caption_text = entry["text"]
            # Remove special characters and line breaks, then append a space.
            cleaned_text = re.sub(r'[\n\(\[].*?[\)\]]', '', caption_text).replace('\n', ' ') + ' '
            f.write(cleaned_text)

def extract_video_id(url):
    patterns = [
        r"(?:https?:\/\/)?(?:www\.)?youtube\.com\/watch\?v=([^&]+)",
        r"(?:https?:\/\/)?(?:www\.)?youtu\.be\/([^&]+)",
        r"(?:https?:\/\/)?(?:www\.)?youtube\.com\/embed\/([^&]+)",
        r"(?:https?:\/\/)?(?:www\.)?youtube\.com\/v\/([^&]+)",
    ]

    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)

    return None

# Set OpenAI API Key
You need an OPENAI API key to be able to run this code.

If you don't have one yet, get it by [signing up](https://platform.openai.com/overview). Then click your account icon on the top right of the screen and select "View API Keys". Create an API key.

Then run the code below and paste your API key into the text input.

In [None]:
os.environ["OPENAI_API_KEY"] = input("Paste your OpenAI key here and hit enter:")

Paste your OpenAI key here and hit enter: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


In [None]:
#Get Content into required folder /content/sample_data/Data"
url=input("Paster youtube url")
video_id = extract_video_id(url)

if url:
  transcript = get_video_transcript(video_id)

output_file = f"{video_id}_transcript.txt"
FQFN = "/content/sample_data/Data/" +  output_file

if transcript:
  save_transcript_to_file(transcript, FQFN)
  print(f"Transcript saved to {output_file}")
else:
  print("Failed to fetch transcript")

Paster youtube urlhttps://youtu.be/dsqEzNq9oYY
Transcript saved to dsqEzNq9oYY_transcript.txt


#Construct an index
Now we are ready to construct the index. This will take every file in the folder 'data', split it into chunks, and embed it with OpenAI's embeddings API.

**Notice:** running this code will cost you credits on your OpenAPI account ($0.02 for every 1,000 tokens). If you've just set up your account, the free credits that you have should be more than enough for this experiment.

In [None]:
construct_index("/content/sample_data/Data")

<llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex at 0x7fdcd7956280>

#Ask questions
It's time to have fun and test our AI. Run the function that queries GPT and type your question into the input. 

If you've used the provided example data for your custom knowledge base, here are a few questions that you can ask:
1. Why people cook at home? Make classification
2. Make classification about what frustrates people about cooking?
3. Brainstorm marketing campaign ideas for an air fryer that would appeal people that cook at home
4. Which kitchen appliences people use most often?
5. What people like about cooking at home?

In [None]:
ask_ai()

What do you want to ask? summarize in a paragraph


Response: <b>
Baby AGI is an autonomous AI powered task management system that utilizes Python OpenAI and Pinecone APIs to generate, prioritize, and execute tasks based on predefined objectives. The system works by running an infinite loop and is executed by four steps: pulling from the task list, sending an execution agent to the task, enriching and storing the result in Pinecone, and creating new tasks based on the results. It leverages OpenAI's GPT4, Pinecone's vector search, and Lang Chain's AI frameworks to autonomously create and perform tasks. The system can complete tasks, generate new tasks based on the results, and prioritize tasks in real time, demonstrating the potential of AI powered language models.</b>

What do you want to ask? how does baby agi identify and prioritize tasks based on objective?


Response: <b>
Baby AGI identifies and prioritizes tasks based on objective by using Python OpenAI and Pinecone APIs. It runs an infinite loop that consists of four steps: pulling from the task list, sending an execution agent to the task to complete it, enriching and storing the result into Pinecone, and creating and re-prioritizing new tasks based on the objective and the previous results. The execution agent leverages OpenAI's GPT4, Pinecone's vector search, and Lang Chain's AI frameworks to autonomously create and perform tasks based on an objective. The system maintains a task list for managing and prioritizing tasks, and it autonomously creates new tasks based on completed results and re-prioritizes the task list accordingly.</b>

What do you want to ask? how does langchain create tasks based on objective?


Response: <b>
Langchain uses its framework to create tasks based on an objective by leveraging GPT-4 and Pinecone's Vector Search capabilities. It uses the objective and task description to send a prompt to the OpenAI API, which returns the result of the task as a string. The system then enriches and stores the result in Pinecone, and creates new tasks based on the completed task results and re-prioritizes them using GPT-4. This allows the system to adapt and respond to new information and prioritize future tasks.</b>

What do you want to ask? write a page on how baby agi works


Response: <b>
Baby AGI is an autonomous AI powered task management system that utilizes Python OpenAI and Pinecone APIs to generate, prioritize, and execute tasks based on predefined objectives. The system works by running an infinite loop that is executed in four steps. 

The first step is to pull from the task list, which consists of the objective, the task, and the description of the API system task. The second step is to send an execution agent to the task to complete it based on the context, utilizing OpenAI's API. The third step is to enrich and store the result into Pinecone. The fourth step is to create new tasks and re-prioritize them based on the objective and the previous results. 

The system uses GPT4 and Lang Chain's capabilities to complete tasks, enriching and storing results in Pinecone. Pinecone is used for the storage of the memory and also for the search engine to help find and complete tasks. This integrated approach allows the AI agent to interact with the environment and perform tasks efficiently. 

The system also generates new tasks based on the completed task results and re-prioritizes them using GPT4. This allows the system to adapt and respond to new information and prioritize future</b>