<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/prompts/WandB_LLM_QA_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{prompt-qa-bot} -->

# Building an LLM App for Document Retrieval / Extraction
<!--- @wandbcode{prompt-qa-bot} -->
This tutorial runs through [this report](https://wandb.ai/gladiator/gradient_dissent_qabot/reports/Building-a-Q-A-Bot-for-Weights-Biases-Gradient-Dissent-Podcast--Vmlldzo0MTcyMDQz) on how to build a basic LLM App for retrieval-augmented question-answering.
- Track datasets and embeddings as artifacts
- Track prompts and chain executions
- Log token counts and cost

In [None]:
!pip install -qqq wandb langchain pytube tiktoken openai youtube-transcript-api chromadb

### Set up OpenAI API Key

In [None]:
from getpass import getpass
import os

if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

### Set up config and environment variables
- NOTE: set the `entity` to your username or team name
- Set wandb [environment variables](https://docs.wandb.ai/guides/track/environment-variables) to change behavior of logging
- `ENTITY` - username or team where your projects live
- `PROJECT` - project where your runs will live
- `LANGCHAIN_WANDB_TRACING` - automatically logs langchain traces, inputs and outputs as part of runs in Weights and Biases

In [None]:
from dataclasses import dataclass
from pathlib import Path
import os

project_name = "gradient-dissent-qabot" #@param
entity = "wandb" #@param
TOTAL_EPISODES = 5

playlist_url = "https://www.youtube.com/playlist?list=PLD80i8An1OEEb1jP0sjEyiLG8ULRXFob_"
root_data_dir = Path("/contents/data")
root_artifact_dir = Path("downloaded_artifacts")
yt_podcast_data_artifact = f"{entity}/{project_name}/yt_podcast_transcript:latest"
summarized_data_artifact = f"{entity}/{project_name}/summarized_podcasts:latest"
summarized_que_data_artifact = f"{entity}/{project_name}/summarized_que_podcasts:latest"
transcript_embeddings_artifact = f"{entity}/{project_name}/transcript_embeddings:latest"

os.makedirs("/contents/data", exist_ok=True)
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
os.environ['WANDB_PROJECT'] = project_name
os.environ['WANDB_ENTITY'] = entity

## Log in to W&B
- You can explicitly login using `wandb login` or `wandb.login()` (See below)
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - `WANDB_API_KEY` - find this in your "Settings" section under your profile
    - `WANDB_BASE_URL` - this is the url of the W&B server (You only need this if you are using a private instance)
- Find your API Token in "Profile" -> "Setttings" in the W&B App

![api_token](https://drive.google.com/uc?export=view&id=1Xn7hnn0rfPu_EW0A_-32oCXqDmpA0-kx)

In [None]:
import wandb

In [None]:
wandb.login()

In [None]:
import time
import pandas as pd
import wandb
from langchain.document_loaders import YoutubeLoader
from pytube import Playlist, YouTube
from tqdm import tqdm


def retry_access_yt_object(url, max_retries=5, interval_secs=5):
    """
    Retries creating a YouTube object with the given URL and accessing its title several times
    with a given interval in seconds, until it succeeds or the maximum number of attempts is reached.
    If the object still cannot be created or the title cannot be accessed after the maximum number
    of attempts, the last exception is raised.
    """
    last_exception = None
    for i in range(max_retries):
        try:
            yt = YouTube(url)
            title = yt.title  # Access the title of the YouTube object.
            return yt  # Return the YouTube object if successful.
        except Exception as err:
            last_exception = err  # Keep track of the last exception raised.
            print(
                f"Failed to create YouTube object or access title. Retrying... ({i+1}/{max_retries})"
            )
            time.sleep(interval_secs)  # Wait for the specified interval before retrying.

    # If the YouTube object still cannot be created or the title cannot be accessed after the maximum number of attempts, raise the last exception.
    raise last_exception

## Log Data Snapshots as Artifacts

W&B is very unopinionated with regard to how you track your experiments.  We could log data in any number of ways.  
* Log one artifact which represents all the data - training, validation, and test data to one artifact
* Log several artifacts - one for each of the training, validation, and test data loaders.  

It is a matter of what best suites your needs and workflows and expectations.  

### Anatomy of an artifact

The `Artifact` class will correspond to an entry in the W&B Artifact registry.  The artifact has
* a name
* a type
* metadata
* description
* files, directory of files, or references

Example usage
```
run = wandb.init(project = "my-project")
artifact = wandb.Artifact(name = "my_artifact", type = "data")
artifact.add_file("/path/to/my/file.txt")
run.log_artifact(artifact)
run.finish()
```

In [None]:
run = wandb.init(project=project_name, entity=entity, job_type="dataset")

playlist = Playlist(playlist_url)
playlist_video_urls = playlist.video_urls[0:TOTAL_EPISODES]

print(f"There are total {len(playlist_video_urls)} videos in the playlist.")

video_data = []
for video in tqdm(playlist_video_urls, total=len(playlist_video_urls)):
    try:
        curr_video_data = {}
        yt = retry_access_yt_object(video, max_retries=25, interval_secs=2)
        curr_video_data["title"] = yt.title
        curr_video_data["url"] = video
        curr_video_data["duration"] = yt.length
        curr_video_data["publish_date"] = yt.publish_date.strftime("%Y-%m-%d")
        loader = YoutubeLoader.from_youtube_url(video)
        transcript = loader.load()[0].page_content
        transcript = " ".join(transcript.split())
        curr_video_data["transcript"] = transcript
        curr_video_data["total_words"] = len(transcript.split())
        video_data.append(curr_video_data)
    except Exception as inst:
        print(type(inst))    # the exception type
        print(inst.args)     # arguments stored in .args
        print(inst)
        print(f"Failed to scrape {video}")

print(f"Total podcast episodes scraped: {len(video_data)}")

# save the scraped data to a csv file
df = pd.DataFrame(video_data)
data_path = root_data_dir / "yt_podcast_transcript.csv"
df.to_csv(data_path, index=False)

# upload the scraped data to wandb
artifact = wandb.Artifact("yt_podcast_transcript", type="dataset")
artifact.add_file(data_path)
run.log_artifact(artifact)


### Log a wandb Table to interact with your data
- Here we log the dataframe of metadata about the youtube transcripts (urls, length, transcripts)
- This allows us to interrogate the original data (filtering, grouping, etc.)

In [None]:
# create wandb table
table = wandb.Table(dataframe=df)
run.log({"yt_podcast_transcript": table})
run.finish()

## Summarize YouTube Transcripts
- Here we summarize the transcripts in chunks, summarizing each chunk and then summarizing the summaries using the LangChain `load_summarize_chain`
- We can do this in parallel since each chunk of a transcript can be summarized independently so we employ `map_reduce`

In [None]:
import os
import pandas as pd
from langchain.callbacks import get_openai_callback
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DataFrameLoader
from langchain.prompts import PromptTemplate
from langchain.text_splitter import TokenTextSplitter
from tqdm import tqdm

import wandb


def get_data(artifact_name: str, total_episodes: int = None):
    podcast_artifact = wandb.use_artifact(artifact_name)
    podcast_artifact_dir = podcast_artifact.download(root_artifact_dir)
    filename = artifact_name.split(":")[0].split("/")[-1]
    df = pd.read_csv(os.path.join(podcast_artifact_dir, f"{filename}.csv"))
    if total_episodes is not None:
        df = df.iloc[:total_episodes]
    return df


def summarize_episode(episode_df: pd.DataFrame):
    # load docs into langchain format
    loader = DataFrameLoader(episode_df, page_content_column="transcript")
    data = loader.load()

    # split the documents
    text_splitter = TokenTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(data)
    print(f"Number of documents for podcast {data[0].metadata['title']}: {len(docs)}")

    # initialize LLM
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

    # define map prompt
    map_prompt = """Write a concise summary of the following short transcript from a podcast.
    Don't add your opinions or interpretations.

    {text}

    CONCISE SUMMARY:"""

    # define combine prompt
    combine_prompt = """You have been provided with summaries of chunks of transcripts from a podcast.
    Your task is to merge these intermediate summaries to create a brief and comprehensive summary of the entire podcast.
    The summary should encompass all the crucial points of the podcast.
    Ensure that the summary is atleast 2 paragraph long and effectively captures the essence of the podcast.
    {text}

    SUMMARY:"""

    map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
    combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

    # initialize the summarizer chain
    chain = load_summarize_chain(
        llm,
        chain_type="map_reduce",
        return_intermediate_steps=True,
        map_prompt=map_prompt_template,
        combine_prompt=combine_prompt_template,
    )

    summary = chain({"input_documents": docs})
    return summary

### Execute Summary Chain and log results
- You can instantiate a `WandbTracer` and pass in additional config about this LangChain run.
- Log the outputs of the chain like tokens used, cost, etc.
- Log the resulting summaries as artifacts

In [None]:
from langchain.callbacks.tracers import WandbTracer

tracer = WandbTracer(run_args = {"job_type": "summarize"})

# get scraped data
df = get_data(artifact_name=yt_podcast_data_artifact, total_episodes=TOTAL_EPISODES)

summaries = []
with get_openai_callback() as cb:
    for episode in tqdm(df.iterrows(), total=len(df), desc="Summarizing episodes"):
        episode_data = episode[1].to_frame().T

        summary = summarize_episode(episode_data)
        summaries.append(summary["output_text"])

    print("*" * 25)
    print(cb)
    print("*" * 25)

    wandb.log(
        {
            "total_prompt_tokens": cb.prompt_tokens,
            "total_completion_tokens": cb.completion_tokens,
            "total_tokens": cb.total_tokens,
            "total_cost": cb.total_cost,
        }
    )

df["summary"] = summaries

# save data
path_to_save = os.path.join(root_data_dir, "summarized_podcasts.csv")
df.to_csv(path_to_save, index=False)

# log to wandb artifact
artifact = wandb.Artifact("summarized_podcasts", type="dataset")
artifact.add_file(path_to_save)
wandb.log_artifact(artifact)

# create wandb table
table = wandb.Table(dataframe=df)
wandb.log({"summarized_podcasts": table})

tracer.finish()

## Embed the contents of the YouTube transcripts
- Here we use OpenAI embeddings and [ChromaDB](https://www.trychroma.com/) to embed the summaries to make them queriable via vector similarity search when we ask contextual questions to the LLM
- Use `wandb.log` and artifacts to log the resulting ChromaDB serialized embeddings.

In [None]:
import os
from dataclasses import asdict

import pandas as pd
from langchain.callbacks import get_openai_callback
from langchain.document_loaders import DataFrameLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores import Chroma
from tqdm import tqdm
from wandb.integration.langchain import WandbTracer

import wandb


def get_data(artifact_name: str, total_episodes=None):
    podcast_artifact = wandb.use_artifact(artifact_name, type="dataset")
    podcast_artifact_dir = podcast_artifact.download(root_artifact_dir)
    filename = artifact_name.split(":")[0].split("/")[-1]
    df = pd.read_csv(os.path.join(podcast_artifact_dir, f"{filename}.csv"))
    if total_episodes is not None:
        df = df.iloc[:total_episodes]
    return df


def create_embeddings(episode_df: pd.DataFrame, index: int):
    # load docs into langchain format
    loader = DataFrameLoader(episode_df, page_content_column="transcript")
    data = loader.load()

    # split the documents
    text_splitter = TokenTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(data)

    title = data[0].metadata["title"]
    print(f"Number of documents for podcast {title}: {len(docs)}")

    # initialize embedding engine
    embeddings = OpenAIEmbeddings()

    db = Chroma.from_documents(
        docs,
        embeddings,
        persist_directory=os.path.join(root_data_dir / "chromadb", str(index)),
    )
    db.persist()

In [None]:
tracer = WandbTracer(run_args = {"job_type": "embed_transcripts"})

# get data
df = get_data(artifact_name=summarized_data_artifact, total_episodes=TOTAL_EPISODES)

# create embeddings
with get_openai_callback() as cb:
    for episode in tqdm(df.iterrows(), total=len(df), desc="Embedding transcripts"):
        episode_data = episode[1].to_frame().T

        create_embeddings(episode_data, index=episode[0])

    print("*" * 25)
    print(cb)
    print("*" * 25)

    wandb.log(
        {
            "total_prompt_tokens": cb.prompt_tokens,
            "total_completion_tokens": cb.completion_tokens,
            "total_tokens": cb.total_tokens,
            "total_cost": cb.total_cost,
        }
    )

# log embeddings to wandb artifact
artifact = wandb.Artifact("transcript_embeddings", type="dataset")
artifact.add_dir(root_data_dir / "chromadb")
wandb.log_artifact(artifact)

tracer.finish()

## Ask Questions Against your Summarized Documents

Finally we tie everything together:
1. We can pull down our ChromaDB embeddings from W&B
2. Pass them along with a prompt template for QA to the `RetrievalQA` chain and start asking questions!

In [None]:
from langchain.chains import RetrievalQA
from langchain.embeddings.openai import OpenAIEmbeddings

def get_answer(podcast: str, question: str):
  index = df[df["title"] == podcast].index[0]
  db_dir = os.path.join(chromadb_dir, str(index))
  embeddings = OpenAIEmbeddings()
  db = Chroma(persist_directory=db_dir, embedding_function=embeddings)

  prompt_template = """Use the following pieces of context to answer the question.
  If you don't know the answer, just say that you don't know, don't try to make up an answer.
  Don't add your opinions or interpretations. Ensure that you complete the answer.
  If the question is not relevant to the context, just say that it is not relevant.

  CONTEXT:
  {context}

  QUESTION: {question}

  ANSWER:"""

  prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

  retriever = db.as_retriever()
  retriever.search_kwargs["k"] = 2

  qa = RetrievalQA.from_chain_type(
      llm=ChatOpenAI(temperature=0),
      chain_type="stuff",
      retriever=retriever,
      chain_type_kwargs={"prompt": prompt},
      return_source_documents=True,
  )

  with get_openai_callback() as cb:
      result = qa({"query": question})
      print(cb)

  answer = result["result"]
  return answer

In [None]:
# download and read data
api = wandb.Api()
artifact_df = api.artifact(summarized_data_artifact)
artifact_df.download(root_data_dir)

artifact_embeddings = api.artifact(transcript_embeddings_artifact)
chromadb_dir = artifact_embeddings.download(root_data_dir / "chromadb")

df_path = root_data_dir / "summarized_podcasts.csv"
df = pd.read_csv(df_path)

In [None]:
df["title"].tolist()[0:TOTAL_EPISODES]

In [None]:
tracer = WandbTracer(run_args = {"job_type": "retriealQA"})

answer = get_answer('Enabling LLM-Powered Applications with Harrison Chase of LangChain', "What did Harrison Chase say?")
print(answer)

tracer.finish()