In [1]:
from langchain import hub
from langchain.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

import os.path
import shutil
import dotenv

dotenv.load_dotenv()

LOCAL = os.getenv('LOCAL')
SAVE_DIR = os.getenv('SAVE_DIR')
CHROMA_PATH = os.getenv('CHROMA_PATH')
WHISPER_MODEL = os.getenv('WHISPER_MODEL')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')


In [2]:
urls = [
    "https://www.youtube.com/watch?v=FgzM3zpZ55o",
    "https://www.youtube.com/watch?v=E3f2Camj0Is",
    "https://www.youtube.com/watch?v=dRIhrn8cc9w",
]

In [3]:
from scripts.rag import vector_store as vs

In [4]:
vector_store = vs.get_vector_store(urls)

Using the following model:  openai/whisper-tiny


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[youtube] Extracting URL: https://www.youtube.com/watch?v=FgzM3zpZ55o
[youtube] FgzM3zpZ55o: Downloading webpage
[youtube] FgzM3zpZ55o: Downloading ios player API JSON
[youtube] FgzM3zpZ55o: Downloading m3u8 information
[info] FgzM3zpZ55o: Downloading 1 format(s): 140
[download] Destination: ../../data/youtube/Stanford CS234： Reinforcement Learning ｜ Winter 2019 ｜ Lecture 1 - Introduction - Emma Brunskill.m4a
[download] 100% of   61.01MiB in 00:00:02 at 21.54MiB/s    
[FixupM4a] Correcting container of "../../data/youtube/Stanford CS234： Reinforcement Learning ｜ Winter 2019 ｜ Lecture 1 - Introduction - Emma Brunskill.m4a"
[ExtractAudio] Not converting audio ../../data/youtube/Stanford CS234： Reinforcement Learning ｜ Winter 2019 ｜ Lecture 1 - Introduction - Emma Brunskill.m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=E3f2Camj0Is
[youtube] E3f2Camj0Is: Downloading webpage
[youtube] E3f2Camj0Is: Downloading ios player API JSON
[youtube

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Transcribing part ../../data/youtube/Stanford CS234： Reinforcement Learning ｜ Winter 2019 ｜ Lecture 1 - Introduction - Emma Brunskill.m4a!
Transcribing part ../../data/youtube/Stanford CS234： Reinforcement Learning ｜ Winter 2019 ｜ Lecture 2 - Given a Model of the World.m4a!
Loaded 3 documents
Load videos done !
Number of chunks: 142, for 3 documents
Split documents done !


In [5]:
retriever = vector_store.as_retriever()

In [7]:
prompt = hub.pull("rlm/rag-prompt")

In [9]:
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

In [10]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [11]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [15]:
rag_chain.invoke("Can you explain me what's the sequential decision making?")

'Sequential decision making is a process where an agent takes actions that affect the state of the world, receives observations and rewards, and aims to maximize future rewards. It involves making decisions based on partially observable information, such as in poker or healthcare scenarios. Types of sequential decision-making processes include bandits and planning, which involve dealing with delayed consequences and exploring the environment to learn how to make optimal decisions.'

In [16]:
from langchain_core.runnables import RunnableParallel

In [17]:
rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

In [18]:
rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

In [19]:
rag_chain_with_source.invoke("Can you explain me what's the sequential decision making?")

{'context': [Document(page_content="or sensitive manner, in which case, of course, please feel free to reach out to the course staff directly. And for things like lectures and homeworks and project questions, pretty much all of that should go through Piazza. For late day policy, we have six late days. details you can see the webpage and for collaboration, please see the webpage for some of the details about that. So before we go on to the next part, then we have any questions about logistics for the class. Okay, let's get started. So we're now going to do an introduction to sequential decision making under uncertainty. A number of you guys will have seen some of this content before. We will be going into this in probably more depth than you've seen for some of this stuff including some theory, not theory today, but in other lectures. And then we'll also be moving on to content that will be new to all of you later in the class. So sequential decision making under uncertainty. The fundam