# LangChain QA

All code comes from [LangChain docs](langchain.readthedocs.io).

In [None]:
!pip install langchain openai chromadb tiktoken pypdf pytube youtube-transcript-api pytube aspose-words ffmpeg fpdf speech_recognition
!pip install ibm_watson
!pip install git+https://github.com/openai/whisper.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import getpass
OPENAI_API_KEY = getpass.getpass("Enter OpenAI API key: ")

Enter OpenAI API key: ··········


#### Get Video and Audio

In [None]:
import subprocess
from ibm_watson import SpeechToTextV1
from ibm_watson.websocket import RecognizeCallback, AudioSource
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from pytube import YouTube
import aspose.words as aw

def audio2text(video, audio):
  # Video to Audio
  command = f'ffmpeg -i {video} -ab 160k -ar 44100 -vn {audio}'
  subprocess.call(command, shell=True)

  # Audio to Text
  model = whisper.load_model("base")
  result = model.transcribe("audio.wav")

  print(result)
  # Writing to .txt File
  with open('transcript.txt', 'w') as file:
  for i in result['text']:
    print(i, end='')
    file.write(i)

  # TXT to PDF
  doc = aw.Document("transcript.txt")
  doc.save("transcript.pdf", aw.SaveFormat.PDF)
  print("Saved text to PDF.")

def download_youtube_video(url, file_name):
    try:
        youtube = YouTube(url)
        video = youtube.streams.first()
        video.download(filename=file_name)
        print(f"Video downloaded successfully as {file_name}")
    except Exception as e:
        print(f"Error: {str(e)}")



In [None]:
youtube_url = input("Youtube URL: ") # Replace with the YouTube video URL
file_name = "test.mp4"  # Replace with desired file name
download_youtube_video(youtube_url, file_name.split('/')[-1])
audio2text(file_name, "audio.wav")

Youtube URL: https://www.youtube.com/watch?v=HbY51mVKrcE&t=179s
Video downloaded successfully as test.mp4




{'text': " Hey YouTube, in this video I'm going to show you how you can quickly convert any audio into text using the free open source package in Python called whisper. I'm going to show I installed it, show an example of how I ran it and compare it to an existing library. So starting off you'll probably want to go to the whisper get hub repository that we're looking at here and they give instructions on how you can install it. Now one thing to keep in mind when you pip install just the name whisper it's not going to install the right version. We want to install from this get repository. So just take this pip install command and run it in your environment that you're running Python. And they also mentioned here that you need FFN peg installed. There's some instructions to do it, but I already had that installed on my computer. Now that I have whisper install, let's just make some audio that I can test this on. So I'm going to say some idioms idioms are usually hard for models to unders

## Method-2 LangChain

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains.question_answering import load_qa_chain

In [None]:
## OpenAI Check
import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

llm = OpenAI()
print(llm("tell me a joke"))



Q: What did the fish say when it hit the wall?
A: Dam!


# load_qa_chain

Loads a chain that you can use to do QA over a set of documents, but it uses ALL of those documents.

chain_type="stuff" will not work because the number of tokens exceeds the limit. We can try other chain types like "map_reduce".

In [None]:
# load document
loader = PyPDFLoader("transcript.pdf")
documents = loader.load()

### For multiple documents
# loaders = [....]
# documents = []
# for loader in loaders:
#     documents.extend(loader.load())

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query = "what all the tools were used?"
chain.run(input_documents=documents, question=query)

' The tools used here are pip install streamlit, Langchain, open AI, Wikipedia, Chroma DB, tick token, Wikipedia API wrapper, title memory buffer, script memory buffer, and wiki pdf research.'

# VectorstoreIndexCreator

VectorstoreIndexCreator is a wrapper for the above logic.

Source:
- https://python.langchain.com/en/latest/modules/chains/getting_started.html
- https://github.com/hwchase17/langchain/blob/master/langchain/indexes/vectorstore.py#L21-L74

In [None]:
index = VectorstoreIndexCreator(
    # split the documents into chunks
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0),
    # select which embeddings we want to use
    embedding=OpenAIEmbeddings(),
    # use Chroma as the vectorestore to index and search embeddings
    vectorstore_cls=Chroma
).from_loaders([loader])
query = "what is the total number of AI publications?"
index.query(llm=OpenAI(), question=query, chain_type="map_reduce")