## YouTube Extractor
- YouTubeのURLを入力するとその動画のテキストを抽出し検索できる
- GPUを使用しないと動画からテキストを抽出する時間がかなり遅くなるのでGPUの利用を推奨

セットアップ

In [None]:
!pip install langchain
!pip install openai
!pip install langchain-openai
!pip install pytube
!pip install git+https://github.com/openai/whisper
!pip install faiss-gpu
import os

Collecting langchain
  Downloading langchain-0.1.6-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting langchain-community<0.1,>=0.0.18 (from langchain)
  Downloading langchain_community-0.0.19-py3-none-any.whl.metadata (7.9 kB)
Collecting langchain-core<0.2,>=0.1.22 (from langchain)
  Downloading langchain_core-0.1.22-py3-none-any.whl.metadata (6.0 kB)
Collecting langsmith<0.1,>=0.0.83 (from langchain)
  Downloading langsmith-0.0.90-py3-none-any.whl.metadata (9.9 kB)
Collecting pydantic<3,>=1 (from langchain)
  Downloading pydantic-2.6.1-py3-none-any.whl.metadata (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting tenacity<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.2.3-py3-none-any.whl.metadata (1.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,

google colab用

In [None]:
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

YouTubeの動画からテキストとそのテキストの開始時間を抽出

In [None]:
import whisper
import pytube

url = "https://www.youtube.com/watch?v=gcOlzvwxVw8"
video = pytube.YouTube(url)
audio = video.streams.get_audio_only()
path = audio.download(output_path="audio.mp3")
model = whisper.load_model("medium")
transcription = model.transcribe(path)
transcription["segments"][0]

テキストとそのテキストの開始時間を配列に変換

In [None]:
from datetime import datetime
texts = []
start_times = []

for segment in transcription['segments']:
    text = segment['text']
    start = segment['start']
    
    start_datetime = datetime.fromtimestamp(start).strftime('%H:%M:%S')
    texts.append("".join(text))
    start_times.append(start_datetime)

print(texts[:10])
print(start_times[:10])

テキストを一定のチャンクに分割
ベクターストアを作成

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
# UserWarning: `VectorDBQAWithSourcesChain` is deprecated - please use `from langchain.chains import RetrievalQAWithSourcesChain`
from langchain.chains import VectorDBQAWithSourcesChain
from langchain_openai import OpenAI
import openai
import faiss

docs = []
metadatas = []
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
for i, d in enumerate(texts):
  splits = text_splitter.split_text(d)
  docs.extend(splits)
  metadatas.extend([{"source": start_times[i]}]*len(splits))

embeddings = OpenAIEmbeddings()
store = FAISS.from_texts(docs, embeddings, metadatas=metadatas)
faiss.write_index(store.index, "docs.index")

chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(), vectorstore=store)

実行

In [None]:
result = chain.invoke({"question": "ジェミニウルトラはGPT4と比較して何が優れていますか？"})
print(result['answer'])
print(result['sources'])

for source in result['sources'].replace(' ', '').split(","):
  print(texts[start_times.index(source)])