<a href="https://colab.research.google.com/github/tonkatsu7/learnLangChain/blob/main/maciekMorzywo%C5%82ek/Langchain_transcription_with_sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Query the YouTube video transcripts, returning timestamps as sources to legitimize the answers by [@m_morzywolek](https://twitter.com/m_morzywolek)

In [None]:
# First set runtime to GPU

In [1]:
pip install pytube # For audio downloading

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0


In [2]:
pip install git+https://github.com/openai/whisper.git -q # Whisper from OpenAI transcription model

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


In [3]:
import whisper 
import pytube 

In [4]:
url = "https://www.youtube.com/watch?v=UO699Szp82M"
video = pytube.YouTube(url)
video.streams.get_highest_resolution().filesize

22935284

In [5]:
audio = video.streams.get_audio_only()
fn = audio.download(output_path="tmp.mp3") # Downlods only audio from youtube video

In [6]:
model = whisper.load_model("base")

100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 176MiB/s]


In [7]:
transcription = model.transcribe('/content/tmp.mp3/LangChain In Action Real-World Use Case With Step-by-Step Tutorial.mp4')

In [8]:
transcription

{'text': " At the heart of the language model revolution and the chain framework lies the concept of a text embedding. A text embedding is a learned representation of text that takes the form of a vector of numbers. This vector allows us to efficiently prompt and retrieve context from vector storage to extract relevant pieces of information, enhance the language models memory and capabilities and ultimately take the action we want to take to generate value. In this video, we're going to look at this process by means of a real world practical application. We are going to use Langchain to extract information and value from Amazon Review Data. One of the most slam dunk applications of Langchain is custom experience and analytics. I'm going to show you how you can take the unstructured review data and map the reviews into themes and a structure that allows you to act on the data. I'm also going to demonstrate how the review embeddings can form the basis as inputs to other machine learning 

In [9]:
res = transcription['text']

In [10]:
print(res)

 At the heart of the language model revolution and the chain framework lies the concept of a text embedding. A text embedding is a learned representation of text that takes the form of a vector of numbers. This vector allows us to efficiently prompt and retrieve context from vector storage to extract relevant pieces of information, enhance the language models memory and capabilities and ultimately take the action we want to take to generate value. In this video, we're going to look at this process by means of a real world practical application. We are going to use Langchain to extract information and value from Amazon Review Data. One of the most slam dunk applications of Langchain is custom experience and analytics. I'm going to show you how you can take the unstructured review data and map the reviews into themes and a structure that allows you to act on the data. I'm also going to demonstrate how the review embeddings can form the basis as inputs to other machine learning models and

In [11]:
from datetime import datetime

def store_segments(segments):
  texts = []
  start_times = []

  for segment in segments:
    text = segment['text']
    start = segment['start']

    # Convert the starting time to a datetime object
    start_datetime = datetime.fromtimestamp(start)

    # Format the starting time as a string in the format "00:00:00"
    formatted_start_time = start_datetime.strftime('%H:%M:%S')

    texts.append("".join(text))
    start_times.append(formatted_start_time)

  return texts, start_times

In [12]:
segments = transcription['segments']

In [13]:
store_segments(segments)

([' At the heart of the language model revolution and the chain framework lies the concept of a text embedding.',
  ' A text embedding is a learned representation of text that takes the form of a vector of numbers.',
  ' This vector allows us to efficiently prompt and retrieve context from vector storage',
  ' to extract relevant pieces of information, enhance the language models memory and capabilities',
  ' and ultimately take the action we want to take to generate value.',
  " In this video, we're going to look at this process by means of a real world practical application.",
  ' We are going to use Langchain to extract information and value from Amazon Review Data.',
  ' One of the most slam dunk applications of Langchain is custom experience and analytics.',
  " I'm going to show you how you can take the unstructured review data and map the reviews into themes",
  ' and a structure that allows you to act on the data.',
  " I'm also going to demonstrate how the review embeddings ca

In [14]:
texts, start_times = store_segments(segments)

In [15]:
pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.183-py3-none-any.whl (938 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m938.0/938.0 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [16]:
pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.7-py3-none-any.whl (71 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/72.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.7


In [17]:
pip install --upgrade faiss-gpu #==1.7.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [18]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.chains import VectorDBQAWithSourcesChain
from langchain.llms import AzureOpenAI
import openai
import faiss

In [20]:
from getpass import getpass
OPENAI_API_KEY = getpass("OpenAI API Key: ")

OpenAI API Key: ··········


In [21]:
RG = input("Azure OpenAI resource group name: ")
OPENAI_API_BASE = 'https://' + RG +'.openai.azure.com'

Azure OpenAI resource group name: silearnai


In [22]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

os.environ['OPENAI_API_TYPE'] = 'azure'
# API version to use (Azure has several)
os.environ['OPENAI_API_VERSION'] = '2023-03-15-preview' #or '2022-12-01'
# base URL for your Azure OpenAI resource
os.environ['OPENAI_API_BASE'] = OPENAI_API_BASE

In [23]:
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs = []
metadatas = []
for i, d in enumerate(texts):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": start_times[i]}] * len(splits))

In [24]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    deployment='embedding01',
    model='text-embedding-ada-002',
    chunk_size=1
)

In [25]:
store = FAISS.from_texts(docs, embeddings, metadatas=metadatas)
faiss.write_index(store.index, "docs.index")

In [26]:
llm = openai = AzureOpenAI(
    deployment_name="gpt301", 
    model_name="text-davinci-003"
)  

In [27]:
chain = VectorDBQAWithSourcesChain.from_llm(llm=llm, vectorstore=store)



In [58]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [28]:
result = chain({"question": "How old was Steve Jobs when started Apple?"})

In [29]:
print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Answer:  I don't know.
  Sources: 00:00:26, 00:10:04, 00:08:22, 00:05:05


In [30]:
result = chain({"question": "What are the use cases?"})

In [31]:
print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Answer:  The use cases of Langchain are custom experience and analytics. 
  Sources: 00:00:36


In [32]:
result = chain({"question": "What are possible machine learning models?"})
print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Answer:  Possible machine learning models include a model with embedding vectors as features and overall rating as a target, propensity models, uplift models, and a random forest machine learning model.
  Sources: 00:04:30, 00:04:36, 00:06:02
