# Retrieve content from a YouTube video and summarize

<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter9_notebooks/Ch9_Retrieve_Content_from_a_YouTube_Video_and_Summarize.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**The purpose of this notebook:**  
Pick a Youtube video that you'd like to summarize and edit to your liking without having to spend the time to watch it.  
In this notebook we picked one of the popular Ted Talks, summarized it, translated it to German, edited it in the form of bullet points and presented it.  

**Requirements:**  
* When running in Colab, use this runtime notebook setting: `Python 3, CPU`  
* This code picks OpenAI's API as a choice of LLM, so a paid **API key** is necessary.   

Install:

In [1]:
# REMARK:
# If the below code error's out due to a Python package discrepency, it may be because new versions are causing it.
# In which case, set "default_installations" to False to revert to the original image:
default_installations = True
if default_installations:
    !pip -q install --upgrade embedchain
    !pip -q install pytube
    !pip -q install openai
    !pip -q install youtube-transcript-api
else:
    import requests
    text_file_path = "youtube_summarize.txt"
    url = "https://raw.githubusercontent.com/python-devops-sre/nlp/master/requirements/" + text_file_path
    res = requests.get(url)
    with open(text_file_path, "w") as f:
      f.write(res.text)
      
    !pip install -r youtube_summarize.txt

Imports:

In [2]:
import os
import textwrap
import pandas as pd
import json

from embedchain import App
# from embedchain.config import ChromaDbConfig

### Code Settings

Define OpenAI's API key:  
**You must provide a key and paste it as a string!**  

In [3]:
os.environ["OPENAI_API_KEY"] = "..."

Setting up configurations for choice of embedding LLM and prompting LLM:

In [4]:
models_config = {
    "llm": {
        "provider": "openai",
        "config": {
            "model": "gpt-3.5-turbo",
            "temperature": 0.5,
            "max_tokens": 1000,
            "top_p": 1,
            "stream": False
        }
    },
    "embedder": {
        "provider": "openai",
        "config": {
            "model": "text-embedding-ada-002"
        }
    }
}

#### Pick the Youtube Video and Insert its URL

In [5]:
video_url = "https://www.youtube.com/watch?v=8KkKuTCFvzI&ab_channel=TED"

### Set Up the Retrieval Mechanism

In [6]:
lecture_RAG = App().from_config(config=models_config)
lecture_RAG.reset()
lecture_RAG.add(data_type="youtube_video", source=video_url)

Inserting batches in chromadb:   0%|          | 0/1 [00:00<?, ?it/s]


'6d9ce5a14285fef40a8afb5268a273ef'

### Observe the raw document
In our example we only gave the RAG a single document to use as context.  
Let's obesrve the first 1000 characters.  

In [None]:
lecture_RAG.db.get()['documents'][0][:1000]

## Review, summarize, and translate

In [None]:
original_answer = lecture_RAG.query("""Please review the entire content, summarize it to the length of 4 sentence, then translate it to Russian and to German.
Make sure the summary is consistent with the content.
Put the string '\n----\n' between the English part of the answer and the Russian part.
Put the string '\n****\n' between the Russian part of the answer and the German part.""")

print(textwrap.fill(original_answer, width=50, replace_whitespace=True).replace("\\n ", "\n\n").replace("----", "\n\nRussian:\n").replace("****", "\n\nGerman:\n"))

In [None]:
print(lecture_RAG.query(f"This is the response from the previous prompt: <{original_answer}> Now take the German response and edit it into 3-5 bullet points. Provide just the German bullet points."))