<a href="https://colab.research.google.com/github/soultanyousif/youtube-video-summarization-and-langchain-rag-for-Q-A-/blob/main/summarization_and_langchain_rag_main_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **YouTube Video Summarization & Q&A with English/Arabic Support**


# **Libraries**

In [52]:

!pip install pytube yt-dlp
!pip install youtube-transcript-api
!pip install openai-whisper
!pip install langdetect langid
!pip install transformers sentencepiece accelerate
!pip install torch torchvision torchaudio
!pip install langchain langchain-community
!pip install faiss-cpu
!pip install streamlit
!pip install yt-dlp -q




In [53]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
from urllib.parse import urlparse, parse_qs
import yt_dlp
import whisper
from langdetect import detect
from langchain_core.documents import Document
from transformers import pipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
import os
import warnings
import math
import warnings
warnings.filterwarnings("ignore")


# **get the transcription**

In [55]:


def extract_video_id(url: str) -> str:
    parsed = urlparse(url)
    qs = parse_qs(parsed.query)
    video_ids = qs.get('v')
    if not video_ids:
        raise ValueError(f"No video id found in URL: {url}")
    return video_ids[0]

def extract_transcript(url):
    """
    Extract transcript from YouTube video.
    Uses YouTubeTranscriptApi only (no Whisper).
    Returns full_text and segments (with start, duration, text).
    """
    video_id = extract_video_id(url)
    api = YouTubeTranscriptApi()

    try:
        # Try English, fallback to Arabic
        try:
            fetched = api.fetch(video_id, languages=['en'])
        except:
            fetched = api.fetch(video_id, languages=['ar'])

        full_text = "\n".join(snippet.text for snippet in fetched)
        segments = [
            {"start": snippet.start, "duration": snippet.duration, "text": snippet.text}
            for snippet in fetched
        ]
        print("Transcript fetched via YouTubeTranscriptApi")
        return full_text, segments

    except (TranscriptsDisabled, NoTranscriptFound):
        raise RuntimeError("Transcript not available for this video.")




# **Auto language detecction**

In [56]:
def detect_lang(text):
    try:
        lang = detect(text)
        return "ar" if lang.startswith("ar") else "en"
    except:
        return "en"


# **Chunking**

In [57]:

def chunk_transcript(segments, chunk_size=500, chunk_overlap=50):
    """
    Chunks transcript segments into LangChain Document objects with metadata.
    """
    documents = []
    current_chunk = ""
    current_start_time = 0

    for i, seg in enumerate(segments):

        if not current_chunk:
            current_start_time = seg['start']


        segment_text = seg['text'].strip()
        if current_chunk:
            current_chunk += " " + segment_text
        else:
            current_chunk = segment_text


        if len(current_chunk) >= chunk_size:
            documents.append(
                Document(
                    page_content=current_chunk,
                    metadata={'start': current_start_time}
                )
            )



            overlap_text = ""
            temp_len = 0

            for j in range(i, -1, -1):
                overlap_segment_text = segments[j]['text'].strip()

                if overlap_text:
                    overlap_text = overlap_segment_text + " " + overlap_text
                else:
                    overlap_text = overlap_segment_text

                temp_len += len(overlap_segment_text) + 1 # +1 for space
                if temp_len >= chunk_overlap:
                    current_chunk = overlap_text
                    current_start_time = segments[j]['start']
                    break
            else:

                 current_chunk = ""


    if current_chunk:
        documents.append(
            Document(
                page_content=current_chunk,
                metadata={'start': current_start_time}
            )
        )

    return documents


# **Summarization**

In [58]:
en_summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
ar_summarizer=  pipeline("summarization", model="csebuetnlp/mT5_multilingual_XLSum")

Device set to use cpu
Device set to use cpu


In [59]:

def summarise_text(text, lang="en"):
    text = text.strip()
    if not text:
        return ""


    words = text.split()
    text = " ".join(words[:900]) if len(words) > 900 else text

    if lang == "en":
        summary = en_summarizer(text, max_length=200, min_length=50, do_sample=False)
    else:
        summary = ar_summarizer(text, max_length=200, min_length=50, do_sample=False)

    return summary[0]['summary_text']

# **Building the RAG**

In [60]:

def build_rag(docs, lang="en"):
    """
    Build a LangChain RetrievalQA (RAG) object from a list of Documents.
    """
    if not docs:
        raise ValueError("No documents provided for RAG!")


    if lang == "en":
        embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    else:
        embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")


    db = FAISS.from_documents(docs, embeddings)


    if lang == "en":
        hf_pipe = pipeline(
            "text2text-generation",
            model="google/flan-t5-base",
            max_length=512,
        )
    else:
        hf_pipe = pipeline(
            "text2text-generation",
            model="UBC-NLP/AraT5-base-title-generation",  # Arabic-friendly T5
            max_length=512,
        )

    llm = HuggingFacePipeline(pipeline=hf_pipe)

    #Custom prompts
    if lang == "en":
        prompt_template = """Use the following pieces of context to answer the question at the end.
        If you don't know the answer, just say that you don't know, don't try to make up an answer.

        Context: {context}
        Question: {question}

        Helpful Answer:"""
    elif lang == "ar":
        prompt_template = """أنت مساعد ذكي.
        استخدم النص التالي للإجابة على السؤال باللغة العربية.

        النص: {context}
        السؤال: {question}

        الإجابة:"""

    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )
    chain_type_kwargs = {"prompt": PROMPT}

    # 5. Build RetrievalQA object
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=db.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
        return_source_documents=True,
        chain_type_kwargs=chain_type_kwargs
    )
    return qa


#Timestamp Formatting
def format_time(seconds):
    """Converts seconds to MM:SS format."""
    minutes = math.floor(seconds / 60)
    seconds = math.floor(seconds % 60)
    return f"{minutes:02d}:{seconds:02d}"

In [61]:
 def run_full_pipeline(video_url, question):
    print(f"Processing video: {video_url}")
    # 1. Extract Transcript
    full_text, segments = extract_transcript(video_url)

    if not segments:
        print("Could not retrieve transcript segments. Aborting.")
        return

    # 2. Detect Language
    lang = detect_lang(full_text)
    print(f"Detected language: {lang}")

    # 3. Summarise
    print("\n--- Summary ---")
    summary = summarise_text(full_text, lang)
    print(summary)

    # 4. Chunk Transcript with Timestamps
    docs = chunk_transcript(segments)
    print(f"\nTranscript chunked into {len(docs)} documents.")

    # 5. Build RAG
    print("Building RAG system...")
    qa_chain = build_rag(docs, lang)

    # 6. Ask Question
    print(f"\n--- Question & Answer ---")
    print(f"Question: {question}")
    result = qa_chain({"query": question})

    print("\nAnswer:")
    print(result['result'])

    # 7. Display Sources with Timestamps
    print("\n--- Sources ---")
    for doc in result['source_documents']:
        start_time = doc.metadata.get('start', 0)
        formatted_time = format_time(start_time)
        print(f"Timestamp: {formatted_time}")
        print(f"Text: {doc.page_content}\n")

In [64]:
#english
YOUTUBE_URL = "https://www.youtube.com/watch?v=wiNXzydta4c&list=PLkDaE6sCZn6FNC6YRfRQc_FbeQrF8BwGI&index=2"
QUESTION = "what is the topic of this video?"

if __name__ == '__main__':
    run_full_pipeline(YOUTUBE_URL, QUESTION)




Processing video: https://www.youtube.com/watch?v=wiNXzydta4c&list=PLkDaE6sCZn6FNC6YRfRQc_FbeQrF8BwGI&index=2
Transcript fetched via YouTubeTranscriptApi
Detected language: en

--- Summary ---
machine learning had grown up as a subfield of AI or artificial intelligence we wanted to build intelligent machines. in this class you learn about the state-of-the-art and also practice implementing machine learning algorithms. You also learn all the important practical tips and tricks for making them perform well and you get to implement them and see how they work for you.

Transcript chunked into 9 documents.
Building RAG system...


Device set to use cpu



--- Question & Answer ---
Question: what is the topic of this video?

Answer:
machine learning algorithms

--- Sources ---
Timestamp: 00:01
Text: in this class you learn about the state-of-the-art and also practice implementing machine learning algorithms yourself you learn about the most important machine learning algorithms some of which are exactly what's being used in large AI or large tech companies today and you get a sense of what is the state of the art in AI Beyond learning the algorithms though in this class you also learn all the important practical tips and tricks for making them perform well and you get to implement them and see how they work for

Timestamp: 03:56
Text: guarantee that you find mastering these skills worthwhile in the next video we'll look at a more formal definition of what is machine learning and we'll begin to talk about the main types of machine learning problems and algorithms you pick up some of the main machine learning terminology and start to get 

In [65]:
# arabic

YOUTUBE_URL = "https://www.youtube.com/watch?v=gj03qR69mHw"
QUESTION = "ما هو الموضوع المقطع؟ "

if __name__ == '__main__':
    run_full_pipeline(YOUTUBE_URL, QUESTION)


Processing video: https://www.youtube.com/watch?v=gj03qR69mHw
Transcript fetched via YouTubeTranscriptApi
Detected language: ar

--- Summary ---
كيف نعرف القران الكريم؟ سؤال الذي يطرحه كثيرون على موقع تويتر حول الطريقة التي ينبغي استخدامها لتعليم اللغة العربية وكيف نتعامل معها

Transcript chunked into 41 documents.
Building RAG system...


Device set to use cpu



--- Question & Answer ---
Question: ما هو الموضوع المقطع؟ 

Answer:
كل ما تريد معرفته عن (الحروف) (فيديو)

--- Sources ---
Timestamp: 19:03
Text: يبقى اقول مبتدا مرفوع حلو مبتدا مرفوع بالايه هنا بقى مبتدا مرفوع بالالف طب ليه قلنا بالالف بالالف عشان خاطر هي مثنى انت هنا بتكلم واحد ولا بتكلم اثنين لا المعلمان هنا اثنين فبالتالي هنا مثنى فاقول مرفوع بالالف وكمان انت ما بتقولش مرفوعه بالالف وخلاص هي فعلا الالف بتبقى موجوده اللي هي الالف اللي هي قبل النون طب تعال في محبوبان المعلمان مال هم محبوبان ف محبوبان خبر خبر مرفوع وعلامه رفعه الالف ليه قلنا الف عشان مثنى وكمان الالف ونون موجوده امتى اقول واو ونون بقى تقول واو ونون في حاله الجمع المذكر السالم زي المعلمون مجتهدون المعلمون نفس

Timestamp: 16:48
Text: صعد مدام فعل يبقى دي جمله فعليه اصلا وكمان سؤال كيف يعني اجي اقول لك كيف صعد المؤذن على المنبر مبتسما حاله مبتسما اللي هو كان بيبتسم وهو صاعد على المنبر فمبتعد حال واجي اقول لك جاء الرجل سعيدا اهي سعيدا هنا هترب برده حال ليه حال عشان خاطر دي جمله جمله فعليه الحال بيكون جملته جمله فعليه ا ج