<a href="https://colab.research.google.com/github/ying2sun/Capstone-Project-Salifort-Motors/blob/main/Youtube_GenAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1: Load Dataset



1. Install Required Packages

In [1]:
!pip install sentence-transformers faiss-cpu datasets transformers tqdm nltk pandas




2. Import Libraries

In [2]:
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import numpy as np
import nltk
from tqdm import tqdm
tqdm.pandas()


3. Download NLTK Punkt Tokenizer

In [3]:
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
nltk.download('all')


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

4. Define Utility Functions

In [5]:
def concat_sentences(group):
    # Concatenate all sentences for one video
    return ' '.join(group['text'])

def get_first_n_sentences(text, n):
    # Get the first n sentences from a transcript string
    sentences = nltk.sent_tokenize(text)
    sentences = ' '.join(sentences[:n])
    if len(sentences.split()) > 1000:
      sentences = ' '.join(sentences.split()[:300])
    return sentences

def get_summary(row, pipe, first_n_sent=20):
    # Generate a summary from the transcript
    subset = get_first_n_sentences(row['full_transcript'], first_n_sent)
    result = pipe(subset, max_length=50, min_length=10, do_sample=False, truncation=True)[0]['summary_text']
    return {"summary": result}

def get_embeddings(row, model, first_n_sent=10):
    # Get sentence embedding for transcript and summary
    transcript = get_first_n_sentences(row['full_transcript'], first_n_sent)
    summary = row['summary']
    transcript_embedding = model.encode(transcript)
    summary_embedding = model.encode(summary)
    return {
        'transcript_embedding': transcript_embedding,
        'summary_embedding': summary_embedding
    }

5. Load and Process Dataset

In [6]:
def load_data(path='jamescalam/youtube-transcriptions', split='train', n_rows=10, summarize_first_n_sents=20,video_ids = []):
    # Load dataset from HuggingFace
    data = load_dataset(path, split=split)
    df = pd.DataFrame(data)
    group_cols = ['title', 'published', 'url', 'video_id', 'channel_id']
    # Group by video, concatenate transcript
    results = []
    for name, group in tqdm(df.groupby(group_cols)):
        full_trans = concat_sentences(group)
        row = {col: name[idx] for idx, col in enumerate(group_cols)}
        row['full_transcript'] = full_trans
        results.append(row)
    df_agg = pd.DataFrame(results)
    if n_rows > 0:
        df_agg = df_agg.head(n_rows)
    elif len(video_ids) > 0:
        df_agg = df_agg[df_agg['video_id'].apply(lambda x: x in video_ids)]
    df_agg.dropna(subset=['full_transcript'], inplace=True)
    return df_agg


6. Main Processing Block

In [7]:
# Load data
df = load_data(n_rows=10, summarize_first_n_sents=20)

# Create summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Generate summaries
df['summary'] = df.apply(lambda row: get_summary(row, summarizer, first_n_sent=20)['summary'], axis=1)

# Generate embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
df['transcript_embedding'] = df.apply(lambda row: model.encode(get_first_n_sentences(row['full_transcript'], 10)), axis=1)
df['summary_embedding'] = df.apply(lambda row: model.encode(row['summary']), axis=1)

# Now df contains all fields we need for downstream RAG/retrieval/analysis
df.head()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
100%|██████████| 701/701 [00:00<00:00, 1882.96it/s]
Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unnamed: 0,title,published,url,video_id,channel_id,full_transcript,summary,transcript_embedding,summary_embedding
0,$5 MILLION AI for FREE,2022-08-12 15:18:07,https://youtu.be/3EjtHs_lXnk,3EjtHs_lXnk,UCfzlCWGWYyIQ0aLC5w48gBQ,Imagine an AI where all in the same model you ...,A group of over a thousand researchers has bee...,"[-0.07433385, -0.13800655, -0.04107823, 0.0329...","[-0.040884223, -0.10034384, -0.05008944, 0.048..."
1,1 week out from SPARTAN Race + How to bake a s...,2018-04-26 07:36:40,https://youtu.be/04HX2zgQNXE,04HX2zgQNXE,UCr8O8l5cCX85Oem1d18EezQ,What's going on guys welcome back to another S...,Spartan trainer is back with a grip and stabil...,"[-0.062081713, -0.05884585, 0.0019795096, -0.0...","[-0.08060527, -0.04789059, -0.0043220487, -0.0..."
2,10-minute Bodyweight Back &amp; Shoulders Bedr...,2020-03-25 08:59:18,https://youtu.be/SBLp0Z4pJko,SBLp0Z4pJko,UCr8O8l5cCX85Oem1d18EezQ,"You know in the coming months, leadership at e...",Leadership at every level is going to be like ...,"[-0.003295755, 0.05800058, 0.011247904, -0.038...","[-0.09108453, -0.034012754, 0.0183743, -0.0607..."
3,10-minute Lower Body Bodyweight Workout for Be...,2020-04-03 07:09:41,https://youtu.be/HtSeYhm1e7A,HtSeYhm1e7A,UCr8O8l5cCX85Oem1d18EezQ,"Oh, I'm building this hat. Look how good it is...",Day 18 of streamlit.io's 10-minute workout ser...,"[-0.011118657, -0.014251923, 0.0387789, 0.0440...","[-0.0782866, 0.054833915, 0.013783748, -0.0164..."
4,10-minute Morning Wake Up Stretching Routine f...,2020-03-30 08:47:10,https://youtu.be/wm5QkbQ5LFI,wm5QkbQ5LFI,UCr8O8l5cCX85Oem1d18EezQ,Good morning! We are up to day 14 of Reps For ...,We are up to day 14 of Reps For Rona. Today on...,"[-0.0476916, -0.0415211, 0.030855495, 0.031183...","[-0.0937326, -0.009859586, 0.01704017, 0.01686..."


# Week 2: Integrate the retrieved context into the LLM prompt to generate context-aware summaries.

Step 1: Chunk transcript

In [8]:
def chunk_text(text, chunk_size=100, overlap=20):
    # Split text into chunks of `chunk_size` words with optional overlap
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i+chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

Step 2: Build vector database

In [9]:
import faiss

# Get all chunks from all videos
all_chunks = []
chunk_to_video = []
for idx, row in df.iterrows():
    video_chunks = chunk_text(row['full_transcript'])
    all_chunks.extend(video_chunks)
    chunk_to_video.extend([row['video_id']] * len(video_chunks))

# Compute embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = model.encode(all_chunks, show_progress_bar=True)

# Build FAISS index
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(chunk_embeddings))


Batches:   0%|          | 0/23 [00:00<?, ?it/s]

Step 3: Define retrieval function

In [10]:
def retrieve_context(query_text, top_k=3):
    # Encode query
    query_emb = model.encode([query_text])
    # Search for top_k similar chunks
    D, I = index.search(np.array(query_emb), top_k)
    # Return the retrieved chunk texts
    return [all_chunks[i] for i in I[0]]


Step 4: Integrate retrieved context into LLM prompt

In [11]:
def construct_rag_prompt(main_transcript, retrieved_chunks):
    context_section = "\n".join(retrieved_chunks)
    prompt = f"""You are an expert summarizer. Given the following transcript and relevant context, generate a concise summary.

Transcript:
{main_transcript}

Retrieved context:
{context_section}
"""
    return prompt


Step 5: Run summarization

In [12]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Example for one video
row = df.iloc[0]
main_text = get_first_n_sentences(row['full_transcript'], 10) # Get the first 10 sentenece
retrieved = retrieve_context(main_text, top_k=3)
prompt = construct_rag_prompt(main_text, retrieved)
rag_summary = summarizer(prompt, max_length=60, min_length=10, do_sample=False, truncation=True)[0]['summary_text']

print("RAG summary:", rag_summary)


Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


RAG summary: A group of over a thousand researchers has been quietly working on their own version of a 176 billion parameter model trained on the nuclear powered supercomputer, the Jonset. You can download multiple size variants all the way up to 176 billion parameters for free. BLUM is an acronym that


Step 6: Optimize RAG model

In [13]:
all_chunks = []
chunk_to_video = []
for idx, row in df.iterrows():
    video_chunks = chunk_text(row['full_transcript'], chunk_size=40, overlap=20)
    all_chunks.extend(video_chunks)
    chunk_to_video.extend([row['video_id']] * len(video_chunks))

# Get embeddings for all chunks
model = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = model.encode(all_chunks, show_progress_bar=True)


Batches:   0%|          | 0/91 [00:00<?, ?it/s]

In [14]:
def construct_rag_prompt(main_transcript, retrieved_chunks, max_sentences=3):
    context_section = "\n".join(retrieved_chunks)
    prompt = f"""You are an expert summarizer. Given the following transcript and relevant context, generate a concise summary.

Transcript:
{main_transcript}

Retrieved context:
{context_section}

Limit your summary to {max_sentences} sentences.
"""
    return prompt

In [15]:
def retrieve_context(query_text, top_k=2, exclude_video_id=None):
    """
    Retrieve top_k context chunks for a given query_text, excluding chunks from the same video if exclude_video_id is provided.
    """
    # Build candidate pool (exclude chunks from current video)
    candidate_chunks = []
    candidate_embeddings = []
    for i, chunk in enumerate(all_chunks):
        if exclude_video_id and chunk_to_video[i] == exclude_video_id:
            continue
        candidate_chunks.append(chunk)
        candidate_embeddings.append(chunk_embeddings[i])
    if not candidate_chunks:
        return []
    # Compute embedding for query
    query_emb = model.encode([query_text])
    emb_matrix = np.vstack(candidate_embeddings).astype('float32')
    # Build a temp FAISS index for candidates only
    faiss_index = faiss.IndexFlatL2(emb_matrix.shape[1])
    faiss_index.add(emb_matrix)
    D, I = faiss_index.search(query_emb, top_k)
    return [candidate_chunks[idx] for idx in I[0]]

# Example usage for one video
row = df.iloc[0]
main_text = get_first_n_sentences(row['full_transcript'], 3)
retrieved = retrieve_context(main_text, top_k=2, exclude_video_id=row['video_id'])

print("Main text:", main_text)
print("Retrieved context chunks:", retrieved)

prompt = construct_rag_prompt(main_text, retrieved, max_sentences=2)
print("Prompt for LLM:\n", prompt)

rag_summary = summarizer(
    prompt,
    max_length=50,   # Adjust as needed
    min_length=10,
    do_sample=False,
    truncation=True
)[0]['summary_text']

print("RAG summary:", rag_summary)


Main text: Imagine an AI where all in the same model you could translate languages, write code, solve crossword puzzles, be a chatbot, and do a whole bunch of other crazy things. This sort of an AI would certainly require a supercomputer of hundreds of A100 GPUs and months of training, even on all that power. We would need a team of researchers, the best of the best.
Retrieved context chunks: ["you know since we're not doing AI just out of joy I guess maybe the researchers are but ultimately we want to build tools that make our lives better and you do expect from a tool that it works in", "like biology and there's a bunch of tech I mean blockchain there's a lot of funding going into AI but there's also a lot of stuff happening right it's a lot of lot of cool products that we haven't built"]
Prompt for LLM:
 You are an expert summarizer. Given the following transcript and relevant context, generate a concise summary.

Transcript:
Imagine an AI where all in the same model you could trans

The current summarization model is now acceptable for several reasons.


*   First, the entire pipeline from transcript chunking, embedding generation, and vector database construction, to context retrieval and prompt-based summarization is fully functional and debugged.
*   The retrieved context for each query is no longer just a repeat of the original transcript but draws from diverse, relevant segments across the dataset, which demonstrates true retrieval-augmented capability.
*   The generated summaries are concise, abstract key points, and are not simple copies of the transcript, showing the model is able to synthesize information.
*   In addition, the summary outputs reflect both the main transcript and information from retrieved context when appropriate, which is the intended goal of a RAG approach. Compared to baseline summaries, the RAG outputs exhibit clear differences, indicating that the model is indeed leveraging external context.
*   This structure is also scalable and can be further evaluated and improved using quantitative metrics or human evaluation in the next phases.

In [16]:
# function to retrieve RAG summary for multiple examples
def getRagSummary(
              transcript
              ,exclude_video_id
              ,n=11
              ,top_k=5
              ,max_sentences=2
              ):

  main_text = get_first_n_sentences(transcript, n=n)
  retrieved = retrieve_context(main_text, top_k=top_k, exclude_video_id=exclude_video_id)

  prompt = construct_rag_prompt(main_text, retrieved, max_sentences=max_sentences)

  rag_summary = summarizer(
                      prompt,
                      max_length=100,   # Adjust as needed
                      min_length=10,
                      do_sample=False,
                      truncation=True
                  )[0]['summary_text']

  return rag_summary


summs = df.progress_apply(lambda x: getRagSummary(x['full_transcript'], x['video_id'],n=11, max_sentences=5), axis=1 )

100%|██████████| 10/10 [03:03<00:00, 18.30s/it]


Step 6.5: Generate RAG summaries for entire dataset

In [17]:
summs = df.progress_apply(
    lambda x: getRagSummary(
        x['full_transcript'],
        x['video_id'],
        n=11,
        max_sentences=5
    ),
    axis=1
)

# Save into df
df['rag_summary'] = summs

100%|██████████| 10/10 [03:39<00:00, 21.96s/it]


Step 7: Evaluation: ROUGE + BLEU

In [18]:
!pip install rouge-score



In [19]:
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import numpy as np

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(
    ['rouge1', 'rouge2', 'rougeL'],
    use_stemmer=True
)

# Create lists to store scores
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []
bleu_scores = []

# Loop through all rows
for idx, row in df.iterrows():
    reference = row['summary']           # baseline summary
    generated = row['rag_summary']       # rag summary

    # --- ROUGE ---
    scores = scorer.score(reference, generated)
    rouge1_scores.append(scores['rouge1'].fmeasure)
    rouge2_scores.append(scores['rouge2'].fmeasure)
    rougeL_scores.append(scores['rougeL'].fmeasure)

    # --- BLEU ---
    ref = [reference.split()]
    gen = generated.split()
    bleu = sentence_bleu(ref, gen)
    bleu_scores.append(bleu)

# Print average scores
print("Average ROUGE-1:", np.mean(rouge1_scores))
print("Average ROUGE-2:", np.mean(rouge2_scores))
print("Average ROUGE-L:", np.mean(rougeL_scores))
print("Average BLEU:", np.mean(bleu_scores))


Average ROUGE-1: 0.3780841199203349
Average ROUGE-2: 0.28416942269777873
Average ROUGE-L: 0.33932818468841897
Average BLEU: 0.2328225947892693


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


# Week 3: Human Evaluation
(Check out Final Report for more information)


# Week 4: Model Fine-Tuning

## 1. Basic fine-tuning (T5-small)

In [20]:
!pip install sentencepiece accelerate




Step 1: Build the Training DataFrame

In [21]:
# Prepare training dataset for fine tuning, using RAG summary as target
# Keep only rows where rag_summary exists
train_df = df.dropna(subset=['rag_summary']).copy()

# Create input and target columns
# We can choose either full_transcript or a truncated version
train_df['input_text'] = train_df['full_transcript'].apply(
    lambda x: get_first_n_sentences(x, 80)
)

train_df['target_text'] = train_df['rag_summary']

# Show example
train_df[['input_text', 'target_text']].head()

Unnamed: 0,input_text,target_text
0,Imagine an AI where all in the same model you ...,A group of over a thousand researchers has bee...
1,What's going on guys welcome back to another S...,You are an expert summarizer. Given the follow...
2,"You know in the coming months, leadership at e...",You are an expert summarizer. Given the follow...
3,"Oh, I'm building this hat. Look how good it is...",You are an expert summarizer. Given the follow...
4,Good morning! We are up to day 14 of Reps For ...,You are an expert summarizer. Given the follow...
