<a href="https://colab.research.google.com/github/surya1604/YouTube-Summarization/blob/main/Model/Text_summarization_BART_%26_Manual_Optimized_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**METHOD -1**: PRETRAINED MODEL - BART


In [None]:
%pip install youtube_transcript_api
import nltk
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import BartTokenizer, BartForConditionalGeneration
from nltk.tokenize import word_tokenize,  sent_tokenize

# Function to fetch and process the transcript
def generate_transcript(id):
    transcript = YouTubeTranscriptApi.get_transcript(id)
    script = ""

    for text in transcript:
        t = text["text"]
        if t != '[Music]':
            script += t + " "

    return script

# Function to summarize text using BART model
def summarize_text(input_text):
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

    inputs = tokenizer.encode("summarize: " + input_text, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(inputs, max_length=300, min_length=80, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Combined function to fetch, process, and summarize a YouTube video's transcript
def extract_summarize(video_id):
    transcript = generate_transcript(video_id)
    summary = summarize_text(transcript)
    original_size = len(word_tokenize(transcript))
    summary_size = len(word_tokenize(summary))

    # Print the comparison

    print("Size of original transcript:", original_size)
    print("Size of summary:", summary_size)
    print("Orignial Transcript:", transcript)
    return summary

# Example usage
video_id = "LnJwH_PZXnM&t=3s&ab_channel=TEDxTalks"
summary = extract_summarize(video_id)
print("\nSummary:\n", summary)


Size of original transcript: 2849
Size of summary: 70
Orignial Transcript: Transcriber: María Constanza Cuevas
Reviewer: Tanya Cushman (Referee whistle sound) Good evening! Good evening. How are you? Are you good? Great! Welcome, welcome, welcome to this match. This match will take exactly 18 minutes. OK? And you're all part of the same team: Mechelen. OK? Hey guys, I would like to see
fair play on the field, respect and positivity. Is that OK for everyone? Cool. Good luck! One year ago, I decided
I wanted to become a football referee. Not because of the money, though. I only get paid 20 euros per match. So I won't really get rich by it, will I? No. I decided to become a referee 
for two other reasons. One - to stay in good shape. Two - because I wanted to learn
how not to take things personally. [How not to take things personally?] I can see some people nodding. You are probably thinking, "Being a referee
is the perfect environment to learn how not to take 
things personally, isn't it

**METHOD -2** : changing the average formula  along with increasing or decreasing the multiplier of threshold we can set the size of the summary. Higher the multiplier shorter the summary

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download stopwords and punkt only once
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    # Replace newline characters with spaces
    text = text.replace("\n", " ")
    # Remove the specific unwanted string "Two -"
    text = text.replace("Two -", "")
    return text

def summarize(text):
    # Preprocess the text to remove unwanted characters
    text = preprocess_text(text)

    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text)

    # Use n-grams for word frequency
    n_grams = nltk.ngrams(words, 2)
    freqTable = dict()
    for gram in n_grams:
        gram = ' '.join(gram).lower()
        if gram not in stopWords:
            if gram in freqTable:
                freqTable[gram] += 1
            else:
                freqTable[gram] = 1

    sentences = sent_tokenize(text)
    sentenceValue = dict()

    for sentence in sentences:
        for gram, freq in freqTable.items():
            if gram in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq
                else:
                    sentenceValue[sentence] = freq

    sumValues = 0
    for sentence in sentenceValue:
        sumValues += sentenceValue[sentence]

    average = int(sumValues / len(sentenceValue))

    # Adjust the threshold for sentence inclusion
    threshold = average * 1.2

    summary1 = ''
    for sentence in sentences:
        if sentence in sentenceValue and sentenceValue[sentence] > threshold:
            summary1 += " " + sentence

    return summary1

def extract_summarize(video_id):
    transcript = generate_transcript(video_id)
    summary1=summarize(transcript)
    # Calculate the sizes of the original transcript and the summary
    original_size = len(word_tokenize(transcript))
    summary_size = len(word_tokenize(summary1))

    # Print the comparison
    print("Size of original transcript:", original_size)
    print("Size of summary:", summary_size)
    print("Orignial Transcript:", transcript)

    return summary1


extract_summarize("LnJwH_PZXnM&t=3s&ab_channel=TEDxTalks")
summary1 = extract_summarize(video_id)
print("\nSummary:\n", summary1)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Size of original transcript: 2849
Size of summary: 1393
Orignial Transcript: Transcriber: María Constanza Cuevas
Reviewer: Tanya Cushman (Referee whistle sound) Good evening! Good evening. How are you? Are you good? Great! Welcome, welcome, welcome to this match. This match will take exactly 18 minutes. OK? And you're all part of the same team: Mechelen. OK? Hey guys, I would like to see
fair play on the field, respect and positivity. Is that OK for everyone? Cool. Good luck! One year ago, I decided
I wanted to become a football referee. Not because of the money, though. I only get paid 20 euros per match. So I won't really get rich by it, will I? No. I decided to become a referee 
for two other reasons. One - to stay in good shape. Two - because I wanted to learn
how not to take things personally. [How not to take things personally?] I can see some people nodding. You are probably thinking, "Being a referee
is the perfect environment to learn how not to take 
things personally, isn't 

**METHOD -3**  use n-grams for word frequency and checks each bigram as a whole AND changing the average formula  along with increasing or decreasing the multiplier of threshold we can set the size of the summary. Higher the multiplier shorter the summary.


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter

# Download stopwords and punkt only once
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    text = text.replace("\n", " ")
    text = text.replace("\\", "")
    text = text.replace("Two -", "")
    return text

def summarize(text):
    text = preprocess_text(text)

    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text)

    # Use n-grams for word frequency and checks each bigram as a whole
    n_grams = nltk.ngrams(words, 2)
    freqTable = Counter((' '.join(gram).lower(), 1) for gram in n_grams if ' '.join(gram).lower() not in stopWords)

    sentences = sent_tokenize(text)
    sentenceValue = Counter()

    for sentence in sentences:
        for gram, freq in freqTable.items():
            if gram[0] in sentence.lower(): # Corrected line
                sentenceValue[sentence] += freq

    sumValues = sum(sentenceValue.values())
    average = int(sumValues / len(sentenceValue))
     #changing the average formula  along with increasing or decreasing the multiplier of threshold we can set the size of the summary. Higher the multiplier shorter the summary.
    # Adjust the threshold for sentence inclusion
    threshold = average * 1.5

    summary2 = ' '.join(sentence for sentence in sentences if sentence in sentenceValue and sentenceValue[sentence] > threshold)

    return summary2

def extract_summarize(video_id):
    # Assuming you have a function to generate the transcript
    transcript = generate_transcript(video_id)
    summary2 = summarize(transcript)

    # Calculate the sizes of the original transcript and the summary
    original_size = len(word_tokenize(transcript))
    summary_size = len(word_tokenize(summary2))

    # Print the comparison
    print("Size of original transcript:", original_size)
    print("Size of summary:", summary_size)
    print("Orignial Transcript:", transcript)

    return summary2

extract_summarize("LnJwH_PZXnM&t=3s&ab_channel=TEDxTalks")
summary2 = extract_summarize(video_id)
print("\nSummary:\n",summary2)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Size of original transcript: 2849
Size of summary: 1110
Orignial Transcript: Transcriber: María Constanza Cuevas
Reviewer: Tanya Cushman (Referee whistle sound) Good evening! Good evening. How are you? Are you good? Great! Welcome, welcome, welcome to this match. This match will take exactly 18 minutes. OK? And you're all part of the same team: Mechelen. OK? Hey guys, I would like to see
fair play on the field, respect and positivity. Is that OK for everyone? Cool. Good luck! One year ago, I decided
I wanted to become a football referee. Not because of the money, though. I only get paid 20 euros per match. So I won't really get rich by it, will I? No. I decided to become a referee 
for two other reasons. One - to stay in good shape. Two - because I wanted to learn
how not to take things personally. [How not to take things personally?] I can see some people nodding. You are probably thinking, "Being a referee
is the perfect environment to learn how not to take 
things personally, isn't 

**METHOD-4**: Generate bigrams and filter out those containing a stop word. This approach reduces the number of valid bigrams since any bigram containing at least one stopword is discarded AND changing the average formula  along with increasing or decreasing the multiplier of threshold we can set the size of the summary. Higher the multiplier shorter the summary.



In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
from youtube_transcript_api import YouTubeTranscriptApi

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    text = text.replace("\n", " ").replace("\\", "").replace("Two -", "")
    return text

def summarize(text):
    text = preprocess_text(text)
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text)

    # Generate bigrams and filter out those containing a stop word. This approach reduces the number of valid bigrams since any bigram containing at least one stopword is discarded.
    n_grams = nltk.ngrams(words, 2)
    filtered_grams = [' '.join(gram) for gram in n_grams if not any(word in stopWords for word in gram)]
    freqTable = Counter(filtered_grams)

    sentences = sent_tokenize(text)
    sentenceValue = Counter()

    # Score sentences based on frequency of each bigram
    for sentence in sentences:
        sentence_lower = sentence.lower()
        for gram in freqTable:
            if gram in sentence_lower:
                sentenceValue[sentence] += freqTable[gram]

    sumValues = sum(sentenceValue.values())
    average = int(sumValues / len(sentenceValue))

    # changing the average formula  along with increasing or decreasing the multiplier of threshold we can set the size of the summary. Higher the multiplier shorter the summary.

    threshold = average * 1.5  # setting a dynamic threshold
    summary3 = ' '.join(sentence for sentence in sentences if sentenceValue[sentence] > threshold)

    return summary3

def fetch_transcript(video_id):
    # Fetch the transcript of the video
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    # Join all text items to form a full transcript
    full_text = ' '.join([item['text'] for item in transcript])
    return full_text

def main(video_id):
    transcript = fetch_transcript(video_id)
    summary3 = summarize(transcript)

    # Calculate size of the original and the summary
    original_length = len(word_tokenize(transcript))
    summary_length = len(word_tokenize(summary3))
    reduction_percent = ((original_length - summary_length) / original_length) * 100

    # Output comparison
    print(f"Original Length: {original_length} words")
    print(f"Summary Length: {summary_length} words")
    print(f"Reduction: {reduction_percent:.2f}%")
    print("Orignial Transcript:", transcript)
    print("\nSummary:\n", summary3)

# Example usage
video_id = "LnJwH_PZXnM&t=3s&ab_channel=TEDxTalks"
main(video_id)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Original Length: 2849 words
Summary Length: 330 words
Reduction: 88.42%
Orignial Transcript: Transcriber: María Constanza Cuevas
Reviewer: Tanya Cushman (Referee whistle sound) Good evening! Good evening. How are you? Are you good? Great! Welcome, welcome, welcome to this match. This match will take exactly 18 minutes. OK? And you're all part of the same team: Mechelen. OK? Hey guys, I would like to see
fair play on the field, respect and positivity. Is that OK for everyone? Cool. Good luck! One year ago, I decided
I wanted to become a football referee. Not because of the money, though. I only get paid 20 euros per match. So I won't really get rich by it, will I? No. I decided to become a referee 
for two other reasons. One - to stay in good shape. Two - because I wanted to learn
how not to take things personally. [How not to take things personally?] I can see some people nodding. You are probably thinking, "Being a referee
is the perfect environment to learn how not to take 
things pe