<a href="https://colab.research.google.com/github/shruthimohan03/video-summarizer/blob/main/GMM_with_SBERT_own_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load and Preprocess Transcript from a TXT File

In [8]:
import re

# Load transcript from a text file
def load_transcript(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Split text into sentences (basic sentence segmentation)
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [s.strip() for s in sentences if s.strip()]  # Remove empty sentences

# Example usage
file_path = "/content/text_processing.txt"
sentences = load_transcript(file_path)


### Preprocess and Convert Sentences into BERT Embeddings

In [9]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert transcript sentences to embeddings
def get_sentence_embeddings(sentences):
    return model.encode(sentences, convert_to_numpy=True)

# Generate embeddings
sentence_vectors = get_sentence_embeddings(sentences)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Apply Gaussian Mixture Model (GMM) for Clustering

In [10]:
from sklearn.mixture import GaussianMixture

# Choose the number of clusters based on BIC/AIC
num_clusters = 5

# Fit GMM
gmm = GaussianMixture(n_components=num_clusters, random_state=42)
gmm.fit(sentence_vectors)

# Predict cluster labels
labels = gmm.predict(sentence_vectors)


### Group Sentences by Cluster

In [11]:
from collections import defaultdict

# Group sentences by their assigned topic
clustered_sentences = defaultdict(list)
for i, label in enumerate(labels):
    clustered_sentences[label].append(sentences[i])

# Print topics and their sentences
for topic, sentences in clustered_sentences.items():
    print(f"\n🟢 Topic {topic}:")
    for s in sentences:
        print(f"  - {s}")



🟢 Topic 2:
  - Tokenization is a foundational step in text processing where text is divided into smaller units called tokens.
  - These tokens can be individual words or sentences, depending on the granularity required.
  - For example, tokenizing the sentence “I’m learning NLP!” results in the tokens ["I", "'m", "learning", "NLP", "!"].
  - Tokenization is essential for enabling downstream natural language processing (NLP) tasks, such as sentiment analysis and machine translation, as it breaks down complex text into manageable pieces.
  - Libraries like NLTK and spaCy provide efficient tokenization methods, making it easy to prepare text for analysis.

🟢 Topic 0:
  - Stopword removal is the process of filtering out common words that do not carry substantial meaning, such as “the,” “is,” and “and.” Removing these stopwords reduces the dimensionality of the dataset, which can enhance the performance of NLP models by focusing on more meaningful words.
  - For instance, the sentence “She

### Extract Keywords for Each Cluster

To get topic keywords, use TF-IDF on sentences inside each cluster.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Get keywords for each topic
def get_top_keywords(sentences, num_words=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    feature_array = vectorizer.get_feature_names_out()
    tfidf_sorting = X.sum(axis=0).A1.argsort()[::-1]
    return [feature_array[i] for i in tfidf_sorting[:num_words]]

# Print top words per topic
for topic, sentences in clustered_sentences.items():
    keywords = get_top_keywords(sentences)
    print(f"\n🔹 Topic {topic} Keywords: {keywords}")



🔹 Topic 2 Keywords: ['text', 'nlp', 'tokens', 'learning', 'tokenization']

🔹 Topic 0 Keywords: ['word', 'lemmatization', 'words', 'stopwords', 'nlp']

🔹 Topic 4 Keywords: ['search', 'text', 'like', 'unstructured', 'tasks']

🔹 Topic 1 Keywords: ['like', 'models', 'musk', 'tesla', 'elon']

🔹 Topic 3 Keywords: ['tagging', 'like', 'pos', 'quick', 'fox']


Looks like clustering happened based on: Tokenization, Stopword/word/lemmatization, importance, NER, POS

SUMMARIZE by picking the most important sentence in each cluster

In [13]:
!pip install sympy --upgrade



In [22]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def extractive_summary(sentences,topic):
    # Embed all the sentences in each cluster using SBERT
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(sentences)

    # Compute similarity matrix
    similarity_matrix = cosine_similarity(embeddings)

    # Find the most central sentence (highest avg similarity)
    avg_sim = np.mean(similarity_matrix, axis=1)
    central_idx = np.argmax(avg_sim) # np.argmax gives the index of the max value

    selected_indices = [central_idx] # storing the selected sentences' indices
    #print(sentences[central_idx],max(avg_sim))

    if topic==0:
        similarity_matrix[:, selected_indices[-1]] = -1
        similarity_matrix[selected_indices[-1], :] = -1
        avg_sim = np.mean(similarity_matrix, axis=1)
        central_idx = np.argmax(avg_sim)
        selected_indices.append(central_idx)
    return ''.join([sentences[i] for i in selected_indices])

In [23]:
summary=''
for topic, sentences in clustered_sentences.items():
    summary = extractive_summary(sentences,topic)
    print(f"{summary}")

Tokenization is essential for enabling downstream natural language processing (NLP) tasks, such as sentiment analysis and machine translation, as it breaks down complex text into manageable pieces.
Stopword removal is the process of filtering out common words that do not carry substantial meaning, such as “the,” “is,” and “and.” Removing these stopwords reduces the dimensionality of the dataset, which can enhance the performance of NLP models by focusing on more meaningful words.Unlike stemming, which often chops off word endings indiscriminately, lemmatization considers the context and part of speech of a word.
It helps NLP systems comprehend the role of each word in a sentence, leading to better analysis and generation of text.
Named Entity Recognition (NER) is a technique used to identify and categorize entities in text, such as names, locations, dates, and organizations.
POS tagging is crucial for understanding syntactic structures and is applied in tasks like parsing and machine t