<a href="https://colab.research.google.com/github/shruthimohan03/video-summarizer/blob/main/GMM_with_SBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load and Preprocess Transcript from a TXT File

In [16]:
import re

# Load transcript from a text file
def load_transcript(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Split text into sentences (basic sentence segmentation)
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [s.strip() for s in sentences if s.strip()]  # Remove empty sentences

# Example usage
file_path = "/content/transcribed_text_nptel_video_1.txt"  # Update with actual file path
sentences = load_transcript(file_path)


### Preprocess and Convert Sentences into BERT Embeddings

In [17]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert transcript sentences to embeddings
def get_sentence_embeddings(sentences):
    return model.encode(sentences, convert_to_numpy=True)

# Generate embeddings
sentence_vectors = get_sentence_embeddings(sentences)


### Apply Gaussian Mixture Model (GMM) for Clustering

In [18]:
from sklearn.mixture import GaussianMixture

# Choose the number of clusters based on BIC/AIC
num_clusters = 6  # Adjust based on the BIC/AIC plot

# Fit GMM
gmm = GaussianMixture(n_components=num_clusters, random_state=42)
gmm.fit(sentence_vectors)

# Predict cluster labels
labels = gmm.predict(sentence_vectors)


### Group Sentences by Cluster

In [19]:
from collections import defaultdict

# Group sentences by their assigned topic
clustered_sentences = defaultdict(list)
for i, label in enumerate(labels):
    clustered_sentences[label].append(sentences[i])

# Print topics and their sentences
for topic, sentences in clustered_sentences.items():
    print(f"\n🟢 Topic {topic}:")
    for s in sentences:
        print(f"  - {s}")



🟢 Topic 0:
  - Hello everyone, welcome back to the final lecture of the first week.
  - And we saw that the distribution is not very uniform.
  - And also how they grow with us to each other.
  - So, this you may or may not have to do always and it depends on what is your application.
  - So, now you might feel that this is very trivial task, but let us see is it trivial.
  - Now, do you think there might be certain challenge involved?
  - That is it.
  - Doctor, Mr. MPH.
  - So, again you have numbers 2.44.3 and so on.
  - So, how do we go about solving this.
  - So, any data point that I am seeing, I have to divide into one of these two classes.
  - So, each point you have to divide into one of the two classes.
  - And this in general, this problem in general is called classification problem.
  - You are classifying into one of the two classes.
  - Now, so the idea is very simple.
  - So, you have two classes and each data point you have to divide into one of the two classes.
  - So

### Extract Keywords for Each Cluster

To get topic keywords, use TF-IDF on sentences inside each cluster.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Get keywords for each topic
def get_top_keywords(sentences, num_words=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    feature_array = vectorizer.get_feature_names_out()
    tfidf_sorting = X.sum(axis=0).A1.argsort()[::-1]
    return [feature_array[i] for i in tfidf_sorting[:num_words]]

# Print top words per topic
for topic, sentences in clustered_sentences.items():
    keywords = get_top_keywords(sentences)
    print(f"\n🔹 Topic {topic} Keywords: {keywords}")



🔹 Topic 0 Keywords: ['simple', 'problem', 'classes', 'general', 'rules']

🔹 Topic 3 Keywords: ['features', 'examples', 'various', 'use', 'doing']

🔹 Topic 1 Keywords: ['words', 'word', 'hyphens', 'hyphen', 'different']

🔹 Topic 2 Keywords: ['tokens', 'problem', 'tokenization', 'sentence', 'words']

🔹 Topic 4 Keywords: ['word', 'sanskrit', 'dot', 'problem', 'language']

🔹 Topic 5 Keywords: ['sentence', 'end', 'word', 'say', 'dot']
