<a href="https://colab.research.google.com/github/shruthimohan03/video-summarizer/blob/main/GMM_with_SBERT_own_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load and Preprocess Transcript from a TXT File

In [None]:
import re

# Load transcript from a text file
def load_transcript(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Split text into sentences (basic sentence segmentation)
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [s.strip() for s in sentences if s.strip()]  # Remove empty sentences

# Example usage
file_path = "/content/text_processing.txt"  # Update with actual file path
sentences = load_transcript(file_path)


### Preprocess and Convert Sentences into BERT Embeddings

In [None]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert transcript sentences to embeddings
def get_sentence_embeddings(sentences):
    return model.encode(sentences, convert_to_numpy=True)

# Generate embeddings
sentence_vectors = get_sentence_embeddings(sentences)


### Apply Gaussian Mixture Model (GMM) for Clustering

In [None]:
from sklearn.mixture import GaussianMixture

# Choose the number of clusters based on BIC/AIC
num_clusters = 5

# Fit GMM
gmm = GaussianMixture(n_components=num_clusters, random_state=42)
gmm.fit(sentence_vectors)

# Predict cluster labels
labels = gmm.predict(sentence_vectors)


### Group Sentences by Cluster

In [None]:
from collections import defaultdict

# Group sentences by their assigned topic
clustered_sentences = defaultdict(list)
for i, label in enumerate(labels):
    clustered_sentences[label].append(sentences[i])

# Print topics and their sentences
for topic, sentences in clustered_sentences.items():
    print(f"\n🟢 Topic {topic}:")
    for s in sentences:
        print(f"  - {s}")



🟢 Topic 2:
  - Tokenization is a foundational step in text processing where text is divided into smaller units called tokens.
  - These tokens can be individual words or sentences, depending on the granularity required.
  - For example, tokenizing the sentence “I’m learning NLP!” results in the tokens ["I", "'m", "learning", "NLP", "!"].
  - Tokenization is essential for enabling downstream natural language processing (NLP) tasks, such as sentiment analysis and machine translation, as it breaks down complex text into manageable pieces.
  - Libraries like NLTK and spaCy provide efficient tokenization methods, making it easy to prepare text for analysis.

🟢 Topic 0:
  - Stopword removal is the process of filtering out common words that do not carry substantial meaning, such as “the,” “is,” and “and.” Removing these stopwords reduces the dimensionality of the dataset, which can enhance the performance of NLP models by focusing on more meaningful words.
  - For instance, the sentence “She

### Extract Keywords for Each Cluster

To get topic keywords, use TF-IDF on sentences inside each cluster.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Get keywords for each topic
def get_top_keywords(sentences, num_words=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    feature_array = vectorizer.get_feature_names_out()
    tfidf_sorting = X.sum(axis=0).A1.argsort()[::-1]
    return [feature_array[i] for i in tfidf_sorting[:num_words]]

# Print top words per topic
for topic, sentences in clustered_sentences.items():
    keywords = get_top_keywords(sentences)
    print(f"\n🔹 Topic {topic} Keywords: {keywords}")



🔹 Topic 2 Keywords: ['text', 'nlp', 'tokens', 'learning', 'tokenization']

🔹 Topic 0 Keywords: ['word', 'lemmatization', 'words', 'stopwords', 'nlp']

🔹 Topic 4 Keywords: ['search', 'text', 'like', 'unstructured', 'tasks']

🔹 Topic 1 Keywords: ['like', 'models', 'musk', 'tesla', 'elon']

🔹 Topic 3 Keywords: ['tagging', 'like', 'pos', 'quick', 'fox']


Looks like clustering happened based on: Tokenization, Stopword/word/lemmatization, importance, NER, POS

### Trying it with 6 clusters to see if stemming and lemmitization are grouped differently

In [None]:
import re

# Load transcript from a text file
def load_transcript(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Split text into sentences (basic sentence segmentation)
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [s.strip() for s in sentences if s.strip()]  # Remove empty sentences

# Example usage
file_path = "/content/text_processing.txt"  # Update with actual file path
sentences = load_transcript(file_path)

In [None]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert transcript sentences to embeddings
def get_sentence_embeddings(sentences):
    return model.encode(sentences, convert_to_numpy=True)

# Generate embeddings
sentence_vectors = get_sentence_embeddings(sentences)


In [None]:
### GMM for clustering
from sklearn.mixture import GaussianMixture

# Choose the number of clusters based on BIC/AIC
num_clusters = 6

# Fit GMM
gmm = GaussianMixture(n_components=num_clusters, random_state=42)
gmm.fit(sentence_vectors)

# Predict cluster labels
labels = gmm.predict(sentence_vectors)

In [None]:
from collections import defaultdict

# Group sentences by their assigned topic
clustered_sentences = defaultdict(list)
for i, label in enumerate(labels):
    clustered_sentences[label].append(sentences[i])

# Print topics and their sentences
for topic, sentences in clustered_sentences.items():
    print(f"\n🟢 Topic {topic}:")
    for s in sentences:
        print(f"  - {s}")



🟢 Topic 2:
  - Tokenization is a foundational step in text processing where text is divided into smaller units called tokens.
  - These tokens can be individual words or sentences, depending on the granularity required.
  - For example, tokenizing the sentence “I’m learning NLP!” results in the tokens ["I", "'m", "learning", "NLP", "!"].
  - Tokenization is essential for enabling downstream natural language processing (NLP) tasks, such as sentiment analysis and machine translation, as it breaks down complex text into manageable pieces.
  - Lemmatization involves converting words to their base or dictionary form, known as the lemma.

🟢 Topic 5:
  - Libraries like NLTK and spaCy provide efficient tokenization methods, making it easy to prepare text for analysis.
  - Tools like WordNetLemmatizer in NLTK and spaCy’s lemmatization functions make it straightforward to incorporate this into preprocessing pipelines.
  - This process is vital for tasks like information extraction and organizin

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Get keywords for each topic
def get_top_keywords(sentences, num_words=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    feature_array = vectorizer.get_feature_names_out()
    tfidf_sorting = X.sum(axis=0).A1.argsort()[::-1]
    return [feature_array[i] for i in tfidf_sorting[:num_words]]

# Print top words per topic
for topic, sentences in clustered_sentences.items():
    keywords = get_top_keywords(sentences)
    print(f"\n🔹 Topic {topic} Keywords: {keywords}")



🔹 Topic 2 Keywords: ['text', 'nlp', 'tokens', 'learning', 'words']

🔹 Topic 5 Keywords: ['like', 'text', 'nltk', 'spacy', 'datasets']

🔹 Topic 0 Keywords: ['word', 'reading', 'book', 'stopwords', 'nlp']

🔹 Topic 4 Keywords: ['search', 'word', 'nlp', 'better', 'comprehend']

🔹 Topic 1 Keywords: ['like', 'models', 'musk', 'tesla', 'elon']

🔹 Topic 3 Keywords: ['tagging', 'like', 'pos', 'quick', 'fox']


Didn't work well. finetuning is a better option

### Since the clustering is not working as per expectation, we make some changes to improve the clustering

1. In gmm, covariance_type='full' to sharpen the clusters: Each component has its own full covariance matrix. Allows each component to have a unique shape, orientation, and size in all dimensions. Provides the most flexibility but also increases computational cost.

2. Normalize Embeddings

In [None]:
import re

# Load transcript from a text file
def load_transcript(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Split text into sentences (basic sentence segmentation)
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [s.strip() for s in sentences if s.strip()]  # Remove empty sentences

# Example usage
file_path = "/content/text_processing.txt"  # Update with actual file path
sentences = load_transcript(file_path)

In [None]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Normalize the embeddings

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
normalized_embeddings = scaler.fit_transform(embeddings)
print(len(normalized_embeddings))

22


In [None]:
### GMM for clustering
from sklearn.mixture import GaussianMixture

# Apply Gaussian Mixture Model clustering

gmm = GaussianMixture(n_components=5, covariance_type="full", random_state=42)
clusters = gmm.fit_predict(normalized_embeddings)

# Predict cluster labels
labels = gmm.predict(normalized_embeddings)

In [None]:
from collections import defaultdict

# Group sentences by their assigned topic
clustered_sentences = defaultdict(list)
for i, label in enumerate(labels):
    clustered_sentences[label].append(sentences[i])

# Print topics and their sentences
for topic, sentences in clustered_sentences.items():
    print(f"\n🟢 Topic {topic}:")
    for s in sentences:
        print(f"  - {s}")



🟢 Topic 2:
  - Tokenization is a foundational step in text processing where text is divided into smaller units called tokens.
  - These tokens can be individual words or sentences, depending on the granularity required.
  - For example, tokenizing the sentence “I’m learning NLP!” results in the tokens ["I", "'m", "learning", "NLP", "!"].
  - Tokenization is essential for enabling downstream natural language processing (NLP) tasks, such as sentiment analysis and machine translation, as it breaks down complex text into manageable pieces.
  - This step is particularly useful in applications like search engines, where reducing noise can lead to more relevant search results.
  - Lemmatization involves converting words to their base or dictionary form, known as the lemma.

🟢 Topic 3:
  - Libraries like NLTK and spaCy provide efficient tokenization methods, making it easy to prepare text for analysis.
  - Tools like WordNetLemmatizer in NLTK and spaCy’s lemmatization functions make it straig

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Get keywords for each topic
def get_top_keywords(sentences, num_words=5):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    feature_array = vectorizer.get_feature_names_out()
    tfidf_sorting = X.sum(axis=0).A1.argsort()[::-1]
    return [feature_array[i] for i in tfidf_sorting[:num_words]]

# Print top words per topic
for topic, sentences in clustered_sentences.items():
    keywords = get_top_keywords(sentences)
    print(f"\n🔹 Topic {topic} Keywords: {keywords}")



🔹 Topic 2 Keywords: ['text', 'tokens', 'nlp', 'words', 'learning']

🔹 Topic 3 Keywords: ['like', 'text', 'tagging', 'pos', 'spacy']

🔹 Topic 0 Keywords: ['word', 'reading', 'book', 'stopwords', 'nlp']

🔹 Topic 1 Keywords: ['like', 'models', 'musk', 'tesla', 'elon']

🔹 Topic 4 Keywords: ['quick', 'jumps', 'fox', 'vbz', 'tags']
