<a href="https://colab.research.google.com/github/shruthimohan03/video-summarizer/blob/main/GMM_for_Extractive_Summarization_Irrelevant_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.mixture import GaussianMixture
from sklearn.metrics.pairwise import cosine_similarity
import re

In [None]:
# Step 1: Load and preprocess the text file
def load_and_preprocess(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    # Split text into sentences based on periods or question marks
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', content)
    # Clean sentences (keep full stops, remove other punctuations, and strip extra spaces)
    sentences = [
        re.sub(r'[^a-zA-Z0-9\s\.]', '', sentence).strip() for sentence in sentences if sentence.strip()
    ]
    return sentences

In [None]:
# Step 2: Preprocess and vectorize sentences
def preprocess_and_vectorize(sentences):
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    sentence_vectors = tfidf_vectorizer.fit_transform(sentences).toarray()
    return sentence_vectors

In [None]:
# Step 3: Fit GMM
def fit_gmm(sentence_vectors, n_clusters):
    gmm = GaussianMixture(n_components=n_clusters, random_state=42)
    gmm.fit(sentence_vectors)
    labels = gmm.predict(sentence_vectors)
    return labels

In [None]:
# Step 4: Extract representative sentences and view all sentences in each cluster
def extract_summary_and_view_clusters(sentences, sentence_vectors, labels):
    summary = []
    clusters = {}
    unique_labels = np.unique(labels)

    for label in unique_labels:
        # Get indices of sentences in the current cluster
        cluster_indices = np.where(labels == label)[0]

        # Group sentences in clusters
        clusters[label] = [sentences[i] for i in cluster_indices]

        # Find the most central sentence in the cluster
        cluster_center = np.mean(sentence_vectors[cluster_indices], axis=0)
        central_index = cluster_indices[np.argmax(cosine_similarity([cluster_center], sentence_vectors[cluster_indices])[0])]
        summary.append(sentences[central_index])

    return ''.join(summary), clusters

In [None]:
# Example
file_path = 'irrelevant_document.txt'
sentences = load_and_preprocess(file_path)
sentence_vectors = preprocess_and_vectorize(sentences)

n_clusters = 11  # no of optimal is 6 for this dataset as found using elbow method
labels = fit_gmm(sentence_vectors, n_clusters)
summary,clusters = extract_summary_and_view_clusters(sentences, sentence_vectors, labels)

# Print sentences in each cluster
for cluster_id, cluster_sentences in clusters.items():
    print(f"Cluster {cluster_id}:")
    for sentence in cluster_sentences:
        print(f"  - {sentence}")
    print()

Cluster 0:
  - Over the decades computers have evolved into compact affordable and versatile tools that are integral to our daily lives.
  - One of the significant milestones in computer history was the invention of the internet.
  - The internet transformed computers from standalone devices into interconnected tools of communication and information exchange.
  - Today billions of devices are connected via the internet forming a global network that has fostered innovation and connectivity.
  - In the healthcare sector computers are used for patient record management diagnostic tools and even robotic surgeries.

Cluster 1:
  - Initially these machines were massive expensive and designed for specialized purposes.
  - In conclusion computers are much more than machines they are catalysts of progress and innovation.

Cluster 2:
  - Computers have revolutionized the way we live work and communicate.
  - Their history dates back to the mid20th century when the first electronic computers were

In [None]:
# Save the summarized text to a file
with open("irrelevant_gmm_centroid_method_11_clusters.txt", "w") as file:
    file.write(summary)

print("Summarization completed.")

Summarization completed.


In [None]:
summary

'The internet transformed computers from standalone devices into interconnected tools of communication and information exchange.In conclusion computers are much more than machines they are catalysts of progress and innovation.Looking to the future the potential of computers seems boundless.Addressing these challenges requires a balanced approach that leverages technology while ensuring responsible use.These calculations enable tasks ranging from simple arithmetic to sophisticated artificial intelligence.In education elearning platforms and virtual classrooms have made learning accessible to millions worldwide.Solar panels convert sunlight into electricity making them a popular choice for sustainable energy solutions.At their core computers operate by processing data using binary logic.The Amazon rainforest is often referred to as the lungs of the Earth due to its vast biodiversity and oxygen production.Penguins are flightless birds that primarily inhabit the icy regions of Antarctica.A