# Extractive Summarization

This notebook demonstrates the process of extractive summarization using AI techniques. Extractive summarization involves selecting important sentences from a text to create a summary. This can be particularly useful in managing large volumes of information efficiently. We will use natural language processing (NLP) libraries and machine learning models to identify and extract key sentences from a given text.

We begin by importing the libraries we will be using. `Spacy` is used for sentence tokenization, `SentenceTransformer` for generating sentence embeddings, and `numpy` for mathematical operations.

**Run this notebook using the "Torch" Kernel.**

In [None]:
from spacy.lang.en import English
from sentence_transformers import SentenceTransformer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

We now initialize Spacy with the English language model and add a sentencizer component. This prepares us to split our text into individual sentences.

In [None]:
spacy = English()
spacy.add_pipe("sentencizer")

### Sentence Splitting
We now define a function to split texts into sentences. It is crucial for preparing our text for summarization.

In [None]:
def split(text):
  """
  Splits a text into sentences using the Spacy library.

  Parameters
  ----------
  text : str
      The text to be split into sentences.

  Returns
  -------
  list
      The list of sentences in the text.
  """
  processed_text = spacy(text)
  processed_sentences = processed_text.sents

  # TODO: Return a list with all sentences. For a processed_sentence s,
  # you get its text with the attribute s.text
  return []

We now define a sample text to be summarized. This text will be processed through our summarization pipeline.

In [None]:
text = ""
with open("cinderella.txt", "r", encoding="utf-8") as file:
    text = file.read()

The next cell applies the previously defined `split` function to our text and prints the individual sentences, allowing us to see how the text has been divided.

In [None]:
# TODO: Use the split function you implemented to print all sentences in the text above.
sentences = None

We now load a pre-trained model to generate embeddings for each sentence. These embeddings capture the semantic meaning of sentences in a high-dimensional space.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_embeddings = model.encode(sentences)
print(f'This {sentence_embeddings[0].shape[0]}-dimensional vector is an abstract representation of the sentence:')
print(sentences[0])
print(sentence_embeddings[0])

## Positional Encoding
The sentence embedding model processes each sentence independently - the model has no means of finding out where the sentence is in the overall text, and hence this information is lost in the embeddings. With positional encoding, we have the possibility to add this information. We start without positional encoding (i.e., "Option 0" below); you can later experiment wtih two types of positional encoding.

In [None]:
def positional_encoding(seq_length, embedding_dim):
    positions = np.arange(seq_length)[:, np.newaxis]
    div_term = np.exp(np.arange(0, embedding_dim, 2) * (-np.log(10000.0) / embedding_dim))
    
    pe = np.zeros((seq_length, embedding_dim))
    pe[:, 0::2] = np.sin(positions * div_term)  # Apply sin to even indices
    pe[:, 1::2] = np.cos(positions * div_term)  # Apply cos to odd indices
    
    return pe

pos_enc = positional_encoding(seq_length=len(sentences), embedding_dim=sentence_embeddings.shape[1])

# Option 0: no positional encoding
# enhanced_embeddings = sentence_embeddings

# Option 1: Concatenation (Doubles dimensionality)
# enhanced_embeddings = np.concatenate([sentence_embeddings, pos_enc], axis=1) # Uncomment this if you prefer concatenation

# Option 2: Addition (Keeps same dimensionality)
enhanced_embeddings = sentence_embeddings + pos_enc  # Uncomment this if you prefer addition

## Dimensionality Reduction
Next, we will use PCA (a linear dimensionality reduction technique) to reduce the number of dimensions - this has computational benefits, and also might get us rid of some irrelevant aspects ("noise") and thus yield a better clustering result.

In [None]:
scaler = StandardScaler()
embeddings_scaled = scaler.fit_transform(enhanced_embeddings)

# Apply full PCA (without specifying n_components first - hence a complete decomposition will be computed)
pca = PCA()
pca.fit(embeddings_scaled)

# Cumulative Explained Variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(8,5))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o', linestyle='--')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.grid()
plt.show()

variance_threshold = 0.95

# Find the number of components that retain at least `variance_threshold` variance
optimal_components = np.sum(cumulative_explained_variance < variance_threshold) + 1

print(f"{optimal_components} components are needed to capture {variance_threshold*100}% of variance.")

Now, we will actually reduce the number of dimensions in the dataset:

In [None]:
# Apply PCA with optimal components
pca = PCA(n_components=optimal_components)
embeddings_pca = pca.fit_transform(embeddings_scaled)

print(f"Reduced Embeddings Shape: {embeddings_pca.shape}") 

## K-Means Clustering
This section clusters the sentence embeddings into groups using KMeans clustering. This helps in identifying sentences with similar meanings, which can be useful for summarization. 

In [None]:
from sklearn.cluster import KMeans

k = 6

In [None]:
# TODO: Apply K-Means to the sentence embeddings.
kmenas = None
cluster_ids = None

In [None]:
# Construct a dictionary of k lists, where each list contains the sentences and the sentence embeddings of one cluster.
clustered_sentences = {i: [] for i in range(k)}
clustered_embeddings = {i: [] for i in range(k)}

for sentence, embedding, cluster_id in zip(sentences, embeddings_pca, cluster_ids):
    clustered_sentences[cluster_id].append(sentence)
    clustered_embeddings[cluster_id].append(embedding)

In [None]:
# Print the sentences in each cluster.
for cluster_id, sentence_list in clustered_sentences.items():
    print(f"Cluster {cluster_id} consists of the following sentences:")
    for sentence in sentence_list:
        print("   " + sentence)

We now compute for each cluster, its *prototypical sentence*. This is the sentence whose embedding is the closest to the cluster's center of mass. In other words, this is the word that is the closest to all other embeddings in the cluster. We begin this by defining two auxiliary functions.

In [None]:
def euclidean_distance(u, v):
    return np.linalg.norm(u - v)

def find_prototypical(centroid, vectors):
    # TODO: Complete this function so that it returns the **index** of the vector that is closest
    # to the centroid.
    distances = []
    for vector in vectors:
        d = euclidean_distance(centroid, vector)
        distances.append(d)
    
    index_closest_vector = 0
    closest_dist = distances[0]
    for i, di in enumerate(distances):
        if di < closest_dist:
            index_closest_vector = i
            closest_dist = di
    
    return index_closest_vector

In [None]:
# Obtain cluster centroids from k-Means
cluster_centroids = kmeans.cluster_centers_

# Create a dictionary that maps each cluster index to the cluster's prototypical sentence.
prototypical_sentence_indices = {}

# Loop through clusters
for cluster_id in range(k):

    # Get cluster centroid
    cluster_centroid = cluster_centroids[cluster_id]
    
    # Get cluster embeddings
    cluster_embeddings = clustered_embeddings[cluster_id]
    
    # Find the prototypical embedding in this cluster
    prototypical_index = find_prototypical(cluster_centroid, cluster_embeddings)
    
    # And store its index
    prototypical_sentence_indices[cluster_id] = prototypical_index


In [None]:
# Now we arrange the sentences accordingly
prototypical_sentences = []
for cluster_id in range(k):

  # Get the sentences in this cluster
  cluster_sentences = clustered_sentences[cluster_id]

  # Get the index of the prototypical sentence
  prototypical_sentence_index = prototypical_sentence_indices[cluster_id]

  # Store the prototypical sentence
  prototypical_sentences.append(
    cluster_sentences[prototypical_sentence_index]
  )

# We now have the k summary sentences that we were looking for...
# but they are shuffled as a result of the clustering
original_indices = []
for sent in prototypical_sentences:
  original_indices.append(sentences.index(sent))

# We re-organize them based on their original order in the document
new_indices = np.argsort(original_indices)
prototypical_sentences = [prototypical_sentences[i] for i in new_indices]

extractive_summary_sentences = prototypical_sentences

## Visualisation of the Story
Finally, we project the sentence embeddings into a 2-dimensional space and color the embeddings according to the clusters they belong.

In [None]:
# Import necessary libraries for PCA and plotting
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Initialize PCA to reduce to 2 dimensions
pca = PCA(n_components=2)

# Apply PCA to the sentence embeddings
sentence_embeddings_2d = pca.fit_transform(embeddings_pca)

# Prepare colors; if you have more than 8 clusters, extend this list of colors
colors = ['red', 'green', 'blue', 'cyan', 'magenta', 'yellow', 'black', 'orange']

# Plot each sentence embedding in the 2D space, colored by its cluster
plt.figure(figsize=(10, 8))  # Set the figure size
for i, embedding in enumerate(sentence_embeddings_2d):
    plt.scatter(embedding[0], embedding[1], color=colors[cluster_ids[i]], label=f'Cluster {cluster_ids[i]}')

# Optional: add a legend. This might make the plot crowded if there are many points.
# To improve the legend, we're creating custom legend entries to avoid duplicates
from matplotlib.lines import Line2D
legend_elements = [Line2D([0], [0], marker='o', color='w', markerfacecolor=col, markersize=10, label=f'Cluster {i}') for i, col in enumerate(colors[:k])]
plt.legend(handles=legend_elements, loc='best', title="Clusters")

plt.title('Sentence Embeddings Reduced to 2D by PCA, Colored by Clustering')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()


In [None]:
print('Summary sentences:\n')
for i, sent in enumerate(extractive_summary_sentences, start=1):
    print(i, sent, '(Original sentence index: ' + str(original_indices[new_indices[i-1]]) + ')')

**TODO:**
- Try different numbers of clusters and see the resulting summary. In your opinion, which number of clusters is ideal for your text of choice?
- Besides the "result-oriented" way of choosing the right number of clusters, what other techniques do you know? Implement them, run the clustering with the obtained number of clusters, and comment on the findings.
- Experiment with different options for positional encoding.