# Baseline solution

I'm going to use a quite widespread approach:

1. Encode sents using spacy's.
2. Calculate pairwise similarity. Find n centroids.
3. Leave the least similar of all - they might carry the most diverse information altogether.


## 1. Encoding

In [1]:
import spacy
import numpy as np

In [2]:
n = 4

In [3]:
nlp = spacy.load('en_core_web_lg')

In [4]:
body = "Resonance is the literary magazine put out by the students of Falmouth Academy, the Massachusetts private school I attended for six years, starting in the seventh grade. During my time at F.A., I had at least one poem published in each issue of Resonance. In high school, I was also a member of the staff. But that wasn’t why I loved it. I loved it — and I swear I am not exaggerating here — because I thought the writing in its pages was more beautiful than anything I’d ever read. I was not a happy or popular adolescent, and the emotional stance I adopted toward most of my peers at F.A. might best be described as a defensive crouch. I was scared of my classmates, and I resented them; I could tell they didn’t like me, but I couldn’t figure out why. To the extent that I was able to lift myself out of my own sodden self-loathing to contemplate their inner worlds, I imagined their minds to be filled, like mine, with a whirlwind of criticism and judgment. But, once a year, at the end of the spring semester, I would open my copy of Resonance and be forced to face the unsettling possibility that my classmates were not the shallow bullies I imagined them to be but actual people, with souls."

In [5]:
doc = nlp(body)

In [287]:
sentence_buffer = []
vector_buffer = []
for sent in doc.sents:
    sentence_buffer.append(sent.text)
    vector_buffer.append(sent.vector)
n_sents = len(vector_buffer)
print(f"No. sents: {n_sents}")
sents_matrix = np.array(vector_buffer) 
print(f"... which were embedded into {sents_matrix.shape[1]}-dimensional space")

No. sents: 9
... which were embedded into 300-dimensional space


In [24]:
pairwise_dot = sents_matrix @ sents_matrix.T
pairwise_norm = np.linalg.norm(sents_matrix, axis=1).reshape(-1, 1)
pairwise_norm = pairwise_norm @ pairwise_norm.T

In [27]:
np.round(pairwise_dot / pairwise_norm * np.eye(n_sents))

array([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.]])

## 2. Calculating centroids

In [274]:
from sklearn.cluster import KMeans

In [277]:
kmns = KMeans(n_clusters=n)
kmns.fit(sents_matrix)

KMeans(n_clusters=4)

In [278]:
kmns.predict(sents_matrix)

array([3, 1, 3, 2, 2, 0, 2, 0, 0], dtype=int32)

In [281]:
kmns.cluster_centers_.shape

(4, 300)

## 3. Finding closest

In [288]:
the_closest = []
for i in range(n):
    dists_to_centroid = []
    for j in range(n_sents):
        dist = np.linalg.norm(sents_matrix[j]-kmns.cluster_centers_[i])
        dists_to_centroid.append(dist)
    closest = np.argmin(dists_to_centroid)
    the_closest.append(closest)
    print(f"For centroid {i} the closest is {closest}: {dists_to_centroid}")


For centroid 0 the closest is 8: [1.4526037, 1.378992, 1.5280867, 1.8807921, 1.4262528, 0.68644917, 1.4759239, 0.59049946, 0.49980232]
For centroid 1 the closest is 1: [1.5242139, 2.8108515e-08, 1.723594, 2.2791972, 1.751496, 1.4821645, 2.0238597, 1.6517166, 1.3599932]
For centroid 2 the closest is 6: [2.222229, 1.916933, 2.0684254, 0.686803, 0.68368447, 1.6655036, 0.6288617, 1.6213056, 1.44309]
For centroid 3 the closest is 0: [0.68872505, 1.4739945, 0.68872505, 2.3726172, 1.9225903, 1.40461, 2.1002202, 1.6076026, 1.325527]


## 4. Finale

Well, we're done here. The sentences choosen above should give quite fine summary.

In [289]:
for i in sorted(the_closest):
    print(sentence_buffer[i])

Resonance is the literary magazine put out by the students of Falmouth Academy, the Massachusetts private school I attended for six years, starting in the seventh grade.
During my time at F.A., I had at least one poem published in each issue of Resonance.
I was scared of my classmates, and I resented them; I could tell they didn’t like me, but I couldn’t figure out why.
But, once a year, at the end of the spring semester, I would open my copy of Resonance and be forced to face the unsettling possibility that my classmates were not the shallow bullies I imagined them to be but actual people, with souls.
