## The Functions Part

In [25]:
VECTOR_PATH = '../glove.6B.200d.pkl'
ARTICLES_PATH = '../articles/'
GOLD_SUMMARIES_PATH = '../gold_summaries/'

Reads the global vector of words.

In [26]:
import pickle
with open(VECTOR_PATH, 'rb') as f:
    data = pickle.load(f)

This method used to split texts to sentences. Handles some of the special cases which I noticed and researched about. And returns a list of sentensec.

In [28]:
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"
def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text) 
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

This method reads the files which are in article path. This documents are splitted into sentences. The returned list contans two elements. First is document name and the second one is list of sentences.

Takses at most 30 secs if you give same number of documents in article path.

In [31]:
import os
def parse_documents_to_sentences():
    docs_by_sentences = []
    doc_dirs = os.listdir(ARTICLES_PATH)
    for f in doc_dirs:
        fin = open(ARTICLES_PATH + f, 'r', encoding="latin-1")
        title = fin.readline() + "."
        title = split_into_sentences(title)
        doc = fin.read()
        doc = split_into_sentences(doc)
        docs_by_sentences.append([f, title + doc])
    return docs_by_sentences

parsed_docs = parse_documents_to_sentences()

Sentence to vector method. Takes documents as sentences list(doc-name, list-of-sentences). And it firstly do case foldig for each sentence and parses the words. After that for a sentence it averages word vectors for each sentence. And this will be used as vector representation of sentence. Note that, if the word is not in the global vector it just skips that word.

Takes at most 30 secs.

In [33]:
import numpy as np
def sent2vec(docs_by_sentences):
    docs_as_vectors = []
    # SOME CODE HERE
    for doc in docs_by_sentences:
        docName = doc[0]
        sentences = doc[1]
        sentenceVectors = []
        for s in sentences:
            words = re.findall(r'\w+', s.lower())
            word_wectors = []
            for w in words:
                if w not in data:
                    continue
                word_wectors.append(data[w])
            if(len(word_wectors) != 0):
                sentenceVector = np.mean(np.array(word_wectors), axis=0)
                sentenceVectors.append([s,sentenceVector])
        docs_as_vectors.append([docName,sentenceVectors])    
    return docs_as_vectors
docs_as_vectors = np.array(sent2vec(parsed_docs))

### The clustring methods
The training part of the k-means, it takes sentences list and cluster number as k. As return, it returns clusters of trained sentences, and means of those clusters. We will use this means to fit the test data. It selects first means random sentences. It is not complicated it just loops until means are not changed. If you want to run this method seperately you need to give a list of sentences and vectors of those sentences list, for example [ [ [ "sentence",   [vector of that sentence] ],  ...]

This take the most time, in my local the maximum run time for only one call of this method takes at most 3 mins.

In [34]:
def kmeans_train(sentences, k):
    means = []
    for i in range(k):
        means.append(np.random.choice(np.array(sentences)[:,1]))
    while True:
        clusters = [[] for i in range(k)]
        for s in sentences:
            sentence = s[0]
            vector = s[1]
            closest_cluster_index = np.argmin([np.linalg.norm(vector - mean) ** 2 for mean in means])
            clusters[closest_cluster_index].append(s)
        new_cluster_means = [[] for i in range(k)]
        for i in range(len(means)):
            cluster_total = []
            for sent in clusters[i]:
                cluster_total.append(sent[1])
            if len(clusters[i]) == 0:
                new_cluster_means[i] = means[i]
            else:
                new_cluster_means[i] = np.sum(np.array(cluster_total),axis=0) / len(clusters[i])
        if np.array_equal(new_cluster_means, means):
            break
        means = new_cluster_means
    return clusters, means

Cluster fit method. This method takes docs as vectors like above return of sent2vec and means of the trained model. 
It just looks for the closest cluster mean and assigns this sentence to that cluster. After fitting to clusters, method chooses
the most representative sentences as the nearest sentence to mean. Returns list of [docName, summary] pairs. 

In [36]:
def cluster_fit(docs_as_vectors, means):
    extracted_summaries = []
    for doc in docs_as_vectors:
        docName = doc[0]
        sentences = doc[1]
        clusters = [[] for i in range(len(means))]
        # fitting part
        for s in sentences:
            sentence = s[0]
            vector = s[1]
            closest_cluster_index = np.argmin([np.linalg.norm(vector - mean) ** 2 for mean in means])
            clusters[closest_cluster_index].append(s)
            
        #summary part
        extracted_summary = ""
        for i in range(len(means)):
            if len(clusters[i]) == 0:
                continue
            closest_index = np.argmin([np.linalg.norm(s[1] - means[i]) ** 2 for s in clusters[i]])
            extracted_summary = extracted_summary + " " + clusters[i][closest_index][0]
        extracted_summaries.append([docName,extracted_summary])
    return extracted_summaries

Evaluates the rouge scores, it expects only extracted_summaries. It assumes the golden summaries is named same as the articles. Reads the gold summaries and calculates rouge scores thanks to rouge library. Returns the rouge scores as extended list of those scores. 

In [37]:
from rouge import Rouge 
def evaluation(extracted_summaries):
    rouge_scores = []
    for doc in extracted_summaries:
        docName = doc[0]
        summary = doc[1]
        infile = open(GOLD_SUMMARIES_PATH+docName , 'r', encoding="latin-1")
        gold_summary = infile.read()
        rouge = Rouge()
        scores = rouge.get_scores(summary, gold_summary)
        rouge_scores = rouge_scores + scores
    return rouge_scores

Calculates the means and standard deviations of the rouge scores. Returns rouge score means and stds as list for each score.

In [38]:
import numpy as np
def mean_var(scores):
    mean1 = np.mean([ s['rouge-1']['f'] for s in scores])
    std1 = np.std([ s['rouge-1']['f'] for s in scores])
    mean2 = np.mean([ s['rouge-2']['f'] for s in scores])
    std2 = np.std([ s['rouge-2']['f'] for s in scores])
    meanl = np.mean([ s['rouge-l']['f'] for s in scores])
    stdl = np.std([ s['rouge-l']['f'] for s in scores])
    return [[mean1, std1], [mean2, std2], [meanl, stdl]]

### K - folding Part
In this part it partitions the the all docs to K fold, You can change the K value as you wanted. I choose 5 as standard.

Takes about 2 secs.

In [40]:
from rouge import Rouge 

docs = parse_documents_to_sentences()
np.random.shuffle(docs)
train_docs = docs[:int(len(docs)*0.8)]
test_docs = docs[int(len(docs)*0.8):]

K = 5
folds = []
for i in range(K):
    if i == K-1:
        folds.append(train_docs[int(i*len(train_docs)/K):])
    else:
        folds.append(train_docs[int(i*len(train_docs)/K):int((i+1)*len(train_docs)/K)])

The calculations of the first model. The difference of the models is just the clustring number of them, in the first model the k is 5 and in the second model k is 3. They are parted to 2 cells to reduce the runtime of one cell. For k-fold it appends other folds than validation fold and trains those folds and after that it fits the model with validation fold. After that calculates rouge scores and appends to rouge list. 


It takes about 8 minutes for K = 5.

In [41]:
rouges1 = []
for i in range(K):
    val = []
    trains = []
    for k in range(K):
        if i == k:
            val.extend(folds[k])
        else:
            trains.extend(folds[k])
            
    trains_vectors = np.array(sent2vec(trains))
    sentences = []
    for l in np.array(trains_vectors)[:,1]:
        sentences = sentences + l
    clusters1, means1 = kmeans_train(sentences, 5)
    extracted_summaries1 = cluster_fit(sent2vec(val), means1)
    rouge_scores1 = evaluation(extracted_summaries1)
    rouges1.extend(rouge_scores1)

The calculations of the second model. It also takes same time as above cell.

In [43]:
rouges2 = []
for i in range(K):
    val = []
    trains = []
    for k in range(K):
        if i == k:
            val.extend(folds[k])
        else:
            trains.extend(folds[k])
            
    trains_vectors = np.array(sent2vec(trains))
    sentences = []
    for l in np.array(trains_vectors)[:,1]:
        sentences = sentences + l
    clusters2, means2 = kmeans_train(sentences, 3)
    extracted_summaries2 = cluster_fit(sent2vec(val), means2)
    rouge_scores2 = evaluation(extracted_summaries2)
    rouges2.extend(rouge_scores2)


Validation scores of the models

In [44]:
val1_scores = mean_var(rouges1)
val2_scores = mean_var(rouges2)
print("The validation scores for model 1 : \nRouge 1:{}±{}, Rouge 2:{}±{}, Rouge L:{}±{}".format(val1_scores[0][0], val1_scores[0][1], val1_scores[1][0], val1_scores[1][1], val1_scores[2][0], val1_scores[2][1]))
print()

print("The validation scores for model 2 : \nRouge 1:{}±{}, Rouge 2:{}±{}, Rouge L:{}±{}".format(val2_scores[0][0], val2_scores[0][1], val2_scores[1][0], val2_scores[1][1], val2_scores[2][0], val2_scores[2][1]))

The validation scores for model 1 : 
Rouge 1:0.42598107701725674±0.16094879556529504, Rouge 2:0.2889821115675144±0.18860871675946708, Rouge L:0.3649079837571582±0.16739646523337046

The validation scores for model 2 : 
Rouge 1:0.40155678639737297±0.16453773778710803, Rouge 2:0.2714417577069077±0.18997116398463357, Rouge L:0.3308415297873292±0.16516762458107073


Training and tests of the models with all the data 

each cell takes about 3 mins

In [45]:
trains_vectors = np.array(sent2vec(train_docs))
sentences = []
for l in np.array(trains_vectors)[:,1]:
    sentences = sentences + l
clusters1, means1 = kmeans_train(sentences, 5)
extracted_summaries1 = cluster_fit(sent2vec(test_docs), means1)
rouge_score1 = evaluation(extracted_summaries1)

In [47]:
clusters2, means2 = kmeans_train(sentences, 3)
extracted_summaries2 = cluster_fit(sent2vec(test_docs), means2)
rouge_score2 = evaluation(extracted_summaries2)

Test scores of the models

In [48]:
test1 = mean_var(rouge_score1)
test2 = mean_var(rouge_score2)

print("The test score for model 1 : \nRouge 1:{}, Rouge 2:{}, Rouge L:{}".format(val1_scores[0][0], val1_scores[1][0], val1_scores[2][0]))
print()

print("The test score for model 2 : \nRouge 1:{}, Rouge 2:{}, Rouge L:{}".format(val2_scores[0][0], val2_scores[1][0], val2_scores[2][0]))

The test score for model 1 : 
Rouge 1:0.42598107701725674, Rouge 2:0.2889821115675144, Rouge L:0.3649079837571582

The test score for model 2 : 
Rouge 1:0.40155678639737297, Rouge 2:0.2714417577069077, Rouge L:0.3308415297873292


## Conclusion

The first model gives better  scores compared to first model. The only difference between them are number of clusters. First one is with 5 clusters and second one is with 3 clusters. The more cluster means in the all data there are more topics than 3 because 5 clusters gives better scores. With this we can give better results for one topic. But the worst thing for these models are they depend on each document even though the most important document for one summary is the summirized document. 