## Answers to questions

a) Approach to the problem Summaries describe the main ideas in a text, and I approached the problem with the objective of finding sentences that best cover the main topics discussed in each speech. With that in mind, for each speech, topic analysis (latent dirchlet allocation) was used to determine what the main topics were. The most discussed topics (based on the number of sentences a topic came up in) were then selected, and the sentence with the highest probability for each of these topics found. These sentence were extracted, ordered based on their when they came up in the orginal text, and put together as a summary.

b) Measuring performance. Below are possible ways to evaluate performance in this problem. Without extensive resources, I would start with the qualitative approach (perhaps developing a rubric for consistency) on a very small number of randomly selected text. Then I would then use the transfer approach in order to have a metric to compare this summarization method to others.

- Qualitative approach: an individual comparing output to actual text on a small subset of the data
- Resource intensive: if resources are available have people create ground truth summaries and combine the BLEU and ROUGE metrics to create an F1 metric representative of the lexicon overlap between automatically generated summaries and the ground truth created by people
- Transfer approach: evalute the summarization method on a different dataset of texts that already have ground truth summaries using the BLEU and ROUGE metrics. Use this evaluation as a proxy for performance on the UN General Assembly Statements summary problem.

In [1]:
import summarize
import pandas as pd
import time
%load_ext autoreload
%autoreload 2

#### Get Data

In [2]:
speeches = pd.read_csv("../un-general-debates.csv")
print(debates.info())
display(debates.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7507 entries, 0 to 7506
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   session  7507 non-null   int64 
 1   year     7507 non-null   int64 
 2   country  7507 non-null   object
 3   text     7507 non-null   object
dtypes: int64(2), object(2)
memory usage: 234.7+ KB
None


Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


#### Summarize first 10 debates:

In [3]:
ten_debates = speeches[:10].copy()

In [4]:
start = time.time()
ten_debates["summary"] = ten_debates["text"].apply(lambda x: summarize.summarize_pd(x))
end = time.time()
long_sent_time = end-start

In [5]:
print("Total time:", long_sent_time, "seconds")

Total time: 8.982362031936646 seconds


#### Summarize first 10 debates, penalize sentence length

In [6]:
pd.set_option('display.max_colwidth', None)
ten_debates["summary"]

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [7]:
start = time.time()
ten_debates["summary_short_sentence"] = ten_debates["text"].apply(lambda x: summarize.summarize_pd(x,sent_len_penalty=True))
end = time.time()
short_len_time = end-start

In [8]:
print("Total time:", short_len_time , "seconds")

Total time: 7.034054279327393 seconds


In [9]:
ten_debates["summary_short_sentence"]

0                                                                                                                                                 Aid flows continue to be inadequate. The situation in Lebanon remains volatile. The good office of the UN could be utilized in these peace negotiations. Another issue that needs our attention is the situation in Cyprus. We in the Maldives were close to becoming the victim of such a dastardly attempt last November.
1                                                                                                                                                                                                                                          ﻿\nMay I begin by congratulating you. We hope that lessons have been learned. We can therefore all learn from each other. The cycle of violence has not been broken. Yet the world Organization and its member States can do more.
2                                                                           

#### Conclusion

1. Sentence length penalty: The summaries generated without sententence length penalty were more coherent and complete. Stronger preprocesssing to remove odd punctuation, etc. would improve the shorter summaries, but it's also clear that as it is the penalty is too strong, and shorter sentences get chosen even when they contain very little information.
2. Time: Summarization on a local system is feasible.

In [10]:
estimated_time = ((len(debates)/10) * long_sent_time)/60
print("It will take about", estimated_time, "minutes to summarize all debates.")

It will take about 112.384319622914 minutes to summarize all debates.
