
# Human Analysis Notebook

This notebook is for the purpose of manually analysing the summaries in the dataset. The goal is to identify features that can be used to evaluate the quality of the summaries. This will be used to inform the design of the evaluation metric.

Current ideas:
* collapse all whitespace to single space
* count capitalised words as features
* assess quotation quality by comparing text in quotation marks to text in the prompts
* measure adjective repetition
* compare reference styles
* measure sentence complexity and variation
* measure sentence length summary stats (avg/max)
* measure word complexity / complex word count
* measure list type: asyndeton vs syndeton
* many words joined together -- separate in pre-processing
* Use proportion of text quoted a feature
* Feature of sentence starting words (e.g. "And" = bad)
* Feature of summary starting words (e.g. "It" = bad)
* Typo count
* Use typo-corrected versions for some features, and original for others
* Measure use of author's name
* Count number of references
* Count number of citations
* Compare summary features with other summaries of that prompt
* For most features, use the non-quote part of the summary


In [26]:
from summary_eval.data import summary_df, prompts_df

In [4]:
# shuffle rows
summary_df = summary_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [5]:
summary_df["text"].iloc[0]

'The Pharaohs were leaders or believed to be gods in human form. The priest and the Nobles were responsiable for keeping the Pharaoh happy with gifts. All of the people in Egypt had to give gifts to the gods. Slavery was the main social structure. In the text it states, " Slavery became the fate of those captured as prisoners of war. All Egyptians-from pharaohs to farmers-gave gifts to the gods." All of the people in Egypt had their own jobs that were given to them, but everyone was responsiable for giving gifts to the gods.'

Comments:
* Some bad capitalisation
* Some unnecessary spaces
* No spaces around dashes
* IDEA: collapse all whitespace to single space
* IDEA: count capitalised words as features

In [6]:
summary_df["text"].iloc[1]

'     The Egyptian system of government was structured like the pyramids they built. The most important people were at the top of the pyramid and the less important people were at the bottom of the pyramid. Some evidence from the text is, "Egyptian society was structured like a pyramid. At the top were the gods, such as Ra, Osiris, and Isis."'

Comments:
* Whitespace padding at beginning
* Coped directly from text
* IDEA: assess quotation quality by comparing text in quotation marks to text in the prompts

In [8]:
summary_df["text"].iloc[2]

'They would use every single piece of meat they had even if it was moldy. They would either cut it up into sausage as cited in paragraph 1.  Or they would put it in chemicals to make it look like its not moldy as cited in paragraph 5. '

Comments:
* Repetition of adjectives
* IDEA: measure adjective repetition
* Double spacing
* Paragraph index references
* Simple sentences only
* IDEA: compare reference styles
* IDEA: measure sentence complexity and variation

In [9]:
summary_df["text"].iloc[3]

'When meat spoiled, the factory would can the meat or chop it into sausage, continuing to send it off for people to eat.  Often, meat that is found sour would be rubbed with soda to remove the smell, and they also invented a machine that would plunge a needle into the meat to fill ham with pickle, which would eliminate the odor of the ham. Eventually, someone found out that removing the bone and inserting a white-hot iron as another way of using spoiled meat before selling it.'

Comments:
* An excessively long sentence
* IDEA: measure sentence length summary stats (avg/max)

In [10]:
summary_df["text"].iloc[4]

'It should be complex that excites pity and fear, the man should not be good or bad and his misfortune is from error of judgment or frailty, the change of fortune should be good to bad.'

Comments:
* Complex words but doesn't make sense.
* IDEA: measure word complexity / complex word count

In [11]:
summary_df["text"].iloc[5]

'Pharos and nobels were at the top and the slaves and servants are at the base and the bottom they work hard for the nobles and the Pharos'

Comments:
* Misspelling.
* No punctuation used.
* Overly syndetic ("and"-based) listing.
* Single sentence -- short.

In [12]:
summary_df["text"].iloc[6]

'The students were drawn towards the experiment because it gave them a sense of superiority. They felt a bond because they saluted eachother and were a part of an "exclusive" club, which caused them to deviate from normal behavior. The experiment spread so fast because the students all wanted to be included. The experiment ended because students were too involved in the project and it could have led to a division within the school.'

Comments:
* Words joined together ("eachother"), likely a result of bad OCR.
* Nice mix of sentence structures.
* Mostly summarisation with a small quote used for evidence.

In [13]:
summary_df["text"].iloc[7]

'The factory had multiple ways to cover up spoiled meat. One of them may be to chop it into sausage or to can it. The txt states, "Whenever meat was so spoiled it cold not be used for anything else, either to can it or else to chop it up into sausage." (Sinclair Par. 1).  Another method would be using pickle to take away the bad odor to make it smell good as new. The text states, "They would rub it up with soda to take away the smell, and sell it to eaten..." (Sinclair Par. 2). '

Comments:
* Typo: "txt" instead of "text", "cold" instead of "could"
* Unique referencing style: "(Sinclair Par. 1)"
* Excessive use of quotations
* IDEA: Use proportion of text in quotes as a feature

In [14]:
summary_df["text"].iloc[8]

'    In the excerpt, "The Jungle" by Upton Sinclair, describes the various ways factories would use and cover up spoiled meat that would be sold to the public.  In paragraph two, it explains a process on how the workers would find sour meat and they would get rid of the smell by rubbing it with soda and then selling it to the public. '

Comments:
* More whitespace padding.
* Introduces the title and author.
* Double spacing.
* Nice summary.

In [15]:
summary_df["text"].iloc[9]

'Some of the ways was to rub it with soda, using chemicals, and to use a pickle machine. Paragraph 2 says "Jonas had told them how the meat that was taken out of pickle would often be found sour, and how they would rub it up with soda to take away the smell". Also, "all the miracles of chemistry which they performed." Lastly, "there would be hams found spoiled, so they pumped a stronger pickle to destroy the odor."'

Comments:
* No introduction.
* Extremely excessive use of quotations.

In [16]:
summary_df["text"].iloc[10]

'On the top of the classes are gods they are the first class. The nobles  are to pleasing the gods. And in the middle class ar eskilled workers. They jobs is to make and sell jewelry, pottery, paprus products, and tools. And the last class are slaves and farmers. They jobs is to build and watch over the animals. How is it involved in the government. Farmers are still need to watch over animals and make sure to feed them. And craftsmen are still need to make tools and jewelry. '

Comments:
* Repeats the same meaning in different words.
* Double spacing.
* Starts sentences with "and".
* Spacing in wrong position: "ar eskilled workers".
* Wrong use of They/Their/They're.

In [17]:
summary_df["text"].iloc[11]

'Aristotle was very clear about 3 elemts of an ideal tradgedy. First, he said that the conflict should occur to an average man, not good nor bad, and it would occur because of a simple mistake. This would make it surprising and more relateable. Then, Aristotle said tradgedies should not end on a good note, but a bad note. This would again be surprising and play with peoples emotions much more. It woud inspire "neither pity nor fear". Finally, Aistotle said that an ideal tradgedy would have one plot with one issue. This would keep focus and lead viewers/readers to become much more invested in the single conflcit.'

Comments:
* Typo: "elemts", "tradgedy", "conflcit"
* Uses numbers instead of words: "3" instead of "three"
* Nice balance of quotation and summarisation
* Use of author's name
* IDEA: Use typo-corrected versions for some features, and original for others

In [18]:
summary_df["text"].iloc[12]

'An ideal tragedy should be arranged on a complex plan,  should imitate actions which excite pity and fear, and should go from good to bad instead of bad to good as described by Aristotle.'

Comments:
* No introduction.
* Double spacing.
* No comma in list.
* Uses Author.
* Single sentence.
* IDEA: Use author's name count as a feature.

In [19]:
summary_df["text"].iloc[13]

'Diffrent social classes were involved in the Egyptian goverment because the pharaohs direct the army in an event of a raid and protect the people, their social classes are shaped like a pyramd, gods on the top, below the pharaoh were the nobles and preists, at the bottom were slaves and farmers. '

Comments:
* Typos: "Diffrent", "pyramd"
* One huge sentence.
* Advanced vocabulary: "social classes", "government", "pharoah", "nobles".

In [20]:
summary_df["text"].iloc[14]

'    At the top of the list where the gods like ra osris and lsis egyptians belive they controlled the earth. The eyptains belived some human bei'

Comments:
* Summary cut short
* Whitespaces at beginning
* No capitalisation on proper nouns
* Typos

In [21]:
summary_df["text"].iloc[15]

"In the first paragraph it states that whenever meat is spoiled and couldn't be used for anything else that they would  either can it or chop it into sausage. Also at the beggining of paragraph two it states how the meat that was taken out of pickle would often be found sour, and how they would rub it up with soda to take away the smell, and sell it to be eaten on free-lunch counters."

Comments:
* Double spacing.
* Smooth referencing, but over-referenced

In [22]:
summary_df["text"].iloc[16]

'It developed so fast due to the students who spread it to their friends, it also probably was easily spread because it made the students feel like they were apart of something bigger. But eventually the experiment was ended due to how exstreme the students had become about it.'

Comments:
* Starts summary with "It".
* Never describes what "it" is.
* Typos.
* Short.

In [23]:
summary_df["text"].iloc[17]

'The structure of the ancient Egyptian system was "structured like a pyramid." "At the top were gods", Pharaohs were believed to be gods in human form. Farmers, being at the lowest rank, had to pay grain taxes to the Pharaoh and government, which helped the Egyptians "in the event of a famine."'

Comments:
* Slightly over-referenced.
* Repetition of words from within quotes.
* Nice sentence complexity.

In [24]:
summary_df["text"].iloc[18]

'Tragedy should contain and replicate actions that incite pity and fear towards the subjects. Most tragedies should end in pitiful, unhappy endings. Tragedies should be complex and different from the general storyline.'

Comments:
* Says author's ideas like they are their own.
* Slightly too terse.

In [25]:
summary_df["text"].iloc[19]

'"By the fourth day of the experiment, the students became increasingly involved in the project and their discipline and loyalty to the project was so outstanding that Jones felt it was slipping out of control. He decided to terminate the movement, so he lied to students by announcing that the Third Wave was a part of a nationwide movement and that on the next day a presidential candidate of the movement would publicly announce its existence on television,"said Jones. The wave develpoed over a short time because people wantd to join thecrowd.'

Comments:
* Almost entire summary is a quote.
* The part which is note a quote is badly written.
* IDEA: For most features, use the non-quote part of the summary