<a href="https://colab.research.google.com/github/sudhang/css-nlp/blob/master/Guessing_Game.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Make it pretty
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# PARAMS

In [None]:
GITREPOPATH = "https://raw.githubusercontent.com/sudhang/css-nlp/master"
NUM_TEXTS_PERMODEL = 5

# Imports

In [None]:
import pandas as pd
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
from nltk.translate.meteor_score import meteor_score
import ipywidgets as widgets
import random
from IPython.display import display
from IPython.display import display, clear_output
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
df_llama_nyt = pd.read_csv(f'{GITREPOPATH}/generated/llama2qlora_nyt.csv')
df_gptneo_nyt = pd.read_csv(f'{GITREPOPATH}/generated/gptneo_nyt.csv')
df_ngrams_nyt = pd.read_csv(f'{GITREPOPATH}/generated/ngram_nyt_3.csv')

## Combine Generated Data

In [None]:
# Add a 'Model' column to each DataFrame
df_ngrams_nyt['Model'] = 'n-grams'
df_llama_nyt['Model'] = 'LLaMa'
df_gptneo_nyt['Model'] = 'GPT-NEO'

# Choose 5 Texts from each.
df_ngrams_nyt = df_ngrams_nyt.sample(n = NUM_TEXTS_PERMODEL)
df_llama_nyt = df_llama_nyt.sample(n = NUM_TEXTS_PERMODEL)
df_gptneo_nyt = df_gptneo_nyt.sample(n = NUM_TEXTS_PERMODEL)


# Concatenate the DataFrames
df_all = pd.concat([df_ngrams_nyt, df_llama_nyt, df_gptneo_nyt])

# Shuffle the DataFrame
df_all = df_all.sample(frac=1).reset_index(drop=True)

# Human Evaluation
To do a human evaluation, we have implemented a "Guessing Game": The evaluator is presented with both a real and a fake article.

In our models, we take the first two sentences from a real article, and then use it as seed texts (prompts) for the language model.  We present the real article and the article generated from this prompt side by side, and ask the user to choose which one is real and which is generated.  The user must make an active choice which is which, because the default value is "blank".

Another point to consider is that both the real and the generated articles are truncated to 25 sentences each.  This is because the generated articles are only made to be generated up to 51 or so sentences.  This would make "abrupt cutoffs" an easy way to determine that it is generated.  This, of course, is a limitation of our evaluation, as it does in fact hide a major failing of the generated articles!

Note that the evaluator is not given any feedback about whether his/her guess was correct or not.  This is done in order to prevent him/her from being influenced by their game playing.

In [None]:
from google.colab import files

# Initialize a counter for the current row
current_row = 0

evaluator_name = None

# Function to apply CSS
def set_css():
    display(HTML('''
    <style>
        pre {
            white-space: pre-wrap;
        }
    </style>
    '''))

# Function to display a row from the DataFrame
def guessing_game(df, row_num, num_sentences=25):
    global current_row
    global evaluator_name
    current_row = row_num

    # Clear previous output
    clear_output(wait=True)

    display(f"Turn {current_row} of {len(df)}")

    # Apply CSS for prettification
    set_css()

    # Create the evaluator_name widget
    if evaluator_name is None:
      evaluator_name = widgets.Text(value='', description='Evaluator Name:', placeholder='Enter your name')
      display(evaluator_name)
    else:
      display(evaluator_name.value)

    row = df.iloc[row_num]

    # Get the first num_sentences sentences of the original and generated articles
    original = ' '.join(nltk.sent_tokenize(row['Original Article'])[:num_sentences])
    generated = ' '.join(nltk.sent_tokenize(row['Generated Article'])[:num_sentences])

    # Randomly decide whether to show the original or generated article first
    if random.choice([True, False]):
        text1, text2 = original, generated
        correct_answer = 'Original, Generated'
    else:
        text1, text2 = generated, original
        correct_answer = 'Generated, Original'

    # Create and display the widgets
    print("Text 1:")
    print(text1)
    print("\nText 2:")
    print(text2)

    dropdown = widgets.Dropdown(options=['', 'Original, Generated', 'Generated, Original'], description='Order:')
    notes = widgets.Textarea(value='', description='Notes:', placeholder='Enter your notes')
    next_button = widgets.Button(description='Next Article Pair')

    display(dropdown)
    display(notes)
    display(next_button)

    def on_next_button_clicked(b):
      if dropdown.value == '':
          df.loc[current_row, f'EVALUATOR_{evaluator_name.value}_GUESS'] = 'SKIPPED'
      else:
          guess_correct = (dropdown.value == correct_answer)
          df.loc[current_row, f'EVALUATOR_{evaluator_name.value}_GUESS'] = guess_correct
      df.loc[current_row, f'EVALUATOR_{evaluator_name.value}_NOTES'] = notes.value
      if current_row < len(df) - 1:
          guessing_game(df, current_row + 1, num_sentences)
      else:
          print("You've reached the end of the DataFrame!")
          filename = f'EVALUATOR_{evaluator_name.value}.csv'
          df.to_csv(filename)
          files.download(filename)


    next_button.on_click(on_next_button_clicked)

# Call the function to display the first row
guessing_game(df_all, 0, num_sentences=10)

'Turn 14 of 15'

'Paula'

Text 1:
WASHINGTON — After weeks of uncertainty atop the Department of Veterans Affairs, President Trump dismissed its secretary, David J. Shulkin, on Wednesday and announced he would replace him with the White House physician, Dr. Ronny L. Jackson, a rear admiral in the Navy. If confirmed, Dr. Jackson, a career Naval officer who has no real experience running a large bureaucracy, would inherit a set of challenges that have bedeviled Democratic and Republican administrations alike. : an enormous budget deficit, high unemployment, rising health care costs and a growing number of veterans seeking treatment for post-traumatic stress disorder. “We’re going to get it done,” Mr. Trump told reporters at his golf club here as he prepared to leave Washington after attending meetings of the National Governors Association meeting. He said he was not changing any other policy areas, including raising taxes or cutting spending. The president, who had been considering replacing Secretary Eric H. Hol

Dropdown(description='Order:', options=('', 'Original, Generated', 'Generated, Original'), value='')

Textarea(value='', description='Notes:', placeholder='Enter your notes')

Button(description='Next Article Pair', style=ButtonStyle())

You've reached the end of the DataFrame!


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>