## Generating Economist Articles with GPT

The model used herein was trained in [this notebook](link)

- This was a good opportunity to experiment with 'repetition_penalty' and 'temperature', both of which have significant impact on the ultimate output.

- This project was a great demonstration of the power of GPT to imitate language style. Even though the Simpsons episodes that this model generates don't necessarily make sense, they do seem to capture an understanding of the Simpsons universe.

Check out some episodes that were generated [here]() or try making your own.

The idea to use titles as a prompt was based on the idea that the output text would relate to the title. Unfortunately, this is not the case. Then again, we could do more epochs and would certainly benefit from significantly more training data. 

In [None]:
!pip install transformers

In [2]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [4]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Using pad_token, but it is not set yet.


In [5]:
loaded_model = GPT2LMHeadModel.from_pretrained('caffsean/gpt2-the-economist')

Downloading:   0%|          | 0.00/948 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

In [10]:
from tqdm import tqdm

def generate_article(generator, title, loops, pool=5, lookback=-400):
  print(f'Writing article: {title}')
  article_title = f'ECONOMIST ARTICLE TOPIC: ' + title + '\n\n\nTEXT OUTPUT:'

  chunk_list = article_title
  
  for loop in tqdm(range(loops)):
    print(f'Loop {loop}/{loops}')
    input = chunk_list[lookback:]
    options = generator(input, num_return_sequences=pool)
    max = 0
    max_idx = 0
    for idx, x in enumerate(options):
      b = len(x['generated_text'])
      if b > max:
        max = b
        max_idx = idx

    addition = options[max_idx]['generated_text'] 
    chunk_list += addition

  return ('.').join(chunk_list.split('.')[:-1]) 

def create_articles_from_title_list(generator, title_list, article_length):
  for title in tqdm(title_list):
    article_text = generate_article(generator, title, article_length)
    title_path = title.replace(' ','_').lower()
    with open(f'/content/drive/MyDrive/NLP_2023/generated_economist_articles/{title_path}.txt', 'w') as f:
      f.write(article_text)
      print(f'Article Saved Successfully!')


### Gridsearch Paramaters for Qualitative Evaluation

In [None]:
top_ks = [10,20,50]
temps = [.70,.80,.90]
penalties = [1.0,1.02,1.1]
letters = ['A','B','C','X','Y','Z']
numbers = [1,2,3]

def parameter_grid_search(title,save_title,top_ks,temps,penalties):

  for x,k in enumerate(top_ks):
    for y,t in enumerate(temps):
      for z,p in enumerate(penalties):
          finetuned_generator = pipeline(
            'text-generation', model=loaded_model, tokenizer=tokenizer, return_full_text=False, max_length=400,do_sample=True, top_p= 0.9, temperature=t, repetition_penalty=p, top_k=k
          )
          ep = generate_article(finetuned_generator, title, 4, pool=5, lookback=-400)
          with open(f'/content/drive/MyDrive/NLP_2023/gridsearch_economist/gridsearch-{save_title}-{letters[x]}-{numbers[y]}-{letters[z+3]}-{k}-{t}-{p}.txt', 'w') as f:
            f.write(ep)
            print(f'-{k}-{t}-{p} - Saved Successfully!')


In [None]:
title = 'SINGAPORE'
save_title = 'singapore'
parameter_grid_search(title,save_title,top_ks,temps,penalties)

In [None]:
title = 'ZAMBIA'
save_title = 'zambia'
parameter_grid_search(title,save_title,top_ks,temps,penalties)

In [None]:
title = 'NEW YORK'
save_title = 'new-york'
parameter_grid_search(title,save_title,top_ks,temps,penalties)

In [None]:
title = 'AMSTERDAM'
save_title = 'amsterdam'
parameter_grid_search(title,save_title,top_ks,temps,penalties)

## Generate Text with Chosen Parameters

In [7]:
finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer, return_full_text=False, max_length=400,do_sample=True, top_p= 0.9, temperature=0.8, repetition_penalty=1.015, top_k=10
)

In [None]:
article_list = ['JAKARTA',
                'TOKYO',
                'SYDNEY',
                'BOSTON',
                'TORONTO',
                'ROME',
                'HELSINKI',
                'MOSCOW',
                'THIMPHU',
                'CAIRO',
                'LAGOS',
                'KATHMANDU']


create_articles_from_title_list(finetuned_generator, article_list, 4)