# Pun Generation

#### Members Names:
 
Kevin ioi <br>
Dougall Percival

https://github.com/KevinIoi/Puns_Generation.git

#### Members Emails:
kevin.ioi@ryerson.ca <br>
dougall.percival@ryerson.ca

# Introduction:

#### Problem Description:

The broad topic that will be discussed in this notebook is the puzzle of designing a machine learning algorithm that is able be (or mimic being) *creative*. That is to say, teach a computer to produce creative content. Specifically, the researchers have taken on the problem of automated pun generation.

#### Context of the Problem:

In recent years, the power of current machine learning algorithms has become immense. Researchers have figured out how to design algorithms that can learn and solve incredilbly complex tasks. However, one domain that remains out of reach, artificial creativity. Part of the difficulty is that creative tasks are by definition, not quantifiable. Thus there is no obvious formula to fall back on when trying to frame the problem.

Puns in particular are seemingly good candidates for early research into the area. They are essentially regular sentence that have been tweaked to be humorous. A regular sentence can be converted to a pun by cleverly replacing a word or two with another.

Regular sentence (maybe a little dramatic):
> Yesterday I accidentally swallowed some food colouring. The doctor says I'm OK, but I feel like I've **died** a little inside.

Pun sentence:
> Yesterday I accidentally swallowed some food colouring. The doctor says I'm OK, but I feel like I've **dyed** a little inside.

This of course requires an understanding of human lanuage, which is something that machines are getting quite good at. Again, the primary problem is inserting creativity into the process, in this case with the end goal of being funny. One difficulty with this is that generally text generation requires a large corpus of training data, and there is no such corpus of puns.

#### Limitation About other Approaches:

The first, and fairly major limitation, is that there is not a large dataset of existing puns. SemEval offers a small dataset, but it cannot be considered a full corpus of puns.  
The recent paper by Yu et al., proposed a language model, jointly decoding conditioned on pun/alternate words, adding ambiguity to sentences. However, the paper by Kao et al. [4] already demonstrated that ambiguity is not enough to add humour to a sentence. The meanings must also support one another across the entire phrase. Both of these papers are used as baselines to compare against the SurGen method.

#### Solution:

This paper (Pun Generation with Surprise [6]) approaches the problem by defining the idea of 'global-local surprisal' as the driving factor behind humour. They attempt to alter 'unhumorous' sentences, by swapping out a keyword with a homophone and utilizing multiple language models to maximize this surprisal metric while maintaining a logically fluid sentence 

# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Yu et al. [1] | Train a sequence-to-sequence neural model on unhumorous text. Then use a decoder conditioned on two wordsense (pun and regular word) to create the pun  | English Wikipedia | The encoded sentence doesn't always have the capcity to merge with both primary and secondary word senses
| Petrovic & Matthews [2] | Apply rules to identify word combinations that will create humor when inserted into a template sentence | Google N-gram data | Very restrictive model with structured output. Lacks the flexibility to be consistenly humorous and gramatically correct. 16% sucess rate
| Fuli et al. [5]| Use and adversarial learning approach to teach a text seq-to-seq generator to make sentences using two word senses and create believable ambiguity| English  Wikipedia, SemEval 2017 | very well done, although it is still effectively just altering provided sentences. Complete sentence generation would be interesting



# Methodology

As previously discussed the motivation for the proposed system is that a good pun sentence is the result of a 'pun' word being more likely given the overall context of the sentence, while the normal word is more likely in the local context. The local context is defined as with in a 3-5 word span of the target word.
<br><br>
They call this metric: 'Global-Local surprise' which is a ratio of the global and local surprisal scores. The surprisal scores are calculated as:<br>
\begin{equation*}
S(c) = -\log \dfrac{P(w^{pun}|c)}{P(w^{alter}|c)} = -\log \dfrac{P(w^{pun},c)}{P(w^{alter},c)}
\end{equation*}
Where the context (c) will change to define local or global context.<br>
Using that they define their Global-Local surprise as :\begin{equation*}
S(c)_{Global-Local} = -\log \dfrac{S(c_{local}}{S(c_{global}}
\end{equation*}<br>
Where a higher score theoretically indicates a better pun


There are 4 primary components to the system:
1. A list of target words and homophones (pun words), as well as a large corpus of unhumorous text
2. The retriever, which retrieves potential sentences from the unhumorous corpus based on our target words
3. The a Skip-gram model, which is used to identify topics related to our new pun
4. A Neural smoother which will 'smooth' the sentence after the new topic is inserted

#### Text Corpuses

- Bookcorpus is used as the text corpus of generally unhumorous sentences, which will be the 'templates' for our puns
- SemEval 2017, was used as a list of target words and homophones


##### Retriever

To begin, a **Retriever** vectorizes the corpus into a TF-IDF matrix, providing a way to identify similar words based on a given keyword, and seed sentences based on pun keywords.
Given an alternative word (word that could be replaced by a pun word), the retriever returns 500 candidates. The top 100 are taken as seed sentences. Seed sentences are sentences with the potential to be transformed into puns. They come from a large, generic corpus (in this case, all of the books). These sentences satisfy three requirements: a strong association between alternative word and the local context; strong association between pun word and distant context; both words are interpretable given local and global context to maintain ambiguity.


In each template retrieved, the target word is swapped out for the pun word 

#### Skip-gram

Then to change the global topic of the sentence a Skip-gram model is used to find words that are similar to the newly added pun word

A Skip-gram model is used to identify relatedness between two words. In the case of pun generation it is used to identify relations between an alternate word, near the end of a seed sentence (word to be replaced in the pun), and k candidate topic words, at the beginning of the sentence. 
Candidate topic words are selected using a “distant” skip-gram model. This model maximizes *pθ(wj|wi)*, for all *wi, wj* in a sentence between *d1* and *d2* words apart. See equation below.  

\begin{equation*}
\sum_{j=1-d_1}^{i-d_2} \log p_0(w_j|w_i) + \sum_{j=1+d_1}^{i+d_2} \log p_0(w_j|w_i) 
\end{equation*}

From this model, the top k predictions from *pθ(w|wp)*, where *wp* is the pun word, and *w* is the candidate topic to be further filtered. In this paper, Skip-gram with Negative Sampling (SGNS) is implemented. This reduces the training time for the Skipgram Model by feeding it many negative examples, rather than the few positive examples that would exist for any word-context pair.   
*This model was trained on d1 = 5, d2 = 10. Embedding size was 300, and 15 epochs were run.*  


A further check is performed on the candidate topic word, to ensure it is type consistent. Since verbs have more chance of causing a nonsense sentence, the swap words are constrained to nouns and pronouns. Wordnet Synsets are then used (with word/POS tag pair) to ensure the candidate word, and its possible replacement are above a certain similarity threshold.

#### Neural Smoother

Given the pun sentence and a new topic, the neural smoother will insert the topic into the sentence 'smoothly'

The Neural Smoother is an encoder-decoder model that is used to 'smooth' the insertion of the new topic word. Both the encoder and decoders are single layer LSTMs with an attention module connecting the two. To train the model, they select target words in each sentence of a large corpus and delete the surrounding words, inserting a placeholder token. The model is then tasked to predict the deleted words based on the context of the target word and the rest of the sentence.
For example:

    Original sentence: "Dave was enjoying the lovely summer weather"
    Target Word: 'enjoying'
    Removed tokens: ['was', 'the']
    
    Input  -> "Dave <placeholder> enjoying <placeholder> lovely summer weather"
    Target output -> ['was', 'the']

In the system, this model is given the location of a token to remove and a topic word to insert. The system swaps out the original token with the new topic and deletes the words surrounding it. It then passes the sentence to the smoother which predicts words that are most likely to surround it, thus 'smoothing' the sentence around the new topic.

Here is an example of the architecture, however the one in the image has multiple RNN layers in the encoder and decoder:
<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/10/Encoder-Decoder-Architecture-for-Neural-Machine-Translation.png" alt="drawing" width="500"/>
Image obtained from:<br>
Denny Britz, Anna Goldie, Minh-Thang Luong, & Quoc Le. (2017). Massive Exploration of Neural Machine Translation Architectures


*********************************************************************************************************************  
*********************************************************************************************************************

# Installation Note
Running this notebook requires a very specific setup, including Python version 3.6, and PyTorch version 0.4.0. Here is the information for setting up the environment properly.  
First, navigate to the downloaded pungen folder.  
If you are using Anaconda, run the following command (where envname is your environment name):  
  
*conda env create --name envname --file=environments.yml*

If you are using venv, you can install packages via pip. After creating a new venv, use:  

*pip install -r requirements.txt*

This will install the needed packages.  

## fairseq Installation
When installing the package fairseq, note that the distribution being installed is not the most up to date version. The paper's original authors actually edited modules in fairseq, and have packaged their version on their github. While the requirements.txt file points to this distribution, to install the followings commands may be used.

*git clone -b pungen https://github.com/hhexiy/fairseq.git  
cd fairseq  
pip install -r requirements.txt  
python setup.py build develop*

## Directory structure/pretrained models
To ensure the directory structure fits for the program implementation and so avoid having to train very large models it is recommended you download our premade folder. The Wikitext model below will have to be aquired separately. https://drive.google.com/a/ryerson.ca/file/d/1Wmh7gbgxZV6GEPEl5F4Net8d4cHE3H7b/view?usp=sharing


## Wikitext Model
One of the required models, Wikitext-103, comes from the actual fairseq site. To setup properly, use these commands:

*curl --create-dirs --output models/wikitext/model https://dl.fbaipublicfiles.com/fairseq/models/wiki103_fconv_lm.tar.bz2  
tar xjf models/wikitext/model -C models/wikitext  
rm models/wikitext/model*  

## Other requirements
We have tried to packages as much of the required material as possible here. Any missing files will be generated at run-time.

## Possible Issues
When trying to set up the required environment on Windows, my computer ran into compiler issues, and was unable to complete the setup. The solution was to set up Ubuntu on a virtual box to set up properly. I gave my Virtual Box 4Gb of RAM and 50Gb of storage, and it was sufficient for this task.  

The original source of data for this project (smashwords.com) no longer allows users to fully download the corpus. As such, only a sample of the original source was available for this implementation.  

*********************************************************************************************************************
*********************************************************************************************************************

# Implementation

### Retriever
The first step is to create a retriever.  

See methodology section for full explanation. The retriever creates a TF-IDF matrix of words in the corpus, and generates a list of "seed sentences", identifying sentences where a pun can likely be created. The retriever also permits querying, based on a keyword, to find similar words. 



In [1]:
import argparse
import os, sys
import numpy as np
import time
import pickle
from functools import total_ordering
from sklearn.feature_extraction.text import TfidfVectorizer
from enum import IntEnum
from utility import sentence_iterator

import logging
logger = logging.getLogger('pungen')

# establish retriever and Template classes
# retriever is essentially a TFIDF vectorized matrix for our book corpus

@total_ordering
class Template(object):
    def __init__(self, tokens, keyword, id_):
        self.id = int(id_)
        self.tokens = tokens
        self.keyword_positions = [i for i, w in enumerate(tokens) if w == keyword]
        self.num_key = len(self.keyword_positions)
        self.keyword_id = None if self.num_key == 0 else max(self.keyword_positions)

    def __len__(self):
        return len(self.tokens)

    def replace_keyword(self, word):
        tokens = list(self.tokens)
        tokens[self.keyword_id] = word
        return tokens

    def __str__(self):
        return ' '.join(['[{}]'.format(w) if i == self.keyword_id else w for i, w in enumerate(self.tokens)])

    def __lt__(self, other):
        # Containing keyword is better
        if self.num_key == 0:
            return True
        # Fewer keywords is better
        if self.num_key > other.num_key:
            return True
        # Later keywords is better
        if self.keyword_id < other.keyword_id:
            return True
        return False

    def __eq__(self, other):
        if self.num_key == 0 and other.num_key == 0:
            return True
        if self.num_key == other.num_key and self.keyword_id == other.keyword_id:
            return True
        return False


class Retriever(object):
    def __init__(self, doc_files, path=None, overwrite=False):
        logger.info('reading retriever docs from {}'.format(' '.join(doc_files)))
        self.docs = [line.strip() for line in open(doc_files[0], 'r')]

        if overwrite or (path is None or not os.path.exists(path)):
            logger.info('building retriever index')
            self.vectorizer = TfidfVectorizer(analyzer=str.split)
            self.tfidf_matrix = self.vectorizer.fit_transform(self.docs)
            if path is not None:
                self.save(path)
        else:
            logger.info('loading retriever index from {}'.format(path))
            with open(path, 'rb') as fin:
                obj = pickle.load(fin)
                self.vectorizer = obj['vectorizer']
                self.tfidf_matrix = obj['tfidf_mat']

    def save(self, path):
        with open(path, 'wb') as fout:
            obj = {
                    'vectorizer': self.vectorizer,
                    'tfidf_mat': self.tfidf_matrix,
                    }
            pickle.dump(obj, fout)

    def query(self, keywords, k=1):
        features = self.vectorizer.transform([keywords])
        scores = self.tfidf_matrix * features.T
        scores = scores.todense()
        scores = np.squeeze(np.array(scores), axis=1)
        ids = np.argsort(scores)[-k:][::-1]
        return ids

    def valid_template(self, template):
        return template.num_key == 1

    def retrieve_pun_template(self, alter_word, len_threshold=10, pos_threshold=0.5, num_cands=500, num_templates=None):
        ids = self.query(alter_word, num_cands)
        templates = [Template(self.docs[id_].split(), alter_word, id_) for id_ in ids]
        templates = [t for t in templates if t.num_key > 0 and len(t.tokens) > len_threshold]
        if len(templates) == 0:
            logger.info('FAIL: no retrieved sentence contains the keyword {}.'.format(alter_word))
            return []

        valid_templates = [t for t in templates if self.valid_template(t)]
        if len(valid_templates) == 0:
            valid_templates = templates
        templates = sorted(valid_templates, reverse=True)[:num_templates]
        return templates
    
# helper function

Word = IntEnum('Word', [(x, i) for i, x in enumerate('TOKEN LEMMA TAG'.split())])   


In [2]:
# create a file of tokenized sentences
input_file = "data/train.txt"
output_file = "data/bookcorpus/raw/sent.tokenized.txt"
ner = False

with open(output_file, 'w') as fout:
    for s in sentence_iterator(input_file):
        if ner:
            ss = []
            [ss.extend(w[Word.TOKEN].split('_')) for w in s]
            ss = ' '.join([x.lower() for x in ss])
        else:
            ss = ' '.join([w[Word.TOKEN] for w in s])
        fout.write(ss + '\n')

In [3]:
# apply retriever
doc_file = "data/bookcorpus/raw/sent.tokenized.txt"
ret_file = "retriever/our_retriever.pkl"
overwrite = True

# generate the retreiver
retriever = Retriever([doc_file], ret_file, overwrite)

## Neural Smoother (Combiner)

The Neural Smoother is an encoder-decoder model that is used to 'smooth' the insertion of the new topic word. Both the encoder and decoders are single layer LSTMs with an attention module connecting the two.

### Preprocess training data

The Training data are sentences that have had 3 sequential words removed and replaced with the token <placeholder\> <br>

Like this:<br>
- Original sentence: "Dave **was enjoying the** lovely summer weather"<br>
- Altered sentence: "Dave **<placeholder\>** lovely summer weather"<br>
- Removed tokens: ['was', 'enjoying', 'the']<br>
<br>

The middle word (enjoying) is saved as the 'target' word 

In [2]:
#
# helper functions
#

import argparse
import random
from spacy.lang.en.stop_words import STOP_WORDS
from enum import IntEnum

# quick way to break the word chunks we created
Word = IntEnum('Word', [(x, i) for i, x in enumerate('TOKEN LEMMA TAG'.split())])

def split_sent(words, delete_frac=0.5, window_size=1):
    """
        Finds a word good candidate word to smooth around
        Removes the candidate word and words within 'window_size' range around it
        Replaces them with <placeholder> token
        
        returns:
        - The altered sentence (template)
        - The key word chosen to smooth around
        - The deleted words
    """

    N = len(words)
    n = max(1, int(delete_frac * N))
    deleted_keyword = None

    # find an appropriate word to delete, in the sentence provided
    for i, w in enumerate(words):
        if i < n and w[Word.TAG] in ('NOUN', 'PROPN', 'PRON') and not w[Word.TOKEN] in STOP_WORDS:
            left, right = i, i+1
            deleted_keyword = w[Word.TOKEN]

            w = window_size
            left = max(0, i - w)
            right = min(N, i + w + 1)
    
    # didn't find any good candidate words
    if not deleted_keyword:
        return None, None, None

    # makes a 'Template' sentence with the target word and one word on either side removed
    template = [w[Word.TOKEN] for w in words[:left]] + ['<placeholder>'] + [w[Word.TOKEN] for w in words[right:]]

    # holds the words that were deleted
    deleted = [w[Word.TOKEN] for w in words[left:right]]

    return template, deleted_keyword, deleted

In [19]:
#
# Example of preprocessing
#raw_sentence_iterator = nlp.pipe([' '.join(t.tokens) for t in templates])
import utility
import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG
from fairseq import tokenizer
nlp = spacy.load('en_core_web_sm', disaable=['ner'])

In [74]:
sentence = "Dave was enjoying the lovely summer weather"

# we need to pre'preprocess the sentence so each token is in format <token|lemma|pos>
# we will do that using the spaCy package
prepreprocess_sent = []
for word in nlp(sentence):
    prepreprocess_sent.append([word,word.lemma_,word.pos_])

# prepreprocess_sent = " ".join(prepreprocess_sent)
print("Original sentence:\n",sentence)
print("Pre'Preprocessing:\n",prepreprocess_sent)

altered_sentence, target, deleted_words = split_sent(prepreprocess_sent)

print("\n\nAltered sentence:\n",altered_sentence)
print("Target word:\n",target)
print("Deleted words:\n",deleted_words)



Original sentence:
 Dave was enjoying the lovely summer weather
Pre'Preprocessing:
 [[Dave, 'Dave', 'PROPN'], [was, 'be', 'AUX'], [enjoying, 'enjoy', 'VERB'], [the, 'the', 'DET'], [lovely, 'lovely', 'ADJ'], [summer, 'summer', 'NOUN'], [weather, 'weather', 'NOUN']]


Altered sentence:
 ['<placeholder>', enjoying, the, lovely, summer, weather]
Target word:
 Dave
Deleted words:
 [Dave, was]


In [40]:
#
# This cell will preprocess a corpus of data from a file
#

fileName = "data/sampleCorpus.txt"
src_output = "data/1split.src"
tgt_ouput = "data/1split.tgt"

import os
from utility import sentence_iterator

# ensure output folder exists 
if not os.path.isdir('data'):os.mkdir('data')
    
# open file objects to print output
fp_src = open(src_output, 'w')
fp_tgt = open(tgt_ouput, 'w')

# iterate through each line in input file
for words in sentence_iterator(fileName):
    
    # selects target_word
    # removes target_word and words on either side of it replacing them with <placeholder>
    template, target_word, deleted = split_sent(words)
    if not template:#failed to generate (probably empty line)
        continue

    # print target (words that were deleted)
    fp_tgt.write('{}\n'.format(' '.join(deleted)))

    # print target_word, then sentence with <placeholder> on next line
    fp_src.write('{target}\n{temp}\n'.format(
        target=target_word,
        temp=' '.join(template)))


### Neural Smoother Training

The model encodes the altered sentence and then conditioned on the 'Target' word the decoder tries to fill in the placeholder to recreate the sentence.

Altered sentence: "Dave <placeholder> lovely summer weather"<br>
Target token: 'enjoying'

Hopfully the model will be able to learn how to 'smooth' sentences to make words fit into them. This takes a lot of training and a large corpus of text  
    

In [122]:
from fairseq import options, tasks, utils
from fairseq.data import iterators
from fairseq.trainer import Trainer
from argparse import Namespace
import utility
import itertools

# Setup fairseq task, controls/defines the training process
task_args = Namespace(task='edit', data='data/', target_lang='tgt', source_lang='src',
                      left_pad_source=True, left_pad_target=True, combine='token', insert='target',raw_text=True,
                     max_source_positions=400,max_target_positions=400)
task = tasks.setup_task(task_args)

# have the task object load datasets
utility.load_dataset_splits(task, ['split'])

# Construct model
# this is essentially just making an encoder-decoder model with single layer LSTMs and using attention
model_args = Namespace(encoder_embed_dim= 512,encoder_hidden_size= 512,encoder_num_layers= 1,
                    encoder_dropout_in= 0.1,encoder_dropout_out= 0.1,encoder_bidirectional= False,
                    encoder_pretrained_embed= None,decoder_embed_dim= 512,decoder_hidden_size= 512,
                    decoder_out_embed_dim= 512,decoder_num_layers= 1,decoder_dropout_in= 0.1,
                    decoder_dropout_out= 0.1,decoder_attention= True,decoder_encoder_embed_dim= 512,
                    decoder_encoder_output_units= 512,decoder_pretrained_embed= False)
model = task.build_model(model_args)

# essentially creates an object to evaluate loss 
criterion_args = Namespace(criterion='cross_entropy', sentence_avg=False)
criterion = task.build_criterion(criterion_args)

trainer = Trainer(args, task, model, criterion, dummy_batch)

# Initialize dataloader
epoch_itr = task.get_batch_iterator(
    dataset=task.dataset('train'),
    ignore_invalid_inputs=True,
    required_batch_size_multiple=8,
    seed=101)

# train for all epochs
while epoch_itr.epoch < max_epoch:
    train(args, trainer, task, epoch_itr)
    
utils.save_state(
    "checkpoint_best.pt", trainer.args, trainer.get_model(), trainer.criterion, trainer.optimizer,
    trainer.lr_scheduler, trainer._num_updates, trainer._optim_history,)

### Loading a trained smoother from File

In [4]:
from argparse import Namespace
from fairseq.sequence_generator import SequenceGenerator
from fairseq import tasks, utils

def get_Smoother():
    
    # define task model will perform
    task_args = Namespace(task='edit', data='combiner-data/',left_pad_source=True,
                          left_pad_target=False,max_source_positions=1024,max_target_positions=1024,
                          combine='embedding', insert='none', source_lang = 'src',target_lang = 'tgt', shard_id=0 )
    task = tasks.setup_task(task_args)

    # load model 
    models, model_args = utils.load_ensemble_for_inference(["combiner/models/checkpoint_best.pt"], task)
    model = models[0]

    # create sequence generator with model
    generator = SequenceGenerator(
                [model], task.target_dictionary, beam_size=20, stop_early=True,
                normalize_scores=True, len_penalty=1,
                unk_penalty=100.0, sampling=False, sampling_topk=-1,
                sampling_temperature=1,minlen=1)
    
    return generator, task

## Pun Scoring Metric: Global-Local Surprise

The scoring metric used in the paper focuses on maximizing the local surprise of the new pun word, while minimizing the global surprise
<br>
To measure surprise they use the probabilities of spans of tokens. To calculate this they utilize a neural language model from another paper "Pointer sentinel mixture models."(Merity, 2016)

<br>

A basic unigram model is also used to calculate the unconditional probabilities of the tokens

In [5]:
from argparse import Namespace
from fairseq.sequence_scorer import SequenceScorer
from fairseq import options, tasks, utils
from utility import LMScorer
import os

def getLanguageModel():
    ''' Loads a pretrained language model, in a "scorer" wraper
        Will be used to score the probablilty of a sentence
        
        This model is the product of another paper:
            S. Merity, C. Xiong, J. Bradbury, and R. Socher. 2016.Pointer sentinel mixture models.
            arXiv preprint arXiv:1609.07843.
    '''
    
    # define the task for the model
    scorer_task_args = Namespace(data=os.path.dirname("models/wikitext/"), path="models/wikitext/", cpu=True,
    task='language_modeling',output_dictionary_size=-1, self_target=False, future_target=False, past_target=False)
    task = tasks.setup_task(scorer_task_args)
    
    # load dictionary and weights
    models, _ = utils.load_ensemble_for_inference(["models/wikitext/wiki103.pt"], task)

    # regenerate model
    scorer = SequenceScorer(models, task.target_dictionary)
    languageModel = LMScorer(task, scorer)
    
    return languageModel


In [6]:
class UnigramModel(object):
    '''
        Simple model to hold the unconditional probabilities of of words in our corpus
    '''
    
    def __init__(self, counts_path, oov_prob=0.03):
        self.word_counts = self.load_model(counts_path)
        self.total_count = sum(self.word_counts.values())
        self.oov_prob = oov_prob # out-of-vocab probability
        self._oov_smoothing_prob = self.oov_prob * (1. / self.total_count)

    def load_model(self, dict_path):
        counts = {}
        with open(dict_path, 'r') as fin:
            for line in fin:
                ss = line.strip().split()
                counts[ss[0]] = int(ss[1])
        return counts

    def _score(self, token):
        p = self.word_counts.get(token, 0) / float(self.total_count)
        smoothed_p = (1 - self.oov_prob) * p + self._oov_smoothing_prob
        return np.log(smoothed_p)

    def score(self, tokens):
        return [self._score(token) for token in tokens]
    
def getUnigramModel(fileName="word-counts/dict.txt"):
    unigram_model = UnigramModel(fileName)
    return unigram_model

In [7]:
def surprisal_score(pun_sent, pun_word_index, alter_word, lm,um,local_window_size=2):
    '''
        The Metrix this paper uses to score puns
    
        Uses a condiational language model to estimate probability of spans of text
    
        Focuses on the full pun sentence for 'global surprise'
        Focuses on the small 'local_window' around the pun word for 'local surprise'
    '''
    
    # recreate original sentence
    alter_sent = list(pun_sent)
    alter_sent[pun_word_index] = alter_word

    # define local context range
    local_start = pun_word_index - local_window_size
    local_end  = pun_word_index + local_window_size
    local_pun_sent = pun_sent[local_start:local_end]
    local_alter_sent = alter_sent[local_start:local_end]

    # gather the different sentences
    sents = [alter_sent, pun_sent, local_alter_sent, local_pun_sent]
    
    # get the probabilities of the sentences (log prob)
    lm_scores = lm.score_sents(sents, tokenize=lambda x: x)
    
    # get unigram probability of pun sente
    unigram_score = um.score(pun_sent)
    
    # calculate global and local surprise values 
    global_surprisal = np.sum(lm_scores[0]) - np.sum(lm_scores[1])
    local_surprisal = np.sum(lm_scores[2]) - np.sum(lm_scores[3])

    # grammar score, how much more likely the sentence is given context
    grammar = (np.sum(lm_scores[1]) - np.sum(unigram_score)) / len(pun_sent)

    if global_surprisal < 0 or local_surprisal < 0:
        # discount sentences with negative surprisal scores
        r = -1.
    else:
        # calc surprise ration
        r = local_surprisal / global_surprisal  # larger is better
    
    return r+local_surprisal+global_surprisal+grammar

# Pun Generation

In [8]:
from utility import SkipGram
import utility
import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG
from fairseq import tokenizer
nlp = spacy.load('en_core_web_sm')

#####  To generate a pun we first need a pair of homophones to use as the pun and alter words Then we need some sentences to try to turn into puns

In [9]:
# define words
alter_word = "die"
pun_word = "dye"

# initialize retriever, will get sentences containing our alter_word
retriever = Retriever(["tmp/train.tokenized.txt"], "retriever/retriever.pkl", False)

# pull sentences that contain the alter word
retrieved_templates = retriever.retrieve_pun_template(alter_word, num_templates=20)
pun_word_indicies = [t.keyword_id for t in retrieved_templates]

sentences = []
templates = []
for t in retrieved_templates:
    if t.tokens not in sentences:
        templates.append(t)
        sentences.append(t.tokens)

# keep track of the index of the pun word in each sentence
pun_word_indicies = [t.keyword_id for t in templates]
        
for i, template in enumerate(templates):
    print(template)

He had come alone , and he was going to [die] .
If we do n't , he or she may [die] .
" I 'm not going to let you [die] . "
And I 'm not going to let him [die] . "
" You mean he 's not going to [die] ? "
And that way you would n't have to [die] at all . "
, he asked , " How did he [die] ? "
He whispered , " It 's to [die] for . "
" I 'm going to [die] , are n't I ? "
" And you will probably [die] just like her . "


##### Now to find words to switch out to switch the topic

In [10]:
# Get positions of possible topic words to replace
raw_sentence_iterator = nlp.pipe([' '.join(t.tokens) for t in templates])

min_distance_to_topic = 0.1 # % of sentence length that should separate pun and topic 
topic_word_indicies = [] # locations of topics

for i, sent in enumerate(raw_sentence_iterator):
    topic_word_indicies.append([])
    for j, word in enumerate(sent[:pun_word_indicies[i]-len(sent)*min_distance_to_topic]):
        if word.pos_ in ('NOUN', 'PROPN', 'PRON'):# topics have to be these type of words
            topic_word_indicies[-1].append(j)


In [11]:
# these are the indices of the word that will help switch the topic of the phrase
topic_word_indicies

[[0, 6], [1, 5], [1, 3], [1, 3], [1, 3], [2, 3], [1], [0, 4], [1], [2]]

##### Putting the Puns together

Lets only do once sentence and one replacment topic for now

In [12]:
" ".join(templates[0].tokens)

'He had come alone , and he was going to die .'

In [13]:
# get first topic word suggestion
topic_word_index =topic_word_indicies[0][0]
alter_sent = templates[0].tokens

# replace the alter word with in the pun word
pun_sent = templates[0].replace_keyword(pun_word)
pun_word_index = template.keyword_id
print(pun_sent)

['He', 'had', 'come', 'alone', ',', 'and', 'he', 'was', 'going', 'to', 'dye', '.']


##### Use wordnet word sense to determine the 'type' of the pun word

This will be used to filter potential topic words

In [14]:
# helper to identify the wordnet 'sense' of a word
type_recognizer = utility.TypeRecognizer(threshold=0.2)

# get the lemma and the word sense of the original topic word
old_topic_word_lemma = nlp(alter_sent[topic_word_index])[0].lemma_
if old_topic_word_lemma == "-PRON-":
    old_topic_word_lemma = alter_sent[topic_word_index]
types = type_recognizer.get_type(old_topic_word_lemma, 'noun')

##### Use the skipgram model to find related topic words

In [15]:
# Use the skipgram model to find words that are related to the pun word
skipgram = SkipGram.load_model("skipgram/dict.txt", "skipgram/model.pt", embedding_size=300, cpu=False)
potential_topic_words = skipgram.predict_neighbors(pun_word, k=100, masked_words=[old_topic_word_lemma])
potential_topic_words[:20]

['fishnet',
 'artificially',
 'factionless',
 'creamer',
 'strappy',
 'bangle',
 'daub',
 'frizz',
 'encode',
 'hair',
 'josey',
 'cm',
 'artful',
 'frizzy',
 'magenta',
 'coloring',
 'floppy',
 'tanning',
 'beehive',
 'stylist']

In [16]:
# some of those are better than others to say the least
# lets filter them so they have similar usage to our original word

replacement_topic_words = []

# filter potentil topic words to find those with similar sense as original topic word
for word in potential_topic_words:
    if type_recognizer.is_types(word, types, 'noun'):
        replacement_topic_words.append(word)
replacement_topic_words = [replacement_topic_words[1]]
replacement_topic_words

['colour']

In [17]:
# Now lets try inserting the topic words and 'smoothing' the sentences

# for this we will be using the neural smoother model
smoother, smoother_task = get_Smoother()

# at the same time we will be scoring our puns
scorer = surprisal_score
lm = getLanguageModel()
um = getUnigramModel()

# try to smooth the sentence around the newly inserted topic word
for word in replacement_topic_words:
    
    # remove the original topic word as well as the words around it
    smoothing_input_sent = pun_sent[:max(0,topic_word_index-1)] + ['<placeholder>'] + [word]+['<placeholder>'] + pun_sent[topic_word_index+2:]

    src_dict = smoother_task.source_dictionary
    max_positions = (100000, 100000)
    
    # setting up the 'dataset' for the fairseq model
    for inputs in utility.makeFairseqDataset([smoothing_input_sent], [word], src_dict, max_positions,smoother_task):
        tokens = inputs['net_input']['src_tokens']
        token_lens = inputs['net_input']['src_lengths']
        encoder_input = {'src_tokens': tokens, 'src_lengths': token_lens}
        
        # generate the smoothed sentence
        outputs = smoother.generate(encoder_input, maxlen=200)
        tgt_dict = smoother_task.target_dictionary
        smoothed_tokens,smoothed_sent,_ = utils.post_process_prediction(hypo_tokens=outputs[0][0]['tokens'].int().cpu(),src_str=None,alignment=None,align_dict=None,tgt_dict=tgt_dict,remove_bpe="None")
        smoothed_sent =pun_sent[:max(0,topic_word_index-1)] + smoothed_sent.split() + pun_sent[topic_word_index+2:]
        
    surp_score = scorer(smoothed_sent,topic_word_index+1 ,alter_word,lm,um )
    print("The new sentence:")
    print(" ".join(smoothed_sent))
    print(f"The local-global surprisal score: {surp_score}")
    break
    

| [src] dictionary: 37360 types
| [tgt] dictionary: 18912 types
| dictionary: 267744 types
['<placeholder>', 'colour', '<placeholder>', 'come', 'alone', ',', 'and', 'he', 'was', 'going', 'to', 'dye', '.']




The new sentence:
the woman thing he saw come alone , and he was going to dye .
The local-global surprisal score: -0.5802893262516355


# Making Batches of puns

This function will pull relevent sentences from the corpus to generate multiple puns using a given pun/alter word pair

In [18]:
def makePuns(alter_word, pun_word, retriever, skip_gram, smoother, smoother_task, scorer, lm, um):

    # helper to identify the wordnet 'sense' of a word
    type_recognizer = utility.TypeRecognizer(threshold=0.20)

    # pull sentences that contain the alter word
    retrieved_templates = retriever.retrieve_pun_template(alter_word, num_templates=20)
    
    # remove duplicate sentences
    sentences = []
    templates = []
    for t in retrieved_templates:
        if t.tokens not in sentences:
            templates.append(t)
            sentences.append(t.tokens)
    
    pun_word_indicies = [t.keyword_id for t in templates]
    
    # Get positions of possible topic words to replace
    raw_sentence_iterator = nlp.pipe([' '.join(t.tokens) for t in templates])
    min_distance_to_topic = 0.3 # % of sentence length that should separate pun and topic 
    topic_word_indicies = [] # locations of topics
    for i, sent in enumerate(raw_sentence_iterator):
        topic_word_indicies.append([])
        for j, word in enumerate(sent[:pun_word_indicies[i]-len(sent)*min_distance_to_topic]):
            if word.pos_ in ('NOUN', 'PROPN', 'PRON'):# topics have to be these type of words
                topic_word_indicies[-1].append(j)

    puns = []
    alter_sentences = []
    for i, (template, topic_word_idxs) in enumerate(zip(templates, topic_word_indicies)):
        for topic_word_id in  topic_word_idxs:
            # store the original sentence
            alter_sent = template.tokens

            # swap pun word
            pun_sent = template.replace_keyword(pun_word)
            pun_word_index = template.keyword_id

            # get the lemma and the word sense of the original topic word
            old_topic_word_lemma = nlp(alter_sent[topic_word_id])[0].lemma_
            if old_topic_word_lemma == "-PRON-":
                old_topic_word_lemma = alter_sent[topic_word_id]
            types = type_recognizer.get_type(old_topic_word_lemma, 'noun')

            # find words that are related to the pun word
            potential_topic_words = skip_gram.predict_neighbors(pun_word, k=100, masked_words=[old_topic_word_lemma])

            replacement_topic_words = []
            for word in potential_topic_words:
                if type_recognizer.is_types(word, types, 'noun'):
                    replacement_topic_words.append(word)

            # try to smooth the sentence around the newly inserted topic word
            for word in replacement_topic_words:
                smoothing_input_sent = pun_sent[:max(0,topic_word_id-1)] + ['<placeholder>'] + [word]+['<placeholder>'] + pun_sent[topic_word_id+2:]
   
                src_dict = smoother_task.source_dictionary
                max_positions = (100000, 100000)

                for inputs in utility.makeFairseqDataset([smoothing_input_sent], [word], src_dict, max_positions,smoother_task):
                    tokens = inputs['net_input']['src_tokens']
                    token_lens = inputs['net_input']['src_lengths']
                    encoder_input = {'src_tokens': tokens, 'src_lengths': token_lens}
                    outputs = smoother.generate(encoder_input, maxlen=200)
                    tgt_dict = smoother_task.target_dictionary
                    smoothed_tokens,smoothed_sent,_ = utils.post_process_prediction(hypo_tokens=outputs[0][0]['tokens'].int().cpu(),src_str=None,alignment=None,align_dict=None,tgt_dict=tgt_dict,remove_bpe="None")
                    smoothed_sent =pun_sent[:topic_word_id-1] + smoothed_sent.split() + pun_sent[topic_word_id+3:]

                surp_score = scorer(smoothed_sent,pun_word_index+1 ,alter_word,lm,um )
                puns.append([smoothed_sent,surp_score])
        
    return puns

In [19]:
retriever = Retriever(["tmp/train.tokenized.txt"], "retriever/retriever.pkl", False)
smoother, task = get_Smoother()
skipgram = SkipGram.load_model("skipgram/dict.txt", "skipgram/model.pt", embedding_size=300, cpu=False)
surprisal_scorer = surprisal_score
lm = getLanguageModel()
um = getUnigramModel()

| [src] dictionary: 37360 types
| [tgt] dictionary: 18912 types
| dictionary: 267744 types


In [20]:
puns = makePuns("die", "dye", retriever, skipgram, smoother, task, surprisal_scorer, lm, um)



In [21]:
seen_puns = []
for pun in puns:
    if pun[0] not in seen_puns:
        print(" ".join(pun[0]))
        seen_puns.append(pun[0])

He had come alone , and he was going to dye the woman thing he saw alone , and he was going to dye .
the only thing those , , he or she may dye .
the only thing those in , he or she may dye .
the woman time i saw , he or she may dye .
, and that 's , he or she may dye .
the smile nine that , he or she may dye .
the woman thing i saw , he or she may dye .
the woman thing he saw , he or she may dye .
the only thing he 's , he or she may dye .
the woman thing i have going to let you dye . "
the woman thing i know going to let you dye . "
the woman thing i have going to let him dye . "
, " goodness looking , going to let him dye . "
" You the woman thing i know going to dye ? "
" You the smile thing you have going to dye ? "
" You the smile thing you kept going to dye ? "
" You the smile nine you have going to dye ? "
" You the smile thing that 's going to dye ? "
" You " " field , going to dye ? "
And the way you earth n't have to dye at all . "
And the smile way , n't have to dye at all 

# Conclusion and Future Direction

The local-global surprisal principle demonstrates some improvement over previous attempts, as seen in the 2013 paper from Petrovic & Matthews [2]. This method makes detecting candidate pun sentences more effective, although there is still a long way to go to meet human funniness standards.
Future progress will involve finding ways to have language models separate creative, well-formed material from throw away nonsensical sentences.

# References:

[1]: Yu, Zhiwei & Tan, Jiwei & Wan, Xiaojun. (2018). A Neural Approach to Pun Generation. 1650-1660. 10.18653/v1/P18-1153

[2]:  Petrovic & Matthews. (2013). Unsupervised  joke generation from big  data. In proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Volume 2: Short Papers, pages 228–232.

[3]: S. Merity, C. Xiong, J. Bradbury, and R. Socher. 2016.Pointer sentinel mixture models.
            arXiv preprint arXiv:1609.07843.
            
[4]: J.T. Kao, R.Levy and N. D. Goodman. 2015. A computational model of linguistic humour in puns. Cognitive Science.

[5] Fuli Luo, Shunyao Li, Pengcheng Yang, Lei li, Baobao Chang, Zhifang Sui, & Xu Sun. (2019). Pun-GAN: Generative Adversarial Network for Pun Generation. 

[6] He He, Nanyun Peng, & Percy Liang (2019). Pun Generation with Surprise. In North American Chapter of the Association for Computational Linguistics (NAACL).