# Dappity Dap

### Characteristics of Puns
* Converging Meanings 
* Sound 
* Association

_Things to try:_
* split words by sound / parsing to increase accuracy of converging meanings hypothesis
    -  e.g. "The soundtrack for Blackfish was **orca**strated."

### Target: Converging Meanings

We have observed that puns often make use of words that have very similar meanings. For example:

'He said I was **average** - but he was just being **mean**.'

where 'average' and 'mean' have the same meanings but are expressed differently. 

___

In order to test this, we will do the following:

* Step 1: Use Synset to list synonyms of tokens
* Step 2: Find common words in Synsets within a sentence
* Step 3: Determine correlation between converging meanings & whether a sentence is a pun or not

---

Import/Download relevant packages:

In [1]:
from textblob import Word
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


For this method, we will use NLTK's WordNet corpus to find the synsets of each token in a sentence.

As an example, let's test it out on the word **'plant'** first:

In [2]:
word = Word('plant')
for i in range(3):
    print('Use Case ', i)
    print(word.synsets[i])
    print(word.definitions[i])
    print(word.synsets[i].lemma_names())
    print(' ')

Use Case  0
Synset('plant.n.01')
buildings for carrying on industrial labor
['plant', 'works', 'industrial_plant']
 
Use Case  1
Synset('plant.n.02')
(botany) a living organism lacking the power of locomotion
['plant', 'flora', 'plant_life']
 
Use Case  2
Synset('plant.n.03')
an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience
['plant']
 


Through WordNet, the **use cases** (Synsets) of the word "Plant" can be found, as well as the **definitions** and **Synonyms** (Lemma Names) as the input.

---
        
           
Let's first eyeball how relevant the lemmas of each significant word in a sentence to determining if a sentence is a pun. 

**The example we will use is: "The past, the present and the future walked into a bar. It was tense."**



In [3]:
# First, importing relevant packages, etc

import codecs
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import PunktSentenceTokenizer,sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer, PorterStemmer
import re

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\WaThone\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We'll need to process the sentence, which includes lemmatizing, filtering out stop words, stripping punctuation and tokenizing the sentence.

In [4]:
def simpleFilter(sentence):
    
    '''This function filters out stopwords, lemmatizes, tokenizes, and 
    strips punctuation from the input sentence and returns the a list of 
    filtered tokens'''
    
    filtered_sent = []
    
    # Strip punctuation
    stripped = re.sub("[(.)',=!#@]", '', sentence)
        
    # filter out stopwords 
    stop_words = set(stopwords.words("english"))
    
    # Tokenize
    words = word_tokenize(stripped)
    
    # Lemmatize and Filter out Stopwords
    lemmatizer = WordNetLemmatizer()
    for w in words:
        if w not in stop_words:
            filtered_sent.append(lemmatizer.lemmatize(w))

    return filtered_sent
  
def printLemmas(word):
    
    '''This function prints out all synonyms of a given word.'''
    
    for ss in Word(word).synsets:
        print(ss.lemma_names())
        

# Print 

s = 'The past, the present and the future walked into a bar. It was tense.'

for word in simpleFilter(s):
    print("Filtered word: '" + word + "' and its lemmas:")
    printLemmas(word)
    print()


Filtered word: 'The' and its lemmas:

Filtered word: 'past' and its lemmas:
['past', 'past_times', 'yesteryear']
['past']
['past', 'past_tense']
['past']
['past', 'preceding', 'retiring']
['by', 'past']

Filtered word: 'present' and its lemmas:
['present', 'nowadays']
['present']
['present', 'present_tense']
['show', 'demo', 'exhibit', 'present', 'demonstrate']
['present', 'represent', 'lay_out']
['stage', 'present', 'represent']
['present', 'submit']
['present', 'pose']
['award', 'present']
['give', 'gift', 'present']
['deliver', 'present']
['introduce', 'present', 'acquaint']
['portray', 'present']
['confront', 'face', 'present']
['present']
['salute', 'present']
['present']
['present']

Filtered word: 'future' and its lemmas:
['future', 'hereafter', 'futurity', 'time_to_come']
['future', 'future_tense']
['future']
['future']
['future']
['future', 'next', 'succeeding']
['future']

Filtered word: 'walked' and its lemmas:
['walk']
['walk']
['walk']
['walk']
['walk']
['walk']
['walk']
[

---
## **Hypothesis 1: Converging Meaning Pun**

We observe that the word 'tense' appears as a synonym of the words 'present', 'past', and 'future'. Since we are exploring puns with converging meanings, **we hypothesise that we are more likely to find words with converging meanings in puns than in non-puns.**

---

To do this, we first produce a list of unique synonyms of a certain word, excluding the word itself.


Let's try this on the word "plant".

In [5]:
def create_lemmas(word):
    lemmas_list = []
    for ss in Word(word).synsets:
        lemmas_list.append(ss.lemma_names())
    return lemmas_list

def process_lemmas(lemmas_list, word):
    '''
    This function process the lemma list of all the definition of a word
    and returns a list of all associated unrepeated words with the word
    '''
    all_lemmas = []
    for each_list in lemmas_list:
        for lemma in each_list:
            if lemma != word and lemma not in all_lemmas:
                all_lemmas.append(lemma)
    return all_lemmas


print(process_lemmas(create_lemmas('plant'), 'plant'))

['works', 'industrial_plant', 'flora', 'plant_life', 'set', 'implant', 'engraft', 'embed', 'imbed', 'establish', 'found', 'constitute', 'institute']


Next, we have to find out if synonyms of any word in a sentence can be found in the rest of the sentence, and count the number of times this occurs.

In [6]:
def common_syn(s):
    
    '''
    This function takes in a sentence, processes and tokenizes it and
    prints each significant word and tests if its synonyms can be found
    in the rest of the sentence. It prints the pair and returns the
    number of pairs found.
    '''
    
    count = 0
    
    # Filter the sentence to remove filler words / stopwords
    filtered_words = simpleFilter(s)
    
    for index, word in enumerate(filtered_words):
        if word.isalpha():
            lemma_list_of_term = process_lemmas(create_lemmas(word),word)

            # test if any word in the rest of the sentence appears in the lemma list of current word
            for other_word in filtered_words[index+1:]:
                if other_word in ' '.join(lemma_list_of_term):
                    count += 1
                    print(word, other_word)
    return count
    
    
s = 'The past, the present and the future walked into a bar. It was tense.'
print('The number of synonym pairs in this sentence is',common_syn(s))

past tense
present tense
future tense
The number of synonym pairs in this sentence is 3


In order to see if this method does work, we will test it out on our list of pre-tagged puns and non-puns where puns are tagged '0' and non-puns are tagged '1'

We import the list and apply our function common_syn to it, under the label 'Syn Count'.

In [7]:
import pandas as pd
df = pd.read_csv('puns_final.csv', encoding='latin-1')
df = df.drop('Unnamed: 0', axis=1)

df['Syn Count'] = df['Sentence'].apply(common_syn)
df.head()

tuna fish
I le
bigger le
bed le
I le
pirate high
pirate sea
make hit
I one
Going sound
bed sleep
After ate
said ate
turned around
broke leg
I one
got one
cannibal eat
got make
paid make
reversing back
I one
got one
got back
one I
go work
Cheap u
Thrills u
want u
post office
wear wear
look see
whistle whistle
mad hare
Old go
die go
back second
call phone
Cell phone
mean egg
laying egg
I number
people wash
little light
seems see
door door
take make
fly fly
like like
metal met
I 5
mean end
went low
wardrobe closet
one I
punch punch
went last
Do get
know get
broth stock
cat sick
cheese cheese
Buffalo Bison
Make one
call one
right -
duck put
Thieves steal
dentist tooth
theatrical performance
pun play
pun word
play word
average mean
In I
past tense
present tense
future tense
soda soda
running go
Better go
present tense
past tense
saw ad
happens come
Id I
Id I
know get
alarm clock
Have eat
ever time
tried time
clock time
take make
seasoned veteran
remember back
boomerang back
sign language
I 

Unnamed: 0,Sentence,P/NP,Syn Count
0,"You can tune a guitar, but you can't tuna fish...",0,1
1,Two peanuts were walking in a tough neighborho...,0,0
2,If I buy a bigger bed will I have more or less...,0,4
3,The earth's rotation really makes my day.,0,0
4,I told my friend she drew her eyebrows too hig...,0,0


To find out if this method is accurate, we use the correlation between whether the sentence is a pun or not and the Syn Count. 

In [8]:
corr = df.corr()
corr

Unnamed: 0,P/NP,Syn Count
P/NP,1.0,0.062171
Syn Count,0.062171,1.0


In this case, it appears the Syn Count is not very highly correlated with whether the sentence is a pun or not...

Perhaps we should try a different approach.

---

Other than the ability to find synonyms, WordNet can also find out a range of other details about a word.  

The functions below make use of WordNet to yield synonyms, hyponyms, antonyms, words that are similar to as well as words that the WordNet corpus has recorded as "also sees".

In [9]:
from nltk.corpus import wordnet as wn

def get_all_synsets(word, pos=None):
    for ss in wn.synsets(word):
        for lemma in ss.lemma_names():
            yield (lemma, ss.name())


def get_all_hyponyms(word, pos=None):
    for ss in wn.synsets(word, pos=pos):
            for hyp in ss.hyponyms():
                for lemma in hyp.lemma_names():
                    yield (lemma, hyp.name())


def get_all_similar_tos(word, pos=None):
    for ss in wn.synsets(word):
            for sim in ss.similar_tos():
                for lemma in sim.lemma_names():
                    yield (lemma, sim.name())


def get_all_antonyms(word, pos=None):
    for ss in wn.synsets(word, pos=None):
        for sslema in ss.lemmas():
            for antlemma in sslema.antonyms():
                    yield (antlemma.name(), antlemma.synset().name())


def get_all_also_sees(word, pos=None):
        for ss in wn.synsets(word):
            for also in ss.also_sees():
                for lemma in also.lemma_names():
                    yield (lemma, also.name())


def get_all_synonyms(word, pos=None):
    for x in get_all_synsets(word, pos):
        yield (x[0], x[1], 'ss')
    for x in get_all_hyponyms(word, pos):
        yield (x[0], x[1], 'hyp')
    for x in get_all_similar_tos(word, pos):
        yield (x[0], x[1], 'sim')
    for x in get_all_antonyms(word, pos):
        yield (x[0], x[1], 'ant')
    for x in get_all_also_sees(word, pos):
        yield (x[0], x[1], 'also')
       

Let's use the words 'happy' and 'cutlery' to see what kind of details WordNet can figure out about a word.

In [10]:
print("The following are synonyms of 'happy':")
for x in get_all_synsets('happy'):
    print(x)
print()
print("The following are hyponyms (words that are more specific) of 'cutlery':")
for x in get_all_hyponyms('cutlery'):
    print(x)
print()
print("The following are similar to 'happy':")
for x in get_all_similar_tos('happy'):
    print(x)
print()
print("The following are antonyms (opposite) of 'happy':")
for x in get_all_antonyms('happy'):
    print(x)
print()
print("The following are words that should also be seen with 'happy':")
for x in get_all_also_sees('happy'):
    print(x)

The following are synonyms of 'happy':
('happy', 'happy.a.01')
('felicitous', 'felicitous.s.02')
('happy', 'felicitous.s.02')
('glad', 'glad.s.02')
('happy', 'glad.s.02')
('happy', 'happy.s.04')
('well-chosen', 'happy.s.04')

The following are hyponyms (words that are more specific) of 'cutlery':
('bolt_cutter', 'bolt_cutter.n.01')
('cigar_cutter', 'cigar_cutter.n.01')
('die', 'die.n.03')
('edge_tool', 'edge_tool.n.01')
('glass_cutter', 'glass_cutter.n.03')
('tile_cutter', 'tile_cutter.n.01')
('fork', 'fork.n.01')
('spoon', 'spoon.n.01')
('Spork', 'spork.n.01')
('table_knife', 'table_knife.n.01')

The following are similar to 'happy':
('blessed', 'blessed.s.06')
('blissful', 'blissful.s.01')
('bright', 'bright.s.09')
('golden', 'golden.s.02')
('halcyon', 'golden.s.02')
('prosperous', 'golden.s.02')
('laughing', 'laughing.s.01')
('riant', 'laughing.s.01')
('fortunate', 'fortunate.a.01')
('willing', 'willing.a.01')
('felicitous', 'felicitous.a.01')

The following are antonyms (opposite) 

Let's all the categories above words that are **related** to the main word. 

Now, we want to do the same as we did for the synonym count and define some functions that will find the common related words - not just within the sentence, but also with the related words of the other words in the sentence. 

In [11]:
def related_list(word):
    lemma_list = []
    for x in get_all_synonyms(word):
        lemma_list.append(x)
    return list(set(lemma_list))

def common_related(s):
    filtered = simpleFilter(s)
    count = 0
    for index, word in enumerate(filtered):
        related = related_list(word)
        for r_set in related:
            if r_set[0] in filtered[index+1:]:
                count += 1
    return count


**Example:**

'What do you call a belt with a watch on it? A waist of time.'

In [12]:
s = 'What do you call a belt with a watch on it? A waist of time.'

filtered = simpleFilter(s)
count = 0
print('Sentence:',s)
print('-----' *10)
print()
for index, word in enumerate(filtered):
    related = related_list(word)
    for r_set in related:
        if r_set[0] in filtered[index+1:]:
            print("The word '" + word + "' in the sentence is related to '" + r_set[0] + "' as", r_set, "to mean '" + wordnet.synset(r_set[1]).definition() +"'")
            print()
            count += 1
print('-----' * 10)
print('Number of Related pairs:', count)


Sentence: What do you call a belt with a watch on it? A waist of time.
--------------------------------------------------

--------------------------------------------------
Number of Related pairs: 0


Now we want to apply this to the rest of our data.

In [13]:
df['Length'] = df['Sentence'].apply(len) #added this because it's mysteriously missing, but need to filter the length next time
df['Related Count'] = df['Sentence'].apply(common_related)
df['Rel Count / Len'] = df['Related Count'] / df['Length']
df.sample(5)

Unnamed: 0,Sentence,P/NP,Syn Count,Length,Related Count,Rel Count / Len
128,"Can February March? No, but April May.",0,0,38,0,0.0
192,He said I was average - but he was just being ...,0,1,51,2,0.039216
275,"If you want something you never had, you have ...",1,0,83,2,0.024096
125,How does a penguin build itÂs house? Igloos i...,0,0,57,0,0.0
32,Smaller babies may be delivered by stork but t...,0,0,75,0,0.0


Here is a description of the values. 

In [14]:
import matplotlib.pyplot as plt
df.describe()

r = df['Related Count']
plt.histfit(r)

AttributeError: module 'matplotlib.pyplot' has no attribute 'histfit'

The code below finds the correlation between the different variables in the data frame. 

As can be seen, the correlation between whether a sentence is a pun or not and the number of related count pairs is debatable.

We also took related count / len of sentence as a longer sentence is more likely to have more related pairs.

In [None]:
corr = df.corr()
corr

We'll try to turn this correlation into an actionable "algorithm" to predict if a sentence is a pun or not. 

The following is another data set with 60 puns and 100 non-puns.

In [None]:
test_df = pd.read_csv('puns_test.csv')
test_df.sample(5)

Let's now code the "algorithm".'

In [None]:
common_related(s)

### Target: Similar Sounds

Other puns involve usage of homophones, words with similar sound but different meanings. For example:

'The **pony** had a **raspy** voice. It was **hoarse**.'

where 'hoarse' means the same as 'raspy', but is also related to 'pony' as it sounds like 'horse'. 

___

In order to test this, we will do the following:

* Step 1: Use Synset to list synonyms of tokens
* Step 2: Find matching similar sounding words in Synsets within the sentence
* Step 3: 

Some inspirations:
* https://stackabuse.com/phonetic-similarity-of-words-a-vectorized-approach-in-python/
* https://pypi.org/project/phonetics/#usage
* https://pypi.org/project/jellyfish/


---


Import/Download relevant packages:

In [21]:
import phonetics
import jellyfish

Testing the library's differeny phonetic functions with two similar sounding words 'horse' and 'hoarse', and observing the result.

In [33]:
test_words = ['horse', 'hoarse']
print(test_words)
def print_phonetic_index(test_words):
    functions = (phonetics.soundex, phonetics.nysiis, phonetics.metaphone, phonetics.dmetaphone, jellyfish.match_rating_codex)
    for func in functions:
        print(f'{func.__name__}: ' , end='')
        for word in test_words:
            code = func(word)
            print(str(code) + ' ', end='')
        print()
print_phonetic_index(test_words)

['horse', 'hoarse']
soundex: h0620 h0620 
nysiis: HA HA 
metaphone: HRS HRS 
dmetaphone: ('HRS', '') ('HRS', '') 
match_rating_codex: HRS HRS 


Testing with the words 'horse' and 'haorse' gave identical phonetic indexes from different packages. Let's try this with another set of words!
Pun examples:
* A harp which sounds too good to be true is probably a lyre. (lie)
* Religious lions get down to their knees to prey. (pray)
* A big computerized dog needs a megabyte. (mega bite)
* Lions eat their prey fresh and roar. (raw)

In [37]:
test_words_2 = ['lyre', 'lie', 'prey', 'pray', 'roar', 'raw']
print_phonetic_index(test_words_2)

soundex: l060 l000 p600 p600 r060 r000 
nysiis: LA LA PA PA RA RA 
metaphone: LR L PR PR RR R 
dmetaphone: ('LR', '') ('L', '') ('PR', '') ('PR', '') ('RR', '') ('R', 'RF') 
match_rating_codex: LYR L PRY PRY RR RW 
