I used mohameddhiab's model named "humor-no-humor" to detect whether the input sentence is a joke or non-joke: https://huggingface.co/mohameddhiab/humor-no-humor

I also used Kuperman et al.'s Age of Acquisition database to do the AoA assessment: https://osf.io/d7x6q/files/osfstorage

Most of the code has been generated using the help of ChatGPT, Deepseek, and Gemini.

**For this colab notebook to work, it will be necessary to import the AoA database, AoA_51715_words.csv, into the Files first. Then you can click "Run All" button to automatically run all the code blocks one by one. **


https://github.com/jamesturk/jellyfish/blob/main/docs/functions.md

https://pypi.org/project/eng-to-ipa/

https://github.com/zas97/ocr_weighted_levenshtein?tab=readme-ov-file

https://pypi.org/project/weighted-levenshtein/


Things to do


1.   Find homophone & calculate sound similarity
2.   Change code so that it tells you if it's funny because of homograph or homophone
3.   Improve tokenization of words (ex. don't should be tokenized to do and not)
4.   Instead of using humor-no-humor model, use LLM?

Stuff for Shreya

Given a joke (string), make Ollama classify the joke as a homograph or homophone joke. If it's a homograph joke, return 1 (literally the number 1). If it's a homophone joke, return 2. This should be in a function

Questions to ask

1.   When calculating the sound similarity score between the two words, can the score be 0? We're currently calculating similarity between the phonetics of the two words, and if the two words sound exactly the same (even though written differently) then the sound similarity score would be 0.



In [3]:
!pip install jellyfish
!pip install weighted-levenshtein
!pip install nltk
!pip install minicons
!pip install transformers accelerate sentencepiece

Collecting weighted-levenshtein
  Downloading weighted_levenshtein-0.2.2.tar.gz (9.0 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: weighted-levenshtein
  Building wheel for weighted-levenshtein (pyproject.toml) ... [?25l[?25hdone
  Created wheel for weighted-levenshtein: filename=weighted_levenshtein-0.2.2-cp312-cp312-linux_x86_64.whl size=477986 sha256=fabe09c8b48a753e2fdd3e375125369bb9bdc7374b5ebe91bb1942b812262c13
  Stored in directory: /root/.cache/pip/wheels/04/d5/34/d5be1791d7ff61b3bb32abdf7b176e70bcad70a3ac8f1e86b1
Successfully built weighted-levenshtein
Installing collected packages: weighted-levenshtein
Successfully installed weighted-levenshtein-0.2.2
Collecting minicons
  Downloading minicons-0.3.32-py3-none-any.whl.metadata (10 kB)
Collecting wonderwords>=2.2.0 (from minicons)
  Downloading wonderwords-3.0

In [4]:
# Importing libraries
import torch
import string
from nltk.corpus import wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer # Import WordNetLemmatizer
import nltk # Import nltk
from nltk import pos_tag # Explicitly import pos_tag
from nltk.corpus import stopwords # Import stopwords
from transformers import AutoTokenizer, AutoModelForSequenceClassification # Import necessary classes from transformers
from sentence_transformers import SentenceTransformer # Import SentenceTransformer from its correct library
import re
import pandas as pd # Import pandas for AoA DataFrame

# Download required NLTK data

# Check if 'punkt' tokenizer data is available or if it needs to be downloaded
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Check if 'wordnet' corpus is available or if it needs to be downloaded
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

# Attempting to download'punkt_tab'
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('cmudict')

# Load the Age of Acquisition dataset
try:
    aoa_df = pd.read_csv('AoA_51715_words.csv')
    print("AoA dataset loaded successfully.")
    print("First 5 rows:")
    display(aoa_df.head())
except FileNotFoundError:
    print("Error: AoA_51715_words.csv not found.")
    print("Please upload the file to your Colab environment or provide the correct path.")
except Exception as e:
    print(f"An error occurred while loading the AoA dataset: {e}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


AoA dataset loaded successfully.
First 5 rows:


Unnamed: 0,Word,AoA
0,a,2.89
1,aardvark,9.89
2,abacus,8.69
3,abacus,8.69
4,abalone,12.23


In [5]:
# Cosine function
def cosine(a: torch.Tensor, b: torch.Tensor, eps=1e-8) -> torch.Tensor:
    # Computing the length (aka: the maginitude) of each vector in a and b
    a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
    # Divde the vector by its length so that all
    a_norm = a / torch.max(a_n, eps * torch.ones_like(a_n))
    b_norm = b / torch.max(b_n, eps * torch.ones_like(b_n))
    sims = torch.mm(a_norm, b_norm.transpose(0, 1))
    return sims

In [31]:
def homophone_sound_similarity(setup_keywords, punchline_keywords):
    import jellyfish
    import json
    from weighted_levenshtein import lev
    import numpy as np
    import math

    arpabet = nltk.corpus.cmudict.dict()

    with open("params_weighted_leven.json", "r") as f:
        leven_params = json.load(f)
        for k in leven_params.keys():
            leven_params[k] = np.array(leven_params[k])
        leven_params

    for s_keyword in setup_keywords:
        for p_keyword in punchline_keywords:
            # Check if keywords exist in the dictionary before accessing
            if s_keyword in arpabet and p_keyword in arpabet:
                # Use the first pronunciation for simplicity in similarity calculation
                s_phonetic = "".join(["".join(inner) for inner in arpabet[s_keyword][0]])
                p_phonetic = "".join(["".join(inner) for inner in arpabet[p_keyword][0]])

                if jellyfish.jaro_similarity(s_phonetic, p_phonetic) >= 0.75:
                    print(f"The joke uses homophone to create humor since the words {s_keyword} and {p_keyword} have similar utterances.")
                    print(f"Setup keyword: {s_keyword} - {s_phonetic}")
                    print(f"Punchline_keyword: {p_keyword} - {p_phonetic}")
                    print(f"These two words are likely to be homophone since the Jaro Distance between the phonetics are higher than 0.75.")
                    print(f"Jaro Distance between two phonetics (higher num = higher similarity): {jellyfish.jaro_similarity(s_phonetic, p_phonetic)}")
                    print(f"Weighted Levenshtein Distance between two words (higher num = lower similarity): {lev(s_keyword, p_keyword, **leven_params)}\n")
                    return [s_keyword], [p_keyword]
            else:
                if s_keyword not in arpabet:
                    print(f"'{s_keyword}' not found in CMU dictionary.")
                if p_keyword not in arpabet:
                    print(f"'{p_keyword}' not found in CMU dictionary.")
                return [],[]
    print("no homophone found")


keywords1 = ["leather", "knight"]
keywords2 = ["paper", "night"]

homophone_sound_similarity(keywords1, keywords2)

The joke uses homophone to create humor since the words knight and night have similar utterances.
Setup keyword: knight - NAY1T
Punchline_keyword: night - NAY1T
These two words are likely to be homophone since the Jaro Distance between the phonetics are higher than 0.75.
Jaro Distance between two phonetics (higher num = higher similarity): 1.0
Weighted Levenshtein Distance between two words (higher num = lower similarity): 0.9314288037569907



(['knight'], ['night'])

In [7]:
def identify_key_homographs_and_definitions(analysis_results, similarity_threshold=0.3):
    """
    Identifies key homographs based on similarity scores and finds the most relevant definition.

    Args:
        analysis_results (dict): The output from analyze_homograph_relevance.
        similarity_threshold (float): A threshold to consider a definition relevant.

    Returns:
        dict: A dictionary of key homographs with their most relevant definition.
    """
    key_homographs_info = {}

    for word, analysis in analysis_results.items():
        # Sort definitions by similarity in descending order
        sorted_analysis = sorted(analysis, key=lambda x: x['similarity_to_sentence'], reverse=True)

        if sorted_analysis:
            most_relevant_definition_info = sorted_analysis[0]
            highest_similarity = most_relevant_definition_info['similarity_to_sentence']

            # Simple heuristic: consider a word a key homograph if its most relevant
            # definition's similarity is above a threshold. This might need tuning
            # or more complex logic for better results.
            # Also consider if there's a significant difference between the top two,
            # which might indicate a clear intended meaning vs. other possibilities.

            # For simplicity here, let's just take the most similar definition
            # and consider it relevant if above a threshold.
            if highest_similarity >= similarity_threshold:
                 key_homographs_info[word] = most_relevant_definition_info


    return key_homographs_info

# Analyze the results for the first joke sentence
if 'analysis_results' in locals():
    print("Key homographs and most relevant definitions for the first joke:")
    key_info_1 = identify_key_homographs_and_definitions(analysis_results)

    if key_info_1:
        for word, info in key_info_1.items():
            print(f"- '{word}': {info['definition']} (Similarity: {info['similarity_to_sentence']:.4f})")
    else:
        print("No key homographs found above the similarity threshold for the first joke.")

# Analyze the results for the second joke sentence if available
if 'analysis_results_2' in locals():
    print("\n" + "="*30 + "\n")
    print("Key homographs and most relevant definitions for the second joke:")
    key_info_2 = identify_key_homographs_and_definitions(analysis_results_2)

    if key_info_2:
        for word, info in key_info_2.items():
            print(f"- '{word}': {info['definition']} (Similarity: {info['similarity_to_sentence']:.4f})")
    else:
         print("No key homographs found above the similarity threshold for the second joke.")

In [8]:
import google.generativeai as genai
import re
from google.colab import userdata

# Configure the Gemini API
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    print("Gemini API configured.")
except userdata.SecretNotFoundError:
    print("Error: GOOGLE_API_KEY not found in Colab secrets.")
    print("Please add your Google API key to the secrets manager.")
    GOOGLE_API_KEY = None # Set to None to prevent further errors

# Initialize the Generative Model
if GOOGLE_API_KEY:
    try:
        # Using gemini-flash-latest which supports generateContent
        gemini_model = genai.GenerativeModel('gemini-flash-latest')
        print("Gemini model initialized.")
    except Exception as e:
        print(f"Error initializing Gemini model: {e}")
        gemini_model = None




Gemini API configured.
Gemini model initialized.


In [9]:
def classify_joke_with_gemini(joke_sentence: str) -> int:
    """
    Classifies a joke as a homograph or homophone joke using the Gemini API.

    Args:
        joke_sentence (str): The joke sentence to classify.

    Returns:
        int: 1 if it's a homograph joke, 2 if it's a homophone joke.
             Returns 0 if classification is not possible or unclear or API key is missing.
    """
    if not GOOGLE_API_KEY or gemini_model is None:
        print("Gemini API not configured or model not initialized. Cannot classify joke.")
        return 0

    prompt = f"""Classify the following joke as either a homograph joke or a homophone joke.

    A homograph joke is a joke that creates humor through two words that are spelled the same but has different meanings.
    A homophone joke is a joke that creates humor through two words that aren't spelled the same but has similar pronunciation.

    Respond with only 'homograph' or 'homophone'. If it's neither or unclear, respond with 'neither'.\n\n

    Joke: {joke_sentence}"""

    try:
        response = gemini_model.generate_content(prompt)
        generated_text = response.text.strip()
        print(f"Generated text from Gemini: {generated_text}")

        # Use regex to find "homograph", "homophone", or "neither" in the generated text
        match = re.search(r'(homograph|homophone|neither)', generated_text.lower())
        classification = match.group(1) if match else None

        if classification == "homograph":
            return 1
        elif classification == "homophone":
            return 2
        else:
            return 0 # Neither or unclear

    except Exception as e:
        print(f"Error calling Gemini API: {e}")
        return 0 # Return 0 in case of error


# Example Usage:
joke1 = "Why don't eggs tell jokes? They'd crack each other up!" # Homograph joke
joke2 = "I told my wife she was drawing her eyebrows too high. She looked surprised." # Homograph joke
joke3 = "This is not a joke." # Neither
joke4 = "She tried to escape the prison, but got distracted by the shiny prism." # Homophone joke

print(f"Joke 4 classification: {classify_joke_with_gemini(joke4)}")

Generated text from Gemini: homophone
Joke 4 classification: 2


In [10]:
# Download necessary NLTK data using shell commands to ensure availability
#!nltk.download('wordnet')
#!nltk.download('omw-1.4')
#!nltk.download('punkt')
#!nltk.download('averaged_perceptron_tagger_eng')
#!nltk.download('stopwords')

# Assuming the following functions and models are loaded from previous cells:
# - predict_humor (from cell ff28d65b)
# - get_keywords_from_homograph_joke (from cell 216e3cde)
# - find_potential_homographs (from cell cfebebe5)
# - analyze_homograph_relevance (from cell 70930d80)
# - embedding_model (from cell 70930d80)



# Initialize WordNetLemmatizer and Stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))


# Assuming aoa_df is loaded in a previous cell (cell 0ba15314)
try:
    aoa_df # Check if aoa_df is defined
except NameError:
    print("WARNING: aoa_df (AoA dataset) not found. Please run the cell to load AoA_51715_words.csv")
    aoa_df = None # Set to None to avoid errors later if not loaded


# Ensure models and tokenizers are loaded if not already
try:
    predict_humor # Check if the function exists
except NameError:
    # print("Loading humor prediction model...") # Removed diagnostic
    model_name_humor = "mohameddhiab/humor-no-humor"
    tokenizer_humor = AutoTokenizer.from_pretrained(model_name_humor)
    model_humor = AutoModelForSequenceClassification.from_pretrained(model_name_humor)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_humor.to(device)
    def predict_humor(sentence):
        inputs = tokenizer_humor(sentence, return_tensors="pt", truncation=True, padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model_humor(**inputs)
            predicted_class_id = outputs.logits.argmax().item()
            labels = ["Not Humor 😐", "Humor 😂"]
            return labels[predicted_class_id]

try:
    embedding_model # Check if the model exists
except NameError:
     # print("Loading sentence embedding model...") # Removed diagnostic
     embedding_model = SentenceTransformer('all-MiniLM-L6-v2')


# Helper function to get POS tag for lemmatization
def get_wordnet_pos(word):
    """Map POS tag to first character used by WordNetLemmatizer."""
    # Use the explicitly imported pos_tag
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"N": wn.NOUN, "V": wn.VERB, "A": wn.ADJ, "R": wn.ADV}
    return tag_dict.get(tag, wn.NOUN)


def find_potential_homographs(text):
    """
    Finds potential homographs in a given text (sentence or part) using NLTK and WordNet.
    A word is considered a potential homograph if it has more than one synset in WordNet.
    """
    # Tokenize the text and remove punctuation
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word not in string.punctuation]

    potential_homographs = {}
    for word in tokens:
        # Get all synsets (different meanings) for the word
        synsets = wn.synsets(word)
        # If a word has more than one synset, it's a potential homograph
        if len(synsets) > 1:
            potential_homographs[word] = [synset.definition() for synset in synsets]

    return potential_homographs

def analyze_homograph_relevance(sentence, potential_homographs):
    sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True)
    homograph_analysis = {}
    for word, definitions in potential_homographs.items():
        homograph_analysis[word] = []
        for definition in definitions:
            if definition.strip():
                definition_embedding = embedding_model.encode(definition, convert_to_tensor=True)
                similarity = torch.nn.functional.cosine_similarity(definition_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
                homograph_analysis[word].append({
                    'definition': definition,
                    'similarity_to_sentence': similarity.item()
                })
            else:
                 homograph_analysis[word].append({
                    'definition': definition,
                    'similarity_to_sentence': -1.0
                })
    return homograph_analysis

def get_keywords_from_homograph_joke(sentence):
    # 1. Delimiter-based splitting
    delimiters = ['.', '?', '!', ';', ',']
    split_index = -1
    chosen_delimiter = None

    # Find the leftmost occurrence of any delimiter
    for delimiter in delimiters: # Iterate from left to right
        idx = sentence.find(delimiter) # Use find instead of rfind
        if idx != -1:
            split_index = idx + 1 # Split after the delimiter
            chosen_delimiter = delimiter
            break # Found the leftmost delimiter, no need to check others


    if split_index != -1 and split_index < len(sentence):
        setup = sentence[:split_index].strip()
        punchline = sentence[split_index:].strip()
        print(f"Splitting method: Delimiter '{chosen_delimiter}' found at index {split_index-1}") # Kept this print as requested
    else:
        # 2. 60/40 split fallback
        split_index = int(len(sentence) * 0.6)
        # Adjust split_index to not break a word
        while split_index > 0 and sentence[split_index-1].isalnum():
            split_index -= 1
        # If the adjustment made the split_index the very beginning of the sentence,
        # and the original 60% point was not at the start, find the next word break.
        if split_index == 0 and int(len(sentence) * 0.6) > 0:
             split_index = int(len(sentence) * 0.6)
             while split_index < len(sentence) and sentence[split_index].isalnum():
                 split_index += 1


        setup = sentence[:split_index].strip()
        punchline = sentence[split_index:].strip()
        print("Splitting method: 60/40 ratio (word boundary adjusted)") # Kept this print as requested

    # Tokenize setup and punchline, remove punctuation, and remove stop words
    setup_tokens = [word.lower() for word in word_tokenize(setup) if word not in string.punctuation and word.lower() not in stop_words]
    punchline_tokens = [word.lower() for word in word_tokenize(punchline) if word not in string.punctuation and word.lower() not in stop_words]

    return setup_tokens, punchline_tokens

# ========================== THIS PART AND ONWARD ONLY FOR HOMOGRAPH KEYWORD EXTRACTION ======================================
# Move the below chunck of code to different function?

def homograph_keyword_analysis(setup_tokens, punchline_tokens, sentence):
    setup_lemmas = {lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in setup_tokens if word.strip()}
    punchline_lemmas = {lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punchline_tokens if word.strip()}

    # 3. Find common lemmas between setup and punchline
    common_lemmas = list(setup_lemmas.intersection(punchline_lemmas))

    punchline_keyword = None
    sentence_embedding = None # Initialize sentence embedding

    if common_lemmas:
        if len(common_lemmas) > 1:
            # Secondary: More than one common lemma, use semantic similarity to whole sentence *among common lemmas only*
            sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True)
            common_word_embeddings = {
                word: embedding_model.encode(word, convert_to_tensor=True)
                for word in common_lemmas if word.strip() # Use only the common lemmas
            }
            common_word_similarities = {}
            for word, word_embedding in common_word_embeddings.items():
                similarity = torch.nn.functional.cosine_similarity(word_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
                common_word_similarities[word] = similarity.item()

            if common_word_similarities:
                sorted_common_words = sorted(common_word_similarities.items(), key=lambda item: item[1], reverse=True)
                punchline_keyword = sorted_common_words[:1][0][0] # Get the top 1 word by similarity from common lemmas

        else: # len(common_lemmas) == 1
            # Tertiary: Exactly one common lemma, use that as the keyword
            punchline_keyword = common_lemmas[0]
    else:
        # Tertiary: No common lemmas found, fall back to previous logic (Prioritize Homographs)
        sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True) # Ensure sentence embedding is calculated

        # --- Punchline Keyword Extraction (Fallback: Prioritize Homographs) ---
        punchline_potential_homographs = find_potential_homographs(" ".join(punchline_tokens)) # Find homographs among non-stopwords

        if punchline_potential_homographs:
            # If homographs are found in the punchline, analyze their similarity to the whole sentence
            homograph_similarities_to_sentence = {}
            for homo_word in punchline_potential_homographs.keys():
                # Get embedding for the homograph word
                homo_word_embedding = embedding_model.encode(homo_word, convert_to_tensor=True)
                # Calculate similarity to the whole sentence
                similarity = torch.nn.functional.cosine_similarity(homo_word_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
                homograph_similarities_to_sentence[homo_word] = similarity.item()
                print(f"  '{homo_word}': {similarity.item():.4f}") # Re-added diagnostic: Print each homograph and its score

            print(f"Homograph similarities to sentence: {homograph_similarities_to_sentence}") # Kept this print as requested


            # Select the homograph with the highest similarity to the whole sentence as the keyword
            if homograph_similarities_to_sentence:
                sorted_homographs = sorted(homograph_similarities_to_sentence.items(), key=lambda item: item[1], reverse=True)
                punchline_keyword = sorted_homographs[0][0] # Get the word with the highest similarity

        # Fallback within fallback: If no homographs in punchline or homograph analysis failed
        if punchline_keyword is None and punchline_tokens:
            punchline_word_embeddings = {
                word: embedding_model.encode(word, convert_to_tensor=True)
                for word in punchline_tokens if word.strip()
            }
            punchline_word_similarities = {}
            for word, word_embedding in punchline_word_embeddings.items():
                similarity = torch.nn.functional.cosine_similarity(word_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
                punchline_word_similarities[word] = similarity.item()
            print(f"Punchline word similarities to sentence (fallback - general): {punchline_word_similarities}") # Re-added diagnostic
            if punchline_word_similarities:
                sorted_punchline_words = sorted(punchline_word_similarities.items(), key=lambda item: item[1], reverse=True)
                punchline_keyword = sorted_punchline_words[:1][0][0] # Get the top 1 word
                print(f"Selected punchline keyword (fallback - general): '{punchline_keyword}'") # Re-added diagnostic


    # --- Setup Keyword Extraction (still top 2 words most similar to sentence, regardless of punchline logic) ---
    # This part remains the same as before, but operating on non-stopword tokens
    if sentence_embedding is None: # Ensure sentence_embedding is calculated if not already
        sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True)

    setup_word_embeddings = {}
    if setup_tokens:
        setup_word_embeddings = {
            word: embedding_model.encode(word, convert_to_tensor=True)
            for word in setup_tokens if word.strip()
        }
    setup_word_similarities = {}
    for word, word_embedding in setup_word_embeddings.items():
        similarity = torch.nn.functional.cosine_similarity(word_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
        setup_word_similarities[word] = similarity.item()
    sorted_setup_words = sorted(setup_word_similarities.items(), key=lambda item: item[1], reverse=True)
    setup_keywords = [word for word, similarity in sorted_setup_words[:2]] # Get top 2 from the dictionary


    print(f"Punchline's keyword: '{punchline_keyword}'") # Kept this print as requested


    # Return the setup keywords and the single punchline keyword
    return setup_keywords, [punchline_keyword] if punchline_keyword else []

# Helper function to look up AoA for a list of words
def lookup_aoa(words, aoa_df):
    """
    Looks up the Age of Acquisition for a list of words in the AoA DataFrame.

    Args:
        words (list): A list of lemmatized words (strings).
        aoa_df (pd.DataFrame): The DataFrame containing AoA data with 'Word' and 'AoA' columns.

    Returns:
        list: A list of tuples (word, AoA_value) for words found in the DataFrame.
              AoA_value is None if the word is not found.
    """
    if aoa_df is None:
        print("WARNING: AoA DataFrame not loaded. Cannot perform AoA lookup.")
        return [(word, None) for word in words]

    aoa_values = []
    # Ensure the 'Word' column is treated as strings for merging/lookup
    aoa_df['Word'] = aoa_df['Word'].astype(str)
    # Create a DataFrame from the input words to merge
    words_df = pd.DataFrame({'Word': words})
    # Merge to get AoA values, keeping all input words
    merged_df = pd.merge(words_df, aoa_df[['Word', 'AoA']], on='Word', how='left')

    # Extract results
    for index, row in merged_df.iterrows():
        aoa_values.append((row['Word'], row['AoA']))

    return aoa_values


def explain_joke_with_keywords(sentence, age=None): # Added age parameter
    """
    Explains why a joke is funny or not based on humor classification
    and the definitions of the punchline keyword, and assesses age appropriateness.
    """
    # --- Output: Sentence Analyzed ---
    print(f"Sentence analyzed: '{sentence}'") # Keep this print

    # 1. Classify humor
    humor_prediction = predict_humor(sentence)
    # Correctly check if it's classified as Humor
    is_humor = humor_prediction == "Humor 😂"

    # --- Output: Humor Analysis ---
    # Print the base humor classification statement
    setup_keywords, punchline_keywords = [], [] # Initialize as empty lists
    # Classify whether joke uses homograph, homophone, or neither
    classification = classify_joke_with_gemini(sentence)

    if is_humor:
        print("Humor analysis: The sentence is classified as humor.")
        # Gets list of tokens of setup and punchline of the joke
        setup_tokens, punchline_tokens = get_keywords_from_homograph_joke(sentence)
        if classification == 1:         # homograph
            setup_keywords, punchline_keywords = homograph_keyword_analysis(setup_tokens, punchline_tokens, sentence)

            if not punchline_keywords or punchline_keywords[0] is None: # Also check for None
                # print("--- Could not identify punchline keyword ---") # Re-added diagnostic
                explanation = "Could not identify a punchline keyword for analysis."
                # If no punchline keyword, cannot do AoA analysis based on it
                if is_humor and age is not None:
                    explanation += "\nAge appropriateness could not be assessed due to missing punchline keyword."
                print(explanation) # Print the explanation
                return

        elif classification == 2:       # homophone
            setup_keywords, punchline_keywords = homophone_sound_similarity(setup_tokens, punchline_tokens)
        else:                           # neither
            print("Joke classification: The sentence is likely a joke that doesn't use homograph or homophone.")

    else:
        print("Humor analysis: The sentence is classified as not humor.")


    # 2. Extract setup and punchline keywords
    # We only need the punchline keyword for the explanation based on definitions,
    # but we need both for AoA analysis.

    raw_punchline_keyword = punchline_keywords[0] if punchline_keywords else None # Get the single punchline keyword before lemmatization

    # 3. Lemmatize the punchline keyword
    # Ensure the raw keyword is not None or empty before lemmatizing
    if raw_punchline_keyword:
        punchline_keyword_lemma = lemmatizer.lemmatize(raw_punchline_keyword, get_wordnet_pos(raw_punchline_keyword))
        # --- Output: Lemmatized Punchline Keyword ---
    else:
        # print("--- Raw punchline keyword is empty, cannot lemmatize ---") # Re-added diagnostic
        explanation = "Identified punchline keyword is empty, cannot perform definition analysis."
        if is_humor and age is not None:
            explanation += "\nAge appropriateness could not be assessed due to empty punchline keyword."
        print(explanation) # Print the explanation
        return # Exit the function if punchline keyword is empty

    # 4. Get all definitions for the lemmatized punchline keyword
    # Now we use the lemmatized keyword to find homographs
    # Note: find_potential_homographs expects a single word as input
    potential_homographs_lemma = find_potential_homographs(punchline_keyword_lemma) # Pass the lemmatized word

    # Prepare for AoA Analysis (Steps 4, 5, 6, 7 from plan) - only if it's humor and age is provided
    highest_aoa = 0
    aoa_analysis_performed = False

    if is_humor and age is not None and aoa_df is not None:
        aoa_analysis_performed = True
        words_for_aoa_lookup = set() # Use a set to avoid duplicate lookups

        # Add lemmatized setup and punchline keywords
        # Ensure keywords are not None before lemmatizing
        words_for_aoa_lookup.update([lemmatizer.lemmatize(kw, get_wordnet_pos(kw)) for kw in setup_keywords + (punchline_keywords if punchline_keywords else []) if kw])

        # Add words from top 2 relevant definitions of the punchline keyword lemma
        top_2_definitions = [] # Initialize here for scope
        if punchline_keyword_lemma in potential_homographs_lemma and potential_homographs_lemma[punchline_keyword_lemma]:
             # Need to analyze relevance first to get top definitions
             analysis_results_keyword = analyze_homograph_relevance(sentence, {punchline_keyword_lemma: potential_homographs_lemma[punchline_keyword_lemma]})
             keyword_analysis = analysis_results_keyword.get(punchline_keyword_lemma, [])
             sorted_definitions = sorted(keyword_analysis, key=lambda x: x['similarity_to_sentence'], reverse=True)
             top_2_definitions = sorted_definitions[:2]

             for def_info in top_2_definitions:
                 # Tokenize and lemmatize words in the definition, exclude punctuation and stopwords
                 def_tokens = [word.lower() for word in word_tokenize(def_info['definition']) if word not in string.punctuation and word.lower() not in stop_words]
                 def_lemmas = {lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in def_tokens if word.strip()}
                 words_for_aoa_lookup.update(def_lemmas)

        # Perform AoA lookup for all collected words
        aoa_results = lookup_aoa(list(words_for_aoa_lookup), aoa_df)


        # Find the highest AoA value
        valid_aoa_values = [aoa for word, aoa in aoa_results if pd.notna(aoa)] # Filter out None/NaN
        if valid_aoa_values:
            highest_aoa = max(valid_aoa_values)
            # print(f"Highest AoA found: {highest_aoa}") # Removed diagnostic
        else: # Re-added diagnostic
             print("No valid AoA values found for relevant words in AoA lookup.")


    # Construct the base explanation based on homographs/definitions (Steps 4, 5, 6 from original plan if not already done for AoA)
    explanation_parts = [] # Use a list to build the explanation

    if classification == 1:
        if punchline_keyword_lemma not in potential_homographs_lemma or not potential_homographs_lemma[punchline_keyword_lemma]:
            print(f"--- No homographs found for '{punchline_keyword_lemma}' for definition analysis ---") # Re-added diagnostic
            if is_humor:
                explanation_parts.append(f"The punchline keyword '{punchline_keyword_lemma}' does not have multiple definitions to analyze using WordNet, so the humor mechanism based on wordplay couldn't be identified.")
            else:
                explanation_parts.append(f"The punchline keyword '{punchline_keyword_lemma}' does not seem to use multiple definitions in a way that creates humor in the given sentence.")
        else:
            # Analyze relevance of definitions (if not already done for AoA)
            if not aoa_analysis_performed or (aoa_analysis_performed and not 'top_2_definitions' in locals()): # Only re-analyze if AoA analysis wasn't done above or didn't produce top definitions
                analysis_results_keyword = analyze_homograph_relevance(sentence, {punchline_keyword_lemma: potential_homographs_lemma[punchline_keyword_lemma]})
                keyword_analysis = analysis_results_keyword.get(punchline_keyword_lemma, [])
                sorted_definitions = sorted(keyword_analysis, key=lambda x: x['similarity_to_sentence'], reverse=True)
                top_2_definitions = sorted_definitions[:2]
            # Else, top_2_definitions is already calculated from AoA analysis block

            # Construct the base explanation based on humor and definitions
            if is_humor:
                explanation_parts.append(f"The joke is funny because the lemmatized punchline keyword '{punchline_keyword_lemma}' likely \nuses multiple definitions to create humor. ")
                if top_2_definitions:
                    explanation_parts.append(f"\n\nBased on context similarity, here are the top two most relevant definitions: \n")
                    for i, def_info in enumerate(top_2_definitions):
                        explanation_parts.append(f"({i+1}) {def_info['definition']} ")
                else:
                    explanation_parts.append(f"However, no relevant definitions were found for the lemmatized punchline keyword '{punchline_keyword_lemma}' \nto explain the humor based on multiple meanings.")

            else: # Not humor
                explanation_parts.append(f"The joke is likely not funny because the lemmatized punchline keyword '{punchline_keyword_lemma}' \ndoes not seem to use multiple definitions in a way that creates humor in the given sentence.")
                # Optionally, you could still list the top definitions here if desired
                # if top_2_definitions:
                #      explanation_parts.append(" Top definitions found: ")
                #      for i, def_info in enumerate(top_2_definitions):
                #           explanation_parts.append(f"({i+1}) {def_info['definition']} ")
    elif classification == 2:
        # Ensure setup_keywords and punchline_keywords are not empty before accessing
        if setup_keywords and punchline_keywords:
             print(f"The words {setup_keywords[0]} and {punchline_keywords[0]} create humor in this sentence because they have similar utterances.")
        else:
             print("Could not identify keywords for homophone analysis.")
    else:
        return

    # Add Age Appropriateness Statement (Step 8) - only if performed and valid AoA found
    if aoa_analysis_performed and highest_aoa > 0: # Check if AoA analysis was attempted and valid AoA found
        # Re-added more verbose output based on likely user preference
        explanation_parts.append(f"\n\nAge Appropriateness Analysis (for age {age}):")
        explanation_parts.append(f"\nHighest Age of Acquisition found among relevant words: {highest_aoa:.2f} years.")
        if age is not None: # Ensure age was actually provided
            if age >= highest_aoa:
                explanation_parts.append(f"\nBased on word acquisition age, this joke is likely appropriate for a {age}-year-old.")
            else:
                explanation_parts.append(f"\nBased on word acquisition age, this joke might be too complex for a {age}-year-old (highest word AoA is {highest_aoa:.2f}).")
        else:
             explanation_parts.append("\nCould not compare to input age as age was not provided.")

    elif is_humor and age is not None and aoa_df is None: # If it's humor and age is provided but AoA data is missing
         explanation_parts.append("\n\nAge Appropriateness Analysis: AoA dataset not loaded. Cannot perform analysis.")

    elif is_humor and age is not None and highest_aoa <= 0: # If it's humor and age provided but no valid AoA found after lookup
         explanation_parts.append("\n\nAge Appropriateness Analysis: No valid Age of Acquisition values found for relevant words.")

    # Print the accumulated explanation parts
    print(" ".join(explanation_parts))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Analyze Your Own Sentence and Age

Use the input prompts below to enter a sentence (joke or not) and an age for analysis.

In [19]:
# Get input from the user
user_sentence = input("Enter the sentence to analyze: ")
user_age_str = input("Enter the age for appropriateness analysis (optional, press Enter to skip): ")

# Convert age to integer if provided, otherwise set to None
user_age = None
if user_age_str:
    try:
        user_age = int(user_age_str)
    except ValueError:
        print("Invalid age entered. Age appropriateness analysis will be skipped.")

# Run the analysis with user input
print("\n" + "="*30 + "\n")
print(f"Running analysis for user input:")
# Call the main analysis function with user inputs
explain_joke_with_keywords(user_sentence, age=user_age)

Enter the sentence to analyze: When the band broke up, it was not just drama, it was also drum.
Enter the age for appropriateness analysis (optional, press Enter to skip): 10


Running analysis for user input:
Sentence analyzed: 'When the band broke up, it was not just drama, it was also drum.'


ERROR:tornado.access:503 POST /v1beta/models/gemini-flash-latest:generateContent?%24alt=json%3Benum-encoding%3Dint (::1) 4606.32ms


Generated text from Gemini: homophone
Humor analysis: The sentence is classified as not humor.
Identified punchline keyword is empty, cannot perform definition analysis.


In [12]:
'''
Archived code

from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
import pandas as pd

# Load tokenizer (turns text into numbers for BERT)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load BERT with a classification head (outputs joke / not joke)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Load your datasets from Jokes.csv and Test.csv separately
dataset = load_dataset("csv", data_files={"train": "dad-a-base.csv", "test": "Test.csv"})

# Rename columns to match expected format for Trainer for both train and test splits
dataset = dataset.rename_column("Type", "labels")
dataset = dataset.rename_column("Sentence", "text")


def tokenize(batch):
    # Use the 'text' column after renaming
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to both train and test splits
tokenized_datasets = dataset.map(tokenize, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# Training setup
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch", # Corrected argument name
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"], # Use the tokenized train split
    eval_dataset=tokenized_datasets["test"],   # Use the tokenized test split
    tokenizer=tokenizer,
)

trainer.train()
'''

'\nArchived code\n\nfrom transformers import BertTokenizer, BertForSequenceClassification\nfrom datasets import load_dataset\nfrom transformers import Trainer, TrainingArguments\nimport pandas as pd\n\n# Load tokenizer (turns text into numbers for BERT)\ntokenizer = BertTokenizer.from_pretrained("bert-base-uncased")\n\n# Load BERT with a classification head (outputs joke / not joke)\nmodel = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)\n\n# Load your datasets from Jokes.csv and Test.csv separately\ndataset = load_dataset("csv", data_files={"train": "dad-a-base.csv", "test": "Test.csv"})\n\n# Rename columns to match expected format for Trainer for both train and test splits\ndataset = dataset.rename_column("Type", "labels")\ndataset = dataset.rename_column("Sentence", "text")\n\n\ndef tokenize(batch):\n    # Use the \'text\' column after renaming\n    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)\n\n# Apply tok

In [13]:
'''
Archived code
# ============================================================
# 1️⃣  Install required packages
# ============================================================
!pip install -q transformers datasets torch accelerate evaluate

# ============================================================
# 2️⃣  Import libraries
# ============================================================
import pandas as pd
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
import evaluate
import numpy as np
import torch

# ============================================================
# 3️⃣  Load your dataset (CSV with columns: text,label)
# ============================================================
# Example CSV structure:
# text,label
# "Why don't skeletons fight each other? They don't have the guts!",1
# "Why do skeletons fight each other? They don't have the guts!",0

df = pd.read_csv("Jokes.csv", encoding='latin-1')  # path to your file
dataset = Dataset.from_pandas(df)

# Rename columns to match expected format for Trainer
dataset = dataset.rename_column("Type", "labels")
dataset = dataset.rename_column("Sentence", "text")


# Split small dataset into train/test (80/20)
dataset = dataset.train_test_split(test_size=0.2, seed=42)

# ============================================================
# 4️⃣  Load tokenizer and model
# ============================================================
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Freeze most of DeBERTa layers except classification head
for name, param in model.named_parameters():
    if 'classifier' not in name and 'pooler' not in name:
        param.requires_grad = False


# ============================================================
# 5️⃣  Tokenize data
# ============================================================
def preprocess(examples):
    return tokenizer(examples["text"], truncation=True, padding=False, max_length=128)

encoded = dataset.map(preprocess, batched=True)

# Data collator pads dynamically per batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ============================================================
# 6️⃣  Define evaluation metric (accuracy + F1)
# ============================================================
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=preds, references=labels)
    f1_score = f1.compute(predictions=preds, references=labels)
    return {"accuracy": acc["accuracy"], "f1": f1_score["f1"]}

# ============================================================
# 7️⃣  Training configuration (few-shot friendly)
# ============================================================
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="no",
    learning_rate=5e-6,             # lower LR
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,             # increased epochs
    weight_decay=0.1,
    warmup_ratio=0.1,
    load_best_model_at_end=False,
    logging_steps=5,
    report_to="none",
    fp16=torch.cuda.is_available()  # use mixed precision if GPU
)

# ============================================================
# 8️⃣  Initialize Trainer
# ============================================================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# ============================================================
# 9️⃣  Train model
# ============================================================
trainer.train()

# ============================================================
# 🔟  Evaluate model
# ============================================================
metrics = trainer.evaluate()
print(metrics)

# ============================================================
# 🔹  Example inference
# ============================================================
def predict(sentence):
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
    # Move inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()
    return "Joke 😂" if pred == 1 else "Not a joke 😐"

print(predict("Why don't skeletons fight each other? They don't have the guts!"))
print(predict("Why do skeletons fight each other? They don't have the guts!"))
'''



In [14]:
'''
from sentence_transformers import SentenceTransformer
import torch
from nltk.tokenize import word_tokenize
import string

# Load the sentence embedding model if not already loaded (from cell 70930d80)
try:
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
except NameError:
    print("SentenceTransformer model not loaded. Please run cell 70930d80 first.")
    # You might want to add code here to load the model if cell execution order is not guaranteed
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')


def get_keywords_from_joke_parts(sentence):
    """
    Separates a joke into setup and punchline and identifies keywords in each.

    Args:
        sentence (str): The input joke sentence.

    Returns:
        tuple: A tuple containing a list of setup keywords and a list of punchline keywords.
    """
    # Calculate split point based on character count
    split_index = int(len(sentence) * 0.6)
    setup = sentence[:split_index].strip()
    punchline = sentence[split_index:].strip()

    # Get embedding for the whole sentence
    sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True)

    # Tokenize setup and punchline and remove punctuation
    setup_tokens = [word.lower() for word in word_tokenize(setup) if word not in string.punctuation]
    punchline_tokens = [word.lower() for word in word_tokenize(punchline) if word not in string.punctuation]

    # Get embeddings for individual words in setup and punchline
    # Filter out empty strings that might result from tokenization after punctuation removal
    setup_word_embeddings = {}
    if setup_tokens:
        setup_word_embeddings = {
            word: embedding_model.encode(word, convert_to_tensor=True)
            for word in setup_tokens if word.strip()
        }

    punchline_word_embeddings = {}
    if punchline_tokens:
        punchline_word_embeddings = {
            word: embedding_model.encode(word, convert_to_tensor=True)
            for word in punchline_tokens if word.strip()
        }


    # Calculate similarity of each word embedding to the whole sentence embedding
    setup_word_similarities = {}
    for word, word_embedding in setup_word_embeddings.items():
         similarity = torch.nn.functional.cosine_similarity(word_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
         setup_word_similarities[word] = similarity.item()

    punchline_word_similarities = {}
    for word, word_embedding in punchline_word_embeddings.items():
        similarity = torch.nn.functional.cosine_similarity(word_embedding.unsqueeze(0), sentence_embedding.unsqueeze(0))
        punchline_word_similarities[word] = similarity.item()

    # Sort words by similarity and get the top N keywords
    sorted_setup_words = sorted(setup_word_similarities.items(), key=lambda item: item[1], reverse=True)
    setup_keywords = [word for word, similarity in sorted_setup_words[:2]] # Get top 2

    sorted_punchline_words = sorted(punchline_word_similarities.items(), key=lambda item: item[1], reverse=True)
    punchline_keywords = [word for word, similarity in sorted_punchline_words[:1]] # Get top 1


    return setup_keywords, punchline_keywords

# Example usage with a joke sentence
joke_sentence_1 = "Why don't eggs tell jokes? They'd crack each other up!"
setup_kws_1, punchline_kws_1 = get_keywords_from_joke_parts(joke_sentence_1)
print(f"Joke 1: '{joke_sentence_1}'")
print(f"Setup Keywords: {setup_kws_1}")
print(f"Punchline Keywords: {punchline_kws_1}")

print("-" * 30)

joke_sentence_2 = "I told my wife she was drawing her eyebrows too high. She looked surprised."
setup_kws_2, punchline_kws_2 = get_keywords_from_joke_parts(joke_sentence_2)
print(f"Joke 2: '{joke_sentence_2}'")
print(f"Setup Keywords: {setup_kws_2}")
print(f"Punchline Keywords: {punchline_kws_2}")

print("-" * 30)

# Example with a non-joke
non_joke_sentence = "The quick brown fox jumps over the lazy dog."
setup_kws_nj, punchline_kws_nj = get_keywords_from_joke_parts(non_joke_sentence)
print(f"Non-joke: '{non_joke_sentence}'")
print(f"Setup Keywords: {setup_kws_nj}")
print(f"Punchline Keywords: {punchline_kws_nj}")
'''

'\nfrom sentence_transformers import SentenceTransformer\nimport torch\nfrom nltk.tokenize import word_tokenize\nimport string\n\n# Load the sentence embedding model if not already loaded (from cell 70930d80)\ntry:\n    embedding_model = SentenceTransformer(\'all-MiniLM-L6-v2\')\nexcept NameError:\n    print("SentenceTransformer model not loaded. Please run cell 70930d80 first.")\n    # You might want to add code here to load the model if cell execution order is not guaranteed\n    embedding_model = SentenceTransformer(\'all-MiniLM-L6-v2\')\n\n\ndef get_keywords_from_joke_parts(sentence):\n    """\n    Separates a joke into setup and punchline and identifies keywords in each.\n\n    Args:\n        sentence (str): The input joke sentence.\n\n    Returns:\n        tuple: A tuple containing a list of setup keywords and a list of punchline keywords.\n    """\n    # Calculate split point based on character count\n    split_index = int(len(sentence) * 0.6)\n    setup = sentence[:split_index]

In [15]:
'''
from sentence_transformers import SentenceTransformer
import torch

# Load a pre-trained sentence embedding model
# 'all-MiniLM-L6-v2' is a good balance of speed and performance
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Cosine similarity function (reusing the one defined earlier if available, or defining it here)
# (Assuming the cosine function from cell 1kDFy5Ph1s3d is available)

def analyze_homograph_relevance(sentence, potential_homographs):
    """
    Analyzes the relevance of different homograph definitions to the sentence context
    using sentence embeddings and cosine similarity.
    """
    sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True)

    homograph_analysis = {}
    for word, definitions in potential_homographs.items():
        homograph_analysis[word] = []
        for definition in definitions:
            # Handle empty definitions or potential issues with encoding
            if definition.strip():
                definition_embedding = embedding_model.encode(definition, convert_to_tensor=True)
                # Need to reshape for the cosine function if it expects batch dimensions
                # Let's adapt the cosine function or calculate directly if needed
                # Assuming the cosine function from cell 1kDFy5Ph1s3d expects shape [batch_size, embedding_dim]
                # and returns a similarity matrix. For sentence vs definition, we want a single score.

                # Calculate cosine similarity directly for two tensors
                similarity = torch.nn.functional.cosine_similarity(sentence_embedding.unsqueeze(0), definition_embedding.unsqueeze(0))

                homograph_analysis[word].append({
                    'definition': definition,
                    'similarity_to_sentence': similarity.item() # .item() to get the scalar value
                })
            else:
                 homograph_analysis[word].append({
                    'definition': definition,
                    'similarity_to_sentence': -1.0 # Assign a low score for empty definitions
                })


    return homograph_analysis

# Example usage with a joke sentence and homographs found earlier (assuming 'homographs_found' is available)
# Using the homographs_found from the "Why don't eggs tell jokes? They'd crack each other up!" example
# If you ran the previous cell, homographs_found should be in memory.

if 'homographs_found' in locals() and homographs_found:
    print("Analyzing homograph relevance for the first joke sentence:")
    analysis_results = analyze_homograph_relevance(joke_sentence, homographs_found)

    # Print the results, potentially highlighting definitions with higher similarity
    for word, analysis in analysis_results.items():
        print(f"\nAnalysis for '{word}':")
        # Sort definitions by similarity for easier analysis
        sorted_analysis = sorted(analysis, key=lambda x: x['similarity_to_sentence'], reverse=True)
        for item in sorted_analysis:
            print(f"  Similarity: {item['similarity_to_sentence']:.4f} - Definition: {item['definition']}")

    # Example for the second joke sentence if homographs_found_2 is available
    if 'homographs_found_2' in locals() and homographs_found_2:
        print("\n" + "="*30 + "\n")
        print("Analyzing homograph relevance for the second joke sentence:")
        analysis_results_2 = analyze_homograph_relevance(joke_sentence_2, homographs_found_2)

        for word, analysis in analysis_results_2.items():
            print(f"\nAnalysis for '{word}':")
            sorted_analysis = sorted(analysis, key=lambda x: x['similarity_to_sentence'], reverse=True)
            for item in sorted_analysis:
                 print(f"  Similarity: {item['similarity_to_sentence']:.4f} - Definition: {item['definition']}")

else:
    print("Please run the previous cell to identify potential homographs first.")
'''

'\nfrom sentence_transformers import SentenceTransformer\nimport torch\n\n# Load a pre-trained sentence embedding model\n# \'all-MiniLM-L6-v2\' is a good balance of speed and performance\nembedding_model = SentenceTransformer(\'all-MiniLM-L6-v2\')\n\n# Cosine similarity function (reusing the one defined earlier if available, or defining it here)\n# (Assuming the cosine function from cell 1kDFy5Ph1s3d is available)\n\ndef analyze_homograph_relevance(sentence, potential_homographs):\n    """\n    Analyzes the relevance of different homograph definitions to the sentence context\n    using sentence embeddings and cosine similarity.\n    """\n    sentence_embedding = embedding_model.encode(sentence, convert_to_tensor=True)\n\n    homograph_analysis = {}\n    for word, definitions in potential_homographs.items():\n        homograph_analysis[word] = []\n        for definition in definitions:\n            # Handle empty definitions or potential issues with encoding\n            if definition.s

In [16]:
'''
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
import string

def find_potential_homographs(sentence):
    """
    Finds potential homographs in a sentence using NLTK and WordNet.
    A word is considered a potential homograph if it has more than one synset in WordNet.
    """
    # Tokenize the sentence and remove punctuation
    tokens = word_tokenize(sentence.lower())
    tokens = [word for word in tokens if word not in string.punctuation]

    potential_homographs = {}
    for word in tokens:
        # Get all synsets (different meanings) for the word
        synsets = wn.synsets(word)
        # If a word has more than one synset, it's a potential homograph
        if len(synsets) > 1:
            potential_homographs[word] = [synset.definition() for synset in synsets]

    return potential_homographs

# Example usage with a joke sentence
joke_sentence = "Why don't eggs tell jokes? They'd crack each other up!"
homographs_found = find_potential_homographs(joke_sentence)

if homographs_found:
    print(f"Potential homographs found in the sentence: '{joke_sentence}'")
    for word, definitions in homographs_found.items():
        print(f"- '{word}':")
        for i, definition in enumerate(definitions):
            print(f"  {i+1}. {definition}")
else:
    print(f"No potential homographs found in the sentence: '{joke_sentence}'")

joke_sentence_2 = "I told my wife she was drawing her eyebrows too high. She looked surprised."
homographs_found_2 = find_potential_homographs(joke_sentence_2)

if homographs_found_2:
    print(f"\nPotential homographs found in the sentence: '{joke_sentence_2}'")
    for word, definitions in homographs_found_2.items():
        print(f"- '{word}':")
        for i, definition in enumerate(definitions):
            print(f"  {i+1}. {definition}")
else:
    print(f"\nNo potential homographs found in the sentence: '{joke_sentence_2}'")
'''

'\nfrom nltk.tokenize import word_tokenize\nfrom nltk.corpus import wordnet as wn\nimport string\n\ndef find_potential_homographs(sentence):\n    """\n    Finds potential homographs in a sentence using NLTK and WordNet.\n    A word is considered a potential homograph if it has more than one synset in WordNet.\n    """\n    # Tokenize the sentence and remove punctuation\n    tokens = word_tokenize(sentence.lower())\n    tokens = [word for word in tokens if word not in string.punctuation]\n\n    potential_homographs = {}\n    for word in tokens:\n        # Get all synsets (different meanings) for the word\n        synsets = wn.synsets(word)\n        # If a word has more than one synset, it\'s a potential homograph\n        if len(synsets) > 1:\n            potential_homographs[word] = [synset.definition() for synset in synsets]\n\n    return potential_homographs\n\n# Example usage with a joke sentence\njoke_sentence = "Why don\'t eggs tell jokes? They\'d crack each other up!"\nhomographs

In [17]:
'''
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the tokenizer and model from Hugging Face Hub
model_name = "mohameddhiab/humor-no-humor"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def predict_humor(sentence):
    """
    Predicts whether a sentence is humorous or not using the loaded model.
    """
    # Tokenize the input sentence
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)

    # Move inputs to the same device as the model
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class (0 for not humor, 1 for humor)
    # The model outputs logits, we take the argmax to get the predicted class index
    predicted_class_id = outputs.logits.argmax().item()

    # Map the class ID to a readable label
    # Based on the model card, 0 is not humor and 1 is humor.
    labels = ["Not Humor 😐", "Humor 😂"]
    prediction = labels[predicted_class_id]

    # You can also get the probabilities if needed
    # probabilities = torch.softmax(outputs.logits, dim=1)[0].tolist()
    # print(f"Probabilities: {probabilities}")

    return prediction

# Example usage:
sentence1 = "Why do skeletons fight each other? They don't have the guts!"
sentence2 = "The quick brown fox jumps over the lazy dog."
sentence3 = "I told my wife she was drawing her eyebrows too high. She looked surprised."

print(f"Sentence 1: '{sentence1}' -> {predict_humor(sentence1)}")
print(f"Sentence 2: '{sentence2}' -> {predict_humor(sentence2)}")
print(f"Sentence 3: '{sentence3}' -> {predict_humor(sentence3)}")
'''

'\nfrom transformers import AutoTokenizer, AutoModelForSequenceClassification\nimport torch\n\n# Load the tokenizer and model from Hugging Face Hub\nmodel_name = "mohameddhiab/humor-no-humor"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForSequenceClassification.from_pretrained(model_name)\n\n# Move the model to GPU if available\ndevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")\nmodel.to(device)\n\ndef predict_humor(sentence):\n    """\n    Predicts whether a sentence is humorous or not using the loaded model.\n    """\n    # Tokenize the input sentence\n    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)\n\n    # Move inputs to the same device as the model\n    inputs = {k: v.to(device) for k, v in inputs.items()}\n\n    # Perform inference\n    with torch.no_grad():\n        outputs = model(**inputs)\n\n    # Get the predicted class (0 for not humor, 1 for humor)\n    # The model outputs logits, we take