<a href="https://colab.research.google.com/github/sundaybest3/Spring2024/blob/main/Corpus/Words_in_context.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍃 Words in context

## Key methods
Analyzing words in context is fundamental for accurately interpreting and understanding language, whether in human communication, language learning, or computational language processing.

+ Tokenization(단어화)
+ POS (Parts of Speech) Tagging
+ Contextual Word Meaning
+ Bigram, N-gram, Collocation
+ Concordance(맥락 안에서 앞뒤로 어떻게 쓰이는지)


## {nltk} installation

In [None]:
!pip install nltk

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize #sentence tokenize

## [1] Tokenization

+ Purpose: Breaking down text into individual words (tokens) is the first step in many NLP tasks.
+ Method: Use nltk.word_tokenize() for tokenizing sentences into words.

In [None]:
text = "The quick brown fox jumps over 2 lazy dogs."

Compare **text.split()** and **word_tokenize**

In [None]:
mywords = text.split()
print(mywords)
print(len(mywords))

In [None]:
tokens = word_tokenize(text)
print(tokens)
print(len(tokens))

Filter out tokens that are not alphabetic

In [None]:
# Filter out tokens that are not alphabetic
words = [token for token in tokens if token.isalpha()]
print(words)
print(len(words))

# [2] Part-of-Speech (POS) Tagging

+ Purpose: Assigning parts of speech to each word (like noun, verb, adjective) helps in understanding the grammatical context.
+ Method: Use nltk.pos_tag().
+ [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:
from nltk import pos_tag
pos_tags = pos_tag(tokens)
print(pos_tags)

# [3] Contextual Word Meaning (Word Sense Disambiguation):

+ Purpose: Determining the meaning of a word based on the context it appears in.
+ Method: Use algorithms like Lesk Algorithm implemented in NLTK.

Note: NLTK uses [WordNet](https://wordnet.princeton.edu)

In [None]:
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
nltk.download('wordnet')

+ Bank (Meaning 1 - Financial Institution):

  + Sentence 1: I need to visit the bank to withdraw some money.
  + Sentence 2: The bank of the river was a peaceful place to relax.
+ Bat (Meaning 1 - Nocturnal Flying Mammal):

  + Sentence 1: I saw a bat flying in the night sky.
  + Sentence 2: She used a baseball bat to hit the ball out of the park.
+ Book (Meaning 1 - Written or Printed Work):

  + Sentence 1: I'm reading a fascinating book about space exploration.
  + Sentence 2: Please book a table for two at the restaurant for tonight.
+ Crane (Meaning 1 - Bird with a Long Neck):

  + Sentence 1: A beautiful crane waded in the shallow water.
  + Sentence 2: They used a crane to lift the heavy machinery onto the truck.
+ Club (Meaning 1 - Social Organization):

  + Sentence 1: I'm a member of the local chess club.
  + Sentence 2: He used a golf club to hit the ball into the hole.

Example sentence: The bank of the river was a peaceful place to relax.

In [None]:
sentence = input("Paste a sentence: ")
ambiguous = input("Type target word: ")

tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

found = False  # Flag to indicate if 'bank' is found
for word, tag in pos_tags:
    if word.lower() == ambiguous:  # Using lower to make the search case-insensitive
        print("POS tag of the ambiguous':", tag)
        found = True
        break  # Exit the loop once 'bank' is found

+ example text (address): In the crowded conference room, as the quarterly meeting commenced, all eyes were on the project manager, Ms. Rivera. The recent challenges faced by the team had sparked concerns among the stakeholders, and it was her responsibility to address these issues comprehensively. She began by acknowledging the setbacks, carefully outlining the underlying causes, and elaborating on the steps taken to mitigate the impacts.
+ example (match; game): The tennis match was intense, with both players demonstrating exceptional skill and endurance. Each set brought the crowd to its feet, cheering for incredible volleys and powerful serves. The excitement built with every point scored, highlighting the competitive spirit of the game.
+ example (match; ignition): As the sun dipped below the horizon, Emma gathered twigs and dry leaves to start a campfire. She struck a match against the side of the box, watching as it flared brightly. Carefully, she touched the flame to the kindling, coaxing the small fire to life, its warm glow soon illuminating the campsite.

In [None]:
sentence = input()
ambiguous = input()

word_sense = lesk(word_tokenize(sentence), ambiguous)

# Access the name of the disambiguated sense
print("Disambiguated Sense:", word_sense.name())
# Access the definition of the disambiguated sense
print("Sense Definition:", word_sense.definition())


In [None]:
from nltk.corpus import wordnet
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

sentence = "You addressed the issue clearly."
ambiguous_word = "addressed"

# Define a function to map Penn Treebank POS tags to WordNet POS tags
def penn_to_wordnet_pos(penn_pos):
    if penn_pos.startswith('N'):
        return wordnet.NOUN
    elif penn_pos.startswith('V'):
        return wordnet.VERB
    elif penn_pos.startswith('R'):
        return wordnet.ADV
    elif penn_pos.startswith('J'):
        return wordnet.ADJ
    else:
        return None  # Return None for unknown POS tags

# Define your sentence and ambiguous word
sentence = "The invalid is in the hospital."
ambiguous_word = "invalid"

# Tokenize the sentence and perform POS tagging
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print(pos_tags)
# Determine the Penn Treebank POS tag for the ambiguous word
ambiguous_word_pos_penn = None

for token, pos in pos_tags:
    if token == ambiguous_word:
        ambiguous_word_pos_penn = pos
        break

# Map the Penn Treebank POS tag to WordNet POS tag
ambiguous_word_pos_wordnet = penn_to_wordnet_pos(ambiguous_word_pos_penn)

if ambiguous_word_pos_wordnet is None:
    print(f"Cannot determine WordNet POS category for '{ambiguous_word_pos_penn}'.")
else:
    # Retrieve synsets and disambiguate sense
    synsets = wordnet.synsets(ambiguous_word, pos=ambiguous_word_pos_wordnet)

    if synsets:
        word_sense = lesk(tokens, ambiguous_word, pos=ambiguous_word_pos_wordnet)
        print("Disambiguated Sense:", word_sense.name())
        print("Sense Definition:", word_sense.definition())
    else:
        print(f"No synsets found for '{ambiguous_word}' in the '{ambiguous_word_pos_wordnet}' category.")


## Gradio

In [None]:
!pip install gradio

+ example text (address): In the crowded conference room, as the quarterly meeting commenced, all eyes were on the project manager, Ms. Rivera. The recent challenges faced by the team had sparked concerns among the stakeholders, and it was her responsibility to address these issues comprehensively. She began by acknowledging the setbacks, carefully outlining the underlying causes, and elaborating on the steps taken to mitigate the impacts.
+ example (match; game): The tennis match was intense, with both players demonstrating exceptional skill and endurance. Each set brought the crowd to its feet, cheering for incredible volleys and powerful serves. The excitement built with every point scored, highlighting the competitive spirit of the game.
+ example (match; ignition): As the sun dipped below the horizon, Emma gathered twigs and dry leaves to start a campfire. She struck a match against the side of the box, watching as it flared brightly. Carefully, she touched the flame to the kindling, coaxing the small fire to life, its warm glow soon illuminating the campsite.

In [None]:
#@markdown Gradio app to display the ambiguous meaning (Not so reliable)
import gradio as gr
import nltk
from nltk.corpus import wordnet
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Define a function to map Penn Treebank POS tags to WordNet POS tags
def penn_to_wordnet_pos(penn_pos):
    if penn_pos.startswith('N'):
        return wordnet.NOUN
    elif penn_pos.startswith('V'):
        return wordnet.VERB
    elif penn_pos.startswith('R'):
        return wordnet.ADV
    elif penn_pos.startswith('J'):
        return wordnet.ADJ
    else:
        return None  # Return None for unknown POS tags

# Define the disambiguation function that uses POS tagging
def disambiguate_word_sense(sentence, ambiguous_word):
    # Tokenize the sentence and perform POS tagging
    tokens = word_tokenize(sentence)
    pos_tags = pos_tag(tokens)

    # Find the POS tag for the ambiguous word in the tokenized sentence
    ambiguous_word_pos_penn = None
    for word, pos in pos_tags:
        if word.lower() == ambiguous_word.lower():
            ambiguous_word_pos_penn = pos
            break

    # If the POS tag is found, convert to WordNet POS tag
    if ambiguous_word_pos_penn:
        ambiguous_word_pos_wordnet = penn_to_wordnet_pos(ambiguous_word_pos_penn)
    else:
        return "The ambiguous word was not found in the sentence."

    if ambiguous_word_pos_wordnet:
        # Perform Word Sense Disambiguation using Lesk algorithm
        word_sense = lesk(tokens, ambiguous_word, pos=ambiguous_word_pos_wordnet)
        if word_sense:
            return f"Disambiguated Sense: {word_sense.name()}\nSense Definition: {word_sense.definition()}"
        else:
            return f"No disambiguated sense found for '{ambiguous_word}'."
    else:
        return f"Cannot determine WordNet POS category for '{ambiguous_word}'."

# Create the Gradio interface
iface = gr.Interface(
    fn=disambiguate_word_sense,
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter a sentence containing the ambiguous word", label="Sentence"),
        gr.Textbox(placeholder="Enter the ambiguous word", label="Ambiguous Word")
    ],
    outputs=gr.Textbox(label="Result"),
    title="Word Sense Disambiguation",
    description="Enter a sentence and an ambiguous word to disambiguate its sense."
)

# Launch the Gradio interface
iface.launch()


## With POS (Just to get an idea)

In [None]:
#@markdown Gradio app
import gradio as gr
import nltk
from nltk.corpus import wordnet
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Ensure NLTK data is available
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')  # Open Multilingual Wordnet
nltk.download('averaged_perceptron_tagger')

# Define the disambiguation function
def disambiguate_word_sense(sentence, ambiguous_word):
    tokens = word_tokenize(sentence)
    tagged_tokens = pos_tag(tokens)

    # Attempt to find a WordNet POS tag for the ambiguous word from the POS tags provided by NLTK
    wordnet_pos = None
    for word, tag in tagged_tokens:
        if word.lower() == ambiguous_word.lower():
            if tag.startswith('N'):
                wordnet_pos = wordnet.NOUN
            elif tag.startswith('V'):
                wordnet_pos = wordnet.VERB
            elif tag.startswith('J'):
                wordnet_pos = wordnet.ADJ
            elif tag.startswith('R'):
                wordnet_pos = wordnet.ADV
            break

    # Use lesk to disambiguate the sense of the word
    disambiguated_sense = lesk(tokens, ambiguous_word, pos=wordnet_pos)

    if disambiguated_sense:
        sense_name = disambiguated_sense.name()
        sense_definition = disambiguated_sense.definition()  # Get the definition of the selected sense
        return f"Disambiguated Sense: {sense_name}\nSense Definition: {sense_definition}"
    else:
        return f"No suitable sense found for '{ambiguous_word}'."

# Create a Gradio interface with a submit button
iface = gr.Interface(
    fn=disambiguate_word_sense,
    inputs=[
        gr.Textbox(label="Sentence", placeholder="Enter a sentence containing the ambiguous word"),
        gr.Textbox(label="Ambiguous Word", placeholder="Enter the ambiguous word"),
    ],
    outputs=gr.Textbox(label="Result"),
    title="Word Sense Disambiguation",
    description="Enter a sentence and an ambiguous word to see its disambiguated sense based on context."
)

# Launch the Gradio interface
iface.launch()


# [4] Bi-gram, N-gram, and Collocation

+ An "ngram" is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs according to the application.
+ Ngrams are used in various applications like statistical language modeling, where they help predict the likelihood of a particular sequence of words.
  + For example, in the sentence "The quick brown fox jumps over the lazy dog," a 2-gram (or bigram) sequence would be ("the quick"), ("quick brown"), ("brown fox"), and so on.
+ A "collocation" refers to a combination of words that occur together more frequently than would be expected by chance. This concept is often used in linguistic analysis to understand typical word combinations and patterns in language usage.


## A. Bi-gram

In [None]:
# !pip install nltk

In [None]:
import nltk
nltk.download('punkt')
from nltk import bigrams
from nltk.tokenize import word_tokenize

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
tokens = word_tokenize(text)

# Generate bigrams
bigrams_list = list(bigrams(tokens))

# Print the bigrams
for bg in bigrams_list:
    print(bg)


+ N-gram

In [None]:
import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize

# Make sure to download the required NLTK models and data
nltk.download('punkt')

# Define a function to get n-grams from text
def get_ngrams(text, n):
    # Tokenize the text into words
    tokens = word_tokenize(text)
    # Generate n-grams
    n_grams = ngrams(tokens, n)
    # Convert to a list and return
    return list(n_grams)

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Get bigrams (2-grams)
bigrams = get_ngrams(text, 2)
print("Bigrams:", bigrams)

# Get trigrams (3-grams)
trigrams = get_ngrams(text, 3)
print("Trigrams:", trigrams)

# Get 4-grams
fourgrams = get_ngrams(text, 4)
print("4-grams:", fourgrams)


+ Collocation: Collocation in linguistics refers to the tendency of certain words to occur frequently together in a language. These word combinations often bear a meaning that is not entirely deducible from the individual words' meanings.

In [None]:
# !pip install nltk   # Install this if you haven't.
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
text = """
Python is an interpreted high-level general-purpose programming language. \
Python's design philosophy emphasizes code readability with its notable use of significant indentation. \
Its language constructs and object-oriented approach aim to help programmers write clear, \
logical code for small and large-scale projects. Python is dynamically-typed and garbage-collected. \
It supports multiple programming paradigms, including structured (particularly procedural), \
object-oriented, and functional programming. \
Python is often described as a "batteries included" language due to its comprehensive standard library. \
Python was created in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, \
introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles. \
Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible. \
The Python 2 series ended with version 2.7 in 2020. Python consistently ranks as one of the most popular programming languages.
"""


In [None]:
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Tokenizing the text
tokens = word_tokenize(text.lower())  # Lowercasing for consistency

# Removing stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words and word.isalnum()]

# Finding bigram collocations
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(filtered_tokens)
finder.apply_freq_filter(2)  # Optional: filter out bigrams that occur less than 2 times

# Display the 5 most frequent bigrams
print("Top 5 bigram collocations:")
print(finder.nbest(bigram_measures.raw_freq, 5))

# Finding trigram collocations
trigram_measures = TrigramAssocMeasures()
finder_tri = TrigramCollocationFinder.from_words(filtered_tokens)
finder_tri.apply_freq_filter(2)  # Optional: filter out trigrams that occur less than 2 times

# Display the 5 most frequent trigrams
print("\nTop 5 trigram collocations:")
print(finder_tri.nbest(trigram_measures.raw_freq, 5))


## [5] Concordance(Words in Context)

+ A "concordance" is a list of all occurrences of a particular search term in a corpus, presented together with a certain amount of context. This is often used in linguistic analysis to understand how words are used in different contexts.

In [None]:
# import nltk
# from nltk.tokenize import word_tokenize
# # Ensure that the necessary NLTK data is available
# nltk.download('punkt')

from nltk.text import Text

# Define a function to display concordances for a word in a given text
def display_concordance(text, word, width=75, lines=25):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Create an NLTK text object
    nltk_text = Text(tokens)
    # Display concordances
    nltk_text.concordance(word, width=width, lines=lines)


# Display concordances for the word 'Python'
display_concordance(text, 'Python')


+ sample text: Jessica always wondered what life would be like in a different city. She thought she would enjoy the bustling streets and diverse culture, but there was always a part of her that feared the unknown. When her friend Mark suggested they visit New York for a week, she knew this would be her chance to experience city life firsthand. During their trip, she discovered she would indeed love the vibrant energy, although she also realized she would miss the quiet of her small hometown. Mark said he would probably move to the city in the future, but Jessica decided she would return home, now certain of where she belonged.

In [None]:
mytext = input()

In [None]:
display_concordance(mytext, 'would', width=100, lines=25)

---
### The END