<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/23a_WordNet_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WordNet Challenge

> *(this was a marked assignment in 2021. We are going to use it as practice now)*

Today I am going to present you with a program or challenge to complete during the workshop. You can work together or alone, up to you. In the instructions below, I will ask you to create two different functions designed to find different pieces of information from WordNet.

Objectives/Skills:

- creating helper functions to serve a larger function
- using one set of global variables for all functions
- practice using parts of speech
- practice using wordnet
- further practice tokenizing and tagging text (using built-in NLTK functions)



# **Sense Finder**

The goal of `sense_finder()` is to do essentially the same thing as the WordNet printer function from a previous notebook. This function is a bit different because it allows you to control the number of senses which will be returned.

1. Define a function named `sense_finder()`.
  - This function will take two arguments: `word` and `senses`.
  - Set the default value of `senses` = 3.

2. In the body of `sense_finder()`, create two variables: `noun_synsets` and `verb_synsets`.
  - Set these variables to be the result of calling `wn.synsets()` on the `word` argument for nouns or verbs, respectively.
  
    - use the `pos = wn.NOUN` or `pos = wn.VERB` arguments in `wn.synsets` to request noun/verb synsets (you could also use `pos = 'n'` or `pos = 'v'`)

3. Then, use an `if statement` to check whether `noun_synsets` exists or is empty.
  - If `noun_synsets` is empty, do nothing.
  - If `noun_synsets` is *not* empty, check whether the length of `noun_synsets` is greater than or equal to the value of `senses`
  - if `noun_synsets` is greater than or equal to the value of `senses`, print a message stating something to the effect of: `"The noun meanings of {word} are {noun_synsets}"`
    - Then, loop through each synset from `noun_synsets` and print the `synset.definition()`. Again, use some sort of informative print statement, such as `"The definition of {synset} is {synset.definition()}`
      - ***However***, the loop should only loop through a number of synsets which are equal to `senses`. In other words, a synset might have 10 total entries, but if `senses` == 3, the program should only print the first 3. To do this, the for loop should slice `noun_synsets` using `senses`
        - you could use `noun_synsets[:senses]`
  - if the length of `noun_synsets` is *not* greater than or equal to `senses`, then loop through each synset and print their `synset_definition()` (i.e., without slicing based on senses, since the total amount is lower than the requested number of senses)
  - if include an `else` statement to inform the user if no noun senses exist.

4. After the set of `ifs` and `else` for the noun synsets, create another `if` statement (**not** an `elif`) which does the same thing for `verb_synsets`

5. Create a final `if statement` which triggers if `word` has no noun synsets *and* no verb synsets. This condition should print a message explaining there are no synsets for the given word.

5. Test your function with a variety of words


See my sample output below for what kind of output you might expect.

In [None]:
# download and import wordnet

import nltk
nltk.download(['wordnet', 'omw-1.4'])

from nltk.corpus import wordnet as wn

## Sample output for `sense_finder()`

Your print statements do not need to match mine, and I encourage you to improve upon or tweak my instructions as long as you can get the desired output!


1. Program limits senses to whatever is in the function, even though both noun and verb have more than one sense for this word

```
sense_finder('comb', 1)
========================
 the NOUN meanings of comb are :
 [Synset('comb.n.01'), Synset('comb.n.02'), Synset('comb.n.03'), Synset('comb.n.04'), Synset('comb.n.05')]
Synset('comb.n.01')
 a flat device with narrow pointed teeth on one edge; disentangles or arranges hair
========================
 The VERB meanings of comb are :
 [Synset('comb.v.01'), Synset('comb.v.02'), Synset('comb.v.03')]
Synset('comb.v.01')
 straighten with a comb
```

2. Program works even when word has fewer synsets than requested

```
sense_finder('basketball', 5)
========================
 the NOUN meanings of basketball are :
 [Synset('basketball.n.01'), Synset('basketball.n.02')]
Synset('basketball.n.01')
 a game played on a court by two opposing teams of 5 players; points are scored by throwing the ball through an elevated horizontal hoop

Synset('basketball.n.02')
 an inflated ball used in playing basketball
```

3. Program works even when the word has no verb or noun synsets

```
sense_finder('wonderful')
========================
 Sorry, no noun senses for wonderful
========================
 Sorry, no verb senses for wonderful
```

## Reflection for `sense_finder()`


Try these words in your program with the default values for `senses`.

1. google
2. xerox
3. bing
4. apple
5. phone
6. brush

**Discussion**

Test your program on these suggested words and more (choose your own words which straddle the line between noun and verb)

- Do you think the senses provided make...sense?
- How could one use this information for processing text?
  - For example, could this program be adapted to guess the meaning of a word in a sentence?
  - What other information might you want in order to do so?

- What other information could we get from WordNet which might be useful/interesting for text analysis?

- Programatically, this program repeats print statements twice for nouns and verbs - what could be done to reduce this repetition?

- What other tweaks or efficiencies can you make to the program?



# Nouny Verbs


Now, let's write a function which uses part of speech tags and incorporates our new `sense_finder()` function!


You'll need these resources:



In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download(['punkt', 'averaged_perceptron_tagger', 'stopwords'])

**Code Cell 1**

Refer to this resource: [Penn Treebank POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

- create a global variable named `noun_tags` which includes the POS tags for `Noun, singular or mass` and `Noun, plural`
- create a global variable named `verb_tags` which is a list including all Penn Treebank POS tags for verbs.


In [None]:
# noun_tags and verb_tags here



**Code Cell 2**

Define a function named `calc_tag_frequency()` which takes a single argument, `pair`

```

calc_tag_frequency(pair)

```



- in the body of the function, create a variable named `noun_freq`.
  - Set the value of the `noun_freq` variable to be the result of calling `.freq('N')` on `pair`

- make another variable named `verb_freq` and set the value to be the result of calling `.freq('V')` on `pair`

- using a single line, `return` two values:
  - `noun_freq`
  - `verb_freq`

- the final return statement would look something like this:

    - `return a, b`

In [None]:
# define `calc_tag_frequency()` here



**Code Cell 3**

1. Define a function named `nouny_verbs()` which takes two arguments: `text` and `senses`.
  - Set the default value of `senses` to 3.

2. Inside the function:

  - create a conditional statement to check whether `text` is a `str`. If not, print a message to the user asking for a string and exit the program.

  - create a variable named `tokens` which is the result of calling `nltk.word_tokenize()` on `text`, taking care to also lowercase the text before it is tokenized.

  - create a variable `tagged` which is the result of calling `nltk.pos_tags()` on `tokens`

  - create a variable `nouns_verbs` which is the result of a list comprehension. This list should include `(word, tag)` pairs from `tagged` that have POS tags in `noun_tags` or `verb_tags`. At the same time, slice out only the first element of `tag` (this will mean the tag will become either 'N' or 'V'). The goal is to only include `(word, tag)` pairs in which the tag is a noun or verb, while retaining only the `N` or `V` of the tag.

  > *hint: use `[(word, tag[0]) for word, tag in ... if tag in ... or tag in ...]`*

- create a `nltk.ConditionalFreqDist()` object named `tagged_cfd` which is a conditional frequency distribution of `nouns_verbs` (*you simply need to pass `noun_verbs` to `nltk.ConditionalFreqDist()`*)

- now, after creating those objects, initiate a `for loop`, iterating through each word of `tagged_cfd`
  
  - hint: loop through the `.conditions()` method of `tagged_cfd` using something like: `for word in tagged_cfd.conditions():`  


- Inside the loop, first check whether the word at least one "N" and "V" pos tag as part of its value in `tagged_cfd`
  - To do this, you can write a conditionAL statement which checks whether the length of `tagged_cfd[word]['N']` and `tagged_cfd[word]['V']` are both greater than 0
  - the goal is the `if` condition will only evaluate `True` if the word has at least one `N` and `V` tags.

- If the word has both `N` and `V` tags, then create two variables, `noun_freq` and `verb_freq` and set them `=` to calling `calc_tag_frequency()` on the current word from `tagged_cfd`
  - hint: `noun_freq, verb_freq = calc_tag_frequency(tagged_cfd[word])`
    - this will work because `calc_tag_frequency()` is designed to return two values

- Create a new variable called `total_frequency` which is the `sum()` of `noun_freq` and `verb_freq`

- Create two variables `noun_percent` and `verb_percent`, each of which is the percentage of how nouny and how verby the word is
  - to do this, divide `noun_freq` and `verb_freq` by `total_freq`

- Now, write a conditional statement which checks whether both `noun_freq` and `verb_freq` are both higher than .33.

    - If yes, classify the word as a nouny_verb. Run `sense_finder()` on the word to look at the different senses.

      - Also, print out the percentages of noun/verb for the word


- **Challenge** customize the function to perform different tasks should the work be classified as more nouny, more verby, or equally nouny and verby.




## Sample output for `nouny_verb()`


```
found a nouny-verb! contact
Noun Frequency: 0.5
Verb Frequency: 0.5
=========================
 The NOUN meanings of contact are:

Synset('contact.n.01')
 close interaction

Synset('contact.n.02')
 the act of touching physically

Synset('contact.n.03')
 the state or condition of touching or of being in immediate proximity

=========================
 The VERB meanings of contact are:

Synset('reach.v.04')
 be in or establish communication with

Synset('touch.v.05')
 be in direct physical contact with; make contact

found a nouny-verb! thinking
Noun Frequency: 0.4
Verb Frequency: 0.6
=========================
 The NOUN meanings of thinking are:

Synset('thinking.n.01')
 the process of using your mind to consider something carefully

=========================
 The VERB meanings of thinking are:

Synset('think.v.01')
 judge or regard; look upon; judge

Synset('think.v.02')
 expect, believe, or suppose

Synset('think.v.03')
 use or exercise the mind or one's power of reason in order to make inferences, decisions, or arrive at a solution or judgments
```

## Reflection for `nouny_verbs()`

Test `nouny_verbs` on a few texts of your choice. This is a good opportunity to bring in your own texts, but you could also use some built-in nltk texts. Carefully observe the output, considering whether the ambiguity raised by words tagged equally as verbs/nouns is reconciled by WordNet.

- How well is the program performing? Are there some results that don't seem to make sense?

- What sorts of preparation to the text should be done further?

- Would it make sense to limit the Wordnet information in some manner?

- Conceptually, what could be done to improve this program?

- What uses might a program like this have for NLP or other fields of research/industry?