<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/03-types-tokens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is a word?**

This might seem like a silly question, but it's a crucual question to ask when we want to think about the computational representation of text and language. We have seen how text can be stored as a single string in Python, and there are no real limits to how long this string can be. But a string is still seen as a single sequence of characters, and thus the most interesting information we can obtain from a string is how long a string is in terms of number of characters.

What if we wanted to know how many words are in a string? What about sentences? Paragraphs?

We need a method to *split* or *segment* the string so that we can extract the individual words. That means we need to create a set of rules or principles by which we can determine individual words in a string.

Hence the question: what is a word? Or, perhaps phrased slightly differently, how do we distinguish words from one another when reading? The answer for written English and many other written forms of language is relatively simple - whitespace. The spaces between words represent simple yet effective boundaries to indicate where words begin and end. (Note that other languages, such as Chinese, do **not** use whitespace, which means many of the methods we use in this notebook require adaptation when applied to other languages).

Let us proceed for the moment with the knowledge the segementing words on whitespace could be a helpful way to split a string into words. And, it just so happens that Python has a built-in function which allows us to do this with ease: the `string.split()` function.





## Using `.split()` to create words from strings.

 `.split()` is string method which means it is specific to string types. By default, `.split()` will search a string for any whitespace character and then split the string on those whitespaces. The whitespaces will be effectively removed from the string, and the resulting chunks will be placed into a list, with each segment representing a value in the list. Consider the example below:

In [None]:
# define a string and save it to a variable
pretzels = 'these pretzels are making me thirsty'

# use .split() to convert the string into a list of segments split on whitespace
pretzels.split()

Pretty neat, right? What's more is that `.split()` can be used to split a string on **any** character one likes - the default is whitespace but you could choose anything, such as a certain character or letter:

In [None]:
# split at "o", note that the "o" is not included in the output (just like whitespace)
'Melodrama'.split('o')

Can you figure out how to use `.split()` as a means to count the total number of words in a string? Hint: do you remember the `len()` function? Use `.split()` to count the number of words in the following sentence:

In [None]:
# can you use .split() and .len() to find the total number of words in this sentence?
# the "\" is used to 'escape' the quote so that python doesn't read it as a delimiter
example = 'what if everything around you isn\'t quite as it seems?'


In [None]:
# the answer...
len(example.split())

Note that we don't actually need to save the string to a variable first - we could wrap `len()` around the string, we just have to keep our brackets in the right place!

In [None]:
len('what if everything around you isn\'t quite as it seems?'.split())

## Types and Tokens

Now that we know how to split a string into different words, we can start using Python to count the number of words in a string. Not only can we find the total number of words in a string, we can also find the total number of *unique* words. The distinction between unique words and total words has specific terminology:

- A ***type*** is a unique word.
- A ***token*** is an individual occurence of a type.

For example - pretend you had three pets at home: two dogs and a cat. If we sorted our pets into types and tokens, we would have three tokens (three pets total), but only two types: dog and cat.

In Section 1.4 of Chapter 1, the NLKT book asks you to think about types and tokens by introducing you to the `set()` and `len()` functions. You have already used `len()`, but `set()` might be new. Using `set()` returns a data container that only allows one of any value to exist in the container. In other words, it returns an object where repeated values are not allowed.

This means we can use `set()` to ask for unique values in our data.


In [None]:
# For example, what does set() give you here?
set('aaabbbccc')

In [None]:
# and here?
set(['a', 'a', 'a', 'b', 'b', 'b', 'c','c', 'c'])

Let's think about how to use `len()`and `set()` to compute the Types and Tokens for a string.

- First we'll define a string
- then we will split the string into a list
- then we will measure the length of the list
- then we will measure the length of unique items in the list

We'll also chain some of these functions together and save them to new variables.

In [None]:
# start with a string
mood_raw = "i can\'t feel a thing, i keep looking at my mood ring"

# create a split version
mood_tokens = mood_raw.split()

# compare the before/after
print(mood_raw, '\n\n', mood_tokens)

In [None]:
# measure the total number of words
total_mood_tokens = len(mood_tokens)
total_mood_tokens

In [None]:
# measure the number of types
# how else could you write this code for the same effect?
total_mood_types = len(set(mood_raw.split()))
total_mood_types

## **Your Turn**

Before moving on, make sure you are comfortable with the difference between types and tokens, as well as how to use `len()` and `set()`.

In [None]:
# take this moment to practice with set() and len()

# **Measuring Lexical Diversity**
We can now use this information to assess our text for a very basic measure of sophistication: lexical diversity. This is also known as a type-token ratio, and provides a measure of how many repeated words there are in a text. You can read more about it in [Chapter 1 of NLTK.](https://www.nltk.org/book/ch01.html)

To calculate lexical diveristy, we can use the following formula:

> `number of types / number of tokens`

The code cell below creates a user-defined **function** named `lexical_diversity()` to calculate lexical diversity. Running the code cell below will load the function into memory, allowing you to use it just like other functions such as `print()`, `len()`, and `set()`.

The name of the function is `lexical_diversity`, and it expects an input called `tokens`.

The function will `return` (i.e., give back) the result of measuring the length of dividing the `set()` of tokens by the `len()` of tokens.

Creating functions will be covered in more detail later on.

In [None]:
# define a function to calculate lexical diversity
def lexical_diversity(tokens):
  # return the result of dividing the length
  return len(set(tokens))/len(tokens)

To use the function, we just need to type its name and provide a list of tokens as the argument. This works just like `print()`, `len()`, and other Python functions we have been using.

So, what is the lexical diversity or TTR of our example sentence?



In [None]:
# what is the TTR of our example sentence?

# first create a list of tokens using .split()
mood_tokens = mood_raw.split()

# then run the function
lexical_diversity(mood_tokens)

We get a result of .916, in other words 91.6% of our tokens are represented by a single type, indicating a very high lexical diversity.

Of course, such measures are relatively meaningless on such a short amount of text - the true use of lexical diversity would be to compare much larger texts against one another.

Nonetheless, try the lexical diversity function on some examples yourself to see how repeating words influence the overall score.

> ***Important!*** You need to feed a list of tokens to `lexical_diversity()`, otherwise you will get the diversity based on **characters** in the string, not words! What could you do to modify this?

In [None]:
# Lexical diversity of 50%
# two types, four tokens
# 2/4 = .5 (50%)
lexical_diversity('hello world hello world'.split())

In [None]:
# Lexical diversity of 100%
# two types, two tokens
# 2/2 = 1 (100%)
lexical_diversity('hello world'.split())

## **Your Turn**

Try out the lexical diversity function on some text. The function expects raw string as input.

In [None]:
# Play with lexical diversity on your own examples


# Comparing Lexical Diversity of Two Texts

Now that we know how to calculate the lexical diversity of a single text, let's expand the function to compare the lexical diversity of two texts.

We will also modify the function so that you can input a raw string to the function, rather than a list of tokens. Read the comments to see if you understand how the function works.

Remember, we need to first load the function into memory before we can use it.


In [None]:
# create a new function which takes two texts as arguments:
def compare_lexical_diversity(text1, text2):

  # in one line, caclulate LD of text1 using .split()
  text_1_ld = len(set(text1.split()))/len(text1.split())
  # repeat for the second text
  text_2_ld = len(set(text2.split()))/len(text2.split())

  # print out the results
  print(f"{text1}\nLexical Diversity: {text_1_ld} \n\n{text2}\nLexical Diversity: {text_2_ld}")

### string formatting

You probably see that in the `print()` line in the function looks a little crazy. It uses something called string formatting, where you can combine variables and text inside a single print statement.

The `f` in front of the string is all you need to do to activate this format. Then, inside the string delimiters, you can use curly brackets `{}` to include variables and other Python functions.

The `\n` is a character which stands for newline, which has the same effect as pressing the enter key on your keyboard to create a new line in a text.

Ok, let's create two texts and analyze them using this new function. You'll see that I use triple quotes as a means to make it easier to encase the entire text on multiple lines. This is mainly for aesthetic reasons.

In [None]:
# create two strings as our texts
turtles = """teenage mutant ninja turtles
teenage mutant ninja turtles
teenage mutant ninja turtles
heroes in a halfshell turtle power!"""


baby_shark = """Baby Shark doo-doo doo-doo
Baby Shark doo-doo doo-doo
Baby Shark doo-doo doo-doo
Baby Shark"""


Ok, time to run the function! But, before doing so, can you predict which song *should* be more lexically diverse? Remember, more repetitions of the same word means there will be *lower* lexical diversity. So, which song repeats more words?

In [None]:
# which song is more lexically diverse?
compare_lexical_diversity(turtles, baby_shark)

## **Your Turn**

Spend some time comparing the lexical diversity of different strings.

# Improving Lexical Diversity

Lexical Diversity based on TTR is a very crude measure of vocabulary diversity. The above examples have used very short texts, and the resulting TTR values are somewhat meaningful. However, TTR has been heavily criticized for being a relatively poor measure of lexical diversity. Why is that? At least for English, one reason is that the overall length of a text will directly influence TTR. Longer texts will naturally start to repeat certain words again, such as function words like `the`, `a`, and `an`.

To account for this problem, there are a surprisingly large number of different TTR metrics that have been developed. Many of them try to include some sort of average TTR by moving over and calculating TTR for portions of the text, or using some other sort of function which helps address the effects of text length. Some other research also points out that human measures of lexical diversity are not being captured in measures such as TTR.


So, know that other measures exist. To wrap this notebook up, let's try to "prove" how text length might influence TTR while also getting some practice with function writing :)

We are going to write a function which splits a text into segments and then calculates the TTR for each segment, and then averages those TTR values to get an average TTR.

But first, let's consider how we might split a text into sentences. You might have noticed me using the `\n` character in some of the print statements, this character represents a `newline` such as when you press the enter/return key to start a new paragraph.

Remember that `.split()` will let you split a string on any character, this includes newlines! Consider the text below, it is all encoded as a single string, but because I copied and pasted it from a website in this format, it was retained new lines (I also use triple quotes """ to allow the string to span across these newlines). Run the cells to see how splitting on newlines can give us sentences:

In [None]:
# Let's read in a longer string
rwib = """See the animal in his cage that you built.
Are you sure what side you're on?
Better not look him too closely in the eye.
Are you sure what side of the glass you are on?
See the safety of the life you have built.
Everything where it belongs
Feel the hollowness inside of your heart
And it's all
Right where it belongs
What if everything around you
Isn't quite as it seems?
What if all the world you think you know
Is an elaborate dream?
And if you look at your reflection
Is it all you want it to be?
What if you could look right
Through the cracks?
Would you find yourself
Find yourself afraid to see?"""


Splitting the string on `\n` means the resulting list will be approximately sentences, rather than single words. Cool!

In [None]:
rwib.split('\n')

Now that have figured out a way to split a string into sentences, we can build our average TTR function. I am going to include a `for loop` inside this function. We will cover these in more detail later on, but you might be able to understand the logic of it now. Basically, a `for loop` will repeat the same operation on each member of a specified population.  

If you want to explore the function as it works, you can uncomment the print statements.

In [None]:
# define a function named average_ttr
def average_ttr(text):

  # create an empty list to store the TTR values
  ttr_values = []

  # split the raw string input into sentences using the newline character
  # the result is a list with each value being a sentence
  sentences = text.split('\n')

  # loop through each value/sentence of sentences, one at a time.
  for sent in sentences:

    # calclulate the TTR of the sentence
    # print(sent)
    sent_ld = len(set(sent.split()))/len(sent.split())

    # add the value to the list using .append()
    ttr_values.append(sent_ld)

  # use sum() to add the TTR values together,
  # then divide by total number of values (using len()) to get average TTR
  # print(ttr_values)
  average_ttr = sum(ttr_values)/len(ttr_values)

  print(f"Average Sentence TTR is: {average_ttr}")


The averaget TTR is .97 or 97%, meaning it is a pretty diverse text!

In [None]:
average_ttr(rwib)

What happens if we calculate the lexical diversity of the entire song at once? The TTR is shockingly lower!

In [None]:
len(set(rwib.split())) / len(rwib.split())

# Discussion

Play around with the average TTR function and different texts.
- You might want to explore splitting texts on different things besdies newlines such as periods
- How does the average TTR function compare to the whole text TTR? Compare the results between the two for the same texts.
- Does calculating average sentence TTR create any new problems with TTR?