<a href="https://colab.research.google.com/github/scskalicky/VocabAtVic2023NLPWorkshop/blob/main/03-types-tokens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **What is a word?**

- or, more precisely, how can we get Python to turn a `string` into a set of words?
- the `str.split()` function is one way
- this function will "split" a string on a pre-defined character. The default is to split on whitespace:

In [1]:
# define a string and save it to a variable
pretzels = 'these pretzels are making me thirsty'

# use .split() to convert the string into a list of segments split on whitespace
pretzels.split()

['these', 'pretzels', 'are', 'making', 'me', 'thirsty']

The resulting object was a `list` of values, which happen to be individual words!

Do you remember `len()`? How can we use `len()` and `str.split()` to count the number of words in a string?

In [None]:
# how to use len() and .split() to count the total number of words?

## types and tokens

- types: unique words
- tokens: occurences of unique words
- type-token ratio (TTR): measure of lexical repetition in text

#### use `set()` to find unique instances of items in a sequence
- `set(tokens)` = all unique values

#### use `len()` and `set()` to calculate TTR

- `len(set(tokens))` / `len(tokens)`

In [5]:
# load in a text
turtles = """teenage mutant ninja turtles, 
            teenage mutant ninja turtles, 
            teenage mutant ninja turtles, 
            heroes in a halfshell, turtle power!"""

In [6]:
# convert to tokens using `.split()`
turtle_tokens = turtles.split()

In [7]:
# look at the tokens
turtle_tokens

['teenage',
 'mutant',
 'ninja',
 'turtles,',
 'teenage',
 'mutant',
 'ninja',
 'turtles,',
 'teenage',
 'mutant',
 'ninja',
 'turtles,',
 'heroes',
 'in',
 'a',
 'halfshell,',
 'turtle',
 'power!']

In [8]:
# use `len()` to count the number of words
len(turtle_tokens)


18

In [11]:
# use `set()` to get just the unique words
turtle_types = set(turtle_tokens)
turtle_types

{'a',
 'halfshell,',
 'heroes',
 'in',
 'mutant',
 'ninja',
 'power!',
 'teenage',
 'turtle',
 'turtles,'}

In [12]:
# use len() and set() to count the number of types..
len(turtle_types)

10

In [13]:
# finally...calculate the TTR!

len(turtle_types) / len(turtle_tokens)

0.5555555555555556

## Problems with `.split()`

- punctuation is retained
- not all languages use whitespace

In [15]:
yadda = 'yadda, yadda, yadda!'

set(yadda.split())

{'yadda!', 'yadda,'}

In [16]:
# not all languages use whitespace
zhongwen = '对不起我的中文不好'

zhongwen.split()

['对不起我的中文不好']

## Enter NLTK

- NLTK = [Natural Language ToolKit](https://www.nltk.org/book/)
- we will use tokenizer function from NLTK
- we need to:
    - import the nltk library
    - download some resources

- then we will use `nltk.word_tokenize()` to obtain tokens

In [17]:
# import the library and download required resources
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/sskalicky/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Compare `str.split()` and `nltk.word_tokenize()` on the same text

- NLTK function treats punctuation as a token
- does this create any new challenges for text metrics?

In [25]:
# .split() keeps punctuation attached to the word
split_yadda = yadda.split()

print(split_yadda)

['yadda,', 'yadda,', 'yadda!']


In [26]:
# nltk tokenizer recognises punctuation
nltk_yadda = nltk.word_tokenize(yadda)

print(nltk_yadda)

['yadda', ',', 'yadda', ',', 'yadda', '!']


In [27]:
# implications for contractions

nltk.word_tokenize('You know we\'re living in a society!')

['You', 'know', 'we', "'re", 'living', 'in', 'a', 'society', '!']

### conditional removal of punctuation
- we have a better representation of words
- but we also have punctuation as words!
- we can use conditional expressions and tests to remove things...conditionally  
    - remove anything that is punctuation...
    - only keep things that are words...
    - only keep words of certain lengths...
- we use `if` conditional statements to do this:

>```
>
>if condition is True:
>    do something
>
>```


#### Conditional tests include:

test|syntax
--|--
is `a` the same as `b`?| `a == b`
is `a` greater than `b`?|`a > b`
is `a` less than `b`?|`a < b`
is `a` in `b?`|`a in b`
is `a` not in `b?`|`a not in b`
string-specific tests|string-specific syntax
is `a` an alphanumeric character?|`a.isalpha()`
is `a` uppercased?|`a.isupper()`
is `a` lowercased?|`a.islower()`


In [30]:
# sequence through the string
for letter in 'New Zealand':
    # apply a test to each item in the string
    if letter.isupper():
        # perform an action is the test is True
        print(letter, end = '')

NZ