<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/12_digging_deeper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digging Deeper

This notebook first describes some new Python functions and data types before extending the conversation from the previous notebook regarding how to write more structured programs. The first new thing is learning how to use `enumerate()` as a means to write more flexible for loops. 

# Enumerate

`enumerate()` is a handy Python function which lets you declare a for loop alongside a counter. This means that you can loop through values *and* the counter of a loop, which can be handy for indexing and doing other tricks.

To use `enumerate()`, you declare a `for` loop per usual but wrap the sequence you are looping over in `enumerate()`. You also need to declare two iterators: one for the count of the loop, and one for the value.


In [None]:
# define a string to use as an example
target = 'soda'

In [None]:
# create an enumerate loop
for index, character in enumerate(target):
  # print the loop counter plus value
  # remember, Python starts at 0
  print(index, character)

We can use this process to do things such as index different locations during a loop. 

In [None]:
# create an enumerate loop
for index, character in enumerate(target):
  # use the loop index as a way to indext the target at that spot, which is the same result
  print(target[index])

In [None]:
# now print the previous character
for index, character in enumerate(target):
  # we add a conditional to make sure the index isn't the first, or else we would get an error 
  # (becaues there is no index before 0)
  if index != 0:
    print(target[index - 1], character)


Consider how we can use this on a list of tokens to print word pairs:


In [None]:
# use nltk to tokenize
import nltk
nltk.download('punkt')
sentence = nltk.word_tokenize('I overthink your punctuation use. Not my fault, just a thing that my mind do')

In [None]:
# print word pairs
for index, word in enumerate(sentence):
  # make sure index is not at the end of the sentence
  if index != len(sentence) - 1:
    # I am using the enumerate index to index the structure we are looping over
    print(word, sentence[index + 1])

Knowing about `enumerate()` is handy for looking at word combination because you may want to iterate before and after words during a loop. Keep this function in mind as we explore bigrams and other ways words pattern together. It might help you when you start making your own functions and analyses!

# Tuples

Tuples are another data type in Python. 

In some ways, tuples are similar to lists:
- They are sequences of values (which can be different types)
- The sequences are ordered (and thus can be indexed)

In some ways, tuples are different than lists
- They are immutable — you cannot alter their values after they are created.
- They use different syntax


You can index tuples, you can slice them, and you can measure their length. 

## Creating Tuples

Tuples are created by separating values by commas, usually placed between parentheses


In [None]:
# create a tuple with no brackets
nin1 = 'Nine', 'Inch', 'Nails'
nin1

In [None]:
type(nin1)

In [None]:
# create a tuple with brackets
nin2 = ('Trent', 'Reznor')
nin2

In [None]:
type(nin2)

## Tuple assignment

You can use tuple assignment to efficiently generate 
multiple variables. To do so, you first generate your variable names, separated by commas, and then assign values, also separated by commas:



In [None]:
# we create three variables in one go
past, present, future = 1982, 2022, 2055

In [None]:
past

In [None]:
present

In [None]:
future

In [None]:
# however, length of each side has to match
past, present, future = 1982, 2022

The "not enough values to unpack" error is telling us that it expected 3 (because there are three variable names) but only got 2 (because there are only two values on the right side of the variable assignment). 

## Using `.split()` and tuple assignment

Knowing that we can assign resulting values to multiple variables, we can also use the results of functions like `.split()` as a means to provide multiple values in tuple assignment. For example, we can split a date into day, month, and year:

In [None]:
# save a date to a string
date = '11.11.22'

In [None]:
# split the string on '.' gives us three values
day, month, year = date.split('.')

In [None]:
day

In [None]:
month

In [None]:
year

Why would you use tuples instead of lists? It's not a bad question, and many other people [also wonder](https://stackoverflow.com/questions/1708510/list-vs-tuple-when-to-use-each) about this. The main distinction seems to boil down to mutability — you can change the values in lists as you like, but tuples cannot change. So, for structures you'd prefer to be set, you may want to use tuples, otherwise use lists. At the end of the day, there is no one "right" way to do it. 


In [None]:
# create equivalent lists/tuples
albums_list = ['Pure Heroine', 'Melodrama', 'Solar Power']
albums_tuple = ('Pure Heroine', 'Melodrama', 'Solar Power')

In [None]:
# I can change the value of values in the list
albums_list[1] = 'PH'
albums_list

In [None]:
# but cannot do so with Tuples
albums_tuple[1] = 'PH'

# Named Tuples

Name tuples are a cool extension of tuples which provide you with a way of building factories to create objects containing similar properties. In this way, they are sort of like a dictionary, in that an object will have nested pieces of information, but they are different from a dictionary in that we pre-assign the properties in advance. 

 To use named tuples we need to import the function from the `collections` module:

In [None]:
from collections import namedtuple

To use `namedtuple`, we first need to define our blueprint. This provides us with a reusable template that we can create new data from. The syntax to do so is:

```
Name = namedtuple('Name', ['attribute1', 'attribute2', ...])

```

The first Name (on the left of the = ) saves our namedtuple to a variable. The second Name (the string inside the brackets) will be the identifier or name of this tuple. By saving the named tuple to a variable, we can then call that variable name to create additional tuples of the same structure. 

The `attributes` inside the list represent the sorts of information we want to store about the tuple. Think of them like dictionary keys.


Maybe a bit confusing, I know, so let's look at an example. Let's create a namedtuple which stores information about a text. We will store the length of the text in words as well as the longest word for a text.


In [None]:
# running this cell creates the factory/blueprint
TextInfo = namedtuple('TextInfo', ['length', 'longest_word'])

Now that we've created the blueprint, we can start making individual versions of this blueprint for different texts. To do so, we choose a new variable name for our individual version and save it to an instance of `TextInfo`. 

We also need to set the values of the attribues, which we can do similar in how we set values of arguments in functions:

In [None]:
# now create our first object with fake values
text1 = TextInfo(length = 100, longest_word = 'incontrovertible')

After having created our object, we can see the true value of namedtuple start to shine. We can access information about the object using dot notation. Each attribute can be accessed by typing the name of the namedtuple (in this case, `text1`), followed by a full stop or dot `.`, followed by the name of the attribute:

```
name.attribute
```

So, we can quickly query the text1 length doing:


In [None]:
text1.length

As well as the longest word:

In [None]:
text1.longest_word

# Writing a full program

Let's put this together into something more useful using some functions we already know. We'll load in some texts, count the tokens, and collect the most frequent word and store it all in some named tuples, which we can then query and loop through. 

1. make a function to load in a text
2. make a function to tokenize and find most frequent token
3. make a function to store that information in a namedtuple

In [None]:
# our load text function from the prior notebook
def load_txt(path):
  """opens and returns a text"""
  text = open(path).read()
  return text

In [None]:
# a function to use nltk.word_tokenize()
def tokenize(x):
  x = nltk.word_tokenize(x)
  return x

In [None]:
# a function to find the most frequent word
def find_most_frequent(tokens):
  fd = nltk.FreqDist(tokens)
  return fd.most_common(1)

In [None]:
# a control function to run a string through the other functions and store results in a named tuple
def process_text(path):
  # first load the text
  txt = load_txt(path)

  # then tokenize
  tokens = tokenize(txt)

  # then find the most frequent word
  most_frequent = find_most_frequent(tokens)

  # return the values - note that I can use tuple assignment to return more than one value
  return tokens, most_frequent


The functions I have just defined are all the different bits and pieces of our program. Now we need one final function which will control the whole show - it will define the named tuple and choose which texts to run through the whole set of programs. We will call this function `main` and will feed it a list of filenames for texts we want to process.

Here is an explanation of the function:


Line 2: I require two arguments: the `root` folder of the texts, and then a list of `files` which are located in that root folder. Doing it this way allows me to separate the filename from the full file path (there are of course other ways of doing this). 

Line 4: I then declare the `TextStats` namedtuple, which includes three attributes: `filename`, `number_of_words`, and `most_frequent_word`. This is the blueprint for our other named tuples.

Line 7: I create an empty list which I will store all of my tuples in. 

Line 10: I then loop through each of the files. In each loop, I use tuple assignment to store the values which are returned by `process_text()` into two variables named `tokens` and `most_frequent`. In the call to `process_text()`, I concatenate the individual filename to the end of root (which is the path to the folder containing the files).

Line 15: I then create a named tuple comprised of the filename, number of tokens, and the most frequent word. This named tuple is appended directly to my output list. 

Line 17: after the loop completes, I return the list of named tuples. 


In [None]:
# write control function
def main(root, files):
  # define our named tuple
  TextStats = namedtuple('TextStats', ['filename', 'number_of_words', 'most_frequent_word'])

  # define an empty data container
  output = []

  # feed each file to our other functions
  for file in files:
    # get the tokens and most frequent word
    tokens, most_frequent = process_text(root + file)

    # create a named tuple
    output.append(TextStats(file,len(tokens), most_frequent))
  
  return output

In [None]:
# need our nltk resources since I will use word_tokenize(). 
import nltk
nltk.download('punkt')

In [None]:
# texts are in the same folder, so save that to a variable
root = '/content/drive/MyDrive/'

# my list of two texts
texts = ['mood_ring.txt', 'marine_biologist.txt']

# run the function
analysis = main(root = root, files = texts)

In [None]:
analysis

In [None]:
# loop through anlaysis
for text in analysis:
  print(text.filename)
  print(text.number_of_words)
  print(text.most_frequent_word)

Looking at the output, we can see that there are probably some additional steps or at least *options* we would want to consider for pre-processing. For instance we probably don't want punctuation to count as our most frequent word. Regardless, hopefully this notebook can give you some ideas on how to structure your programs. 

# Conclusion

In this notebook we have explored a new function (`enumerate`) as well as a new data type (tuples). We also learned about an extension to tuples called `namedtuples`, which provides us with another more structured way to query attributes of objects. 

Then I wrote a structured program which drew from different functions in a pipeline. You should spend some time playing around with each individual function to see if you can make tweaks here and there - for example could you add more functions to process texts further? What about a function to auomatically grab the filenames from a directory so you don't have to manually type them? Extending pre-existing functions, rather than reinventing the wheel each time, is a good way to develop more complex and interesting programs. 