## Workshop 3: List comprehensions and more useful functions
#### What is a list comprehension?
From W3Schools: "List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list."

List comprehensions can provide an alternative to loops, when you're trying to create a list.

In [None]:
corpus = "Arma virumque canō, Trōiae quī prīmus ab ōrīs\nĪtaliam, fātō profugus, Lāvīniaque vēnit\nlītora, multum ille et terrīs iactātus et altō\nvī superum saevae memorem Iūnōnis ob īram"
corpus_list = corpus.split()

long_words = [word for word in corpus_list if len(word) > 4]
print(long_words)


long_words = [x for word in corpus_list if len(word) > 4]  
Explanation: For each word in corpus_list, add the word to the new list long_words if that word's length is greater than 4

You can also apply a function to whatever you're adding to the list

In [None]:
long_words = [word.upper() for word in corpus_list if len(word) > 4]
print(long_words)

long_words = [word.upper() for word in corpus_list if len(word) > 4]  
Explanation:
For each word in corpus_list, add the word in all caps to the new list long_words if that word's length is greater than 4

Remember that list comprehensions can also be written as loops. This can sometimes be more readable, sometimes not, so you as the programmer can use your own discretion.

In [None]:
long_words = [word.upper() for word in corpus_list if len(word) > 4]

# the above code does the same thing as...

long_words_2 = []
for word in corpus_list:
    if len(word) > 4:
        long_words_2.append(word.upper())

print(long_words)
print(long_words_2)

if long_words == long_words_2:
    print("the lists are the same")

### List slicing
List slicing lets you take a slice (or slices) of elements of a list in the format list[start : end : step]

In [None]:
print(corpus_list[:]) # all the elements
print(corpus_list[0:5]) # first 5 elements, note that it includes the 0th element and not the 5th element
print(corpus_list[4:10]) # a section in the middle (the 4th element up until and including the 9th element)
print(corpus_list[::3]) # the whole list but only every 3rd element 
print(corpus_list[0:10:2]) # every other of the first 10 elements

# and whatever else you need to do


### Some list functions
To be honest, a lot of list functions such as all() map() and filter() can all be replaced with list comprehensions. You're welcome to check those out on your own, but I think that for our purposes it's easier to just use list comprehensions. 

In [None]:
# enumerate
corpus_list_nums = enumerate(corpus_list)
for index, word in corpus_list_nums:
    print(index, word)

In [None]:
# min and max
word_lengths = [3, 5, 2]
longest_word_length = max(word_lengths)
shortest_word_length = min(word_lengths)

# but this is slightly strange data, let's see a different example:

word_freqs = {"hello": 1, "my": 4, "name": 2, "is": 5}
most_frequent = max(word_freqs.values())

print(most_frequent)

How do you programattically generate something like word_freqs?  
In week 1 of workshop we talked about doing this with loops, but you can also use Counter from the collections module. A Counter object allows you to easily count the most common elements in an iterable. Note that the Counter object isn't the same as a Python dictionary, but can often be used in the same ways.

You can also do arithmetic between Counter objects, which can be useful to add in data from other corpora

In [None]:

from collections import Counter

corpus = "Arma virumque canō, Trōiae quī prīmus ab ōrīs\nĪtaliam, fātō profugus, Lāvīniaque vēnit\nlītora, multum ille et terrīs iactātus et altō\nvī superum saevae memorem Iūnōnis ob īram"
corpus_list = corpus.split()

word_freqs = Counter(corpus_list)

print(word_freqs) 

corpus2 = "Siquis in hoc artem populo non novit amandi,\nHoc legat et lecto carmine doctus amet\nArte citae veloque rates remoque moventur,\nArte leves currus: arte regendus amor."
corpus2_list = corpus2.split()

word_freqs2 = Counter(corpus2_list)

print(word_freqs + word_freqs2)
print(word_freqs & word_freqs2) #intersection
print(f"most common word in first corpus {word_freqs.most_common(1)} in second corpus {word_freqs2.most_common(1)}")

In [None]:
# set
tokens = ['this', 'is', 'a', 'test', 'this']
unique_tokens = list(set(tokens)) # remember that sets are lists with only unique values

print(unique_tokens)

Let's look at and explain some examples from last week's class. The code that follows is from the example of calculating the first part of Delta P, towards the end of the Week 4 notebook. This also includes some examples of list functions and list slicing!

In [46]:
node = "temple"
collocate = "a"

def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
    lemmata = [t.lemma_ for t in x]

    indexes = [i for i, lemma in enumerate(lemmata) if lemma == w1]
    cooccurrences = 0

    for i in indexes:
        left = max(i - l_size, 0)
        right = min(i + r_size + 1, len(lemmata))

        window = lemmata[left:right]
        print(window)
        if w2 in window:
            cooccurrences += 1
            
    return cooccurrences

lemmata = [t.lemma_ for t in x]  
This takes the lemma_ for each token in x, which in this case is a Pandas dataframe, and puts each lemma_ into the variable lemmata

indexes = [i for i, lemma in enumerate(lemmata) if lemma == w1]
Take the index i for each index, lemma pair in the enumerated lemmata list, if the lemma in question is our node (w1)

