<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 4: range, zip, enumerate, and useful string methods

This notebook introduces a way to iterate over numbers within a certain range, therefore giving access to index-based iteration over containers using `range`. It also shows how to use `zip` and `enumerate`. Apart from it, it introduces several useful string methods such as `split` and `join`, and provides multiple practice exercises. Finally, it contains the implementation of $n$-gram extraction.

## For-loops: reminder

_For-loops_ iterates over some object (**iterable**) and considers sub-elements of that object in order.

In [None]:
for letter in "apple":
    print(letter)

In [None]:
indicies = [0, 1, -1, -4]
word = "linguistics"

for index in indicies:
    print(word[index], end="")

In [None]:
cities = ["NYC", "LA", "SF"]
for city in cities:
    print("The current city is", city)
    for ch in city:
        print("\t", ch)

In order to print indicies of items in iterables, we can implement a **counter**, i.e. a variable that will increase every time some condition is met. In this case, we will set the counter to $0$ and increase it with every iteration.

In [None]:
index = 0
for letter in "linguistics":
    print(letter, "\t index:", index)
    index += 1

**Example:** Let's say we are given three lists: list of states (`states`), list of average temperatures for those states in the same order (`temperatures`) and a list of states that are considered New England (`new_england`).

In [None]:
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

temperatures = [62.8, 26.6, 60.3, 60.4, 59.4, 45.1, 49, 55.3, 70.7, 63.5,
                70, 44.4, 51.8, 51.7, 47.8, 54.3, 55.6, 66.4, 41, 54.2, 
                47.9, 44.4, 41.2, 63.4, 54.5, 42.7, 48.8, 49.9, 43.8, 52.7, 
                53.4, 45.4, 59, 40.4, 50.7, 59.6, 48.4, 48.8, 50.1, 62.4, 
                45.2, 57.6, 64.8, 48.6, 42.9, 55.1, 48.3, 51.8, 43.1, 42]

new_england = ["Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut",
               "Rhode Island"]

The code below prints average temperatures for New England states. The variable `index` stores the index of an item we are currently looking at.

In [None]:
index = 0
for state in states:
    if state in new_england:
        print(state+":", temperatures[index])
    index += 1

**Practice:** Helpful function for the following practice exercise is `sum` that takes list as an argument and returns the sum of all numbers in that list. FYI, functions `min` and `max` are available as well.

In [None]:
numbers = [1, 18.3, 9, 0, 3.14]
print("Sum of those numbers is", sum(numbers))
print("The smallest number is", min(numbers))
print("The largest number is", max(numbers))

Modify the code above to print the average temperature in New England. (You can use the `round` function to make the resulting number prettier.)

#### Modifying strings

String indecies cannot be reassigned, i.e. the existent parts of the string cannot be modified directly:

In [None]:
string = "hello"
string[-1] = "a"

If we have a task to "mask" all vowels from a text, we will need to create a new string based on the old one:

In [None]:
vowels = "aoiue"
text = "This is a sentence that should contain no vowels."

masked_text = ""
for char in text:
    if char not in vowels:
        masked_text += char
    else:
        masked_text += "*"
print(masked_text)

**Practice:** You are given a string `alphabet` that contains all English letters, and a string `text`.

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."

Write code that makes this string lowercase and deletes punctuations from the text.

## Range

**Range** is a numeric iterable defined by three arguments: _start_, _end_, and _step_. These arguments behave exactly as they do in slices: _start_ defines the initial numerical value, _end_ is the first value not included in the range, and _step_ defines the difference between the first and the following value.

In [None]:
for value in range(1, 10):
    print(value, end=" ")

In [None]:
for value in range(1, 10, 2):
    print(value, end=" ")

If only one argument is provided, it is considered to be _end_, and the initial value is assumed to be $0$.

In [None]:
for value in range(10):
    print(value, end=" ")

Range cannot be displayed directly, but can be easily converted to a list using `list` function.
(If you are curious about the nature of the range object, read [this article](https://treyhunner.com/2018/02/python-range-is-not-an-iterator/), but a safe way is to just call it an iterable, or a range object).

In [None]:
print("Printing range object:", range(10))
print("Typecasting range to a list:", list(range(10)))

In order to iteratively get indicies available in some iterable, we can use the following trick: `range(len(iterable))`.

In [None]:
word = "linguist"
for i in range(len(word)):
    print("index:", i, "\tsymbol:", word[i])

#### N-grams

$n$-grams are sequences of $n$ consequtive symbols.

    word:   banana
    n:      2
    ngrams: ba, an, na
    
    word:   linguist
    n:      3
    ngrams: lin, ing, ngu, gui, uis, ist

A special case of $n$-grams where the value of $n$ is $2$ are called _bigrams_. If $n=1$, these are called _unigrams_.

For computational linguistics and NLP, **$n$-gram models** are extremely important: symbol-level $n$-gram models define which sequences of characters are (im)possible in a certain language, word-level $n$-gram models tell us which words can be adjacent to each other, and so on.

**Practice:** write code that extracts _bigrams_ from a given word.

## Enumerate and Zip

Object-defining functions that can sometimes be very useful are `enumerate` and `zip`.

**`enumerate`** takes a list as input, and returns list of _tuples_, where every tuple contains an item from the input list, and its index. Just as `range`, this function creates its own object that can be easily typecasted into a list.

In [None]:
input_list = ["NY", "CA", "RI", "CO"]
print(list(enumerate(input_list)))

**Tuple** is another basic data type in Python. While they share the majority of the functionality with lists, their main difference is that tuples cannot be modified as easily as lists. Tuples can be thought of as "protected lists", but read [here](https://realpython.com/python-lists-tuples/) to learn more.

**`zip`** takes arbitrary number of lists as input, and return a list of tuples, where every tuple is an index-wise combination of items from those lists.

In [None]:
towns = ["Port Jeff", "Stony Brook", "Lake Grove"]
zip_codes = [11777, 11790, 11755]
print(list(zip(towns, zip_codes)))

## Several useful string methods

There are multiple methods that simplify working with strings and lists, and in this section, I exemplify the following ones: `replace`, `split`, `strip`, `join`, `startswith`, and `endswith`.

**`replace`** returns a string in which some replacement was performed.

    string.replace(old_substring, new_substring)

In [None]:
string = "Hi friend. It is very nice to see you, friend!"
string = string.replace("friend", "Alex")
print(string)

**Practice:** Using the template provided below, greet everybody whose name is listed in the list `guests`.

In [None]:
template = "Hi, [guest], it is very nice to meet you!"
guests = ["Mary", "Jon", "Aniello"]

# your code

**`split`** takes a string and splits it into a list based on the provided argument. If no argument is provided, `split` splits the string based on the whitespaces.

    string.split(separator)

In [None]:
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."
parsed_text = text.split()
print(parsed_text)

In [None]:
names = "Anna and Mary and John and Sebastian"
list_of_names = names.split(" and ")
print(list_of_names)

**`strip`** removes inisible symbols from the ends of the string. The invisible things that `strip` removes are ` `, `\n` and `\t`. It is an extremely useful function when working with the "dirty" user input, or when processing text files.

    string.strip()

In [None]:
string = "\nHello world!   \t"
string = string.strip()
print("-->" + string + "<--")

**`startswith`** and **`endswith`** are string methods that return booleans depending on the string starting or ending with a certain substring.

    string.startswith(substring)
    string.endswith(substring)

In [None]:
print("'hello' starts with 'hell':", "hello".startswith("hell"))
print("'hello' starts with 'hi':", "hello".startswith("hi"))
print("'hello' starts with 'hello':", "hello".startswith("hello"))

In [None]:
print("'linguistics' ends with 'cs':", "linguistics".endswith("cs"))
print("'linguistics' ends with '':", "linguistics".endswith(""))

**`join`** is a string method that takes a list as argument, and, if all items within that list are strings, it concatenates them using the given string.

    conjunction_string.join(list_to_concatenate)

In [None]:
names = ['Anna', 'Mary', 'John', 'Sebastian']
print(" and ".join(names))

In [None]:
letters = ['P', 'y', 't', 'h', 'o', 'n']
print("".join(letters))

# Homework 4

**Due on Tuesday, October 1st, 11.59pm**

Send your notebook (don't forget to save your solutions!) to <alena.aksenova@stonybrook.edu> with the subject **\[CompLing1\] Homework 4**.

**Problem 1. \[3 pts\]** You are given the following list of English vowels.

In [None]:
vowels = ["a", "o", "i", "u", "e"]

Using the idea of a counter, implement code that ask user for a word, and then will print the number of consonants in that word. (For simplicity, we assume that "y" always behaves as a consonant, even though [it is not true](https://www.rd.com/culture/letter-y-vowel-consonant/).)

**Problem 2. \[5 pts\]**
Implement code that asks the user for the value of $n$ and for a word, and extracts $n$-grams from that word for any $n$ provided by a user.

    word:   banana
    n:      2
    ngrams: ba, an, na, an, na
    
    word:   linguist
    n:      3
    ngrams: lin, ing, ngu, gui, uis, ist

**Problem 3. \[7 pts\]** You are given the following text.

In [None]:
text = "It was dark, like the bottom of a well. There was a pattern of skulls and bones around \
the frame, for the sake of appearances; Death could not look himself in the skull in a mirror \
with cherubs and roses around it. The Death of Rats climbed the frame in a scrabble of claws and \
looked at Death expectantly from the top. Quoth fluttered over and pecked briefly at his own \
reflection, on the basis that anything was worth a try. Show me, said Death, show me my thoughts. \
A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen. \
Right on this point was the world - turtle, elephants, the little orbiting sun and all. It was the \
Discworld, which existed only just this side of total improbability and, therefore, in border country. \
In border country the border gets crossed, and sometimes things creep into the universe that have \
rather more on their mind than a better life for their children and a wonderful future in the \
fruit picking and domestic service industries. On every other black or white triangle of the \
chessboard, all the way to infinity, was a small grey shape, rather like an empty hooded robe."

You are also given a string that contains all symbols of English alphabet.

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"

_Part 1._ Create a list `unique_words` with the unique lowercase words from `text`.

You should see the following output (the order can differ!):
    
    ['a', 'infinity', 'reflection', 'with', 'like', 'big', 'briefly', 'into', 'children', 'which', 'fruit', 'picking', 'there', 'try', 'little', 'around', 'appearances', 'appeared', 'all', 'crossed', 'basis', 'improbability', 'their', 'discworld', 'black', 'to', 'death', 'future', 'only', 'my', 'robe', 'things', 'for', 'it', 'existed', 'said', 'sake', 'sometimes', 'right', 'way', 'that', 'country', 'chessboard', 'quoth', 'well', 'domestic', 'skull', 'wonderful', 'hooded', 'or', 'empty', 'bottom', 'mirror', 'himself', 'rather', 'over', 'every', 'triangle', 'roses', 'border', 'orbiting', 'was', 'from', 'show', 'be', 'pecked', 'bones', 'just', 'universe', 'me', 'triangular', 'gets', 'worth', 'have', 'climbed', 'service', 'fluttered', 'top', 'but', 'grey', 'claws', 'at', 'rats', 'creep', 'own', 'pattern', 'point', 'white', 'than', 'dark', 'therefore', 'frame', 'this', 'not', 'the', 'could', 'mind', 'turtle', 'scrabble', 'better', 'industries', 'looked', 'an', 'cherubs', 'life', 'anything', 'more', 'small', 'and', 'of', 'his', 'on', 'skulls', 'elephants', 'in', 'thoughts', 'seen', 'nearest', 'expectantly', 'other', 'side', 'shape', 'total', 'so', 'world', 'look', 'sun']

_Part 2._ Create a list `bigrams` in which you will collect all attested bigrams in `unique_words`. Ignore words that are shorter than $2$ characters. Make sure that the list `bigrams` does not contain duplicates.

In [None]:
bigrams = []

# your code

_Part 3._ Based on the `alphabet`, generate all possible bigrams of English. (Hint: look at the second exercise of the previous homework!)

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"
possible_bigrams = []

# your code

_Part 4._ Collect all unattested bigrams of English in the list `unattested_bigrams`.

In [None]:
unattested_bigrams = []

# your code

Don't be surprised that some bigrams from `unattested_bigrams` are actually present in other English words, the text that we are working with is very small! If you are curious, take a larger text, and run your code on it. :)