# Notebook 4: range, zip, enumerate, and useful string methods

This notebook expands on for-loops, introducing a way to iterate over numbers within a certain range, therefore giving access to index-based iteration over containers using `range`. It also shows how to use `zip` and `enumerate`. 
It also discusses several additional string methods such as `split` and `join`.
Finally, the homework will lead you to use what you have learned so far (specifically, for-loops, if statements, and lists) to implement $n$-gram extraction.

## For-loops: reminder

_For-loops_ iterates over some object (**iterable**) and considers sub-elements of that object in order.

In order to print indexes of items in iterables, we can implement a **counter**, i.e. a variable that will increase every time some condition is met. In this case, we will set the counter to $0$ and increase it with every iteration.

**Example:** Let's say we are given three lists: list of states (`states`), list of average temperatures for those states in the same order (`temperatures`) and a list of states that are considered New England (`new_england`).

In [None]:
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

temperatures = [62.8, 26.6, 60.3, 60.4, 59.4, 45.1, 49, 55.3, 70.7, 63.5,
                70, 44.4, 51.8, 51.7, 47.8, 54.3, 55.6, 66.4, 41, 54.2, 
                47.9, 44.4, 41.2, 63.4, 54.5, 42.7, 48.8, 49.9, 43.8, 52.7, 
                53.4, 45.4, 59, 40.4, 50.7, 59.6, 48.4, 48.8, 50.1, 62.4, 
                45.2, 57.6, 64.8, 48.6, 42.9, 55.1, 48.3, 51.8, 43.1, 42]

new_england = ["Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut",
               "Rhode Island"]

The code below prints average temperatures for New England states. The variable `index` stores the index of an item we are currently looking at.

**Practice:** Helpful function for the following practice exercise is `sum` that takes list as an argument and returns the sum of all numbers in that list. FYI, functions `min` and `max` are available as well.

In [None]:
numbers = [1, 18.3, 9, 0, 3.14]
print("Sum of those numbers is", sum(numbers))
print("The smallest number is", min(numbers))
print("The largest number is", max(numbers))

Modify the code above to print the average temperature in New England. (You can use the `round` function to make the resulting number prettier.)

### Modifying strings

String indexes cannot be reassigned, i.e. the existent parts of the string cannot be modified directly:

If we have a task to "mask" all vowels from a text, we will need to create a new string based on the old one.

**Practice** Can you think of how to do it?

In [None]:
vowels = "aoiue"
text = "This is a sentence that should contain no vowels."

#try it here by yoursel!

**Practice:** You are given a string `alphabet` that contains all English letters, and a string `text`.

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."

Write code that makes this string lowercase and deletes punctuations from the text.

## Range

Say that you want to print the word "hello" ten times. How would you do it? The most trivial answer is "I'll write _print("Hello")_ ten times". But how would you do it with a for-loop? Can you think of a way to make the loop iterate exactly $10$ times?

In [None]:
# Try it here!


**Range** is a numeric iterable defined by three arguments: _start_, _end_, and _step_. These arguments behave exactly as they do in slices: _start_ defines the initial numerical value, _end_ is the first value not included in the range, and _step_ defines the difference between the first and the following value.

If only one argument is provided, it is considered to be _end_, and the initial value is assumed to be $0$.

Range cannot be displayed directly, but can be easily converted to a list using `list` function.
(If you are curious about the nature of the range object, read [this article](https://treyhunner.com/2018/02/python-range-is-not-an-iterator/), but a safe way is to just call it an iterable, or a range object).

In [None]:
print("Printing range object:", range(10))
print("Typecasting range to a list:", list(range(10)))

In order to iteratively get indexes available in some iterable, we can use the following trick: `range(len(iterable))`.

**Practice** OK, now you know all you need to know about for-loops! Can you write a code that asks the user for $10$ favorite foods, one at the time? Add those foods into a list, and once the user is done print them back!

In [None]:
# Try it here!

## N-grams

$n$-gram models are a very basic, fundamental concept in computational linguistics!
Intuitively, $n$-grams are sequences of $n$ consequtive symbols.

    word:   banana
    n:      2
    ngrams: ba, an, na
    
    word:   linguist
    n:      3
    ngrams: lin, ing, ngu, gui, uis, ist

A special case of $n$-grams where the value of $n$ is $2$ are called _bigrams_. If $n=1$, these are called _unigrams_.

For computational linguistics and NLP, **$n$-gram models** are extremely important: symbol-level $n$-gram models define which sequences of characters are (im)possible in a certain language, word-level $n$-gram models tell us which words can be adjacent to each other, and so on.

**Practice:** write code that extracts _bigrams_ from a given word.

In [6]:
word_input = input("word: ")
n = int(input("n: "))
bigrams = list(set([x+y for x, y in zip(word_input[:-1], word_input[1:])]))
print(bigrams)

['ba', 'na', 'an']


In [8]:
word = "banana"
bigrams = []
n = 3
for i in range(len(word) - 1):
    bigram = word[i:i+n]
    if bigram not in bigrams:
        bigrams.append(bigram)
print(bigrams)


['ban', 'ana', 'nan', 'na']


## Enumerate and Zip

Object-defining functions that can sometimes be very useful are `enumerate` and `zip`.

**`enumerate`** takes a list as input, and returns list of _tuples_, where every tuple contains an item from the input list, and its index. Just as `range`, this function creates its own object that can be easily typecasted into a list.

In [None]:
input_list = ["NY", "CA", "RI", "CO"]


In [None]:
z = (0,1,2,3,4,4)


**Tuple** is another basic data type in Python. While they share the majority of the functionality with lists, their main difference is that tuples cannot be modified as easily as lists. Tuples can be thought of as "protected lists", but read [here](https://realpython.com/python-lists-tuples/) to learn more.

**`zip`** takes an arbitrary number of lists as input, and returns a list of tuples, where every tuple is an index-wise combination of items from those lists (i.e. `[(lis1[0],list2[0]),(lis1[1],list2[1]), ...]`).

In [None]:
towns = ["Port Jeff", "Stony Brook", "Lake Grove"]


## Several useful string methods

There are multiple methods that simplify working with strings and lists, and in this section, I exemplify the following ones: `replace`, `split`, `strip`, `join`, `startswith`, and `endswith`.

**`replace`** returns a string in which some replacement was performed.

    string.replace(old_substring, new_substring)

In [None]:
string = "Hi friend. It is very nice to see you, friend!"


**Practice:** Using the template provided below, greet everybody whose name is listed in the list `guests`.

In [None]:
template = "Hi, [guest], it is very nice to meet you!"
guests = ["Pearl", "Garnet", "Peridot"]

# your code

**`split`** takes a string and splits it into a list based on the provided argument. If no argument is provided, `split` splits the string based on the whitespaces.

    string.split(separator)

In [None]:
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."
# your code

In [None]:
text = "Achessboardappeared"
#code

In [None]:
names = "Anna and Mary and John and Sebastian"
#code

In [None]:
names = "Anna, and , Mary and John and Sebastian"
#code

**`strip`** removes inisible symbols from the ends of the string. The invisible things that `strip` removes are ` `, `\n` and `\t`. It is an extremely useful function when working with the "dirty" user input, or when processing text files.

    string.strip()

In [None]:
string = "\nHello world!   \t"
string = string.strip()
print("-->" + string + "<--")

**`startswith`** and **`endswith`** are string methods that return booleans depending on the string starting or ending with a certain substring.

    string.startswith(substring)
    string.endswith(substring)

In [None]:
print("'hello' starts with 'hell':", "hello".startswith("hell"))
print("'hello' starts with 'hi':", "hello".startswith("hi"))
print("'hello' starts with 'hello':", "hello".startswith("hello"))

In [None]:
print("'linguistics' ends with 'cs':", "linguistics".endswith("cs"))
print("'linguistics' ends with '':", "linguistics".endswith(""))

**`join`** is a string method that takes a list as argument, and, if all items within that list are strings, it concatenates them using the given string.

    conjunction_string.join(list_to_concatenate)

In [None]:
names = ['Anna', 'Mary', 'John', 'Sebastian']
print(" and ".join(names))

In [None]:
letters = ['P', 'y', 't', 'h', 'o', 'n']
print("".join(letters))