## Iterators

An iterable is an object that has an associated `iter()` method.

* iterator <br>
  is defined as an object that has an associated `next()` method that produces the consecutive values.

In [3]:
word = 'Da'
it = iter(word)
next(it)

'D'

In [4]:
next(it)

'a'

As we see, calling next() iterator on the word 'Da' returns 'D' and 'a' respectively. <br>
We can aslo print all values of an iterator in one fell swoop using the star operator, referred to as the __splat__ operator in some circles.

In [6]:
word = 'Data'
it = iter(word)
print(*it)

D a t a


The star operator unpacks all elements of an iterator or an iterable. <br>
__Note__: Once an iterator is exhausted, it cannot be reused as there are no more values to iterate.  <br.

In [8]:
"""On dictionaries, we have:"""

pythonistas = {'eren': 'sen', 'george': 'michael'}
for key, value in pythonistas.items():
    print(key, value)

eren sen
george michael


In [1]:
"""w.r.t file connections we use:"""

file = open('./datasets/text.txt')
it = iter(file)
print(next(it))

This is the first line.



In [2]:
print(next(it))

This is the second line.


### Exercise

In [16]:
# Create a range object: values
values = range(10,21)

# Print the range object
print(values)

# Create a list of integers: values_list
values_list = list(values)

# Print values_list
print(values_list)

# Get the sum of values: values_sum
values_sum = sum(values)

# Print values_sum
print(values_sum)


range(10, 21)
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
165


## Enumerate and zip

### Enumerate

Enumerate is a built-in function that takes any iterable argument, such as a list, and returns a special enumerate object. This enumerate object consists of pairs of the form (index, value), where index is the item's index in the original iterable, and value is the item itself. <br>
 _enumarete_ itself is also an iterable, and it van be looped over while unpacking its elements using the clause for index, value in enumerate(iterable). 

In [None]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
e = enumerate(avengers)
print(type(e))

# Convert e to a list: e_list
e_list = list(e)
print(e_list)

<class 'enumerate'>
[(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')]


In [20]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
for index, value in enumerate(avengers):
    print(index, value)

0 hawkeye
1 iron man
2 thor
3 quicksilver


It is the default behavior of of enumerate to start counting from 0. However, you can specify a different start value by passing it as an argument to enumerate:

In [21]:
for index, value in enumerate(avengers, start=10):
    print(index, value)

10 hawkeye
11 iron man
12 thor
13 quicksilver


### Zip

Zip (a built-in function) accepts an arbitrary number of iterables and returns an iterator of tuples.

In [33]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']

# Create a zip object from avengers and names: z
z = zip(avengers, names)
print(type(z))

<class 'zip'>


We can turn this zip object into a list and print the list:

In [27]:
z_list = list(z)
print(z_list)

[('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]


The first element is a tuple containing the first elements of each list that was zipped. The second element nt is a tuple containing the second elements of each list, and so on.

Alternatively, we could use a for loop to iterate over the zip object and print the tuples:

In [30]:
for z1, z2 in zip(avengers, names):
    print(z1, z2)

hawkeye barton
iron man stark
thor odinson
quicksilver maximoff


We could also have used the _splat_ operator to print all the elements! 

In [32]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(*z)

('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson') ('quicksilver', 'maximoff')


## Exercise

In [38]:
mutants = ['charles xavier',
           'bobby drake',
           'kurt wagner',
           'max eisenhardt',
           'kitty pryde']
aliases = ['prof x',
           'iceman',
           'nightcrawler',
           'magneto',
           'shadowcat']
powers = ['telepathy',
           'thermokinesis',
           'teleportation',
           'magnetokinesis',
           'intangibility']

# Create a list of tuples: mutant_data
z = zip(mutants, aliases, powers)
mutant_data = list(z)

# Print the list of tuples
print(mutant_data, '\n')

# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)

# Print the zip object
print(mutant_zip, '\n')

# Unpack the zip object and print the tuple values
for value1,value2,value3 in mutant_zip:
    print(value1, value2, value3)

[('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pryde', 'shadowcat', 'intangibility')] 

<zip object at 0x103c239c0> 

charles xavier prof x telepathy
bobby drake iceman thermokinesis
kurt wagner nightcrawler teleportation
max eisenhardt magneto magnetokinesis
kitty pryde shadowcat intangibility


In [41]:
# Create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# Print the tuples in z1 by unpacking with *
print(*z1, '\n')

"Because the previous print() call would have exhausted the elements in z1, recreate the zip object you defined earlier and assign the result again to z1."
# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pryde', 'intangibility') 

False
False


## Using iterators to load large files into memory

Let's say that we are pulling data from a file, database or API and there's so much of it, so much data, that we can't hold it in memory. 
* One solution is to load the data in chunks, perform the desired operation/operations on each chunk, store the resuld, discard the chunk, then load the next chunk, and so on. This is where iterators come in handy.
* `pandas` function: `read_csv()` - provides a nice option whereby we can load data in chunks and iterate over them.
  * All we need to is to specify the chunk using the argument `chunksize`.

In [None]:
import pandas as pd

result = []

for chunk in pd.read_csv('data.csv', chunksize=1000):
    result.append(sum(chunk['x']))
total = sum(result)
print(total)

"""This will give error because the file data.csv does not exist. But the code is correct."""

Within the for loop, that is, on each iteration (meaning on each chunk), we compute sum of the column of interest and we append it to the list `result`. Once this is executed, we can take the sum of the list `result` and this gives us our total sum of the column of interest.

Another approach to store the result is:

In [None]:
total = 0
for chunk in pd.read_csv('data.csv', chunksize=1000):
    total += sum(chunk['x'])
print(total)

This initializes total to zero before iterating over the file and adds each sum during the iteration procedure.

In [5]:
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('./datasets/tweets.csv', chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)

{'en': 97, 'et': 1, 'und': 2}


What the above code does is that it iterates over the file chunk by chunk, and within each chunk, it iterates over the column of interest, in this case `'lang'`, and nd adds the count to the dictionary. If the key is already in the dictionary, it adds 1 to the current value, otherwise it initializes the key with value 1.

Another code that gives the same output is:

In [6]:
# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries('./datasets/tweets.csv', c_size=10, colname = 'lang')

# Print result_counts
print(result_counts)


{'en': 97, 'et': 1, 'und': 2}


What the above code does is:
* First we define ne a function called `count_entries()` that takes three arguments: `csv_file`, `c_size`, and `colname`. 
* Then we initialize an empty dictionary called `counts_dict`. 
* Next we iterate over the file chunk by chunk using `pd.read_csv()` function. 
* Then we iterate over the column in DataFrame using a for loop. 
* Then we check if the entry is in the keys of `counts_dict`. If it is, we add 1 to the value of that key. If it is not, we add the entry as a key and 1 as the value.
* We then return `counts_dict`.
* After that we call `count_entries()` function and pass the path of the file, the size of the chunk, and the name of the column as arguments. Then we print the result. The output is the same as the previous one. But this time we define a function to do the same thing. This is a good practice because it makes the code more readable and reusable.