# Python Iterators

# What are iterators in Python?

Iterators are everywhere in Python. They are elegantly implemented within for loops, comprehensions, generators etc. but hidden in plain sight.

Iterator in Python is simply an object that can be iterated upon. An object which will return data, one element at a time.

Technically speaking, Python iterator object must implement two special methods, __iter__() and __next__(), collectively called the iterator protocol.

An object is called iterable if we can get an iterator from it. Most of built-in containers in Python like: list, tuple, string etc. are iterables.

The iter() function (which in turn calls the __iter__() method) returns an iterator from them.

# Iterating Through an Iterator in Python

We use the next() function to manually iterate through all the items of an iterator. When we reach the end and there is no more data to be returned, it will raise StopIteration. Following is an example.

In [1]:
# define a list
my_list = [4, 7, 0, 3]

# get an iterator using iter()
my_iter = iter(my_list)

## iterate through it using next() 

#prints 4
print(next(my_iter))

#prints 7
print(next(my_iter))

## next(obj) is same as obj.__next__()

#prints 0
print(my_iter.__next__())

#prints 3
print(my_iter.__next__())

## This will raise error, no items left
next(my_iter)

4
7
0
3


StopIteration: 

A more elegant way of automatically iterating is by using the for loop. Using this, we can iterate over any object that can return an iterator, for example list, string, file etc.

In [2]:
for element in my_list:
    print(element)

4
7
0
3


# How for loop actually works?

As we see in the above example, the for loop was able to iterate automatically through the list.

In fact the for loop can iterate over any iterable. Let's take a closer look at how the for loop is actually implemented in Python.

In [None]:
for element in iterable:
    # do something with element

Is actually implemented as.

In [None]:
# create an iterator object from that iterable
iter_obj = iter(iterable)

# infinite loop
while True:
    try:
        # get the next item
        element = next(iter_obj)
        # do something with element
    except StopIteration:
        # if StopIteration is raised, break from loop
        break

So internally, the for loop creates an iterator object, iter_obj by calling iter() on the iterable.

Ironically, this for loop is actually an infinite while loop.

Inside the loop, it calls next() to get the next element and executes the body of the for loop with this value. After all the items exhaust, StopIteration is raised which is internally caught and the loop ends. Note that any other kind of exception will pass through.

# Building Your Own Iterator in Python

Building an iterator from scratch is easy in Python. We just have to implement the methods __iter__() and __next__().

The __iter__() method returns the iterator object itself. If required, some initialization can be performed.

The __next__() method must return the next item in the sequence. On reaching the end, and in subsequent calls, it must raise StopIteration.

Here, we show an example that will give us next power of 2 in each iteration. Power exponent starts from zero up to a user set number.

In [3]:
class PowTwo:
    """Class to implement an iterator
    of powers of two"""

    def __init__(self, max = 0):
        self.max = max

    def __iter__(self):
        self.n = 0
        return self

    def __next__(self):
        if self.n <= self.max:
            result = 2 ** self.n
            self.n += 1
            return result
        else:
            raise StopIteration

Now we can create an iterator and iterate through it as follows.

In [4]:
a = PowTwo(4)
i = iter(a)
next(i)

1

In [5]:
next(i)

2

In [6]:
next(i)

4

In [7]:
next(i)

8

In [8]:
next(i)

16

In [9]:
next(i)

StopIteration: 

# Python Infinite Iterators

It is not necessary that the item in an iterator object has to exhaust. There can be infinite iterators (which never ends). We must be careful when handling such iterator.

Here is a simple example to demonstrate infinite iterators.

The built-in function iter() can be called with two arguments where the first argument must be a callable object (function) and second is the sentinel. The iterator calls this function until the returned value is equal to the sentinel.

In [10]:
int()

0

In [11]:
inf = iter(int,1)
next(inf)

0

We can see that the int() function always returns 0. So passing it as iter(int,1) will return an iterator that calls int() until the returned value equals 1. This never happens and we get an infinite iterator.

We can also built our own infinite iterators. The following iterator will, theoretically, return all the odd numbers.

In [12]:
class InfIter:
    """Infinite iterator to return all
        odd numbers"""

    def __iter__(self):
        self.num = 1
        return self

    def __next__(self):
        num = self.num
        self.num += 2
        return num

A sample run would be as follows.

In [13]:
a = iter(InfIter())
print(next(a))
print(next(a))
print(next(a))
print(next(a))

1
3
5
7


And so on...

Be careful to include a terminating condition, when iterating over these type of infinite iterators.

The advantage of using iterators is that they save resources. Like shown above, we could get all the odd numbers without storing the entire number system in memory. We can have infinite items (theoretically) in finite memory.

# Python RegEx

A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example,

In [None]:
^a...s$

The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s.

A pattern defined using RegEx can be used to match against a string.

Python has a module named re to work with RegEx. Here's an example:

In [14]:

import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	

Search successful.


Here, we used re.match() function to search pattern within the test_string. The method returns a match object if the search is successful. If not, it returns None.

There are other several functions defined in the re module to work with RegEx. 

# Specify Pattern Using RegEx

To specify regular expressions, metacharacters are used. In the above example, ^ and $ are metacharacters.

# MetaCharacters

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

[] . ^ $ * + ? {} () \ |

# [] - Square brackets

Square brackets specifies a set of characters you wish to match.

1. [abc] => a => 1 match
2. [abc] => ac => 2 matches
3. [abc] => Hey Jude => No match
4. [abc] => abc de ca => 5 matches

Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

1. [a-e] is the same as [abcde].
2. [1-4] is the same as [1234].
3. [0-39] is the same as [01239].

You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

1. [^abc] means any character except a or b or c.
2. [^0-9] means any non-digit character.

# . - Period

A period matches any single character (except newline '\n').

1. .. => a => No match
2. .. => ac => 1 match
3. .. => acd => 1 match
4. .. => acde => 2 matches(contains 4 characters)

# ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

1. ^a => a => 1 match
2. ^a => abc => 1 match
3. ^a => bac => No match
4. ^ab => abc => 1 match
5. ^ab => acb => No match(starts with a but not followed by b)

# * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

1. ma*n => mn => 1 match
2. ma*n => man => 1 match
3. ma*n => maan => 1 match
4. ma*n => main => No match(a is not followed by n)
5. ma*n => woman => 1 match

# + - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

1. ma+n => mn => No match(no a character)
2. ma+n => man => 1 match
3. ma+n => maan => 1 match
4. ma+n => main => No match(a is not followed by n)
5. ma+n => woman => 1 match

# ? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

1. ma?n => mn => 1 match
2. ma?n => man => 1 match
3. ma?n => maaan => No match(more than one a character)
4. ma?n => main => No match(a is not followed by n)
5. ma?n => woman => 1 match

# {} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.
    
1. a{2,3} => abc dat => No match
2. a{2,3} => abc daat => 1 match(at daat)
3. a{2,3} => aabc daaat => 2 matches(at aabc and daaat)
4. a{2,3} => aabc daaaat => 2 matches(at aabc and daaaat)

Let's try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than 4 digits

1. [0-9]{2,4} => ab123csde => 1 match (match at ab123csde)
2. [0-9]{2,4} => 12 and 345673 => 2 matches (at 12 and 345673)
3. [0-9]{2,4} => 1 and 2 => No match

# | - Alternation

Vertical bar | is used for alternation (or operator).

1. a|b => cde => No match
2. a|b => ade => 1 match (match at ade)
3. a|b => acdbea => 3 matches (at acdbea)

Here, a|b match any string that contains either a or b

# () - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

1. (a|b|c)xz => ab xz => No match
2. (a|b|c)xz => abxz => 1 match(at abxz)
3. (a|b|c)xz => axz cabxz => 2 matches (at axzbc cabxz)

# Python RegEx

Python has a module named re to work with regular expressions. To use it, we need to import the module.

In [15]:
import re

The module defines several functions and constants to work with RegEx.

# re.findall()

The re.findall() method returns a list of strings containing all matches.

Example 1: re.findall()

In [16]:
# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

['12', '89', '34']


If the pattern is no found, re.findall() returns an empty list.


# re.split()

The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

Example 2: re.split()

In [17]:
import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

['Twelve:', ' Eighty nine:', '.']


If the pattern is no found, re.split() returns a list containing an empty string.

You can pass maxsplit argument to the re.split() method. It's the maximum number of splits that will occur.

In [18]:
import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

['Twelve:', ' Eighty nine:89 Nine:9.']


By the way, the default value of maxsplit is 0; meaning all possible splits.

# re.sub()

The syntax of re.sub() is:

In [None]:
re.sub(pattern, replace, string)

The method returns a string where matched occurrences are replaced with the content of replace variable.

Example 3: re.sub()

In [19]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

abc12de23f456


If the pattern is no found, re.sub() returns the original string.

You can pass count as a fourth parameter to the re.sub() method. If omited, it results to 0. This will replace all occurrences.

In [20]:

import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6

abc12de 23 
 f45 6


# re.subn()

The re.subn() is similar to re.sub() expect it returns a tuple of 2 items containing the new string and the number of substitutions made.

Example 4: re.subn()

In [21]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)

('abc12de23f456', 4)


# re.search()

The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None

In [None]:
match = re.search(pattern, str)

Example 5: re.search()

In [22]:
import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

pattern found inside the string


Here, match contains a match object.

# Match object

You can get methods and attributes of a match object using dir() function.

Some of the commonly used methods and attributes of match objects are:

# match.group()

The group() method returns the part of the string where there is a match.

Example 6: Match object

In [23]:
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

# Output: 801 35

801 35


Here, match variable contains a match object.

Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). You can get the part of the string of these parenthesized subgroups. Here's how:

In [24]:
print(match.group(1))
print(match.group(2))
print(match.group(1,2))
print(match.groups())

801
35
('801', '35')
('801', '35')


# match.start(), match.end() and match.span()

The start() function returns the index of the start of the matched substring. Similarly, end() returns the end index of the matched substring.

In [25]:
print(match.start())
print(match.end())

2
8


The span() function returns a tuple containing start and end index of the matched part.

In [26]:
match.span()

(2, 8)

# match.re and match.string

The re attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string.

In [27]:
match.re

re.compile(r'(\d{3}) (\d{2})', re.UNICODE)

In [28]:
match.string

'39801 356, 2102 1111'