# Iterators part 2 : operators
In the previous lecture, we looked at a brief introduction to iterators in Python. Iterators will serve as the basic building block for the systems that we will consider. This lecture will look at "functions" of iterators, i.e., procedures that produce and consume iterators.

## FileScan Iterator 
Let us revist the `FileScan` from the previous lecture. This iterator loads lines from a file one-by-one: 

In [None]:
class FileScan:
    """Loads a large file into the
    program line-by-line"""
    
    def __init__(self, filename):
        self.filename = filename
        
    def __iter__(self):
        self.file = open(self.filename, 'r')
        self.line = self.file.readline()
        return self
    
    def __next__(self):
        if self.line != "":
            result = int(self.line)
            self.line = self.file.readline()
            return result
        else:
            self.file.close()
            raise StopIteration

We can use this iterator in code to process the lines in a specified file of numbers.

In [None]:
import itertools
file = FileScan('my_file')
print(list(itertools.islice(file, 5)))

Suppose, we wanted to transform every element in this file, e.g, normalize each value by 100. We could write code as follows:

In [None]:
for i in FileScan('my_file'):
    print(i/100.0)

In a sense, the transformation `i/100.0` defines another iterator. We can make this explicit with a new iterator class. This iterator class will take *another iterator* as an argument in its constructor, and return each next value transformed. 

In [None]:
class Normalize:
    """Divides each of an iterator of numbers by
       100"""
    
    def __init__(self, iter_in):
        self.iter_in = iter_in
        
    def __iter__(self):
        self.input_state = iter(self.iter_in) 
        #we have to explicitly perserve the input
        #state. 
        return self

    def __next__(self):
        #poll the next value of the input and divide it by 100.
        return next(self.input_state)/100.0

Let us see how we can use `Normalize` to simplify our code. Now, we can simply compose the two iterator classes and we get the right behavior. 

In [None]:
for i in Normalize(FileScan('my_file')):
    print(i)

Now, let's consider an example where we want to change the number of elements. Consider a `Filter` iterator that removes all values of the input iterator less than a threshold.

In [None]:
class Filter:
    """Skips elements that are less than
       a given threshold"""
    
    def __init__(self, iter_in, thresh):
        self.iter_in = iter_in
        self.thresh = thresh
        
    def __iter__(self):
        self.input_state = iter(self.iter_in) 
        #we have to explicitly perserve the input
        #state. 
        return self

    def __next__(self):
        #skip elements less than the threshold
        elem = next(self.input_state)
        
        if elem < self.thresh:
            return self.__next__() #Recursive, whoa!
        
        return elem

We can compose all of the iterator classes together and get a transformed and filtered iterator over the data.

In [None]:
for i in Filter(Normalize(FileScan('my_file')),0.06):
    print(i)

There are several interesting aspects of this programming model. Notice that the code `Filter(Normalize(FileScan('my_file')),0.1)` runs nearly instantly. Until you explicitly call for the next element from that expression *it will not evaluate anything*. In programming language theory, this is called lazy evaluation---an evaluation strategy which delays the evaluation of an expression until its value is needed (non-strict evaluation). Lazy evaluation is indispensible for situations when data are delayed or there are unpredictable timing issues. Let's consider a variant of the `FileScan` iterator that is "broken" meaning it has delays in retrieving data. We added an artificial 1 second sleep in between each line fetched:

In [None]:
import time

class BrokenFileScan:
    """Loads a large file into the
    program line-by-line"""
    
    def __init__(self, filename):
        self.filename = filename
        
    def __iter__(self):
        self.file = open(self.filename, 'r')
        self.line = self.file.readline()
        return self
    
    def __next__(self):
        if self.line != "":
            result = int(self.line)
            self.line = self.file.readline()
            
            time.sleep( 1 ) #sleep for 1 sec
            
            return result
        else:
            self.file.close()
            raise StopIteration

In [None]:
for i in Filter(Normalize(BrokenFileScan('my_file')),0.06):
    print(i)

An iterator model allows you to avoid delays that are unnessary to your program. Suppose, we were interested in only taking the first 3 elements (a 3 sec delay):

In [None]:
import itertools
file = BrokenFileScan('my_file')
print(list(itertools.islice(file, 3)))

In this sense, a programming with iterators is self-optimizing. Downstream logic consumes only what it needs. 

## Operators
`Filter`, `Normalize` and `FileScan` are special cases of a general concept of an `Operator`. Manipulating iterators is a key tool in the design of data-intensive systems. Operators define transformations of iterators. An operator is an object produced from a collection iterators that is itself an iterable object. Maintaining this discipline and programming with operators is a key tool to allow for robust and efficient code. We will show later that many important computations can be expresses simply as a composition of operators.

In [None]:
class Operator:
    """A template for a generic operator"""
    
    def __init__(self, inputs, args):
        self.inputs = inputs
        self.args = args
        
    def __iter__(self):
        self.iterators = [iter(i) for i in inputs] #store a list of iterators
        return self
    
    def __next__(self):
        # do something here!!!
        raise NotImplemented("DO SOMETHING HERE!!!")

Let's now consider an example of an operator that consumes multiple input iterators. Consider two iterators `in1` and `in2`, each iterates over a stream of numbers. We want to define a `MatchOperator` that iterates over all elements that appear in *both* iterators. The algorithm that we are going to use is called a Nested Loop Join. In pseudo-code, a nested loop join: iterates over one of the iterators, then for each element, iterates over the other iterator. Below is an animation of the basic iteration scheme:
![NestedLoopJoin](https://media.giphy.com/media/X7OUYegK1H49Uyl36W/giphy.gif)

The code that we write to make this work is described below.

In [None]:
class MatchOperator:
    '''
    A match operator finds equality relationships between
    two iterators.
    Consider the following example where you are given two
    iterators i1,i2:
    >> i1 = [ 1,7,2,4,6, ... ] # iterator
    >> i2 = [ 3,6,7,2,1, ... ] # iterator
    You can construct a MatchOperator object:
    >> m = MatchOperator( (i1,i2) )
    and this operator should return all values that appear in both
    iterators. The order is not important
    >> for i in twoWayIter:
    ...  print(i)
    1. (2,2)
    2. (1,1)
    3. (6,6)
    Edge cases:
     * Return an error if any of the iterators has 0 values
    '''

    def __init__(self, input):
        '''
        Takes in a tuple of input iterators (i1,i2)
        '''
        self.in1, self.in2 = input
        # a list of iterators
        
    def __iter__(self):
        '''
        Initializes the iterators and fetches the first element
        '''

        self.it1 = iter(self.in1) # initialize the first input
        self.it2 = iter(self.in2) # initialize the second input
        
        self.i = next(self.it1)
        self.j = next(self.it2)
        
        return self


    """
    Below are two helper methods. Conceptually,
    we are running the following patter:
    for i in it1:
        for j in it2:
            if j == i:
                return (i,j)
    To implement this with iterators, we need two
    helper methods _reset_or_inc2 (go back to the
    beginning of the inner for loop), or _inc1_or_end
    (increment the first for loop or stop)
    """

    def _reset_or_inc2(self):
        try:
            self.j = next(self.it2)

        except StopIteration:
            self.it2 = iter(self.in2)
            self.j = next(self.it2)
            self._inc1_or_end()

    def _inc1_or_end(self):
        try:
            self.i = next(self.it1)
        except StopIteration:
            self.i = None
            self.j = None


    def __next__(self):
        '''
        The next method fetches the next element
        '''

        rtn = (self.i, self.j)

        self._reset_or_inc2()

        # skip non-pairs
        if rtn[0] == None:
            raise StopIteration()

        if rtn[0] != rtn[1]:
            return self.__next__()
        else:
            return rtn