# Lecture 3. Systems and Examples
In the last lecture, we covered the concept of an operator. In this lecture, we discuss systems composed of different parametrized families of these operators.

## Recap. Streaming Model of Computation
The key insight of the previous lecture(s) is that data-intensive systems (ones that process a large amount of data) are most suited for a "streaming" model of computation rather than a static one. The main insight was a processing paradigm where the data that serves as input to a function or a program is an infinite sequence. To recap,

* An iterator is an object that can initialize a data stream and return each `next` element when it is queried.
* Iterators can be chained, manipulated, filtered, and transformed with operators.
* Iterators can hide latency through lazy execution.

How do iterators and operators form the basic building blocks for a data intensive system?

## Atomic Values and Tuples
We've so far talked about data in the abstract. When we build actual systems, we will have to get more precise. While in principle, we can iterate over any data types or structures, building rigid restrictions will help us better understand the behavior of said systems. We will work with systems that follow the following axioms:

* Every system defines *atomic* data values. An atomic data type is one that are indivisible, dereferenced, and stateless in the systems eyes. This is similar to the concept of primitive types in programming languages.
* Every operator iterates over a fixed size tuple (an immutable array) of atomic values.

Let's try to understand why we have these restrictions by illustrating some analogies in python. Here is an example of an allowed iteration: 

In [1]:
lst = [1, 2, 3, 4] #integers are an atomic type
for i in lst:
    print(i)

1
2
3
4


What if we changed this example in the following way:

In [5]:
x = [1,2]
y = [3,4]
lst = [x, y] #iterating over lists (or are we?)
for i in lst:
    print(i)

[1, 2]
[3, 4]


Iterating over object references can sometimes have complicated semantics: 

In [10]:
class Clip2:
    '''
    Iterator that clips all values less than or equal to 2 to zero
    '''
    
    def __init__(self, inp):
        self.input = inp
    
    def __iter__(self):
        self.it = iter(self.input)
        return self
    
    def __next__(self):
        elem = next(self.it)
        
        for i, v in enumerate(elem):
            if v >= 2:
                elem[i] = 0
        return elem

    
class Add1:
    '''
    Iterator that adds one to each element
    '''
    
    def __init__(self, inp):
        self.input = inp
    
    def __iter__(self):
        self.it = iter(self.input)
        return self
    
    def __next__(self):
        elem = next(self.it)
        
        for i, v in enumerate(elem):
            elem[i] += 1
        return elem

Why is this bad? This means that two unrelated operators that happen to touch the same data might change each other's results! Thus, we require that every piece of data that the operators in our system process is dereferenced and atomic. This allows for isolation from unrelated processes. 



In [13]:
x = [1,2]
y = [3,4]
lst = [x, y] #iterating over lists (or are we?)

for i in Clip2(lst):
    print(i)

for i in Add1(lst):
    print(i)

[1, 0]
[0, 0]
[2, 1]
[1, 1]


This is unexpected because the semantics of running two separate for loops over the same data *should* be different than composing the operators: 

In [15]:
x = [1,2]
y = [3,4]
lst = [x, y] #iterating over lists (or are we?)

for i in Add1(Clip2(lst)):
    print(i)

[2, 1]
[1, 1]


The moral of the story is that data flow operators should always operate over iterators of atomic data types. Tuples are sometimes called rows or records depending on the context.

## Select, Project, and Join
A database is a collection of information that is organized so that it can be easily accessed, managed and updated. Data is organized into rows, columns and tables, and it is indexed to make it easier to find relevant information. There are three parametrized operators (select, project, and join) that form the basic building blocks of most database systems. 
 
Every tuple in a database system is an array of atomic values (int, float, string, date, etc.). A table consist of a collection of tuples of the same size and types. Typically, each of the slots of the array has a descriptive attribute name (or sometimes called field) that identifies the index. In other words, tables are rectangular blocks of data. A database consists of one or more tables.  Each table is made up of rows and columns.  If you think of a table as a grid, the column go from left to right across the grid and each entry of data is listed down as a row. 

In `db.py`, we define these operators and let's see how they can be used. We provide the functionality to load the data:

In [17]:
from db import *
courses = Load('courses.csv')
rooms = Load('rooms.csv')

for course in courses:
    print(course)

for room in rooms:
    print(room)

{'dept': 'cmsc', 'cn': '136', 'c_building': 'SHFE', 'c_number': '203'}
{'dept': 'cmsc', 'cn': '220', 'c_building': 'RY', 'c_number': '277'}
{'dept': 'econ', 'cn': '152', 'c_building': 'RWLD', 'c_number': '161'}
{'building': 'SHFE', 'number': '203', 'capacity': '32', 'board': 'black'}
{'building': 'JCL', 'number': '243', 'capacity': '4', 'board': 'white'}
{'building': 'JCL', 'number': '298', 'capacity': '51', 'board': 'black'}
{'building': 'RWLD', 'number': '161', 'capacity': '52', 'board': 'black'}
{'building': 'RY', 'number': '161a', 'capacity': '4', 'board': 'black'}
{'building': 'RY', 'number': '277', 'capacity': '32', 'board': 'white'}
{'building': 'RY', 'number': '276', 'capacity': '42', 'board': 'black'}


Suppose, we were interested in finding "all rooms with a whiteboard":

In [18]:
for room in Select(rooms, lambda x: x['board'] == 'white'):
    print(room)

{'building': 'JCL', 'number': '243', 'capacity': '4', 'board': 'white'}
{'building': 'RY', 'number': '277', 'capacity': '32', 'board': 'white'}


Now suppose, we were interested in finding all rooms with a CS course in it:

In [19]:
for out in Join((courses,rooms), lambda x: (x['building'] == x['c_building']) AND (x['c_number'] == x['number'])  ):
    print(out)

SyntaxError: invalid syntax (<ipython-input-19-21d5cb0a8161>, line 1)