# Class 9 - 1.6.18

# Software Design Principles

The previous class discussed one of the most important aspects of writing software - testing. But testing is a "mechanism" used when writing code, it's not some high-level principle. This class will deal with a few important principles that should be kept in the back of your minds whenever you write a program.

Most of the ideas presented below are from Robert Martin's, AKA Uncle Bob, lectures and textbooks. He's one of the founding fathers of object-oriented design.

## Object Orthogonality

In many cases objects interact with one another. In the case of some `ProcessData` class, which might process some instances of a `Data` class, that can contain a couple of `Series` and metadata, for example, we can see how `ProcessData` communicates with the data inside the `Data` class, modifying it further. 

A preliminary design might look like the following:

In [1]:
import numpy as np
import pandas as pd


class Data:
    """ Simple container for DataFrames and their metadata """
    def __init__(self, arr1: np.ndarray, arr2: np.ndarray, date: float):
            self.ser1 = pd.Series(arr1, dtype=np.uint8)
            self.ser22 = pd.Series(arr2, dtype=np.int16)
            self.metadata = dict(shape1=self.df1.shape,
                                 shape2=self.df2.shape,
                                 total=self.df1.shape[0] + self.df2.shape[0],
                                 date=date)
            
class ProcessData:
    """ Pipeline to process twin Data instances """
    def __init__(self, data1: Data, data2: Data):
        self.data1 = data1
        self.data2 = data2
        self.result = []
        self.metadata = dict(columns1=data1.columns,
                             columns2=data2.columns,
                             metadata=data1.metadata)
        
    def process(self):
        self.result.extend([data1.x.sum(), data2.x.sum()])
        self.result.append([data1.x.mean() + data2.y.mean()])
        return result

We have here a `Data` class which serves as a container for two DataFrames that are logically connected. It also simplifies the access to some of the metadata contained with theses DataFrames.

We also have a `ProcessData` class that uses the `Data` instances to calculate some statistical properties and keep them for later use.

While this design works (which is important), it's flawed in the sense that the `ProcessData` object is very reliant on the implementation details of the `Data` class. When higher-level objects are dependent on specific attributes of some lower-level module, we need to perform Dependency Inversion. This decoupling process can also be called "object orthogonality".

We'll do a couple of major changes to our design which will solve, step by step, the design issues we encoutered.

First we'll create a new `DataContainer` class that holds `Data` instances, and redefine the `Data` class more appropriately:

In [2]:
class Data:
    """ Simple container for DataFrames and their metadata """
    def __init__(self, arr1: np.ndarray, arr2: np.ndarray, date: float):
            self._ser1 = pd.Series(arr1, dtype=np.uint8)
            self._ser2 = pd.Series(arr2, dtype=np.int16)
            self._metadata = dict(shape1=self.df1.shape,
                                 shape2=self.df2.shape,
                                 total=self.df1.shape[0] + self.df2.shape[0],
                                 date=date)
    @property
    def data(self):
        """ Returns the actual data variables as an iterable"""
        result = [self._ser1, self._ser2]
        return result
    
    @property
    def metadata(self):
        return self._metadata
    
    def sum(self):
        return [x.sum() for x in self.data]
    
    
class DataContainer:
    """ Holds, in order, instances of Data """
    def __init__(self, datas):
        self._data = []
        self._metadata = {}
        try:
            for idx, data in enumerate(datas):
                if isinstance(data, Data):
                    self._data.append(data)
                    self._metadata[idx] = data.metadata
                else:
                    raise TypeError(f"TypeError: Data {data} isn't a 'Data' type.")
        except TypeError as e:
            print(e)
    
    @property
    def data(self):
        return self._data
    
    @property
    def metadata(self):
        return self._metadata
    
    def sum(self):
        result = []
        for data in self._data:
            result.append(data.sum())
        return result

First note the "new technical term": We introduce here the `@property` decorators. We'll discuss Python's decorators in the next class, but for now we only care about their practical aspect: If we define some method as a property, that keyword can be used like a regular attribute, except for the fact that it's immutable:

In [3]:
class Trial:
    def __init__(self):
        self.two_as_attr = 2
    
    def two_as_method(self):
        return 2
    
    @property
    def two_as_prop(self):
        return 2

tr = Trial()

# Changing attributes is possible:
print(f"The original attribute: {tr.two_as_attr}")
tr.two_as_attr = 3
print(f"Attributes can be changed: {tr.two_as_attr}")
print("------")

# Using the regular method requires brackets
print(f"Using the method: {tr.two_as_method()}")
print("And of course, it can't be changed (immutable).")
print("------")

# Using a property "feels" like using an attributes:
print(f"As a property: {tr.two_as_prop}")  # no brackets
try:
    tr.two_as_prop = 3  # AttributeError
except AttributeError as e:
    print(f"AttributeError: {e} - properties can't be changed.")

The original attribute: 2
Attributes can be changed: 3
------
Using the method: 2
And of course, it can't be changed (immutable).
------
As a property: 2
AttributeError: can't set attribute - properties can't be changed.


Properties are useful for one more reason (setters), which we'll examine in the next class.

But besides this new, exciting feature of Python, what else has changed with the implementation?

#### `Data`:
1. We redefined `Data`. The new object doesn't allow anyone from the outside to change the data it holds, it only allows for a "view" of the data. The use of properties ensure that once the object was created, the internal structure of the instance remains intact. The single underscore before the variable names also prevents direct access to the attribute.

2. Furthermore, if we examine the `sum()` method, we see that it's now bound to the `Data` object itself. If we write it explicitly it makes senes: _The sum of the data is a bound method to our data - an intrinsic property of it._ If we every decide to change how our data is stored, the `sum()` method should change accordingly, but no other object will be affected.


#### `DataContainer`:
1. The new `DataContainer` class _doesn't really know_ what it's holding. All it cares is that they're `Data` instances. It doesn't peek inside the methods of the different `Data` instances.

2. It doesn't allow access to the list of `Data` instances itself. It exposes a `data` property which returns the list. If we decide to change the internal implementation of `DataContainer`, users of this class wouldn't care as long as we keep the output of the `data` property similar. Even if the list is empty - it will always return something.

Let's see the redefined implementation of the `ProcessData` class:

In [4]:
class ProcessData:
    """ Pipeline to process twin Data instances """
    def __init__(self, datacont: DataContainer):
        self.datacont = datacont
        self.result = {}
        self.metadata = datacont.metadata
        
    def process(self):
        """ Mock processing pipeline """
        self.result['sum'] = self.datacont.sum()
        means = [x.mean() for x in self.datacont.data]
        self.result['mean'] = means
        return self.result

The code snippet above is now much cleaner than the one we had beforehand. It uses the "API" of the `DataContainer` in two ways - either using a fully-featured `sum()` function, or by (securely) accessing the data using the `data` property and running non-standard processing on it - mean calculation in our case.}

The downside is the added class - more code to write, more tests, more imports at the top. But the added value is tremendous. Think how easy it is to add new functionality into the pipeline. Everything is flexible, allowing to create a new `median()` function in the `DataContainer` class, for example. We can even change the internal structure of the `Data` class and still use the downstream class effectively.

## Liskov Subtitution Principle

The LSP can be presented in several ways, and we'll choose the more straight-forward approach of just showing an example of when the principle is violated.

Say I wish to model a rectangle, just as we did in the first class:

In [14]:
class Rectangle:
    """ A very simple implementation just to prove a point """
    def __init__(self, point, x, y):
        self.corner = point[0], point[1]
        self.x = x
        self.y = y
    
    def move(self, point):
        """ Move the object to the point """
        self.corner = point
        
    def set_width(self, dx):
        """ Change width to dx """
        self.x = dx
    
    def set_height(self, dy):
        """ Change height to dy """
        self.y = dy        

As the docstring says, above is a super-basic implementation of such a Rectangle. Take note of the two mutating functions that present a way to change the shape of the rectangle _independently._ This seems very logical when only dealing with a rectangle - each side truly is independent of the other.

However, if we wish to reuse this class when modeling a Square via inheritance, we'll be facing quite a pickle:

In [8]:
class Square(Rectangle):
    """ Simple circle, inheriting from Rectangle """
    def __init__(self, point, x):
        super().__init__(point, x, x)

In [13]:
sq = Square((0, 0), 10)
print(f"Square size: {sq.x, sq.y}")
print(f"Square corner: {sq.corner}")

Square size: (10, 10)
Square corner: (0, 0)


Initially this seems OK. We only require a single `x` input for a square, and we just pass it twice to the `Rectangle` constructor to create a squared rectangle.

Even the `move()` method of the rectangle is helpful - we can move our square around without the need to redefine it.

But the `set_X()` methods are an issue. We can't allow for users of our `Square` to modify the height and width of the square independently. If someone would only change the square's height, keeping its current width unchanged, it would make our `Square` not a true square.

In [16]:
sq.set_height(20)
print(f"New dimensions: {sq.x, sq.y} - not a square.")

New dimensions: (10, 20) - not a square.


Logically, and mathematically, a square _should_ inherit from a Rectangle. The simple mental model of the problem at hand is very clear with this inheritance relationship in mind. However, our implementation reaches a set back with might have not been able to predict in advance.

LSP claims that we should be able to replace instances of `Rectangle` with instances of `Square` without changing the correctness of the application. In this case we see that this substitution isn't possible, and so the principle breaks.

### What do we do?

#### 1. Limit the use of inheritance
Only when we're completely positive that the use of inheritance will contribute to our application - by improving readability or reducing code repetitions - only then should we use it. It's an important tool to have as an object-oriented programmer, but one which should be used carefully.

#### 2. Define a higher-level abstraction
We could define a more abstract base class for both the rectangle and square, such as a `2DShape`. This class can have a `corner` attribute, and a few very basic methods like `move()`. This will change the definition of `Rectangle` to 

```python
class Rectangle(TwoDShape):
     # ...
``` 
and `Square` to 
```python
class Square(TwoDShape):
     # ...
```

#### 3. Override methods of the base class
We may simply override the implementation of one (or both) of the `set_X()` methods. The new implementation may raise a warning when trying to use it, pointing the user to the appropriate method, or it may raise a simple exception.

#### 4. Addition of a precondition
We can add to the `Rectangle` class a flag (=boolean attribute) called `stretchable`. Each `set_X()` methods then checks this flag, to see if the operation is allowed, before changing the width and height.

## Typestates

Typestates are a way to enforce the state of our data\application with strict types.


Let's assume I have 24 human volunteers in combined a fMRI + questionnaire study. I keep them all in a single DataFrame for brevity and ease-of-use, but in effect they're in different stages of my experiment. A few were just recruited last week, and I haven't even set a date for our first meeting. A few others were already scanned in the magnet once, but still have to go through my second questionnaire session. 

My application monitors these students, alerts me of incoming meeting dates, and (of course) analyzes the results of the questionnaires and scans.

The __correctness__ of this application can be enforced in many ways - tests, mock data, daily use - but here I choose to show another mechanism - typestates. The fact that the current status of each volunteer isn't specified with a simple string in a table, but is actually a different class altogether, is another way to make sure that I always receive the expected output from each method call.

In [26]:
import datetime
import pandas as pd


# Helper types
class Name:
    """ First and last name """
    # Implementation omitted


class Age:
    """ Special age type """
    # Implementation omitted


class FmriResult:
    """ Results from an fMRI scan """
    # Implementation omitted


# Volunteer types    
class Volunteer:
    """ Base class for all volunteers in my project """
    def __init__(self, name: Name, age: Age, call_date: datetime.time, vol_id: int):
        self.name = name
        self.age = age
        self.call_date = call_date
        self.id = vol_id
        
    def __str__(self):
        return f"{self.name}, age {self.age}, first called at {self.call_date}."
        
    def update_df(self, records: pd.DataFrame):
        """ Add the instance to the dataframe containing the rest of the data """
        record = pd.DataFrame([self.name, self.age, self.call_date, 
                               self.id, self.metadata, type(self), copy.copy(self)])
        records.append(record)
        return records
    
    def remove_from_df(self, records: pd.DataFrame):
        """ Remove the instance from the student records """
        idx = records.id == self.id
        records.drop(idx, inplace=True)
        return records

    
class PreScanOne(Volunteer):
    """ Volunteer before the first session """
    loc = 0  # ordinal place in hierarchy
    
    def __init__(self, name: Name, age: Age, call_date: datetime.time, vol_id: int, 
                 scan_one_date: datetime.time):
        super().__init__(name, age, call_date, vol_id)
        self.metadata = dict(scan_one_date=scan_one_date)
        
    def advance(self, result: FmriResult, next_date: datetime.time):
        """ Advance a PreScanOne to a PostScanOne """
        new = PostScanOne(self, result, next_date)
        return new
    

class PostScanOne(Volunteer):
    """ Volunteer after the first session """
    loc = 1
    
    def __init__(self, pre_volunteer: PreScanOne, scan_one_data: FmriResult, 
                 scan_two_date: datetime.time):
        super().__init__(pre_volunteer.name, pre_volunteer.age, pre_volunteer.call_date, pre_volunteer.id)
        self.metadata = pre_volunteer.metadata
        self.metadata['scan_one_data'] = scan_one_data
        self.metadata['scan_to_date'] = scan_two_date
    
    def advance(self, result: FmriResult, next_date: datetime.time):
        """ Advance a PostScanOne to a PreScanTwo """
        new = PreScanTwo(self, result, next_date)
        return new
    
    
# Examples of generic methods that use this interface
def advance_volunteer(old_vol, results: FmriResult, records: pd.DataFrame):
    """ 
    Move volunteer to next step in the experiment, returning the new 
    instance and records.
    """
    old_vol.remove_from_df(records)
    new_vol = old_vol.advance(results, records)
    new_vol.update_df(records)
    return new_vol, records


def process_data(records):
    """ Run the same processing function over all fMRI data """
    results = []
    for vol in records:
        try:
            results.append(vol.process_data)
        except AttributeError:  # instance doesn't have data
            pass
    return results

This is long, but interesting, so let's try to break it down.

At the beginning we have a few help classes which I merely defined, but not implemented. These shouldn't look strange to you. We talked during class of how an `Age` type is an important example of defining our own types in a program, since it's neither an integer nor a floating point number.

The second part is the most interesting. We have a base class called `Volunteer` which contains basic information which is common to all experiment volunteers. But it's actually more than that - it also defines the _interfaces_ between the classes, it forces the classes to have specific attributes that will comply to this protocol, linking their behavior together.

The other two classes inherit from `Volunteer` and represent the first two steps in the "Volunteer path". The `loc` class variable signifies that. From phase one (`PreScanOne`( a volunteer can only advance forward (or drop out from the experiment) to step 2. And likewise from step 2 to 3 - you'll always find the same `.advance()` method that takes you to the next step, even though the implementation is slightly different. To handle the variability in the held data, we have the `metadata` attribute which can hold different parameters and datapoints.

The last part shows how to use such an interface. We have a function that advances an instance of a class "one step" to the next phase. We have a function that runs some processing on the data held inside the instances, and we can have as many functions (and classes as we wish). It's completely extensible since the API is well-defined.

## Design vs. Productivity

Before we start exercising, one important note to remember: There's a thin line between under- and over-engineering. Very small scripting projects require almost no engineering at all. This might mean that after you gain a few extra months of experience in Python, the structure of code for a small scripting job in Python might be obvious for you right from the get-go. You'll know which data structures you'll have, whether or not you'll need a class or two, and how the user interface might go.

On the other hand, large applications which span at least a few thousands lines of code will always need _some_ form of pre-planning. It would be senseless not to write out a diagram of the main modules in your code and their interfaces. One can consider this to be common knowledge, or a simple programmer's instinct. Just like architects sit down and plan for months in advance the construct they're about to create, programmers should spell out the architecture of their own programs. In no way will this guarantee you'll get the architecture right in the first time, but the design might serve as good building blocks when you start the refactoring process.

Problems mostly occur when you write medium-sized scripts, up to a couple thousand lines. These scripts usually start out small - a few functions that deal with file I/O and display of data - but can grow quite quickly once you start adding functionality. When the script was short you probably didn't even write tests, since you were sure you're handling some insignificant piece of code, and now it starts biting back at you.

It's hard to write rules for these occasions. When someone asks me for improved functionality on some short script I wrote, I sometimes tell them it will take more time than I think it should, since I want to devote time for refactoring of the code, to make the new functionality feel more natural inside it.

It's also good practice to write use classes to bind data and methods, even when you think they might be overkill. It's much easier to expand the functionality of classes than of an assortment of functions.