# 3. Python fundamentals II

As your programs get larger, you'll want to get organized. This module covers the main concepts in Python for grouping and reusing code: functions, objects, and modules.


## Functions

A function is a chunk of code that takes inputs, does some processing on them, then returns an output. The code within the function runs only when it is called.

Functions can be reused without having to write all the code out again, making your software more **reproducable** and **consistent**. 

Grouping code into functions also keeps your software **organised**.



### Anatomy of a python function

In Python, all functions have a **name**, zero or more **input arguments**, and an **output**. 

We're also going to give all our functions a descriptive **docstring** for this course: I recommend you do the same in your code!

You write a function by using the `def` keyword. Here's an example with these four basic components:


In [None]:
def acre_feet_to_m3(volume_acre_feet):
    """Converts volume in US acre-feet to SI m³."""
    volume_m3 = volume_acre_feet * 1_233.482
    return volume_m3

Running this function works just like we've already seen with builtin python functions like `print` and `str`:

In [None]:
lake_tahoe_volume_acre_feet = 120_000_000
acre_feet_to_m3(lake_tahoe_volume_acre_feet)

Any variables created inside the function are only available inside the function. Try to use `volume_m3` now, you will see an error. This scoping of variables inside functions is one of the benefits of functions that keep your workspace clean of variables.


```python
print(volume_m3)
```

```text
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 volume_m3

NameError: name 'volume_m3' is not defined
```

### Default arguments

Input arguments can have **default** values. Normally, running a function without specifying all of its arguments results in an error.

But when default values are given for arguments using an `=` sign in the function definition, Python will use that default for any missing arguments.



In [None]:
def say_hello(user=None):
    if user is None:
        print("Hi!")
    else:
        print(f"Hi {user}!")

In [None]:
say_hello("Andrew")

In [None]:
say_hello()

It's nice to give your arguments defaults when you know what the value will be most of the time, but still allow the possibility of using a different input.

### Named arguments

A function will receive the arguments in the order you use them when running the function:

In [None]:
def bounds_area(left, bottom, right, top, force_positive=True):
    """Area of rectangular bounding box."""
    height = top - bottom
    width = right - left
    area = height * width
    if force_positive:
        area = abs(area)
    return area


bounds_area(4137299, 606008, 4137399, 606009)

Using argument names lets you control the order of the arguments. Plus, will make the code much easier to understand when you're trying to understand it later!

In [None]:
bounds_area(left=4137299, right=4137399, bottom=606008, top=606009)

In science and geospatial coding, we frequently work between different lat/lon x/y row/col ordering customs, as well as just very complicated algorithms with a large level of parameters.

By writing out argument names in full wherever they're not completely obvious, we help others and our future selves to read our code, and make our code more resilient to bugs.




### Outputs

Technically, Python can only return a single variable as output. But if we want **zero output** we can just return `None` (this is also what happens when we have a function without a `return` statement):


In [None]:
def check_number_is_non_negative(num):
    """Throw an error for non-negative numbers."""
    assert num >= 0

page_number = 7
result = check_number_is_non_negative(page_number)
print(result)

To stuff **multiple outputs** in one variable, we can use a tuple

In [None]:
def extract_lat_lon(latlon):
    """Extract lat,lon from a string like '37.364,-122.010'"""
    parts = latlon.split(",")
    lat = float(parts[0])
    lon = float(parts[1])
    return (lat, lon)

result_lat, result_lon = extract_lat_lon("37.364,-122.010")

print(result_lat)

### Type annotations

Modern versions of Python (like the one we're using in our conda environment!) let you document and restrict the types of your input arguments and output using a special syntax.

For the `extract_lat_lon` function above that would look like this:


```python
def extract_lat_lon(latlon: str) -> tuple(float, float):
    ...
```

These **type annotations** are slowly being adopted by many new software projects. But their use isn't widespread in scientific computing.

We chose not to use type annotations for this course as many of the core scientific Python packages we'll be using don't support them (yet!). But you may see them when viewing Python code in the future.



### When to use functions?

There's a balance to strike here with functions. Not enough functions makes for code that's hard to navigate and prone to bugs. But wrapping every little statement in a function slows development and adds counter-productive complexity.

Here are some suggestions for when it's time to put code in a function:

* To avoid repetition.
  * If you've already written some code to read data from Snowflake, there's no need to re-write that every time in your code that needs snowflake data. Be kind to yourself, make a `read_from_snowflake` function!
  * There's no need to force it though. Sometimes it makes sense to copy and modify code, rather than trying to write a single function that handles two slightly different situations.
* When consistency is needed.
  * There are two different definitions of the [acre-foot](https://en.wikipedia.org/wiki/Acre-foot) unit. By using our `acre_feet_to_m3` function instead of dividing by 1,233.4892 throughout your code, we can ensure that all parts of our program are using a consistent definition of the unit.
* To structure a program.
  * You can use functions to split up your program so that it reads almost like English.
  * Good rules of thumb: all lines of a function should fit on a computer screen.
  * For example, the top-level of this flow rate forecasting program is split into 7 high-level functions: 
    ```python
    def forecast_flow_rate(streamgage_id, forecast_date):
        """Single day ML forecast for flow at a streamgage location."""
        # Inputs.
        historic_flow = load_historic_flow(streamgage_id)
        watershed = load_watershed(streamgage_id)
        dem = load_dem(watershed.bounds)
        precip = load_precipitation_forecast(watershed.bounds, forecast_date)

        # Run model.
        forecast_result = run_lstm_flow_simulation(dem, precip, forecast_date)
        validate_single_gage_forecast(forecast_result)

        # Save result.
        save_result_to_snowflake(forecast_result)
    ```
    Skipping through just that function gives the reader a good overview of how the program works, and offers a clear directory of where to look to resolve a `Invalid Forecast Result` error for example.



### Library functions

Python comes with a large standard (builtin) library of modules, and the conda environment we installed adds many more. Library modules are accessed using `import`. For example, to use the builtin `statistics` module:



In [None]:
import statistics

states = ["CA", "NV", "NV", "AZ"]
statistics.mode(states)

## Errors in functions

By now you've probably seen what happens in Python when you trigger an errror!

```python
42 / 0
```

```text
ZeroDivisionError                         Traceback (most recent call last)
Cell In[71], line 1
----> 1 42 / 0

ZeroDivisionError: division by zero
```

In this simple example Python is telling you a few pieces of information

* The error name: `ZeroDivisionError`. This is the bit you should use for googling more information about the error!
* A plain-english description of the error: `division by zero`.
* The line of code that triggered the error: `42 / 0`.
* You also get a line number: `line 1`. In notebooks this is less helpful, but for Python files this can help you quickly find the problematic line (you also get the filename).


Often in Python our code gets deeply nested: we have a function, that calls another function in a different module, which then calls a different function. In the example below we have `build_bounds` which calls `print_bounds_details` which calls `calculate_aspect_ratio`:



In [None]:
def calculate_aspect_ratio(width, height):
    return height / width

def print_bounds_details(bounds):
    print(f"{bounds=}")
    
    width = bounds[2] - bounds[0]
    height = bounds[3] - bounds[1]
    print(f"{width=} {height=}")
    
    aspect_ratio = calculate_aspect_ratio(width, height)
    print(f"{aspect_ratio=}")

def build_bounds(left, bottom, right, top):
    bounds = (left, bottom, right, top)
    print_bounds_details(bounds)
    return bounds



Look what happens when we try to build a bounding box with zero width:

```python
build_bounds(1, 100, 1, 101)
```


```text
ZeroDivisionError                         Traceback (most recent call last)
Cell In[75], line 20
     16     print_bounds_details(bounds)
     17     return bounds
---> 20 build_bounds(1, 100, 1, 101)

Cell In[75], line 16, in build_bounds(left, bottom, right, top)
     14 def build_bounds(left, bottom, right, top):
     15     bounds = (left, bottom, right, top)
---> 16     print_bounds_details(bounds)
     17     return bounds

Cell In[75], line 11, in print_bounds_details(bounds)
      8 height = bounds[3] - bounds[1]
      9 print(f"{width=} {height=}")
---> 11 aspect_ratio = calculate_aspect_ratio(width, height)
     12 print(f"{aspect_ratio=}")

Cell In[75], line 2, in calculate_aspect_ratio(width, height)
      1 def calculate_aspect_ratio(width, height):
----> 2     return height / width

ZeroDivisionError: division by zero
```

Now python is giving is a **Traceback**: it shows the line that triggered the error at each level.

This is helpful because often the line that caused Python to fail in the *code* isn't the one that is problematic at the *conceptual* level.

In the example above, Python crashed because it was trying to divide by zero when calculating the aspect ratio of the bounds. But we can fix the mathematics of that line: the proper fix might be at the `build_bounds` level of nesting to detect and reject zero-width bounding boxes before trying to `print_bounds_details`.

---

One last note on errors: **embrace errors** during development! Except in a few circumstances (writing to a database, deleting files), crashing your code has zero negative concequences, and gives you direct valuable feedback via the error message. Instead of wondering "will this code work?" it's often faster just to run it and see!

As we move into more complex coding, we'll rely more heavily on errors and tracebacks.






### Triggering errors

You can trigger an error using the `raise` keyword with an error type (there are [many to choose from](https://docs.python.org/3/library/exceptions.html), `ValueError` is the most common) and an error message string:

In [None]:
def build_bounds(left, bottom, right, top):
    bounds = (left, bottom, right, top)
    if left == right or bottom == top:
        raise ValueError("Zero-area bounds detected!")
    print_bounds_details(bounds)
    return bounds

Why would you want to crash your code?! Well now when we try to create a bad bounding box, our code fails early (avoiding running code that's doomed to fail), and we get a clear error message about what went wrong, rather than a long traceback triggered in a utility function:

```python
build_bounds(1, 100, 1, 101)
```

```text
ValueError                                Traceback (most recent call last)
Cell In[76], line 29
     24     print_bounds_details(bounds)
     25     return bounds
---> 29 build_bounds(1, 100, 1, 101)

Cell In[76], line 22, in build_bounds(left, bottom, right, top)
     20 def build_bounds(left, bottom, right, top):
     21     if left == right or bottom == top:
---> 22         raise ValueError("Zero-area bounds detected!")
     23     bounds = (left, bottom, right, top)
     24     print_bounds_details(bounds)

ValueError: Zero-area bounds detected!
```

In general, we want our code to fail as early as possible, with the best description possible.



### Function decorators

A decorator (signified with a `@` symbol) modifies the functionality of a function.

Mastering decorators is a more advanced topic. But there's one super useful decorator you should now about: the `cache` decorator in the builtin `functools` library.

Say you have a function that is very slow and always returns the same output when given the same input argument. (This is really common in data science, think loading data from remote servers or performing very complicated but deterministic calculations). 

Adding the `@functools.cache` before your slow deterministic function means that the first time it's called, the result is stored by Python. Any subsequent times the function is called with the same input arguments, that stored result is returned instantly!

In the example below, `load_webpage` is our slow function.

In [None]:
import functools
import requests

@functools.cache
def load_webpage(url):
    requests.get(url)

In [None]:
%%time
load_webpage("https://example.com/")

The first time the function is run, it takes over 50ms to query the server and download the webpage.

(Note the `%%time` notebook syntax here, which prints out the time taken to run a cell).

In [None]:
%%time
load_webpage("https://example.com/about.html")

Running the function a second time with different input also takes over 50ms.

In [None]:
%%time
load_webpage("https://example.com/")

But when run with an input that the decorated function has already seen, the result takes a few μs. That's about as close to instantaneous you get with Python!

## Classes

The next step on our organizational journey is classes and objects.

Just like functions group related *statements* together, classes group related *variables* and *functions*.

And just like how organizing your code into functions is optional, so is defining your own classes. The programming technique where most things are organised into classes is called Object Oriented Programming (OOP), but you don't have to go full OOP to selectively enjoy many of the benefits of classes!


### Defining classes


Classes are defined with the `class` keyword. Here's a simple class representing a point on a map.

In [None]:
class Point:
    """A lat/lon point."""

We can set create an **instance** of the class using the `()` syntax. An instance of a class is also called an **object** in Python.

In [None]:
point_half_dome = Point()

### Object attributes

Ok so what do we do with classes and instances?

Classes group together variables and functions. Lets start with the variables: you can create variables known as **attributes** on your instances using the `.` syntax. Lets give our point a latitude and a longitude.

In [None]:
point_half_dome.lat = 37.745
point_half_dome.lon = -119.533

We can use these object attributes as variables like any other:

In [None]:
print(point_half_dome.lat)
print(round(point_half_dome.lon))

### The \_\_init\_\_ method

Despite the example above, it's best not to set attributes directly. It's too easy to introduce inconsistency: you might have `point.lat` in one place and `point.latitude` in another, resulting in chaos.

Instead, we let the class define it's own attribute names using a function. The functions of a class are called **methods**.

Python gives us a few special method names we can use that have special functionality. The `__init__` method is used to initialize (create) an instance of the class.

Lets rewrite our `Point` class:

In [None]:
class Point:
    """A lat/lon point."""
    def __init__(self, lat, lon):
        self.lat = lat

        # Wrap longitude to range (-180, 180].
        while lon > 180:
            lon = lon - 360
        while lon <= -180:
            lon = lon + 360
        self.lon = lon


point_half_dome = Point(lat=37.745, lon=-119.533)
print(point_half_dome.lat)

The initializer now stores the first argument as an attribute called `lat`: no room for `latitude` ambiguity! The initializer also takes the opportunity to ensure that the longitude is between -180 and 180: by putting this check in a single place, we can be sure of the longitude convention used by all `Point` objects.

All class methods get the instance passed as an initial first argument called `self`.


### Custom methods

Pretty much all of your classes are going to start like this

In [None]:
class SomeClass:
    """What SomeClass does."""
    def __init__(self, arg1, arg2, arg3):
        self.arg1 = arg1
        self.arg2 = arg2
        ...

But as you expand the functionality of each class, you'll add more attributes and methods.

A big feature of methods is that they can modify the object. Lets add a `move` method to our `Point` class. Remember that `self` is always inserted as the first argument.

In [None]:
class Point:
    """A lon/lon point."""
    def __init__(self, lat, lon):
        self.lat = lat

        # Wrap longitude to range (-180, 180].
        while lon > 180:
            lon = lon - 360
        while lon <= -180:
            lon = lon + 360
        self.lon = lon

    def move(self, delta_lat=0, delta_lon=0):
        """Shifts the point."""
        self.lat = self.lat + delta_lat
        self.lon = self.lon + delta_lon

point_half_dome = Point(lat=37.745, lon=-119.533)
print(point_half_dome.lat)


In [None]:
point_half_dome.move(delta_lat=0.1)
print(point_half_dome.lat)

Using the `move` method on the object has permanently modified the instance's `lat` attribute.

### The \_\_repr\_\_ method

The `__repr__` method (like all methods beginning with double underscore) is another special method. It lets you control the representation when you print your instance.

Python classes have a default `__repr__` method that isn't very user friendly:

In [None]:
print(point_half_dome)

Adding our own method can help with debugging and logging:

In [None]:
class Point:
    """A lon/lon point."""
    def __init__(self, lat, lon):
        self.lat = lat

        # Wrap longitude to range (-180, 180].
        while lon > 180:
            lon = lon - 360
        while lon <= -180:
            lon = lon + 360
        self.lon = lon

    def move(self, delta_lat=0, delta_lon=0):
        """Shifts the point."""
        self.lat = self.lat + delta_lat
        self.lon = self.lon + delta_lon
    
    def __repr__(self):
        return f"{self.lat}, {self.lon}"
    
point_half_dome = Point(lat=37.745, lon=-119.533)
print(point_half_dome)

### Inheritance

Inheritance is used to specialise existing classes.

With inheritance you take an existing class and add new attributes, or add/modify existing methods.

Lets make a new class that represents a streamgage. We can add a new attribute by overriding the `__init__` method:



In [None]:
class Streamgage(Point):
    """A USGS streamgage."""
    def __init__(self, lat, lon, usgs_id):
        super().__init__(lat, lon)
        self.usgs_id = usgs_id

The line `super.__init__(lat, lon)` call's the parent's `__init__` method. It's a bit like going `self = Point(lat, lon)`.

And we can also add a brand new method that wouldn't make sense on the generic parent `Point` class.

In [None]:
class Streamgage(Point):
    """A USGS streamgage."""
    def __init__(self, lat, lon, usgs_id):
        super().__init__(lat, lon)
        self.usgs_id = usgs_id

    def current_flowrate(self):
        """The curent flowrate at the gage, in ft³s⁻¹."""
        return query_usgs_api(site=self.usgs_id, field="streamflow")

Because `Streamgage` inherits from `Point`, and because we called Point's `__init__` method inside of our own, we get to keep all the helpful functionality of `Point` like the longitude wrapping and moving:

In [None]:
gage = Streamgage(lat=35.8494, lon=243.7692, usgs_id="10251300")
gage.move(delta_lat=0.00025)
gage

You can see we've also inherited the `__repr__` method that only displays the coordinated. As an exercise, can you copy the `Point` and `Streamgage` classes, then add a `__repr__` method to `Streamgage` that also displays the id?

### Dataclassess

Classes are helpful but as you can see, object oriented programming gets complicated fast!

If you just want a simple structure to hold a few variables together, the builtin `dataclasses` automates away some of the annoying boilerplate. You create a using the `dataclasses.dataclass` decorator, and have to specify the type of your attributes:

In [None]:
import dataclasses

@dataclasses.dataclass
class SimulationConfig:
    results_path: str
    num_iterations: int
    num_trials: int = 1

    def total_num_steps(self):
        return self.num_iterations * self.num_trials
    

SimulationConfig("/tmp/results.csv", num_iterations=5)

See how we didn't need define an `__init__` method, and `__repr__` was also handled for us!

In data science, I often start out with dataclasses, and upgrade them to normal classes when I need control over `__init__` or complex inheritance.

### When to use OOP in data science

OOP is great for representing *complex systems* of different structures of data.

Data science, in constrast, often deals with simple data forms that undergo *complex transformations*.

As a result, functional programming tends to be dominant approach in data science. 

That said, there's lots of places where OOP can help your data science development:

* Productionizing data science analysis as an external tool or service.
* Representing configuration and model inputs.
* Data that's too complex to be a row in a datatable: rasters, experiment results, other domain-specific data structures.

And finally, some of the biggest tools in data science are very object oriented (pandas, scikit-learn). So even if you're not writing custom classes, understanding classes will help.

## Modules

The highest-level tool Python gives is for code organization is modules.

A module is just a python file, while a package is a collection of modules.
















1. Functions   
   1. Anatomy of a function  
      1. Arguments (positional, default, keyword, variable)  
      2. Returning outputs  
      3. Variable scope  
      4. Documenting functionality with docstrings  
      5. Type annotations  
   2. Modularization: when to use a function  
   4. Function decorators  
   5. The functools library  
      1. Freezing arguments with functools.partial  
      2. Storing outputs with functools.cache  
3. Object oriented programing  
   1. Objects and classes  
      1. Class syntax  
      2. Instances vs classes  
      3. Methods  
         1. Init  
         2. \_\_str\_\_  
         3. Classmethod and staticmethod  
      4. Property methods  
      5. When to use classes  
      6. OOP for data science  
   2. Dataclasses  
      1. Constructing dataclasses  
      2. Dataclasses vs dictionaries  
   3. Inheritance  
      1. Overriding functionality  
      2. Calling super()  
      3. Extending dataclasses
   3. Debugging nested errors with tracebacks  

2. Modules  
   1. Importing code from other modules  
   2. Module layout  
   3. Auto-reloading imported modules in jupyter  
   4. Packages  
