<a href="https://colab.research.google.com/github/travisormsby/python-tips-tricks/blob/main/docs/PerformanceMemory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Performance and Memory

When you begin to use Python regularly in your work, you'll start noticing bottlenecks in your code. Some workflows may run at lightning speed, while others take hours of processing time to complete, or even crash.

Avoiding bloat is invaluable as you move toward using code for automation, bigger data, and working with APIs. Code efficiency means:
- Less chance of a slowdown or crash: the dreaded MemoryError.
- Quicker response time and fewer bottlenecks for the larger workflow.
- Better scaling.
- Efficient code is often (but not always!) cleaner and more readable.

Let's look at some ways you can reduce bloat in your code.

## Choosing the efficient code alternative

tl;dr
<br>Access and store only what you need, no more.
- __Storage__: avoid a list where you could use a tuple
- __Membership look-up__: avoid a list (or tuple) where you could use a set (or dictionary)
- __Iteration__: avoid a function where you could use a generator
- __Calculation__: avoid a loop where you could use vector operations

### Storage: lists vs. tuples

If you have a collection of values, your first thought may be to store them in a list.

In [None]:
data_list = [17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356]

Lists are nice because they are very flexible. You can change the values in the list, including appending and removing values. But that flexibility comes at a cost. Lists are less efficient than tuples. For example, they use more memory.

In [None]:
import sys

data_tuple = (17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356)

print(sys.getsizeof(data_list))
print(sys.getsizeof(data_tuple))

104
88


If you aren't going to be changing the values in a collection, use a tuple instead of a list.

### Membership look-up: lists vs. sets and dictionaries

However, when you want to see if an element _already exists_ in a collection of elements, use a set or dictionary to store that collection if possible.

- List and tuple look-up is **sequential**, going at the speed of *O(n): linear time*.
    - With lists, Python scans the entire list until it finds the match (or reaches the end).
    - Worst case: it has to look at every element.
- Set and dictionary look-up instead **maps keys to values**, going at the speed of *O(1): constant time*.
    - No matter how big the collection is, the set only ever has to check 1 value.
    - Sets are built on hash tables. Python computes the hash of the element and jumps straight to where it should be stored.

The example below shows that a set is over 1000x faster than a list in calculating the first 100,000 values of [Recaman's sequence](https://oeis.org/search?q=recaman&language=english&go=Search).

In [None]:
def recaman_check(cur, i, visited):
    return (cur - i) < 0 or (cur - i) in visited

def recaman_list(n: int) -> list[int]:
    """
    return a list of the first n numbers of the Recaman series
    """

    visited_list = [0]
    current = 0
    for i in range(1, n):
        if recaman_check(current, i, visited_list):
            current += i
        else:
            current -= i
        visited_list.append(current)
    return visited_list

In [None]:
%%timeit
recaman_list(100000)

33.6 s ± 1.62 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
def recaman_set(n: int) -> list[int]:
    visited_set = {0}
    current = 0
    for i in range(1, 100_000):
        if recaman_check(current, i, visited_set):
            current += i
        else:
            current -= i
        visited_set.add(current)
    return visited_set

In [None]:
%%timeit
recaman_set(100000)

23.7 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


When you add an element to a set...
1. Python calls the element’s __hash__() method to get a hash value (an integer);
1. That hash value determines where the element will be stored in the set's internal structure; and
1. When checking if an element is in the set, Python uses the hash to quickly find it.

### Iteration: functions vs. generators

When working with dataframes, we often use functions to operate on data, but generators can be more memory-efficient and faster for certain tasks.

**Regular functions and comprehensions** typically store outputs into containers, like lists or dictionaries. This can take up unnecessary memory, especially when we're creating multi-step workflows with many intermediate outputs.

In contrast, **generators** only hold one data item in memory at a time. A generator is a type of iterator that produces results on-demand (lazily), maintaining its state between iterations.

Under the hood, a generator looks similar to a function. In most cases, you:
- define a process(),
- provide the logic, and
- ask for the result, either with a return statement (for functions) or a yield statement (for generators).

Imagine you have a large dataset containing millions of employee records. You want to calculate the combined hourly rates of all employees on an annual salary.

In [21]:
# For the sake of simplicity, we'll represent the dataset with a small sample.
employeeDatabase = [
  {'lastName': 'Knope', 'rate': 52000, 'pay_class': 'annual'},
  {'lastName': 'Gergich', 'rate': 9, 'pay_class': 'hourly'},
  {'lastName': 'Ludgate', 'rate': 50000, 'pay_class': 'annual'},
  {'lastName': 'Swanson', 'rate': 'redacted', 'pay_class': 'redacted'},
  {'lastName': 'Haverford', 'rate': 42000, 'pay_class': 'annual'}
]

You can use a function for this, but it means the entire dataset will be held in memory.

In [33]:
def hourly_rate(payments):
  """Function that yields a worker's hourly rate based on annual salary."""
  hourlyRates = []
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      hourlyRates.append(hourly)
  return hourlyRates

# Use the function to sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")

Total dispersments per hour for salaried employees: $69.23


If the input dataset is huge, this eats up a ton of space. Instead, what if we process data lazily, storing one row in memory at a time?

In [34]:
def hourly_rate_gen(payments):
  """Generator that yields a worker's hourly rate based on annual salary."""
  for worker in payments:
    if worker.get('pay_class') == 'annual':
      hourly = worker['rate'] / 2080
      yield hourly

# Use the generator to sum hourly rates for those receiving an annual salary.
salariesPerHour = sum(hourly_rate_gen(employeeDatabase))

print(f"Total dispersments per hour for salaried employees: ${salariesPerHour:.2f}")

Total dispersments per hour for salaried employees: $69.23


In a function, the return statement signals that the function can execute from start to finish, *holding every value in memory*.

In a generator, the yield statement signals that execution can proceed *one at a time*; when yield is executed, the generator pauses, retaining the generator's state until the next time it is called.

Generators are a powerful tool for GIS and remote sensing. You can set up **generator pipelines** to string multiple tasks together lazily. These are hugely helpful for complex spatial analysis workflows, such as raster processing.

#### Iteration, continued: List comprehension vs. generator expression

Generator expressions (also known as generator comprehensions) are concise, one-line generators. Generator expressions can be a handy replacement for list comprehensions.

Let's look at how the function above would appear in list comprehension format.

In [48]:
results = sum([worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual'])
print(sys.getsizeof(results))

print(f"${salariesPerHour:.2f}")

88
$69.23


As with the function, the list comprehension constructs a list of n values. Then, we use sum() to add all values in the list together.

A generator expression looks almost identical to a list comprehension: simply swap out square brackets with parentheses.

*Here's a fun tip: When a generator expression is the only argument in a function (in this case, sum()), you can drop the inner parentheses.*

In [47]:
salariesPerHour = sum(worker['rate'] / 2080 for worker in employeeDatabase if worker.get('pay_class') == 'annual')
print(sys.getsizeof(salariesPerHour))

print(f"${salariesPerHour:.2f}")

24
$69.23


### Calculation: loop versus vector operations

## Profiling: finding bottlenecks

tl;dr
<br>Make time for performance checks.

Resources:
1. __Spot-profile your code.__ Use the `timeit` notebook magic to perform some basic profiling by cell or by line.
1. __Profile your script comprehensively.__ The `cProfile` module has the ability to break down call by call to determine the number of calls and the total time spent on each.

_Note: You also saw `sys.getsizeof()` earlier, which you can use to check memory size of variables. Memory and performance are interrelated._

### Spot-check with `%%timeit`

 `%timeit` is a form of _line magic_. Line magic arguments only extend to the end of the current line.

 `%%timeit` is a form of _cell magic_. It measures the execution time of the entire notebook cell.

 Two parameters to consider:
 - -n is the number of
 - -r is the repeats

In line mode you can time a single-line statement (though multiple
ones can be chained with using semicolons).


In cell mode, the statement in the first line is used as setup code
(executed but not timed) and the body of the cell is timed.  The cell
body has access to any variables created in the setup code.

In [25]:
%timeit sum(hourly_rate(employeeDatabase))
%timeit sum(hourly_rate_gen(employeeDatabase))

1.04 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
888 ns ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Profile with `cProfile`

 But `%%timeit` isn't precise enough to tell which calls in each cell are taking the longest to execute.

---

# Exercises

__Exercises summary__
1. Replace lists with efficient alternatives
    1. Storage: List to tuple
    1. Look-up: List to set
1. Replace sequences with efficient alternatives
    1. Iteration: List comprehension to generator expression
    1. Calculation: Loop to vector math
1. Check for speed bottlenecks
    1. Compare differences in speed with `timeit`
    1. Check for speed bottlenecks in detail with `cProfile`

## 1) Replace lists with efficient alternatives

### 1a) Tuple-based storage

The code below creates a list containing all years in a research study timeframe, from 1900 to 2030.

The values in this collection will not need to be changed because the study will always use this timeframe.

In [None]:
import sys

def listFromRange(r1, r2):
  """Create a list from a range of values"""
  return [item for item in range(r1, r2+1)]

start = 1900
end = 2030

studyYears = listFromRange(start, end)

print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))

[range(1900, 2031)]
Bytes used:  64


**Your turn:** For the same timeframe, write a different implementation using a storage option that takes up less memory.

In [None]:
# # # Exercise solution # # #

def tupleFromRange(r1, r2):
  """Create a tuple from a range of values"""
  return tuple(range(r1, r2+1))

start = 1900
end = 2030

studyYears = tupleFromRange(start, end)

print(studyYears)
print("Bytes used: ", sys.getsizeof(studyYears))

(1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030)
Bytes used:  1088


### 1b) Set-based look-up

The code below assigns a collection of placenames to a list. Then, it checks whether a placename is in the list. If not, the placename is reported missing.

If you have 1 million placenames to look up and 6 names in the list, that’s up to 6 million checks.

In [None]:
placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dum"]

# List look-up
if "Dinkytown" not in placeNames_list:
    print("Missing.")  # O(n) look-up

**Your turn:** Write a different implementation using a storage option that allows quicker checks for membership.

In [None]:
# # # Exercise solution # # #

placeNames_set = set(placeNames_list)

# Set look-up
if "Dinkytown" not in placeNames_set:
    print("Missing.")  # O(1) look-up

## 2) Replace sequences with efficient alternatives

### 2a) Generator expression

### 2b) Vector math

## 3) Check for speed bottlenecks

### 3a) Compare differences in speed using `timeit`

Using `%%timeit`, compare the time it took to create myDataPaths as a list (original code) versus as a tuple (exercise solution).

In [None]:
%%timeit
print([filePath for file in directory])

In [None]:
%%timeit
## Your solution here ##

Use `%%timeit` again to compare list-based lookup to set intersection.

In [None]:
%%timeit
def meetsCriteria(filePaths):
    """
    Dataframe must have a 'lat' field to be included.
    """
    members = []
    criterium = 'lat'

    for filePath in filePaths:
        with open(filePath) as fPath:
            headerList = csv.DictReader(fPath).fieldnames
            if criterium in headerList:
                members.append(filePath)
    return members


# Print all matching file paths
print(meetsCriteria(myDataPaths))

In [None]:
%%timeit
## Your solution here ##

Finally, compare the second list vs. set change that you made.

In [None]:
%%timeit
def meetsCriteria(filePaths):
    """
    Dataframe must have a 'lat' field to be included.
    """
    members = []
    criterium = 'lat'

    for filePath in filePaths:
        with open(filePath) as fPath:
            headerList = csv.DictReader(fPath).fieldnames
            if criterium in headerList:
                members.append(filePath)
    return members


# Print all matching file paths
print(meetsCriteria(myDataPaths))

In [None]:
%%timeit
## Your solution here ##

### 3b) Check for speed bottlenecks in detail using `cProfile`

Use cProfile to locate the slowest calls in your improved script.

Hint: Sort by tottime instead of name to find hotspots more easily.