# Performance and Memory

When you begin to use Python regularly in your work, you'll start noticing bottlenecks in your code. Some workflows may run at lightning speed, while others take hours of processing time to complete, or even crash.

Avoiding bloat is invaluable as you move toward using code for automation, bigger data, and working with APIs. Code efficiency means:
- Less chance of a slowdown or crash: the dreaded MemoryError.
- Quicker response time and fewer bottlenecks for the larger workflow.
- Better scaling.
- Efficient code is often (but not always!) cleaner and more readable.

Let's look at some ways you can reduce bloat in your code.

## Memory

tl;dr
<br>Access and store only what you need, no more.
- __Storage__: avoid a list where you could use a tuple
- __Membership look-up__: avoid a list/tuple where you could use a set/dictionary
- __Iteration__: avoid a sequence where you could use generator
- __Calculation__: avoid a loop where you could use vectorized math

### Storage: lists vs. tuples

If you have a collection of values, your first thought may be to store them in a list.

In [None]:
data_list = [17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356]

Lists are nice because they are very flexible. You can change the values in the list, including appending and removing values. But that flexibility comes at a cost. Lists are less efficient than tuples. For example, they use more memory.

In [None]:
import sys

data_tuple = (17999712, 2015, 'Hawkins Road', 'Linden ', 'NC', 28356)

print(sys.getsizeof(data_list))
print(sys.getsizeof(data_tuple))

104
88


If you aren't going to be changing the values in a collection, use a tuple instead of a list.

### Membership look-up: lists vs. sets and dictionaries

However, when you want to see if an element _already exists_ in a collection of elements, use a set or dictionary to store that collection if possible.

Lists and tuples require **sequential look-up** to see if an element is a member of the collection. That means that on average, they have to make n/2 comparisons for a collection of length n. Meanwhile, hash tables and dictionaries **map keys to values**. That means no matter how big the collection is, the set only ever has to check 1 value.

Fun fact: A set can use a hash table for look-ups, similar to a dictionary, because every element in a set is unique.

- List and tuple look-up goes at the speed of _O(n): linear time_. Time increases linearly with the number of elements.
    - With lists, Python scans the entire list until it finds the match (or reaches the end).
    - Worst case: it has to look at every element.

- Set and dictionary look-up goes at the speed of _O(1): constant time_. Takes the same time no matter the size of the data.
    - Sets are built on hash tables. Python computes the hash of the element and jumps straight to where it should be stored.

The example below shows that a set is over 1000x faster than a list in calculating the first 100,000 values of [Recaman's sequence](https://oeis.org/search?q=recaman&language=english&go=Search).

In [None]:
def recaman_check(cur, i, visited):
    return (cur - i) < 0 or (cur - i) in visited

def recaman_list(n: int) -> list[int]:
    """
    return a list of the first n numbers of the Recaman series
    """

    visited_list = [0]
    current = 0
    for i in range(1, n):
        if recaman_check(current, i, visited_list):
            current += i
        else:
            current -= i
        visited_list.append(current)
    return visited_list

In [None]:
%%timeit
recaman_list(100000)

33.6 s ± 1.62 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
def recaman_set(n: int) -> list[int]:
    visited_set = {0}
    current = 0
    for i in range(1, 100_000):
        if recaman_check(current, i, visited_set):
            current += i
        else:
            current -= i
        visited_set.add(current)
    return visited_set

In [None]:
%%timeit
recaman_set(100000)

23.7 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


When you add an element to a set...
1. Python calls the element’s __hash__() method to get a hash value (an integer);
1. That hash value determines where the element will be stored in the set's internal structure; and
1. When checking if an element is in the set, Python uses the hash to quickly find it.

### Iteration: sequences vs. generators

Regular functions and comprehensions typically create a container type like a list or a dictionary to store the data that results from the function’s intended computation. All this data is stored in memory at the same time.

In contrast, iterators keep only one data item in memory at a time, generating the next items on demand or lazily.

With iterators and generators, you don’t need to store all the data in your compter’s memory at the same time.

Iterators and generators also allow you to completely decouple iteration from processing individual items. They let you connect multiple data processing stages to create memory-efficient data processing pipelines.

When working with dataframes, we often use functions to operate on data, but generators can be more memory-efficient and faster for certain tasks—especially when you're processing rows one at a time or streaming large datasets.

Let’s say you have a huge CSV that you want to process row by row, applying some logic to each row. Using a generator here helps avoid loading the entire DataFrame into memory.

In [None]:
import pandas as pd

def process_all_rows(filepath):
    df = pd.read_csv(filepath)
    for _, row in df.iterrows():
        process_row(row)

def process_row(row):
    # Imagine some expensive operation here
    if row["value"] > 1000:
        print(row["name"], row["value"])

If the CSV is huge, this can eat up memory. Instead, what if we process data in chunks or rows lazily?

In [None]:
import pandas as pd

def row_generator(filepath, chunksize=1000):
    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        for _, row in chunk.iterrows():
            yield row

def process_large_file(filepath):
    for row in row_generator(filepath):
        if row["value"] > 1000:
            print(row["name"], row["value"])

When to prefer a generator:
- You're dealing with very large datasets that would be cumbersome to load into memory.
- You want to start processing before loading everything.
- You're doing line-by-line processing, not vectorized Pandas ops.
- Streaming data or preprocessing before database insertions.

### Calculation: loop versus intersection

## Performance

tl;dr
<br>Make time for performance checks.

Resources:
1. __Spot-profile your code.__ Use the `timeit` notebook magic to perform some basic profiling by cell or by line.
1. __Profile your script comprehensively.__ The `cProfile` module has the ability to break down call by call to determine the number of calls and the total time spent on each.
1. __Memory:__ You also saw `sys.getsizeof()` earlier, which you can use to check memory size of variables. Memory and performance are interrelated.

### Spot-check with `%%timeit`

We know that the first cell is much slower than the second.

 `%timeit` is a form of _line magic_. Line magic arguments only extend to the end of the current line.

 `%%timeit` is a form of _cell magic_. It measures the execution time of the entire notebook cell.

 Two parameters to consider:
 - -n is the number of
 - -r is the repeats

In line mode you can time a single-line statement (though multiple
ones can be chained with using semicolons).


In cell mode, the statement in the first line is used as setup code
(executed but not timed) and the body of the cell is timed.  The cell
body has access to any variables created in the setup code.

### Profile with `cProfile`

 But `%%timeit` isn't precise enough to tell which calls in each cell are taking the longest to execute.

## QA Workflow for Performance and Memory
1. Spot-check for instances of unnecessary memory use
1. Replace above instances with low-memory alternatives
1. If necessary, create a sample to profile on: same complexity, smaller size. Then:
1. Profile: Check for speed bottlenecks at a high level (%%timeit)
1. Profile: For the slowest cell from prev step: check for speed bottlenecks at a granular level (cProfile)

---

# Exercises

__Exercises summary__
1. Replace lists with efficient alternatives
    1. Storage: List to tuple
    1. Look-up: List to set
    1. Look-up: List to dictionary
1. Replace sequences with efficient alternatives
    1. Iteration: List comprehension to generator expression
    1. Calculation: Loop to vector math
1. Check for speed bottlenecks
    1. Compare differences in speed with `timeit`
    1. Check for speed bottlenecks in detail with `cProfile`

## 1. Replace lists with efficient alternatives

### 1.1 Tuple-based storage

Scenario: You have CSVs stored in a subdirectory. Some of these CSVs can be converted to point shapefiles. For every CSV in the folder, you need to determine whether it contains a field called 'lat', which would indicate it has point coordinates.

In [None]:
# Create list of file paths where the data is stored.
myDataPaths = [filePath for file in directory]


# Define a function for determining if a file meets your criteria.
def meetsCriteria(filePaths):
    """
    Dataframe must have a 'lat' field to be included.
    """
    members = []
    criterium = 'lat'

    for filePath in filePaths:
        with open(filePath) as fPath:
            headerList = csv.DictReader(fPath).fieldnames
            if criterium in headerList:
                members.append(filePath)
    return members


# Print all matching file paths
print(meetsCriteria(myDataPaths))

Change the filePaths list to a tuple.

In [None]:
# Exercise solution
myDataPaths = (filePath for file in directory)

### 1.2 Set-based look-up

The code below assigns a collection of placenames to a list. Then, it checks whether a placename is in the list. If not, the placename is reported missing.

If you have 1 million placenames to look up and 6 names in the list, that’s up to 6 million checks.

In [None]:
placeNames_list = ["Kinshasa", "Duluth", "Uruguay", "Doherty Residence", "Dinkytown", "Khazad-dum"]

# List look-up
if "Dinkytown" not in placeNames_list:
    print("Missing.")  # O(n) look-up

Write a different implementation that uses a faster-performing storage option for the collection.

In [None]:
# # # Exercise solution # # #

placeNames_set = set(placeNames_list)

# Set look-up
if "Dinkytown" not in placeNames_set:
    print("Missing.")  # O(1) look-up

#### Change a list to a set, alternative example

Removing Duplicate Census Records by Household ID
Problem:
You’re cleaning a census dataset and want to remove duplicate household records based on a unique ID.

First, we'll try using a List to Track Seen IDs (O(n) lookup each time):

In [None]:
seen_ids = []
cleaned_data = []

for record in census_data:
    if record.household_id not in seen_ids:  # O(n) lookup
        seen_ids.append(record.household_id)
        cleaned_data.append(record)

Now,

In [None]:
seen_ids = set()
cleaned_data = []

for record in census_data:
    if record.household_id not in seen_ids:  # O(1) lookup
        seen_ids.add(record.household_id)
        cleaned_data.append(record)

### 1.3 Dictionary-based look-up

Adapt the meetsCriteria function to add files to a set instead of appending to a list.
<br><br>Actions:
- Change the variable *members* to a set.
- Modify the line *members.append(filePath)* to use the *.add()* function.

In [None]:
# Exercise solution
def meetsCriteria(filePaths):
    """
    Dataframe must have a 'lat' field to be included.
    """
    members = {}
    criterium = 'lat'

    for filePath in filePaths:
        with open(filePath) as fPath:
            headerList = csv.DictReader(fPath).fieldnames
            if headerList.intersection(criterium) is not None:
                members.add(filePath)
    return members

## 2. Replace sequences with efficient alternatives

### 2.1 Generator expression

### 2.2 Vector math

## 3. Check for speed bottlenecks

### 2.1 Compare differences in speed using `timeit`

Using `%%timeit`, compare the time it took to create myDataPaths as a list (original code) versus as a tuple (exercise solution).

In [None]:
%%timeit
print([filePath for file in directory])

In [None]:
%%timeit
## Your solution here ##

Use `%%timeit` again to compare list-based lookup to set intersection.

In [None]:
%%timeit
def meetsCriteria(filePaths):
    """
    Dataframe must have a 'lat' field to be included.
    """
    members = []
    criterium = 'lat'

    for filePath in filePaths:
        with open(filePath) as fPath:
            headerList = csv.DictReader(fPath).fieldnames
            if criterium in headerList:
                members.append(filePath)
    return members


# Print all matching file paths
print(meetsCriteria(myDataPaths))

In [None]:
%%timeit
## Your solution here ##

Finally, compare the second list vs. set change that you made.

In [None]:
%%timeit
def meetsCriteria(filePaths):
    """
    Dataframe must have a 'lat' field to be included.
    """
    members = []
    criterium = 'lat'

    for filePath in filePaths:
        with open(filePath) as fPath:
            headerList = csv.DictReader(fPath).fieldnames
            if criterium in headerList:
                members.append(filePath)
    return members


# Print all matching file paths
print(meetsCriteria(myDataPaths))

In [None]:
%%timeit
## Your solution here ##

### 3.1 Check for speed bottlenecks in detail using `cProfile`

Use cProfile to locate the slowest calls in your improved script.

Hint: Sort by tottime instead of name to find hotspots more easily.