# CHAPTER 1 : Understanding Performant Python

## The Fundamental Computer System

### Overview
A computer system can be simplified into three basic components:
- **Computing Units**
- **Memory Units**
- **Connections (Communication Layers)**

Each component has specific properties that help us understand their functionality and performance.

### Computing Units
The computing unit is the core of the computer, responsible for executing instructions and performing calculations.

#### Key Properties
- **Instructions Per Cycle (IPC)**: Number of instructions a CPU can execute per clock cycle.
- **Clock Speed**: Number of cycles a CPU can perform per second (measured in Hertz, Hz).

#### Types of Computing Units
- **CPU (Central Processing Unit)**: The primary computing unit for general-purpose tasks.
- **GPU (Graphics Processing Unit)**: Optimized for parallel processing, initially used for graphics but now also for numerical computations.

#### Specialized Operations
- **SIMD (Single Instruction, Multiple Data)**: Allows a single instruction to be applied to multiple data points simultaneously, enhancing performance for specific computations.

#### Advanced Techniques
- **Simultaneous Multithreading**: Allows multiple threads to run on a single CPU, improving utilization.
- **Out-of-Order Execution**: Enables execution of instructions as soon as their operands are available, rather than strictly in the order they appear in the program.

#### Trends and Challenges
- **Stagnation in Clock Speed and IPC**: Due to physical limitations in making transistors smaller.
- **Multicore Architectures**: Incorporating multiple CPUs within a single unit to increase total capability without increasing individual CPU speed.
- **Amdahl’s Law**: Limits the potential speedup in parallel processing due to the serial portion of the task.
- **Global Interpreter Lock (GIL) in Python**: Limits Python to running one instruction at a time, affecting multi-threaded performance.

### Memory Units
Memory units store data and instructions and vary in speed, capacity, and type.

#### Types of Memory Units
- **L1/L2 Cache**: Small, very fast memory inside the CPU, used for frequently accessed data.
- **RAM (Random Access Memory)**: Stores application code and data currently in use. Faster than hard drives but slower than cache.
- **Hard Drives**: Used for long-term storage.
  - **Spinning Hard Drives (HDDs)**: Use physical disks, slower but higher capacity.
  - **Solid-State Drives (SSDs)**: Use flash memory, faster but lower capacity.

#### Memory Characteristics
- **Read/Write Speed**: The speed at which data can be read from or written to memory.
- **Latency**: The time it takes to access data from memory.
- **Sequential vs. Random Access**: Sequential access (large contiguous blocks) is faster than random access (scattered data).

#### Memory Hierarchy
Data is stored in a tiered approach for optimal performance:
1. **Hard Drive**: Full data set.
2. **RAM**: Frequently used data.
3. **Cache (L1/L2)**: Most frequently accessed data.

### Communication Layers
Communication layers are pathways that allow data to move between computing and memory units.

#### Types of Buses
- **Frontside Bus**: Connects the CPU to the RAM.
- **Peripheral Component Interconnect (PCI) Bus**: Connects peripheral devices like GPUs to the CPU and memory.

#### Bus Properties
- **Bus Width**: Amount of data transferred in one operation.
- **Bus Frequency**: Number of transfer operations per second.

#### Communication Characteristics
- **Speed**: How fast data can be transferred (combination of bus width and frequency).
- **Latency**: Time taken for a data request to be responded to.

#### Network Communication
- **Network as a Communication Block**: Connects to other memory or computing units over a network, generally slower than internal buses.




## Python's Role in High-Performance Programming

#### Misconceptions about Python
Some people think Python isn't good for high-performance tasks. However, Python's speed of development compensates for its performance drawbacks.

#### Solution
By combining Python's fast development with optimization techniques, we can write efficient code.

### Idealized Computing

#### Ideal Scenario
- In an ideal setup, the CPU quickly accesses data from memory and processes it without delays.
- Keeping data close to the CPU in cache memory minimizes delays caused by moving data around.

### Optimizing Code for Performance

#### Vectorization
Instead of processing one number at a time, the CPU can handle multiple numbers simultaneously. This reduces the time needed for calculations.


#### Examples

In [10]:
# Original Version
import math
import time

def check_prime(number):
    sqrt_number = math.sqrt(number)
    for i in range(2, int(sqrt_number) + 1):
        if (number / i).is_integer():
            return False
    return True

# Start measuring time
start_time = time.perf_counter()

# Run the function
result1 = check_prime(10_000_000)
result2 = check_prime(10_000_019)

# End measuring time
end_time = time.perf_counter()

# Calculate the elapsed time
elapsed_time = end_time - start_time

# Print results and elapsed time
print(f"check_prime(10,000,000) = {result1}")
print(f"check_prime(10,000,019) = {result2}")
print(f"Elapsed time: {elapsed_time} seconds")

check_prime(10,000,000) = False
check_prime(10,000,019) = True
Elapsed time: 0.0004640949991880916 seconds


In [11]:
# Optimized Version

import math
import time

def check_prime(number):
    sqrt_number = math.sqrt(number)
    numbers = range(2, int(sqrt_number)+1)
    for i in range(0, len(numbers), 5):
        results = []
        for j in range(i, min(i+5, len(numbers))):
            results.append((number / numbers[j]).is_integer())
        if any(results):
            return False
    return True

# Start measuring time
start_time = time.perf_counter()

# Run the function
result1 = check_prime(10_000_000)
result2 = check_prime(10_000_019)

# End measuring time
end_time = time.perf_counter()

# Calculate the elapsed time
elapsed_time = end_time - start_time

# Print results and elapsed time
print(f"check_prime(10,000,000) = {result1}")
print(f"check_prime(10,000,019) = {result2}")
print(f"Elapsed time: {elapsed_time} seconds")


check_prime(10,000,000) = False
check_prime(10,000,019) = True
Elapsed time: 0.0015602540006511845 seconds


| Aspect         | Original Version                                                                                                                      | Optimized Version                                                                                                                     |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Functionality  | Checks divisibility of a number from 2 to the square root of the number.                                                             | Divides the number by a range of divisors at once, exploiting vectorization.                                                          |
| Performance    | Iterates through each divisor individually, potentially leading to redundant calculations.                                           | Groups divisors into batches, reducing individual calculations and leveraging vectorization.                                         |
| Memory Usage   | Utilizes memory resources similarly to the optimized version.                                                                         | May require slightly more memory due to the creation of additional lists to store results temporarily.                                |


## Python’s virtual machine

#### Abstraction by Python Interpreter
The Python interpreter abstracts away low-level computing details, allowing programmers to focus on algorithms rather than memory management or CPU instructions. However, this abstraction can come with a performance cost.

#### Optimizing Code Sequence
Python executes optimized instructions, but the challenge lies in arranging these instructions efficiently for better performance. For example, some functions may run faster by avoiding unnecessary computations or early termination of loops.


#### Impact of Abstraction on Vectorization
Python's abstraction layer may hinder immediate vectorization, as demonstrated by the example provided. External libraries like NumPy can help achieve vectorized mathematical operations.

#### Challenges with Memory Optimization
Python's memory allocation, garbage collection, and dynamic typing may lead to memory fragmentation and inefficient CPU cache utilization. Compounded by the lack of direct memory layout control, this can impact performance.

#### Dynamic Typing and Compilation
Python's dynamic typing and lack of compilation make it harder to optimize code algorithmically. Solutions like Cython allow for compiling Python code and providing hints to the compiler for optimization.

#### Global Interpreter Lock (GIL)
The GIL can restrict parallel execution of Python code, limiting performance gains from utilizing multiple CPU cores. Solutions include using multiprocessing instead of multithreading or utilizing Cython or foreign functions.

#### Examples of Functions
The text provides examples of different functions (`search_fast`, `search_slow`, `search_unknown1`, `search_unknown2`) and discusses which one may perform faster. It emphasizes the importance of profiling to identify slow regions of code and optimize them.


In [12]:
import dis

def count_ops(func):
    bytecode = dis.Bytecode(func)
    ops_count = 0
    for instruction in bytecode:
        if instruction.opname.startswith('LOAD_') or instruction.opname.startswith('STORE_'):
            continue
        ops_count += 1
    return ops_count

# Define the functions
def search_fast(haystack, needle):
    for item in haystack:
        if item == needle:
            return True
    return False

def search_slow(haystack, needle):
    return_value = False
    for item in haystack:
        if item == needle:
            return_value = True
    return return_value

def search_unknown1(haystack, needle):
    return any((item == needle for item in haystack))

def search_unknown2(haystack, needle):
    return any([item == needle for item in haystack])


# Count operations for each function
ops_count_fast = count_ops(search_fast)
ops_count_slow = count_ops(search_slow)
ops_count_unknown1 = count_ops(search_unknown1)
ops_count_unknown2 = count_ops(search_unknown2)

# Print the results
print(f"search_fast: {ops_count_fast} operations")
print(f"search_slow: {ops_count_slow} operations")
print(f"search_unknown1: {ops_count_unknown1} operations")
print(f"search_unknown2: {ops_count_unknown2} operations")


search_fast: 9 operations
search_slow: 7 operations
search_unknown1: 10 operations
search_unknown2: 10 operations


##  So Why Use Python?
#### Expressiveness and Ease of Learning
Python is highly expressive and easy to learn, making it accessible to new programmers. This allows programmers to achieve a lot in a short amount of time.

#### Rich Ecosystem
Python has a rich ecosystem of libraries and tools, many of which are built-in ("batteries included"). These libraries cover a wide range of functionalities, from basic mathematical operations to advanced machine learning and data analysis.

#### Efficient Wrapping of Other Languages
Python libraries often wrap tools written in other languages, such as C and Fortran. This allows Python code to leverage the speed and efficiency of these underlying libraries while still benefiting from Python's high-level syntax.

#### Built-in and External Libraries
Python's built-in libraries cover fundamental functionalities like handling Unicode, basic mathematical operations, database interaction, and more. External libraries provide even more specialized functionalities, such as numerical computing (NumPy), scientific computing (SciPy), machine learning (scikit-learn), and natural language processing (NLTK, SpaCy, Gensim), among others.

#### Versatile Deployment Options
Python offers various deployment options, including standard distributions, lightweight virtual environments (e.g., pipenv, virtualenv), containerization (Docker), scientifically focused environments (Anaconda), and interactive shells (IPython, Jupyter Notebook).

#### Fast Prototyping
Python enables fast prototyping of ideas due to its ease of use and extensive library support. This makes it suitable for testing the feasibility of concepts and iterating quickly.

#### Optimization Considerations
While Python offers flexibility and ease of development, it may not always provide optimal performance out of the box. Optimization techniques like using NumPy for mathematical routines or Cython for compiling Python code with C-like types can improve performance but may require additional effort and expertise to maintain.

#### Consideration of Team Dynamics
It's essential to consider the trade-offs between performance optimization and team productivity. While squeezing more performance out of a system is possible, it may lead to complex and brittle optimizations that could hinder team collaboration and long-term maintenance.


## How to Be a Highly Performant Programmer

Writing high performance code is only one part of being highly performant with successful projects over the longer term. Overall team velocity is far more important than speedups and complicated solutions. Several factors are key to this—good structure, documentation, debuggability, and shared standards.

#### Building Structured and Maintainable Code
Creating prototypes without thorough testing and review can lead to unstructured and undocumented code. Lack of maintenance and testing can result in code that becomes hard to support and prone to errors. It's crucial to demonstrate the long-term benefits of tests and documentation to maintain team productivity and convince managers to allocate time for code cleanup.

#### Best Practices
1. **Make it work:** Begin by building a good-enough solution, which acts as a prototype for a better-structured version.
2. **Make it right:** Develop a strong test suite, documentation, and clear reproducibility instructions.
3. **Make it fast:** Focus on profiling, optimization, and parallelization while ensuring the new, faster solution still meets expectations.

#### Good Working Practices
- Documentation, good structure, and testing are essential.
- Project-level documentation helps maintain a clean structure and assists future team members.
- Docker simplifies environment setup and deployment.
- Unit tests, integrated with pytest, ensure code reliability and maintainability.
- Docstrings provide useful descriptions of functions, classes, and modules.
- Refactor code to maintain readability and simplicity.
- Follow test-driven development methodology to ensure code correctness and efficiency.
- Use source control for version management and backup.
- Adhere to coding standards like PEP8 and use tools like black and flake8 for code formatting and linting.
- Isolate environments using tools like Anaconda or pipenv coupled with Docker.
- Automate repetitive tasks like build, test, and deployment processes.
- Prioritize readability over cleverness in code design.

#### Notebook Best Practices
- Extract long functions from Jupyter Notebooks into separate Python modules.
- Prototype code in IPython or QTConsole before moving it to Notebooks.
- Use assert statements in Notebooks for quick data validation, but refactor code into modules with proper unit tests for robustness.
- Avoid using assert statements for data validation in production code; raise appropriate exceptions instead.
- Add sanity checks at the end of Notebooks to validate generated results.
- Utilize tools like nbdime for versioning and collaboration with Notebooks.

Maintaining structured, well-documented, and tested code, along with adopting best practices, contributes to long-term project success and team productivity.


## Key Takeaways

- **Fundamental Computer System**: Recognize the three basic components: computing units, memory units, and communication layers, each impacting performance differently.
  
- **Python's Role**: Despite performance drawbacks, Python's fast development speed compensates, especially when combined with optimization techniques.

- **Optimizing Code**: Techniques like vectorization can significantly improve performance by allowing parallel processing.

- **Highly Performant Programming**: Prioritize good structure, documentation, and testing. Follow best practices like making code work, right, and fast, and utilize version control.

- **Notebook Best Practices**: Extract long functions, prototype code before using Notebooks, use assert statements for data validation, add sanity checks, and utilize versioning tools.

By adhering to best practices and optimizing code, teams can ensure long-term project success and productivity.
