In [1]:
import numpy as np
import pandas as pd

print(f"Numpy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")

Numpy Version: 1.24.2
Pandas Version: 2.2.0


Ah, a quest for consistency in the random world of NumPy, guided by the light of a seed! Let's break this down, Watson-style:

1. **The Goal**: To create a pandas Series.
2. **The Tool**: use NumPy's random number generator (RNG).
3. **The Key**: Set the seed to `43` to ensure reproducibility.

Here's how we'll proceed:

- First, we summon the NumPy RNG.
- Next, we imbue it with the power of the seed `43`.
- We then ask it to conjure a sequence of random numbers.
- Lastly, we transform this mystical sequence into a pandas Series.

In [2]:
# Setting the seed for reproducibility
rng = np.random.default_rng(seed=43)

# Generating random integers
random_integers = rng.integers(low=0, high=101, size=10)  # 10 random integers from 0 to 100

# Creating a pandas Series
integer_series = pd.Series(random_integers)

# Display the Series
integer_series

0    51
1    65
2    40
3     4
4    58
5     2
6    27
7    84
8    46
9    59
dtype: int64

In [3]:
string_series = pd.Series(['staple', 'text',
                           'strings', 'tokens', 
                           'natural', 'processes',
                           'language', 'neuro', 
                           'linguistic', 'program'])

string_series

0        staple
1          text
2       strings
3        tokens
4       natural
5     processes
6      language
7         neuro
8    linguistic
9       program
dtype: object

Let us use the `pyarrow` backend.

In [4]:
integer_series = pd.Series(random_integers, dtype='int64[pyarrow]')
integer_series 

0    51
1    65
2    40
3     4
4    58
5     2
6    27
7    84
8    46
9    59
dtype: int64[pyarrow]

In [5]:
string_series = pd.Series(['staple', 'text',
                           'strings', 'tokens', 
                           'natural', 'processes',
                           'language', 'neuro', 
                           'linguistic', 'program'],
                         dtype='string[pyarrow]')

string_series

0        staple
1          text
2       strings
3        tokens
4       natural
5     processes
6      language
7         neuro
8    linguistic
9       program
dtype: string

### The Scroll's Contents
This Python script is composed of several interesting elements:

1. **Import Statements**: 
   - `from typing import Callable`: This imports the `Callable` type from the `typing` module, used for type hinting. It's like saying, "Expect a sorcerer (function) who can perform spells (actions)."
   - `from functools import partial`: Here we summon the `partial` function from the `functools` module. Think of `partial` as a spell that lets you prepare another spell (function) with some of its ingredients (arguments) already chosen.
   - `import timeit`: This imports the `timeit` module, a timekeeper for how fast spells (functions) are cast.

2. **Constant Definition**:
   - `NUM_ITERATIONS = 20`: This sets a constant named `NUM_ITERATIONS` to `20`, likely to be used as the number of times a function will be timed. It's like saying, "We shall repeat our experiment 20 times."

3. **The `time_function` Spell**:
   - **Function Signature**: 
     - `def time_function(function_to_time: Callable, **kwargs)`: This defines a function, `time_function`, which takes a `Callable` (a function) and an arbitrary number of keyword arguments (`**kwargs`).
   - **The Function's Inner Workings**:
     - `return partial(function_to_time, **kwargs)`: Here, `time_function` returns a `partial` object. What does this mean? It means you are pre-loading the `function_to_time` with some of its arguments (the ones provided in `**kwargs`), but not executing it yet. It's like preparing a magic potion but not drinking it immediately.

### The Code's Purpose in Layman's Terms
Imagine you have a spell (function) that you want to cast (execute), but this spell requires some specific ingredients (arguments). The `time_function` allows you to prepare this spell with some of its ingredients already in place. However, instead of casting it right away, it gives you this pre-prepared spell. Why? So you can cast it repeatedly (perhaps using `timeit` to measure how long it takes) without having to add those ingredients every single time.

In short, `time_function` is a convenience tool for setting up a function with specific arguments for timing purposes, like a rehearsal before the actual performance. Quite clever, isn't it? 🎩✨🔮

In [6]:
from typing import Callable
from functools import partial
import timeit
from typing import List

NUM_ITERATIONS = 20

def time_function(function_to_time: Callable, **kwargs):
  return partial(function_to_time, **kwargs)

In [7]:
import string

def generate_random_strings(num_strings: int = 2_500_000, chance_for_empty_string: float = 0.1, min_length: int = 5, max_length: int = 8, seed: int = 43) -> list:

    # Initialize NumPy's random number generator with a seed
    rng = np.random.default_rng(seed)

    chars = np.array(list(string.ascii_lowercase + string.digits))

    # Generate string lengths using NumPy's RNG
    string_lengths = rng.integers(min_length, max_length, size=num_strings)

    # Choose random characters using NumPy's RNG
    random_chars = rng.choice(chars, size=(num_strings, max_length))

    # Create strings and handle the chance for empty strings
    empty_string_mask = rng.random(num_strings) < chance_for_empty_string
    random_strings = [''.join(chars[:length]) if not is_empty else '' for chars, length, is_empty in zip(random_chars, string_lengths, empty_string_mask)]

    return random_strings

In [8]:
random_strings = generate_random_strings()
assert len(random_strings) == 2_500_000

In [9]:
assert len(random_strings) == 2_500_000

**NumPy Backend**

In [10]:
def build_numpy_backend_series(data: List[str]) -> pd.Series:
  return pd.Series(data)

numpy_backend_timed_function = time_function(build_numpy_backend_series, data = random_strings)

numpy_backend_elapsed_time = timeit.timeit(numpy_backend_timed_function, number=NUM_ITERATIONS) / NUM_ITERATIONS

numpy_backend_series = build_numpy_backend_series(data = random_strings)

print(f"It took ~{numpy_backend_elapsed_time:.2f} seconds to run this function.")

It took ~0.10 seconds to run this function.


**PyArrow Backend**

In [11]:
def build_pyarrow_backend_series(data: List[str]) -> pd.Series:
  return pd.Series(data, dtype='string[pyarrow]')

pyarrow_backend_timed_function = time_function(build_pyarrow_backend_series, data = random_strings)

pyarrow_backend_elapsed_time = timeit.timeit(pyarrow_backend_timed_function, number=NUM_ITERATIONS) / NUM_ITERATIONS

pyarrow_backend_series = build_pyarrow_backend_series(data = random_strings)

print(f"It took ~{numpy_backend_elapsed_time:.2f} seconds to run this function.")

It took ~0.10 seconds to run this function.


In [12]:
def find_strings_ending_with_char(pandas_series: pd.Series, ending_char: str = 'a') -> pd.Series:
  return pandas_series[pandas_series.str.endswith(ending_char)]

find_string_numpy = time_function(find_strings_ending_with_char, pandas_series = numpy_backend_series)

find_string_numpy_elapsed_time = timeit.timeit(find_string_numpy, number = NUM_ITERATIONS) / NUM_ITERATIONS

numpy_series_str_endswith_a = find_strings_ending_with_char(numpy_backend_series)

print(f"It took ~{find_string_numpy_elapsed_time:.2f}s to find all the strings that ended with 'a'!")

It took ~0.51s to find all the strings that ended with 'a'!


In [13]:
numpy_series_str_endswith_a.head()

11       j5jza
114    js1fkta
161     lxlgpa
206    fosk3ha
207      30bja
dtype: object

In [14]:
find_string_pyarrow = time_function(find_strings_ending_with_char, pandas_series = pyarrow_backend_series)

find_string_pyarrow_elapsed_time = timeit.timeit(find_string_pyarrow, number = NUM_ITERATIONS) / NUM_ITERATIONS

pyarrow_series_str_endswith_a = find_strings_ending_with_char(pyarrow_backend_series)

print(f"It took ~{find_string_pyarrow_elapsed_time:.2f}s to find all the strings that ended with 'a'!")

It took ~0.04s to find all the strings that ended with 'a'!


In [15]:
pyarrow_series_str_endswith_a.head()

11       j5jza
114    js1fkta
161     lxlgpa
206    fosk3ha
207      30bja
dtype: string

In [16]:
print(f"The PyArrow backend ran ~{find_string_numpy_elapsed_time / find_string_pyarrow_elapsed_time:.2f}x faster!")

The PyArrow backend ran ~14.34x faster!


**Let's find the mean!**
* we will create a random float list
* we will convert the list into a pandas Series using both the numpy and the pyarrow backend

In [17]:
def generate_random_numbers(n: int = 2_500_000, min_val: int = -1_000_000, max_val: int = 1_000_000, seed: int = 43) -> np.ndarray:
    # Initialize NumPy's random number generator with a seed
    rng = np.random.default_rng(seed)

    # Generate random numbers using NumPy's RNG
    return rng.uniform(low = min_val, high = max_val, size = n)


random_float_list = generate_random_numbers()
assert len(random_float_list) == 2_500_000

In [18]:
numpy_backend_float_series = pd.Series(random_float_list)

numpy_backend_float_series.head()

0    304598.525402
1   -912449.352722
2   -959940.826252
3    678425.165022
4    174286.095176
dtype: float64

In [19]:
pyarrow_backend_float_series = pd.Series(random_float_list, dtype="float64[pyarrow]")

pyarrow_backend_float_series.head()

0    304598.525402
1   -912449.352722
2   -959940.826252
3    678425.165022
4    174286.095176
dtype: double[pyarrow]

Upon closer inspection, we observe that the data managed with the PyArrow backend adopts the label `double[pyarrow]` rather than `float64`. However, this distinction is largely semantic rather than functional, as both terminologies describe the same type of data: a double-precision floating-point number.

Next, we turn our attention to comparing the performance of these two backends, specifically in their ability to compute the mean of these values.

Conveniently, since our data is already organized into `pd.Series` objects, we can employ the same function to calculate the mean for both backends. This will provide us with a direct comparison of their efficiency in executing this particular operation.

In [20]:
def find_mean(data: pd.Series) -> np.float64:
    # Using np.mean to calculate the mean
    return np.mean(data.values)

**Numpy Backend**

In [21]:
timed_numpy_backend_mean_fn = time_function(find_mean, data = numpy_backend_float_series)
numpy_backend_mean_elapsed_time = timeit.timeit(timed_numpy_backend_mean_fn, number = NUM_ITERATIONS) / NUM_ITERATIONS

numpy_backend_mean = find_mean(numpy_backend_float_series)

print(f"The amount of time it took to get the mean was ~{numpy_backend_mean_elapsed_time:.4f}s! And the mean was: {numpy_backend_mean:.2f}")

The amount of time it took to get the mean was ~0.0031s! And the mean was: 116.55


**PyArrow Backend**

In [22]:
timed_pyarrow_backend_mean_fn = time_function(find_mean, data = pyarrow_backend_float_series)
pyarrow_backend_mean_elapsed_time = timeit.timeit(timed_pyarrow_backend_mean_fn, number = NUM_ITERATIONS) / NUM_ITERATIONS

pyarrow_backend_mean = find_mean(pyarrow_backend_float_series)

print(f"The amount of time it took to get the mean was ~{pyarrow_backend_mean_elapsed_time:.4f}s! And the mean was: {pyarrow_backend_mean:.2f}")

The amount of time it took to get the mean was ~0.0031s! And the mean was: 116.55


In [23]:
print(f"The PyArrow backend ran ~{numpy_backend_mean_elapsed_time / pyarrow_backend_mean_elapsed_time:.2f}x faster!")

The PyArrow backend ran ~1.01x faster!


In [24]:
url = "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv"

In [25]:
%%timeit
df = pd.read_csv(
    url, na_values=["NA", "?"]
)

90.6 ms ± 5.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [26]:
%%timeit
#pd.options.mode.dtype_backend = "pyarrow"
df = pd.read_csv(
    url, na_values=["NA", "?"],
    engine = "pyarrow"
)

136 ms ± 41.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
