# One Billion Row Challenge out of core on multiple GPUs

The [One Billion Row Challenge](https://www.morling.dev/blog/one-billion-row-challenge/) is a programming competition aimed at Java developers to write the most efficient code to process a one billion line text file and calculate some metrics. The challenge has inspired solutions in many languages beyond Java including [Python](https://github.com/gunnarmorling/1brc/discussions/62).

In this notebook we will explore how we can use RAPIDS to build an efficient solution to the problem.

## The Problem

The input data of the challenge is a ~13GB text file containing one billion lines of temperature measurements. The file is structured with one measurement per file with the name of the weather station and the measurement separated by a semicolon.

```text
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
...
```

Our goal is to calculate the min, mean, and max temperature per weather station sorted alphabetically by station name.

## A PyData solution

A solution written with popular PyData tools would likely be something along the lines of the following Pandas code (assuming you have enough RAM).

```python
import pandas as pd

df = pd.read_csv(
    "measurements.txt",
    sep=";",
    header=None,
    names=["station", "measure"],
    engine='pyarrow'
)
df = df.groupby("station").agg(["min", "max", "mean"])
df.columns = df.columns.droplevel()
df = df.sort_values("station")
```

Here we use `pandas.read_csv()` to open the text file and specify the `;` separator and also set some column names. We also set the engine to `pyarrow` to give us some extra performance out of the box.

Then we group the measurements by their station name and calculate the min, max and mean. Finally we sort the grouped dataframe by the station name.

Running this on a workstation with a 12-core CPU completes the task in around **4 minutes**.

## GPU solution with RAPIDS

We can certainly use RAPIDS to speed that up, but if you directly convert the above example from Pandas to cuDF you will run into some [limitations it has with string columns](https://github.com/rapidsai/cudf/issues/13733). Also depending on your GPU you may run into memory limits as cuDF will read the whole dataset into memory.

Therefore to solve this with RAPIDS we also need to use Dask to partition the dataset and stream it through GPU memory, then cuDF can process each partition in a performant way.

### Deploying Dask

To run this solution we will need a single machine with one or more GPUs. There are many ways you can get this:

- Have a laptop, desktop or workstation with GPUs.
- Run VM on the cloud using [AWS EC2](/cloud/aws/ec2), [Google Compute Engine](/cloud/gcp/compute-engine/), [Azure VMs](/cloud/azure/azure-vm/), etc.
- Use a managed notebook service like [SageMaker](/cloud/aws/sagemaker/), [Vertex AI](/cloud/gcp/vertex-ai/), [Azure ML](/cloud/azure/azureml/) or [Databricks](/platforms/databricks/).
- Run a container in a [Kubernetes cluster with GPUs](/platforms/kubernetes/).

Once you have a GPU machine you will need to [install RAPIDS](/local/). You can do this with [pip](https://docs.rapids.ai/install#pip), [conda](https://docs.rapids.ai/install#conda) or [docker](https://docs.rapids.ai/install#docker).

Then once you are in your RAPIDS Python environment you can use [dask-cuda](/tools/dask-cuda/) to start a GPU Dask cluster.

In [None]:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

client = Client(LocalCUDACluster())

Creating a `LocalCUDACluster()` inspects the machine and starts one Dask worker for each detected GPU. We then pass that to a Dask client which means that all following code in the notebook will leverage the GPU workers.

### Data Generation

Before we get started with our problem we need to generate the input data. The 1BRC repo has a [Java implementation](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CreateMeasurements.java) which takes around 15 minutes to generate the file. However we can do this on the GPU using cuDF and CuPy much faster, and we don't need Dask for this because we can generate the data in separate chunks and append them to the output file.

Based on the `lookup.csv` table of stations and their mean temperatures we want to generate our file containing `n` rows of random temperatures.

To generate each row we choose a random station from the lookup table, then generate a random temperature measurement from a normal distribution around the mean temp. We assume the standard deviation is `10.0` for all stations.

In [33]:
import cupy as cp
import cudf
from pathlib import Path
import time

In [34]:
def generate_chunk(filename, chunksize, std, lookup_df):
    """Generate some sample data based on the lookup table."""
    df = cudf.DataFrame(
        {
            # Choose a random station from the lookup table for each row in our output
            "station": cp.random.randint(0, len(lookup_df) - 1, int(chunksize)),
            # Generate a normal distibution around zero for each row in our output
            # Because the std is the same for every station we can adjust the mean for each row afterwards
            "measure": cp.random.normal(0, std, int(chunksize)),
        }
    )

    # Offset each measurement by the station's mean value
    df.measure += df.station.map(lookup_df.mean_temp)
    # Round the temprature to one decimal place
    df.measure = df.measure.round(decimals=1)
    # Convert the station index to the station name
    df.station = df.station.map(lookup_df.station)

    # Append this chunk to the output file
    with open(filename, "a") as fh:
        df.to_csv(fh, sep=";", chunksize=10_000_000, header=False, index=False)

#### Configuration

In [35]:
n = 1_000_000_000  # Number of rows of data to generate

lookup_df = cudf.read_csv(
    "lookup.csv"
)  # Load our lookup table of stations and their mean temperatures
std = 10.0  # We assume temperatures are normally distributed with a standard deviation of 10
chunksize = 2e8  # Set the number of rows to generate in one go (reduce this if you run into GPU RAM limits)
filename = Path(f"measurements.txt")  # Choose where to write to
filename.unlink() if filename.exists() else None  # Delete the file if it exists already

#### Run the data generation

In [36]:
%%time
# Loop over chunks and generate data
start = time.time()
for i in range(int(n / chunksize)):
    # Generate a chunk
    generate_chunk(filename, chunksize, std, lookup_df)

    # Update the progress bar
    percent_complete = int(((i + 1) * chunksize) / n * 100)
    time_taken = int(time.time() - start)
    time_remaining = int((time_taken / percent_complete) * 100) - time_taken
    print(
        f"Writing {int(n / 1e9)} billion rows to {filename}: {percent_complete}% in {time_taken}s ({time_remaining}s remaining)",
        end="\r",
    )
print()

Writing 1 billion rows to data_1b.txt: 100% in 24s (0s remaining)
CPU times: user 8.89 s, sys: 16.7 s, total: 25.6 s
Wall time: 24.6 s


#### Check the files

Now we can verify our dataset is the size we expected and contains rows that follow the format needed by the challenge.

In [37]:
!ls -lh {filename}

-rw-r--r-- 1 rapids conda 13G Jan 19 09:23 data_1b.txt


In [38]:
!head {filename}

Muscat;32.1
Kunming;10.8
Ho Chi Minh City;36.0
Belgrade;19.5
Nicosia;18.0
Lhasa;-4.0
La Paz;5.2
Mek'ele;18.2
Kuopio;8.3
Zagreb;16.5


### Dask + cuDF Solution

Now that we have our input data we can write some Dask code that leverages cuDF under the hood to perform the compute operations.

First we need to import `dask.dataframe` and tell it to use the `cudf` backend.

In [None]:
import dask
import dask.dataframe as dd

dask.config.set({"dataframe.backend": "cudf"})

Now we can run our Dask code, which is almost identical to the Pandas code we used before.

In [None]:
%%timeit -n 3 -r 4
df = dd.read_csv("measurements.txt", sep=";", header=None, names=["station", "measure"])
df = df.groupby("station").agg(["min", "max", "mean"])
df.columns = df.columns.droplevel()
df = (
    df.compute().to_pandas()
)  # We need to switch back to Pandas for the final sort at the time of writing due to rapidsai/cudf#14794
df = df.sort_values("station")

4.46 s ± 37.3 ms per loop (mean ± std. dev. of 4 runs, 3 loops each)

Running this notebook on a desktop workstation with two NVIDIA RTX 8000 GPUs completes the challenge in around **4 seconds**.