# Script to estimate at least GPU count

This script which typically simulates the GPU memory consumption flow is used to estimate the GPU minimum count by giving some parameters including rows and columns and others. The more precise parameters are, the more accurate the output is.

| parameters |  |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SINGLE_GPU_MEMORY_SIZE | The size of one gpu memory on device, you can get it by `nvidia-smi` |
| NUM_OF_FEATURE_COLUMNS | The total feature columns of input dataset. |
| NUM_OF_WEIGHT_COLUMNS | The total weight columns of input dataset. If no weight column, it should be set to 0. |
| NUM_OF_GROUPS | the size of prediction per instance. This value is set to 1 for all tasks except multi-class classification. For multi-class classification, NUM_OF_GROUPS must be set to the number of classes |
| SPARSITY | sparsity of input dataset. (1 - NON_ZEROR_COUNT(A)/TOTAL_COUNT_A) |
| MAX_BIN | maximum number of discrete bins to bucket continuous features. Default is 16 |

---

- ROW_STRIDE

As to ROW_STRIDE, which is the largest number of features/items across all rows the input dataset. 
You can calculate it by 
```shell
cat xxxx | awk -F, '{$NF=""; print $0}'  | sort -n -r | head -1 | awk '{for(i=0;i<NF;i++) if($i!=0) sum+=1} END {print sum}'
```

In [1]:
# Input cell

SINGLE_GPU_MEMORY_SIZE      = 32 * 1024 * 1024 * 1024   # one single GPU memory on device

ROW_COUNT       = 72 * 1000 * 1000              # 72 million

NUM_OF_FEATURE_COLUMNS = 3600        # the number of feature columns

NUM_OF_WEIGHT_COLUMNS  = 0           # the number of weight columns, it should be 0 or 1

NUM_OF_GROUPS   = 1           # number of prediction dimension
assert NUM_OF_GROUPS > 0

SPARSITY        = 0.5  #(1 - NON_ZEROR_COUNT(A)/TOTAL_COUNT_A)

In [2]:
# Below parameters can also affects the result, all of them are default values

MAX_BIN         = 16     # max_bin default value. It is 256 in native xgboost, while it is 16 in xgboost-4j

# ROW_STRIDE: it should be <= NUM_OF_FEATURE_COLUMNS
# let's assume last column is feature column, so the ROW_STRIDE can be calculated with below script
# cat xxxx | awk -F, '{$NF=""; print $0}'  | sort -n -r | head -1 | awk '{for(i=0;i<NF;i++) if($i!=0) sum+=1} END {print sum}'
ROW_STRIDE = NUM_OF_FEATURE_COLUMNS

LABEL_COLUMNS   = 1      
assert LABEL_COLUMNS == 1 # Currently, XGBoost only supports 1 feature column

TOTAL_COLUMNS = NUM_OF_FEATURE_COLUMNS + NUM_OF_WEIGHT_COLUMNS + LABEL_COLUMNS


**Below cells are immutable**

In [3]:
# This cell is immutable
import math

# constant values
SIZE_OF_uint64   = 8
SIZE_OF_float    = 4
SIZE_OF_Entry    = 8
SIZE_OF_uint32_t = 4

N_BINS     = MAX_BIN * NUM_OF_FEATURE_COLUMNS   # maximum value

# Formula

'''
Please note that,
Here both stages of cudf loading and dmatrix construction are skipped, that's because we've supported
chunk loading, and at the same time the DMatrix memory is moved to CPU from GPU, which means we can tune chunk
size to load any data size if CPU memory is enough.

And GPU sketcher stage is also skipped, for that memory is temporary.
'''

rows_to_load = ROW_COUNT

if True:
    
    # Step 1 PredictRaw
    mem_PredictRaw  = NUM_OF_GROUPS * rows_to_load * SIZE_OF_float

    # Step 2 GetGradient
    labels  = LABEL_COLUMNS * rows_to_load * SIZE_OF_float
    weights = NUM_OF_WEIGHT_COLUMNS * rows_to_load * SIZE_OF_float
    out_gpair = NUM_OF_GROUPS * rows_to_load * SIZE_OF_float * 2

    mem_GetGradient =  labels + weights + out_gpair

    # Step3 Ellpack
    n_items = rows_to_load * NUM_OF_FEATURE_COLUMNS * (1-SPARSITY)
    num_symbols = N_BINS + 1
    num_row_symbols = n_items + 1

    ellpack_gidx_row_buffer = 0  #inititalized value
    compressed_size_bytes = math.log2(num_symbols) * ROW_STRIDE * rows_to_load  / 8

    is_dense = SPARSITY == 0
    if not is_dense: #it's csr dmatrix
        item_compressed_size_bytes = math.log2(num_symbols) * n_items / 8
        row_compressed_size_bytes = math.log2(num_row_symbols) * (rows_to_load + 1) / 8
        if (item_compressed_size_bytes + row_compressed_size_bytes < compressed_size_bytes):
            compressed_size_bytes = item_compressed_size_bytes
            ellpack_gidx_row_buffer = row_compressed_size_bytes

    ellpack_prediction_cache = rows_to_load * SIZE_OF_float
    ellpack_gidx_buffer      = compressed_size_bytes
    mem_Ellpack     =  ellpack_prediction_cache +  ellpack_gidx_row_buffer + ellpack_gidx_buffer

    # Step4  RowPartitioner
    ridx_a     = rows_to_load * SIZE_OF_uint32_t
    ridx_b     = rows_to_load * SIZE_OF_uint32_t
    position_a = rows_to_load * SIZE_OF_uint32_t
    position_b = rows_to_load * SIZE_OF_uint32_t

    mem_row_partitioner = ridx_a + ridx_b + position_a + position_b

    # Step5 DeviceHistogram
    mem_device_histogram = 1 * 1024 * 1024 * 1024  #1G

    mem_total = mem_PredictRaw + mem_GetGradient + mem_Ellpack + mem_row_partitioner + mem_device_histogram
    
    gpu_count = math.ceil(mem_total * 1.0 / SINGLE_GPU_MEMORY_SIZE)
    print("\n GPU count: %d \n" % (gpu_count))



 GPU count: 8 

