# Data Generation
In this notebook, we generate 3 mixed typed numeric only dataframe, each about 1.5GB:
1. "Coarse Grain Mixed", Tall DataFrame composed of 5 `int8` columns, 5 `float64` columns, 5 `int64` columns
2. "Fine Grain Mixed", Tall DataFrame composed of 3 `int8`, 3 `int16`, 3 `int32`, 3 `int64`, 5 `float16`, 5 `float32`, 5 `float64`
3. "Coarse Grain Mixed", Wide DataFrame composed of 5000 `int8`, 50 `float64` columns, 50 `int64` columns

In [1]:
import pandas as pd
import numpy as np

In [2]:
coarse_dtype = [np.int8, np.int64, np.float64]
fine_dtype = [np.int8, np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]

In [3]:
data_dtypes = {
    'coarse_tall': coarse_dtype*5,
    'fine_tall': fine_dtype*3,
    'coarse_wide': coarse_dtype*5000
}

In [4]:
bytes_per_row = {
    key: np.sum([np.array(1, dtype=dtype).nbytes for dtype in types]) for key, types in data_dtypes.items()
}
bytes_per_row

{'coarse_tall': 85, 'fine_tall': 87, 'coarse_wide': 85000}

In [5]:
total_bytes = 1.5e9

In [6]:
num_rows = {
    key: int(total_bytes/per_row) for key, per_row in bytes_per_row.items()
}
num_rows

{'coarse_tall': 17647058, 'fine_tall': 17241379, 'coarse_wide': 17647}

Data are generated first as numpy random matrix, each item can be `[-1000,1000]`, then they are cast to corresponding type

In [7]:
from multiprocessing import Pool

def save_dataframe(key_and_types):
    key, types = key_and_types
    n_rows = num_rows[key]
    matrix = np.random.randn(n_rows, len(types))
    dataframe = pd.DataFrame(
        matrix, 
        columns=np.arange(len(types)).astype(str)).astype({
        str(i): column_dtype for i, column_dtype in enumerate(types)
    })
    dataframe.to_parquet(f"data/{key}.pq",engine='fastparquet', compression=None)

with Pool(3) as p:
    p.map(save_dataframe, list(data_dtypes.items()))

Creating sample data for scripting

In [10]:
pd.DataFrame({
    'a': np.random.randint(0, 500, size=500),
    'b': np.random.randn(500)
}).to_parquet('data/sample.pq')