# How to Work with Million-row Datasets Like a Pro
## It is time to take off your training wheels
![](https://cdn-images-1.medium.com/max/1620/1*bUX-lSoJ4VMQz2YnmSwgFA.jpeg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@belart84?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Artem Beliaikin</a>
        on 
        <a href='https://www.pexels.com/photo/aerial-photo-of-woman-standing-in-flower-field-1657974/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [None]:
import logging
import time

import catboost as cb
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Introduction

One of the difficult stages of my learning journey was about overcoming my fear of massive datasets. It wasn't easy because working with million-row datasets was nothing like the tiny, toy datasets the online courses continuously gave me.

Today, I am here to share the concepts and tricks I have learned to handle the challenges of gigabyte-sized datasets with millions or even billions of rows. By the end, they will feel to you almost as natural as working with the Iris or Titanic.

# Read in the massive dataset

The first of your worries start when loading the data - the time it takes to read the dataset into your working environment can be as long as you train a model. At this stage, don't use pandas - there are much faster alternatives available. One of my favorites is the `datatable` package which can read data up to 10 times faster.

As an example, we will load ~1M row Kaggle TPS September 2021 dataset with both `datatable` and `pandas` and compare the speeds:

In [None]:
import datatable as dt  # pip install datatable
import pandas as pd

In [None]:
%%time
tps_dt = dt.fread("../input/tabular-playground-series-sep-2021/train.csv").to_pandas()
tps_dt.head()

In [None]:
%%time
tps_df = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
tps_df.head()

7 times speedup! The datatable API for manipulating data may not be as intuitive as pandas - so, call the to_pandas method after reading the data to convert it to a DataFrame.

Apart from datatable, there are Dask, Vaex, or cuDF, etc. that read data multiple times faster than pandas. If you want to see some of those in action, refer to [this notebook](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets) on reading large datasets by Kaggle Grandmaster Rohan Rao.

# Reduce the memory size

Next, we have memory issues. Even a 200k row dataset may exhaust your 16GB RAM while doing complex computations.

I have experienced this first-hand twice in the last month's TPS competition on Kaggle. The first one was when projecting the training data to 2D using UMAP - I ran out of RAM. The second was while computing the SHAP values with XGBoost for the test set - I ran out of GPU VRAM. What is shocking is that the training and test sets only had 250k and 150k rows with a hundred features, and I was using Kaggle kernels.

The dataset we are using today has ~960k rows with 120 features, so memory issues are much more likely:

In [None]:
memory_usage = tps_df.memory_usage(deep=True) / 1024 ** 2

memory_usage.head(7)

In [None]:
memory_usage.sum()

Using the `memory_usage` method on a DataFrame with `deep=True`, we can get the exact estimate of how much RAM each feature is consuming - 7 MBs. Overall, it is close to 1GB.

Now, there are certain tricks you can use to decrease memory usage up to 90%. These tricks have a lot to do with changing the data type of each feature to the smallest subtype possible.

Python represents various data with unique types such as `int`, `float`, `str`, etc. In contrast, pandas has several NumPy alternatives for each of Python's:

![](https://miro.medium.com/max/1050/1*j9CH_6m1XrvuPz2DUGf5tQ.png)
<figcaption style="text-align: center;">
    <strong>
        Source: http://pbpython.com/pandas_dtypes.html
    </strong>
</figcaption>

Numbers next to the datatype refer to how many bits of memory a single data unit consumes when represented in that format. To reduce the memory as much as possible, choose the smallest NumPy data format. Here is a good table to understand this:

![](https://miro.medium.com/max/1050/1*f7kTFcscHI7dstMHZ1_eFg.png)
<figcaption style="text-align: center;">
    <strong>
        Source: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html
    </strong>
</figcaption>

In the above table, `uint` refers to unsigned, only positive integers. I have found this handy function that reduces the memory of pandas DataFrames based on the above table (shout out to [this Kaggle kernel](https://www.kaggle.com/somang1418/tuning-hyperparameters-under-10-minutes-lgbm?scriptVersionId=11067143&cellId=10)):

In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

Based on the minimum and maximum value of a *numeric* column and the above table, the function converts it to the smallest subtype possible. Let's use it on our data:

In [None]:
reduced_df = reduce_memory_usage(tps_df, verbose=True)

70% memory reduction is pretty impressive. However, please note that memory reduction won't speed up computation in most cases. If the memory size is not an issue, you can skip this step.

Regarding non-numeric data types, never use the `object` datatype in Pandas as it consumes the most memory. Either use `str` or `category` if there are few unique values in the feature. In fact, using pd.Categorical data type can speed things up to 10 times while using[LightGBM's default categorical](https://towardsdatascience.com/how-to-beat-the-heck-out-of-xgboost-with-lightgbm-comprehensive-tutorial-5eba52195997?source=your_stories_page-------------------------------------) handler.

For other data types like `datetime` or `timedelta`, use the native formats offered in `pandas` since they enable special manipulation functions.

# Choose a data manipulation library

Up until this point, I mainly mentioned `pandas`. It might be slow, but the vast range of data manipulation functions gives it a mounting advantage over its competitors.

But what can its competitors do? Let's start with datatable (again).

[`datatable`](https://datatable.readthedocs.io/en/latest/start/index-start.html) allows multi-threaded preprocessing of datasets sized up to 100 GBs. At such scales, `pandas` starts throwing memory errors while `datatable` humbly executes. You can read [this excellent article](https://towardsdatascience.com/an-overview-of-pythons-datatable-package-5d3a97394ee9) by @parulpandey for an intro to the package.

Another alternative is [`cuDF`](https://docs.rapids.ai/api/cudf/stable/), developed by RAPIDS. This package has many dependencies and can be used in extreme cases (think hundreds of billions). It enables running preprocessing functions distributed over one or more GPUs, as is the requirement by most of today's data applications. Unlike `datatable`, its API is very similar to `pandas`. Read [this article](https://developer.nvidia.com/blog/pandas-dataframe-tutorial-beginners-guide-to-gpu-accelerated-dataframes-in-python/) from the NVIDIA blog for more information.

You can also check out [Dask](https://dask.org/) or [Vaex](https://vaex.io/docs/index.html) that offer similar functionalities.

If you are dead set on `pandas`, then read on to the next section.

# Sample the data

Regardless of any speed tricks or packages on GPU steroids, too much data, well, is too much. When you have millions of rows, there is a good chance you can sample them so that all feature distributions are preserved.

This is done mainly to speed up computation. Take a small sample instead of running experiments, feature engineering, and training baseline models on all the data. Typically, 10–20% is enough. Here is how it is done in pandas:

In [None]:
sample_df = tps_df.sample(int(len(tps_df) * 0.2))
sample_df.shape

As proof, we can plot a histogram of a single feature from both the sample and the original data:

In [None]:
fig, ax = plt.subplots(figsize=(12, 9))

sns.histplot(
    data=tps_df, x="f6", label="Original data", color="red", alpha=0.3, bins=15
)
sns.histplot(
    data=sample_df, x="f6", label="Sample data", color="green", alpha=0.3, bins=15
)

plt.legend()
plt.show();

As you can see, the distributions are roughly the same - you can even compare the variances to check.

Now, you can use this sample for rapid prototyping, experimenting, building a model validation strategy, and so on.

# Use vectorization instead of loops

Whenever you find yourself itching to use some looping function like `apply`, `applymap`, or `itertuples` - stop. Use vectorization instead.

First, start thinking about DataFrame columns as giant n-dimensional vectors. As you know, vector operations affect each element in the vector simultaneously removing the need for loops in math. Vectorization is the process of executing operations on arrays rather than individual scalars.

Pandas has a large collection of vectorized functions. In fact, virtually any function and operator with the ability to affect each element in the array is vectorized in pandas. These functions are orders of magnitude faster than anything that loops.

You can also define custom vectorized preprocessing functions that accept whole DataFrame columns as vectors rather than scalars. The hairy details of this are beyond the scope of this article. Why don't you check out [this awesome guide](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)?

# Choose a machine learning library for baseline models or prototypes

Machine learning is an iterative process. When dealing with large datasets, you have to make sure each iteration is as fast as possible. You want to build baselines, develop a validation strategy, check if different feature engineering ideas improve the baseline, and so on.

At this stage, don't use models in Sklearn because they are CPU-only. Choose from XGBoost, LightGBM or CatBoost. And here is the surprising fact - XGBoost is much slower than the other two, even on GPUs.

It is up to [10 times slower than LightGBM](https://towardsdatascience.com/how-to-beat-the-heck-out-of-xgboost-with-lightgbm-comprehensive-tutorial-5eba52195997?source=your_stories_page-------------------------------------). CatBoost beats both libraries, and the speed difference grows rapidly as the dataset size gets bigger. It also regularly outperforms them in terms of accuracy.

These speed differences become much more pronounced when you are running multiple experiments, cross-validating, or hyperparameter tuning.

# Miscellaneous tips

Use Cython (C Python) - usually it is up to 100 times faster than pure Python. Check out [this section](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#cython-writing-c-extensions-for-pandas) of the Pandas documetnation.

If you really have to loop, decorate your custom functions with `@numba.jit` function after installing Numba. JIT (just-in-time) complition converts pure Python to native machine instructions, enabling you to achieve C, C++ and Fortran-like speeds. Again, check [this section](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#numba-jit-compilation) from the docs.

Search for alternatives other than CSV files for storage. File formats like feather, parquet, and jay are lighting fast - it only takes seconds to load billion-row datasets if they are stored in these.

Read the Pandas documentation on [enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html) and [scaling to large datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html).

# Wrapping up...

Here is a brief summary of the article:

1. Load the data only using libraries like `datatable`, `cuDF` or `dask`. They are always faster than Pandas.
2. Reduce the memory consumption by up to 90% by casting each column to the smallest subtype possible.
3. Choose a data manipulation library you are comfortable with or based on what you need.
4. Take a 10–20% sample of the data for rapid analysis and experimentation.
5. Think in vectors and use vectorized functions.
6. Choose a fast ML library like CatBoost for building baselines and doing feature engineering.

Thank you for reading!
![](https://cdn-images-1.medium.com/max/1080/1*KeMS7gxVGsgx8KC36rSTcg.gif)

# You might also be interested...

- [Kaggler’s Guide to LightGBM Hyperparameter Tuning with Optuna in 2021](https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5)

- [You Are Missing Out on LightGBM. It Crushes XGBoost in Every Aspect](https://towardsdatascience.com/how-to-beat-the-heck-out-of-xgboost-with-lightgbm-comprehensive-tutorial-5eba52195997)

- [Tired of Cliché Datasets? Here are 18 Awesome Alternatives From All Domains](https://towardsdatascience.com/tired-of-clich%C3%A9-datasets-here-are-18-awesome-alternatives-from-all-domains-196913161ec9)

- [Love 3Blue1Brown Animations? Learn How to Create Your Own in Python in 10 Minutes](https://towardsdatascience.com/love-3blue1brown-animations-learn-how-to-create-your-own-in-python-in-10-minutes-8e0430cf3a6d)

- [Yes, These Unbelievable Masterpieces Are Created With Matplotlib](https://ibexorigin.medium.com/yes-these-unbelievable-masterpieces-are-created-with-matplotlib-2256a4c54b12)