# Modin Pandas

By default, Pandas executes its functions as a single process using a single CPU core. That works just fine for smaller datasets since you might not notice much of a difference in speed. But with larger datasets, and so many more calculations to make, speed starts to take a major hit when using only a single core. It’s doing just one calculation at a time for a dataset that can have millions or even billions of rows.

Yet most modern machines made for Data Science have at least 2 CPU cores. That means, for the example of 2 CPU cores, that 50% or more of your computer’s processing power won’t be doing anything by default when using Pandas. The situation gets even worse when you get to 4 cores (modern Intel i5) or 6 cores (modern Intel i7). Pandas simply wasn’t designed to use that computing power effectively.

Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system’s available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size.

# Install

In [1]:
!pip install modin[ray]



**Note:** By the time this tutorial was done the Dask version was still on trial 

# Benchmarks and Examples

# Practical Tips for using Modin

By default, Modin will use all of the CPU cores available on your machine. There may be some cases where you wish to limit the number of CPU cores

In [2]:
import ray
ray.init(num_cpus=4)
import modin.pandas as pd

2019-12-09 18:25:25,018	INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-09_18-25-25_017761_24230/logs.
2019-12-09 18:25:25,125	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:46329 to respond...
2019-12-09 18:25:25,290	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:61395 to respond...
2019-12-09 18:25:25,317	INFO services.py:809 -- Starting Redis shard with 3.34 GB max memory.
2019-12-09 18:25:25,396	INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-09_18-25-25_017761_24230/logs.
2019-12-09 18:25:25,402	INFO services.py:1475 -- Starting the Plasma object store with 5.01 GB memory using /dev/shm.


When working with big data, it’s not uncommon for the size of the dataset to exceed the amount of memory (RAM) on your system. Modin has a specific flag that we can set to true which will enable its out of core mode. Out of core basically means that Modin will use your disk as an overflow storage for your memory, allowing you to work with datasets far bigger than your RAM size. We can set the following environment variable to enable this functionality:


In [5]:
export MODIN_OUT_OF_CORE=true


SyntaxError: invalid syntax (<ipython-input-5-c87e6ac53963>, line 1)