# Tips and Tricks to Process Large Data in Python Pandas

As data scientists the first and foremost skill to have is the ability to be process and analyze data. Python pandas has been the most popular tool for data wrangling and data analysis. Usually pandas is really good and efficient in processing small data (usually from 100MB up to 1GB) where performance is rarely a concern.Once the data size increases we experience memory and perfromance issues. By understanding how pandas interprets data and by using some tricks we can efficiently process files large data.In this blog post we will see some of the common issues faced while processing large data in pandas and trick to overcome them along with a example.

### TL, DR:
- By default Pandas interprets the numeric values as __`int64`__, __`float64`__ dtypes and string columns as __`object`__ dtypes which makes the dataframe to use a lot of memory.   
- After analysing the data we can observe that incase of numeric columns not always we require 8 bytes of memory to store the data and by downcasting them to respective __`signed`__ and __`unsigned`__ values we can achieve a significant reduction in memory.
- Incase of string columns by default pandas stores all the characters present in a string column, it doesnot identify mutiple occurences of same value. In order to achieve this we can convert the object column to a categorical column using inbuilt __`df.astype('category')`__. This gives us around 10 times of reduction in memory   .

We will be using the kaggle dataset for [Instacart market basket analysis](https://www.kaggle.com/c/instacart-market-basket-analysis). The dataset contains data for around 3lakh historical orders with multiple products per each order. Overall the dataset has size of 700MB with around 30 million rows and 9 columns.

Let us now load this data into a pandas dataframe and see how pandas interprets the data.

In [1]:
import pandas as pd

df_aisles = pd.read_csv('Datasets/aisles.csv')
df_departments = pd.read_csv('Datasets/departments.csv')
df_order_prior = pd.read_csv('Datasets/order_products__prior.csv')
df_order_train = pd.read_csv('Datasets/order_products__train.csv')
df_orders = pd.read_csv('Datasets/orders.csv')
df_products = pd.read_csv('Datasets/products.csv')

In [2]:
print('aisles: \t',df_aisles.shape) 
print('departments: \t',df_departments.shape) 
print('order_prior: \t',df_order_prior.shape)
print('order_train: \t',df_order_train.shape)
print('orders: \t',df_orders.shape)
print('products: \t',df_products.shape)

aisles: 	 (134, 2)
departments: 	 (21, 2)
order_prior: 	 (32434489, 4)
order_train: 	 (1384617, 4)
orders: 	 (3421083, 7)
products: 	 (49688, 4)


In [3]:
# Creating the historical orders data by merging all the dataframe.
df_his_orders = (df_order_prior
                 .merge(df_orders, on=['order_id'], how='left')
                 .merge(df_products, on=['product_id'], how='left')
                 .merge(df_aisles, on=['aisle_id'], how='left')
                 .merge(df_departments, on=['department_id'], how='left'))

In [4]:
del df_aisles
del df_departments
del df_order_prior
del df_order_train
del df_orders
del df_products

In [5]:
df_his_orders.shape

(32434489, 15)

In [6]:
df_his_orders.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  int64
product_id                int64
add_to_cart_order         int64
reordered                 int64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
product_name              object
aisle_id                  int64
department_id             int64
aisle                     object
department                object
dtypes: float64(1), int64(10), object(4)
memory usage: 11.4 GB


We can see that by default pandas interprets all number columns as 64 bit integers, float columns `int64`,`float64` values and all string columns as `object` values. Each number value in the above dataframe has a size of 8bytes, and each string value has the size equal to sum of size taken for each character.

let us now see how to optimize the default pandas interpretation of various dtypes to reduce the dataframes size

## Int64  & Float 64 Columns 

In [7]:
int_columns = [col for col in df_his_orders.columns if df_his_orders[col].dtype == 'int64']
float_column = [col for col in df_his_orders.columns if df_his_orders[col].dtype == 'float64']

In [8]:
# Printing Min and Max Values of `int_columns`
for column in int_columns:
    print(column, df_his_orders[column].min(),df_his_orders[column].max())

order_id 2 3421083
product_id 1 49688
add_to_cart_order 1 145
reordered 0 1
user_id 1 206209
order_number 1 99
order_dow 0 6
order_hour_of_day 0 23
aisle_id 1 134
department_id 1 21


From the above output we can see that none of the integer column really need a 64 bit integer to store the data w.r.t the min and max values of each column below are the actual byte sizes required by each column:

- `order_id`, `product_id`, `user_id` : __`uint32`__ range(0 to 4294967295)
- `add_to_cart_order`, `order_number`,`order_number`, `order_dow`, `order_hour_of_day`, `aisle_id`, `department_id` : __`uint16`__ range(0 to 65,535)

__Note__: We are choosing unsigned integer as there are no negative values in the data

Pandas has an inbulit method __`pd.to_numeric`__ to downcast a number to it's respective lower byte size. Sometimes we may face data loss issue while downcasting a numeric column to their respective `signed` or `unsigned` value, hence it is recommened that you analyse the data values as above before downcasting the values.

Let us now doencast the integer columns and analyse the dataframe size.

In [9]:
# Downcasting the numeric columns to respective unsigned integers
for column in int_columns:
    df_his_orders[column] = pd.to_numeric(df_his_orders[column], downcast='unsigned')

In [16]:
# Float columns
for column in float_column:
    print(column, df_his_orders[column].min(),df_his_orders[column].max())

print(list(df_his_orders['days_since_prior_order'].unique()))

days_since_prior_order 0.0 30.0
[8.0, 12.0, 7.0, 9.0, 30.0, 17.0, 5.0, 23.0, 10.0, 1.0, 3.0, 2.0, 13.0, 6.0, nan, 0.0, 25.0, 14.0, 18.0, 11.0, 21.0, 4.0, 15.0, 20.0, 19.0, 27.0, 16.0, 24.0, 22.0, 26.0, 28.0, 29.0]


We can see that the float values are not actually floats but due to the missing value `nan` pandas is interpreting it as a float column. Let us fill the missing value with 0 for now and downcast the column. 

In [19]:
# Filling the ,missing value with 0 nand downcasting the data.
df_his_orders['days_since_prior_order'].fillna(value=0, inplace=True)
df_his_orders['days_since_prior_order'] = (pd
                                           .to_numeric(df_his_orders['days_since_prior_order'], 
                                                       downcast='unsigned'))

In [21]:
# Anlaysing the memory usage of the dataframe after downcasting Int64 and Float64 values
df_his_orders.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  uint32
product_id                uint16
add_to_cart_order         uint8
reordered                 uint8
user_id                   uint32
eval_set                  object
order_number              uint8
order_dow                 uint8
order_hour_of_day         uint8
days_since_prior_order    uint8
product_name              object
aisle_id                  uint8
department_id             uint8
aisle                     object
department                object
dtypes: object(4), uint16(1), uint32(2), uint8(8)
memory usage: 9.3 GB


We can see that by analysing the data values and downcasting them to their respective lower sizes we have achieved a 2GB decrease in the dataframe size.

## Object Columns

`Object` columns are the columns which contribute to most of the size of a pandas dataframe. The size of each object column is equal to sum of size taken by all characters present in the column. 

Let us now analyse the data present in the `object` columns and see how we can optimize the size taken.

In [22]:
object_columns = [col for col in df_his_orders.columns if df_his_orders[col].dtype == 'object']

In [24]:
# no of unique values present in each object columns
for column in object_columns:
    print(column, df_his_orders[column].nunique())

eval_set 1
product_name 49677
aisle 134
department 21


From the above we can infer that each `object` column has only few unique values whihc are repeated over the rows and constituting for unneccesary memory storage. It would be useful if we can store this data once and re use it across the dataframe. In pandas we can achieve this by converting it to a categorical column using the pandas __`df[column_name].astype('category')`__.

Note: __`df[column_name].astype`__ can also be used for converting integers and float columns from `int64` to their respective downcasted versions but the only problem is that you have to manually enter the size to which it need to downcast in  `astype` whereas incase if `pd.to_numeric` it infers by itself..   

In [25]:
# converting the object columns to categorial columns
for column in object_columns:
    df_his_orders[column] = df_his_orders[column].astype('category')

In [26]:
# Anlaysing the memory usage of the dataframe after converting object to category 
df_his_orders.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  uint32
product_id                uint16
add_to_cart_order         uint8
reordered                 uint8
user_id                   uint32
eval_set                  category
order_number              uint8
order_dow                 uint8
order_hour_of_day         uint8
days_since_prior_order    uint8
product_name              category
aisle_id                  uint8
department_id             uint8
aisle                     category
department                category
dtypes: category(4), uint16(1), uint32(2), uint8(8)
memory usage: 1.0 GB


We can see that by converting `object` columns to `categorical` columns we have achieved around 10 times reduction in the size of the dataframe.

# Conclusion

By using some simple tricks listed below we can process large data in pandas efficiently.
- Downcast numeric column values to their respective signed or unsigned values using __`pd.to_numeric`__ or __`df.astype`__.
- Convert object column values to categorical values using __`df.astype('category')`__    