In this notebook, I will show you an easy workflow to get the historical transactions dataset 
from a **13.1GB loaded dataset** to a **much smaller one**. Notice that the techniques showcased in what follows are applicable to a wide range of datasets of course. 
Let's get started!

In [None]:
from IPython.display import Image
Image("../input/big-to-small-filepng/big_to_small_file.png")

<center><h1>Sometimes, small is better!</h1></center>
source (with some adaptation): https://bulbapedia.bulbagarden.net

# Why should you bother?

First, it is "fun" to do it, in the sense that it is challenging, interesting to learn 
to do it, and finally, could be useful. 
How could it be useful?
Well, let's see: 

* you need to train a gradient boosted trees model with every possible training dataset you have. Unfortunately, your laptop has only 8GB of RAM. 

* you have access to cloud resources but to get your model working, you need a much bigger instance which costs 3 times more.

* you have access to a large cloud instance that fits everything but the training time is high and you want to reduce it. 

In all cases, your boss will be happy that you have managed to train with the largest amount of data, using the least amout of resources necessary, and finished the training in a reasonable time. 



# Before optimization

In [None]:
# Some imports
import pandas as pd
DATA_PATH = "../input/elo-merchant-category-recommendation/historical_transactions.csv"

Let's start by loading the dataset using the good old `pd.read_csv` and timeit (using the `%%timeit` magic command). 

In [None]:
df = pd.read_csv(DATA_PATH)

Let's also time the loading process

In [None]:
%%timeit
df = pd.read_csv(DATA_PATH)

Around **1 minute** to load the historical transactions dataset!
That's not negligible. Alright, let's check how much space it takes on 
disk first. For that, will issue the following `bash` command: `ls -lh` (the `h` flag is for getting a human-readable output). 

If you don't know it, you can issue `bash` commands right from a[ jupyter notebook](https://jupyter.org/) (with or without the `!` sign before the command since `automagic` is turned on by default). Check this great [blog post](https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html) for more details.
                                        

In [None]:
ls -lh {DATA_PATH}

**2.7GB** on disk. That's a large dataset! Not yet "big data" but could make older computers flinch. 

Alright, the next question to ask is: how to get the size of this dataset when loaded into memory?

# Memory footprint

For that, I will be using a pandas method (of course, there is usually a pandas method for almost everything): [`pandas.DataFrame.info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html). Let's see what we get. 

In [None]:
# verbose is set to False here to avoid the metadata information
df.info(verbose=False)

**3GB**. That's not that bad. 

But wait, is this really the memory size? **Why is there a `+` sign at the end?** That looks suspecious...

# Real memory footprint

The answer to the previous questions is: no, it isn't!
And the `+` sign is here to indicate that the returned value is an **estimation**. 

Ok, so why is that?

The answer to that could be summed up in one word: `object`.
In fact, when calling the `.info` method, one doesn't get the "real" memory footprint bur rather an estimation. 

From the [`pandas.DataFrame.memory`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info) documentation, here is how it is computed: 

> True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.


So what is the correct way to get it? 
It isn't that hard either, just use the same method again but this time setting `memory_usage="deep"`. By doing so, `pandas` will do the real memory usage computation (thus computing how much space `object` data takes). 
Simple!

In [None]:
df.info(memory_usage="deep", verbose=False)

The correct answer is thus **13.1GB** (but you knew it already if you paid attention to the introduction). That's a huge DataFrame loaded into memory, one that is more than 4 times bigger than the original estimate. 

Can we do something about it? Of course, otherwise this notebook won't make sense. ;)

# Dtypes

![df_blocks.png](https://www.dataquest.io/blog/content/images/df_blocks.png)
<center><h1>Pandas block representation</h1></center>
source: https://www.dataquest.io/blog/pandas-big-data/ 

As mentionned earlier, the "heavy" load comes mostly from the `object` type (and the associated `ObjectBlock`). In simpler words, an `object` dtype is how pandas stores strings. For that, it uses python and not numpy (contrary to all the other types). Check this [thread](https://stackoverflow.com/questions/34881079/pandas-distinction-between-str-and-object-types) for some explanations why.  

Let's see which columns have this type and see how much they contribute to the overall memory footprint. 

In [None]:
# Verbose is left to the default True here since we want the columns metadata.
df.select_dtypes('object').info(memory_usage='deep')

Waw, 11.3GB! That's around **86%** of the total memory footprint!
Alright, what can we do to reduce it?

There is probably only one solution I can think of, that is casting the `object`colums to another, **more efficient **type representation (for example, integer) ** while preserving the information**.  Let's do this. 

# Exploring the object columns

Before casting to the appropriate type, we need to explore the columns to find out the best 
one. In what follows, I will display few values of each column, count the number of unique values and compare it to the length of the column. 

In [None]:
for col in df.select_dtypes('object'):
    print(df[col].sample(5))
    print(f"{df[col].nunique()} unique values for {col}, which has {len(df[col])} rows.")

# Timestamps anyone?

Alright, the `purchase_date` contains temporal information, so let's turn it into a `datetime` type using the [`pandas.to_datetime`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function. 

In [None]:
df.purchase_date = pd.to_datetime(df.purchase_date)

Alright, what is the new memory footprint?

In [None]:
df.info(memory_usage="deep", verbose=False)

Not bad for a start!

# Categorical to the rescue

Next, any column with "textual" information and having more than 3 unique values 
and less than, say, 60% of the column length, should be transformed into the [categorical](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html) 
type. Some background information about this type: it is a fairly new addition to pandas 
(since version 0.21.0) and is inspired from the R one. 
To do so, will use the `astype("category")` method. 

In [None]:
CATEGORICAL_COLS = ["card_id", "category_3", "merchant_id"]
for col in["card_id", "category_3", "merchant_id"]:
    df[col] = df[col].astype("category")

In [None]:
df.info(memory_usage="deep", verbose=False)

That's a huge gain!

# Binarize some features

What about the other `object` columns? These are neither timestamps and have only 2 unique textual 
values. So what to do about thses? 
Binarize them! Let's see how to do it.

In [None]:
for col in ["authorized_flag", "category_1"]:
    # Each row having "Y" (short for yes) will get the value 1, otherwise, 0.
    df[col] = pd.np.where(df[col] == "Y", 1, 0)

In [None]:
df.info(memory_usage="deep", verbose=False)

We are on a roll!

# What about other categorical columns?

After more exploration, it appears that other columns aren't of `object` type but could be 
turned into categoricals to save some more space. Let's do it.!

In [None]:
df.nunique().sort_values(ascending=True)

In [None]:
# Be careful, even though it is tempting to turn the "purchase_amount" to
# categorical to gain more space, 
# it isn't the best thing to do since we will be using this column to compute
# aggregations!
for col in ["month_lag", "installments", "state_id", "subsector_id", 
            "city_id", "merchant_category_id", "merchant_id"]:
    df[col] = df[col].astype("category")

In [None]:
df.info(memory_usage="deep", verbose=False)

In [None]:
df.dtypes

# Integer with NaNs

One last thing before leaving, there is the `category_2` that is a `float64` column. Why is that? To see why, let's plot the distribution of the unique values. 

In [None]:
df.category_2.value_counts(dropna=False, normalize=True).plot(kind='bar')

Alright, all the values are integer ones, except some NaNs. It is possible to cast these to integer of one uses
the underlying numpy array. 

In [None]:
df.category_2 = df.category_2.values.astype(int)

In [None]:
pd.__version__

There is a new feature in the [0.24](http://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#optional-integer-na-support) version that allows to do this "natively" but we need to wait 
for Kaggle to update the [Dockerfile](https://github.com/Kaggle/docker-python/blob/master/Dockerfile) ;)

In [None]:
df.info(memory_usage="deep", verbose=False)

# Bonus: smaller integer types

No need to use the int64 for binary type, the numpy unit8 or the bool_ type are 
more than enough. So let's do this!

In [None]:
# You can also use the "bool" type (both take one byte for storage).
df.authorized_flag = df.authorized_flag.astype(pd.np.uint8)
df.category_1 = df.category_1.astype(pd.np.uint8)

In [None]:
df.info(memory_usage="deep", verbose=False)

Same thing for the `category_2` column, where the `NaN` value can be stored as 0
and the column cast as `np.unit8`.

In [None]:
df.category_2 = df.category_2.astype(pd.np.uint8)

In [None]:
df.category_2.value_counts(normalize=True, dropna=False).plot(kind='bar')

In [None]:
df.info(memory_usage="deep", verbose=False)

We went from **13.1GB to 1GB**. How awesome is that!

Remark: I guess it is even possible to get a smaller DataFrame by using smaller integer types for some of the categorical columns. I haven't done it, so let me know in the comments. ;)

# TL;DR: give me the optimization pipeline

For those only interested in the output and how to generate it, here is a function that you can add to your notebook/script. 

In [None]:
# This function could be made generic to almost any loaded CSV file with
# pandas. Can you see how to do it?

# Some constants
PARQUET_ENGINE = "pyarrow"
DATE_COL = "purchase_date"
CATEGORICAL_COLS = ["card_id", "category_3", "merchant_id", "month_lag", 
                    "installments", "state_id", "subsector_id", 
                    "city_id", "merchant_category_id", "merchant_id"]
CATEGORICAL_DTYPES = {col: "category" for col in CATEGORICAL_COLS}
POSITIVE_LABEL = "Y"
INTEGER_WITH_NAN_COL = "category_2"
BINARY_COLS = ["authorized_flag", "category_1"]
INPUT_PATH = "../input/elo-merchant-category-recommendation/historical_transactions.csv"
OUTPUT_PATH = "historical_transactions.parquet"


def smaller_historical_transactions(input_path, output_path):
    # Load the CSV file, parse the datetime column and the categorical ones.
    df = pd.read_csv(input_path, parse_dates=[DATE_COL], 
                    dtype=CATEGORICAL_DTYPES)
    # Binarize some columns and cast to the boolean type
    for col in BINARY_COLS:
        df[col] = pd.np.where(df[col] == POSITIVE_LABEL, 1, 0).astype('bool')
    # Cast the category_2 to np.uint8
    df[INTEGER_WITH_NAN_COL] = df[INTEGER_WITH_NAN_COL].values.astype(pd.np.uint8)
    # Save as parquet file
    df.to_parquet(output_path, engine=PARQUET_ENGINE)
    return df
    
def load_historical_transactions(path=None):
    if path is None:
        return smaller_historical_transactions(INPUT_PATH, OUTPUT_PATH)
    else: 
        df = pd.read_parquet(path, engine=PARQUET_ENGINE)
        # Categorical columns aren't preserved when doing pandas.to_parquet
        # (or maybe I am missing something?)
        for col in CATEGORICAL_COLS:
            df[col] = df[col].astype('cateogry')
        return df


In [None]:
optimized_df = smaller_historical_transactions(INPUT_PATH, OUTPUT_PATH)

In [None]:
optimized_df.info(memory_usage="deep", verbose=False)

Finally, let's time how long it takes to load the dataset from parquet, how much disk space it takes, and how big is its memory footprint. Notice that I need to remove old DataFrames, otherwise the kernel dies.

In [None]:
del df
del optimized_df

In [None]:
# TODO: There is a bug when reading the saved parquet file. Check why and fix it!
# Is it related to this issue: https://issues.apache.org/jira/browse/ARROW-2369?
# %%timeit 
# parquet_df = load_historical_transactions(INPUT_PATH)

In [None]:
# parquet_df.info(memory_usage="deep", verbose=False)

In [None]:
ls -lh {OUTPUT_PATH}

# To go beyond

* I have written a blog post about pandas and there is a section about memory optimization, so check it out [here](https://www.datacamp.com/community/tutorials/pandas-idiomatic).
* Here is [another](http://pbpython.com/pandas_dtypes.html) blog post about pandas dtypes. 
* Check the pandas [internal archtitecture](https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst) document for more details about the `Block` data structure, the `BlockManager`, and their drawbacks. 
* A great [blog post](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) to understand how pythons objects are stored. This explains why the pandas `object` has a variable size and can't be accurately estimated without the `memory_usage="deep"`. 



# To wrap up

I hope you have enjoyed reading this memory optimization workflow and have learned something new. Stay tuned for upcoming kernels. ;)