The goal of this notebook is to offer a set of tips and tricks to load the tabular dataset and manipulate it easily and quickly.

In [None]:
import pandas as pd
import numpy as np
import datetime

# Loading The Data Files Efficiently :
## Using Parquet Format

Using another file format than CSV can help you load data files faster than the classic read_csv. there are several formats to handle big datasets :
- HDF
- Feather
- Parquet
- Pickle (used to store any serilizable python object)

In this case we will use the parquet format, first let's load the transactions csv file which is over 3GB in size : 

In [None]:
%%time
transactions = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')

This operation takes around 39 seconds on Kaggle Kernel and 3 Gigs of RAM.

In [None]:
# Let's convert back to parquet and load it in parquet format
transactions.to_parquet('transactions.parquet')

In [None]:
%%time 
transactions_parquet = pd.read_parquet('./transactions.parquet')


This time the file takes only 9 secondes to load and the file size is only 786 MB rather than 3.4 GB, this is an improvement of over 200% ! Amazing !

In [None]:
del transactions

## Using another library : 

You can use other libraries such as cuDF which uses GPU to load dataframes faster (but I think it's an unnecessary flex in this case and just a waste of computing ressources because the data file is still not that huge (not over 20 GB))

You can also use a distributed library such as Dask or PySpark if you can set up a cluster on Amazon EMR or GCP Dataproc.

# Converting dates to datetime in a blink ! 

Here is a neat technique that help you convert the `t_dat` column in transactions data to python datetimes quickly.

Sure `pd.to_datetime` is clean and easy to use but it can be a bit slow sometimes depending on your machine and with 31 million rows of transactions that's not our best bet.

In [None]:
def convert_to_date(s):
    """
    Memoization technique - very fast conversion to pure python dates
    """
    dates = {date:datetime.datetime.strptime(date,'%Y-%m-%d') for date in s.unique()}
    return s.map(dates)

In [None]:
%%time
convert_to_date(transactions_parquet.t_dat)

# Use Pandas Vectorization :

Instead of looping through the dataframe with `iterrows()` which will take an eternity, you should use vectorization of Pandas whenever possible. 

Let's say you've built a function to compute a mathematical function on a column, just to show how fast vectorization is, we will take a useless function $f(x)$ to prove this point and apply it on the price column.

$$ f(x) = cos(x^2) $$

In [None]:
def f(x):
    return np.cos(x**2)

In [None]:
%%time
transactions_parquet.price.apply(lambda x: f(x))


This took 41s :o ,Now let's with pandas vectorization super power

In [None]:
%%time
f(transactions_parquet.price)

This is amazing, we did the exact same thing **13x times faster** ! 

You can even get faster results with numpy vectorization because pandas dataframe have some overhead.

In [None]:
%%time
f(transactions_parquet.price.values)

Did you see that ! It takes less than half a second to do the same thing with numpy vectorization and 39 seconds with apply ! **that's an improvement of 102%**

# Final Tip : Del and gc.collect()

This tip is a bit obvious but it's very easy to forget : always deleted objects that you're no longer using with the`del` keyword.

Also try to look at how the garbage collector works in Python and try `gc.collect()` 

# Conclusion

In this notebook we saw that by using some simple techniques available in pandas, numpy and python we can really improve loading and opration speed and memory to a great extent !

There are several other techniques, if you have some other techniques please leave a comment and I will try to add it with a credit in this notebook.

Finally if this notebook was helpful please give it an upvote ! 