In [134]:
import numpy as np
import pandas as pd

import os
print(os.listdir("../input"))

from sklearn import preprocessing
import random

The Avito challenge uses a number of files that, when we are adding features or doing train-test-splits, can consume more memory than some people have available.

In this notebook, we will be looking at a few things we can do to save memory and increase the overall workflow speed from loading data to creating numerical features.

# Loading files and reducing  file size
To test how long certain operations take, we can use the `%%time` magic command at the top of the cell.

Loading a csv file into memory can take some time since pandas needs to infer datatypes for every row.

In [135]:
%%time
train = pd.read_csv("../input/train.csv")

Our goal is to reduce the loading time as well as the memory usage while the object is loaded.

With `.info()` we can check datatypes and the total memory usage. For this, we need to set `memory_usage=True`, which will take a bit longer but return a more accurate size.

In [136]:
train.info(memory_usage="deep")

Each DataFrame and Series also has a `.memory_usage` attribute which shows the memory usage in bytes.

In [137]:
train.memory_usage(deep=True)

To make this easier to read, we convert all values to megabytes

In [138]:
train.memory_usage(deep=True) * 1e-6

The total memory usage can be seen like this:

In [139]:
train.memory_usage(deep=True).sum() * 1e-6

We can see that the title and description collumns take up the most space but even region, city, param_1-3 and the activation date use quite a bit of memory.

While 2500 MB are not too large, this can easily grow to 10+GB with some column-combinations, text transformations and new objects like those created during train-test-splits, etc.

## File/object size reduction with correct datatypes

One of the easiest ways to reduce sizes is by converting columns to the right datatype. Currently, almost every column uses the `object` type, which is basically strings that are very memory-inefficent.

Pandas uses Numpy's datatypes along with a few own additions.. The most important ones are integers, floats, datetime, boolean, string and categorical types. 
More information on them can be found [here](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html).

### datetime
The easiest and most well-known is converting dates to a datetime-type.

In [140]:
print("size before:", train["activation_date"].memory_usage(deep=True) * 1e-6)
train["activation_date"] = pd.to_datetime(train["activation_date"])
print("size after: ", train["activation_date"].memory_usage(deep=True) * 1e-6)

### categorical
Text columns that have a lot of repeating values, for example `region` and `city` should be converted to categorical columns. This special datatype basically saves all unique values in a dictionary, then places memory-efficient integers in each column and displays the corresponding text-values when using the DataFrame. This is similar to overwriting the column with label-encoded values except for keeping the readability.

Some tools like XGBoost and LightGBM can also use categorical columns directly without converting them to integer labels. Some other libraries have trouble using them, though, and require string or integer values instead.

In [34]:
print("size before:", train["region"].memory_usage(deep=True) * 1e-6)
train["region"] = train["region"].astype("category")
print("size after :", train["region"].memory_usage(deep=True) * 1e-6)

In [141]:
print("size before:", train["city"].memory_usage(deep=True) * 1e-6)
train["city"] = train["city"].astype("category")
print("size after :", train["city"].memory_usage(deep=True) * 1e-6)

This only works well for columns with lower cardinality, meaning a lower number of unique values. For columns like `title` which has more than 90% unique values, this does not help.

To make this easier, we can define a function to help with the categorical conversions and print the before/after object sizes:

In [142]:
def convert_columns_to_catg(df, column_list):
    for col in column_list:
        print("converting", col.ljust(30), "size: ", round(df[col].memory_usage(deep=True)*1e-6,2), end="\t")
        df[col] = df[col].astype("category")
        print("->\t", round(df[col].memory_usage(deep=True)*1e-6,2))

In [143]:
convert_columns_to_catg(train, column_list=["param_1", "param_2", "param_3", "parent_category_name", "category_name", "user_type"])

The total memory size has decreased a lot (2500 MB to 1300 MB) now with title and description taking up the majority of the size:

In [144]:
print(train.memory_usage(deep=True)*1e-6)
print("total:", train.memory_usage(deep=True).sum()*1e-6)

# Saving objects as pickle-files for faster loading

After these basic improvements, we can save the DataFrame as a so called pickle-file, which is the entire Python object saved to your hard drive complete with all datatypes intact. Compared to csv which just stores raw string values, this will be a lot smaller. The resulting file will not have the same size reduction but still be smaller.

Unfortunately we can't save larger files in Kaggle kernels so we will just compare the 1300 MB to the original csv-file:

In [145]:
train.to_pickle("train.pkl")

In [146]:
# size is shown in bytes again and needs to be converted to megabytes
print("train.csv:", os.stat('../input/train.csv').st_size * 1e-6)
print("train.pkl:", os.stat('train.pkl').st_size * 1e-6)

Loading from a .pkl file will also be a lot faster (remember the 20-25s load time for the csv). If you do the datatype improvements once, save everything as a pickle file, then only load from those, you can save time each time you have to load your data again.

In [147]:
del train

In [148]:
%%time
train = pd.read_pickle("train.pkl")

When working with a new dataset, I usually create a first notebook to load all relevant files, convert datatypes, save the DataFrame as a pickle file and then only load this in the main feature-enginering notebook. This saves time and memory when I actuall start working with the data .

In [149]:
# We will remove the file from the Kernels virtual environment.
os.remove("train.pkl")

# Garbage Collector
Python has a library for controlling it's garbace collector, a system to manage objects in memory and specifically removing unneeded objects.

After doing larger transformations,  object creations/deletions or generally anything else that runs for more than a few seconds,, it can help to free up memory by calling the garbage collector directly.

On your own computer you can use htop, the Windows taks manager and similar tools to monitor the RAM, for demonstration in this notebook we can use `psutil` to show the used RAM.

In [150]:
import gc
import psutil

In [151]:
print("available RAM:", psutil.virtual_memory())

gc.collect()

print("available RAM:", psutil.virtual_memory())

# Numerical data, label-encoding and high cardinality features

While text columns take up the most memory, having a lot of numerical columns can also add up over time if they don't use optimal datatypes.

We will look at some newly created label columns and explore a way to handle high-cardinality features at the same time.

For tree-based models, we usually apply a label-encoding to categorical columns.

For columns like `region` this works well as even for the region with the least occurences, we still have enough values in the whole dataset to allow the tree to find patterns.

In [152]:
train["region"].value_counts()

For other columns like the `user_id` there are only a few dozen users with more than hundreds of rows of data and a lot of users with less than a handful of rows, making labels for each unique user too complex for a tree based model.

In [153]:
train["user_id"].value_counts().head(5)

In [154]:
train["user_id"].value_counts().tail(5)

## Single column encoding

When applying label-encoding to high-cardinality columns like this, you should only apply these to categories with at least 20/50/100 rows of data, depending on your preferences.

To make this easier, let's define another function to label-encode with a count-threshold:

In [155]:
def create_label_encoding_with_min_count(df, column, min_count=50):
    column_counts = df.groupby([column])[column].transform("count").astype(int)
    column_values = np.where(column_counts >= min_count, df[column], "")
    train[column+"_label"] = preprocessing.LabelEncoder().fit_transform(column_values)
    
    return df[column+"_label"]

In [156]:
train["user_id_label"] = create_label_encoding_with_min_count(train, "user_id", min_count=50)

In [157]:
print("number of unique users      :", len(train["user_id"].unique()))
print("number of unique user labels:", len(train["user_id_label"].unique()))

These 562 values are much easier to use than 700k values, if you would have used them at all.

## Multi column encoding

In some cases it makes sense to concatenate columns first, then apply a label encoding on the resulting column
* to help the tree-model to find structure 
* or to avoid labelling categories with the same number although they belong to different parent categories

In the Avito dataset, this occurs with the region-city hierarchies. Many kernels apply label encodings to both columns individually which causes information loss. There are a number of cities with the same name that belong to different regions:

In [158]:
train.loc[train["city"]=="Светлый", "region"].value_counts().head()

For this, we should concatenate region and city to assign better labels to the individual cities.

We can't add the values of categorical columns together so we need to use `.apply()` with a join-function which is rather slow and creates a new column with high memory usage.

In [159]:
%%time
train["region_city"] = train.loc[:, ["region", "city"]].apply(lambda s: " ".join(s), axis=1)

In [160]:
print("unique:", len(train["region_city"].unique()))
print("size:  ", train["region_city"].memory_usage(deep=True)*1e-6)

If you want to use multiple column-combinations, the creation alone can take a few minutes.

To speed this up, we can use the groupby-function and apply unique values for each column-combination. Using the `.transform()` method we usually return an aggregated value back to each row. In this case, we will just return one unique random number for each grouped combination.

In [161]:
%%time
train["region_city_2"] = train.groupby(["region", "city"])["region"].transform(lambda x: random.random())

In [162]:
print("unique:", len(train["region_city_2"].unique()))
print("size:  ", train["region_city_2"].memory_usage(deep=True)*1e-6)

This is not only 10 times faster but also creates a much smaller new column, which can then be used in a LabelEncoder().

In [163]:
train["region_city_2_label"] = create_label_encoding_with_min_count(train, "region_city_2", min_count=50)

In [164]:
gc.collect()

## Numerical data size reduction

Let's add a few more numerical columns to see what we can do with their datatypes and sizes.

In [165]:
train["description_len"] = train["description"].fillna("").apply(len)
train["description_count_words"] = train["description"].fillna("").apply(lambda s: len(s.split(" ")))

In [166]:
train.loc[:, ["user_id_label", "region_city_2_label", "description_len", "description_count_words"]
         ].info()

We can see that pandas created all new columns as [int64](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html), meaning they can store values between -9223372036854775808 to 9223372036854775807.

Looking at the min and max values of our new columns, it becomes obvious we can use a smaller datatype as we use smaller min-max-ranges:

In [167]:
for col in ["user_id_label", "region_city_2_label", "description_len", "description_count_words"]:
    print(col.ljust(30), "min:", train[col].min(), "  max:", train[col].max())

The memory usages of these columns are comparitevly small for now. If you use the train and test set togther, and create a few dozen of integer columns, this can still quickly add up to hundreds of megabytes of extra RAM usage.

In [168]:
train.loc[:, ["user_id_label", "region_city_2_label", "description_len", "description_count_words"]
         ].memory_usage(deep=True)*1e-6

Pandas offers the `.to_numeric` method to convert columns to numeric values and at the same time downcast them to the most efficient datatype for the given value range.

In [169]:
train["user_id_label"] = pd.to_numeric(train["user_id_label"], downcast="integer")

In [170]:
train.loc[:, ["user_id_label", "region_city_2_label", "description_len", "description_count_words"]
         ].info()
# note the int16 here

In [171]:
train.loc[:, ["user_id_label", "region_city_2_label", "description_len", "description_count_words"]
         ].memory_usage(deep=True)*1e-6

For dozens or hundreds of integer columns, a 50-75% size reduction can help a lot in the end.

To downcast all available integer columns in our DataFrame, we can also use this function:

In [174]:
def downcast_df_int_columns(df):
    list_of_columns = list(df.select_dtypes(include=["int32", "int64"]).columns)
        
    if len(list_of_columns)>=1:
        max_string_length = max([len(col) for col in list_of_columns]) # finds max string length for better status printing
        print("downcasting integers for:", list_of_columns, "\n")
        
        for col in list_of_columns:
            print("reduced memory usage for:  ", col.ljust(max_string_length+2)[:max_string_length+2],
                  "from", str(round(df[col].memory_usage(deep=True)*1e-6,2)).rjust(8), "to", end=" ")
            df[col] = pd.to_numeric(df[col], downcast="integer")
            print(str(round(df[col].memory_usage(deep=True)*1e-6,2)).rjust(8))
    else:
        print("no columns to downcast")
    
    gc.collect()
    
    print("done")

In [173]:
downcast_df_int_columns(train)

You can also downcast float values from float64 to float32, but this can cause data loss when working with a lot of decimal places.

Even if your dataset starts out smaller and you don't have any memory issues, it can come in handy if you convert texts to categories, downcast integer values and in general try to use approaches that avoid handling a lot of string-values.

**Do you have other tips and tricks to work with larger datasets?**