# Hello and welcome to my memory usage reduction tutorial!


**In this very short and simple tutorial I will show some insanely easy methods to reduce the memory/RAM usage of pandas dataframes.**

**I will not do any stuff like imputing, encoding, handling missing values, feature selection,  NONE of that stuff, ONLY memory usage reduction.**

**For this tutorial I will use the data of the IEEE fraud detection competition:**  https://www.kaggle.com/c/ieee-fraud-detection/submissions

**If you are interested in a complete and detailed tutorial for this competition, feel free to have a look at this:** https://www.kaggle.com/jonas0/ieee-fraud-detection

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Overview
 
## '1'. Load the data

## '2'. Convert numerical datatypes to smaller ones

## '3'. Convert categorical features to 'category' datatype

## '4'. Summary

# 1. Load the data

**For this short and simple tutorial we will only load the train_transaction data.**

In [None]:
print("loading data takes about 1 minute....")

train_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_transaction.csv', index_col='TransactionID')
#test_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/test_transaction.csv', index_col='TransactionID')

#train_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_identity.csv', index_col='TransactionID')
#test_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/test_identity.csv', index_col='TransactionID')

#sample_submission = pd.read_csv('/kaggle/input/ieee-fraud-detection/sample_submission.csv', index_col='TransactionID')

print("loading successful!")

**Let's have a look at our data:**

In [None]:
print("shape of train_transaction: ", train_transaction.shape, "\n")

print("info of train_transaction: \n")

print(train_transaction.info())

**As we can see this dataframe has 590540 rows and 393 columns.**

**The .info() function of pandas is quite helpful, it shows us what datatypes the 393 columns have: float64(376), int64(3), object(14).**

**And in the last line we can see that our dataframe uses about 1.7 GB of memory/RAM, let's see if we can reduce it :)**

**Before we can reduce the memory usage of our dataframe, we have to detect the numerical and the categorical features first:**

In [None]:
# lets generate some useful lists of columns,
# we want a list of numerical features
# and a list of categorical features

c = (train_transaction.dtypes == 'object')
n = (train_transaction.dtypes != 'object')
cat_cols = list(c[c].index)
num_cols = list(n[n].index) 

print(cat_cols, "\n")
print("number categorical features: ", len(cat_cols), "\n\n")
print(num_cols, "\n")
print("number numerical features: ", len(num_cols))

# 2. Convert numerical datatypes to smaller ones

**Dataframes can contain two types of numerical values: integers and floats. These integers and floats have a certain datatype called int8/int16/int32/int64  or float16/float32/float64. (float8 does not exist)**

**As you maybe already know, the higher the number in the datatype, the more memory it consumes. Down below is a chart listing all numerical datatypes together with their minimum and maximum value displayable and the maximum range which this datatype can cover.**

**When you load a dataframe with pandas all numerical features are given by int64/float64 by default.**

**If all values of a numerical feature do not exceed this range or the minimum or maximum value, you can convert this int64/float64  number down to a smaller datatype. No information is lost, but we can save a lot of memory.** 

In [None]:
# the int/float datatypes have the following ranges:

#   int8:  -128 to 127, range = 255  

#  int16:  -32,768 to 32,767, range = 65,535

#  int32:  -2,147,483,648 to 2,147,483,647, range = 4,294,967,295

#  int64:  -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807,
#           range = 18,446,744,073,709,551,615


#  These ranges are the same for all float datatypes.
#  By default all numerical columns in pandas are in int64 or float64.
#  This means that when we find a numerical integer column whose 
#  values do not exceed one of the ranges shown above, we can then
#  convert this datatype down to a smaller one. 

**Let's look at our dataframe again before we start reducing the memory usage:**

In [None]:
print("train_transaction.info(): \n")

print(train_transaction.info())

**The dataframes uses 1.7 GB and has 376 columns in float64, 3 columns in int64 and 14 columns in object.**

**Let's write some code that checks if we can convert a numerical feature to a smaller datatype:**

In [None]:
#  this function detects all the numerical columns,
#  that can be converted to a smaller datatype.

def detect_num_cols_to_shrink(list_of_num_cols, dataframe):
 
    convert_to_int8 = []
    convert_to_int16 = []
    convert_to_int32 = []
    
    #  sadly the datatype float8 does not exist
    convert_to_float16 = []
    convert_to_float32 = []
    
    for col in list_of_num_cols:
        
        if dataframe[col].dtype in ['int', 'int8', 'int32', 'int64']:
            describe_object = dataframe[col].describe()
            minimum = describe_object[3]
            maximum = describe_object[7]
            diff = abs(maximum - minimum)

            if diff < 255:
                convert_to_int8.append(col)
            elif diff < 65535:
                convert_to_int16.append(col)
            elif diff < 4294967295:
                convert_to_int32.append(col)   
                
        elif dataframe[col].dtype in ['float', 'float16', 'float32', 'float64']:
            describe_object = dataframe[col].describe()
            minimum = describe_object[3]
            maximum = describe_object[7]
            diff = abs(maximum - minimum)

            if diff < 65535:
                convert_to_float16.append(col)
            elif diff < 4294967295:
                convert_to_float32.append(col) 
        
    list_of_lists = []
    list_of_lists.append(convert_to_int8)
    list_of_lists.append(convert_to_int16)
    list_of_lists.append(convert_to_int32)
    list_of_lists.append(convert_to_float16)
    list_of_lists.append(convert_to_float32)
    
    return list_of_lists

**Let's call the function and print all the numerical features we can convert:**

In [None]:
num_cols_to_shrink_trans = detect_num_cols_to_shrink(num_cols, train_transaction)

convert_to_int8 = num_cols_to_shrink_trans[0]
convert_to_int16 = num_cols_to_shrink_trans[1]
convert_to_int32 = num_cols_to_shrink_trans[2]

convert_to_float16 = num_cols_to_shrink_trans[3]
convert_to_float32 = num_cols_to_shrink_trans[4]

print("convert_to_int8 :", convert_to_int8, "\n")
print("convert_to_int16 :", convert_to_int16, "\n")
print("convert_to_int32 :", convert_to_int32, "\n")

print("convert_to_float16 :", convert_to_float16, "\n")
print("convert_to_float32 :", convert_to_float32, "\n")

**As we can see a lot of features can be converted,  most of them to float16, some more of them to float32.** 

In [None]:
print("starting with converting process....")

# convert the datatypes with .astype() 

for col in convert_to_int16:
    train_transaction[col] = train_transaction[col].astype('int16')  
    
for col in convert_to_int32:
    train_transaction[col] = train_transaction[col].astype('int32') 

for col in convert_to_float16:
    train_transaction[col] = train_transaction[col].astype('float16')
    
for col in convert_to_float32:
    train_transaction[col] = train_transaction[col].astype('float32')
    
print("successfully converted!")

In [None]:
print("train_transaction.info(): \n")   # now uses 548 MB

print(train_transaction.info(), "\n")

**Wow, the memory usage went down from 1.7 GB to 548 MB, so converting the numerical datatypes can really be worth it.**

**In the next and final chapter we will transform the categorical columns from 'object' datatype to 'category'.**

# 3. Convert categorical features to 'category' datatype

**We have already created our list 'cat_cols' containing all categorical features.**

**We can simply convert all of the 14 categorical features from 'object' datatype to 'category', we do not have to check for any conditions.**

In [None]:
for i in cat_cols:
    
    train_transaction[i] = train_transaction[i].astype('category')
    
print("successfully converted all categorical features!")

**Now we can check if the memory usage went down again:**

In [None]:
print("train_transaction.info(): \n")

print(train_transaction.info(), "\n")

**Well ok, converting the 14 categorical features from 'object' to 'category' only saved us about 55 MB, but there were only 14 categorical features.**

# 4. Summary


**Converting the 379 numerical features saved about 1.1 GB, while converting the 14 categorical columns only saved about 55 MB.**

**Considering the few lines of code and the low effort this whole procedure takes, it can really be worth it, since you can save Gigabytes of RAM.**

# Thank you for reading this tutorial :)