The goal of this kernel is to show very simple way to reduce data size without writing kilometers of code with if/else constructions, also I want to give some explanations for beginners - why it works.

So, let's get our hands dirty.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import gc

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
# Loading data
train_tr = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv', index_col = 'TransactionID')
train_id = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv', index_col = 'TransactionID')

test_tr = pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv', index_col = 'TransactionID')
test_id = pd.read_csv('../input/ieee-fraud-detection/test_identity.csv', index_col = 'TransactionID')

# Join train and test datasets
train_df = train_tr.join(train_id)
test_df = test_tr.join(test_id)

# Removing datasets that we don't need anymore
del train_id
del train_tr
del test_id
del test_tr

gc.collect()

print(train_df.shape)
print(test_df.shape)

In [None]:
train_df.head()

After loading our data and joining it in separate dataseets we can start working on size of our data.

First let's look what we have now using info() method.

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
# Check memory usage of different features
# train_df.memory_usage()

We can see that weight of our train and test datasets are about 2GB each. Also we can see that in both datasets we have only 3 datatypes - float64, int64 and object and each columns have size of 4724320 bytes (even isFraud feature). This is most interesting part, because the size of our data depends on datatypes of our features.

Here is short description of dtypes:
* bool type - consumes 1 byte of memory, range True or False

**int types**:
* int8 - consumes 1 byte of memory, range from -128 to 127
* int16 - consumes 2 bytes of memory, range from -32 768 to 32 767
* int32 - consumes 4 bytes of memory, range from -2 147 483 648 to 2 147 483 648
* int64 - consumes 8 bytes of memory, range from -9 223 372 036 854 775 808 to 9 223 372 036 854 775 808

**uint types:**
* uint8 - consumes 1 byte of memory, range from 0 to 255
* uint16 - consumes 2 bytes of memory, range from 0 to 65 535
* uint32 - consumes 4 bytes of memory, range from 0 to 4 294 967 295
* uint64 - consumes 8 bytes of memory, range from 0 to 18 446 744 073 709 551 615

**float types:**
* float16 - consumes 2 bytes of memory, range from -6.55040e+04 to 6.55040e+04, resolution 0.001
* float32 - consumes 4 bytes of memory, range from -3.4028235e+38 to 3.4028235e+38, resolution 1e-06
* float64 - consumes 8 bytes of memory, range from -1.7976931348623157e+308 to 1.7976931348623157e+308, resolution 1e-15

In [None]:
# You can use these commands to see datatypes description
print(np.iinfo('int16'))
print(np.finfo('float64'))

In [None]:
# First I will select only numeric columns
num_cols = [col for col in train_df.columns.values if str(train_df[col].dtype) != 'object']

In [None]:
# To fullfill my curiocity, I'll create small dataframe with minimum and maximum values
types_df = pd.DataFrame({'Col': num_cols, 
              'min': [train_df[col].min() for col in num_cols],
              'max': [train_df[col].max() for col in num_cols],
              'dtype': [str(train_df[col].dtype) for col in num_cols]})

types_df['dtype_min'] = types_df['dtype'].map({'int64': np.iinfo('int64').min, 'float64': np.finfo('float64').min})
types_df['dtype_max'] = types_df['dtype'].map({'int64': np.iinfo('int64').max, 'float64': np.finfo('float64').max})
types_df.sample(20)

We can see that int64 or float64 consumes 8 bits of memory and our dataset does not have such big values to store, so we can easily reduce dataset size by downcasting dtypes of our features.

Also pandas stores strings as 'object' type. If amount of unique values in selected column is less than 50% of the count of these values, than we can convert it to 'category' datatype to reduce memory usage.

So, let's start data size reduction. First I want to divide numeric features by two groups - to_integer and to_float.

To change datatype of each feature I'll use pd.to_numeric function with 'downcast' parameter ([link to documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html))

**Downcast**: If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules:

*     ‘integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)
*     ‘unsigned’: smallest unsigned int dtype (min.: np.uint8)
*     ‘float’: smallest float dtype (min.: np.float32)

In [None]:
def reduce_size(dataset):
    for col in dataset.columns.values:
        if str(dataset[col].dtype) == 'object':
            # Change object to category if needed
#             dataset[col] = dataset[col].astype('category')
            continue
        elif str(dataset[col].dtype)[:3] == 'int':                    
            dataset[col] = pd.to_numeric(dataset[col], downcast = 'integer')
        else:   
            dataset[col] = pd.to_numeric(dataset[col], downcast = 'float')
        
    return dataset

In [None]:
train_df = reduce_size(train_df)
test_df = reduce_size(test_df)

In [None]:
train_df.info()

In [None]:
test_df.info()

So, using couple strings of code we managed to decrease size of our datasets by 2GB. But It's not all what we can do here, we can see that we have a lot of float32 types in datasets and after some feature engineering and filling of NaN values we can convert them to int or uint types decreasing size of our data even more.