Thank you to @[ragnar123](https://www.kaggle.com/ragnar123)

In [None]:
import pandas as pd
import os
import re
import numpy as np



The dataset used for this tutorial is **df_sales_train_validation** from M5 competition.



In [None]:

df_sales_train_validation = pd.read_csv(r'../input/m5-forecasting-accuracy/sales_train_validation.csv')


In [None]:
df_sales_train_validation.head(3)

One of the challenges working with pandas dataframe is dimensions of the frames. Generally there are millions of records. Dataframes containing Time Series data are usually huge. Although some libraries or techniques has come to help the situation, still there is a lot to be done. Pandas.melt function is one of them in order to reduce the horizontal size of the frames and make it vertically.  This helps for better processing but it is not enough. 
Another technique is reducing the size of dataframes. 

Most of the times the data type set for columns are initially set to the maximum size. Considering the number of columns with the same conditions makes the situation worse. For example in M5 dataset, the sales_train_validation.csv  conveys the amount of sold items in 1913 days. Simply the number of sold items per day would not be high. Here, the following code returns the maximum amount over all 1913 days.


In [None]:
cols = df_sales_train_validation.filter(regex='d_').columns
max = 0
for c in cols:
    if df_sales_train_validation[c].max() > max:
        max = df_sales_train_validation[c].max()
print('The maximum value for columns d_1 to d_1913 is: ', max)        

So, the maximum amount is 763! The interesting point is that all columns are set with int64 data type. For more information, int64 is defined to save numbers with almost 19 digits. However, in our case the maximum target is a 3 digits number which could be processed with int16 as well; so let’s reduce the size of all of columns to int16. The following code helps to do so easily. 

The point regarding this function is that it could be used for any dataframe and you will not have to care about the name of target columns or whether its data types are enumerated as integer or not. 


In [None]:
def ReduceSize(df_,  fl = 1):
    intValues = ['int_', 'intc', 'intp', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64']
    floatValues = ['float_', 'float16', 'float32', 'float64']
    minn, maxx = 0, 0 
    stype = ''
    for c in df_.columns:
        try:
            if df_[c].dtypes == 'object':
                df_[c] = df_[c].astype('int64')
                print('Successful conversion Object to Integer for COlumn: ', c)
        except:
            print('Not Possible Casting Object to INT64 for column: ', c)
        stype = df_[c].dtypes
        #Cast to INT
        if stype in intValues:
            minn , maxx = 1, -1
            maxx = df_[c].max()
            minn = df_[c].min()
            if (minn >= -128) &  (maxx <= 128):
                df_[c] = df_[c].astype('int8')                   
            else:
                if (minn >= -32767) &  (maxx <= 32767):
                    df_[c] = df_[c].astype('int16')
                else:
                    if (minn >= -2147483647) &  (maxx <= 2147483647):
                        df_[c] = df_[c].astype('int32')
                    else:
                        df_[c] = df_[c].astype('int64')
            #Cast to UINT
            if (fl == 2):
                if (minn >= 0) &  (maxx <= 255):
                    df_[c] = df_[c].astype('uint8')                   
                else:
                    if (minn >= 0) &  (maxx <= 65535):
                        df_[c] = df_[c].astype('uint16')                   
                    else:
                        if (minn >= 0) &  (maxx <= 4294967295):
                            df_[c] = df_[c].astype('uint32')                   
                        else:
                            if (minn >= 0) &  (maxx <= 18446744073709551615):
                                df_[c] = df_[c].astype('uint64')                   
        
        if stype in floatValues:
            try:
                df_[c] = df_[c].astype('float16')
            except:
                try:
                    df_[c] = df_[c].astype('float32')
                except:
                    df_[c] = df_[c].astype('float64')            
        print(c)
    
    return df_

Before we do any operation on our data let's get some info about the dataset. The function ****memory_usage**** shows how much memory is occupied by the dataset. 

In [None]:
df_sales_train_validation.memory_usage(index=False)


This function gives information based on columns, therefor if we would to see the whole volume occupied, a SUM() is needed as follow:


In [None]:

np.sum(df_sales_train_validation.memory_usage(index=False))


In [None]:
ReduceSize(df_sales_train_validation)

The data originally is not big but after loading, joining as well as other operations then become huge and a big size RAM will be needed


![](https://gdpr.report/wp-content/uploads/2019/05/graph-3078539_1280-e1557991579520-635x360.png)



Now let's see the memory used after **OPTIMIZATION**:


In [None]:

np.sum(df_sales_train_validation.memory_usage(index=False))



The size before omptimization is 468082480 and after running function and getting optimized it become 98604660. With some simple calculations we will see that **the size of the data is decreased for almost 80 percent**.


![](https://fmad.io/images/blog/20160128_zip.png)

There are **TWO arguments** for this function. The **first** one is a Dataframe, and the **second** is a number that could be 1 or 2. Normally it would be set with 1 as default value but if you would like to include **Unsigned Integer** (“uint8”, “uint16”, “uint32”, or “uint64”) as your possible datatypes the second arument must set with 2.

# **THIS FUNCTION COULD BE USED FOR ANY DATAFRAME FROM ANY COMPETIOTION OR CHALLENGE!** 

#  So keep it and USE it ...

I appreciate try the function with your existing dataset. In case you face error or inefficiency please let me know so as to improve it. 
