<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Optimising DataFrame</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Reducing memory</h2>
</div>


In [None]:
import pandas as pd
import numpy as np
pd.set_option("display.precision", 8)

__Read Data__

In [None]:
df = pd.read_csv("Datasets/large_dataset.csv")

In [None]:
df.head()

Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.116,0.010284,0.4119,0,1,1,0.039,0,1366.0,0,2,0.01046,0.1167,2,2.0,0.08966,0.10034,475799.0,0
1,1,0.3562,0.2347,0.431,0,1,0,0.1584,0,1366.0,0,0,0.1718,0.3555,1,2.0,0.01384,0.10034,461478.0,0
2,1,0.3562,0.2347,0.02974,0,1,0,0.02313,2,1366.0,0,3,0.1718,0.3555,0,1.0,0.10547,0.1063,476438.0,0
3,1,0.3562,0.2347,0.4119,0,1,0,0.03644,0,1366.0,0,0,0.1718,0.3555,0,1.0,0.10547,0.1063,37974.0,0
4,1,0.116,0.03738,0.4119,0,0,0,0.0012665,0,1280.0,0,0,0.03815,0.1167,1,1.0,0.0433,0.1038,475955.0,0


In [None]:
# df.to_csv("Datasets/large_dataset.csv", index=False)

The columns in a dataframe are imported as float64, int64 be default, which allocates the highest amount of memory required to store the data. Though this is safe, it can slow down the code, requiring more computational power than necessary.


1. int8 / uint8 : consumes 1 byte of memory, range between -128/127 or 0/255
2. bool : consumes 1 byte, true or false
3. float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535
4. float32 / int32 / uint32 : consumes 4 bytes of memory, range between -2147483648 and 2147483647
5. float64 / int64 / uint64: consumes 8 bytes of memory

Often, for columns that only hold a smaller range of values, a smaller datatype will be more efficient, making the code run faster.

__Datatype of Each Column__

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 20 columns):
 #   Column                                             Non-Null Count    Dtype  
---  ------                                             --------------    -----  
 0   HasTpm                                             1000000 non-null  int64  
 1   Census_OSInstallLanguageIdentifier                 993199 non-null   float64
 2   LocaleEnglishNameIdentifier                        1000000 non-null  float64
 3   EngineVersion                                      1000000 non-null  float64
 4   UacLuaenable                                       1000000 non-null  int64  
 5   Census_MDC2FormFactor                              1000000 non-null  int64  
 6   Census_IsSecureBootEnabled                         1000000 non-null  int64  
 7   Census_OSVersion                                   1000000 non-null  float64
 8   Census_GenuineStateName                            1000000 non-

__Memory Consumed__

In [None]:
# Memory req in mb
memory_init = round(df.memory_usage(deep=True).sum() / 1024 ** 2, 2)
memory_init

152.59

__Data Ranges__

In [None]:
df.apply(lambda x: (np.min(x), np.max(x)))

Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,0,0.0003452,2.4e-07,5e-07,-1,0,0,2.4e-07,0,-1.0,-1,0,5e-07,2.4e-07,0,1.0,2.4e-07,2.4e-07,0.0,0
1,1,0.3562,0.2347,0.431,7,11,1,0.1584,3,12290.0,6,5,0.1718,0.3555,29,7.0,0.3025,0.1444,17169172.0,1


__Max and Min for each Datatype__

In [None]:
print(np.iinfo('int8'))
print(np.iinfo('int16'))
print(np.iinfo('int32'))
print(np.iinfo('int64'))

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------



In [None]:
print(np.finfo('float16'))
print(np.finfo('float32'))
print(np.finfo('float64'))

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
---------------------------------------------------------------

Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
---------------------------------------------------------------

Machine parameters for float64
---------------------------------------------------------------
precision =  15   resolution = 1.0000000000000001e-15
machep =    -52   e

__Reduce Memory Usage__

In [None]:
# Reduce memory usage
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage(deep=True).sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
df = reduce_mem_usage(df)

Mem. usage decreased to 31.47 Mb (79.4% reduction)


__New Datatypes__

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 20 columns):
 #   Column                                             Non-Null Count    Dtype  
---  ------                                             --------------    -----  
 0   HasTpm                                             1000000 non-null  int8   
 1   Census_OSInstallLanguageIdentifier                 993199 non-null   float16
 2   LocaleEnglishNameIdentifier                        1000000 non-null  float16
 3   EngineVersion                                      1000000 non-null  float16
 4   UacLuaenable                                       1000000 non-null  int8   
 5   Census_MDC2FormFactor                              1000000 non-null  int8   
 6   Census_IsSecureBootEnabled                         1000000 non-null  int8   
 7   Census_OSVersion                                   1000000 non-null  float16
 8   Census_GenuineStateName                            1000000 non-

__Memory taken by optimized data__

In [None]:
# Memory req in mb
memory_optim = round(df.memory_usage(deep=True).sum() / 1024 ** 2, 2)
memory_optim

31.47

__Savings__

In [None]:
savings = (memory_init - memory_optim)/memory_init
savings

0.7937610590471198