In this notebook I will show how to save the data using the [HDF5](https://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) format, available on the pandas library.

This allows a faster read then using the common csv format, it also allows direct query from the disk to subset the data.

First we have to read the csv file (hopefully for the last time!). Let's also keep track of the time we spend on each operation.

To reduce the size of the data I'm using the approach suggested on [this great kernel](https://www.kaggle.com/theoviel/load-the-totality-of-the-data).

In [None]:
import pandas as pd
import time
import gc

In [None]:
dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float16',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int8',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float32',
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float16',
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float16',
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float32',
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float32',
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float16',
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float16',
        'Census_InternalPrimaryDisplayResolutionVertical':      'float16',
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float32',
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
}

start = time.time()
train = pd.read_csv("../input/train.csv", dtype=dtypes)
test = pd.read_csv("../input/test.csv", dtype=dtypes)
time_to_read_csv = int(time.time() - start)
print("Time to read data using the csv format = {}s".format(time_to_read_csv))

In [None]:
print('Memory Usage (train data) ~ {:.2f} GB'.format(train.memory_usage(deep=True).sum() / 1e+9))
print('Memory Usage (test data) ~ {:.2f} GB'.format(test.memory_usage(deep=True).sum() / 1e+9))

Now we can use the pandas API to convert the files to HDF5. We are going to use pandas' [high level API](https://pandas.pydata.org/pandas-docs/stable/io.html#id3), that is very similar to the CSV API, you can read more on the [documentation](https://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables) if you want to get more advanced tips.

Also, we are going to save the data using the [table format](https://pandas.pydata.org/pandas-docs/stable/io.html#table-format), which is slower to read/write than the [fixed format](https://pandas.pydata.org/pandas-docs/stable/io.html#fixed-format). But, since we have columns of type "category" we need to use the table format. Moreover, the table format allows us to query the data directly from the disk, for this we set the argument **data_columns = True**, this will index all the columns so that they can all be queried (you could also just pass a list of columns you want to make it queryable). 

The **key** argument is an identifier of the group in the store, here I will save the train/test data on different files, but you could choose to save them on only one store and use different keys to identify them. You can read more on the [documentation](https://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables) on what to do in this case.

First, let's save the data on the new format.

In [None]:
start = time.time()
train.to_hdf("train.h5", key="train", format="table", data_columns=True)
test.to_hdf("test.h5", key="test", format="table", data_columns=True)
time_to_write_hdf5 = int(time.time() - start)
print("Time to save data using the HDF5 format = {}s".format(time_to_write_hdf5))

In [None]:
del train
del test
gc.collect()

It takes sometime to save, but we only have to do it once, so it will be worth for the fast reads afterwards. 

Now lets read the data back to see how much faster it is.

In [None]:
start = time.time()
train = pd.read_hdf("train.h5", key="train")
test = pd.read_hdf("test.h5", key="test")
time_to_read_hdf5 = int(time.time() - start)
print("Time to read data using the HDF5 format = {}s".format(time_to_read_hdf5))

This is more than 2x faster then using CSV!

(Almost) Same memory usage as before (not sure where the small difference comes from.)

In [None]:
print('Memory Usage (train data) ~ {:.2f} GB'.format(train.memory_usage(deep=True).sum() / 1e+9))
print('Memory Usage (test data) ~ {:.2f} GB'.format(test.memory_usage(deep=True).sum() / 1e+9))

In [None]:
del train
del test
gc.collect()

Now we can also query the data directly from disk. Here is one quick example, for this we will read only records where **HasDetections == 1.**

In [None]:
start = time.time()
train_hasdetection = pd.read_hdf("train.h5", key="train", where = ['HasDetections == 1'])
time_to_read_query = int(time.time() - start)
print("Time to read data using a query = {}s".format(time_to_read_query))

In [None]:
train_hasdetection[["MachineIdentifier", "HasDetections"]].head()

That's it, good luck to everyone!