# Feather format for super fast data loading

Original `panquet` format takes time to load data. Here I converted them and uploaded with `feather` format.<br/>
It is about **30 times faster**.

You can see dataset here: [https://www.kaggle.com/corochann/bengaliaicv19feather](https://www.kaggle.com/corochann/bengaliaicv19feather)<br/>
Please upvote both dataset and this kernel if you like it! :)

This kernel describes how to load this dataset.

# How to add dataset

When you write kernel, click "+ Add Data" botton on right top.<br/>
Then inside window pop-up, you can see "Search Datasets" text box on right top.<br/>
You can type "bengaliai-cv19-feather" to find this dataset and press "Add" botton to add the data.

In [1]:
import gc
import os
from pathlib import Path
import random
import sys

from tqdm import tqdm_notebook as tqdm
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import display, HTML

# --- plotly ---
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

# --- models ---
from sklearn import preprocessing
from sklearn.model_selection import KFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb

# --- setup ---
pd.set_option('max_columns', 50)

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/bengaliai-cv19/test_image_data_0.parquet
/kaggle/input/bengaliai-cv19/sample_submission.csv
/kaggle/input/bengaliai-cv19/test_image_data_3.parquet
/kaggle/input/bengaliai-cv19/train_image_data_0.parquet
/kaggle/input/bengaliai-cv19/test_image_data_1.parquet
/kaggle/input/bengaliai-cv19/train_image_data_2.parquet
/kaggle/input/bengaliai-cv19/train_image_data_3.parquet
/kaggle/input/bengaliai-cv19/test_image_data_2.parquet
/kaggle/input/bengaliai-cv19/train_image_data_1.parquet
/kaggle/input/bengaliai-cv19/class_map.csv
/kaggle/input/bengaliai-cv19/test.csv
/kaggle/input/bengaliai-cv19/train.csv
/kaggle/input/bengaliaicv19feather/train_image_data_3.feather
/kaggle/input/bengaliaicv19feather/test_image_data_3.feather
/kaggle/input/bengaliaicv19feather/test_image_data_0.feather
/kaggle/input/bengaliaicv19feather/train_image_data_0.feather
/kaggle/input/bengaliaicv19feather/train_image_data_2.feather
/kaggle/input/bengaliaicv19feather/test_image_data_1.feather
/kaggle/input/be

In [3]:
%%time
datadir = Path('/kaggle/input/bengaliai-cv19')

# Read in the data CSV files
train = pd.read_csv(datadir/'train.csv')
test = pd.read_csv(datadir/'test.csv')
sample_submission = pd.read_csv(datadir/'sample_submission.csv')
class_map = pd.read_csv(datadir/'class_map.csv')

CPU times: user 194 ms, sys: 32.1 ms, total: 226 ms
Wall time: 231 ms


To load `feather` format, we just need to change `read_parquet` to `read_feather`.

Original `parquet` format takes about 60 sec to load 1 data, while `feather` format takes about **2 sec to load 1 data!!!**

In [4]:
%%time
train_image_df0 = pd.read_parquet(datadir/'train_image_data_0.parquet')

CPU times: user 3min 35s, sys: 8.99 s, total: 3min 44s
Wall time: 59.4 s


In [5]:
%%time
featherdir = Path('/kaggle/input/bengaliaicv19feather')

train_image_df0 = pd.read_feather(featherdir/'train_image_data_0.feather')
train_image_df1 = pd.read_feather(featherdir/'train_image_data_1.feather')
train_image_df2 = pd.read_feather(featherdir/'train_image_data_2.feather')
train_image_df3 = pd.read_feather(featherdir/'train_image_data_3.feather')

CPU times: user 4.08 s, sys: 11.8 s, total: 15.9 s
Wall time: 11 s


For test files, please be careful that this is **code competition** and **test data will change in the actual submission**. <br/>
So I guess we need to load from original `parquet` format to load private test data when submission.

In [6]:
%%time
# Please change this to `True` when actual submission
submission = False

if submission:
    test_image_df0 = pd.read_parquet(datadir/'test_image_data_0.parquet')
    test_image_df1 = pd.read_parquet(datadir/'test_image_data_1.parquet')
    test_image_df2 = pd.read_parquet(datadir/'test_image_data_2.parquet')
    test_image_df3 = pd.read_parquet(datadir/'test_image_data_3.parquet')
else:
    test_image_df0 = pd.read_feather(featherdir/'test_image_data_0.feather')
    test_image_df1 = pd.read_feather(featherdir/'test_image_data_1.feather')
    test_image_df2 = pd.read_feather(featherdir/'test_image_data_2.feather')
    test_image_df3 = pd.read_feather(featherdir/'test_image_data_3.feather')

CPU times: user 2.93 s, sys: 1.34 s, total: 4.27 s
Wall time: 2.58 s


In [7]:
train_image_df0.head()

Unnamed: 0,image_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,...,32307,32308,32309,32310,32311,32312,32313,32314,32315,32316,32317,32318,32319,32320,32321,32322,32323,32324,32325,32326,32327,32328,32329,32330,32331
0,Train_0,254,253,252,253,251,252,253,251,251,253,254,253,253,253,254,253,252,253,253,253,253,252,252,253,...,252,252,252,252,252,252,252,253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,251
1,Train_1,251,244,238,245,248,246,246,247,251,252,250,250,246,249,248,250,249,251,252,253,253,253,253,253,...,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,254
2,Train_2,251,250,249,250,249,245,247,252,252,252,253,252,252,251,250,251,253,254,251,251,252,252,253,253,...,250,251,251,251,250,250,250,251,252,253,253,253,253,254,254,254,253,252,252,253,253,253,253,251,249
3,Train_3,247,247,249,253,253,252,251,251,250,250,251,250,249,251,251,251,250,252,251,245,245,251,252,251,...,254,254,254,254,254,254,254,253,252,253,254,253,252,253,254,254,254,254,254,254,253,253,252,251,252
4,Train_4,249,248,246,246,248,244,242,242,229,225,231,229,229,228,221,224,226,221,221,220,217,217,218,219,...,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255
