# Speed Up Loading The Data By Importing from the Parquet Dataset

Dataset Link here: https://www.kaggle.com/robikscube/ubiquant-parquet

Read about parquet files here: https://databricks.com/glossary/what-is-parquet

Excerpt from the above website:

**What is Parquet?**

*Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.*

*Parquet uses the record shredding and assembly algorithm which is superior to simple flattening of nested namespaces. Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types.  This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.*

In [None]:
import pandas as pd
import numpy as np
import gc

# Reading as CSV (Slow)
- **18GB in size**
- Don't Do this. It may cause the kaggle notebooks to crash.

In [None]:
if False:
    train = pd.read_csv('../input/ubiquant-market-prediction/train.csv')

# Reading as Parquet (Fast)
- **5.5GB** in size.
- This is faster and keeps the dtypes of the original dataset.

In [None]:
%%time
train = pd.read_parquet('../input/ubiquant-parquet/train.parquet')

In [None]:
train.info()

In [None]:
train.dtypes

In [None]:
del train
gc.collect()

# Reading as Parquet Low Memory (Fast & Low Mem Use)
- **3.63GB** in size
- Even better! Uses less memory and loads even faster!

In [None]:
%%time
train = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')

In [None]:
train.info()

In [None]:
train.dtypes

# Read just a single `investment_id`
- If you only want to work with a single transaction load them like this

In [None]:
%%time
example = pd.read_parquet('../input/ubiquant-parquet/investment_ids/529.parquet')

In [None]:
example.info()

# Reading a Subset of Columns

In [None]:
%%time
col_subset = ['time_id','investment_id','target','f_1','f_2','f_3']
train = pd.read_parquet('../input/ubiquant-parquet/train.parquet',
               columns=col_subset)

In [None]:
train.info()

## Thanks!