# Converting CSV files to Parquet

Aim of the notebook is to understand what methods help to reduce memory with the CSV files. Leveraging the great article on [Converting CSVs to Parquet](https://www.confessionsofadataguy.com/converting-csvs-to-parquets-with-python-and-scala/) to understand what can be acheived.

Using the correct data type for each feature is a key challenge when working with large datasets. Great work by [SRK within the discussion](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327205) who has shared a dictionary for the feature data types. Having this initialised should greatly improve the data processing. By default a pandas dataframe will take each float as float64 data type unless the method has be advised otherwise.

In [None]:
# Feature data type dictionary.
# 1. Reviewed the list of categorical features from the data tab ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
#    to understand if they should be adjusted to data type = 'category'
dtype_dict = {
    'customer_id':"object",'S_2':"object",'S_3':'float16','S_5':'float16','S_6':'float16','S_7':'float16','S_8':'float16'
    ,'S_9':'float16','S_11':'float16','S_12':'float16','S_13':'float16','S_15':'float16','S_16':'float16','S_17':'float16'
    ,'S_18':'float16','S_19':'float16','S_20':'float16','S_22':'float16','S_23':'float16','S_24':'float16','S_25':'float16'
    ,'S_26':'float16','S_27':'float16'
    ,'P_2':'float16','P_3':'float16','P_4':'float16'
    ,'R_1':'float16','R_2':'float16','R_3':'float16','R_4':'float16','R_5':'float16','R_6':'float16','R_7':'float16'
    ,'R_8':'float16','R_9':'float16','R_10':'float16','R_11':'float16','R_12':'float16','R_13':'float16','R_14':'float16'
    ,'R_15':'float16','R_16':'float16','R_17':'float16','R_18':'float16','R_19':'float16','R_20':'float16','R_21':'float16'
    ,'R_22':'float16','R_23':'float16','R_24':'float16','R_25':'float16','R_26':'float16','R_27':'float16','R_28':'float16'
    ,'B_1':'float16','B_2':'float16','B_3':'float16','B_4':'float16','B_5':'float16','B_6':'float16','B_7':'float16'
    ,'B_8':'float16','B_9':'float16','B_10':'float16','B_11':'float16','B_12':'float16','B_13':'float16','B_14':'float16'
    ,'B_15':'float16','B_16':'float16','B_17':'float16','B_18':'float16','B_19':'float16','B_20':'float16','B_21':'float16'
    ,'B_22':'float16','B_23':'float16','B_24':'float16','B_25':'float16','B_26':'float16','B_27':'float16','B_28':'float16'
    ,'B_29':'float16','B_30':'float16','B_31':'int64','B_32':'float16','B_33':'float16','B_36':'float16','B_37':'float16'
    ,'B_38':'float16','B_39':'float16','B_40':'float16','B_41':'float16','B_42':'float16'
    ,'D_39':'float16','D_41':'float16','D_42':'float16','D_43':'float16','D_44':'float16','D_45':'float16','D_46':'float16'
    ,'D_47':'float16','D_48':'float16','D_49':'float16','D_50':'float16','D_51':'float16','D_52':'float16','D_53':'float16'
    ,'D_54':'float16','D_55':'float16','D_56':'float16','D_58':'float16','D_59':'float16','D_60':'float16','D_61':'float16'
    ,'D_62':'float16','D_63':'object','D_64':'object','D_65':'float16','D_66':'float16','D_68':'float16','D_69':'float16'
    ,'D_70':'float16','D_71':'float16','D_72':'float16','D_73':'float16','D_74':'float16','D_75':'float16','D_76':'float16'
    ,'D_77':'float16','D_78':'float16','D_79':'float16','D_80':'float16','D_81':'float16','D_82':'float16','D_83':'float16'
    ,'D_84':'float16','D_86':'float16','D_87':'float16','D_88':'float16','D_89':'float16','D_91':'float16','D_92':'float16'
    ,'D_93':'float16','D_94':'float16','D_96':'float16','D_102':'float16','D_103':'float16','D_104':'float16','D_105':'float16'
    ,'D_106':'float16','D_107':'float16','D_108':'float16','D_109':'float16','D_110':'float16','D_111':'float16','D_112':'float16'
    ,'D_113':'float16','D_114':'float16','D_115':'float16','D_116':'float16','D_117':'float16','D_118':'float16','D_119':'float16'
    ,'D_120':'float16','D_121':'float16','D_122':'float16','D_123':'float16','D_124':'float16','D_125':'float16','D_126':'float16'
    ,'D_127':'float16','D_128':'float16','D_129':'float16','D_130':'float16','D_131':'float16','D_132':'float16','D_133':'float16'
    ,'D_134':'float16','D_135':'float16','D_136':'float16','D_137':'float16','D_138':'float16','D_139':'float16','D_140':'float16'
    ,'D_141':'float16','D_142':'float16','D_143':'float16','D_144':'float16','D_145':'float16'
}

In [None]:
# Import packages
import numpy as np
import pandas as pd
from glob import glob
from datetime import datetime

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Create a list of the csv file names in the directory
def get_local_files() -> list:
    local_files = glob("/kaggle/input/amex-default-prediction/*.csv")
    return local_files

In [None]:
# Run the method
local_files = get_local_files()
local_files

## Pandas to convert to parquet

In [None]:
train = local_files[1]
train

Testing to understand if the updated data type dictionary allows for a pandas dataframe to be created in memory. results were good, the dataframe can be created and manipulated with a much reduced memory

In [None]:
# Test using the updated data type dictionary with Pandas to understand change. Changes the data types allows the dataframe to be created
# in memory
# df = pd.read_csv(train, dtype = dtype_dict)
# df.head()

In [None]:
# Displaying the memory footprint for the new dataframe
# df.info(memory_usage='True')

Making use of chunksize we are able to reduce the size of each 

In [None]:
# Work with pandas to convert the train data
# 1. Updated to include chunk size as the file is too large to fit in RAM
# def file_df_to_parquet(local_file: str, parquet_file: str) -> None:
#     for i, chunk in enumerate(pd.read_csv(local_file, chunksize=1000000)):
#         chunk.to_parquet(f"parquet_file_{i}.parquet")

# if __name__ == "__main__":
#     t1 = datetime.now()
#     file_df_to_parquet(train, "/kaggle/working/train_parquet")
#     t2 = datetime.now()
#     duration = t2 - t1
#     print(f"It took {duration} seconds to process")

In [None]:
# Count the number of output parquet files using this method
# def get_output_files() -> list:
#     local_files = glob("/kaggle/working/*.parquet")
#     return local_files

# output_files = get_output_files()
# len(output_files)

In [None]:
# Understanding the file size of one file
# from humanize import naturalsize
# size = os.stat("/kaggle/working/parquet_file_42.parquet").st_size
# print(size)
# print(naturalsize(size))

In [None]:
# Review folder size
# from pathlib import Path

# def get_size(path: str = '.') -> str:
#     size = 0
#     for file_ in Path(path).rglob('*.parquet'):
#         size += file_.stat().st_size
#     return naturalsize(size)

# print(get_size('/kaggle/working/'))

This method was able to produce 56 parquet files. Time taken was 8 mins and 23 seconds. Memory reduced from 16.39GB to 8.4GB

## Use Dask to convert csv file to parquet
As Dask has a similar API to pandas we are able to create data frames using similar keyword arguments

In [None]:
# Import packages
# from dask.dataframe import read_csv

In [None]:
# Create the method for the dataframe storage in parquet format
# def dask_df_parquet(local_file: str, parquet_file: str) -> None:
# #     dask_df = read_csv(local_file, dtype = dtype_dict)
#     dask_df = read_csv(local_file)
#     dask_df.to_parquet(parquet_file)

# if __name__ == "__main__":
#     t1 = datetime.now()
#     dask_df_parquet(train, "/kaggle/working/train_parquet.parquet")
#     t2 = datetime.now()
#     duration = t2 - t1
#     print(f"It took {duration} seconds to process")