# <b><span style='color:#016FD0;font-size:200%'>1 |</span><span style='color:#016FD0;font-size:200%'> Introduction</span></b>

Dataset for ["American Express - Default Prediction"][1] competition was splitted to approximately equal-sized chunks in the kernel for processing it more efficiently (The dataset cannot be loaded to a kernel at once!!). The code for processing the dataset with [Pandas][2] in the kernel is simple and easy to understand. So it can be helpful for kaggle/machine learning/data science beginners to learn how to process large dataset.

[1]: https://www.kaggle.com/competitions/amex-default-prediction
[2]: https://pandas.pydata.org/docs/



<b><div style='color:#9BD4F5;font-size:120%'>NOTE : When you want to save time to read, please check hidden "Table of Contents" (in the right side of notebook) first. All of topics are summarized in it, and you can jump to the item you want to check.</div></b>

The kernel may have several bugs/wrongs. I am happy to get your comments. Thank you in advance for your kind advice to make the kernel so NICE! and to make me NICE deep learning guy!!

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>List of files that created in the kernel</div></b>

The following files will be created in the kernel, if configurations are not changed,
- Chunked train metadata : "/kaggle/working/train_data_chunk_#.parquet",
- Chunked test metadata : "/kaggle/working/test_data_chunk_#.parquet",
- Compressed train labels : "/kaggle/working/train_labels.parquet",
- Grouped names of features : "/kaggle/working/features.json",
- Customer IDs decoding map for train metadata : "/kaggle/working/customer_id_decoding_map_train.json",
- Customer IDs decoding map for test metadata : "/kaggle/working/customer_id_decoding_map_test.json".

# <b><span style='color:#016FD0;font-size:200%'>2 |</span><span style='color:#016FD0;font-size:200%'> Load Competition Dataset to the Kernel</span></b>

Load ["American Express - Default Prediction"][1] competition dataset to the kernel and check its contents. If you are beginner and don't know how to add competition data set to your kernel (notebook), the other kernels ["Preview of Whale and Dolphin Dataset with Plotly/Matplotlib"][2], ["Plotly/Matplotlib による Whale&Dolphin データセットのプレビュー"][3] can be useful (See Chapter 2 "Preparation of dataset".).

[1]: https://www.kaggle.com/competitions/amex-default-prediction
[2]: https://www.kaggle.com/code/acchiko/preview-of-whale-dolphin-dataset-with-plotly-matpl
[3]: https://www.kaggle.com/code/acchiko/plotly-matplotlib-whale-dolphin

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>List of files</div></b>

In [None]:
# Show list of files.
path_to_dir_input = "/kaggle/input/amex-default-prediction"
!ls {path_to_dir_input}

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Train metadata</div></b>

In [None]:
# Show contents of train metadata and number of lines.
path_to_train_metadata = f"{path_to_dir_input}/train_data.csv"
!echo "Contents : "
!head -3 {path_to_train_metadata}
!echo ""
!echo "Number of lines : "
!cat -n {path_to_train_metadata} | tail -1 | cut -f1

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Train labels</div></b>

In [None]:
# Show contents of train labels and number of lines.
path_to_train_labels = f"{path_to_dir_input}/train_labels.csv"
!echo "Contents : "
!head -3 {path_to_train_labels}
!echo ""
!echo "Number of lines : "
!cat -n {path_to_train_labels} | tail -1 | cut -f1

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Test metadata</div></b>

In [None]:
# Show contents of test metadata and number of lines.
path_to_test_metadata = f"{path_to_dir_input}/test_data.csv"
!echo "Contents : "
!head -3 {path_to_test_metadata}
!echo ""
!echo "Number of lines : "
!cat -n {path_to_test_metadata} | tail -1 | cut -f1

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Sample submission</div></b>

In [None]:
# Show contents of sample submission and number of lines.
path_to_sample_submission = f"{path_to_dir_input}/sample_submission.csv"
!echo "Contents : "
!head -3 {path_to_sample_submission}
!echo ""
!echo "Number of lines : "
!cat -n {path_to_sample_submission} | tail -1 | cut -f1

# <b><span style='color:#016FD0;font-size:200%'>3 |</span><span style='color:#016FD0;font-size:200%'> Split Train Metadata</span></b>

Load train metadata and split it to approximately equal-sized chunks. Configurations for splitting, such as number of chunks, path to chunks of metadata, type of data, etc., can be changed in the class "Config", if it is required.

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Configuration for splitting train metadata</div></b>

In [None]:
# Set configuration for splitting train metadata.
class Config():
    num_chunks = 10
    encode_customer_ID = True
    dtype_numerical = "float32" # "float16" is more appropriate for reducing memory usage, but it cannot be available for parquet file.
    dtype_categorical = "category"
    categorical_features = ["B_30", "B_38", "D_114", "D_116", "D_117", "D_120", "D_126", "D_63", "D_64", "D_66", "D_68"]
    path_to_metadata = path_to_train_metadata # Path to original csv format metadata.
    path_to_chunked_metadata = "/kaggle/working/train_data_chunk_#.parquet" # Basename of path to chunks of metadata. "#" will be replaced with id of chunk.

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Required libraries for splitting metadata</div></b>

In [None]:
# Import libs.
import numpy as np
import pandas as pd
from tqdm import tqdm
import gc
import json

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Number of rows of train metadata</div></b>

In [None]:
# Define utility functions for getting number of rows of train metadata.
def getNumRows(path_to_metadata_csv):
    """Load first column of csv format metadata and extract total number of rows."""
    df = pd.read_csv(path_to_metadata_csv, usecols=[0])
    return len(df)

In [None]:
# Show number of rows of train metadata.
getNumRows(path_to_metadata_csv=Config.path_to_metadata)

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Name of features</div></b>

In [None]:
# Define utility functions for getting name of features.
def getFeatures(path_to_metadata_csv):
    """Load first line of csv format metadata and extract names of features."""
    
    # Load first line of metadata for getting names of features.
    df = pd.read_csv(path_to_metadata_csv, nrows=1)
    
    # Define group of features.
    features = {}
    features["index"] = ["customer_ID", "S_2"]
    features["all"] = [feature for feature in df.columns if feature not in features["index"]]
    
    features["categorical"] = Config.categorical_features
    features["categorical_delinquency"] = [feature for feature in features["categorical"] if feature.startswith("D_")]
    features["categorical_balance"] = [feature for feature in features["categorical"] if feature.startswith("B_")]
    
    features["numerical"] = [feature for feature in features["all"] if feature not in features["categorical"]]
    features["numerical_delinquency"] = [feature for feature in features["numerical"] if feature.startswith("D_")]
    features["numerical_spend"] = [feature for feature in features["numerical"] if feature.startswith("S_")]
    features["numerical_payment"] = [feature for feature in features["numerical"] if feature.startswith("P_")]
    features["numerical_balance"] = [feature for feature in features["numerical"] if feature.startswith("B_")]
    features["numerical_risk"] = [feature for feature in features["numerical"] if feature.startswith("R_")]
    
    return features

In [None]:
# Show grouped name of features.
features = getFeatures(path_to_metadata_csv=Config.path_to_metadata)
for group in features.keys():
    print(f"features[\"{group}\"] ({len(features[group])} features total) :")
    print(f"  {features[group]}")
    print()

In [None]:
# Save grouped name of features as json for future use.
path_to_features_json = "/kaggle/working/features.json"
with open(path_to_features_json, "w") as fout:
    json.dump(features, fout)

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Customer IDs in train metadata</div></b>

In [None]:
# Define utility functions for loading metadata.
def loadMetadata(path_to_metadata_csv, cols=None, index_range=None, encode_customer_ID=False):
    """Load specified row range and columns of csv format metadata with data type conversion."""
    
    # Load first line of metadata (csv) and extract names of features.
    features = getFeatures(path_to_metadata_csv)
    
    # Define dictionaries for data type conversion.
    categorical_features_dtypes = dict.fromkeys(features["categorical"], Config.dtype_categorical)
    numerical_features_dtypes = dict.fromkeys(features["numerical"], Config.dtype_numerical)
    dtypes = dict(**categorical_features_dtypes, **numerical_features_dtypes)
    
    # Prepare args for read_csv().
    kwargs = dict(parse_dates=["S_2"], dtype=dtypes)
    
    if cols is not None:
        cols.extend(features["index"])
        kwargs["usecols"] = cols
        
    if index_range is not None:
        skiprows, nrows = _toReadCsvArgs(index_range)
        kwargs["skiprows"] = skiprows
        kwargs["nrows"] = nrows
        
    if encode_customer_ID:
        kwargs["converters"] = {"customer_ID": encodeCustomerID}
    
    # Reload metadata from second line with data type conversion.
    df = pd.read_csv(path_to_metadata_csv, **kwargs)
    
    return df

def _isValidRange(index_range, valid_range):
    first_index, last_index = index_range
    lower_limit, upper_limit = valid_range
    
    if first_index >= last_index:
        return False
    
    if first_index < lower_limit or upper_limit < last_index:
        return False
    
    return True
    
def _toReadCsvArgs(index_range):
    # Convert index range of dataframe to line numbers to skip (skiprows) and line number of rows to load (nrows).
    # skiprows starts from 1 for keeping name of columns.
    first_index, end_index = index_range
    skiprows = range(1, first_index + 1)
    nrows = (end_index - first_index) + 1
    
    return skiprows, nrows

def encodeCustomerID(customer_id):
    return int(customer_id[-16:], 16)

In [None]:
# Show customer IDs (If cols=[] is specified for loadMetadata(), only features["index"] are loaded.).
indices_df = loadMetadata(path_to_metadata_csv=Config.path_to_metadata, cols=[])
customer_ids = indices_df["customer_ID"].unique() # Uniques are returned in order of appearance. This does NOT sort.
customer_ids

<b><div style='color:#9BD4F5;font-size:120%'>Tips : All customer IDs in train metadata are same as the ones in train labels? And is its order same?</div></b>

In [None]:
# Load train labels and compare it with customer IDs in train metadata for answering the question.
train_labels = pd.read_csv(path_to_train_labels, usecols=["customer_ID"])
"Yes" if customer_ids.tolist() == train_labels["customer_ID"].tolist() else "No"

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Encoded customer IDs in train metadata</div></b>

In [None]:
# Show first 3 encoded customer IDs as example.
encoded_customer_ids = [encodeCustomerID(customer_id) for customer_id in customer_ids]
encoded_customer_ids[:3]

In [None]:
# Create map for decoding customer IDs and save it as json for future use.
decoding_map = dict(zip(encoded_customer_ids, customer_ids))

path_to_decoding_map_json = "/kaggle/working/customer_id_decoding_map_train.json"
with open(path_to_decoding_map_json, "w") as fout:
    json.dump(decoding_map, fout)

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Memory cleaning</div></b>

In [None]:
# Clean memory, if it is required.
del indices_df, customer_ids, train_labels, encoded_customer_ids, decoding_map
gc.collect()

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Chunk of train metadata</div></b>

In [None]:
# Define utility functions for splitting metadata.
def splitMetadata(path_to_metadata_csv, num_chunks, path_to_chunked_metadata_basename):
    """Load metadata and split it to approximately equal-sized chunks."""
    
    # Load metadata and extract unique customer IDs.
    indices_df = loadMetadata(path_to_metadata_csv=path_to_metadata_csv, cols=[], encode_customer_ID=Config.encode_customer_ID)
    customer_ids = indices_df["customer_ID"].unique()
    
    # Split customer IDs.
    chunked_customer_ids = np.array_split(ary=customer_ids, indices_or_sections=num_chunks)
    
    # Split metadata.
    for chunk_id, chunked_customer_ids_ in enumerate(tqdm(chunked_customer_ids, desc="Splitting metadata ...")):
        # Load metadata for chunked customers ids.
        start_index = _getFirstRowIndex(df=indices_df, customer_id=chunked_customer_ids_[0])
        end_index = _getLastRowIndex(df=indices_df, customer_id=chunked_customer_ids_[-1])
        index_range = (start_index, end_index)
        chunked_metadata = loadMetadata(path_to_metadata_csv=path_to_metadata_csv, index_range=index_range, encode_customer_ID=Config.encode_customer_ID)
        
        # Save chunked metadata.
        path_to_chunked_metadata = path_to_chunked_metadata_basename.replace("#", f"{chunk_id:03d}")
        chunked_metadata.to_parquet(path_to_chunked_metadata)
        
def _getFirstRowIndex(df, customer_id):
    #return df.query(f"customer_ID == '{customer_id}'").index[0]  # NOT works for encoded customer ID.
    return df[df["customer_ID"] == customer_id].index[0]

def _getLastRowIndex(df, customer_id):
    #return df.query(f"customer_ID == '{customer_id}'").index[-1]  # NOT works for encoded customer ID.
    return df[df["customer_ID"] == customer_id].index[-1]

In [None]:
# Load metadata and split it to approximately equal-sized chunks.
splitMetadata(path_to_metadata_csv=Config.path_to_metadata, num_chunks=Config.num_chunks, path_to_chunked_metadata_basename=Config.path_to_chunked_metadata)

# <b><span style='color:#016FD0;font-size:200%'>4 |</span><span style='color:#016FD0;font-size:200%'> Split Test Metadata</span></b>

Load test metadata and split it to approximately equal-sized chunks. Configurations for splitting, such as number of chunks, path to chunks of metadata, type of data, etc., can be changed in the class "Config", if it is required.

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Configuration for splitting test metadata</div></b>

In [None]:
# Set configuration for splitting test metadata.
class Config():
    num_chunks = 20
    encode_customer_ID = True
    dtype_numerical = "float32" # "float16" is more appropriate for reducing memory usage, but it cannot be available for parquet file.
    dtype_categorical = "category"
    categorical_features = ["B_30", "B_38", "D_114", "D_116", "D_117", "D_120", "D_126", "D_63", "D_64", "D_66", "D_68"]
    path_to_metadata = path_to_test_metadata # Path to original csv format metadata.
    path_to_chunked_metadata = "/kaggle/working/test_data_chunk_#.parquet" # Basename of path to chunks of metadata. "#" will be replaced with id of chunk.

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Number of rows of test metadata</div></b>

In [None]:
# Show number of rows of test metadata.
getNumRows(path_to_metadata_csv=Config.path_to_metadata)

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Customer IDs in test metadata</div></b>

In [None]:
# Show customer IDs (If cols=[] is specified for loadMetadata(), only features["index"] are loaded.).
indices_df = loadMetadata(path_to_metadata_csv=Config.path_to_metadata, cols=[])
customer_ids = indices_df["customer_ID"].unique()
customer_ids

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Number of customer IDs in test metadata</div></b>

In [None]:
len(customer_ids)

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Encoded customer IDs in test metadata</div></b>

In [None]:
# Show first 3 encoded customer IDs as example.
encoded_customer_ids = [encodeCustomerID(customer_id) for customer_id in customer_ids]
encoded_customer_ids[:3]

In [None]:
# Create map for decoding customer IDs and save it as json for future use.
decoding_map = dict(zip(encoded_customer_ids, customer_ids))

path_to_decoding_map_json = "/kaggle/working/customer_id_decoding_map_test.json"
with open(path_to_decoding_map_json, "w") as fout:
    json.dump(decoding_map, fout)

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Memory cleaning</div></b>

In [None]:
# Clean memory, if it is required.
del indices_df, customer_ids, encoded_customer_ids, decoding_map
gc.collect()

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Chunk of test metadata</div></b>

In [None]:
# Load test metadata and split it to approximately equal-sized chunks.
splitMetadata(path_to_metadata_csv=Config.path_to_metadata, num_chunks=Config.num_chunks, path_to_chunked_metadata_basename=Config.path_to_chunked_metadata)

# <b><span style='color:#016FD0;font-size:200%'>4 |</span><span style='color:#016FD0;font-size:200%'> Compress Train Labels</span></b>

Load train labels and compress it. Configurations for compressing, such as path to compressed labels, type of data, etc., can be changed in the class "Config", if it is required.

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Configuration for compressing train labels</div></b>

In [None]:
# Set configuration for compressing train metadata.
class Config():
    encode_customer_ID = True
    dtype_target = "category"
    path_to_labels = path_to_train_labels # Path to original csv format labels.
    path_to_labels_parquet = "/kaggle/working/train_labels.parquet" # Path to parquet format labels.

# <b><div style='padding:20px;background-color:#636364;color:white;border-radius:5px;font-size:80%'>Compressed train labels</div></b>

In [None]:
# Define utility functions for loading metadata.
def loadLabels(path_to_labels_csv, encode_customer_ID=False):
    """Load csv format labels with data type conversion."""
    
    # Define dictionaries for data type conversion.
    dtypes = {"target": Config.dtype_target}
    
    # Prepare args for read_csv().
    kwargs = dict(dtype=dtypes)
    
    if encode_customer_ID:
        kwargs["converters"] = {"customer_ID": encodeCustomerID}
    
    # Load lables with data type conversion.
    df = pd.read_csv(path_to_labels_csv, **kwargs)
    
    return df

In [None]:
# Load train labels with data type conversion.
train_labels = loadLabels(Config.path_to_labels, encode_customer_ID=Config.encode_customer_ID)
train_labels

In [None]:
# Saves train labels.
train_labels.to_parquet(Config.path_to_labels_parquet)

<b><div style='padding:20px;background-color:#016FD0;color:white;border-radius:5px;font-size:700%'>Thank you for reading!!</div></b>