# 🏠 Home Credit  Risk Model Pipeline

Welcome to our advanced pipeline for building and assessing credit risk models using machine learning! 📊💳

## Overview 📝
Discover a comprehensive approach to constructing credit risk models. We employ various machine learning algorithms like LightGBM and CatBoost, alongside ensemble techniques for robust predictions. Our pipeline emphasizes data integrity, feature relevance, and model stability, crucial elements in credit risk assessment. 🛠️💼

## Features 🚀
- **Data Preprocessing**: Begin with cleaning data, handling missing values, and optimizing memory usage for efficient computation.
- **Feature Engineering**: Extract meaningful insights from data using advanced techniques, enhancing model predictive power.
- **Model Training**: Train multiple machine learning models such as LightGBM and CatBoost to capture complex relationships and patterns.
- **Ensemble Learning**: Combine predictions from various models using our custom Voting Model to achieve higher accuracy and stability. 🤝📈

**🌟 Explore my profile and other public projects, and don't forget to share your feedback!**

## 👉 [Visit my Profile](https://www.kaggle.com/code/zulqarnainalipk) 👈

## Requirements 🛠️
Ensure you have:
- Python 3.7+
- Libraries: NumPy, pandas, polars, seaborn, matplotlib, scikit-learn, lightgbm, imbalanced-learn, joblib, catboost

## Usage 🚀
Follow these steps:
1. **Data Loading**: Ensure required datasets are available in the specified directory (`/kaggle/input/home-credit-credit-risk-model-stability`).
2. **Initialization**: Run initialization code to set up necessary functions and configurations.
3. **Data Preprocessing**: Execute data preprocessing steps to handle missing values and optimize memory usage.
4. **Feature Engineering**: Use provided feature engineering functions to extract relevant features from the dataset.
5. **Model Training**: Train machine learning models like LightGBM and CatBoost using preprocessed data.
6. **Ensemble Learning**: Combine predictions from multiple models using the custom Voting Model for improved performance.
7. **Evaluation**: Assess ensemble model performance and generate submission files for further analysis.

## Note 📌
- **Customization**: Feel free to customize the pipeline by adding or modifying features, adjusting model parameters, or experimenting with different algorithms.
- **Resource Management**: Monitor memory usage and computational resources, especially during data preprocessing and model training, for smooth execution.

## Acknowledgments 🙏
We acknowledge The Home Credit Group organizers for providing the dataset and the competition platform.

Let's dive in! Feel free to reach out if you have any questions or need assistance along the way. 👉 [Visit my Profile](https://www.kaggle.com/zulqarnainalipk) 👈

In [1]:
import sys  # System-specific parameters and functions
import subprocess  # Spawn new processes, connect to their input/output/error pipes, and obtain their return codes
import os  # Operating system dependent functionality
import gc  # Garbage Collector interface
from pathlib import Path  # Object-oriented filesystem paths
from glob import glob  # Unix style pathname pattern expansion

import numpy as np  # Fundamental package for scientific computing with Python
import pandas as pd  # Powerful data structures for data manipulation and analysis
import polars as pl  # Fast DataFrame library implemented in Rust

from datetime import datetime  # Basic date and time types
import seaborn as sns  # Statistical data visualization
import matplotlib.pyplot as plt  # MATLAB-like plotting framework

import joblib  # Save and load Python objects

import warnings  # Warning control
warnings.filterwarnings('ignore')  # Ignore warnings

from sklearn.base import BaseEstimator, RegressorMixin  # Base classes for all estimators in scikit-learn
from sklearn.metrics import roc_auc_score  # ROC AUC score
import lightgbm as lgb  # LightGBM: Gradient boosting framework
from sklearn.model_selection import TimeSeriesSplit, GroupKFold, StratifiedGroupKFold  # Cross-validation strategies
from imblearn.over_sampling import SMOTE  # Oversampling technique for imbalanced datasets
from sklearn.preprocessing import OrdinalEncoder  # Encode categorical features as an integer array
from sklearn.impute import KNNImputer  # Imputation for completing missing values using k-Nearest Neighbors



In [2]:
ROOT = '/kaggle/input/home-credit-credit-risk-model-stability'  # Setting the root directory path


# 🛠️📊 Pipeline for  Data Preprocessing 

Let's create a  class named `Pipeline` containing methods to preprocess data using Pandas and Pipelines. 
**1. `set_table_dtypes(df)`**
- This method iterates through each column in the DataFrame (`df`) and converts the data types based on certain conditions.
- If the column name is one of ["case_id", "WEEK_NUM", "num_group1", "num_group2"], it converts the column to `Int64`.
- If the column name is "date_decision", it converts the column to `Date`.
- If the last character of the column name is "P" or "A", it converts the column to `Float64`.
- If the last character of the column name is "M", it converts the column to `String`.
- If the last character of the column name is "D", it converts the column to `Date`.
- Finally, it returns the DataFrame with modified data types.

**2. `handle_dates(df)`**
- This method aims to handle date columns in the DataFrame.
- It iterates through each column, and if the last character of the column name is "D", it performs some operations.
- It subtracts the date values in the current column from the values in the "date_decision" column.
- Then it computes the total days between the two dates.
- After processing, it drops the "date_decision" and "MONTH" columns from the DataFrame.
- Finally, it returns the modified DataFrame.

**3. `filter_cols(df)`**
- This method filters out columns based on certain conditions.
- It iterates through each column and checks if the column name is not in ["target", "case_id", "WEEK_NUM"] and if the column type is `String`.
- If the number of unique values in the column is either 1 or more than 200, it drops that column.
- Finally, it returns the filtered DataFrame.

### Study Sources
- For learning Pandas and data preprocessing: [Pandas Documentation](https://pandas.pydata.org/docs/)
- Understanding Pipelines in data preprocessing: [Scikit-Learn Pipeline Documentation](https://scikit-learn.org/stable/modules/compose.html#pipeline)
- Data type conversion and manipulation: [Pandas Data Types and Conversion](https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/basics.html#basics-dtypes)

In [3]:

class Pipeline:

    def set_table_dtypes(df):
        for col in df.columns:
            if col in ["case_id", "WEEK_NUM", "num_group1", "num_group2"]:
                df = df.with_columns(pl.col(col).cast(pl.Int64))
            elif col in ["date_decision"]:
                df = df.with_columns(pl.col(col).cast(pl.Date))
            elif col[-1] in ("P", "A"):
                df = df.with_columns(pl.col(col).cast(pl.Float64))
            elif col[-1] in ("M",):
                df = df.with_columns(pl.col(col).cast(pl.String))
            elif col[-1] in ("D",):
                df = df.with_columns(pl.col(col).cast(pl.Date))
        return df

    def handle_dates(df):
        for col in df.columns:
            if col[-1] in ("D",):
                df = df.with_columns(pl.col(col) - pl.col("date_decision"))  #!!?
                df = df.with_columns(pl.col(col).dt.total_days()) # t - t-1
        df = df.drop("date_decision", "MONTH")
        return df

    def filter_cols(df):
        
        for col in df.columns:
            if (col not in ["target", "case_id", "WEEK_NUM"]) & (df[col].dtype == pl.String):
                freq = df[col].n_unique()
                if (freq == 1) | (freq > 200):
                    df = df.drop(col)
        
        return df




# 🔍 Aggregator for Feature Extraction 

Let's create a  `Aggregator` class  to aggregate features from a DataFrame. 

**1. `num_expr(df)`**
- This method extracts numerical features from the DataFrame (`df`).
- It selects columns whose names end with "P" or "A", indicating some numerical measurements.
- For each selected column, it creates an expression to compute the maximum value and aliases it accordingly.
- Finally, it returns a list of expressions for maximum values of numerical features.

 **2. `date_expr(df)`**
- This method extracts date-related features from the DataFrame (`df`).
- It selects columns whose names end with "D", representing date columns.
- Similar to `num_expr`, it creates expressions to compute the maximum date value for each selected column and aliases them.
- It returns a list of expressions for maximum date values of date features.

 **3. `str_expr(df)`**
- This method extracts string features from the DataFrame (`df`).
- It selects columns whose names end with "M", indicating string type columns.
- It creates expressions to compute the maximum string value for each selected column and aliases them accordingly.
- Returns a list of expressions for maximum string values of string features.

 **4. `other_expr(df)`**
- This method extracts other miscellaneous features from the DataFrame (`df`).
- It selects columns whose names end with "T" or "L".
- Similar to previous methods, it computes the maximum value for each selected column and aliases them.
- Returns a list of expressions for maximum values of miscellaneous features.

 **5. `count_expr(df)`**
- This method extracts count-related features from the DataFrame (`df`).
- It selects columns containing "num_group" in their names.
- It computes the maximum value for each selected column and aliases them.
- Returns a list of expressions for maximum count values of count features.

**6. `get_exprs(df)`**
- This method aggregates all the expressions from the previous methods to get a comprehensive list of feature extraction expressions.
- It calls all the individual feature extraction methods and concatenates the resulting lists.
- Returns a consolidated list of expressions for all types of features.

### Study Sources
- For learning about feature extraction and aggregation: [Feature Engineering for Machine Learning](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241)
- Understanding Pandas DataFrame manipulation: [Pandas Documentation](https://pandas.pydata.org/docs/)
- Relational Algebra and Expressions: [Relational Algebra - Wikipedia](https://en.wikipedia.org/wiki/Relational_algebra)

In [4]:



class Aggregator:
    #Please add or subtract features yourself, be aware that too many features will take up too much space.
    def num_expr(df):
        cols = [col for col in df.columns if col[-1] in ("P", "A")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return expr_max
    
    def date_expr(df):
        cols = [col for col in df.columns if col[-1] in ("D")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return  expr_max
    
    def str_expr(df):
        cols = [col for col in df.columns if col[-1] in ("M",)]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return  expr_max
    
    def other_expr(df):
        cols = [col for col in df.columns if col[-1] in ("T", "L")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return  expr_max 
    
    def count_expr(df):
        cols = [col for col in df.columns if "num_group" in col]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols] 
        return  expr_max
    
    def get_exprs(df):
        exprs = Aggregator.num_expr(df) + \
                Aggregator.date_expr(df) + \
                Aggregator.str_expr(df) + \
                Aggregator.other_expr(df) + \
                Aggregator.count_expr(df)

        return exprs



# File Reading with Data Preprocessing 📄

The function `read_file(path, depth=None)` reads a Parquet file located at the given `path`, performs data preprocessing using the `Pipeline` class, and optionally aggregates features based on the `depth` parameter using the `Aggregator` class. 

1. `read_file(path, depth=None)`
- **Inputs**:
  - `path`: Path to the Parquet file.
  - `depth`: An optional parameter indicating the depth of feature aggregation. Default is `None`.
- **Output**: Returns a processed DataFrame.
- **Process**:
  - Reads the Parquet file located at the given `path` using `pl.read_parquet(path)`.
  - Performs data preprocessing using the `Pipeline` class by applying the `set_table_dtypes` method to ensure proper data types.
  - If `depth` is provided and is either 1 or 2:
    - It groups the DataFrame by "case_id".
    - It aggregates features based on the depth using the `Aggregator` class and the `get_exprs` method.
  - Returns the processed DataFrame.

### Study Sources
- For understanding Parquet file format and reading: [Parquet File Format](https://parquet.apache.org/documentation/latest/)
- Data preprocessing with Pandas Pipelines: [Pandas Pipe Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html)
- Feature aggregation and group-by operations: [Pandas GroupBy Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
- Aggregating features using Pandas: [Pandas Aggregation Documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#aggregation)

In [5]:
def read_file(path, depth=None):
    df = pl.read_parquet(path)
    df = df.pipe(Pipeline.set_table_dtypes)
    if depth in [1,2]:
        df = df.group_by("case_id").agg(Aggregator.get_exprs(df)) 
    return df



# 📄Reading Multiple Files with Data Preprocessing 

Let's create a function `read_files(regex_path, depth=None)` that reads multiple Parquet files matching the specified regex pattern, performs data preprocessing using the `Pipeline` class, optionally aggregates features based on the `depth` parameter using the `Aggregator` class, and concatenates the results.
**1. `read_files(regex_path, depth=None)`**
- **Inputs**:
  - `regex_path`: Regular expression pattern for matching file paths.
  - `depth`: An optional parameter indicating the depth of feature aggregation. Default is `None`.
- **Output**: Returns a concatenated and processed DataFrame.
- **Process**:
  - Initializes an empty list `chunks` to store processed DataFrames.
  - Iterates through each file path matched by the provided regular expression pattern using `glob(str(regex_path))`.
    - Reads each Parquet file using `pl.read_parquet(path)`.
    - Performs data preprocessing using the `Pipeline` class by applying the `set_table_dtypes` method.
    - If `depth` is provided and is either 1 or 2:
      - It groups the DataFrame by "case_id".
      - It aggregates features based on the depth using the `Aggregator` class and the `get_exprs` method.
    - Appends the processed DataFrame to the `chunks` list.
  - Concatenates all DataFrames in `chunks` vertically using `pl.concat(chunks, how="vertical_relaxed")`.
  - Removes duplicate rows based on the "case_id" column using `df.unique(subset=["case_id"])`.
  - Returns the concatenated and processed DataFrame.

### Study Sources
- For understanding file path manipulation and regular expressions: [Python Glob Documentation](https://docs.python.org/3/library/glob.html)
- Concatenating DataFrames in Pandas: [Pandas Concatenation Documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
- Removing duplicate rows in Pandas: [Pandas Drop Duplicates Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

In [6]:
def read_files(regex_path, depth=None):
    chunks = []
    
    for path in glob(str(regex_path)):
        df = pl.read_parquet(path)
        df = df.pipe(Pipeline.set_table_dtypes)
        if depth in [1, 2]:
            df = df.group_by("case_id").agg(Aggregator.get_exprs(df))
        chunks.append(df)
    
    df = pl.concat(chunks, how="vertical_relaxed")
    df = df.unique(subset=["case_id"])
    return df



# 🛠️ Feature Engineering Function 

Function `feature_eng(df_base, depth_0, depth_1, depth_2)` performs feature engineering on a base DataFrame (`df_base`) and multiple sets of additional DataFrames (`depth_0`, `depth_1`, `depth_2`). It adds new features, joins additional DataFrames, and handles dates using the `Pipeline` class.

 **1. `feature_eng(df_base, depth_0, depth_1, depth_2)`**
- **Inputs**:
  - `df_base`: Base DataFrame on which feature engineering will be performed.
  - `depth_0`, `depth_1`, `depth_2`: Lists of DataFrames representing additional features of different depths.
- **Output**: Returns the feature-engineered DataFrame.
- **Process**:
  - Adds new features to the base DataFrame:
    - `month_decision`: Extracts the month from the "date_decision" column.
    - `weekday_decision`: Extracts the weekday from the "date_decision" column.
  - Iterates through each set of additional DataFrames (`depth_0`, `depth_1`, `depth_2`):
    - Joins each DataFrame to the base DataFrame using the "case_id" column as the key and left join method.
    - Appends a suffix to the column names to distinguish between different sets of features.
  - Performs date handling using the `Pipeline` class by applying the `handle_dates` method.
  - Returns the feature-engineered DataFrame.

### Study Sources
- For understanding feature engineering techniques: [Feature Engineering for Machine Learning](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241)
- Handling dates in Pandas: [Pandas DateTime Documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)
- Joining DataFrames in Pandas: [Pandas Merge Documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html)
- Applying functions to Pandas DataFrame using pipe: [Pandas Pipe Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html)

In [7]:
def feature_eng(df_base, depth_0, depth_1, depth_2):
    df_base = (
        df_base
        .with_columns(
            month_decision = pl.col("date_decision").dt.month(),
            weekday_decision = pl.col("date_decision").dt.weekday(),
        )
    )
    for i, df in enumerate(depth_0 + depth_1 + depth_2):
        df_base = df_base.join(df, how="left", on="case_id", suffix=f"_{i}")
    df_base = df_base.pipe(Pipeline.handle_dates)
    return df_base



# 🐼 DataFrame Conversion to Pandas with Categorical Columns 

The function `to_pandas(df_data, cat_cols=None)` converts a DataFrame (`df_data`) to a Pandas DataFrame and optionally converts specified columns to categorical data type. 

 **1. `to_pandas(df_data, cat_cols=None)`**
- **Inputs**:
  - `df_data`: Input DataFrame to be converted to Pandas.
  - `cat_cols`: Optional list of column names to be converted to categorical data type. Default is `None`.
- **Output**: Returns the converted Pandas DataFrame and the list of categorical column names.
- **Process**:
  - Converts the input DataFrame to Pandas DataFrame using the `.to_pandas()` method.
  - If `cat_cols` is not provided, it selects columns with data type "object" as default categorical columns.
  - Converts the selected categorical columns to the categorical data type using `.astype("category")`.
  - Returns the converted Pandas DataFrame along with the list of categorical column names.

### Study Sources
- Converting Dask DataFrame to Pandas: [Dask DataFrame to Pandas](https://docs.dask.org/en/latest/dataframe-best-practices.html#converting-to-pandas)
- Converting column data types in Pandas: [Pandas DataFrame astype](https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.DataFrame.astype.html)
- Handling categorical data in Pandas: [Categorical Data in Pandas](https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/categorical.html)

In [8]:
def to_pandas(df_data, cat_cols=None):
    df_data = df_data.to_pandas()
    if cat_cols is None:
        cat_cols = list(df_data.select_dtypes("object").columns)
    df_data[cat_cols] = df_data[cat_cols].astype("category")
    return df_data, cat_cols



# 🔽 Memory Reduction Function for DataFrames 

The function `reduce_mem_usage(df)` iterates through all columns of a DataFrame and modifies the data types to reduce memory usage. 

 **1. `reduce_mem_usage(df)`**
- **Input**: 
  - `df`: Input DataFrame.
- **Output**: Returns the DataFrame with reduced memory usage.
- **Process**:
  - Calculates the initial memory usage of the DataFrame (`start_mem`) using `df.memory_usage()`.
  - Iterates through each column of the DataFrame:
    - Checks if the column type is a category. If so, skips to the next column.
    - For non-category columns:
      - Determines the minimum and maximum values of the column (`c_min` and `c_max`).
      - If the column type is integer:
        - Checks if the data can be fit into `int8`, `int16`, `int32`, or `int64` and converts the column type accordingly.
      - If the column type is float:
        - Checks if the data can be fit into `float16`, `float32`, or `float64` and converts the column type accordingly.
      - If the column type is object (string), it skips the conversion.
  - Calculates the final memory usage of the DataFrame (`end_mem`) after the modifications.
- **Returns** the DataFrame with reduced memory usage.

### Study Sources
- Optimizing memory usage in Pandas: [Optimizing Memory Usage in Pandas](https://www.dataquest.io/blog/pandas-big-data/)
- Understanding data types and memory in Pandas: [Pandas Data Types and Memory Usage](https://pbpython.com/pandas_dtypes.html)
- Data type conversion in NumPy: [NumPy Data Types](https://numpy.org/doc/stable/reference/arrays.scalars.html#arrays-scalars-built-in)

In [9]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    
    for col in df.columns:
        col_type = df[col].dtype
        if str(col_type)=="category":
            continue
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            continue
    end_mem = df.memory_usage().sum() / 1024**2    
    return df


 **Definition of Root Directory and Subdirectories**
- `ROOT`: It specifies the root directory path where the dataset is located. The `Path` object is created using the `Path` class from the `pathlib` module.
- `TRAIN_DIR`: It specifies the directory path for training data files. It is derived from the `ROOT` directory by appending the subdirectories "parquet_files" and "train" using the `/` operator.
- `TEST_DIR`: It specifies the directory path for test data files. Similar to `TRAIN_DIR`, it is derived from the `ROOT` directory by appending the subdirectories "parquet_files" and "test" using the `/` operator.

### Study Sources
- Working with file paths in Python: [Pathlib Documentation](https://docs.python.org/3/library/pathlib.html)
- Manipulating file paths using Pathlib: [Pathlib Tutorial](https://realpython.com/python-pathlib/)

In [10]:
ROOT            = Path("/kaggle/input/home-credit-credit-risk-model-stability")

TRAIN_DIR       = ROOT / "parquet_files" / "train"
TEST_DIR        = ROOT / "parquet_files" / "test"



**Explaination** 


Initializes a dictionary `data_store` containing different sets of DataFrames obtained from reading Parquet files using the `read_file()` and `read_files()` functions.

****1. Data Store Initialization**
- `data_store`: It is a dictionary storing different sets of DataFrames under different keys.
  
**2. Data Read Operations**
- `df_base`: It stores the DataFrame obtained by reading the file "train_base.parquet" located in the `TRAIN_DIR` directory using the `read_file()` function.
- `depth_0`: It stores a list of DataFrames obtained by reading multiple files. The first element is obtained using the `read_file()` function, while the second element is obtained using the `read_files()` function with a wildcard pattern.
- `depth_1`: It stores a list of DataFrames obtained by reading multiple files using the `read_files()` function with specific patterns. Each file is associated with a depth level of 1.
- `depth_2`: It stores a list containing a single DataFrame obtained by reading a specific file associated with a depth level of 2 using the `read_file()` function.

### Study Sources
- Loading data from Parquet files: [Dask Parquet Reader](https://docs.dask.org/en/latest/dataframe-io.html#dask.dataframe.read_parquet)
- Working with dictionaries in Python: [Python Dictionaries](https://realpython.com/python-dicts/)

In [11]:

data_store = {
    "df_base": read_file(TRAIN_DIR / "train_base.parquet"),
    "depth_0": [
        read_file(TRAIN_DIR / "train_static_cb_0.parquet"),
        read_files(TRAIN_DIR / "train_static_0_*.parquet"),
    ],
    "depth_1": [
        read_files(TRAIN_DIR / "train_applprev_1_*.parquet", 1),
        read_file(TRAIN_DIR / "train_tax_registry_a_1.parquet", 1),
        read_file(TRAIN_DIR / "train_tax_registry_b_1.parquet", 1),
        read_file(TRAIN_DIR / "train_tax_registry_c_1.parquet", 1),
        read_files(TRAIN_DIR / "train_credit_bureau_a_1_*.parquet", 1),
        read_file(TRAIN_DIR / "train_credit_bureau_b_1.parquet", 1),
        read_file(TRAIN_DIR / "train_other_1.parquet", 1),
        read_file(TRAIN_DIR / "train_person_1.parquet", 1),
        read_file(TRAIN_DIR / "train_deposit_1.parquet", 1),
        read_file(TRAIN_DIR / "train_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        read_file(TRAIN_DIR / "train_credit_bureau_b_2.parquet", 2),
    ]
}



#  Data Preprocessing and Feature Engineering 🔧

Perform data preprocessing and feature engineering operations on the training data.

**1. Data Preprocessing and Feature Engineering**
- `df_train = feature_eng(**data_store)`: Applies feature engineering to the training data stored in the `data_store` dictionary using the `feature_eng` function. The unpacking operator `**` is used to pass the dictionary as keyword arguments.
- `del data_store`: Deletes the `data_store` dictionary to release memory.
- `gc.collect()`: Manually triggers garbage collection to free up memory space.

 **2. Data Filtering, Conversion, and Memory Reduction**
- `df_train = df_train.pipe(Pipeline.filter_cols)`: Applies column filtering using the `filter_cols` method from the `Pipeline` class to the `df_train` DataFrame using the `pipe` method.
- `df_train, cat_cols = to_pandas(df_train)`: Converts the `df_train` DataFrame to Pandas DataFrame and retrieves the categorical column names. It uses the `to_pandas` function for conversion.
- `df_train = reduce_mem_usage(df_train)`: Reduces memory usage of the `df_train` DataFrame using the `reduce_mem_usage` function to optimize memory consumption.

 **3. Handling Missing Values**
- `nums = df_train.select_dtypes(exclude='category').columns`: Selects numerical columns (excluding categorical columns) from the DataFrame and stores their column names in the `nums` variable.
- `from itertools import combinations, permutations`: Imports the `combinations` and `permutations` functions from the `itertools` module.
- `nans_df = df_train[nums].isna()`: Creates a DataFrame `nans_df` to identify missing values in numerical columns.
- `nans_groups = {}`: Initializes an empty dictionary to store numerical columns grouped by the count of missing values.
- Loops through each numerical column (`col`) and calculates the count of missing values for each column. Then, it groups the columns based on the count of missing values in the `nans_groups` dictionary.

 **4. Memory Management**
- `del nans_df`: Deletes the `nans_df` DataFrame to release memory.
- `x = gc.collect()`: Manually triggers garbage collection to free up memory space.

### Study Sources
- Data preprocessing techniques: [Data Preprocessing with Pandas](https://realpython.com/python-data-preprocessing/)
- Feature engineering concepts: [Feature Engineering for Machine Learning](https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241)
- Memory management in Python: [Memory Management in Python](https://realpython.com/python-memory-management/)
- Missing data handling: [Handling Missing Data with Pandas](https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/missing_data.html)

In [12]:
df_train = feature_eng(**data_store)
del data_store
gc.collect()
df_train = df_train.pipe(Pipeline.filter_cols)
df_train, cat_cols = to_pandas(df_train)
df_train = reduce_mem_usage(df_train)
nums=df_train.select_dtypes(exclude='category').columns
from itertools import combinations, permutations
nans_df = df_train[nums].isna()
nans_groups={}
for col in nums:
    cur_group = nans_df[col].sum()
    try:
        nans_groups[cur_group].append(col)
    except:
        nans_groups[cur_group]=[col]
del nans_df; x=gc.collect()



**Explaination**

Function `reduce_group(grps)` aims to reduce the number of columns within each group by selecting the column with the highest number of unique values. 

**1. `reduce_group(grps)`**
- **Input**:
  - `grps`: List of groups, where each group is represented as a list of column names.
- **Output**: Returns a list of selected columns within each group.
- **Process**:
  - Initializes an empty list `use` to store the selected columns within each group.
  - Iterates through each group `g` in the input list `grps`.
    - Initializes variables `mx` and `vx` to track the maximum number of unique values and the corresponding column name within the group, respectively.
    - Iterates through each column `gg` in the group `g`.
      - Calculates the number of unique values `n` in the column `df_train[gg]`.
      - Updates `mx` and `vx` if `n` is greater than the current maximum number of unique values.
    - Appends the column name `vx` with the highest number of unique values to the `use` list for the current group.
- **Returns** the list `use` containing selected columns within each group.

### Study Sources
- Working with groups and group operations: [Pandas GroupBy Documentation](https://pandas.pydata.org/pandas-docs/version/1.3/reference/groupby.html)
- Unique values in Pandas Series: [Pandas nunique Documentation](https://pandas.pydata.org/pandas-docs/version/1.3/reference/api/pandas.Series.nunique.html)
- Iterating through lists in Python: [Python List Iteration](https://realpython.com/iterate-through-dictionary-python/)

In [13]:
def reduce_group(grps):
    use = []
    for g in grps:
        mx = 0; vx = g[0]
        for gg in g:
            n = df_train[gg].nunique()
            if n>mx:
                mx = n
                vx = gg
        use.append(vx)
    return use



**Explaination** 

Function, `group_columns_by_correlation(matrix, threshold=0.8)`, aims to group columns based on their correlation values. 

 **1. `group_columns_by_correlation(matrix, threshold=0.8)`**
- **Inputs**:
  - `matrix`: DataFrame representing the dataset.
  - `threshold`: Threshold value for correlation. Columns with correlation values greater than or equal to this threshold will be grouped together. Default is set to 0.8.
- **Output**: Returns a list of column groups where each group contains columns with correlation values above the specified threshold.
- **Process**:
  - Calculates the correlation matrix of the input DataFrame `matrix` using the `.corr()` method.
  - Initializes an empty list `groups` to store the resulting column groups.
  - Initializes a list `remaining_cols` containing all column names from the DataFrame.
  - Iterates through each column `col` in the `remaining_cols` list:
    - Initializes a group with the current column `col`.
    - Initializes a list `correlated_cols` containing the current column `col`.
    - Iterates through each remaining column `c` in the `remaining_cols` list:
      - If the correlation between the current column `col` and column `c` is greater than or equal to the specified `threshold`, adds column `c` to the group and `correlated_cols`.
    - Appends the current group to the `groups` list.
    - Updates the `remaining_cols` list to exclude columns already correlated with the current column.
- **Returns** the list of column groups.

### Study Sources
- Correlation matrix and its computation in Pandas: [Pandas Correlation Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
- Grouping data based on conditions: [Grouping Data in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)
- Removing elements from a list in Python: [Python List Operations](https://www.programiz.com/python-programming/methods/list/remove)

In [14]:
def group_columns_by_correlation(matrix, threshold=0.8):
    correlation_matrix = matrix.corr()
    groups = []
    remaining_cols = list(matrix.columns)
    while remaining_cols:
        col = remaining_cols.pop(0)
        group = [col]
        correlated_cols = [col]
        for c in remaining_cols:
            if correlation_matrix.loc[col, c] >= threshold:
                group.append(c)
                correlated_cols.append(c)
        groups.append(group)
        remaining_cols = [c for c in remaining_cols if c not in correlated_cols]
    
    return groups



# 🔄Handling Missing Values and Reducing Columns Based on Correlation

Let's processes the `nans_groups` dictionary to handle missing values and reduce columns based on their correlation. 
### Explanation of Code

1. **Initialization**
   - `uses = []`: Initializes an empty list `uses` to store the final list of selected columns.

2. **Iterate through Groups in `nans_groups`**
   - `for k, v in nans_groups.items()`: Iterates through the `nans_groups` dictionary where `k` is the key (number of missing values) and `v` is the list of column names with that number of missing values.

3. **Processing Each Group**
   - **For Groups with More Than One Column**:
     - `if len(v) > 1`: Checks if the group contains more than one column.
       - `Vs = nans_groups[k]`: Assigns the list of columns `v` to `Vs`.
       - `grps = group_columns_by_correlation(df_train[Vs], threshold=0.8)`: Groups columns in `Vs` based on their correlation using a threshold of 0.8.
       - `use = reduce_group(grps)`: Reduces the groups by selecting columns with the highest number of unique values using the `reduce_group` function.
       - `uses = uses + use`: Appends the selected columns to the `uses` list.
   - **For Groups with a Single Column**:
     - `else`: If the group contains only one column,
       - `uses = uses + v`: Directly appends the column to the `uses` list.
### Study Sources

- **Handling Missing Data in Pandas**: [Pandas Documentation on Handling Missing Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)
- **Correlation in Pandas**: [Pandas DataFrame.corr() Method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)
- **Grouping Columns by Correlation**: [Correlation and Grouping in Data Analysis](https://machinelearningmastery.com/feature-selection-with-correlation-threshold/)
- **Iterating and Modifying Lists in Python**: [Python List Operations](https://docs.python.org/3/tutorial/datastructures.html)
- **Memory Management in Python**: [Python Memory Management](https://realpython.com/python-memory-management/)



In [15]:
uses=[]
for k,v in nans_groups.items():
    if len(v)>1:
            Vs = nans_groups[k]
            grps= group_columns_by_correlation(df_train[Vs], threshold=0.8)
            use=reduce_group(grps)
            uses=uses+use
    else:
        uses=uses+v

# Subset the DataFrame to keep only the selected columns
df_train = df_train[uses]        

# 📊🚀 Data Preparation for Test Set 🚀📊

Let's prepares the `data_store` dictionary for the test set by reading the required Parquet files. The structure and logic mirror those used for the training set, ensuring consistency in data preprocessing.
### Explanation of Code

1. **Reading Base and Depth Data for Test Set**:
    - **Base Data**:
      - `df_base`: Reads the base data from the Parquet file located at `TEST_DIR / "test_base.parquet"`.
    - **Depth 0 Data**:
      - `depth_0`: Reads the static credit bureau data from individual and wildcard Parquet files located at `TEST_DIR / "test_static_cb_0.parquet"` and `TEST_DIR / "test_static_0_*.parquet"`.
    - **Depth 1 Data**:
      - `depth_1`: Reads various related data files such as application previous, tax registry, credit bureau, and other related data files, all with depth 1, from their respective Parquet files.
    - **Depth 2 Data**:
      - `depth_2`: Reads the credit bureau data with depth 2 from the Parquet file located at `TEST_DIR / "test_credit_bureau_b_2.parquet"`.


### Study Sources

1. **Reading Parquet Files**:
   - [Pandas Documentation on Reading Parquet Files](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html)
   - [Polars Documentation on Reading Parquet Files](https://pola-rs.github.io/polars/py-polars/html/reference/io.html#polars.read_parquet)
2. **File Handling and Wildcards**:
   - [Python glob Module](https://docs.python.org/3/library/glob.html)
3. **Data Preprocessing Techniques**:
   - [Pandas User Guide on Data Preprocessing](https://pandas.pydata.org/pandas-docs/stable/user_guide/preprocessing.html)
   - [Polars User Guide on DataFrame Operations](https://pola-rs.github.io/polars/py-polars/html/user-guide/index.html)

### Explanation of Concepts

- **Parquet Files**: A columnar storage file format optimized for use with large-scale data processing frameworks.
- **Data Preprocessing**: The process of transforming raw data into an understandable format. Includes reading data, handling missing values, and selecting relevant features.
- **Wildcards in File Paths**: Used to specify patterns in file names. For example, `test_static_0_*.parquet` matches all files starting with `test_static_0_` and ending with `.parquet`.



In [16]:
data_store = {
    "df_base": read_file(TEST_DIR / "test_base.parquet"),
    "depth_0": [
        read_file(TEST_DIR / "test_static_cb_0.parquet"),
        read_files(TEST_DIR / "test_static_0_*.parquet"),
    ],
    "depth_1": [
        read_files(TEST_DIR / "test_applprev_1_*.parquet", 1),
        read_file(TEST_DIR / "test_tax_registry_a_1.parquet", 1),
        read_file(TEST_DIR / "test_tax_registry_b_1.parquet", 1),
        read_file(TEST_DIR / "test_tax_registry_c_1.parquet", 1),
        read_files(TEST_DIR / "test_credit_bureau_a_1_*.parquet", 1),
        read_file(TEST_DIR / "test_credit_bureau_b_1.parquet", 1),
        read_file(TEST_DIR / "test_other_1.parquet", 1),
        read_file(TEST_DIR / "test_person_1.parquet", 1),
        read_file(TEST_DIR / "test_deposit_1.parquet", 1),
        read_file(TEST_DIR / "test_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        read_file(TEST_DIR / "test_credit_bureau_b_2.parquet", 2),
    ]
}



# Feature Engineering and Data Preparation for the Test Set 🚀

Performs feature engineering and data preparation on the test set. It follows the same steps as for the training set, ensuring consistency in data processing. 

### Explanation of Code

1. **Feature Engineering**:
    - `df_test = feature_eng(**data_store)`: Applies the `feature_eng` function to the test data stored in `data_store`. This function performs various feature engineering steps, such as creating new features and joining different depth data based on `case_id`.

2. **Memory Management**:
    - `del data_store`: Deletes the `data_store` dictionary to free up memory.
    - `gc.collect()`: Calls the garbage collector to release any unreferenced memory.

3. **Selecting Relevant Columns**:
    - `df_test = df_test.select([col for col in df_train.columns if col != "target"])`: Selects columns in `df_test` that match the columns in `df_train`, excluding the "target" column. This ensures that the test set has the same features as the training set.

4. **Conversion to Pandas DataFrame and Category Data Type**:
    - `df_test, cat_cols = to_pandas(df_test)`: Converts the `df_test` Polars DataFrame to a Pandas DataFrame and converts specified columns to the "category" data type to save memory.

5. **Memory Usage Reduction**:
    - `df_test = reduce_mem_usage(df_test)`: Applies the `reduce_mem_usage` function to reduce the memory footprint of the Pandas DataFrame by converting columns to more efficient data types.

6. **Final Memory Management**:
    - `gc.collect()`: Calls the garbage collector again to release any unreferenced memory after data processing.

### Study Sources

1. **Feature Engineering**:
   - [Feature Engineering for Machine Learning](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
2. **Memory Management**:
   - [Python Memory Management](https://realpython.com/python-memory-management/)
   - [Garbage Collection in Python](https://docs.python.org/3/library/gc.html)
3. **DataFrame Selection**:
   - [Pandas DataFrame Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
4. **Data Type Conversion in Pandas**:
   - [Pandas Data Types and Memory Usage](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes)

### Explanation of Concepts

- **Feature Engineering**: The process of using domain knowledge to create features (variables) that make machine learning algorithms work. It includes creating new features, transforming existing ones, and joining data from different sources.
- **Garbage Collection**: A form of automatic memory management that reclaims memory occupied by objects no longer in use by the program.
- **Data Type Conversion**: Changing the data type of columns to more efficient types (e.g., from `float64` to `float32`) to save memory and improve performance.


In [17]:
df_test = feature_eng(**data_store)
del data_store
gc.collect()
df_test = df_test.select([col for col in df_train.columns if col != "target"])
df_test, cat_cols = to_pandas(df_test)
df_test = reduce_mem_usage(df_test)
gc.collect()


0



### Explanation of Code

1. **Adding Target Column to Training Set**:
    - `df_train['target'] = 0`: Adds a column named "target" to the `df_train` DataFrame and sets its value to 0 for all rows. This indicates that these rows belong to the training set.

2. **Adding Target Column to Test Set**:
    - `df_test['target'] = 1`: Adds a column named "target" to the `df_test` DataFrame and sets its value to 1 for all rows. This indicates that these rows belong to the test set.

### Study Sources

1. **Adding Columns in Pandas**:
   - [Pandas DataFrame Adding Columns](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#adding-columns)
2. **Combining Train and Test Data for Preprocessing**:
   - [Combining Datasets for Preprocessing](https://machinelearningmastery.com/combining-train-and-test-datasets-for-preparing-machine-learning-data/)



In [18]:
df_train['target']=0
df_test['target']=1


# Combining and Preparing Data for Modeling 🚀📊

Let's combines the training and test datasets, optimizes memory usage, prepares the features and target for modeling, and then saves the prepared data to a file using `joblib`. 
### Explanation of Code

1. **Combining Train and Test Data**:
    - `df_train = pd.concat([df_train, df_test])`: Concatenates the training and test datasets along the rows. This step combines the datasets into one for uniform preprocessing.

2. **Reducing Memory Usage**:
    - `df_train = reduce_mem_usage(df_train)`: Applies the `reduce_mem_usage` function to the combined DataFrame to optimize its memory usage by converting columns to more efficient data types.

3. **Preparing Target Variable**:
    - `y = df_train["target"]`: Extracts the "target" column from the combined DataFrame and stores it in the variable `y`. This will be used as the target variable for modeling.

4. **Dropping Unnecessary Columns**:
    - `df_train = df_train.drop(columns=["target", "case_id", "WEEK_NUM"])`: Drops the "target", "case_id", and "WEEK_NUM" columns from the combined DataFrame. The "case_id" and "WEEK_NUM" columns are likely identifiers and not useful for modeling.

5. **Saving the Prepared Data**:
    - `joblib.dump((df_train, y, df_test), 'data.pkl')`: Uses `joblib` to save the prepared features (`df_train`), target (`y`), and test set (`df_test`) to a file named `data.pkl`. This serialized file can be loaded later for model training and evaluation.

### Study Sources

1. **Pandas Concatenation**:
   - [Pandas Concatenation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)
2. **Memory Optimization in Pandas**:
   - [Pandas Memory Usage](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#scaling-to-large-datasets)
3. **Joblib for Serialization**:
   - [Joblib Documentation](https://joblib.readthedocs.io/en/latest/)
4. **Data Preparation for Machine Learning**:
   - [Data Preparation Techniques](https://machinelearningmastery.com/data-preparation-techniques-for-machine-learning/)

### Explanation of Concepts

- **Data Concatenation**: Combining multiple DataFrames along a particular axis (rows or columns).
- **Memory Optimization**: Techniques to reduce the memory footprint of data structures, crucial for handling large datasets efficiently.
- **Target Variable**: The variable that a model aims to predict. Here, it distinguishes between training (0) and test (1) data.
- **Serialization**: The process of converting a data structure into a format that can be easily saved to disk and later restored.



In [19]:
df_train=pd.concat([df_train,df_test])
df_train=reduce_mem_usage(df_train)

y = df_train["target"]
df_train= df_train.drop(columns=["target", "case_id", "WEEK_NUM"])


joblib.dump((df_train,y,df_test),'data.pkl')

['data.pkl']

---

#  Data Preprocessing with Pipeline Class 🚀

Define a `Pipeline` class with methods to preprocess a DataFrame. The class includes methods for setting data types, handling date columns, and filtering columns based on certain criteria.

### Explanation of Code

1. **set_table_dtypes(df)**:
    - This method sets the appropriate data types for the columns in the DataFrame.
    - **Int64**: Converts specified columns to 64-bit integers.
    - **Date**: Converts specified columns to date type.
    - **Float64**: Converts specified columns to 64-bit floating-point numbers.
    - **String**: Converts specified columns to string type.
  
2. **handle_dates(df)**:
    - This method handles date columns by calculating the difference in days between date columns ending with "D" and a reference date column "date_decision".
    - It drops the "date_decision" and "MONTH" columns after the calculations.

3. **filter_cols(df)**:
    - This method filters out columns based on missing values and unique values.
    - Columns with more than 70% missing values are dropped.
    - String columns with either only one unique value or more than 200 unique values are dropped as they are likely not useful for modeling.

### Study Sources

1. **Polars Documentation**:
   - [Polars User Guide](https://pola-rs.github.io/polars/py-polars/html/index.html)
2. **Handling Missing Data**:
   - [Dealing with Missing Data in Machine Learning](https://machinelearningmastery.com/handle-missing-data-python/)
3. **Data Types in Python**:
   - [Python Data Types](https://docs.python.org/3/library/datatypes.html)
4. **Date and Time Handling**:
   - [Working with Dates and Times in Python](https://realpython.com/python-datetime/)



In [20]:

class Pipeline:

    def set_table_dtypes(df):
        for col in df.columns:
            if col in ["case_id", "WEEK_NUM", "num_group1", "num_group2"]:
                df = df.with_columns(pl.col(col).cast(pl.Int64))
            elif col in ["date_decision"]:
                df = df.with_columns(pl.col(col).cast(pl.Date))
            elif col[-1] in ("P", "A"):
                df = df.with_columns(pl.col(col).cast(pl.Float64))
            elif col[-1] in ("M",):
                df = df.with_columns(pl.col(col).cast(pl.String))
            elif col[-1] in ("D",):
                df = df.with_columns(pl.col(col).cast(pl.Date))
        return df

    def handle_dates(df):
        for col in df.columns:
            if col[-1] in ("D",):
                df = df.with_columns(pl.col(col) - pl.col("date_decision"))  #!!?
                df = df.with_columns(pl.col(col).dt.total_days()) # t - t-1
        df = df.drop("date_decision", "MONTH")
        return df

    def filter_cols(df):
        for col in df.columns:
            if col not in ["target", "case_id", "WEEK_NUM"]:
                isnull = df[col].is_null().mean()
                if isnull > 0.7:
                    df = df.drop(col)
        
        for col in df.columns:
            if (col not in ["target", "case_id", "WEEK_NUM"]) & (df[col].dtype == pl.String):
                freq = df[col].n_unique()
                if (freq == 1) | (freq > 200):
                    df = df.drop(col)
        
        return df




# Feature Aggregation with Aggregator Class 🔧

Define an `Aggregator` class designed to aggregate features from a DataFrame. Aggregation functions are used to transform and summarize data, which can help in creating new features for machine learning models. 
### Explanation of Code

1. **num_expr(df)**:
    - This method aggregates numerical columns ending with "P" or "A".
    - **Max**: Maximum value of each column.
    - **Last**: Last value of each column.
    - **Mean**: Average value of each column.
    - **Median** and **Variance** are also defined but not used in this implementation.

2. **date_expr(df)**:
    - This method aggregates date columns ending with "D".
    - **Max**: Latest date in each column.
    - **Last**: Last date in each column.
    - **Mean**: Average date in each column.

3. **str_expr(df)**:
    - This method aggregates string columns ending with "M".
    - **Max**: Lexicographically last string in each column.
    - **Last**: Last string value in each column.

4. **other_expr(df)**:
    - This method aggregates other columns ending with "T" or "L".
    - **Max**: Maximum value in each column.
    - **Last**: Last value in each column.

5. **count_expr(df)**:
    - This method aggregates columns containing "num_group".
    - **Max**: Maximum value in each column.
    - **Last**: Last value in each column.

6. **get_exprs(df)**:
    - This method combines all the aggregation expressions from the above methods.

### Study Sources

1. **Polars Documentation**:
   - [Polars User Guide](https://pola-rs.github.io/polars/py-polars/html/index.html)
2. **Feature Engineering**:
   - [Feature Engineering for Machine Learning](https://www.udacity.com/course/feature-engineering-for-machine-learning--nd025)
3. **Data Aggregation and Group Operations**:
   - [Pandas GroupBy: Your Guide to Grouping Data in Python](https://realpython.com/pandas-groupby/)
4. **Correlation in Python**:
   - [How to Calculate Correlation in Python](https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/)

### Explanation of Concepts

- **Aggregation Functions**: These functions summarize data by calculating statistical measures like max, mean, and last value, which can provide meaningful insights for machine learning models.
- **Feature Engineering**: The process of using domain knowledge to create features that make machine learning algorithms work better.
- **Data Types and Polars**: Understanding and using correct data types and efficient data manipulation libraries like Polars is crucial for handling large datasets effectively.
- **Correlation**: Measures the relationship between two variables, helping in feature selection by identifying highly correlated features that may provide redundant information.



In [21]:
class Aggregator:
    # Please add or subtract features yourself, be aware that too many features will take up too much space.
    def num_expr(df):
        cols = [col for col in df.columns if col[-1] in ("P", "A")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]

        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        # expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
        expr_median = [pl.median(col).alias(f"median_{col}") for col in cols]
        expr_var = [pl.var(col).alias(f"var_{col}") for col in cols]

        return expr_max + expr_last + expr_mean 

    def date_expr(df):
        cols = [col for col in df.columns if col[-1] in ("D")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        # expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        # expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
        expr_median = [pl.median(col).alias(f"median_{col}") for col in cols]

        return expr_max + expr_last + expr_mean 

    def str_expr(df):
        cols = [col for col in df.columns if col[-1] in ("M",)]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        # expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        # expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        # expr_count = [pl.count(col).alias(f"count_{col}") for col in cols]
        return expr_max + expr_last  # +expr_count

    def other_expr(df):
        cols = [col for col in df.columns if col[-1] in ("T", "L")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        # expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        # expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        return expr_max + expr_last

    def count_expr(df):
        cols = [col for col in df.columns if "num_group" in col]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        # expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
        expr_last = [pl.last(col).alias(f"last_{col}") for col in cols]
        # expr_first = [pl.first(col).alias(f"first_{col}") for col in cols]
        return expr_max + expr_last

    def get_exprs(df):
        exprs = Aggregator.num_expr(df) + \
                Aggregator.date_expr(df) + \
                Aggregator.str_expr(df) + \
                Aggregator.other_expr(df) + \
                Aggregator.count_expr(df)

        return exprs


# Data Preparation Functions 

Functions to prepare and process data for our  machine learning pipeline. These functions include reading data from files, performing feature engineering, and converting data formats.

### Explanation of Code

1. **read_file(path, depth=None)**:
    - **Purpose**: Reads a Parquet file, sets appropriate data types, and optionally performs aggregation based on `depth`.
    - **Parameters**:
        - `path`: File path to the Parquet file.
        - `depth`: Determines the level of aggregation (1 or 2).
    - **Process**:
        - Reads the Parquet file into a DataFrame.
        - Sets the appropriate data types using a pipeline.
        - If `depth` is 1 or 2, groups the data by `case_id` and aggregates features using the `Aggregator` class.

2. **read_files(regex_path, depth=None)**:
    - **Purpose**: Reads multiple Parquet files matching a regex pattern, sets appropriate data types, and optionally performs aggregation.
    - **Parameters**:
        - `regex_path`: Regex pattern to match file paths.
        - `depth`: Determines the level of aggregation (1 or 2).
    - **Process**:
        - Reads each Parquet file matching the regex pattern into a DataFrame.
        - Sets the appropriate data types using a pipeline.
        - If `depth` is 1 or 2, groups the data by `case_id` and aggregates features.
        - Concatenates the DataFrames vertically and removes duplicate `case_id`s.

3. **feature_eng(df_base, depth_0, depth_1, depth_2)**:
    - **Purpose**: Performs feature engineering by adding date-related features and joining additional data based on `case_id`.
    - **Parameters**:
        - `df_base`: Base DataFrame.
        - `depth_0`, `depth_1`, `depth_2`: Lists of DataFrames at different depths.
    - **Process**:
        - Adds `month_decision` and `weekday_decision` features based on `date_decision`.
        - Joins additional DataFrames from `depth_0`, `depth_1`, and `depth_2` to the base DataFrame.
        - Handles date columns using a pipeline.

4. **to_pandas(df_data, cat_cols=None)**:
    - **Purpose**: Converts a Polars DataFrame to a Pandas DataFrame and sets categorical data types.
    - **Parameters**:
        - `df_data`: Polars DataFrame to be converted.
        - `cat_cols`: List of columns to be converted to categorical type.
    - **Process**:
        - Converts the Polars DataFrame to a Pandas DataFrame.
        - If `cat_cols` is not provided, identifies columns of object type and converts them to categorical.
        - Returns the converted DataFrame and the list of categorical columns.

### Study Sources

1. **Polars Documentation**:
   - [Polars User Guide](https://pola-rs.github.io/polars/py-polars/html/index.html)
2. **Pandas Documentation**:
   - [Pandas User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
3. **Feature Engineering for Machine Learning**:
   - [Feature Engineering for Machine Learning](https://www.udacity.com/course/feature-engineering-for-machine-learning--nd025)
4. **Data Aggregation and Group Operations in Pandas**:
   - [Pandas GroupBy: Your Guide to Grouping Data in Python](https://realpython.com/pandas-groupby/)


In [22]:
def read_file(path, depth=None):
    df = pl.read_parquet(path)
    df = df.pipe(Pipeline.set_table_dtypes)
    if depth in [1,2]:
        df = df.group_by("case_id").agg(Aggregator.get_exprs(df)) 
    return df

def read_files(regex_path, depth=None):
    chunks = []
    
    for path in glob(str(regex_path)):
        df = pl.read_parquet(path)
        df = df.pipe(Pipeline.set_table_dtypes)
        if depth in [1, 2]:
            df = df.group_by("case_id").agg(Aggregator.get_exprs(df))
        chunks.append(df)
    
    df = pl.concat(chunks, how="vertical_relaxed")
    df = df.unique(subset=["case_id"])
    return df


def feature_eng(df_base, depth_0, depth_1, depth_2):
    df_base = (
        df_base
        .with_columns(
            month_decision = pl.col("date_decision").dt.month(),
            weekday_decision = pl.col("date_decision").dt.weekday(),
        )
    )
    for i, df in enumerate(depth_0 + depth_1 + depth_2):
        df_base = df_base.join(df, how="left", on="case_id", suffix=f"_{i}")
    df_base = df_base.pipe(Pipeline.handle_dates)
    return df_base

def to_pandas(df_data, cat_cols=None):
    df_data = df_data.to_pandas()
    if cat_cols is None:
        cat_cols = list(df_data.select_dtypes("object").columns)
    df_data[cat_cols] = df_data[cat_cols].astype("category")
    return df_data, cat_cols




# 🛠️ Memory Optimization Function

Function `reduce_mem_usage(df)` which takes a DataFrame `df` as input and iterates through all its columns. The purpose of this function is to optimize the memory usage of the DataFrame by adjusting the data types of its columns.
**Explanation:**


- The function starts by calculating the initial memory usage of the DataFrame `df`.
- It then iterates through each column of the DataFrame.
- For each column, it checks the data type.
- If the column is categorical, it skips the optimization process.
- For non-categorical columns, it finds the minimum and maximum values.
- Based on the range of values, it changes the data type to one that requires less memory while ensuring that it can still accommodate the data without loss of precision.
- Finally, it calculates the memory usage after optimization and prints out the reduction percentage.
- The function returns the optimized DataFrame.

**Study Sources:**
1. Pandas Documentation: https://pandas.pydata.org/docs/
2. NumPy Documentation: https://numpy.org/doc/stable/
3. Data Type Objects (dtype) - NumPy Documentation: https://numpy.org/doc/stable/reference/arrays.dtypes.html

In [23]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        if str(col_type)=="category":
            continue
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            continue
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


# Model Information Retrieval 

**Explanation:**
- The code loads information about a specific LightGBM model from a file named 'notebook_info.joblib'.
- It prints out the start time of the notebook that created the models and a brief description of the notebook.
- The code then retrieves details about the columns and categorical columns used in the models.
- It prints out the number of columns and categorical columns.
- Next, it loads the LightGBM models from a file named 'lgb_models.joblib'.
- Finally, it displays the loaded LightGBM models.

- Similarly, this part of the code loads information about categorical (cat) models and prints out the start time of the notebook that created the models and a brief description of the notebook.
- It then loads the categorical (cat) models from a file named 'cat_models.joblib'.
- Finally, it displays the loaded categorical models.

**Study Sources:**
1. Joblib Documentation: https://joblib.readthedocs.io/en/latest/
2. LightGBM Documentation: https://lightgbm.readthedocs.io/en/latest/
3. Categorical Features in Machine Learning: https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63

In [24]:
lgb_notebook_info = joblib.load('/kaggle/input/homecredit-models-public/other/lgb/1/notebook_info.joblib')
print(f"- [lgb] notebook_start_time: {lgb_notebook_info['notebook_start_time']}")
print(f"- [lgb] description: {lgb_notebook_info['description']}")

cols = lgb_notebook_info['cols']
cat_cols = lgb_notebook_info['cat_cols']
print(f"- [lgb] len(cols): {len(cols)}")
print(f"- [lgb] len(cat_cols): {len(cat_cols)}")

lgb_models = joblib.load('/kaggle/input/homecredit-models-public/other/lgb/1/lgb_models.joblib')
lgb_models

cat_notebook_info = joblib.load('/kaggle/input/homecredit-models-public/other/cat/1/notebook_info.joblib')
print(f"- [cat] notebook_start_time: {cat_notebook_info['notebook_start_time']}")
print(f"- [cat] description: {cat_notebook_info['description']}")

cat_models = joblib.load('/kaggle/input/homecredit-models-public/other/cat/1/cat_models.joblib')
cat_models


- [lgb] notebook_start_time: 2024-04-17 17:19:35.710340
- [lgb] description: Add notebook info dict to store cols and cat_cols
- [lgb] len(cols): 386
- [lgb] len(cat_cols): 113
- [cat] notebook_start_time: 2024-04-18 00:37:32.864485
- [cat] description: first cat models


[<catboost.core.CatBoostClassifier at 0x7873426f3f70>,
 <catboost.core.CatBoostClassifier at 0x787360c30100>,
 <catboost.core.CatBoostClassifier at 0x78735a6a5d20>,
 <catboost.core.CatBoostClassifier at 0x78735a6c26e0>,
 <catboost.core.CatBoostClassifier at 0x78733eb01210>]

# 📂🔍 Data Loading and Storage Configuration


**Explanation:**

- Sets up a directory structure for storing and accessing test data related to credit risk models.
- It defines `ROOT` as the root directory and `TEST_DIR` as the directory containing test data in parquet format.
- The `data_store` dictionary is initialized to store different types of test data.
- Each key in the `data_store` dictionary corresponds to a different depth level of data.
- For each depth level, specific files are read into memory using the `read_file` and `read_files` functions, and these data are stored as lists under the corresponding depth level keys in the `data_store` dictionary.

**Study Sources:**
1. Pathlib Documentation: https://docs.python.org/3/library/pathlib.html
2. Parquet File Format: https://parquet.apache.org/documentation/latest/

In [25]:
ROOT            = Path("/kaggle/input/home-credit-credit-risk-model-stability")

TEST_DIR        = ROOT / "parquet_files" / "test"

data_store = {
    "df_base": read_file(TEST_DIR / "test_base.parquet"),
    "depth_0": [
        read_file(TEST_DIR / "test_static_cb_0.parquet"),
        read_files(TEST_DIR / "test_static_0_*.parquet"),
    ],
    "depth_1": [
        read_files(TEST_DIR / "test_applprev_1_*.parquet", 1),
        read_file(TEST_DIR / "test_tax_registry_a_1.parquet", 1),
        read_file(TEST_DIR / "test_tax_registry_b_1.parquet", 1),
        read_file(TEST_DIR / "test_tax_registry_c_1.parquet", 1),
        read_files(TEST_DIR / "test_credit_bureau_a_1_*.parquet", 1),
        read_file(TEST_DIR / "test_credit_bureau_b_1.parquet", 1),
        read_file(TEST_DIR / "test_other_1.parquet", 1),
        read_file(TEST_DIR / "test_person_1.parquet", 1),
        read_file(TEST_DIR / "test_deposit_1.parquet", 1),
        read_file(TEST_DIR / "test_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        read_file(TEST_DIR / "test_credit_bureau_b_2.parquet", 2),
        read_files(TEST_DIR / "test_credit_bureau_a_2_*.parquet", 2),
        read_file(TEST_DIR / "test_applprev_2.parquet", 2),
        read_file(TEST_DIR / "test_person_2.parquet", 2)
    ]
}


**Explanation:**
- Performs feature engineering on the test data using the `feature_eng` function with the provided data store.
- After feature engineering, it prints out the shape of the processed test data.
- Then, it deletes the data store and performs garbage collection to free up memory.
- It selects only the required columns from the processed test data.
- The test data is converted to a pandas DataFrame and categorical columns are handled accordingly.
- Memory usage of the test data is optimized using the `reduce_mem_usage` function.
- Finally, the 'case_id' column is set as the index for the test data, and its shape is printed out again.

**Study Sources:**
1. Garbage Collection in Python: https://docs.python.org/3/library/gc.html
2. Pandas DataFrame Selection: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
3. Memory Optimization Techniques in Pandas: https://www.dataquest.io/blog/pandas-big-data/
```

In [26]:
df_test = feature_eng(**data_store)
print("test data shape:\t", df_test.shape)
del data_store
gc.collect()


df_test = df_test.select(['case_id'] + cols)

df_test, cat_cols = to_pandas(df_test, cat_cols)
df_test = reduce_mem_usage(df_test)
df_test = df_test.set_index('case_id')
print("test data shape:\t", df_test.shape)

gc.collect()


test data shape:	 (10, 860)
Memory usage of dataframe is 0.04 MB
Memory usage after optimization is: 0.02 MB
Decreased by 40.2%
test data shape:	 (10, 386)


0

**Explanation:**
- Define a custom ensemble model named `VotingModel` that inherits from `BaseEstimator` and `RegressorMixin`.
- The `__init__` method initializes the model with a list of estimators (models) to be used for voting aggregation.
- The `fit` method is implemented but does nothing, as fitting is not required for voting aggregation.
- The `predict` method performs prediction using voting aggregation on the provided features by averaging predictions from all estimators.
- The `predict_proba` method performs prediction with probabilities using voting aggregation on the provided features.
- For prediction with probabilities, it first collects predictions from LightGBM (lgb) models and then from categorical (cat) models. Categorical columns are converted to string type before making predictions to ensure compatibility.
- Finally, the predictions are averaged across all models to get the final predicted probabilities.

**Study Sources:**
1. Scikit-learn BaseEstimator Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html
2. Scikit-learn RegressorMixin Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html

In [27]:
class VotingModel(BaseEstimator, RegressorMixin):
    """
    A custom ensemble model that performs voting aggregation for predictions.

    Parameters:
    ----------
    estimators : list
        List of estimators (models) to be used for voting aggregation.

    Methods:
    --------
    fit(X, y=None):
        Fit the ensemble model. This method does nothing as it's not required for voting aggregation.

    predict(X):
        Perform prediction using voting aggregation on the provided features.

    predict_proba(X):
        Perform prediction with probabilities using voting aggregation on the provided features.

    """
    def __init__(self, estimators):
        """
        Initialize the VotingModel.

        Parameters:
        ----------
        estimators : list
            List of estimators (models) to be used for voting aggregation.
        """
        super().__init__()
        self.estimators = estimators
        
    def fit(self, X, y=None):
        """
        Fit the ensemble model.

        This method does nothing as it's not required for voting aggregation.

        Parameters:
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Training data.
        y : array-like, shape (n_samples,) (default=None)
            Target values.

        Returns:
        --------
        self : object
            Returns self.
        """
        return self
    
    def predict(self, X):
        """
        Perform prediction using voting aggregation on the provided features.

        Parameters:
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Features to perform prediction on.

        Returns:
        --------
        y_pred : array-like, shape (n_samples,)
            Predicted target values.
        """
        y_preds = [estimator.predict(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)
     
    def predict_proba(self, X):      
        """
        Perform prediction with probabilities using voting aggregation on the provided features.

        Parameters:
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Features to perform prediction on.

        Returns:
        --------
        y_pred_proba : array-like, shape (n_samples, n_classes)
            Predicted probabilities.
        """
        # lgb
        y_preds = [estimator.predict_proba(X) for estimator in self.estimators[:5]]
        
        # cat
        X[cat_cols] = X[cat_cols].astype(str)
        y_preds += [estimator.predict_proba(X) for estimator in self.estimators[-5:]]
        
        return np.mean(y_preds, axis=0)


In [28]:
model = VotingModel(lgb_models + cat_models)
len(model.estimators)


10

In [29]:
y_pred = pd.Series(model.predict_proba(df_test)[:, 1], index=df_test.index)
df_subm = pd.read_csv(ROOT / "sample_submission.csv")
df_subm = df_subm.set_index("case_id")
df_subm["score"] = y_pred
df_subm.to_csv("sub.csv")
df_train,y,df_test=joblib.load('/kaggle/working/data.pkl')

---

In [30]:
fitted_models_lgb=[]
model = lgb.LGBMClassifier()
model.fit(df_train,y)
fitted_models_lgb.append(model) 

[LightGBM] [Info] Number of positive: 10, number of negative: 1526659
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 2.088825 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 46717
[LightGBM] [Info] Number of data points in the train set: 1526669, number of used features: 308
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000007 -> initscore=-11.936007
[LightGBM] [Info] Start training from score -11.936007


**Explanation:**
- Defines a custom ensemble model named `VotingModel` for voting aggregation of predictions.
- It inherits from `BaseEstimator` and `RegressorMixin`.
- The `__init__` method initializes the model with a list of fitted estimators (models) to be used for voting aggregation.
- The `fit` method is implemented but does nothing, as fitting is not required for voting aggregation.
- The `predict` method performs prediction using voting aggregation on the provided features by averaging predictions from all fitted estimators.
- The `predict_proba` method performs prediction with probabilities using voting aggregation on the provided features by averaging probabilities from all fitted estimators.
- An instance of `VotingModel` is then created with a list of fitted LightGBM models (`fitted_models_lgb`).

This custom ensemble model allows for easy integration of multiple fitted models for voting aggregation of predictions.



In [31]:
class VotingModel(BaseEstimator, RegressorMixin):
    """
    A custom ensemble model for voting aggregation of predictions.

    Parameters:
    ----------
    estimators : list
        List of fitted estimators (models) to be used for voting aggregation.

    Methods:
    --------
    fit(X, y=None):
        Fit the ensemble model. This method does nothing as it's not required for voting aggregation.

    predict(X):
        Perform prediction using voting aggregation on the provided features.

    predict_proba(X):
        Perform prediction with probabilities using voting aggregation on the provided features.

    """
    def __init__(self, estimators):
        """
        Initialize the VotingModel.

        Parameters:
        ----------
        estimators : list
            List of fitted estimators (models) to be used for voting aggregation.
        """
        super().__init__()
        self.estimators = estimators
        
    def fit(self, X, y=None):
        """
        Fit the ensemble model.

        This method does nothing as it's not required for voting aggregation.

        Parameters:
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Training data.
        y : array-like, shape (n_samples,) (default=None)
            Target values.

        Returns:
        --------
        self : object
            Returns self.
        """
        return self
    
    def predict(self, X):
        """
        Perform prediction using voting aggregation on the provided features.

        Parameters:
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Features to perform prediction on.

        Returns:
        --------
        y_pred : array-like, shape (n_samples,)
            Predicted target values.
        """
        y_preds = [estimator.predict(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)
    
    def predict_proba(self, X):
        """
        Perform prediction with probabilities using voting aggregation on the provided features.

        Parameters:
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Features to perform prediction on.

        Returns:
        --------
        y_pred_proba : array-like, shape (n_samples, n_classes)
            Predicted probabilities.
        """
        y_preds = [estimator.predict_proba(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)

model = VotingModel(fitted_models_lgb)


**Explanation:**
- Drops columns "WEEK_NUM" and 'target' from the test data DataFrame `df_test` as they are not required for prediction.
- The 'case_id' column is set as the index for `df_test` DataFrame.
- Predictions are made on the test data using the ensemble model (`model`), and probabilities for class 1 are extracted.
- A condition is defined based on the predicted probabilities.
- The submission file is read into `df_subm` DataFrame, and the 'case_id' column is set as its index.
- Scores in the submission DataFrame are adjusted based on the defined condition.
- The modified submission DataFrame is saved to a CSV file named "submission.csv".
- Finally, a file named 'data.pkl' is removed from the working directory.

This code snippet completes the process of generating predictions, adjusting scores based on a condition, and saving the modified submission to a CSV file.



In [32]:
df_test = df_test.drop(columns=["WEEK_NUM",'target'])
df_test = df_test.set_index("case_id")

y_pred = pd.Series(model.predict_proba(df_test)[:,1], index=df_test.index)
condition=y_pred<0.98
df_subm = pd.read_csv("/kaggle/working/sub.csv")
df_subm = df_subm.set_index("case_id")

df_subm.loc[condition, 'score'] = (df_subm.loc[condition, 'score'] - 0.073).clip(0)
df_subm.to_csv("submission.csv")
!rm -rf data.pkl

## Keep Exploring! 👀

Thank you for delving into this notebook! If you found it insightful or beneficial, I encourage you to explore more of my projects and contributions on my profile.

👉 [Visit my Profile](https://www.kaggle.com/zulqarnainalipk) 👈

[GitHub]( https://github.com/zulqarnainalipk) |
[LinkedIn]( https://www.linkedin.com/in/zulqarnainalipk/)

## Share Your Thoughts! 🙏

Your feedback is invaluable! Your insights and suggestions drive our ongoing improvement. If you have any comments, questions, or ideas to contribute, please feel free to reach out.

📬 Contact me via email: [zulqar445ali@gmail.com](mailto:zulqar445ali@gmail.com)

I extend my sincere gratitude for your time and engagement. Your support inspires me to create even more valuable content.
Happy coding and best of luck in your data science endeavors! 🚀
