## Short summary of this notebook

   * In this notebook, I'm using the dataset provided by RADDAR (https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format).
    
   * EDA with this large amount of data is very, very difficult to do when you have only 16GB of RAM available for each notebook. Even when using Dask to store a large dataset (or, better to say, a large amount of tasks related to a dataset), it is difficult to perform all the tasks in only one notebook. This way, each piece of information is in different VERSIONS of this notebook. Until this version, I think that the only interesting things I did were:
    
   * (A) In version 10, Pearson's correlation was calculated for the numeric ("continuous") variables. Since , for the "continuous" features, the cardinality is enormous, I don't think it makes sense to compare this variables with the int8 and int16 ones. So, in a future version, I'll calculate the Spearman correlation coefficient with all the "low cardinality" features, i.e. int8 and int16 features.
    
   * (B) In version 10, I made some intra-groups plots of correlation. The link to the original notebook from where I took the code is commented above the plots.
    
   * (C) In version 11, Spearman's correlation was calculated for the numeric ("continuous") variables. Although some folks already made some correlation analysis (see, for example : https://www.kaggle.com/competitions/amex-default-prediction/discussion/328885 and https://www.kaggle.com/competitions/amex-default-prediction/discussion/330710 ), I think that some non-linear analysis of correlation is essential. 
    
   *  (D) In both versions 10 and 11, I listed the 50 most positive correlated pairs and the 50 most negative correlated pairs.
    
   * (E) In version 13 I made some plots of the 20 most (positive) correlated pairs, both linear and non-linear (there were some intersection between both, so we have less non-linear plots). The aim here is to recognize patterns and do some work to deanonimize features and do some feature engineering IN THE FUTURE, in the same way as the winners of IEEE Credit Card Fraud Competition. In version 14, I've tried to improve the aesthetics of the plots.
   
   * (F) In version 15 I made some the same plots for the 20 most negatively correlated pairs, both linear and non-linear.
   
   * (G) In version 16, I investigated some float type variables that maybe are ordinal/binary.
   
   * (H) In version 17, I made a Spearman's correlation study for the categorical/ordinal/binary variables.

## Part 1: Imports, reading data, type of column data, transforming S_2 and customer_ID

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import dask.dataframe as dd
import dask
from time import time
#import datetime
import seaborn as sns
import random
import gc #Coletor de lixo
import matplotlib.pyplot as plt
import matplotlib.style as mplstyle #https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html
plt.rcParams['agg.path.chunksize'] = 20000
%matplotlib inline
mplstyle.use(['dark_background', 'ggplot']) #https://matplotlib.org/stable/users/explain/performance.html
gc.enable()

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#The data below came from here: https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format
pdf = dd.read_parquet('/kaggle/input/amex-data-integer-dtypes-parquet-format/train.parquet', split_row_groups = True)
test = dd.read_parquet('/kaggle/input/amex-data-integer-dtypes-parquet-format/test.parquet')
target = dd.read_csv('/kaggle/input/amex-default-prediction/train_labels.csv')

In [None]:
# Columns dtypes
for col in pdf.columns:
    print(f"{col} \t {pdf[col].dtype}")

## Last Month Dataset

* All the correlation analysis was made, without loss of generality, for a fixed month (the last one of the training dataset). We don't have any reason to believe that the general correlation between features will change much between different months.

In [None]:
#Changing S_2 to datetime
pdf['S_2'] = dd.to_datetime(pdf['S_2'])

In [None]:
#Getting the month and year of the last month of the training dataset
actual_date = pdf['S_2'].compute()
max_date = actual_date.max()
max_month = max_date.month
max_year = max_date.year

del actual_date
gc.collect()

In [None]:
#Taking only the last month of data
pdf = pdf[pdf.S_2 >= pd.to_datetime(f'{max_year}-{max_month}')]

In [None]:
print(f"{max_year}-{max_month}")

## Separating numeric and categorical columns

In [None]:
features = list(pdf.columns)
for non_feature in ['customer_ID', 'S_2']:
    features.remove(non_feature)

#Note that we made a CHOICE in this notebook: since the features are anonymized, we decided to 
#treat the low-cardinality ones as categorical and the high-cardinality ones as numerical
cat_features = [feature for feature in features if pdf[feature].dtype != 'float32']
cat_features[:5]

In [None]:
num_features = [col for col in features if col not in cat_features]
num_features[:5]

In [None]:
# REDUCE DTYPE FOR CUSTOMER. Taken from here: https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793
hex_to_int2 = lambda x: int(x[-16:], 16)
for df in [pdf,test,target]:
    df['customer_ID'] = df['customer_ID'].apply(hex_to_int2, meta = (df['customer_ID'], 'i8'))
test['S_2'] = dd.to_datetime(test['S_2'])

## Correlations - ordinal/categorical features

* Disclaimer

We don't know which of the int columns has a scale that is at least ordinal. For some of them, the calculations in this section may not make sense at all. We think that, for this variables, there's a extremely low probability that a high Spearman's correlation will appear by chance - since they were encoded as integers.

In [None]:
#The general lines of code below were taken from here: https://seaborn.pydata.org/examples/many_pairwise_correlations.html
sns.set_theme(style="white")
inicio = time()
ordinal_correlation = pdf[cat_features].compute().corr(method = 'spearman')
final = time()

print(f"Tempo para calcular a matriz de correlação, em segundos: {(final-inicio):.3f}s")


# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(ordinal_correlation, dtype=bool))

In [None]:
#The most correlated pairs, taken from here: https://www.kaggle.com/code/datark1/american-express-eda
unstacked = ordinal_correlation.unstack()
unstacked = unstacked.sort_values(ascending=False, kind="quicksort").drop_duplicates()

In [None]:
unstacked.head(50)

In [None]:
unstacked.tail(50)

* Part 1a: correlations between risk variables

All the code below (from parts 1a,1b,1c,1d and 1e) was taken from: https://www.kaggle.com/code/datark1/american-express-eda


In [None]:
cols_to_show = [c for c in pdf.columns if (c.startswith('R')) and c in cat_features]
#corr=pdf[cols_to_show].compute().corr(method = 'spearman')
corr = ordinal_correlation.loc[cols_to_show, cols_to_show]
mask=np.triu(np.ones_like(corr))[1:,:-1]
corr=corr.iloc[1:,:-1].copy()

fig, ax = plt.subplots(figsize=(30,30))   
sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, center=0, annot=True, fmt='.2f', 
            cmap='coolwarm', annot_kws={'fontsize':10,'fontweight':'bold'}, cbar=False)
ax.tick_params(left=False,bottom=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right',fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
plt.title('Correlations between Risk Variables\n', fontsize=16)
plt.show()

* Part 1b: correlation between spend variables

In [None]:
cols_to_show = [c for c in pdf.columns if (c.startswith('S')) and c in cat_features]
#corr=pdf[cols_to_show].compute().corr(method = 'spearman')
corr = ordinal_correlation.loc[cols_to_show, cols_to_show]
mask=np.triu(np.ones_like(corr))[1:,:-1]
corr=corr.iloc[1:,:-1].copy()

fig, ax = plt.subplots(figsize=(15,15))   
sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, center=0, annot=True, fmt='.2f', 
            cmap='coolwarm', annot_kws={'fontsize':10,'fontweight':'bold'}, cbar=False)
ax.tick_params(left=False,bottom=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right',fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
plt.title('Correlations between Spend Variables\n', fontsize=16)
plt.show()

* Part 1c: correlation between delinquency variables

In [None]:
cols_to_show = [c for c in pdf.columns if (c.startswith('D')) and c in cat_features]
#corr=pdf[cols_to_show].compute().corr(method = 'spearman')
corr = ordinal_correlation.loc[cols_to_show, cols_to_show]
mask=np.triu(np.ones_like(corr))[1:,:-1]
corr=corr.iloc[1:,:-1].copy()

fig, ax = plt.subplots(figsize=(23,23))   
sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, center=0, annot=True, fmt='.2f', 
            cmap='coolwarm', annot_kws={'fontsize':10,'fontweight':'bold'}, cbar=False)
ax.tick_params(left=False,bottom=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right',fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
plt.title('Correlations between Delinquency Variables\n', fontsize=16)
plt.show()

* Part 1d: correlation between balance variables

In [None]:
cols_to_show = [c for c in pdf.columns if (c.startswith('B')) and c in cat_features]
#corr=pdf[cols_to_show].compute().corr(method = 'spearman')
corr = ordinal_correlation.loc[cols_to_show, cols_to_show]
mask=np.triu(np.ones_like(corr))[1:,:-1]
corr=corr.iloc[1:,:-1].copy()

fig, ax = plt.subplots(figsize=(15,15))   
sns.heatmap(corr, mask=mask, vmin=-1, vmax=1, center=0, annot=True, fmt='.2f', 
            cmap='coolwarm', annot_kws={'fontsize':10,'fontweight':'bold'}, cbar=False)
ax.tick_params(left=False,bottom=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right',fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
plt.title('Correlations between Balance Variables\n', fontsize=16)
plt.show()

* Part 1e: correlation between payment variables

In [None]:
cols_to_show = [c for c in pdf.columns if (c.startswith('P')) and c in cat_features]

In [None]:
cols_to_show

## Part 2: creating aggregated features and constructing a new dataset from them

In [None]:
#pdf = pdf.sort_values(['customer_ID','S_2'])
#test = test.sort_values(['customer_ID','S_2'])
#target = target.sort_values('customer_ID')

* Aggregating train variables

In [None]:
#The code below was taken from here: https://www.kaggle.com/code/huseyincot/amex-agg-data-how-it-created

#train_num_agg = pdf.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last'])
#train_num_agg.columns = ['_'.join(x) for x in train_num_agg.columns]
#train_cat_agg = pdf.groupby("customer_ID")[cat_features].agg(['count', 'last'])
#train_cat_agg.columns = ['_'.join(x) for x in train_cat_agg.columns]
#train_target = (target.groupby("customer_ID").tail(1).set_index('customer_ID', drop=True).sort_index()["target"])


In [None]:
#for col in train_num_agg.columns:
#train_num_agg[col] = train_num_agg[col].astype('float32')
#for col in train_cat_agg.columns:
#if col[-5:] == 'count':
#train_cat_agg[col] = train_cat_agg[col].astype('int8')

In [None]:
#train_num_agg

In [None]:
#train_cat_agg

In [None]:
#target

In [None]:
#target['target'] = target['target'].astype('int8')

In [None]:
#train = dd.concat([train_num_agg, train_cat_agg, target], axis=1, ignore_unknown_divisions = True)

In [None]:
#del train_num_agg, train_cat_agg, target

In [None]:
#train.to_parquet("train_agg.parquet", compression="gzip")

In [None]:
#train

In [None]:
#del train, pdf
#gc.collect()

* Aggregating test variables 

In [None]:
#The code below was taken from here: https://www.kaggle.com/code/huseyincot/amex-agg-data-how-it-created

#test_num_agg = test.groupby("customer_ID")[num_features].agg(['mean', 'std', 'min', 'max', 'last'])
#test_num_agg.columns = ['_'.join(x) for x in test_num_agg.columns]
#test_cat_agg = test.groupby("customer_ID")[cat_features].agg(['count', 'last'])
#test_cat_agg.columns = ['_'.join(x) for x in test_cat_agg.columns]
#train_target = (target.groupby("customer_ID").tail(1).set_index('customer_ID', drop=True).sort_index()["target"])


In [None]:
#for col in test_num_agg.columns:
#test_num_agg[col] = test_num_agg[col].astype('float32')
#for col in test_cat_agg.columns:
#if col[-5:] == 'count':
#test_cat_agg[col] = test_cat_agg[col].astype('int8')

In [None]:
#test = dd.concat([test_num_agg, test_cat_agg], axis=1, ignore_unknown_divisions = True)
#del test_num_agg, test_cat_agg
#gc.collect()

In [None]:
#test.to_parquet("test_agg.parquet", compression="gzip")

In [None]:
#Trying to keep only the last month without groupby. Taken from: https://www.kaggle.com/competitions/amex-default-prediction/discussion/327361
#pdf.drop_duplicates(subset=['customer_ID'], keep='last').drop(['S_2'], axis='columns').compute()