# Ch 01. View Each Separately

When we don't really understand what our data looks like, it's best to look at in smaller chunks. In this first chapter, we separate the five column types(D,S,P,B,R) and take a look at the two-dimensional distribution.

> Table of Contents
```
Step 1. Load Samle Data
Step 2. Separate Data
Step 3. Check Outliers
Step 4. PCA (Focus on 'D' type)
```


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step 1. Load Sample Data


Because the data volume is too large, we will only use 10000 rows in the tutorial.

In [None]:
train_data_path = '/kaggle/input/amex-default-prediction/train_data.csv'
train_label_path = '/kaggle/input/amex-default-prediction/train_labels.csv'

In [None]:
train_iter = pd.read_csv(train_data_path, chunksize=10000)
train_iter

In [None]:
train_sample = train_iter.__next__() # first 10000 rows
train_sample

Check how many cases are counted for each customer ID

In [None]:
train_sample['customer_ID'].value_counts()

When unfolding as a two-dimensional distribution, the target will be expressed in color. So, Let's take the target information as well and combine it with our sample data.

In [None]:
train_labels = pd.read_csv(train_label_path)
train_labels

In [None]:
sample_df = pd.merge(left=train_sample, right=train_labels, on=['customer_ID'], how='left')
sample_df

## Step 2. Separate data

We will check the data separately for each column type. So, Let's separate the columns.

In [None]:
d_cols = list(filter(lambda x : x.startswith("D"), train_sample.columns.tolist()))
s_cols = list(filter(lambda x : x.startswith("S"), train_sample.columns.tolist()))
p_cols = list(filter(lambda x : x.startswith("P"), train_sample.columns.tolist()))
b_cols = list(filter(lambda x : x.startswith("B"), train_sample.columns.tolist()))
r_cols = list(filter(lambda x : x.startswith("R"), train_sample.columns.tolist()))

print(f"number of 'd's : {len(d_cols)}\n",
      f"number of 's's : {len(s_cols)}\n", 
      f"number of 'p's : {len(p_cols)}\n", 
      f"number of 'b's : {len(b_cols)}\n", 
      f"number of 'r's : {len(r_cols)}\n")

In [None]:
import warnings
warnings.filterwarnings('ignore')

user_info = sample_df[['customer_ID', 'target']]
ts_info = sample_df[['S_2']]
ts_info['S_2'] = ts_info['S_2'].astype('datetime64')

sample_df_d = sample_df.loc[:,d_cols]
sample_df_d = pd.concat([ts_info, user_info, sample_df_d], axis=1)
sample_df_d.set_index('S_2', inplace=True)
sample_df_d.sort_index(inplace=True)

sample_df_s = sample_df.loc[:,s_cols[1:]]
sample_df_s = pd.concat([ts_info, user_info, sample_df_s], axis=1)
sample_df_s.set_index('S_2', inplace=True)
sample_df_s.sort_index(inplace=True)

sample_df_p = sample_df.loc[:,p_cols]
sample_df_p = pd.concat([ts_info, user_info, sample_df_p], axis=1)
sample_df_p.set_index('S_2', inplace=True)
sample_df_p.sort_index(inplace=True)

sample_df_b = sample_df.loc[:,b_cols]
sample_df_b = pd.concat([ts_info, user_info, sample_df_b], axis=1)
sample_df_b.set_index('S_2', inplace=True)
sample_df_b.sort_index(inplace=True)

sample_df_r = sample_df.loc[:,r_cols]
sample_df_r = pd.concat([ts_info, user_info, sample_df_r], axis=1)
sample_df_r.set_index('S_2', inplace=True)
sample_df_r.sort_index(inplace=True)

print(f"[sample_df_d] number of rows : {len(sample_df_d)}\n",
      f"[sample_df_d] number of cols : {len(sample_df_d.columns)}\n",
      
      f"[sample_df_s] number of rows : {len(sample_df_s)}\n",             
      f"[sample_df_s] number of cols : {len(sample_df_s.columns)}\n", 
      
      f"[sample_df_p] number of rows : {len(sample_df_p)}\n", 
      f"[sample_df_p] number of cols : {len(sample_df_p.columns)}\n", 
      
      f"[sample_df_b] number of rows : {len(sample_df_b)}\n", 
      f"[sample_df_b] number of cols : {len(sample_df_b.columns)}\n", 
      
      f"[sample_df_r] number of rows : {len(sample_df_r)}\n",
      f"[sample_df_r] number of cols : {len(sample_df_r.columns)}\n")

If the data types are the same, the independence between columns is likely to be weak. And if the independence is weak, it is necessary to eliminate some of the highly correlated variables. To check these points, let's check the correlation distribution. The darker the blue color, the lower the independence and the stronger the correlation.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,12))

plt.subplot(321)
sns.heatmap(sample_df_d.loc[:,d_cols].corr(), cmap='Blues')
plt.title('col_type : D')

plt.subplot(322)
sns.heatmap(sample_df_s.loc[:,s_cols[1:]].corr(), cmap='Blues')
plt.title('col_type : S')

plt.subplot(323)
sns.heatmap(sample_df_p.loc[:,p_cols].corr(), cmap='Blues')
plt.title('col_type : P')

plt.subplot(324)
sns.heatmap(sample_df_b.loc[:,b_cols].corr(), cmap='Blues')
plt.title('col_type : B')

plt.subplot(325)
sns.heatmap(sample_df_r.loc[:,r_cols].corr(), cmap='Blues')
plt.title('col_type : R')

plt.tight_layout()
plt.show()

Considering the amount of variables, it is fortunate that there are not many variables that are related to each other. However, there are several strongly correlated variables. So,Keep in mind that these variables will need to be removed when training the machine learning model later(not this chapter).

## Step 3. Check Outliers

Since it is a matter of matching whether the target is 0 or 1, it is necessary to check the proportion of appearance of 1 for each date. In case of excessive or very rare occurrence, the date can be treated as an outlier, and this outlier may be helpful for our analysis.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,4))
daily_cnt_df = pd.DataFrame(sample_df_d.resample('D')['target'].value_counts())
daily_cnt_df.columns = ['count']
daily_cnt_df.reset_index(inplace=True)

plt.subplot(211)
data_0 = daily_cnt_df[daily_cnt_df['target']==0]
sns.lineplot(data=data_0, x='S_2', y='count', linewidth=0.8, alpha=0.8, label='target:0')
data_1 = daily_cnt_df[daily_cnt_df['target']==1]
sns.lineplot(data=data_1, x='S_2', y='count', linewidth=0.8, alpha=0.8, label='target:1')
plt.title('count of targets')

plt.subplot(212)
data_ratio_1 = data_1.set_index('S_2')['count'] / (data_1.set_index('S_2')['count'] + data_0.set_index('S_2')['count'])
sns.lineplot(data=data_ratio_1, linewidth=0.8, alpha=0.8, color='g')
plt.axhspan(ymin=data_ratio_1.mean()-3*data_ratio_1.std(), ymax=data_ratio_1.mean()+3*data_ratio_1.std(), alpha=0.25, color='pink')
v_idx = data_ratio_1[(data_ratio_1 > data_ratio_1.mean()+3*data_ratio_1.std()) | (data_ratio_1 < data_ratio_1.mean()-3*data_ratio_1.std())]
for i in v_idx.index:
    plt.axvline(x=i, color='r', linestyle='--', alpha=0.7, linewidth=0.95)
plt.ylabel('ratio')
plt.title('ratio of target(1)')

plt.tight_layout()
plt.show()

The red area on the `ratio of target(1)` graph indicates values that fall within 3 standard deviations. + and - standard deviations are 99% or more of the section that contains most of the data, and values protruding outside the corresponding range can be judged as outliers.

In [None]:
v_idx # outliers

## Step 4. PCA (Focus on 'D' type)

In this tutorial, we will analyze 'D' type columns as an example among the five data types divided above.

And Among the categorical variables specified in the guide, the variables corresponding to type D are as follows.

In [None]:
cat_cols = ['D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

There should be no missing values for PCA. However, since the loss of information is enormous when all missing values are removed, we will fill in the missing values with the median of the average values of each D-type data.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
decomposed_data = pca.fit_transform(sample_df_d.drop(cat_cols,axis=1).iloc[:,2:].fillna(sample_df_d[d_cols].mean().median()))
decomposed_df = pd.DataFrame(data=decomposed_data, columns=['x1','y1'])
decomposed_df

Now all D-type data has been reduced to two-dimensional data. Let's add target information here.

In [None]:
pd.concat([decomposed_df, sample_df_d['target'].reset_index(drop=True)], axis=1)

When this data is spread out on a two-dimensional plane, it looks like the following.

In [None]:
decomposed_2d = pd.concat([decomposed_df, sample_df_d['target'].reset_index(drop=True)], axis=1)
sns.scatterplot(data=decomposed_2d, x='x1', y='y1', hue='target') 
plt.title('decomposed 2d')
plt.show()

It seems that the x-axis values are reduced to discrete. It may be a problem because the existing data values have different scales for each column. Therefore, let's standardize the data and then proceed with dimensionality reduction again.

In [None]:
num_d = sample_df_d.drop(cat_cols,axis=1).iloc[:,2:]
scaled_d = (num_d - num_d.mean()) / num_d.std()
scaled_d.describe()

In [None]:
pca = PCA(n_components=2)
decomposed_scaled_data = pca.fit_transform(scaled_d.fillna(scaled_d.mean().median()))
decomposed_scaled_df = pd.DataFrame(data=decomposed_scaled_data, columns=['x1','y1'])
decomposed_scaled_df

In [None]:
decomposed_scaled_2d = pd.concat([decomposed_scaled_df, 
                                  sample_df_d['target'].reset_index(drop=True)], axis=1)
sns.scatterplot(data=decomposed_scaled_2d, x='x1', y='y1', hue='target') 
plt.title('decomposed_scaled_2d')

plt.grid(alpha=0.4)
plt.show()

Now it is well expressed as a continuous variable. It can be seen that target-0 and target-1 can be distinguished to some extend only with D type variables.

In addition, let's check where the outlier(date) extracted earlier is located.

In [None]:
decomposed_scaled_2d = pd.concat([decomposed_scaled_df, 
                                  sample_df_d['target'].reset_index(drop=True), 
                                  sample_df_d.reset_index()['S_2'].isin(v_idx.index)], axis=1)
sns.scatterplot(data=decomposed_scaled_2d, x='x1', y='y1', color='grey', alpha=0.2)
sns.scatterplot(data=decomposed_scaled_2d[decomposed_scaled_2d['S_2']==True], x='x1', y='y1', hue='target')
plt.xlim((decomposed_scaled_2d['x1'].min(), decomposed_scaled_2d['x1'].max()))
plt.ylim((decomposed_scaled_2d['y1'].min(), decomposed_scaled_2d['y1'].max()))
plt.title('decomposed_scaled_2d')

plt.grid(alpha=0.4)
plt.show()

The points where the outliers exist are irregular and are not clustered in a specific location. This means that it is difficult to confirm information about outliers with at leat D-type data.

So. Chapter 1 of this tutorial is finished. By analyzing S-type, P-type, etc. in this way, we can determine which type of data best distributes the target category, and through it, we can select the variables we want to focus on little by little. In this tutorial, we only checked type D variables(column), but I hope you will actively check other type variabels in this way and share your insights in the comments.

Thanks for joining the tutorial!