<a href="https://colab.research.google.com/github/sankardevisharath/amex-default-prediction/blob/master/notebooks/explore_columns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explore Dataset Column Wise

## Load Data From Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%mkdir data
%cd data
%mkdir raw
%cd raw

/content/data
/content/data/raw


In [3]:
!cp /content/drive/MyDrive/amex-default-prediction/data/raw/amex-default-prediction.zip .

In [4]:
!unzip amex-default-prediction.zip train_data.csv

Archive:  amex-default-prediction.zip
  inflating: train_data.csv          


In [5]:
!unzip amex-default-prediction.zip train_labels.csv

Archive:  amex-default-prediction.zip
  inflating: train_labels.csv        


## Setup Environment

In [71]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

In [7]:
TRAIN_DATA_PATH = '/content/data/raw/train_data.csv'
TRAIN_LABELS_PATH = '/content/data/raw/train_labels.csv'

## Explore Column Metadata

Load train labels dataframe

In [44]:
train_labels = pd.read_csv(TRAIN_LABELS_PATH)

In [45]:
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458913 entries, 0 to 458912
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   customer_ID  458913 non-null  object
 1   target       458913 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 7.0+ MB


There are two columns in the train_labels file, customer_ID of the customer & target variable. Target variable is of type int64.

Below utility function can be used to merge target column to another dataframe.

In [64]:
def append_label(source_df):
  return pd.merge(left=source_df, right=train_labels, how='inner')

Read 10 rows from the training data to check the object types and column names.

In [30]:
pd.read_csv(TRAIN_DATA_PATH, nrows=10).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 190 entries, customer_ID to D_145
dtypes: float64(185), int64(1), object(4)
memory usage: 15.0+ KB


There are 185 float64, 4 Object, 1 int64 columns. Now let us take all the columns and corresponding dtype into a variable.

In [28]:
columns = pd.read_csv(TRAIN_DATA_PATH, nrows=10).dtypes

In [33]:
columns[columns==object]

customer_ID    object
S_2            object
D_63           object
D_64           object
dtype: object

There customer_ID, S_2, D_63, D_64 are the columns with datatype object. Lets check each column. 

## Customer ID, Statment Date & Label 


In [54]:
cust_id_stmt_date_df = pd.read_csv(TRAIN_DATA_PATH, usecols=['customer_ID', 'S_2'])

In [58]:
cust_id_stmt_date_df.shape

(5531451, 2)

In [67]:
cust_id_stmt_date_df = append_label(cust_id_stmt_date_df)

In [68]:
cust_id_stmt_date_df.shape

(5531451, 3)

In [89]:
cust_id_stmt_date_df.head(5)

Unnamed: 0,customer_ID,S_2,target
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-03-09,0
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-04-07,0
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-05-28,0
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-06-13,0
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-07-16,0


In [72]:
cust_id_stmt_date_df['customer_ID'].nunique()

458913

There are around half million unique customers in the train data set

In [74]:
cust_id_stmt_date_df.isna().sum()

customer_ID    0
S_2            0
target         0
dtype: int64

None of the columns are NAN in these 3 columns.

In [85]:
cust_id_len = cust_id_stmt_date_df.customer_ID.str.len().max()
print(f'Maximum size of value in customer_id column is {cust_id_len}')
print(f'Total size of column customer_id is {round(cust_id_len * cust_id_stmt_date_df.shape[0]/(1024 * 1024), 2)} MB')

Maximum size of value in customer_id column is 64
Total size of column customer_id is 337.61 MB


In [87]:
s_2_len = cust_id_stmt_date_df.S_2.str.len().max()
print(f'Maximum size of value in customer_id column is {s_2_len}')
print(f'Total size of column S_2 is {round(s_2_len * cust_id_stmt_date_df.shape[0]/(1024 * 1024), 2)} MB')

Maximum size of value in customer_id column is 10
Total size of column S_2 is 52.75 MB


In [88]:
print(f'Total size of column target is {round(8 * cust_id_stmt_date_df.shape[0]/(1024 * 1024), 2)} MB')

Total size of column target is 42.2 MB


In [90]:
pd.to_datetime(cust_id_stmt_date_df["S_2"]).dtype

dtype('<M8[ns]')

In [37]:
columns[columns==np.int64]

B_31    int64
dtype: object