This notebook provides you with the code to process the data and obtain the pytorch tensors for the neural network model.

## Data Processing
The data is in the file "UCI_Credit_Card.csv".

- This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

- 30000 observations for the following 25 variables:
  - `ID`: ID of each client
  - `LIMIT_BAL`: Amount of given credit in NT dollars (includes individual and family/supplementary credit
  - `SEX`: Gender (1=male, 2=female)
  - `EDUCATION`: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
  - `MARRIAGE`: Marital status (1=married, 2=single, 3=others)
  - `AGE`: Age in years
  - `PAY_0`: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
  - `PAY_2`: Repayment status in August, 2005 (scale same as above)
  - `PAY_3`: Repayment status in July, 2005 (scale same as above)
  - `PAY_4`: Repayment status in June, 2005 (scale same as above)
  - `PAY_5`: Repayment status in May, 2005 (scale same as above)
  - `PAY_6`: Repayment status in April, 2005 (scale same as above)
  - `BILL_AMT1`: Amount of bill statement in September, 2005 (NT dollar)
  - `BILL_AMT2`: Amount of bill statement in August, 2005 (NT dollar)
  - `BILL_AMT3`: Amount of bill statement in July, 2005 (NT dollar)
  - `BILL_AMT4`: Amount of bill statement in June, 2005 (NT dollar)
  - `BILL_AMT5`: Amount of bill statement in May, 2005 (NT dollar)
  - `BILL_AMT6`: Amount of bill statement in April, 2005 (NT dollar)
  - `PAY_AMT1`: Amount of previous payment in September, 2005 (NT dollar)
  - `PAY_AMT2`: Amount of previous payment in August, 2005 (NT dollar)
  - `PAY_AMT3`: Amount of previous payment in July, 2005 (NT dollar)
  - `PAY_AMT4`: Amount of previous payment in June, 2005 (NT dollar)
  - `PAY_AMT5`: Amount of previous payment in May, 2005 (NT dollar)
  - `PAY_AMT6`: Amount of previous payment in April, 2005 (NT dollar)
  - `default.payment.next.month`: Default payment (1=yes, 0=no)


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.utils.data as Data
import torch.optim as optim
 
# Jupyter command to automatically show figures from matplotlib
%matplotlib inline 

### Import data
We use pandas to read data in the Excel file into a ``DataFrame``.

In [18]:
df = pd.read_csv('UCI_Credit_Card.csv')
df.head(3)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0


We delete the "ID" feature and shuffle the data using `sample` method.

In [19]:
df = df.drop(["ID"], axis=1)
df = df.sample(frac=1)
df.head(3)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
25722,210000.0,2,2,1,30,0,0,0,0,0,...,140518.0,70901.0,72539.0,5404.0,6230.0,5696.0,2328.0,2567.0,2531.0,0
14084,200000.0,2,2,2,36,2,2,2,2,2,...,79781.0,81533.0,80243.0,4000.0,0.0,5500.0,3000.0,0.0,6100.0,1
17632,60000.0,2,2,2,30,0,0,0,0,0,...,12469.0,6060.0,5382.0,3160.0,2604.0,2160.0,4060.0,3382.0,3751.0,0


There are 6636 (28.4%) "yes" data and 23364 (71.6%) "no" data in the dataset. 

In [20]:
print("Yes data:", sum(df["default.payment.next.month"] == 1))
print("No data:", sum(df["default.payment.next.month"] == 0))

Yes data: 6636
No data: 23364


#### Data Standardization
We can see some feature values are quite large, which is not good for training a neural network, so we standardize them 
to make their mean equal to 0 and variance equal to 1.

In [21]:
columns_to_standardize = ["LIMIT_BAL", "AGE", 
                          "BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6", 
                          "PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5", "PAY_AMT6",
                          "PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]

df[columns_to_standardize] = (df[columns_to_standardize] - np.mean(df[columns_to_standardize], 0)) / np.std(df[columns_to_standardize], 0)   
# the second argument in np.mean and np.std is axis. Setting axis=0 (1) means doing the calculation along columns (rows).

In [22]:
df.head(5)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
25722,0.327685,2,2,1,-0.595102,0.014861,0.111736,0.138865,0.188746,0.234917,...,1.511773,0.50315,0.565331,-0.015672,0.013404,0.026713,-0.15946,-0.146117,-0.151008,0
14084,0.250611,2,2,2,0.055816,1.794564,1.782348,1.809921,1.899436,1.999879,...,0.567652,0.67803,0.694695,-0.10044,-0.25699,0.01558,-0.116564,-0.314136,0.049755,1
17632,-0.828424,2,2,2,-0.595102,0.014861,0.111736,0.138865,0.188746,0.234917,...,-0.478674,-0.563381,-0.562351,-0.151155,-0.143971,-0.17412,-0.048901,-0.092773,-0.082381,0
26255,-0.828424,2,2,1,-0.703588,0.014861,0.111736,0.138865,0.188746,0.234917,...,0.120672,-0.264793,-0.563812,-0.20984,-0.157816,-0.215639,-0.297275,-0.286253,1.357612,0
29291,0.096463,1,1,2,-0.920561,0.014861,0.111736,0.138865,0.188746,-0.647565,...,-0.182133,1.659276,1.679312,2.073079,1.348879,-0.111076,12.138654,0.062417,-0.012122,0


#### One-hot Transform

We need to transform some categorical features to one-hot vectors.

In [23]:
columns_to_one_hot = ["SEX", "EDUCATION", "MARRIAGE"]
one_hot_vectors = []
for column in columns_to_one_hot:
    column_tensor = torch.tensor(df[column].values)
    min_value = column_tensor.min()
    tensor_to_one_hot = column_tensor - min_value
    # print(tensor_to_one_hot.min(), tensor_to_one_hot.max())
    one_hot = torch.nn.functional.one_hot(tensor_to_one_hot)
    print(column, one_hot.shape)
    one_hot_vectors.append(one_hot)

SEX torch.Size([30000, 2])
EDUCATION torch.Size([30000, 7])
MARRIAGE torch.Size([30000, 4])


We combine all the one-hot vectors together.

In [24]:
one_hot_features = torch.cat(one_hot_vectors, axis=1)
print(one_hot_features.shape)

torch.Size([30000, 13])


We combine all the features and the label together.

In [30]:
data = torch.cat([torch.tensor(df[columns_to_standardize].values),
                  one_hot_features,
                  torch.tensor(df[["default.payment.next.month"]].values)], axis=1)
feature_num = data.shape[-1] - 1  
# data.shape[-1] shows the number of columns in data. The number of features is the number of columns minus one as we need to exclude the default variable. 

#### Data Loader

In [26]:
N_train = 20000
trainData=data[:N_train,:].float()  # Use the first 20000 observations for training and convert this NumPy array into a torch tensor.
testData=data[N_train:,:].float()   # Use the remaining 10000 observations for testing and convert this NumPy array into a torch tensor.
trainData.shape

torch.Size([20000, 34])

In a standard training procedure, we use ``DataLoader`` to distribute a mini-batch of data to each iteration. In some applications the dataset is very large, we should not load the entire dataset all at once or the speed of training is very slow. To use ``DataLoader``, we need to build a ``TensorDataset`` first, and then feed the dataset into the data loader.

In [27]:
trainset = Data.TensorDataset(trainData[:, 0:-1], trainData[:, -1:])
# trainData[:, 0:3] gives the inputs, while trainData[:, 3:] provides the output.
# To use pytorch for training NNs, it is very important to match the dimensions of the inputs and the output.
# The inputs to the NN here is a 2D tensor. 
# The colon after "-1" is very important. It keeps trainData[:,-1:] as a 2D tensor. Otherwise it will be a 1D tensor. 
testset = Data.TensorDataset(testData[:, 0:-1], testData[:, -1:])
trainloader = Data.DataLoader(
    trainset, batch_size=200, shuffle=True, num_workers=0)