# Assignment Chapter 2 - DEEP LEARNING [Case #4]
Startup Campus, Indonesia - `Artificial Intelligence (AI)` (Batch 7)
* Task: **CLASSIFICATION**
* DL Framework: **PyTorch**
* Dataset: Credit Card Fraud 2023
* Libraries: Pandas/cuDF, Scikit-learn/cuML, Numpy/cuPy
* Objective: Classify credit fraud transactions using Multilayer Perceptron

`PERSYARATAN` Semua modul (termasuk versi yang sesuai) sudah di-install dengan benar.
<br>`CARA PENGERJAAN` Lengkapi baris kode yang ditandai dengan **#TODO**.
<br>`TARGET PORTFOLIO` Peserta mampu mengklasifikasi transaksi fraud menggunakan *Multilayer Perceptron*

### Import Libraries

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
Installing RAPIDS remaining 24.6.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com

        ***********************************************************************
        The pip install of RAPIDS is complete.
        
        Please do not run any further installation from the conda based installation methods, as they may cause issues!
        
        Please ensure that you're pulling from the git repo to remain updated with the latest working install scripts.

        Troubleshooting:
            - If there is an installation failure, please check back on RAPIDSAI owned templates/notebooks to see how to update your personal files. 
            - If an installation failure persists when using the latest script, please make an issue on https://github.com/rapidsai-community/rapidsai-csp-utils
        *****************************************************************

In [None]:
import shutil
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

<font color="red">**- - - - MOHON DIPERHATIKAN - - - -**</font>
<br>**Aktifkan GPU sekarang.** Di Google Colab, klik **Runtime > Change Runtime Type**, lalu pilih **T4 GPU**.

### Dataset Loading (CPU vs. GPU)

In [None]:
from pandas import read_csv as read_by_CPU
from cudf import read_csv as read_by_GPU

In [None]:
# unzip the file
shutil.unpack_archive('dataset_case_04.zip', '/content/sample_data', 'zip')

In [None]:
# TODO: Impor dataset dengan Pandas, gunakan fungsi "read_by_CPU"
%time data_cpu = read_by_CPU('/content/sample_data/creditcard_2023.csv')

CPU times: user 6 s, sys: 281 ms, total: 6.28 s
Wall time: 8.64 s


In [None]:
# Impor dataset dengan cuDF (Pandas di GPU)
%time data_gpu = read_by_GPU('/content/sample_data/creditcard_2023.csv')
print(data_gpu)

CPU times: user 304 ms, sys: 365 ms, total: 669 ms
Wall time: 764 ms
            id        V1        V2        V3        V4        V5        V6  \
0            0 -0.260648 -0.469648  2.496266 -0.083724  0.129681  0.732898   
1            1  0.985100 -0.356045  0.558056 -0.429654  0.277140  0.428605   
2            2 -0.260272 -0.949385  1.728538 -0.457986  0.074062  1.419481   
3            3 -0.152152 -0.508959  1.746840 -1.090178  0.249486  1.143312   
4            4 -0.206820 -0.165280  1.527053 -0.448293  0.106125  0.530549   
...        ...       ...       ...       ...       ...       ...       ...   
568625  568625 -0.833437  0.061886 -0.899794  0.904227 -1.002401  0.481454   
568626  568626 -0.670459 -0.202896 -0.068129 -0.267328 -0.133660  0.237148   
568627  568627 -0.311997 -0.004095  0.137526 -0.035893 -0.042291  0.121098   
568628  568628  0.636871 -0.516970 -0.300889 -0.144480  0.131042 -0.294148   
568629  568629 -0.795144  0.433236 -0.649140  0.374732 -0.244976 -0.60349

In [None]:
# TODO: Hilangkan kolom ID
data_gpu = data_gpu.drop(columns=['id'])
print(data_gpu)

              V1        V2        V3        V4        V5        V6        V7  \
0      -0.260648 -0.469648  2.496266 -0.083724  0.129681  0.732898  0.519014   
1       0.985100 -0.356045  0.558056 -0.429654  0.277140  0.428605  0.406466   
2      -0.260272 -0.949385  1.728538 -0.457986  0.074062  1.419481  0.743511   
3      -0.152152 -0.508959  1.746840 -1.090178  0.249486  1.143312  0.518269   
4      -0.206820 -0.165280  1.527053 -0.448293  0.106125  0.530549  0.658849   
...          ...       ...       ...       ...       ...       ...       ...   
568625 -0.833437  0.061886 -0.899794  0.904227 -1.002401  0.481454 -0.370393   
568626 -0.670459 -0.202896 -0.068129 -0.267328 -0.133660  0.237148 -0.016935   
568627 -0.311997 -0.004095  0.137526 -0.035893 -0.042291  0.121098 -0.070958   
568628  0.636871 -0.516970 -0.300889 -0.144480  0.131042 -0.294148  0.580568   
568629 -0.795144  0.433236 -0.649140  0.374732 -0.244976 -0.603493 -0.347613   

              V8        V9       V10  .

### Standardization (CPU vs. GPU)

In [None]:
from sklearn.preprocessing import StandardScaler as StandardScalerCPU
from cuml.preprocessing import StandardScaler as StandardScalerGPU

In [None]:
ScalerCPU = StandardScalerCPU()
ScalerGPU = StandardScalerGPU()

arbitrary_features = ["V"+str(i+1) for i in range(27)]

In [None]:
%%time

data_cpu[arbitrary_features] = ScalerCPU.fit_transform(data_cpu[arbitrary_features].values)
data_cpu["Amount"] = ScalerCPU.fit_transform(data_cpu["Amount"].values.reshape(-1, 1)).squeeze()

CPU times: user 237 ms, sys: 143 ms, total: 380 ms
Wall time: 428 ms


In [None]:
%%time

data_gpu[arbitrary_features] = ScalerGPU.fit_transform(data_gpu[arbitrary_features].values)
data_gpu["Amount"] = ScalerGPU.fit_transform(data_gpu["Amount"].values.reshape(-1, 1)).squeeze()

CPU times: user 825 ms, sys: 253 ms, total: 1.08 s
Wall time: 1.12 s


### Train/Test Split (CPU vs. GPU)

In [None]:
from sklearn.model_selection import train_test_split as splitCPU
from cuml.preprocessing import train_test_split as splitGPU

In [None]:
# TODO: Tentukan X (features) dan Y (target), gunakan "data_gpu"

X = data_gpu.iloc[:, :-1]
Y = data_gpu.iloc[:, -1]

print("X shape: ", X.shape)
print("Y shape: ", Y.shape)

X shape:  (568630, 29)
Y shape:  (568630,)


In [None]:
%%time

# TODO: Pecah dataset dengan komposisi 80% train set dan 20% test set, dengan fungsi "splitCPU"
test_size = 0.2
random_state = 0
x_train, x_test, y_train, y_test = splitCPU(X, Y, test_size=test_size, random_state=random_state)

print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)

x_train shape:  (454904, 29)
x_test shape:  (113726, 29)
CPU times: user 53.2 ms, sys: 20.7 ms, total: 73.9 ms
Wall time: 83.2 ms


In [None]:
%%time

# TODO: Lakukan hal yang sama untuk data spliting, tetapi dengan fungsi "splitGPU"
test_size = 0.2
random_state = 0
x_train, x_test, y_train, y_test = x_train, x_test, y_train, y_test = splitGPU(X, Y, test_size=test_size, random_state=random_state)

print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)

x_train shape:  (454904, 29)
x_test shape:  (113726, 29)
CPU times: user 123 ms, sys: 39.8 ms, total: 163 ms
Wall time: 171 ms


### Convert the dataset into Tensor

In [None]:
import cupy # Numpy for GPU

In [None]:
torch.cuda.is_available()

True

In [None]:
# TODO: Aktifkan GPU (CUDA) sebagai device untuk training
device = torch.device('cuda')

In [None]:
import torch
import cudf

# Mengonversi dari cuDF DataFrame ke PyTorch tensor
x_train_tensor = torch.tensor(x_train.to_numpy()).to(device)  # Menggunakan to_numpy() dari cuDF
y_train_tensor = torch.tensor(y_train.to_numpy()).to(device)

x_test_tensor = torch.tensor(x_test.to_numpy()).to(device)
y_test_tensor = torch.tensor(y_test.to_numpy()).to(device)

from torch.utils.data import TensorDataset

Train_tensor = TensorDataset(x_train_tensor, y_train_tensor)
Test_tensor = TensorDataset(x_test_tensor, y_test_tensor)

print("Train tensor shape: ", Train_tensor.tensors[0].shape, Train_tensor.tensors[1].shape)
print("Test tensor shape: ", Test_tensor.tensors[0].shape, Test_tensor.tensors[1].shape)

Train tensor shape:  torch.Size([454904, 29]) torch.Size([454904])
Test tensor shape:  torch.Size([113726, 29]) torch.Size([113726])


### Batching the Dataset with PyTorch DataLoader

1.   List item
2.   List item



In [None]:
# TODO: Tentukan nilai batch
batch_size = 64

Train_dataset = DataLoader(Train_tensor, batch_size=batch_size, shuffle=True)
Test_dataset = DataLoader(Test_tensor, batch_size=batch_size, shuffle=False)

# Jika ingin memeriksa beberapa batch dari DataLoader
for batch in Train_dataset:
    x_batch, y_batch = batch
    print("X batch shape: ", x_batch.shape)  # Menampilkan bentuk batch input
    print("Y batch shape: ", y_batch.shape)  # Menampilkan bentuk batch target
    break  # Hentikan setelah satu batch untuk pemeriksaan

X batch shape:  torch.Size([64, 29])
Y batch shape:  torch.Size([64])


### Model Blueprint

In [None]:
class FeedForward(nn.Module):
    def __init__(self, input_dim, num_neurons):
        super(FeedForward, self).__init__()
        self.input_dim = input_dim
        self.num_neurons = num_neurons

        self.net = nn.Sequential(
            nn.Linear(self.input_dim, self.num_neurons),
            nn.ReLU()
        )

    def forward(self, x):
        return self.net(x)

    def to(self, device):
        self.net.to(device)
        return self

class Net(nn.Module):
    def __init__(self, in_features, num_layers, num_neurons):
        super(Net, self).__init__()
        self.in_features = in_features
        self.num_layers = num_layers
        self.num_neurons = num_neurons

        self.fc1 = nn.Linear(self.in_features, self.num_neurons)
        self.relu = nn.ReLU()
        self.blocks = [FeedForward(self.num_neurons, self.num_neurons).to(device) \
                       for _ in range(self.num_layers)]
        self.output_layer = nn.Linear(self.num_neurons, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        output = self.relu(self.fc1(x))

        for block in self.blocks:
            output = block(output)
        output = self.sigmoid(self.output_layer(output))

        return output

### Model Hyperparameters and Parameters

In [None]:
# [ PERTANYAAN ]
# Apa perbedaan hyperparameters dan parameters?

Parameters
Definisi: Parameters adalah nilai yang dipelajari dari data selama proses pelatihan model. Mereka menentukan bagaimana model melakukan prediksi.
Contoh: Dalam regresi linier, koefisien (slope) dan intercept adalah parameter yang ditentukan oleh algoritma berdasarkan data pelatihan.
Pengaturan: Dioptimalkan melalui proses pelatihan model menggunakan algoritma seperti gradient descent.
Hyperparameters
Definisi: Hyperparameters adalah nilai yang ditentukan sebelum proses pelatihan dimulai. Mereka mengontrol proses pelatihan dan arsitektur model, tetapi tidak dipelajari dari data.
Contoh: Contoh hyperparameters termasuk jumlah epoch, ukuran batch, laju pembelajaran, dan jumlah lapisan serta neuron dalam jaringan saraf.
Pengaturan: Ditentukan melalui teknik seperti grid search, random search, atau menggunakan pendekatan berbasis pengalaman.

jadi Parameters: Dipelajari dari data selama pelatihan.
Hyperparameters: Ditetapkan sebelum pelatihan dan mengontrol cara model dilatih.

[ ANSWER HERE ]

In [None]:
# TODO: Tentukan hyperparameters
epochs = 50
num_layers = 3
num_neurons = 64
learning_rate = 0.001

In [None]:
# TODO: Tentukan besaran input untuk model
num_inputs = X.shape[1]

model = Net(in_features=num_inputs, num_layers=num_layers, num_neurons=num_neurons)
model = model.to(device)

In [None]:
# Set the optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.BCELoss()

In [None]:
# Check the number of parameters
print("Number of parameters: {:,}".format(sum(p.numel() for p in model.parameters() if p.requires_grad)))
print("Number of trainable parameters: {:,}".format(sum(p.numel() for p in model.parameters() if p.requires_grad)))

Number of parameters: 1,985
Number of trainable parameters: 1,985


In [None]:
# [ PERTANYAAN ]
# Mengapa total "trainable parameters" sama dengan total keseluruhan parameter?

Total "trainable parameters" sama dengan total keseluruhan parameter jika semua parameter dalam model dapat diperbarui selama proses pelatihan. Dengan kata lain, tidak ada parameter yang dibekukan atau ditetapkan sebagai non-trainable. Dalam konteks neural network, parameter biasanya terdiri dari bobot (weights) dan bias. Jika semua bobot dan bias dalam model tersebut dioptimalkan melalui backpropagation, maka semua parameter tersebut adalah "trainable", sehingga total "trainable parameters" akan sama dengan total keseluruhan parameter.

Namun, jika ada beberapa layer yang parameternya dibekukan (misalnya pada transfer learning, ketika layer-layer awal tidak diperbarui), maka total parameter akan lebih besar daripada trainable parameters karena hanya sebagian dari total parameter yang diperbarui selama pelatihan.

[ ANSWER HERE ]

### Train the Model

In [None]:
print("Start training ...")
for epoch in range(epochs):
    train_loss = 0.0
    model.train()

    for data, label in Train_dataset:
        data = data.to(device)
        label = label.squeeze()
        label = label.to(device)
        optimizer.zero_grad()
        output = model.forward(data.float())

        loss = criterion(output.squeeze(), label.float())
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    train_loss = train_loss / len(Train_dataset.dataset)
    if(epoch % 10 == 0):
        print('  - Epoch: {} \tTraining_loss: {:.6f}'.format(epoch, train_loss))

Start training ...
  - Epoch: 0 	Training_loss: 0.001621
  - Epoch: 10 	Training_loss: 0.000049
  - Epoch: 20 	Training_loss: 0.000025
  - Epoch: 30 	Training_loss: 0.000016
  - Epoch: 40 	Training_loss: 0.000012


### Model ACCURACY should reach >= 95%

In [None]:
# TODO: Jika akurasi masih dibawah 95%, silakan lakukan fine-tuning

In [None]:
correct_preds = 0
total_samples = 0

with torch.no_grad():
    for data, labels in Test_dataset:
        labels = labels.squeeze()
        output = model.forward(data.float())
        output = output.squeeze(1)

        predictions = (output >= 0.5).float()
        correct_preds += (predictions == labels).sum().item()
        total_samples += labels.numel()

accuracy = correct_preds / total_samples
print("Model accuracy: {:.2f}%".format(accuracy*100))

Model accuracy: 99.95%


### Scoring
Total `#TODO` = 12
<br>Checklist:

- [ ] Impor dataset dengan Pandas, gunakan fungsi "read_by_CPU"
- [ ] Hilangkan kolom ID
- [ ] Tentukan X (features) dan Y (target), gunakan "data_gpu"
- [ ] Pecah dataset dengan komposisi 80% train set dan 20% test set, dengan fungsi "splitCPU"
- [ ] Lakukan hal yang sama untuk data spliting, tetapi dengan fungsi "splitGPU"
- [ ] Aktifkan GPU (CUDA) sebagai device untuk training
- [ ] Tentukan nilai batch
- [ ] PERTANYAAN: Apa perbedaan hyperparameters dan parameters?
- [ ] Tentukan hyperparameters
- [ ] Tentukan besaran input untuk model
- [ ] PERTANYAAN: Mengapa total "trainable parameters" sama dengan total keseluruhan parameter?
- [ ] Jika akurasi masih dibawah 95%, silakan lakukan fine-tuning

### Additional readings
- N/A

### Copyright © 2024 Startup Campus, Indonesia
* Prepared by **Nicholas Dominic, M.Kom.** [(profile)](https://linkedin.com/in/nicholas-dominic)
* You may **NOT** use this file except there is written permission from PT. Kampus Merdeka Belajar (Startup Campus).
* Please address your questions to mentors.