# **Homework 1: COVID-19 Cases Prediction (Regression)**

Objectives:
* Solve a regression problem with deep neural networks (DNN).
* Understand basic DNN training tips.
* Familiarize yourself with PyTorch.

If you have any questions, please contact the TAs via TA hours, NTU COOL, or email to mlta-2023-spring@googlegroups.com

In [113]:
!nvidia-smi

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Thu Jun 15 16:04:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    34W / 250W |    821MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                      

# Download data
If the Google Drive links below do not work, you can use the dropbox link below or download data from [Kaggle](https://www.kaggle.com/t/a339b77fa5214978bfb8dde62d3151fe), and upload data manually to the workspace.

In [114]:
# google drive link
# !pip install gdown
# !gdown --id '1BjXalPZxq9mybPKNjF3h5L3NcF7XKTS-' --output covid_train.csv
# !gdown --id '1B55t74Jg2E5FCsKCsUEkPKIuqaY7UIi1' --output covid_test.csv

# dropbox link
!wget -O covid_train.csv https://www.dropbox.com/s/lmy1riadzoy0ahw/covid.train.csv?dl=0
!wget -O covid_test.csv https://www.dropbox.com/s/zalbw42lu4nmhr2/covid.test.csv?dl=0

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
--2023-06-15 16:04:23--  https://www.dropbox.com/s/lmy1riadzoy0ahw/covid.train.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/lmy1riadzoy0ahw/covid.train.csv [following]
--2023-06-15 16:04:23--  https://www.dropbox.com/s/raw/lmy1riadzoy0ahw/covid.train.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucdea96f6afac435790bcb1cb35f.dl.dropboxusercontent.com/cd/0/inline/B-CWiid1e2BgybpB9miiM1d0jve0_R0Fkz6E3xlZB4FkOKkWC_n-rm4xHOJo9An5QOV93zukyz5mqUXxiqB5fFW0SZ4nSP1lyX81dmPCl-JMYuZU1XkeC2DhS5cO8_0bOHoLEsm4CAuIlstS3hrjLwat7lWZIZa7UiQSdY5nW-yjnA/file# [following]
--2023-06-15 16:04:23--  https://ucdea96f6afac435790bcb1cb35f.dl.

# Import packages

In [115]:
# Numerical Operations
import math
import numpy as np

# Reading/Writing Data
import pandas as pd
import os
import csv

# For Progress Bar
from tqdm import tqdm

# Pytorch
import torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split

# For plotting learning curve
from torch.utils.tensorboard import SummaryWriter

# Some Utility Functions

You do not need to modify this part.

In [116]:
# 设置随机种子，以确保实验的可重复性
def same_seed(seed): 
    '''Fixes random number generator seeds for reproducibility.'''
    # 设置了PyTorch的CuDNN后端以确定性模式运行，这意味着所有的操作都将是确定的，即每次运行程序时，结果都将是相同的。
    torch.backends.cudnn.deterministic = True
    # 禁用了CuDNN的基准测试模式，这个模式通常用于加速训练，但在这里被禁用，以确保结果的一致性。
    torch.backends.cudnn.benchmark = False
    # 设置随机种子
    np.random.seed(seed)
    # 为CPU设置种子用于生成随机数，以使得结果是确定的
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        # 为当前GPU设置随机种子
        torch.cuda.manual_seed_all(seed)
# 将训练数据分为训练集和验证集
def train_valid_split(data_set, valid_ratio, seed):
    '''Split provided training data into training set and validation set'''
    # 计算了验证集的大小
    valid_set_size = int(valid_ratio * len(data_set)) 
    # 计算了训练集的大小
    train_set_size = len(data_set) - valid_set_size
    # 使用PyTorch提供的random_split函数将数据集分割成训练集和验证集
    train_set, valid_set = random_split(data_set, [train_set_size, valid_set_size], generator=torch.Generator().manual_seed(seed))
    return np.array(train_set), np.array(valid_set)
# 返回预测结果
def predict(test_loader, model, device):
    model.eval() # Set your model to evaluation mode.
    preds = []
    # tqdm是一个快速，可扩展的Python进度条，可以在Python长循环中添加一个进度提示信息，用户只需要封装任意的迭代器tqdm(iterator)即可
    for x in tqdm(test_loader):
        # 将数据复制到指定的设备上
        x = x.to(device)
        # 禁用梯度计算
        with torch.no_grad():
            # 使用模型对输入数据x进行预测
            pred = model(x)
            # 将预测结果从设备上移除并转移到CPU上，然后添加到预测结果列表中
            preds.append(pred.detach().cpu())
    # 将预测结果拼接成一个numpy数组
    preds = torch.cat(preds, dim=0).numpy()  
    return preds

# Dataset

In [117]:
'''
这段代码定义了一个名为COVID19Dataset的类,它继承自PyTorch的Dataset类。Dataset是一个抽象类,用于表示数据集,
它要求所有子类实现__getitem__和__len__方法。
COVID19Dataset类用于表示COVID-19的数据集,包括特征(x)和目标(y)。
'''
class COVID19Dataset(Dataset):
    '''
    x: Features.
    y: Targets, if none, do prediction.
    '''
    def __init__(self, x, y=None):
        if y is None:
            self.y = y
        else:
            self.y = torch.FloatTensor(y)
        self.x = torch.FloatTensor(x)

    def __getitem__(self, idx):
        if self.y is None:
            return self.x[idx]
        else:
            return self.x[idx], self.y[idx]

    def __len__(self):
        return len(self.x)

# Neural Network Model
Try out different model architectures by modifying the class below.

In [118]:
class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 1)
        )

    # 定义了模型的前向传播过程。它接收一个输入x，然后通过self.layers进行处理，最后使用squeeze方法移除大小为1的维度
    def forward(self, x):
        x = self.layers(x)
        x = x.squeeze(1) # (B, 1) -> (B)
        return x

# Feature Selection
Choose features you deem useful by modifying the function below.

In [119]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

def select_feat(train_data, valid_data, test_data, select_all, k):
    '''Selects useful features to perform regression'''
    # Extract target variables from training and validation data.
    y_train, y_valid = train_data[:,-1], valid_data[:,-1]
    # Extract feature variables from training, validation and test data.
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data

    if select_all:
        feat_idx = list(range(raw_x_train.shape[1]))
    else:
        # Create a Linear Regression model
        model = LinearRegression()
        # Create an RFE object
        rfe = RFE(estimator=model, n_features_to_select=k)
        # Fit the RFE object to the training data
        rfe.fit(raw_x_train, y_train)
        # Get the selected feature indices
        feat_idx = [i for i in range(len(rfe.support_)) if rfe.support_[i]]

    # Return selected feature variables and target variables
    return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid


# Training Loop

In [120]:
def trainer(train_loader, valid_loader, model, config, device):

    criterion = nn.MSELoss(reduction='mean') # Define your loss function, do not modify this.

    # Define your optimization algorithm. 
    # TODO: Please check https://pytorch.org/docs/stable/optim.html to get more available algorithms.
    # TODO: L2 regularization (optimizer(weight decay...) or implement by your self).
    # 1.使用SGD优化器
    # optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.7)
    # 2.优化器使用Adam
    optimizer = torch.optim.Adam(model.parameters(), lr=config['learning_rate'], weight_decay=0.01)
    # 加入学习率衰减
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=800, gamma=0.1)  # 每100个epoch，学习率乘以0.1
    # 3.优化器使用RMSprop
    # optimizer = torch.optim.RMSprop(model.parameters(), lr=config['learning_rate'], weight_decay=0.01)
    writer = SummaryWriter() # Writer of tensoboard.

    if not os.path.isdir('./models'):
        os.mkdir('./models') # Create directory of saving models.

    n_epochs, best_loss, step, early_stop_count = config['n_epochs'], math.inf, 0, 0

    for epoch in range(n_epochs):
        model.train() # Set your model to train mode.
        loss_record = []

        # tqdm is a package to visualize your training progress.
        # train_pbar = tqdm(train_loader, position=0, leave=True)

        for x, y in train_loader:
            optimizer.zero_grad()               # Set gradient to zero.
            x, y = x.to(device), y.to(device)   # Move your data to device. 
            pred = model(x)
            loss = criterion(pred, y)
            loss.backward()                     # Compute gradient(backpropagation).
            optimizer.step()                    # Update parameters.
            step += 1
            loss_record.append(loss.detach().item())
            
            # Display current epoch number and loss on tqdm progress bar.
            # train_pbar.set_description(f'Epoch [{epoch+1}/{n_epochs}]')
            # train_pbar.set_postfix({'loss': loss.detach().item()})

        mean_train_loss = sum(loss_record)/len(loss_record)
        writer.add_scalar('Loss/train', mean_train_loss, step)
        scheduler.step()  # 更新学习率

        model.eval() # Set your model to evaluation mode.
        loss_record = []
        for x, y in valid_loader:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                loss = criterion(pred, y)

            loss_record.append(loss.item())
            
        mean_valid_loss = sum(loss_record)/len(loss_record)
        # print(f'Epoch [{epoch+1}/{n_epochs}]: Train loss: {mean_train_loss:.4f}, Valid loss: {mean_valid_loss:.4f}')
        writer.add_scalar('Loss/valid', mean_valid_loss, step)

        if mean_valid_loss < best_loss:
            best_loss = mean_valid_loss
            torch.save(model.state_dict(), config['save_path']) # Save your best model
            print('Saving model with loss {:.3f}...'.format(best_loss))
            early_stop_count = 0
        else: 
            early_stop_count += 1

        if early_stop_count >= config['early_stop']:
            print('\nModel is not improving, so we halt the training session.')
            return

# Configurations
`config` contains hyper-parameters for training and the path to save your model.

In [121]:
device = 'cuda'
config = {
    'seed': 7,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.1,   # validation_size = train_size * valid_ratio
    'n_epochs': 10000,     # Number of epochs.            
    'batch_size': 512, 
    'learning_rate': 1e-3,              
    'early_stop': 1000,    # If model has not improved for this many consecutive epochs, stop training.     
    'save_path': './models/model.ckpt',  # Your model will be saved here.
    'n_features_to_select': 20              # Number of selected features.
}

# Dataloader
Read data from files and set up training, validation, and testing sets. You do not need to modify this part.

In [122]:
# Set seed for reproducibility
same_seed(config['seed'])


# train_data size: 3009 x 89 (35 states + 18 features x 3 days) 
# test_data size: 997 x 88 (without last day's positive rate)
train_data, test_data = pd.read_csv('./covid_train.csv').values, pd.read_csv('./covid_test.csv').values
train_data, valid_data = train_valid_split(train_data, config['valid_ratio'], config['seed'])

# Print out the data size.
print(f"""train_data size: {train_data.shape} 
valid_data size: {valid_data.shape} 
test_data size: {test_data.shape}""")

# Select features
x_train, x_valid, x_test, y_train, y_valid = select_feat(train_data, valid_data, test_data, config['select_all'], config['n_features_to_select'])

# Print out the number of features.
print(f'number of features: {x_train.shape[1]}')

train_dataset, valid_dataset, test_dataset = COVID19Dataset(x_train, y_train), \
                                            COVID19Dataset(x_valid, y_valid), \
                                            COVID19Dataset(x_test)

# Pytorch data loader loads pytorch dataset into batches.
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
valid_loader = DataLoader(valid_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False, pin_memory=True)

train_data size: (2709, 89) 
valid_data size: (300, 89) 
test_data size: (997, 88)
number of features: 20


# Start training!

In [123]:
model = My_Model(input_dim=x_train.shape[1]).to(device) # put your model and data on the same computation device.
trainer(train_loader, valid_loader, model, config, device)

Saving model with loss 369.260...
Saving model with loss 344.862...
Saving model with loss 315.864...
Saving model with loss 274.671...
Saving model with loss 220.806...
Saving model with loss 153.369...
Saving model with loss 79.295...
Saving model with loss 23.516...
Saving model with loss 15.326...
Saving model with loss 12.061...
Saving model with loss 11.679...
Saving model with loss 11.381...
Saving model with loss 10.908...
Saving model with loss 10.554...
Saving model with loss 10.295...
Saving model with loss 10.011...
Saving model with loss 9.707...
Saving model with loss 9.300...
Saving model with loss 8.824...
Saving model with loss 8.266...
Saving model with loss 7.610...
Saving model with loss 6.961...
Saving model with loss 6.224...
Saving model with loss 5.450...
Saving model with loss 4.666...
Saving model with loss 3.878...
Saving model with loss 3.121...
Saving model with loss 2.509...
Saving model with loss 1.934...
Saving model with loss 1.529...
Saving model with 

# Testing
The predictions of your model on testing set will be stored at `pred.csv`.

In [124]:
def save_pred(preds, file):
    ''' Save predictions to specified file '''
    with open(file, 'w') as fp:
        writer = csv.writer(fp)
        writer.writerow(['id', 'tested_positive'])
        for i, p in enumerate(preds):
            writer.writerow([i, p])

model = My_Model(input_dim=x_train.shape[1]).to(device)
model.load_state_dict(torch.load(config['save_path']))
preds = predict(test_loader, model, device) 
save_pred(preds, 'pred.csv')         

100%|██████████| 2/2 [00:00<00:00, 342.78it/s]


# Download

Run this block to download the `pred.csv` by clicking.

In [125]:
from IPython.display import FileLink
FileLink(r'pred.csv')

# Reference
This notebook uses code written by Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)