# **Homework 1: COVID-19 Cases Prediction (Regression)**

Objectives:
* Solve a regression problem with deep neural networks (DNN).
* Understand basic DNN training tips.
* Familiarize yourself with PyTorch.

If you have any questions, please contact the TAs via TA hours, NTU COOL, or email to mlta-2023-spring@googlegroups.com

In [52]:
!nvidia-smi

Thu Jun 15 17:45:13 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...  WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P5              18W / 100W |   3222MiB /  6144MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Download data
If the Google Drive links below do not work, you can use the dropbox link below or download data from [Kaggle](https://www.kaggle.com/t/a339b77fa5214978bfb8dde62d3151fe), and upload data manually to the workspace.

In [53]:
# google drive link
# !pip install gdown
# !gdown --id '1BjXalPZxq9mybPKNjF3h5L3NcF7XKTS-' --output covid_train.csv
# !gdown --id '1B55t74Jg2E5FCsKCsUEkPKIuqaY7UIi1' --output covid_test.csv

# dropbox link
# !wget -O covid_train.csv https://www.dropbox.com/s/lmy1riadzoy0ahw/covid.train.csv?dl=0
# !wget -O covid_test.csv https://www.dropbox.com/s/zalbw42lu4nmhr2/covid.test.csv?dl=0

# 本地运行时文件下载
# Invoke-WebRequest -Uri https://www.dropbox.com/s/lmy1riadzoy0ahw/covid.train.csv?dl=0 -OutFile covid_train.csv
# Invoke-WebRequest -Uri https://www.dropbox.com/s/zalbw42lu4nmhr2/covid.test.csv?dl=0 -OutFile covid_test.csv


# Import packages

In [54]:
# Numerical Operations
import math
import numpy as np

# Reading/Writing Data
import pandas as pd
import os
import csv

# For Progress Bar
from tqdm import tqdm

# Pytorch
import torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split

# For plotting learning curve
from torch.utils.tensorboard import SummaryWriter

# Some Utility Functions

You do not need to modify this part.

In [55]:
# 设置随机种子，以确保实验的可重复性
def same_seed(seed): 
    '''Fixes random number generator seeds for reproducibility.'''
    # 设置了PyTorch的CuDNN后端以确定性模式运行，这意味着所有的操作都将是确定的，即每次运行程序时，结果都将是相同的。
    torch.backends.cudnn.deterministic = True
    # 禁用了CuDNN的基准测试模式，这个模式通常用于加速训练，但在这里被禁用，以确保结果的一致性。
    torch.backends.cudnn.benchmark = False
    # 设置随机种子
    np.random.seed(seed)
    # 为CPU设置种子用于生成随机数，以使得结果是确定的
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        # 为当前GPU设置随机种子
        torch.cuda.manual_seed_all(seed)
# 将训练数据分为训练集和验证集
def train_valid_split(data_set, valid_ratio, seed):
    '''Split provided training data into training set and validation set'''
    # 计算了验证集的大小
    valid_set_size = int(valid_ratio * len(data_set)) 
    # 计算了训练集的大小
    train_set_size = len(data_set) - valid_set_size
    # 使用PyTorch提供的random_split函数将数据集分割成训练集和验证集
    train_set, valid_set = random_split(data_set, [train_set_size, valid_set_size], generator=torch.Generator().manual_seed(seed))
    return np.array(train_set), np.array(valid_set)
# 返回预测结果
def predict(test_loader, model, device):
    model.eval() # Set your model to evaluation mode.
    preds = []
    # tqdm是一个快速，可扩展的Python进度条，可以在Python长循环中添加一个进度提示信息，用户只需要封装任意的迭代器tqdm(iterator)即可
    for x in tqdm(test_loader):
        # 将数据复制到指定的设备上
        x = x.to(device)
        # 禁用梯度计算
        with torch.no_grad():
            # 使用模型对输入数据x进行预测
            pred = model(x)
            # 将预测结果从设备上移除并转移到CPU上，然后添加到预测结果列表中
            preds.append(pred.detach().cpu())
    # 将预测结果拼接成一个numpy数组
    preds = torch.cat(preds, dim=0).numpy()  
    return preds

# Dataset

In [56]:
'''
这段代码定义了一个名为COVID19Dataset的类,它继承自PyTorch的Dataset类。Dataset是一个抽象类,用于表示数据集,
它要求所有子类实现__getitem__和__len__方法。
COVID19Dataset类用于表示COVID-19的数据集,包括特征(x)和目标(y)。
'''
class COVID19Dataset(Dataset):
    '''
    x: Features.
    y: Targets, if none, do prediction.
    '''
    def __init__(self, x, y=None):
        if y is None:
            self.y = y
        else:
            self.y = torch.FloatTensor(y)
        self.x = torch.FloatTensor(x)

    def __getitem__(self, idx):
        if self.y is None:
            return self.x[idx]
        else:
            return self.x[idx], self.y[idx]

    def __len__(self):
        return len(self.x)

# Neural Network Model
Try out different model architectures by modifying the class below.

**优化方向：**
1. 增加或减少隐藏层：可以尝试增加或减少nn.Linear层的数量，以改变模型的深度。更深的模型可能能够学习更复杂的表示，但也可能更容易过拟合。

2. 改变隐藏层的大小：可以改变nn.Linear层的大小，即改变每一层的神经元数量。更多的神经元可以增加模型的容量，但也可能导致过拟合。

3. 使用不同的激活函数：可以尝试使用不同的激活函数，如nn.Sigmoid、nn.Tanh或nn.LeakyReLU。

4. 添加正则化：可以添加nn.Dropout层或使用权重衰减（L2正则化）来防止过拟合。

5. 使用批量归一化：可以在每个nn.Linear层后添加nn.BatchNorm1d层，这可以加速训练并提高模型的性能。

6. 改变模型的架构：除了全连接网络，你还可以尝试使用其他类型的网络，如卷积神经网络（CNN）、循环神经网络（RNN）或者更复杂的模型如Transformer。

In [57]:
class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 1)
        )

    # 定义了模型的前向传播过程。它接收一个输入x，然后通过self.layers进行处理，最后使用squeeze方法移除大小为1的维度
    def forward(self, x):
        x = self.layers(x)
        x = x.squeeze(1) # (B, 1) -> (B)
        return x

# Feature Selection
Choose features you deem useful by modifying the function below.


以下是对这些新冠调查指标的解释和建议：

cli：COVID-Like Illness，类似新冠病毒感染的症状。这个指标可能与新增阳性病例有关，因为有症状的人更可能进行检测。

ili：Influenza-Like Illness，类似流感的症状。这个指标可能与新增阳性病例有关，因为新冠病毒的一些症状与流感相似。

wnohh_cmnty_cli：社区中有类似新冠病毒感染症状的家庭的加权数量。这个指标可能与新增阳性病例有关，因为如果社区中有更多的家庭有症状，那么可能有更多的阳性病例。

wbelief_masking_effective：对口罩有效性的加权信念。这个指标可能与新增阳性病例有关，因为如果人们相信口罩有效，他们可能更愿意佩戴口罩，从而降低感染的风险。

wbelief_distancing_effective：对社交距离有效性的加权信念。这个指标可能与新增阳性病例有关，因为如果人们相信社交距离有效，他们可能更愿意保持社交距离，从而降低感染的风险。

wcovid_vaccinated_friends：接种新冠疫苗的朋友的加权数量。这个指标可能与新增阳性病例有关，因为如果一个人的朋友中有更多的人接种了疫苗，那么这个人可能也更愿意接种疫苗，从而降低感染的风险。

wlarge_event_indoors：参加大型室内活动的加权数量。这个指标可能与新增阳性病例有关，因为大型室内活动可能增加感染的风险。

wothers_masked_public：公共场所其他人佩戴口罩的加权数量。这个指标可能与新增阳性病例有关，因为如果公共场所的人都佩戴口罩，那么感染的风险可能会降低。

wothers_distanced_public：公共场所其他人保持社交距离的加权数量。这个指标可能与新增阳性病例有关，因为如果公共场所的人都保持社交距离，那么感染的风险可能会降低。

wshop_indoors：在室内购物的加权数量。这个指标可能与新增阳性病例有关，因为在室内购物可能增加感染的风险。

wrestaurant_indoors：在室内餐厅就餐的加权数量。这个指标可能与新增阳性病例有关，因为在室内餐厅就餐可能增加感染的风险。

wworried_catch_covid：担心感染新冠病毒的加权数量。这个指标可能与新增阳性病例有关，因为如果人们更担心感染新冠病毒，他们可能会采取更多的防护措施，从而降低感染的风险。

hh_cmnty_cli：家庭社区中有类似新冠病毒感染症状的数量。这个指标可能与新增阳性病例有关，因为如果一个家庭社区中有更多的人有症状，那么可能有更多的阳性病例。

nohh_cmnty_cli：非家庭社区中有类似新冠病毒感染症状的数量。这个指标可能与新增阳性病例有关，因为如果一个非家庭社区中有更多的人有症状，那么可能有更多的阳性病例。

wearing_mask_7d：过去7天佩戴口罩的加权数量。这个指标可能与新增阳性病例有关，因为佩戴口罩可以降低感染新冠病毒的风险。

public_transit：使用公共交通的数量。这个指标可能与新增阳性病例有关，因为使用公共交通可能增加感染新冠病毒的风险。

worried_finances：担心财务问题的数量。这个指标可能与新增阳性病例有关，因为担心财务问题可能影响人们的健康行为和决策。

建议选择以下指标进行预测新增阳性病例：cli、ili、wnohh_cmnty_cli、wbelief_masking_effective、wbelief_distancing_effective、wcovid_vaccinated_friends、wlarge_event_indoors、wothers_masked_public、wothers_distanced_public、wshop_indoors、wrestaurant_indoors、wworried_catch_covid、hh_cmnty_cli、nohh_cmnty_cli、wearing_mask_7d、public_transit。这些指标涵盖了症状、防护行为、社区状况、个人行为等多个方面，可以全面地反映出可能影响新增阳性病例的因素。同时，这些指标也包含了人们的信念和行为，这些都可能影响到新冠病毒的传播。因此，选择这些指标进行预测可能会得到更准确的结果。

In [58]:
# 设置了四个参数。前三个参数分别是训练数据、验证数据和测试数据。第四个参数select_all是一个布尔值，用于决定是否选择所有特征。
def select_feat(train_data, valid_data, test_data, select_all):
    '''Selects useful features to perform regression'''
    # 从训练数据和验证数据中提取目标变量。假设目标变量在每个数据集的最后一列，因此可以使用[:, -1]来提取目标变量。
    y_train, y_valid = train_data[:,-1], valid_data[:,-1]
    # 从训练数据、验证数据和测试数据中提取特征变量。假设特征变量在每个数据集的第一列到倒数第二列，因此可以使用[:,:-1]来提取特征变量。
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data

    if select_all:
        feat_idx = list(range(raw_x_train.shape[1]))
    else:
        feat_idx = list(range(35, raw_x_train.shape[1])) # TODO: Select suitable feature columns.
    print(feat_idx)
    # 返回选择的特征变量和目标变量
    return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid

# Training Loop

In [59]:
def trainer(train_loader, valid_loader, model, config, device):

    criterion = nn.MSELoss(reduction='mean') # Define your loss function, do not modify this.

    # Define your optimization algorithm. 
    # TODO: Please check https://pytorch.org/docs/stable/optim.html to get more available algorithms.
    # TODO: L2 regularization (optimizer(weight decay...) or implement by your self).
    # 1.使用SGD优化器
    # optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.7)
    # 2.优化器使用Adam
    optimizer = torch.optim.Adam(model.parameters(), lr=config['learning_rate'], weight_decay=0.01)
    # 加入学习率衰减
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=800, gamma=0.1)  # 每100个epoch，学习率乘以0.1
    # 3.优化器使用RMSprop
    # optimizer = torch.optim.RMSprop(model.parameters(), lr=config['learning_rate'], weight_decay=0.01)
    writer = SummaryWriter() # Writer of tensoboard.

    if not os.path.isdir('./models'):
        os.mkdir('./models') # Create directory of saving models.

    n_epochs, best_loss, step, early_stop_count = config['n_epochs'], math.inf, 0, 0

    for epoch in range(n_epochs):
        model.train() # Set your model to train mode.
        loss_record = []

        # tqdm is a package to visualize your training progress.
        # train_pbar = tqdm(train_loader, position=0, leave=True)

        for x, y in train_loader:
            optimizer.zero_grad()               # Set gradient to zero.
            x, y = x.to(device), y.to(device)   # Move your data to device. 
            pred = model(x)
            loss = criterion(pred, y)
            loss.backward()                     # Compute gradient(backpropagation).
            optimizer.step()                    # Update parameters.
            step += 1
            loss_record.append(loss.detach().item())
            
            # Display current epoch number and loss on tqdm progress bar.
            # train_pbar.set_description(f'Epoch [{epoch+1}/{n_epochs}]')
            # train_pbar.set_postfix({'loss': loss.detach().item()})

        mean_train_loss = sum(loss_record)/len(loss_record)
        writer.add_scalar('Loss/train', mean_train_loss, step)
        scheduler.step()  # 更新学习率

        model.eval() # Set your model to evaluation mode.
        loss_record = []
        for x, y in valid_loader:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                loss = criterion(pred, y)

            loss_record.append(loss.item())
            
        mean_valid_loss = sum(loss_record)/len(loss_record)
        # print(f'Epoch [{epoch+1}/{n_epochs}]: Train loss: {mean_train_loss:.4f}, Valid loss: {mean_valid_loss:.4f}')
        writer.add_scalar('Loss/valid', mean_valid_loss, step)

        if mean_valid_loss < best_loss:
            best_loss = mean_valid_loss
            torch.save(model.state_dict(), config['save_path']) # Save your best model
            print('Saving model with loss {:.3f}...'.format(best_loss))
            early_stop_count = 0
        else: 
            early_stop_count += 1

        if early_stop_count >= config['early_stop']:
            print('\nModel is not improving, so we halt the training session.')
            return

# Configurations
`config` contains hyper-parameters for training and the path to save your model.

In [60]:
device = 'cuda'
config = {
    'seed': 5201314,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.1,   # validation_size = train_size * valid_ratio
    'n_epochs': 10000,     # Number of epochs.            
    'batch_size': 512, 
    'learning_rate': 1e-2,              
    'early_stop': 800,    # If model has not improved for this many consecutive epochs, stop training.     
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}

# Dataloader
Read data from files and set up training, validation, and testing sets. You do not need to modify this part.

In [61]:
# Set seed for reproducibility
same_seed(config['seed'])


# train_data size: 3009 x 89 (35 states + 18 features x 3 days) 
# test_data size: 997 x 88 (without last day's positive rate)
train_data, test_data = pd.read_csv('./covid_train.csv').values, pd.read_csv('./covid_test.csv').values
train_data, valid_data = train_valid_split(train_data, config['valid_ratio'], config['seed'])

# Print out the data size.
print(f"""train_data size: {train_data.shape} 
valid_data size: {valid_data.shape} 
test_data size: {test_data.shape}""")

# Select features
x_train, x_valid, x_test, y_train, y_valid = select_feat(train_data, valid_data, test_data, config['select_all'])

# Print out the number of features.
print(f'number of features: {x_train.shape[1]}')

train_dataset, valid_dataset, test_dataset = COVID19Dataset(x_train, y_train), \
                                            COVID19Dataset(x_valid, y_valid), \
                                            COVID19Dataset(x_test)

# Pytorch data loader loads pytorch dataset into batches.
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
valid_loader = DataLoader(valid_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False, pin_memory=True)

train_data size: (2709, 89) 
valid_data size: (300, 89) 
test_data size: (997, 88)
[35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87]
number of features: 53


# Start training!

In [None]:
model = My_Model(input_dim=x_train.shape[1]).to(device) # put your model and data on the same computation device.
trainer(train_loader, valid_loader, model, config, device)

# Testing
The predictions of your model on testing set will be stored at `pred.csv`.

In [63]:
def save_pred(preds, file):
    ''' Save predictions to specified file '''
    with open(file, 'w') as fp:
        writer = csv.writer(fp)
        writer.writerow(['id', 'tested_positive'])
        for i, p in enumerate(preds):
            writer.writerow([i, p])

model = My_Model(input_dim=x_train.shape[1]).to(device)
model.load_state_dict(torch.load(config['save_path']))
preds = predict(test_loader, model, device) 
save_pred(preds, 'pred.csv')

100%|██████████| 4/4 [00:00<00:00, 444.42it/s]


# Download

Run this block to download the `pred.csv` by clicking.

In [64]:
from IPython.display import FileLink
FileLink(r'pred.csv')

# Reference
This notebook uses code written by Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)