# 机器学习课程HW1

Source from Delphi @ CMU

## 目标

主要是自己写一下熟悉一下深度学习的流程，重温深度学习中一些部件 例如 dataset dataloader这些 以及 网络中一些计算部件的用法

- 使用深度学习解决一个回归问题
- 理解基础的深度神经网络的训练的技巧，例如调超参数，特征选择，正则化等等
- 熟悉Pytorch

# 任务描述

- COVID-19 病例预测
- 给出美国某个州过去五天的新冠调研结果，然后预测第五天新检测出阳性病例的百分比
- 其中数据通过facebook收集

训练集包含 2699 条（行）数据，每条数据有 118 个（列）属性，包括：

- 1 列数据 id
- 37 列联邦州的 one-hot 编码点（即将每个联邦州作为一个属性维度，每条数据所属的联邦州数值为 1，其他联邦州数值为 0，这种做法常用在处理非数值型的数据标签）
- 5 个连续的日期 × 每日的 18 个公共卫生相关特征（例如医疗运转、社交限制、心理健康等），其中，每天的最后一个公共卫生特征为当日阳性率。

而测试集包含 1078 条数据，每条数据所含属性与训练集基本一致，唯一缺少的是 5 个连续日期中最后一天的阳性率。

本次作业所需要完成的任务，就是训练一个可以根据一条数据的前 117 个属性的全部或部分，来预测最后一条属性数值的模型。代入具体情境，即一个可以根据任意联邦州连续 5 天的公共卫生数据（不包含最后一天的阳性率）来预测最后一天阳性率的模型。

## 评估的度量标准

使用均方误差MSE来衡量预测结果

# 选择配置好的torch环境

在运行代码之前选择配置了torch的内核环境

# 导入Python要使用的包

In [423]:
# 导入文件读写库Pandas
import pandas as pd
import os
import csv

# 导入数学运算库
import numpy as np
import math

# 导入PyTorch库
import torch
# 导入神经网络构建模块nn
import torch.nn as nn
# 导入数据预处理以及加载器模块
from torch.utils.data import DataLoader, Dataset, random_split
# 导入优化器模块
import torch.optim as optim

# 进度条
from tqdm import tqdm

# 绘图
from torch.utils.tensorboard import SummaryWriter

# 设置种子保证结果的可复现

In [424]:
def set_seed(seed):
    '''Fixes random number generator seeds for reproducibility'''
    # 使用确定性算法(deterministic algorithms)，以确保相同的input，parameters和环境可以输出相同的output，使得训练结果可以复现。
    torch.backends.cudnn.deterministics=True
    # 由于使用GPU进行训练时，cuDNN会自动选择最高效的算法，导致训练结果难以复现，因此需要关闭benchmark模式。
    torch.backends.cudnn.benchmark=False
    np.random.seed(seed)   # 根据输入的seed设置固定的numpy seed。
    torch.manual_seed(seed)   # 根据输入的seed值在torch中设置固定的起始参数。
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# 从CSV中加载数据

将csv文件读取成程序中可以处理的list，或者说是dataframe。或者说将数据读入程序内存。

In [425]:
train_data_path = 'data/covid.train.csv'
test_data_path = 'data/covid.test.csv'
output_data_path = 'data/covid.test.predict.csv'

# 读取训练数据 将csv变成 DataFrame  | return -> DataFrame
train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# 输出训练数据的前5行
print(train_data.head())
# 输出数据的列名
print(train_data.columns)

# 通过pd.read_csv().values可以将DataFrame转换为numpy数组，这样可以方便地进行数据处理。
train_data = train_data.values
test_data = test_data.values

# 转化为float类型
train_data = train_data.astype(np.float32)
test_data = test_data.astype(np.float32)

# 输出数据的形状
print(train_data.shape)
# 结果为(2700, 118)，即训练数据有2700个样本，每个样本有95个特征。
# 其中 第一个为数据的id，最后一个为label，index：1-37是州标注，剩下的除了最后一个是label，其他的是特征

   id  AL  AK  AZ  AR  CA  CO  CT  FL  GA  ...  work_outside_home.4  \
0   0   0   0   0   0   0   0   0   1   0  ...            31.113209   
1   1   0   0   0   0   0   1   0   0   0  ...            33.920257   
2   2   0   0   0   0   0   0   0   0   0  ...            31.604604   
3   3   0   0   0   0   0   0   0   0   0  ...            35.115738   
4   4   0   0   0   0   0   0   0   0   0  ...            35.129714   

      shop.4  restaurant.4  spent_time.4  large_event.4  public_transit.4  \
0  67.394551     36.674291     40.743132      17.842221          4.093712   
1  64.398380     34.612238     44.035688      17.808103          4.924935   
2  62.101064     26.521875     36.746453      13.903667          7.313833   
3  67.935520     38.022492     48.434809      27.134876          3.101904   
4  69.934592     38.242368     49.095933      22.683709          4.594620   

   anxious.4  depressed.4  worried_finances.4  tested_positive.4  
0  10.440071     8.627117           37.3295

# 特征选择

In [426]:
# 首先尝试使用所有数据进行训练

# 获取label列
train_labels = train_data[:, -1]

# 删除第一个index列和最后一个label列
train_data = train_data[:, 1:-1]
test_data = test_data[:, 1:]


# 对训练数据划分验证集

In [427]:
def split_train_val(data, valid_ratio, seed):
    '''Split provided training data into training set and validation set'''
    valid_set_size = int(valid_ratio * len(data))
    train_set_size = len(data) - valid_set_size
    train_set, valid_set = random_split(data, 
                                        [train_set_size, valid_set_size],
                                        generator=torch.Generator().manual_seed(seed))
    return np.array(train_set), np.array(valid_set)

train_data, val_data = split_train_val(train_data, 0.2, 1)

# 编写DataSet类

In [428]:
class COVID19Dataset(Dataset):

    def __init__(self, data, target=None, transform=None):
        self.data = data
        self.target = target
        if target is not None:
            self.__type__ = 'train'
        else :
            self.__type__ = 'test'
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        if self.transform is not None:
            sample = self.transform(sample)
        if self.__type__ == 'train':
            target = self.target[idx]
            return sample, target
        elif self.__type__ == 'test':
            return sample

train_dataset = COVID19Dataset(train_data, target=train_labels)
test_dataset = COVID19Dataset(test_data)

# 设置DataLoader

在之前已经定义过了数据集加载器DataSet，但是要使用batch来选择数据，或者更多操作例如shuffle，因此使用dataloader进行加载

In [429]:
train_loader = DataLoader(COVID19Dataset(train_data, target=train_labels), batch_size=128, shuffle=True)
test_loader = DataLoader(COVID19Dataset(test_data), batch_size=128, shuffle=False)

# 设计深度学习网络

In [430]:
class EZNet(nn.Module):

    def __init__(self,input_size):
        super(EZNet, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

    def forward(self, x):
        x = self.net(x)
        x = x.squeeze(1)
        return x

eznet = EZNet(train_data.shape[1])

# 设计损失函数

In [431]:
def loss(outputs, labels):
    return torch.sqrt(nn.MSELoss()(outputs, labels)) * 2

# 设置优化器

In [432]:
optimizer = torch.optim.SGD(eznet.parameters(), lr=0.05, momentum=0.9, weight_decay=1e-4)

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2100, gamma=0.1)

# 开始训练

In [435]:
device = torch.device('mps' if torch.cuda.is_available() else 'cpu')
eznet.to(device)
eznet.train()
for epoch in range(5000):
    for feature, label in train_loader:
        feature = torch.Tensor(feature)
        label = torch.Tensor(label)
        feature = feature.to(device)
        label = label.to(device)
        outputs = eznet(feature)
        l = loss(outputs, label)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
    print('epoch:', epoch, 'loss:', l.item())

        

epoch: 0 loss: 12.558199882507324
epoch: 1 loss: 12.074381828308105
epoch: 2 loss: 10.264640808105469
epoch: 3 loss: 11.537510871887207
epoch: 4 loss: 14.754241943359375
epoch: 5 loss: 13.563756942749023
epoch: 6 loss: 11.41231632232666
epoch: 7 loss: 11.634618759155273
epoch: 8 loss: 12.403926849365234
epoch: 9 loss: 13.31664752960205
epoch: 10 loss: 11.919900894165039
epoch: 11 loss: 11.115144729614258
epoch: 12 loss: 11.960451126098633
epoch: 13 loss: 12.067655563354492
epoch: 14 loss: 12.39603042602539
epoch: 15 loss: 11.144936561584473
epoch: 16 loss: 11.669995307922363
epoch: 17 loss: 13.389983177185059
epoch: 18 loss: 11.582047462463379
epoch: 19 loss: 11.333950996398926
epoch: 20 loss: 12.857478141784668
epoch: 21 loss: 13.187426567077637
epoch: 22 loss: 11.426513671875
epoch: 23 loss: 13.675308227539062
epoch: 24 loss: 11.237878799438477
epoch: 25 loss: 11.680055618286133
epoch: 26 loss: 12.732237815856934
epoch: 27 loss: 12.936721801757812
epoch: 28 loss: 12.682701110839844
e

# 测试

In [434]:
eznet.eval()
i=0
for feature in test_loader:
    i+=1
    feature = torch.Tensor(feature)
    feature = feature.to(device)
    outputs = eznet(feature)
    print(outputs)
    print(i)


tensor([9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535, 9.0535,
        9.0535, 9.0535, 9.0535, 9.0535, 