# 全连接神经网络数据分析（最后转成英文）

在之前的章节中，我们介绍了神经网络的基本概念、以及如何对其进行编程。在这一章节中，我们以上市公司税收不遵从行为的分类为例，用```torch```展示神经网络的内部工作机制，并借此简单介绍```torch```的基本用法。

**学习目标**：
* a
* b
* c

## 目录

<a name='1'></a>
## 1-第三方包导入

In [1]:
import pandas as pd

<a name='2'></a>
## 2-探索性数据分析

我们使用的数据包括2015-2024年中国所有上市公司的财报指标和纳税违规公告。由于税收遵从的公司占了绝大部分，我们对其进行下采样。最终得到的数据集data.csv中包含4155条税收遵从企业和831条不遵从企业的观测记录。

下面我们先使用```pandas```读取数据文件，再展示数据集的前5条观测值，以对数据产生初步认识。

In [2]:
train_df = pd.read_csv('train_data.csv') # 读取训练数据
test_df = pd.read_csv('test_data.csv') # 读取测试数据
train_df.head(5) # 展示训练数据前5行

Unnamed: 0,noncompliance,股东权益/负债合计_EquTotLia,股东类别_SHType,每股收益(元/股)_BasicEPS,每股经营活动现金流量(元/股)_OpeCFPS,营业利润/营业总收入()_OpePrTOR,净利润()_NetPrf,有形净值债务率(%)_DbTanEquRt,每股现金及现金等价物余额(元/股)_CCEPS,资产负债率(%)_DbAstRt,经营现金净流量(元)_NOCF,股东总户数(户)_SHNum,利润总额增长率(%)_TotPrfGrRt,营业收入增长率(%)_OpeIncmGrRt,户均持股数(股/户)_AvgHS,每股资本公积金(元/股)_CapSurFdPS,股东权益周转率(次)_EquRat,产权比率(%)_DbEquRt,权益乘数(%)_EquMul,营业收入3年复合增长率(%)_OperaInc3GrRt
0,0.0,174.883,0,-0.38,-0.1315,-57.4452,-195816600.0,78.2335,0.5952,36.8075,-68076470.0,46858.0,-42.8375,-30.1179,11047.0,2.0907,0.202,58.2465,1.5825,-34.051
1,0.0,118.3646,0,0.88,-0.0025,6.3096,113947800.0,93.7349,0.4161,45.795,-354280.5,48.0,48.4912,31.7471,2937500.0,3.2002,3.5688,84.4847,1.8448,11.1478
2,0.0,66.8906,0,0.58,0.2602,9.3053,42845440.0,155.1066,1.2825,59.8207,19201930.0,136.0,5.3126,12.4304,542647.0,0.6208,2.4298,148.8843,2.4888,32.6596
3,1.0,451.539,0,0.27,0.1137,23.8167,413039200.0,33.0603,1.0412,18.0099,176219100.0,33553.0,20.0212,16.2989,46189.0,0.9839,0.4365,21.9659,1.2197,13.7481
4,0.0,86.7559,1,0.103,0.2129,3.6002,129559400.0,119.5386,0.4313,53.5458,267710700.0,46289.0,30.1121,6.8971,27170.0,1.3461,1.2921,115.2659,2.1527,9.3085


在该数据集中，```noncompliance``` 变量是我们关心的Y变量，```noncompliance = 1```代表该企业在当年存在着税收不遵从行为。

我们按照```noncompliance```变量进行分组，在每组中计算样本均值。可以发现税收不遵从的企业和税收遵从的企业在股权结构、偿债能力、盈利能力等方面都可能存在着不同。

In [3]:
train_df.groupby('noncompliance').mean().round(3).T # 按照noncompliance分组计算均值，取3位小数

noncompliance,0.0,1.0
股东权益/负债合计_EquTotLia,238.302,155.16
股东类别_SHType,0.261,0.272
每股收益(元/股)_BasicEPS,0.505,0.323
每股经营活动现金流量(元/股)_OpeCFPS,0.621,0.564
营业利润/营业总收入()_OpePrTOR,-1.869,-2.717
净利润()_NetPrf,887208500.0,1432636000.0
有形净值债务率(%)_DbTanEquRt,98.924,148.144
每股现金及现金等价物余额(元/股)_CCEPS,1.852,2.223
资产负债率(%)_DbAstRt,42.075,51.769
经营现金净流量(元)_NOCF,1891478000.0,1338614000.0


<a name='3'></a>
## 3-建模分析

In [4]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler
torch.manual_seed(42)

<torch._C.Generator at 0x127d58cf0>

In [6]:
# 将data数据框转换为numpy数组
X_train_array = train_df.drop(columns=['noncompliance']).values
y_train_array = train_df['noncompliance'].values
X_test_array = test_df.drop(columns=['noncompliance']).values
y_test_array = test_df['noncompliance'].values

scaler = StandardScaler()
X_train_array = scaler.fit_transform(X_train_array)  # 计算均值和方差并标准化
X_test_array = scaler.transform(X_test_array) # 使用训练集的均值和方差来标准化测试集

# 将 numpy数组转换为 pytorch tensor
X_train_tensor = torch.tensor(X_train_array, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_array, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_array, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_array, dtype=torch.float32)

# 将pytorch tensor 转换为 pytorch dataset
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

In [7]:
class FNN(nn.Module):
    def __init__(self):
        super(FNN, self).__init__()
        self.fc1 = nn.Linear(19, 8) # 输入层：19特征 -> 隐藏层1: 16神经元
        self.fc2 = nn.Linear(8, 4) # 隐藏层1：19特征 -> 隐藏层2: 8神经元
        self.fc3 = nn.Linear(4, 1) # 隐藏层2：8特征 -> 输出概率
        
    def forward(self, x):
        x = torch.relu(self.fc1(x)) # relu激活函数
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x)) # sigmoid激活函数
        return x

In [8]:
def train(model, train_loader, criterion, optimizer, num_epochs):
    model.train()  # 设置模型为训练模式
    for epoch in range(num_epochs):
        total_loss = 0.0  # 用于累计整个 epoch 的 loss
        num_batches = 0

        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output.squeeze(1), target)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            num_batches += 1

        avg_loss = total_loss / num_batches
        if (epoch + 1) % 50 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Average epoch Loss: {avg_loss:.4f}')

In [16]:
from sklearn.metrics import precision_score, recall_score, f1_score

def test(model, test_loader, criterion):
    model.eval()  # 设置模型为评估模式
    test_loss = 0
    correct = 0
    all_preds = []
    all_targets = []
    
    with torch.no_grad():
        for data, target in test_loader:
            output = model(data)
            test_loss += criterion(output.squeeze(1), target).item()
            
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            
            all_preds.extend(pred.cpu().numpy())
            all_targets.extend(target.cpu().numpy())
    
    test_loss /= len(test_loader.dataset)
    accuracy = 100. * correct / len(test_loader.dataset)
    
    # Precision, Recall, F1
    precision = precision_score(all_targets, all_preds, zero_division=0)
    recall = recall_score(all_targets, all_preds, zero_division=0)
    f1 = f1_score(all_targets, all_preds, zero_division=0)

    print(f'Test Loss: {test_loss:.4f}, Accuracy: {accuracy:.2f}%')
    print(f'Precision: {precision:.4f}, Recall: {recall:.4f}, F1-score: {f1:.4f}')
    print("Number of predicted positives:", sum(all_preds))
    print("Number of actual positives:", sum(all_targets))

In [None]:
# def test(model, test_loader, criterion):
#     model.eval() # 设置模型为评估模式
#     test_loss = 0
#     correct = 0
#     with torch.no_grad():
#         for data, target in test_loader:
#             output = model(data)
#             test_loss += criterion(output.squeeze(1), target).item()
#             pred = output.argmax(dim=1, keepdim=True)
#             correct += pred.eq(target.view_as(pred)).sum().item()
#     test_loss /= len(test_loader.dataset)
#     accuracy = 100. * correct / len(test_loader.dataset)

#     print(f'Test Loss: {test_loss:.4f}, Accuracy: {accuracy:.2f}%')


In [10]:
model = FNN() # 实例化模型
criterion = nn.BCELoss() # 二分类交叉熵损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Adam优化器


In [14]:
train(model, train_loader, criterion, optimizer, num_epochs=500) # 训练模型

Epoch [50/500], Average epoch Loss: 0.1020
Epoch [100/500], Average epoch Loss: 0.1021
Epoch [150/500], Average epoch Loss: 0.1019
Epoch [200/500], Average epoch Loss: 0.1019
Epoch [250/500], Average epoch Loss: 0.1019
Epoch [300/500], Average epoch Loss: 0.1031
Epoch [350/500], Average epoch Loss: 0.1019
Epoch [400/500], Average epoch Loss: 0.1028
Epoch [450/500], Average epoch Loss: 0.1018
Epoch [500/500], Average epoch Loss: 0.1016


In [17]:
test(model, test_loader, criterion)

Test Loss: 0.0009, Accuracy: 97.82%
Precision: 0.0000, Recall: 0.0000, F1-score: 0.0000
Number of predicted positives: [0]
Number of actual positives: 172.0


## 4-模型比较