# 步骤一：库的安装
首先，激活 Conda 环境：
```Bash
conda activate fin-gen
```
然后，使用 pip 安装库。建议同时安装 rich 库，它可以让输出更美观。
```Bash
pip install pytorch-tabular rich
```
如果安装过程中遇到依赖冲突，可以尝试创建一个全新的环境来安装，或者根据报错信息解决。这是科研环境配置的日常。

# 步骤二：数据准备

In [1]:
import kagglehub
import os
import pandas as pd

# 1. 使用 kagglehub 下载数据集到缓存，并获取路径
print("正在下载或从缓存定位数据集...")
dataset_dir = kagglehub.dataset_download("uciml/default-of-credit-card-clients-dataset")
dataset = os.path.join(dataset_dir, "UCI_Credit_Card.csv")


# 1。加载数据
df = pd.read_csv(dataset)
# 2. 数据初步检查与清洗
print("----原始数据信息----")
print(df.info())

# ID通常对建模没有帮助，删除ID列
df = df.drop("ID",axis = 1)


# 列名可能不友好，我们来重命名一下
# 特别是最后一个目标变量 'default.payment.next.month'
df = df.rename(columns={'default.payment.next.month': 'default', 
                        'PAY_0': 'PAY_1'}) # PAY_0 和 PAY_1 含义相似，统一一下命名习惯

# 让我们看看数据的前5行，确认修改生效
print("\n--- 清洗并重命名后的数据预览 ---")
print(df.head())

正在下载或从缓存定位数据集...
----原始数据信息----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13 

In [2]:
# --- 1. 定义特征类型 ---
# 识别出哪些是类别特征，哪些是连续（数值）特征
# 通常，取值范围有限且不表示大小关系的特征是类别特征
categorical_cols = ['SEX','EDUCATION','MARRIAGE']+[f'PAY_{i}' for i in range(1,7)]
numerical_cols = ['LIMIT_BAL' , 'AGE']+ [f'PAY_AMT{i}' for i in range(1,7)]

#目标变量
target_col = 'default'
print(f"类别特征:{categorical_cols}")
print(f"数值特征:{numerical_cols}")
print(f"目标特征:{target_col}")


类别特征:['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
数值特征:['LIMIT_BAL', 'AGE', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
目标特征:default


In [3]:
# --- 2. 修正一些数据类型问题 ---
# PAY_* 系列特征虽然是数字，但它们代表的是类别（-2=未使用, -1=付清, 1=延迟1个月...）
# 我们需要确保它们被当作类别处理

for col in categorical_cols:
    df[col] = df[col].astype('category')

# print(df.describe())

### 代码解释:
特征分类: 这是关键一步。正确区分 categorical_cols 和 numerical_cols 对 FT-Transformer 至关重要，因为它会对这两类特征做不同的处理。
* **.astype('category'):** 明确告诉 Pandas 和 pytorch-tabular，这些列是类别，而不是可以比较大小的数字。
* **train_test_split:** 这是机器学习的标准流程。
  test_size=0.2 表示留出20%的数据完全不用来训练，只在最后用来评估。stratify=df[target_col] 是一个非常重要的参数，它能保证在划分后的训练集、验证集和测试集中，违约和不违约的样本比例与原始数据集保持一致。

In [4]:
from sklearn.model_selection import train_test_split

# --- 3. 划分数据集 ---
# 将数据划分为训练集和测试集，这非常重要！
# 我们在训练集上训练模型，在测试集上评估模型的泛化能力
# 再从训练集中分出一部分作为验证集，用于模型训练过程中的监控和调优

# 先分出 20% 作为最终的测试集
train_val_df ,test_df = train_test_split(df,test_size = 0.2, random_state = 42,stratify = df[target_col])

# 再从剩下的数据中分出 10% 作为验证集
train_df, val_df = train_test_split(train_val_df, test_size=0.1, random_state=42, stratify=train_val_df[target_col])
print("\n---数据集划分情况---")
print(f"总样本数:{len(df)}")
print(f"训练集样本数:{len(train_df)}")
print(f"验证集样本数:{len(val_df)}")
print(f"测试集样本数:{len(test_df)}")



---数据集划分情况---
总样本数:30000
训练集样本数:21600
验证集样本数:2400
测试集样本数:6000


# 步骤三：配置并训练 FT-Transformer 模型

In [5]:
from pytorch_tabular import TabularModel
from pytorch_tabular.models import FTTransformerConfig
from pytorch_tabular.config import DataConfig,OptimizerConfig,TrainerConfig,ExperimentConfig

* [参考文档](https://pytorch-tabular.readthedocs.io/en/latest/models/)
* [API文档](https://pytorch-tabular.readthedocs.io/en/stable/apidocs_model/#pytorch_tabular.models.FTTransformerConfig)

--- 1. 配置数据 ---

In [6]:
data_config = DataConfig(
    target = [target_col], # 目标变量列名
    continuous_cols = numerical_cols,
    categorical_cols = categorical_cols,
)

 --- 2. 配置模型 (FT-Transformer) --- 这里可以设置模型的超参数。
 
num_heads, num_attn_blocks 分别是Transformer的头数和层数

In [7]:
# num_heads, num_attn_blocks 分别是Transformer的头数和层数
model_config = FTTransformerConfig(
    task = "classification",
    num_heads = 4,
    num_attn_blocks = 3,
    learning_rate = 1e-4, # 默认是1e-3
)

 -- 3. 配置训练器 --- 这里设置训练过程的参数，比如用不用GPU，跑多少个epoch等

In [8]:
trainer_config = TrainerConfig(  
    batch_size = 64,
    accelerator = 'auto',        # 自动检测是否有GPU ('gpu', 'cpu', 'auto')
    max_epochs = 10,             # 先设置一个较小的epoch数
    early_stopping = "valid_loss", # 如果验证集损失不再下降，就提前停止，防止过拟合
    early_stopping_patience = 3,
    progress_bar="simple"
)

--- 4. 配置优化器 ---

In [9]:
optimizer_config = OptimizerConfig()


--- 5. 整合所有配置 ---

In [10]:
experiment_config = ExperimentConfig(
    project_name="CreditCard_FTTransformer", # 实验项目名
    run_name="first_run",                   # 本次运行的名称
    log_target="tensorboard",               # 使用TensorBoard记录日志
    # log_target="wandb", # 如果你用WandB，可以改成这个
)

--- 6. 初始化 TabularModel ---

In [11]:
# 这是将所有配置组合在一起的核心对象
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
    experiment_config=experiment_config,
)

--- 7. 开始训练 ---
需要注意的是，此处更改了

`~/miniconda3/envs/fin-gen/lib/python3.10/site-packages/pytorch_tabular/utils/python_utils.py`
这个文件的85行，把
```python
return torch.load(f, map_location=map_location)
更改为
return torch.load(f, map_location=map_location,weights_only=False)
```
这是因为在pytorch2.6之后，会默认使torch.load中weights_only参数为True，这个参数会严格检测，导致很多情况下抛出异常。

这里我们数据来源，训练过程都是安全的，故做上述更改

In [12]:
print("\n--- 开始训练 FT-Transformer 模型 ---")
tabular_model.fit(train=train_df, validation=val_df)
print("\n--- 训练完成！ ---")

Seed set to 42



--- 开始训练 FT-Transformer 模型 ---


  int(x) + 1 for x in list(self.train[config.categorical_cols].fillna("NA").nunique().values)


  int(x) + 1 for x in list(self.train[config.categorical_cols].fillna("NA").nunique().values)
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(unique(X[col].fillna(NAN_CATEGORY)), name=col).reset_index().rename(columns={"index": "value"})
  map = Series(uni

Trainer will use only 1 of 8 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=8)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/data/home/yuanxiaosong/miniconda3/envs/fin-gen/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /data/home/yuanxiaosong/fin-gen/saved_models exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name             | Type                  | Params | Mode 
-------------------------------------------------------------------
0 | _backbone        | FTTransformerBackbone | 86.5 K | train
1 | _embedding_layer | Embedding2dLayer      | 3.6 K  | train
2 | _head            | LinearHead            | 66     | train
3 | loss             | CrossEntropyLoss   

Sanity Checking: |                                                                          | 0/? [00:00<?, ?i…

/data/home/yuanxiaosong/miniconda3/envs/fin-gen/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.
/data/home/yuanxiaosong/miniconda3/envs/fin-gen/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.


Training: |                                                                                 | 0/? [00:00<?, ?i…

Validation: |                                                                               | 0/? [00:00<?, ?i…

Validation: |                                                                               | 0/? [00:00<?, ?i…

Validation: |                                                                               | 0/? [00:00<?, ?i…

Validation: |                                                                               | 0/? [00:00<?, ?i…

Validation: |                                                                               | 0/? [00:00<?, ?i…

Validation: |                                                                               | 0/? [00:00<?, ?i…


--- 训练完成！ ---


--- 8. 在测试集上评估模型 ---

In [13]:
# --- 8. 在测试集上评估模型 ---
print("\n--- 在测试集上评估模型性能 ---")
eval_result = tabular_model.evaluate(test_df)
print(eval_result)

  X_encoded[col] = X_encoded[col].fillna(NAN_CATEGORY).map(mapping["value"])
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
 -0.26464071]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  data.loc[:, self.config.continuous_cols] = self.scaler.transform(data.loc[:, self.config.continuous_cols])
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]



--- 在测试集上评估模型性能 ---


/data/home/yuanxiaosong/miniconda3/envs/fin-gen/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.


Testing: |                                                                                  | 0/? [00:00<?, ?i…

[{'test_loss_0': 0.44090536236763, 'test_loss': 0.44090536236763, 'test_accuracy': 0.8176666498184204}]


 --- 9. 保存模型 ---

In [14]:
save_path = tabular_model.save_model("saved_models/ft_transformer_v1")
print(f"\n模型已保存至: {save_path}")


模型已保存至: None
