## PD结合的思路

### 分别替换gate/up层

先输入一句话，按照70%稀疏去记录prefill阶段激活的神经元，最后统计这个输入prompt对应的最高激活次数的70%的神经元。

In [None]:
import os
import json
import torch
from transformers import LlamaForCausalLM, AutoTokenizer
import convert_llama
from transformers import GenerationConfig
from datasets import load_dataset

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

### from path.json read paths of model and dataset
model_name = "Llama3-8b"
dataset_name = "c4"
with open('path.json', 'r') as file:
    paths = json.load(file)
    model_path = paths.get(model_name, '')
    dataset_path = paths.get(dataset_name, '')

c4 = load_dataset(dataset_path)
model = LlamaForCausalLM.from_pretrained(
    model_path,
    device_map='auto',
    use_cache=True,
    torch_dtype=torch.float16,
)
convert_llama.convert_llama_model(model, sparsity=0.1, start_num=14, end_num=16, )

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [2]:
for c4_demo in c4['validation']['text'][:1]:
    input_demo = tokenizer(c4_demo, padding="max_length", truncation=True, max_length=200, return_tensors="pt")
    generate_ids = model.generate(input_demo.input_ids.to('cuda:0'), max_length=230, generation_config=GenerationConfig(do_sample=False), pad_token_id=tokenizer.eos_token_id)
    tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    model.model.layers[15].mlp.gate_proj.coreinfer_recall()
    model.model.layers[15].mlp.up_proj.coreinfer_recall()


[prefill] in gate layer: 15
[prefill] in up layer: 15
in decode, gate layer 15
Overlap count: 1243.8621, Overlap ratio: 0.8680
in decode, up layer 15
Overlap count: 1080.8966, Overlap ratio: 0.7543


In [2]:
### 使用greedy decode
generate_ids = model.generate(input_demo.input_ids.to('cuda:0'), max_length=230, generation_config=GenerationConfig(do_sample=False), pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

[prefill] in gate layer: 15
[prefill] in up layer: 15


'The woman who died after falling from a bridge over the A21 has been identified as a Sevenoaks mum.\nMarta Kendle, 37, fell from the Gracious Lane bridge on the morning of February 19.\nPolice were called to the carriageway around 6.10am and the road was promptly closed in both directions.\nDespite paramedics best efforts, Marta, who was originally from Poland, was pronounced dead at the scene.\nKent and Medway Coroners office have confirmed an inquest into her death will open on Wednesday (February 27).\nTributes to the mum were left at the scene and on social media.\nFriend, Jodi Cahill posted on Facebook: "I will certainly remember you. I am sorry we did not see how lost and alone you felt.\n"Be at peace dear Marta."\nA floral tribute left at the scene said goodbye to the "beautiful and kind soul".\nIt read: "To a beautiful and kind soul. You will be missed. Rest in peace."\nA spokesman for Kent Police said: "Officers were called to the A21 at Gracious Lane,'

In [3]:
model.model.layers[15].mlp.gate_proj.coreinfer_recall()
model.model.layers[15].mlp.up_proj.coreinfer_recall()

in decode, gate layer 15
Overlap count: 1243.8621, Overlap ratio: 0.8680
in decode, up layer 15
Overlap count: 1080.8966, Overlap ratio: 0.7543


## 加载保存的激活值

In [2]:
from torch import nn
import torch.nn.init as init
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
    
from torch.utils.tensorboard import SummaryWriter
from torch.cuda.amp import GradScaler, autocast  
from torch.utils.data import DataLoader, Dataset, random_split
import torch.optim as optim
import torch
import json

with open('path.json', 'r') as file:
    paths = json.load(file)
    save_path = paths.get('save_path','')

def load_datasets(layerid = 1, expertid = 0):
    datasets_x = []
    datasets_y = []
    datasets_x1 = []
    for fileid in range(1, 5):
        # print(fileid)
        # 加一个map_location
        d = torch.load(f'{save_path}/{fileid}-{layerid}.pth', map_location=lambda storage, loc: storage.cuda(0))
        datasets_x.append(d[0])
        datasets_x1.append(d[1])
        datasets_y.append(d[2])
    # 
    x,x1,y = torch.cat(datasets_x,dim=1), torch.cat(datasets_x1,dim=1), torch.cat(datasets_y,dim=1)
    datasets_x.clear()
    datasets_y.clear()
    datasets_x1.clear()
    x = x.reshape(-1, 4096)
    x1 = x1.reshape(-1, 14336)
    y = y.reshape(-1, 14336)
    # print(x[0].shape)
    return x,x1,y

class CustomDataset(Dataset):
    def __init__(self, layerid = 1, expertid = 0):
        # 加载数据self.data_x1,
        self.data_x, self.data_x1, self.data_y = load_datasets(layerid)
        print(len(self.data_x1),len(self.data_x),len(self.data_y))

    def __len__(self):
        return len(self.data_x)
    
    def __getitem__(self, idx):
        return self.data_x[idx],self.data_x1[idx],self.data_y[idx]

In [3]:
# expertid = 2
layerid = 15
dataset = CustomDataset(layerid)
print(len(dataset), dataset[0][0].shape, dataset[0][1].shape) # torch.Size([512, 4096])
# 划分训练集和验证集
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1024, shuffle=False)


110137 110137 110137
110137 torch.Size([4096]) torch.Size([14336])


## 稀疏预测器

In [7]:
class SimpleLinearModel(nn.Module):
    def __init__(self,input_dim,output_dim,hidden_dim=32):
        super(SimpleLinearModel, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim,bias=False)
        # self.activation = nn.SiLU() # 添加激活函数
        self.linear2 = nn.Linear(hidden_dim,output_dim,bias=False)  
        init.kaiming_normal_(self.linear1.weight, mode='fan_out', nonlinearity='relu')
        init.kaiming_normal_(self.linear2.weight, mode='fan_out', nonlinearity='relu')
        # self.linear1.bias.data.fill_(0)
        # self.linear2.bias.data.fill_(0)

    def forward(self, x):
        # x= self.activation(x)
        return self.linear2(self.linear1(x))
    
model=SimpleLinearModel(4096,14336,hidden_dim=128)
model.to("cuda")  # 假设使用 GPU
# criterion = nn.MSELoss().to("cuda")
criterion = nn.CrossEntropyLoss().to("cuda")
# criterion = nn.KLDivLoss(reduction='batchmean').to("cuda")
optimizer = optim.Adam(model.parameters(), lr=5e-4) #lr=5e-5
writer = SummaryWriter('runs/predictor_sparsity')

In [8]:
from tqdm import tqdm

def sparse_row(row, keep_ratio=0.1, use_abs = False):
    # 计算需要保留的参数数量
    num_to_keep = int(keep_ratio * row.numel())
    
    # 找到绝对值最大的 num_to_keep 个参数的索引
    if use_abs:
        row = torch.abs(row)
    topk_indices = torch.topk(row, num_to_keep).indices
    # topk_indices = torch.topk(row, num_to_keep).indices
    
    # 创建一个与 row 相同大小的零张量
    sparse_row = torch.zeros_like(row)
    
    # 将 topk_indices 对应的值置为 1
    sparse_row[topk_indices] = 1
    
    return sparse_row

def generate_label(y, sparsity, use_abs=False):
    # 对每一行进行稀疏化
    sparse_tensor = torch.stack([sparse_row(row, sparsity, use_abs) for row in y])
    return sparse_tensor

def test_model(model, val_loader, sparsity=0.1):
    model.eval()
    # 初始化总的统计变量
    total_correct_preds = 0
    total_preds = 0
    total_labels = 0
    total_masks = 0

    with torch.no_grad():
        for batch_idx, (inputs,_, targets) in tqdm(enumerate(val_loader)):
            with autocast():
                outputs = model(inputs)

            preds = generate_label(outputs, sparsity)
            truth = generate_label(targets, 0.2, use_abs = True)
            # truth = targets
            
            # 计算当前batch的精度
            dif = truth - preds
            miss = dif > 0.0 # classifier didn't activated target neuron

            total_correct_preds += (truth.sum(dim=1).float() - miss.sum(dim=1).float()).mean().item()
            total_preds += (preds.sum(dim=1).float()).mean().item()
            total_labels += (truth.sum(dim=1).float()).mean().item()

    # print('预测占比:{:.4f}'.format((total_preds/total_masks).item()))
    # print('标签占比:{:.4f}'.format((total_labels/total_masks).item()))
    print('预测与标签选取的数量比:',(total_preds / total_labels))
    print('覆盖率(Recall):',(total_correct_preds / total_labels))

def train_model(model, train_loader, val_loader, criterion, optimizer, writer, epochs=25, layerid=1):
    scaler = GradScaler()  # 创建 GradScaler 对象
    for epoch in range(epochs):
        if epoch % 2 == 0:
            print(f'---------after training {epoch} epochs---------')
            test_model(model, val_loader, sparsity=0.3)
        model.train()
        for batch_idx, (inputs,_, targets) in enumerate(train_loader):
            inputs, targets = inputs.cuda(), targets.cuda()

            optimizer.zero_grad()

            targets = generate_label(targets, 0.2, use_abs =True)

            # 使用 autocast 来进行自动混合精度处理
            with autocast():
                outputs = model(inputs)
                probs = outputs.sigmoid()
                # cross_entropy
                loss = criterion(probs, targets)

            # 使用 GradScaler 来缩放损失，然后进行反向传播
            # 注意：反向传播不包含在 autocast() 块中
            scaler.scale(loss).backward()
            writer.add_scalar('Loss/Train', loss.item(), epoch * len(train_loader) + batch_idx)
            # 调用 scaler.step() 来更新模型权重，并调用 scaler.update() 准备下一步
            scaler.step(optimizer)
            scaler.update()
    print(f'---------after training {epochs} epochs---------')
    test_model(model, val_loader, sparsity=0.3)


In [9]:
### c4_llama3
train_model(model, train_loader, val_loader, criterion, optimizer, writer=writer, epochs=20, layerid=layerid)

---------after training 0 epochs---------


11it [00:03,  3.07it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.30043170823031573
---------after training 2 epochs---------


11it [00:03,  3.10it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5338869594236857
---------after training 4 epochs---------


11it [00:03,  3.09it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5561815844420109
---------after training 6 epochs---------


11it [00:03,  3.21it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5669374307540791
---------after training 8 epochs---------


11it [00:03,  3.21it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5732316396468279
---------after training 10 epochs---------


11it [00:03,  3.16it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5770994205506292
---------after training 12 epochs---------


11it [00:03,  3.09it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5796713903404633
---------after training 14 epochs---------


11it [00:03,  3.06it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5813429851684339
---------after training 16 epochs---------


11it [00:03,  3.20it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5825691479489092
---------after training 18 epochs---------


11it [00:03,  3.19it/s]


预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5834638697877045
---------after training 20 epochs---------


11it [00:03,  3.19it/s]

预测与标签选取的数量比: 1.499302649930265
覆盖率(Recall): 0.5841920779385282



