**Table of contents**<a id='toc0_'></a>    
- 1. [Scientific HuggingFace](#toc1_)    
- 2. [datas](#toc2_)    
- 3. [Tokenizer](#toc3_)    
  - 3.1. [Training a new tokenizer](#toc3_1_)    
  - 3.2. [Using a pre-trained tokenizer](#toc3_2_)    
    - 3.2.1. [直接加载](#toc3_2_1_)    
    - 3.2.2. [在Transformers中使用](#toc3_2_2_)    
      - 3.2.2.1. [封装](#toc3_2_2_1_)    
      - 3.2.2.2. [加载](#toc3_2_2_2_)    
- 4. [BERT](#toc4_)    
  - 4.1. [tokenizer](#toc4_1_)    
    - 4.1.1. [train a new tokenizer](#toc4_1_1_)    
    - 4.1.2. [make tokenizer to be used in transformers with AutoTokenizer](#toc4_1_2_)    
  - 4.2. [model](#toc4_2_)    
  - 4.3. [datas](#toc4_3_)    
  - 4.4. [trainer](#toc4_4_)    
- 5. [LLaMa](#toc5_)    
  - 5.1. [Tokenizer](#toc5_1_)    
- 6. [DeepSeek](#toc6_)    
  - 6.1. [R1](#toc6_1_)    
- 7. [什么是RAG？](#toc7_)    
  - 7.1. [文本知识检索](#toc7_1_)    
    - 7.1.1. [知识库构建](#toc7_1_1_)    
    - 7.1.2. [查询构建](#toc7_1_2_)    
    - 7.1.3. [如何检索？-文本检索](#toc7_1_3_)    
    - 7.1.4. [如何喂给大模型？-生成增强](#toc7_1_4_)    
  - 7.2. [多模态知识检索](#toc7_2_)    
  - 7.3. [应用](#toc7_3_)    
- 8. [部署大模型](#toc8_)    
  - 8.1. [ollama](#toc8_1_)    
    - 8.1.1. [Install and run model](#toc8_1_1_)    
    - 8.1.2. [API on web port](#toc8_1_2_)    
    - 8.1.3. [Python ollama module](#toc8_1_3_)    
      - 8.1.3.1. [demo：翻译中文为英文](#toc8_1_3_1_)    
  - 8.2. [ktransformers](#toc8_2_)    
    - 8.2.1. [DeepSeek-R1_Q4_K_M with ktransformers docker container](#toc8_2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 1. <a id='toc1_'></a>[Scientific HuggingFace](#toc0_)

In [1]:
import os 


os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# 2. <a id='toc2_'></a>[datas](#toc0_)

In [3]:
import datasets 


datas = datasets.load_dataset('dnagpt/dna_promoters')

README.md:   0%|          | 0.00/360 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/11.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11840 [00:00<?, ? examples/s]

# 3. <a id='toc3_'></a>[Tokenizer](#toc0_)

## 3.1. <a id='toc3_1_'></a>[Training a new tokenizer](#toc0_)

In [5]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

from transformers import AutoTokenizer


# 初始化一个BPE模型 
tokenizer = Tokenizer(models.BPE())
# 设置预处理
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False) #use_regex=False,空格当成一般字符串
# 设置训练器
trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=["<|endoftext|>"]) #3w words

In [6]:
# 训练
tokenizer.train(["data/huggingface/dna_1g.txt"], trainer=trainer) #all file list, take 10-20 min






In [8]:
# 编码
encoding = tokenizer.encode("TGGCGTGAACCCGGGATCGGG")
print(encoding.tokens)

['TG', 'GCGTGAA', 'CCCGG', 'GATCGG', 'G']


In [9]:
# 解码
decoding = tokenizer.decode(encoding.ids)
print(decoding)

TG GCGTGAA CCCGG GATCGG G


In [11]:
# 保存
tokenizer.save("data/huggingface/tokenizer.json")

## 3.2. <a id='toc3_2_'></a>[Using a pre-trained tokenizer](#toc0_)

### 3.2.1. <a id='toc3_2_1_'></a>[直接加载](#toc0_)

In [15]:
from tokenizers import Tokenizer


# 加载自定义的tokenizer
tokenizer = Tokenizer.from_file("data/huggingface/tokenizer.json")

# 编码
encoding = tokenizer.encode("TGGCGTGAACCCGGGATCGGG")
print(encoding.tokens)

# 解码
decoding = tokenizer.decode(encoding.ids)
print(decoding)

['TG', 'GCGTGAA', 'CCCGG', 'GATCGG', 'G']
TG GCGTGAA CCCGG GATCGG G


### 3.2.2. <a id='toc3_2_2_'></a>[在Transformers中使用](#toc0_)
为了能够从AutoTokenizer中调用。

#### 3.2.2.1. <a id='toc3_2_2_1_'></a>[封装](#toc0_)
要在 🤗 Transformers 中使用这个标记器，我们必须将它包裹在一个 PreTrainedTokenizerFast 类中。

In [18]:
from transformers import GPT2TokenizerFast


dna_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

dna_tokenizer.save_pretrained("data/huggingface/dna_bpe_dict")

# dna_tokenizer.push_to_hub("dna_bpe_dict_1g", organization="dnagpt", use_auth_token="hf_*****") # push to huggingface

('data/huggingface/dna_bpe_dict/tokenizer_config.json',
 'data/huggingface/dna_bpe_dict/special_tokens_map.json',
 'data/huggingface/dna_bpe_dict/vocab.json',
 'data/huggingface/dna_bpe_dict/merges.txt',
 'data/huggingface/dna_bpe_dict/added_tokens.json',
 'data/huggingface/dna_bpe_dict/tokenizer.json')

#### 3.2.2.2. <a id='toc3_2_2_2_'></a>[加载](#toc0_)

In [19]:
from transformers import AutoTokenizer 


# 成功
tokenizer = AutoTokenizer.from_pretrained("data/huggingface/dna_bpe_dict")


# 4. <a id='toc4_'></a>[BERT](#toc0_)

## 4.1. <a id='toc4_1_'></a>[tokenizer](#toc0_)

### 4.1.1. <a id='toc4_1_1_'></a>[train a new tokenizer](#toc0_)

In [4]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import PreTrainedTokenizerFast, AutoModelForMaskedLM


# 初始化一个空的 WordPiece 模型
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# 设置训练参数
trainer = WordPieceTrainer(
    vocab_size=30000,        # 词汇表大小
    min_frequency=2,         # 最小词频
    show_progress=True,      # 显示进度
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

In [8]:
help(tokenizer.train)

Help on built-in function train:

train(self, files, trainer=None) method of tokenizers.Tokenizer instance
    Train the Tokenizer using the given files.

    Reads the files line by line, while keeping all the whitespace, even new lines.
    If you want to train from data store in-memory, you can check
    :meth:`~tokenizers.Tokenizer.train_from_iterator`

    Args:
        files (:obj:`List[str]`):
            A list of path to the files that we should use for training

        trainer (:obj:`~tokenizers.trainers.Trainer`, `optional`):
            An optional trainer that should be used to train our Model



In [9]:
# 训练
tokenizer.train(files=["data/huggingface/dna_1g.txt"], trainer=trainer)

# 保存
tokenizer.save("data/huggingface/dna_wordpiece_dict.json")






### 4.1.2. <a id='toc4_1_2_'></a>[make tokenizer to be used in transformers with AutoTokenizer](#toc0_)

In [10]:
new_tokenizer = Tokenizer.from_file("data/huggingface/dna_wordpiece_dict.json")

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=new_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# 保存
wrapped_tokenizer.save_pretrained("data/huggingface/dna_wordpiece_dict")

('data/huggingface/dna_wordpiece_dict/tokenizer_config.json',
 'data/huggingface/dna_wordpiece_dict/special_tokens_map.json',
 'data/huggingface/dna_wordpiece_dict/tokenizer.json')

In [12]:
from transformers import AutoTokenizer


# 加载
tokenizer = AutoTokenizer.from_pretrained("data/huggingface/dna_wordpiece_dict")
#tokenizer.pad_token = tokenizer.eos_token

In [13]:
# 编码
tokenizer("ATCGGATCG")

{'input_ids': [6, 766, 22, 10], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

In [14]:
tokenizer

PreTrainedTokenizerFast(name_or_path='data/huggingface/dna_wordpiece_dict', vocab_size=30000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

## 4.2. <a id='toc4_2_'></a>[model](#toc0_)

In [15]:
from transformers import BertConfig, BertForMaskedLM 


# 配置
max_len = 1024 

config = BertConfig(
    vocab_size = len(tokenizer),
    max_position_embeddings=max_len, 
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
) 

# 模型
model = BertForMaskedLM(config=config)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


## 4.3. <a id='toc4_3_'></a>[datas](#toc0_)

In [17]:
from datasets import load_dataset 


raw_dataset = load_dataset('text', data_files='data/huggingface/dna_1g.txt')
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1079595
    })
})

In [18]:
dataset = raw_dataset["train"].train_test_split(test_size=0.1, shuffle=True)
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 971635
    })
    test: Dataset({
        features: ['text'],
        num_rows: 107960
    })
})

In [19]:
tokenizer._tokenizer.model.max_input_chars_per_word = 10000


def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=max_len)


# 对数据集应用分词函数
tokenized_datasets = dataset.map(tokenize_function, batched=False, remove_columns=['text'], num_proc=50)  # 设置为你的 CPU 核心数或根据需要调整


Map (num_proc=50):   0%|          | 0/971635 [00:00<?, ? examples/s]

Map (num_proc=50):   0%|          | 0/107960 [00:00<?, ? examples/s]

In [20]:
from transformers import DataCollatorForLanguageModeling


# 创建一个数据收集器，用于动态填充和遮蔽,注意mlm=true
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [21]:
dataset["train"][0]

{'text': 'GAATATTTGTCTATTCTTCTTAACTTTCTCCACTGTAAATTAAATTGCTCCTCAGGGTGCTATATGGCATCCCTTGCTATTTTTGGAGCAAATCTTAAATTCTTCAACAATTTTATCAAGACAAACACAACTTTCAGTAAATTCATTGTTTAAATTTGGTGAAAAGTCAGATTTCTTTACACATAGTAAAGCAAATGTAAAATAATATATCAATGTGATTCTTTTAATAAAATACCATTATTGCCAATGGTTTTTAATAGTTCACTGTTTGAAAGAGACCACAAAATTCATGTGCAAAAATCACAAGCATTCTTATACAACAGTGACAGACAAACAGAGAGCCAAATCAGGAATGAACTTCCATTCACAATTGCTTCAAAGAGAATCAAATACCTAGGAATCCAACTTACAAGGGATGTAAAGGACCTCTTCAAGGAGAACTACAAACCACTGCTCAGTGAAATAAAAGAGGACACAAACAAATGGAAGAACATACCATGCTCATGGATAGGAAGAATCAATATCGTGAAAATGGCCATACTGCCCAAGGTAATTTATAGATTCAATGCCATCCCCATCAAGCTACCAATGAGTTTCTTCACAGAATTGGAAAAAACTGTTTTAAAGTTCATATGGAACCAAAAAAGAACCCACATTGCCAAGACAATCCTAAGTCAAATGAACAAAGCTGGAGGGATCATGCTACCTGACTTCAAACTATACTACAAGGCTACAGTAACCAAAATAGCATGGTACTGGTACCAAAACAGAAATATAGACCAATGGAACAGCATAGAGTCCTCAGAAATAATACCACACATCTACATCTTTGATAAATCTGACAAAAACAAGAAATGGGGAAAGGATTCTCTATATAATAAATGGTGCTGGGAAAATTGGCTAGCCATAAGTAGAAAGCTGAAACTGGATCCTTTCCTTACTCTTTATACGAAAATTAATTCAAGATGGAGTAGAGACTTAAATGTTAGA

In [22]:
tokenizer.tokenize(dataset["train"][0]["text"][:100])

['GAA',
 '##TATTTG',
 '##TCTATT',
 '##CTTCTTAA',
 '##CTTTCTCC',
 '##A',
 '##CTGTAAATT',
 '##AAATT',
 '##GCTCC',
 '##TCAGG',
 '##GTGCTA',
 '##TATGGCA',
 '##TCCCTT',
 '##GCTATTTT',
 '##TGGAGCAA',
 '##A',
 '##TCTTAAA',
 '##T']

## 4.4. <a id='toc4_4_'></a>[trainer](#toc0_)

In [29]:
from transformers import TrainingArguments, Trainer


run_path = "cache/bert_run"
train_epoches = 5
batch_size = 2


training_args = TrainingArguments(
        output_dir=run_path,
        overwrite_output_dir=True,
        num_train_epochs=train_epoches,
        per_device_train_batch_size=batch_size,
        save_steps=2000,
        save_total_limit=2,
        prediction_loss_only=True,
        fp16=True, #v100没法用
    )


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train()
trainer.save_model("cache/dna_bert_v0")

# 5. <a id='toc5_'></a>[LLaMa](#toc0_)

## 5.1. <a id='toc5_1_'></a>[Tokenizer](#toc0_)

In [32]:
import sentencepiece as spm


spm.SentencePieceTrainer.train(
    input="data/huggingface/dna_1g.txt,data/huggingface/protein_1g.txt", 
    model_prefix="dna_llama", 
    vocab_size=60000, 
    model_type="bpe", 
    # max_sentence_length=1000000,
    num_threads=50, 
)

In [None]:
tokenizer = spm.SentencePieceProcessor(model_file="dna_llama.model")

tokenizer.encode("ATCGGATCG")


# 6. <a id='toc6_'></a>[DeepSeek](#toc0_)

## 6.1. <a id='toc6_1_'></a>[R1](#toc0_)

In [11]:
# Use a pipeline as a high-level helper
from transformers import pipeline


messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1", trust_remote_code=True)

pipe(messages)

Downloading shards:   0%|          | 0/163 [00:00<?, ?it/s]

model-00007-of-000163.safetensors:  71%|#######1  | 3.07G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  71%|#######1  | 3.07G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  71%|#######1  | 3.07G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  73%|#######3  | 3.15G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  73%|#######3  | 3.16G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  75%|#######5  | 3.23G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  75%|#######5  | 3.23G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  83%|########2 | 3.55G/4.31G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs-us-1.hf-mirror.com/repos/e7/f7/e7f7b8810f2020d7ff50a46aef578773eecb7386ccba95924d21eae90685f990/d6f299f7b410b9a7806927b5d2d413fae1f2c1dfa340bb0037d02d220cd8c080?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model-00007-of-000163.safetensors%3B+filename%3D%22model-00007-of-000163.safetensors%22%3B&Expires=1739353834&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczOTM1MzgzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U3L2Y3L2U3ZjdiODgxMGYyMDIwZDdmZjUwYTQ2YWVmNTc4NzczZWVjYjczODZjY2JhOTU5MjRkMjFlYWU5MDY4NWY5OTAvZDZmMjk5ZjdiNDEwYjlhNzgwNjkyN2I1ZDJkNDEzZmFlMWYyYzFkZmEzNDBiYjAwMzdkMDJkMjIwY2Q4YzA4MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=uWn2eRLEE%7EkEPseLsNFZx3nYDabBpqrL2gJZdKix4fRMvtXUj-QBn8R4yfwVaxb%7EzgsgIh2jRpAy6BLf1bEfzJv1SByB3-z4bCnf8OhuOM81SM2u5kO-CDNjGdbPADY6HfMFKRioqgbFlgd6PAIC6eGNUtM6B5jHJxa9yzKxEKU9PRM9O0JDJPH4IvYT-6SmKqEyDG2pZKPAojQm9FJNAytH

model-00007-of-000163.safetensors:  83%|########2 | 3.55G/4.31G [00:00<?, ?B/s]

In [None]:
# Load model directly
from transformers import AutoModelForCausalLM


model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1", trust_remote_code=True)

# 7. <a id='toc7_'></a>[什么是RAG？](#toc0_)

RAG的分类：

|Model | 检索器微调 | 大预言模型微调| 例如 |
|---|---|---| --- |
| 黑盒 | - | - | e.g. In-context ralm |
| 黑盒 | 是 | - | e.g. Rplug |
| 白盒 | - | 是 | e.g. realm, self-rag |
| 白盒 | 是 | 是 | e.g. altas |

## 7.1. <a id='toc7_1_'></a>[文本知识检索](#toc0_)
如何检索出相关信息来辅助改善大语言模型生成质量的系统。知识检索通常包括知识库构建、查询构建、文本检索和检索结果重排四部分。

### 7.1.1. <a id='toc7_1_1_'></a>[知识库构建](#toc0_)
文本块的知识库构建，如维基百科、新闻、论文等。

文本分块：将文本分成多个块，每个块包含一个或多个句子。
- 固定大小块：将文本分成固定大小的块，如每个块包含512个字符。
- 基于内容块：将文本分成基于内容的块，如每个块包含一个句子。
  - 通过句子分割符分割句子。
  - 用LLM进行分割

知识库增强：知识库增强是通过改进和丰富知识库的内容和结构，为查询提供"抓手”，包括查询生成与标题生成两种方法。
- 伪查询生成
- 标题生成

### 7.1.2. <a id='toc7_1_2_'></a>[查询构建](#toc0_)
查询构建：旨在通过查询增强的方式，扩展和丰富用户查询的语义和内容，提高检索结果的准确性和全面性，“钩"出相应内容。增强方式可分为语义增强与内容增强。
- 语义增强：同一句话多种表达方式
- 内容增强：增加背景知识

### 7.1.3. <a id='toc7_1_3_'></a>[如何检索？-文本检索](#toc0_)
`检索器`：给定知识库和用户查询，文本检索旨在找到知识库中与用户查询相关的知识文本;检索效率增强旨在解决检索时的性能瓶颈问题。所以检索质量、检索效率很重要。常见检索器有三类：
- 判别式检索器：
  - 稀疏检索器，e.g. TF-IDF
  - 双向编码检索器，e.g. 用bert预先将文本块进行编码成向量
  - 交叉编码检索器，e.g. 
- 生成式检索器：器直接将知识库中的文档信息记忆在模型参数中。然后，在接收到查询请求时，能够直接生成相关文档的标识符夺（即Doc ID），以完成检索。
- 图检索器：图检索器的知识库为图数据库，包括开放知识图谱和自建图两种，它们一般由<主体、谓词和客体>三元组构成。这样做不仅可以捕捉概念间的语义关系，还允许人类和机器可以共同对知识进行理解与推理。

`重排器`：检索阶段为了保证检索速度通常会损失一定的性能，可能检索到质量较低的文档。重排的目的是对检索到的段落进行进一步的排序精选。重排可以分为基于交叉编码的方法和基于上下文学习的方法。

### 7.1.4. <a id='toc7_1_4_'></a>[如何喂给大模型？-生成增强](#toc0_)
RAG增强比较：

|架构分类|优点|缺点|
|-|-|-|
|输入端prompt|简单|tokens太多|
|中间层|高效|耗GPU资源|
|输出端|-|-|

## 7.2. <a id='toc7_2_'></a>[多模态知识检索](#toc0_)
## 7.3. <a id='toc7_3_'></a>[应用](#toc0_)
对话机器人、知识库文答...

# 8. <a id='toc8_'></a>[部署大模型](#toc0_)
## 8.1. <a id='toc8_1_'></a>[ollama](#toc0_)
### 8.1.1. <a id='toc8_1_1_'></a>[Install and run model](#toc0_)

In [None]:
# start the serve
ollama serve

# list all model images
ollama list 

# run model from image
ollama run model_card

### 8.1.2. <a id='toc8_1_2_'></a>[API on web port](#toc0_)
communicatation with local model via web port.

`generate` and `chat`.

In [41]:
%%bash
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "prompt": "Who are you?",
  "stream": false,
  "options": {
    "temperature": 0.6
  },
  "format": "json"
}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   456  100   315  100   141    142     64  0:00:02  0:00:02 --:--:--   206


{"model":"deepseek-r1:7b","created_at":"2025-02-18T02:49:32.744934023Z","response":"{\"}\u003cthink\u003e{\"\n\n\n\n\n\n\n\n\n\n:\n\n{\n\n}\n\n}\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n","done":false}

In [40]:
%%bash
curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:7b",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ],
  "stream": false
}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5034    0  4905  100   129    326      8  0:00:16  0:00:15  0:00:01   719


{"model":"deepseek-r1:7b","created_at":"2025-02-18T02:49:26.56791501Z","message":{"role":"assistant","content":"\u003cthink\u003e\nOkay, so I just read that \"Why is the sky blue?\" and now I'm trying to figure it out myself. Let me think through this step by step.\n\nFirst off, when you look at the sky on a clear day, it's usually blue, especially during the day when the sun is out. But sometimes I've seen it turn other colors too, like red in the evening or during sunrise. So why is it mostly blue?\n\nI know that light travels through the atmosphere, but how does it get colored? I remember learning about something called Rayleigh scattering from my science class. Let me try to recall what that was about. Rayleigh scattering involves light interacting with particles much smaller than the wavelength of light itself. When sunlight enters the Earth's atmosphere, it reaches tiny molecules in the air, like nitrogen and oxygen.\n\nWait, so these small particles scatter the sunlight in all d

### 8.1.3. <a id='toc8_1_3_'></a>[Python ollama module](#toc0_)

In [None]:
import ollama


texts = '''
详细比较deepseek母公司和openAI公司的区别
'''

# model_card = "deepseek-r1:7b"
model_card = "modelscope.cn/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF"

# 方式一（非流式输出）：
# outputs = ollama.generate(model_card, inputs)
# print(f'{outputs['response']}')

# 方式二（流式输出）：
outputs = ollama.generate(
    stream=True,
    model=model_card,
    prompt=texts,
)
for chunk in outputs:
    if not chunk['done']:
        print(f'{chunk['response']}', end='', flush=True)

#### 8.1.3.1. <a id='toc8_1_3_1_'></a>[demo：翻译中文为英文](#toc0_)

In [44]:
import ollama 


class zh2en():
    def __init__(self, model_card):
        self.model_card = model_card
        
    def build_prompt(self, texts):
        # with open(prompt_template_path, 'r') as f:
        #     prompt_template = f.read()
        #     # str with replace function
        #     prompt = prompt_template.replace(var, texts)
        prompt_template = """
        专业翻译：\n
        ---\n
        {Chinese_words} \n
        --- \n
        作为翻译专家，将上述中文准确翻译为英文。 \n
        """
        prompt = prompt_template.replace("{Chinese_words}", texts)
        return prompt

    def translate(self, texts):
        prompt = self.build_prompt(texts = texts)
        # key step
        outputs = ollama.generate(
            stream=True,
            model=self.model_card,
            prompt=prompt,
        )
        for chunk in outputs:
            if not chunk['done']:
                print(f'{chunk['response']}', end='', flush=True)
            else:
                print('⚡')


translater = zh2en(model_card='modelscope.cn/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF')

translater.translate('基于深度学习的对枯草芽胞杆菌芽胞形成相关基因的研究。')
translater.translate('通过宏基因组研究微生物与植物相互作用的机制。')

<think>
嗯，首先我要理解这个题目的意思。“基于深度学习”指的是使用深度学习技术来进行研究。而“对枯草芽胞杆菌芽胞形成相关基因的研究”则是具体的研究内容，涉及到枯草芽胞杆菌在形成芽胞过程中相关的基因。

我需要把这整个句子准确地翻译成英文。首先，“基于深度学习”可以直接翻译为“Based on deep learning”。接下来是“研究”，对应的英文是“study”。然后是“枯草芽胞杆菌”，这个应该是一个专有名词，可能需要查一下正确的英译名称，比如“Bacillus subtilis”。

接着是“芽胞形成相关基因”，这部分可以翻译为“genes related to spore formation”。最后，把整个句子连贯起来，就是“Based on deep learning study of genes related to spore formation in Bacillus subtilis.”

这样组合起来，既准确传达了原意，又符合英文的表达习惯。我觉得这个翻译应该是比较专业和准确的。
</think>

Study of Genes Related to Spore Formation in *Bacillus subtilis* Based on Deep Learning⚡
<think>
好的，首先我要理解用户的需求。他给了一个中英对照的句子，要求专业翻译，并且需要将中文句子“通过宏基因组研究微生物与植物相互作用的机制。”准确地翻译成英文。

接下来，我需要分析原文的意思。句子的主干是“通过宏基因组研究…”，这里的关键词有“宏基因组”、“微生物”、“植物”以及“相互作用的机制”。所以，首先要确定这些术语在英文中的准确对应词。

“宏基因组”通常翻译为“metagenome”或者“meta-genomics”，但更常见的是使用“metagenomics”来表示这一研究领域。因此，这里选择“metagenomics”作为翻译。

然后，“通过…研究…”的结构在英文中可以用“through”或者“by means of”来表达，但为了简洁和专业，直接使用“Through”比较合适。

接下来是“微生物与植物相互作用的机制”。这里需要注意语序和用词。整体结构应该是“the mechanisms underlying the interactio

## 8.2. <a id='toc8_2_'></a>[ktransformers](#toc0_)
### 8.2.1. <a id='toc8_2_1_'></a>[DeepSeek-R1_Q4_K_M with ktransformers docker container](#toc0_)

[https://github.com/kvcache-ai/ktransformers-private/blob/main/doc/en/Docker.md](https://github.com/kvcache-ai/ktransformers-private/blob/main/doc/en/Docker.md)

In [None]:
# pull the image from docker hub 
# about 19 GB
docker pull approachingai/ktransformers:0.1.1

# docker run \
#     --gpus all \
#     -v /path/to/models:/models \
#     -p 10002:10002 \
#     approachingai/ktransformers:v0.1.1 \
#     --port 10002 \
#     --gguf_path /models/path/to/gguf_path \
#     --model_path /models/path/to/model_path \
#     --web True

# maybe happen some errors
docker run  \
    -v /bmp/backup/zhaosy/ws/ktransformers/models:/models \
    -p 10002:10002 \
    approachingai/ktransformers:0.1.1 \
    --port 10002 \
    --model_path /bmp/backup/zhaosy/ProgramFiles/hf/deepseek-ai/DeepSeek-R1 \
    --gguf_path /bmp/backup/zhaosy/ProgramFiles/hf/deepseek-ai/DeepSeek-R1-Q4_K_M_GGUF \
    --web True