<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/gemma_FAISS_Cosmopedia_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4400.43
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd

In [None]:
!nvidia-smi

Tue Mar 26 10:39:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# 使用 Gemma 7B LLM 构建 RAG 应用程序的步骤
![img](https://miro.medium.com/v2/resize:fit:1400/0*ireimPdcCcsa4ro0)

# 介绍

随着大型语言模型的不断发展，构建 RAG（检索增强生成）应用程序的热潮与日俱增。谷歌刚刚推出了一个开源模型：Gemma。众所周知，RAG 代表了两种基本方法之间的融合：基于检索的技术和生成模型。基于检索的技术涉及从广泛的知识库或语料库中获取相关信息以响应特定的查询。生成模型擅长利用训练数据中的见解从头开始创建新内容，从而精心制作原始文本或响应。通过这次发布，为什么不尝试使用新的开源模型来构建 RAG 管道并看看它的性能如何呢？

让我们开始并将该过程分为以下步骤：

1. 加载数据集：Cosmopedia
2. 拥抱脸部的嵌入生成
3. 存储在 FAISS DB 中
4. Gemma：介绍 SOTA 模型
5. 查询RAG管道

# 在 Gemma 7B 上构建 RAG 应用程序

在行动起来之前，让我们安装并导入所需的依赖项。

> 添加区块引用符号



In [None]:
%pip install -q -U langchain torch transformers sentence-transformers datasets faiss-cpu


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.1/269.1 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [None]:
import torch
from datasets import load_dataset
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
import pandas as pd


## 加载数据集：Cosmopedia

为了制作 RAG 应用程序，我们选择了 Hugging Face 数据集[Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)。该数据集由 Mixtral-8x7B-Instruct-v0.1 生成的综合教科书、博客文章、故事、帖子和 WikiHow 文章组成。该数据集包含超过 3000 万个文件和 250 亿个令牌，这使其成为迄今为止最大的开放综合数据集。

该数据集包含 8 个子集。我们将继续讨论“故事”子集。我们将使用数据集库加载数据集。

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'stor

In [None]:
!huggingface-cli download \
  --repo-type dataset HuggingFaceTB/cosmopedia data/stories/train-00000-of-00043.parquet \
  --local-dir dataset/HuggingFaceTB/cosmopedia \
  --local-dir-use-symlinks False

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/datasets/HuggingFaceTB/cosmopedia/resolve/main/data/stories/train-00000-of-00043.parquet to /root/.cache/huggingface/hub/tmpdngdwu0n
train-00000-of-00043.parquet: 100% 277M/277M [00:01<00:00, 236MB/s]
dataset/HuggingFaceTB/cosmopedia/data/stories/train-00000-of-00043.parquet


In [None]:
from datasets import load_dataset

data = load_dataset("./dataset/HuggingFaceTB/cosmopedia", split="train[:100]")
#data = load_dataset("./dataset/HuggingFaceTB/cosmopedia", split="train")


Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# https://huggingface.co/docs/datasets/loading
data = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train[:1000]")


Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

然后，我们将其转换为 Pandas 数据帧，并将其保存为 CSV 文件。



In [None]:
data = data.to_pandas()
data.to_csv("dataset.csv")
data.head()


Unnamed: 0,text,prompt,text_token_length,seed_data,format,audience
0,"Once upon a time, in a village called Kiwilan...",Write an educational story (3-5 paragraphs) ta...,520,ultrachat,story_children,young_children
1,In a bustling town full of curious creatures ...,Write an educational story (3-5 paragraphs) ta...,381,openhermes2.5,story_children,young_children
2,Step 3: Embracing an Unconventional Warmup Ro...,Write a real-life story shared by someone in a...,580,openhermes2.5,story_reddit,general
3,"Once upon a time, in a small town named Harmo...",Write an educational story (3-5 paragraphs) ta...,439,ultrachat,story_children,young_children
4,"On a bright, sunny day, two best friends, Tim...",Write an educational story (3-5 paragraphs) ta...,414,openhermes2.5,story_children,young_children


In [None]:
!ls -lh dataset.csv
!wc -l dataset.csv

-rw-r--r-- 1 root root 418K Mar 26 10:41 dataset.csv
2898 dataset.csv


现在数据集已保存在我们的系统上，我们将使用 LangChain 加载数据集。 这里需要先释放掉前面加载时所用到的内存。另外，如果你是在Colab中运行的，你也可以重置Colab运行时环境来释放内存。你可以选择“Runtime”菜单，然后选择“Factory reset runtime”来重新启动Colab运行时环境，这将清除所有已加载的数据和对象，并释放内存空间； 然后重新执行import。

In [None]:
loader = CSVLoader(file_path='./dataset.csv',)
data = loader.load()


现在数据已加载，我们需要拆分数据内的文档。在这里，我们将文档分成大小为 1000 的块。这将有助于模型快速高效地工作。



In [None]:
#直接加载切分
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)


## hf的嵌入生成

之后，我们将使用 Hugging Face Embeddings 并在 Sentence Transformers 模型的帮助下生成嵌入。

In [None]:
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

## 存储在 FAISS DB 中

嵌入已生成，但我们需要将它们存储在向量数据库中。我们将把这些嵌入保存在 FAISS 矢量存储中，这是一个用于高效相似性搜索和聚类密集矢量的库。

In [None]:
db = FAISS.from_documents(docs, embeddings)
print(db.index.ntotal)


598


## Gemma：介绍 SOTA 模型

Gemma 提供两种模型大小，分别具有 20 亿和 70 亿参数，满足不同的计算约束和应用场景。提供预训练和微调的检查点，以及用于推理和服务的开源代码库。它接受了多达 6 万亿个文本数据标记的训练，并利用与 Gemini 模型类似的架构、数据集和训练方法。两者都在跨文本领域展现了强大的通才能力，并且擅长大规模的理解和推理任务。

该版本包括原始的、预先训练的检查点以及针对对话、遵循指令、帮助和安全等特定任务进行优化的微调检查点。我们进行了全面评估，以评估模型的性能并解决任何缺陷，从而能够对模型调整机制进行深入研究和调查，并开发更安全、更负责任的模型开发方法。Gemma 的性能超越了各个领域的同等规模的开放模型，包括问答、常识推理、数学和科学以及编码，自动化基准测试和人工评估都证明了这一点。要了解有关 Gemma 模型的更多信息，请访问此[技术报告](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf)。

要开始使用[Gemma](https://huggingface.co/google/gemma-7b)模型，您应该了解他们在 Hugging Face 上的条款。然后在登录时传递 Hugging Face 令牌。

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", padding=True, truncation=True, max_length=512)


Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", padding=True, truncation=True, max_length=512)

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", padding=True, truncation=True, max_length=512)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", padding=True, truncation=True, max_length=512)

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

In [None]:
#  issue: https://github.com/langchain-ai/langchain/discussions/19403 ;
# remove `return_tensors='pt'` return raw json
pipe = pipeline(
 "text-generation",
 model=model,
 tokenizer=tokenizer,
 return_tensors='pt',
 max_length=512,
 max_new_tokens=512,
 model_kwargs={"torch_dtype": torch.bfloat16},
 device="cuda"
)


In [None]:
# generate pipeline
llm = HuggingFacePipeline(
 pipeline=pipe,
 model_kwargs={"temperature": 0.7, "max_length": 512},
)


In [None]:
# RAG pipeline
qa = RetrievalQA.from_chain_type(
 llm=llm,
 chain_type="stuff",
 retriever=db.as_retriever()
)


## 查询RAG管道

RAG管道已准备就绪；让我们传递查询并看看它的执行情况。

In [None]:
res = qa.invoke("Write an educational story for young children.")
print(res["result"])

In [None]:
#res = qa.invoke("Write an educational story for young children.")
res = qa.invoke("Once upon a time.")

print(res["result"])


# 最后的话

Gemma 7B 型号表现非常出色。我们读了一个关于小猫的美丽故事。新的 SOTA 模型使用起来很有趣且令人兴奋。在 FAISS 矢量存储的帮助下，我们能够构建 RAG 管道。谢谢阅读！