<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/gemma_FAISS_Cosmopedia_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4400.43
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd

In [None]:
!nvidia-smi

Tue Mar 26 10:39:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# 使用 Gemma LLM 构建 RAG 应用程序的步骤
![img](https://miro.medium.com/v2/resize:fit:1400/0*ireimPdcCcsa4ro0)

# 介绍

随着大型语言模型的不断发展，构建 RAG（检索增强生成）应用程序的热潮与日俱增。谷歌刚刚推出了一个开源模型：Gemma。众所周知，RAG 代表了两种基本方法之间的融合:：基于检索的技术和生成模型。基于检索的技术涉及从广泛的知识库或语料库中获取相关信息以响应特定的查询。生成模型擅长利用训练数据中的见解从头开始创建新内容，从而精心制作原始文本或响应。通过这次发布，为什么不尝试使用新的开源模型来构建 RAG 管道并看看它的性能如何呢？

让我们开始并将该过程分为以下步骤：

1. 加载数据集：Cosmopedia
2. 拥抱脸部的嵌入生成
3. 存储在 FAISS DB 中
4. Gemma：介绍 SOTA 模型
5. 查询RAG管道

# 在 Gemma 7B 上构建 RAG 应用程序

在行动起来之前，先安装并导入所需的依赖项。



In [1]:
%pip install -q -U langchain torch transformers sentence-transformers datasets faiss-cpu


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.1/269.1 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [2]:
import torch
from datasets import load_dataset
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
import pandas as pd


## 加载数据集：Cosmopedia

为了制作 RAG 应用程序，我们选择了 Hugging Face 数据集[Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)。该数据集由 Mixtral-8x7B-Instruct-v0.1 生成的综合教科书、博客文章、故事、帖子和 WikiHow 文章组成。该数据集包含超过 3000 万个文件和 250 亿个令牌，这使其成为迄今为止最大的开放综合数据集。

该数据集包含 8 个子集。我们将继续讨论“故事”子集。我们将使用数据集库加载数据集。

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'stor

In [4]:
!huggingface-cli download \
  --repo-type dataset HuggingFaceTB/cosmopedia data/stories/train-00000-of-00043.parquet \
  --local-dir dataset/HuggingFaceTB/cosmopedia \
  --local-dir-use-symlinks False

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/datasets/HuggingFaceTB/cosmopedia/resolve/main/data/stories/train-00000-of-00043.parquet to /root/.cache/huggingface/hub/tmp_obxi7kb
train-00000-of-00043.parquet: 100% 277M/277M [00:03<00:00, 87.8MB/s]
dataset/HuggingFaceTB/cosmopedia/data/stories/train-00000-of-00043.parquet


In [56]:
from datasets import load_dataset

#data = load_dataset("./dataset/HuggingFaceTB/cosmopedia", split="train[:100]")
data = load_dataset("./dataset/HuggingFaceTB/cosmopedia", split="train")


In [None]:
# https://huggingface.co/docs/datasets/loading
# download all, then choose sample
data = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train[:1000]")


Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

然后，我们将其转换为 Pandas 数据帧，并将其保存为 CSV 文件。



In [57]:
data = data.to_pandas()
data.to_csv("dataset.csv")
data.head()


Unnamed: 0,text,prompt,text_token_length,seed_data,format,audience
0,"Once upon a time, in a village called Kiwilan...",Write an educational story (3-5 paragraphs) ta...,520,ultrachat,story_children,young_children
1,In a bustling town full of curious creatures ...,Write an educational story (3-5 paragraphs) ta...,381,openhermes2.5,story_children,young_children
2,Step 3: Embracing an Unconventional Warmup Ro...,Write a real-life story shared by someone in a...,580,openhermes2.5,story_reddit,general
3,"Once upon a time, in a small town named Harmo...",Write an educational story (3-5 paragraphs) ta...,439,ultrachat,story_children,young_children
4,"On a bright, sunny day, two best friends, Tim...",Write an educational story (3-5 paragraphs) ta...,414,openhermes2.5,story_children,young_children


In [58]:
!ls -lh dataset.csv
!wc -l dataset.csv

-rw-r--r-- 1 root root 473M Mar 26 14:24 dataset.csv
3333354 dataset.csv


现在数据集已保存在我们的系统上，我们将使用 LangChain 加载数据集。 这里需要先释放掉前面加载时所用到的内存。另外，如果你是在Colab中运行的，你也可以重置Colab运行时环境来释放内存。你可以选择“Runtime”菜单，然后选择“Factory reset runtime”来重新启动Colab运行时环境，这将清除所有已加载的数据和对象，并释放内存空间； 然后重新执行import。

In [59]:
loader = CSVLoader(file_path='./dataset.csv',)
data = loader.load()


现在数据已加载，我们需要拆分数据内的文档。在这里，我们将文档分成大小为 1000 的块。这将有助于模型快速高效地工作。



In [61]:
#直接加载切分
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)


## hf的嵌入生成

之后，我们将使用 Hugging Face Embeddings 并在 Sentence Transformers 模型的帮助下生成嵌入。

In [62]:
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

## 存储在 FAISS DB 中

嵌入已生成，但我们需要将它们存储在向量数据库中。我们将把这些嵌入保存在 FAISS 矢量存储中，这是一个用于高效相似性搜索和聚类密集矢量的库。

In [None]:
# disk IO
db = FAISS.from_documents(docs, embeddings)
print(db.index.ntotal)


## Gemma：介绍 SOTA 模型

Gemma 提供两种模型大小，分别具有 20 亿和 70 亿参数，满足不同的计算约束和应用场景。提供预训练和微调的检查点，以及用于推理和服务的开源代码库。它接受了多达 6 万亿个文本数据标记的训练，并利用与 Gemini 模型类似的架构、数据集和训练方法。两者都在跨文本领域展现了强大的通才能力，并且擅长大规模的理解和推理任务。

该版本包括原始的、预先训练的检查点以及针对对话、遵循指令、帮助和安全等特定任务进行优化的微调检查点。我们进行了全面评估，以评估模型的性能并解决任何缺陷，从而能够对模型调整机制进行深入研究和调查，并开发更安全、更负责任的模型开发方法。Gemma 的性能超越了各个领域的同等规模的开放模型，包括问答、常识推理、数学和科学以及编码，自动化基准测试和人工评估都证明了这一点。要了解有关 Gemma 模型的更多信息，请访问此[技术报告](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf)。

要开始使用[Gemma](https://huggingface.co/google/gemma-7b)模型，您应该了解他们在 Hugging Face 上的条款。然后在登录时传递 Hugging Face 令牌。

### local LLM

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", padding=True, truncation=True, max_length=512)


Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", padding=True, truncation=True, max_length=512)

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b", padding=True, truncation=True, max_length=512)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", padding=True, truncation=True, max_length=512)

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

In [None]:
#  issue: https://github.com/langchain-ai/langchain/discussions/19403 ;
# remove `return_tensors='pt'` return raw json
pipe = pipeline(
 "text-generation",
 model=model,
 tokenizer=tokenizer,
 return_tensors='pt',
 max_length=1024,
 max_new_tokens=1024,
 model_kwargs={"torch_dtype": torch.bfloat16},
 device="cuda"
)


In [None]:
# generate pipeline
llm = HuggingFacePipeline(
 pipeline=pipe,
 model_kwargs={"temperature": 0.9, "max_length": 1024,"top_k":40, "top_p":0.95},
)


### remote HuggingFace LLM Endpoint

huggingface Hub是一个拥有超过35万个模型、75万个数据集和15万个演示应用程序(空间)的平台，所有都是开源的和公开的，在这个在线平台上，人们可以轻松地合作和构建ML。Hub作为一个中心场所，任何人都可以通过机器学习探索、实验、协作和构建技术。

In [40]:
import os
from google.colab import userdata
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HF_TOKEN')

from langchain_community.llms import HuggingFaceEndpoint

#repo_id = "google/gemma-2b-it"
#repo_id = "google/gemma-7b-it"
#repo_id = "google/gemma-2b"
repo_id = "google/gemma-7b"

llm = HuggingFaceEndpoint(
    repo_id=repo_id, max_length=1024, temperature=0.9, top_k=40, top_p=0.95
)



                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 查询RAG管道

RAG管道已准备就绪；让我们传递查询并看看它的执行情况。

In [36]:
# RAG pipeline
qa = RetrievalQA.from_chain_type(
 llm=llm,
 chain_type="stuff",
 retriever=db.as_retriever()
)


In [48]:
res = llm.invoke("Write an educational story for young children.")
print(res)

 Maybe your story explains the basics of the solar system, or it introduces children to the idea of saving money.

There are many ways to write an educational story for children. In a book I have written with my wife, <strong>Stories and Lessons</strong>, we introduce 100 Bible stories to children.

Each story comes with a lesson or moral. These lessons are written by my wife, Jennifer, and she has done a brilliant job of distilling the moral of the story into a few words.

I would love to be a part of your children’s educational experience. If you would like to write an educational story for children, then please get in touch.


In [49]:
res = llm.invoke("为幼儿写一个有教育意义的故事。")
print(res)


要求：
1.故事要有主题，有启发作用
2.故事的构思要有新意，有独创性
3.故事的语言要通顺，生动形象
4.故事的长短适中，最好能达到500字左右

参考答案：
1.你走入校园的每一步，都为下一代铺平了一条路。

2.我读小学时，有一位教师把我的作文批上了“不负我青春，不负韶华”的字样，这令我大受鼓舞，从那时起，我便决定，我要在每一个学生的心中，扎下根，打下台阶，用我的全部努力，帮助学生树立正确的理想、价值观、目标和人生观。

3.我们这群教师，是国家对下一代的厚爱，是下一代对国家的深情。我们就像一棵棵树，每棵树是独立的，但又互相交融为一体。每棵树在不同的季节、不同的光影、不同的角度，都会呈现出不同的颜色，不同的形态。

4.我们这一代的教育工作者是教育界的巨人，是社会发展的先锋，是培养新一代中国人的摇篮。我们这一代的教育工作者，是人类历史发展的转折点，是人类进步史的转折点。我们这一代的教育工作者，是教育界的巨人，是社会发展的先锋，是培养新一代中国人的摇篮。

5.我读大学时，有一位大学教授，每天都把他的心意写在我的身上，让我体会到老师的真情。那一年，我得到了一张全省的奖学金，老师说：“你虽然得到了一张全省的奖学金，但是我给你的情，是全省的奖学金的十倍。”那年的9月份，我离开母校，去我未来的工作岗位，去我的新生活，可是那年10月份，我的母校又来了一位学生，她叫李明，是来自一个山穷水尽的地方的。李明说，她很希望可以来学校读大学，可是她的家境很困难，没有钱。她来到学校，到了一位教授那里，教授给了她一封信，这封信里的内容是：“李明，如果你真的想来这里来读大学，你可以来我的研究


In [55]:
res = qa.invoke("Write an educational story for young children.")
print(res['result'])



Timmy and Sally were curious about who would win a race combining both running and swimming. The wise old turtle suggested organizing a competition among various animals. Nobody expected either Timmy or Sally to win, but instead, Kiki Koala surprised everyone by climbing trees and swimming strongly against the current. This unexpected revelation taught Timmy and Sally that every creature has unique abilities, making each special and valuable.


In [53]:
#知识库里木有中文故事
res = qa.invoke("为幼儿写一个有教育意义的故事。")
print(res['result'])



Sarah and Julian's story teaches us about the power of friendship, compassion, and resilience. It also highlights the importance of standing up for what we believe in, even when it means going against the grain.


# 总结

Gemma 型号表现非常出色。我们读了一个关于小动物的美丽故事，相对只用大模型来讲故事，知识有限，如果模型训练的数据中故事类型的数据少，生成的故事不够多样化; 当然如果外挂知识库，即使数据很多，但是通过query召回相似的数据几乎一样的，这就需要模型随机多样生成泛化能力要强些（故事类场景，特定搜索确定性场景除外）。在 FAISS 矢量存储的帮助下，我们能够构建 RAG 管道。下一步找些中文故事集(翻译下也行)，两种可行方式，一个是外挂知识库，一个是在Gemma基础模型上继续训练。

题外话：数据压缩结晶的好模型+Prompt engineering 生成 好数据。 模型起步阶段，假设模型结构公开，就看谁家数据质量好，好数据就会有模型去学，感觉像是基因迭代一样。。。