# 使用Redis作为向量数据库

## 非结构化数据问题

如今，组织生成的数据中约有 80% 是非结构化的；没有明确定义的模式或无法重组为熟悉的列格式的数据。非结构化数据的典型示例包括自由格式文本、图像、视频和声音剪辑。预计未来几十年非结构化数据的数量将会增长。

非结构化数据具有高维性和噪声性，使得使用传统方法分析和解释更具挑战性。但它也充满了信息和意义。

传统上，对非结构化数据进行处理以提取特定特征，从而有效地将其转化为结构化数据。一旦进入结构化数据领域，我们就可以使用 SQL 查询（如果存储在关系数据库中）或文本搜索引擎来搜索数据。

将非结构化数据转换为结构化数据的方法存在一些问题。首先，从非结构化数据中提取特征的计算成本可能很高并且容易出错，从而大大延迟了我们有效使用数据的时间。其次，在提取/转换过程中可能会丢失一些保真度和信息，因为独特的潜在特征无法轻松分类或量化。

## 输入向量数据库

处理非结构化数据的一种方法是对数据进行向量化。通过向量化，我们的意思是以某种方式将文本段落、图像、视频或歌曲等内容转换为代表特定数据的平面数字序列。这些向量是 N 维空间中数据的表示。通过向量化，我们能够使用线性代数技术对数据进行比较、分组和操作。这是向量数据库的基础；存储和操作向量的能力。这种方法并不新鲜，并且已经存在很长时间了。今天的不同之处在于生成向量的技术如何进步。

## 使用机器学习embeddings作为向量

将非结构化文本数据转换为向量形式的传统方法包括词袋 (BoW) 和词频-逆文档频率 (TF-IDF)。对于分类数据，one-hot 编码是一种常用的方法。散列和特征提取技术，例如边缘检测、纹理分析或颜色直方图，已用于图像等高维数据。

虽然这些方法本身很强大，但在面对高维和复杂的数据形式（如长文本段落、图像和音频）时却暴露出局限性。例如，考虑如何通过句子重新排列、同义词使用或叙述风格的改变来重组文本段落。这种简单的修改可以有效地避开词袋等技术，防止使用生成的编码的系统识别具有相似含义的文本段落。

这就是机器学习，特别是深度学习的进步发挥作用的地方。机器学习模型促进了嵌入作为一种广泛接受的生成密集、低维向量表示的方法的兴起。给定合适的模型，生成的嵌入可以封装数据中固有的复杂模式和语义，从而克服传统嵌入的局限性。



## 使用bikes数据集来生成embeddings

为了研究向量相似性，我们将使用bikes dataset，这是一个相对简单的合成数据集。该数据集在名为`bikes.json`的JSON文件中包含11条自行车记录，并包含字段`model`、`brand`、`price`、`type`、`specs`和`description`。该`description`字段对包含自行车的自由形式文本描述。

这里使用python来执行redis vss， 安装相关依赖包

```shell
python3 -m venv .venv
source .venv/bin/activate
pip3 install redis
pip3 install pandas
# Generating vector embeddings for text data using SentenceTransformers (pre-trained NLP models)
pip3 install sentence-transformers
```
# reference
1. https://redis.io/docs/interact/search-and-query/search/vectors/
2. https://github.com/RedisVentures/redis-vss-getting-started/blob/main/vector_similarity_with_redis.ipynb

In [1]:
import requests
import json

url = 'https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json'
response = requests.get(url)
bikes = json.loads(response.text)

In [2]:
%pip -q install pandas
import pandas as pd

pd.DataFrame(bikes)

You should consider upgrading via the '/Users/wuyong/project/python/redis-vss/.venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,model,brand,price,type,specs,description
0,Jigger,Velorim,270,Kids bikes,"{'material': 'aluminium', 'weight': '10'}","Small and powerful, the Jigger is the best rid..."
1,Hillcraft,Bicyk,1200,Kids Mountain Bikes,"{'material': 'carbon', 'weight': '11'}",Kids want to ride with as little weight as pos...
2,Chook air 5,Nord,815,Kids Mountain Bikes,"{'material': 'alloy', 'weight': '9.1'}",The Chook Air 5 gives kids aged six years and...
3,Eva 291,Eva,3400,Mountain Bikes,"{'material': 'carbon', 'weight': '9.1'}","The sister company to Nord, Eva launched in 20..."
4,Kahuna,Noka Bikes,3200,Mountain Bikes,"{'material': 'alloy', 'weight': '9.8'}",Whether you want to try your hand at XC racing...
5,XBN 2.1 Alloy,Breakout,810,Road Bikes,"{'material': 'alloy', 'weight': '7.2'}",The XBN 2.1 Alloy is our entry-level road bike...
6,WattBike,ScramBikes,2300,eBikes,"{'material': 'alloy', 'weight': '15'}",The WattBike is the best e-bike for people who...
7,Soothe Electric bike,Peaknetic,1950,eBikes,"{'material': 'alloy', 'weight': '14.7'}","The Soothe is an everyday electric bike, from ..."
8,Secto,Peaknetic,430,Commuter bikes,"{'material': 'aluminium', 'weight': '10.0'}",If you struggle with stiff fingers or a kinked...
9,Summit,nHill,1200,Mountain Bike,"{'material': 'alloy', 'weight': '11.3'}",This budget mountain bike from nHill performs ...


In [3]:
print(json.dumps(bikes[0], indent=2))

{
  "model": "Jigger",
  "brand": "Velorim",
  "price": 270,
  "type": "Kids bikes",
  "specs": {
    "material": "aluminium",
    "weight": "10"
  },
  "description": "Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids\u2019 pedal bike on the market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring to go. We say rare because this smokin\u2019 little bike is not ideal for a nervous first-time rider, but it\u2019s a true giddy up for a true speedster. The Jigger is a 12 inch lightweight kids bicycle and it will meet your little one\u2019s need for speed. It\u2019s a single speed bike that makes learning to pump pedals simple and intuitive. It even has  a handle in the bottom of the saddle so you can easily help your child during training!  The Jigger is among the most lightweight children\u2019s bikes on the planet. It is designed so that 2-3 year-olds fit comfortably in a mo

In [4]:
%pip install -U -q sentence-transformers

You should consider upgrading via the '/Users/wuyong/project/python/redis-vss/.venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('msmarco-distilbert-base-v4')

  from .autonotebook import tqdm as notebook_tqdm
