Cloud Computing Presentation

# Cloud Computing Presentation
## From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees
### Paper 
- From OSDI20
- Wisconsin ADSL
- Learn Index (Started in 2017)

### Background
- Index is commonly used in the storage system for the lookup operation.
- Some examples, unsorted liner search, sorted binary search
- **What if the data is huge?**
  - Assume sorted data
  - Traditional solution: build specific data structures for lookups (Record the position of the data). B-Tree/HASH/LSM Tree
  - New Solution: What if we know the data beforehand?
    - SImple Example: f(x)=0.5x - 50. 
      - Time Complexity – O(1) for lookups
      - Space Complexity – O(1) Only 2 floating points – slope(斜率) + intercept(截距)
    - Challenges: 
       - How to efficiently support insertions/updates? (新加入的数据不再符合以前的分布，需要重新训练，不训练准确度降低)
       - How to integrate into production systems? (实际生产环境中的数据分布不一定满足简单的线性模型，还要考虑其他成本)

### Overview of Bourbon
- Paper Chapters and Writing ideas

### LevelDB
- BigTable, RocksDB etc KVS
- Core Concept
- Differences with WiscKey.
  - Key-Value  VS Key-Value Addr

### Guidelines
...

### Design
- WiscKey Introduction
- Lookup with model  VS Original Lookup

### Cost-Benefit Analyzer
- Goal: Minimize the total CPU time. (CPU Bottleneck)
- Factors for benefit and cost.
- Minimal total CPU cost in all scenarios

### Evaluation
- Micro Benchmarks: Different Datasets （均匀随机只读负载）
  - synthetic datasets:  线性分布、正常分布、含1%分段的分布、含10%分段的分布
  - real-world dataset: AmazonReviews(亚马逊商品评论数据集) and OpenStreetMapNY(开放街道地图数据集)
- Micro Benchmarks: Different request distribution （只读负载，初始化的时候随机加载）
  - Sequential, zipfian(一个离散幂律概率分布，也就是常常提到的长尾模型), hotspot, exponential, uniform, and latest
    - uniform: 均匀随机地选择一件物品。例如，在选择一条记录时，数据库中的所有记录都同样有可能被选中。
    - Zipfian:根据Zipfian的分布选择一个项目。例如，在选择唱片时，一些唱片会非常受欢迎(发行版的头部)，而大多数唱片将不受欢迎(尾部)
    - Latest:类似Zipfian发行版，除了最近插入的记录在发行版的头部。
    - exponential: 可以指定每个项目的概率。例如，我们可以为读操作分配0.95的概率，为更新操作分配0.05的概率，为扫描和插入分配0的概率。其结果将是一个读取繁重的工作负载。

- Macro benchmark: YCSB R:W A(50:50) B(95:5 zipfian) C(100:0) D:(95:5 latest) F(Read-modify-write) E(short range)

### Conclusion
#### 一点收获
- 新的思考：
  - AI For System. Learn Index, Detect Workload, Cache Policy
  - 还是需要打破思想桎梏，LSM 历来都被认为是写友好型的，而学习索引其实是更利于只读负载的，这种看似有绝对冲突的 topic 往往仔细推敲之后发现其实大有可为，甚至可能实现互补的效果 1+1>2
- 论文写作：
  - 大量的实验，从一开始问题的提出，首先理论分析，后来用实验证明相应的结论，并总结出 Guidelines。
  - 再基于以上的发现和经验总结，提出自己的设计，设计往往可能不是最重要的，问题可能更为关键
  - 对于自己所提出的每个设计点都进行了测试，测试的组织形式采用了自问自答的方式，从读者的角度可能存在的问题，先大致总结出来，然后带着问题，每个问题进行针对性地测试，同时涵盖设计中的每个点。

#### 问题
- 文中虽然有大量的测试，但其实稍有脱离 LSM 应用最广泛的场景，即面向磁盘的存储，虽然有 Optane SSD 的测试，但是发现效果不理想，虽有收益，但是较小，如果存储设备退化成应用最广泛的普通的 SATA SSD 甚至 HDD，是否会出现完全没收益的情况。
- 结论：算是给 LSM Tree 的学习索引开了个头，但是离实际应用还有一定的距离，不像 WiscKey 直接。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Computing Presentation #19

Cloud Computing Presentation

From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees

Paper

Background

Overview of Bourbon

LevelDB

Guidelines

Design

Cost-Benefit Analyzer

Evaluation

Conclusion

一点收获

问题

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cloud Computing Presentation #19

Description

Cloud Computing Presentation

From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees

Paper

Background

Overview of Bourbon

LevelDB

Guidelines

Design

Cost-Benefit Analyzer

Evaluation

Conclusion

一点收获

问题

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions