Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coprocessor: v2 row format for decode #5725

Merged
merged 17 commits into from Nov 13, 2019
Merged

Conversation

@niedhui
Copy link
Collaborator

niedhui commented Oct 24, 2019

What have you changed?

a row_slice implementation for v2 decoder.

parts of (https://github.com/pingcap/tidb/blob/master/docs/design/2018-07-19-row-format.md) codec

the encoder part of this PR is only for test

What is the type of the changes?

New feature

How is the PR tested?

Unit test

Does this PR affect documentation (docs) or should it be mentioned in the release notes?

No

Does this PR affect tidb-ansible?

No

Refer to a related PR or issue link (optional)

2018-07-19-row-format.md

#5497

Benchmark result if necessary (optional)

on MBP (2.7 GHz Quad-Core Intel Core i7)

benches::bench_from_bytes_big                             ... bench:          39 ns/iter (+/- 2)
benches::bench_search_in_non_null_ids                     ... bench:          28 ns/iter (+/- 2)
benches::bench_search_in_non_null_ids_big                 ... bench:          48 ns/iter (+/- 1)
benches::bench_search_in_non_null_ids_middle              ... bench:          32 ns/iter (+/- 1)
benches::bench_search_in_null_ids_middle                  ... bench:          31 ns/iter (+/- 7)

}

impl Row {
fn new_row(datums: Vec<Datum>, ids: &[i64]) -> Row {

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 24, 2019

Contributor

Could we just name it: new ?

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 24, 2019

Author Collaborator

new is better 😃

#[cfg(test)]
mod encoder;

trait BoundedU64 {

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 24, 2019

Contributor

Maybe this is what you want: https://rust-num.github.io/num/num/trait.Bounded.html.
And we could use a blanket implementation for the BoundedU64 with another trait: https://rust-num.github.io/num/num/cast/trait.NumCast.html.

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 24, 2019

Author Collaborator

the bounded value is used as Range patterns later, so the value should either be literal or const value. So I think num-trait could not help.

I'm thinking maybe using constant directly is more reasonable

mod int_bound {
    pub const MAX_U8: u64 = (1 << 8) - 1;
    pub const MAX_I8: i64 = (1 << 7) - 1;
    .......
}
#[allow(clippy::match_overlapping_arm)]
#[inline]
fn encode_i64(&mut self, v: i64) -> codec::Result<()> {
match v {

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 24, 2019

Contributor

I think we could use a generic function call is_bounded_by to extract this kind of code. It is worth to have a try.


#[allow(clippy::match_overlapping_arm)]
#[inline]
fn encode_u64(&mut self, v: u64) -> codec::Result<()> {

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 24, 2019

Contributor

Ditto

@iosmanthus

This comment has been minimized.

Copy link
Contributor

iosmanthus commented Oct 24, 2019

BTW, Thank you a lot! 😄

}

#[inline]
fn flatten(datum: Datum) -> Datum {

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 24, 2019

Contributor

Maybe we should add an EvalContext to the function, the new version of DateTime need an EvalContext to encode.

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 24, 2019

Author Collaborator

good to know. I'll update this after #5473 merged

Signed-off-by: niedhui <niedhui@gmail.com>
@niedhui niedhui force-pushed the niedhui:niedhui/row_slice_v2 branch from 2019274 to ca71539 Oct 24, 2019
Copy link
Member

breeswish left a comment

Good job! Some advice to make it perfect :)

components/tidb_query/src/codec/row/v2.rs Outdated Show resolved Hide resolved
components/tidb_query/src/codec/row/v2/encoder.rs Outdated Show resolved Hide resolved

use std::{i16, i32, i8, u16, u32, u8};

pub const CODEC_VERSION: u8 = 128;

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

Maybe adding a MAGIC would be better, since v2 conflicts to version: u8 = 128 semantically. By calling it a magic number it makes sense.

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 28, 2019

Author Collaborator

Do you mean MAGIC_CODE_VERSION?

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 29, 2019

Member

Yes maybe.

components/tidb_query/src/codec/row/v2/encoder.rs Outdated Show resolved Hide resolved
let mut value_wtr = Vec::with_capacity(row.value_approx_size);

self.write_u8(super::CODEC_VERSION)?;
self.write_flag(row.is_big)?;

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

Prefer to define the flag using bitflags so that it will be more flexible when we add more flags.

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

According to the spec, is_big == true also when "the total size of the values is greater than 65535"

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 28, 2019

Author Collaborator
  1. Please take a look, am I using bitflags the right way, thanks.

According to the spec, is_big == true also when "the total size of the values is greater than 65535"

Good catch. Code done. I test it by using generated-data on the golang side locally ,the data is a little large(190k), so I did not push it.

// the row format is:
// | version | flag | number_of_non_null_values | number_of_null_values | non_null_value_ids | null_value_ids | value_offsets | values
// short spec of each field, check https://github.com/pingcap/tidb/blob/master/docs/design/2018-07-19-row-format.md for details
// version: 1 byte
// flag: 1 byte, when there's id greater than 255, value is 1, otherwise 0
// number of non-null values: 2 bytes
// number of null values: 2 bytes
// column ids: ids of non-null values + ids of null values, when flag == 1 (big), id is 4 bytes, otherwise 1 byte
// non-null values offset: when big, offset is 4 bytes, otherwise 2 bytes
Comment on lines 16 to 24

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

Prefer to make this part a doc comment for the encoder mod, since it describes a format that the whole module respects.

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 28, 2019

Author Collaborator

done

pub fn encode(datums: Vec<Datum>, ids: &[i64]) -> Result<Vec<u8>> {
let row = Row::new(datums, ids);
let mut wtr = Vec::with_capacity(row.approx_size());
wtr.write_row(row)?;
Ok(wtr)
}
Comment on lines 25 to 30

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

Prefer to remove this function since it does not follow a standard encode style. Some help utils can be introduced to simplify the code at the call site (i.e. the test code of the decoder, but not here).

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 28, 2019

Author Collaborator

done

fn flatten(datum: Datum) -> Datum {
match datum {
Datum::F64(f) => Datum::U64(f.to_bits()),
Datum::Time(t) => Datum::U64(t.to_packed_u64()),
Datum::Dur(d) => Datum::I64(d.to_nanos()),
_ => datum,
}
}
Comment on lines 231 to 238

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

Prefer to remove this flatten function and embed its logic directly in the encoder, since it harms to the readability, i.e. making it not clear that how each data type is encoded in the encoder.

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 28, 2019

Author Collaborator

done

use tikv_util::codec::read_slice;

#[derive(Default, Debug)]
struct RowSlice<'a> {

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 25, 2019

Member

I don't think it is good to derive Default for it, because the default RowSlice is very likely to be in a broken state so that removing the Default will avoid abusing it elsewhere.

This comment has been minimized.

Copy link
@niedhui

niedhui Oct 28, 2019

Author Collaborator

Without derived Default, in from_bytes I need to declare all fields of RowSlice as variables, then pass it to RowSlice, what's your opinion?

    fn from_bytes(mut data: &[u8]) -> Result<RowSlice> {
        assert_eq!(data.read_u8()?, super::CODEC_VERSION);
        //        let mut row = RowSlice::default();
        let is_big = data.read_u8()? == super::Flags::BIG.bits();

        // read ids count
        let non_null_cnt = data.read_u16_le()? as usize;
        let null_cnt = data.read_u16_le()? as usize;
        if is_big {
            let non_null_ids_big = read_ints_le(&mut data, non_null_cnt)?;
            let null_ids_big = read_ints_le(&mut data, null_cnt)?;
            let offsets_big = read_ints_le(&mut data, non_null_cnt)?;
            Ok(RowSlice {
                is_big,
                non_null_ids: &[],
                null_ids: &[],
                offsets: &[],
                offsets_big,
                null_ids_big,
                non_null_ids_big,
                values: data,
            })
        } else {
            let non_null_ids = read_ints_le(&mut data, non_null_cnt)?;
            let null_ids = read_ints_le(&mut data, null_cnt)?;
            let offsets = read_ints_le(&mut data, non_null_cnt)?;
            Ok(RowSlice {
                is_big,
                non_null_ids,
                null_ids,
                offsets,
                offsets_big: &[],
                null_ids_big: &[],
                non_null_ids_big: &[],
                values: data,
            })
        }

This comment has been minimized.

Copy link
@breeswish

breeswish Oct 29, 2019

Member

Looks fine to me. Maybe a less duplication way can be...

let mut non_null_ids = &[];
let mut null_ids = &[];
...
if is_big {
    non_null_ids = ...
} else {
...
}

Ok(RowSlice {
    is_big,
    non_null_ids,
    null_ids,
    ..
})

This comment has been minimized.

Copy link
@niedhui

niedhui Nov 1, 2019

Author Collaborator

done

niedhui and others added 3 commits Oct 25, 2019
Signed-off-by: niedhui <niedhui@gmail.com>
Datum::Bytes(b"abc".to_vec()),
Datum::F64(1.8),
Datum::F64(-1.8),
Datum::Time(Time::parse_utc_datetime("2018-01-19 03:14:07", 0).unwrap()),

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 29, 2019

Contributor

Please use parse_datetime instead, after:#5473 merged.

This comment has been minimized.

Copy link
@niedhui

niedhui Nov 7, 2019

Author Collaborator

done

Datum::Json(j) => self.write_json(&j),

Datum::F64(f) => self.encode_u64(f.to_bits()).map_err(Error::from),
Datum::Time(t) => self.encode_u64(t.to_packed_u64()).map_err(Error::from),

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Oct 29, 2019

Contributor

Please add an EvalContext here, after:#5473 merged.

This comment has been minimized.

Copy link
@niedhui

niedhui Nov 7, 2019

Author Collaborator

done

Copy link
Member

breeswish left a comment

The rest seems good to me. @sticnarf @lonng PTAL

value_wtr.write_value(col)?;
offsets.push(value_wtr.len());

if value_wtr.len() > (u16::MAX as usize) {

This comment has been minimized.

Copy link
@breeswish

breeswish Nov 3, 2019

Member

we can check it after the loop

This comment has been minimized.

Copy link
@niedhui

niedhui Nov 3, 2019

Author Collaborator

done. thanks

/// Returns the `start` position and `offset` in `values` field if found, otherwise returns `None`
///
/// # Errors
/// If the id is found with no offset, `Error::ColumnOffset` will be returned.

This comment has been minimized.

Copy link
@breeswish

breeswish Nov 3, 2019

Member

Looks like this means the row data is broken?

This comment has been minimized.

Copy link
@niedhui

niedhui Nov 3, 2019

Author Collaborator

Yes. or we can simply self.offsets.get(idx).expect("id must have offset"), I'm not sure is it too much
to return an Error here for this scenario.

niedhui added 2 commits Nov 3, 2019
refactor
Signed-off-by: niedhui <niedhui@gmail.com>
Copy link
Member

breeswish left a comment

}
ScalarValue::Duration(Some(v)) => self.encode_i64(v.to_nanos()).map_err(Error::from),
ScalarValue::Json(Some(v)) => self.write_json(&v),
_ => unreachable!(),

This comment has been minimized.

Copy link
@breeswish

breeswish Nov 5, 2019

Member

Maybe deserve to handle the NULL case?

This comment has been minimized.

Copy link
@lonng

lonng Nov 6, 2019

Contributor

Seems the NULL has been filtered, see L90.


/// Decodes `len` number of ints from `buf` in little endian
#[inline]
fn read_ints_le<'a, T>(buf: &mut &'a [u8], len: usize) -> Result<&'a [T]>

This comment has been minimized.

Copy link
@breeswish

breeswish Nov 5, 2019

Member

I believe this function is not safe.. It mutates a buffer which is not mutable (notice that the buffer is &'a [u8]).

Since x86 uses little endian, maybe we can simply leave a TODO for it, because reading the buffer directly as an array of numbers conforms little endian.

This comment has been minimized.

Copy link
@niedhui

niedhui Nov 7, 2019

Author Collaborator

I had removed the unsafe code, make it only implemented when target endianness is little.
If someday we compiled it on big endianness target, we can figure it out what we are missing.
PTAL

use tikv_util::codec::read_slice;

#[derive(Debug)]
struct RowSlice<'a> {

This comment has been minimized.

Copy link
@lonng

lonng Nov 6, 2019

Contributor

Prefer to use an enum to define the RowSlice, which can get benefit from the compiler and the compiler will complain if we forget to check the is_big variable.

This comment has been minimized.

Copy link
@lonng

lonng Nov 6, 2019

Contributor

Furthermore, define as enum will make the struct have a smaller size.

niedhui added 2 commits Nov 6, 2019
Signed-off-by: niedhui <niedhui@gmail.com>
Copy link
Contributor

lonng left a comment

LGTM

Copy link
Member

breeswish left a comment

Except for #5725 (comment), the rest LGTM. @sticnarf PTAL

Signed-off-by: niedhui <niedhui@gmail.com>
@winkyao winkyao added this to Backlog in Coprocessor SIG via automation Nov 8, 2019
@winkyao winkyao moved this from Backlog to In Progress in Coprocessor SIG Nov 8, 2019
Copy link
Member

breeswish left a comment

The rest LGTM!

for n in buf.iter_mut() {
*n = n.to_le()
}
let bytes = read_slice(buf, bytes_len)?.as_ptr() as *const T;

This comment has been minimized.

Copy link
@breeswish

breeswish Nov 11, 2019

Member

How about:

if buf.len() < len * std::mem::size_of::<T>() {
    return ...
}
let slice = unsafe { std::slice::from_raw_parts(buf.as_ptr(), len) };
Ok(slice)

This saves one slice creation and is easier to read.

/// This method is only implemented on little endianness currently, since x86 use little endianness.
#[cfg(target_endian = "little")]
#[inline]
fn read_ints_le<'a, T>(buf: &mut &'a [u8], len: usize) -> Result<&'a [T]> {

This comment has been minimized.

Copy link
@breeswish

breeswish Nov 11, 2019

Member

Maybe better to keep restricting T within a numeric type (like PrimInt), to minimize the possibility of making mistakes. Otherwise this function is not safe.

//! This `encoder` module is only used for test, so the implementation is very straightforward.
//!
//! According to https://github.com/pingcap/tidb/blob/master/docs/design/2018-07-19-row-format.md
//! The row format is:

This comment has been minimized.

Copy link
@iosmanthus

iosmanthus Nov 11, 2019

Contributor

Could we format this row with markdown?

niedhui added 2 commits Nov 13, 2019
Signed-off-by: niedhui <niedhui@gmail.com>
Copy link
Member

breeswish left a comment

Awesome!

@lonng
lonng approved these changes Nov 13, 2019
@breeswish

This comment has been minimized.

Copy link
Member

breeswish commented Nov 13, 2019

/merge

@sre-bot sre-bot added the S: CanMerge label Nov 13, 2019
@sre-bot

This comment has been minimized.

Copy link
Collaborator

sre-bot commented Nov 13, 2019

/run-all-tests

@sre-bot sre-bot merged commit 34bebdd into tikv:master Nov 13, 2019
6 checks passed
6 checks passed
DCO All commits are signed off!
Details
idc-jenkins-ci-tikv/integration-common-test Jenkins job succeeded.
Details
idc-jenkins-ci-tikv/integration-compatibility-test Jenkins job succeeded.
Details
idc-jenkins-ci-tikv/integration-copr-test Jenkins job succeeded.
Details
idc-jenkins-ci-tikv/integration-ddl-test Jenkins job succeeded.
Details
idc-jenkins-ci/test Jenkins job succeeded.
Details
Coprocessor SIG automation moved this from In Progress to Done Nov 13, 2019
koushiro added a commit to koushiro/tikv that referenced this pull request Nov 15, 2019
Signed-off-by: niedhui <niedhui@gmail.com>
hawkingrei added a commit to hawkingrei/tikv that referenced this pull request Dec 1, 2019
Signed-off-by: niedhui <niedhui@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
5 participants
You can’t perform that action at this time.