Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Prototype Rust Backend #6

Closed
brendanzab opened this issue Sep 21, 2017 · 9 comments · Fixed by #38
Closed

Implement Prototype Rust Backend #6

brendanzab opened this issue Sep 21, 2017 · 9 comments · Fixed by #38

Comments

@brendanzab
Copy link
Member

brendanzab commented Sep 21, 2017

In order to start making our push to go end-to-end via Allsorts, we need a backend for generating Rust code. This will probably be similar to how LALRPOP does it, but I would like to generate more readable code than it does.

The unsafe slices branch shows what a resulting decoder might look like note that this is a very simple case however, and we will need to do more trickery in cases where we have anonymous structs, unions, and have unknown-sized types in the middle of structs.

@mikeday
Copy link
Contributor

mikeday commented Sep 21, 2017

I'm questioning whether generating an actual Rust struct is necessary at all, although it could be helpful for relatively simple structs with fixed field layouts. Complex structs could be represented by an abstract slice with custom accessor methods.

@brendanzab
Copy link
Member Author

Yeah, that may indeed be the case. We also want the output Rust API to be reasonably predicable, so changing the specification slightly should not greatly change the resulting code.

@mikeday
Copy link
Contributor

mikeday commented Sep 21, 2017

Even simple structs will need accessors anyway for endian swapping if we want zero-copy access.

@brendanzab
Copy link
Member Author

brendanzab commented Sep 21, 2017

Note that in the cmap example above I do just that. Note that in Rust struct fields are private by default, so the only way to interact with this API is via the accessors.

@mikeday
Copy link
Contributor

mikeday commented Sep 21, 2017

Right, so the public API doesn't hinge on there being an actual struct, just a type with the appropriate name. (Zero copy would require some lifetime trickery to make references live as long as the main buffer, I suppose).

@brendanzab
Copy link
Member Author

brendanzab commented Sep 21, 2017

Yup. The Cmap type is defined as:

#[repr(C)]
pub struct CMap {
    version: u16,
    num_tables: u16,
    encoding_records: [EncodingRecord],
}

The [EncodingRecord] means that this struct is 'dynamically sized' - ie. it can only be accessed via a pointer indirection. The only way it can be constructed is via pub fn from_buf(buf: &[u8]) -> Result<&CMap, ()>, so it will be forever forced to be immutable.

@brendanzab
Copy link
Member Author

brendanzab commented Sep 21, 2017

Another alternative might be to do something like:

pub struct CMap<'data> {
    data: &'data [u8],
}

pub fn from_buf<'data>(data: &'data [u8]) -> Result<CMap<'data>, ()> {
    ...
}

Would be interested to see how Nom does deserialization, seeing as it claims to be zero-copy too.

Edit: Seems like nom is zero copy in the sense of it does not allocate copies of data from the input iterator. But it does require you copy data into new structs in order to make use of the parsed data. For example, the gif decoder: https://github.com/Geal/gif.rs/blob/master/src/parser.rs

@brendanzab
Copy link
Member Author

brendanzab commented Nov 8, 2017

Here are some different methods I've been exploring for the by-ref output:

use std::{mem, ptr, slice};

pub struct Bmp {
    w: u64,
    h: u64,
    data: [u8],
}

impl Bmp {
    pub fn new(buf: &[u8]) -> Result<&Bmp, ()> {
        if buf.len() < Bmp::min_size() {
            Err(())
        } else {
            let bmp = unsafe { mem::transmute::<_, &Bmp>(buf) };

            if buf.len() != bmp.exact_size() {
                Err(())
            } else {
                Ok(bmp)
            }
        }
    }

    fn min_size() -> usize {
        mem::size_of::<u64>() + // w
        mem::size_of::<u64>() // h
    }

    fn exact_size(&self) -> usize {
        Bmp::min_size() +
        mem::size_of::<u8>() * self.w() as usize * self.h() as usize // data
    }

    pub fn w(&self) -> u64 {
        u64::from_be(self.w)
    }

    pub fn h(&self) -> u64 {
        u64::from_be(self.h)
    }

    pub fn data(&self) -> &[u8] {
        unsafe { slice::from_raw_parts(self.data.as_ptr(), self.w() as usize * self.h() as usize) }
    }
}
use std::{mem, ptr, slice};

pub struct Bmp {
    data: [u8],
}

impl Bmp {
    pub fn new(buf: &[u8]) -> Result<&Bmp, ()> {
        if buf.len() < Bmp::min_size() {
            Err(())
        } else {
            let bmp = unsafe { mem::transmute::<_, &Bmp>(buf) };

            if buf.len() != bmp.exact_size() {
                Err(())
            } else {
                Ok(bmp)
            }
        }
    }

    fn min_size() -> usize {
        Bmp::w_size() + Bmp::h_size()
    }

    fn exact_size(&self) -> usize {
        Bmp::min_size() + self.data_size()
    }

    fn w_size() -> usize { mem::size_of::<u64>() }
    fn h_size() -> usize { mem::size_of::<u64>() }
    fn data_size(&self) -> usize { mem::size_of::<u8>() * self.w() as usize * self.h() as usize }

    fn w_offset() -> isize { 0 }
    fn h_offset() -> isize { Bmp::w_offset() + Bmp::w_size() as isize }
    fn data_offset() -> isize { Bmp::h_offset() + Bmp::h_size() as isize }

    pub fn w(&self) -> u64 {
        unsafe {
          let ptr = self.data.as_ptr().offset(Bmp::w_offset()) as *const u64;
          u64::from_be(ptr::read(ptr))
        }
    }

    pub fn h(&self) -> u64 {
        unsafe {
          let ptr = self.data.as_ptr().offset(Bmp::h_offset()) as *const u64;
          u64::from_be(ptr::read(ptr))
        }
    }

    pub fn data(&self) -> &[u8] {
        unsafe {
          let ptr = self.data.as_ptr().offset(Bmp::data_offset()) as *const u8;
          slice::from_raw_parts(ptr, self.data_size())
        }
    }
}

Note that the generated assembly code is identical. This is very similar to what Harfbuzz is doing.

Advantages:

  • all bounds checking is done up-front
  • no reallocation occurs because it is reusing the same bytes as the buffer

Disadvantages:

  • requires the full buffer to be persisted in memory
  • requires a great deal of unsafe code under the hood to work which could be tricky to audit
  • have to execute 'interp' types by need, without memoization (eg. for endian-conversions)
  • users might want to persist part of the struct beyond the lifetime of the buffer
  • not compatible with streaming

@brendanzab
Copy link
Member Author

I wanted to experiment with the above unsafe APIs because they followed Harfbuzz's technique, but using unsafe in generated code gives me the heebyjeebies. Don’t want to hit something like that Ragel bug that hit Cloudflare…

Here are some more ideas for the 'spidery api', but this time using byte slices as they are intended top be used:

//! Pixel = {
//!     r : u8,
//!     g : u8,
//!     b : u8,
//! };
//!
//! Bmp = struct {
//!     w : u64be,
//!     h : u64be,
//!     data : [Pixel; w * h],
//!     trailer_len : u32be,
//!     trailer_data : [u32be; trailer_len],
//! };

extern crate byteorder;

use byteorder::{BigEndian, ReadBytesExt};
use std::io::Cursor;
use std::mem;

pub struct BmpDataRef<'a> {
    buf: &'a [u8],
}

impl<'a> BmpDataRef<'a> {
    pub fn new(buf: &[u8]) -> BmpDataRef {
        BmpDataRef { buf }
    }
}

pub struct BmpRef<'a> {
    buf: &'a [u8],
}

impl<'a> BmpRef<'a> {
    pub fn new(buf: &[u8]) -> BmpRef {
        BmpRef { buf }
    }

    pub fn w(&self) -> u64 {
        let offset = 0;
        Cursor::new(&self.buf[offset..])
            .read_u64::<BigEndian>()
            .unwrap()
    }

    pub fn h(&self) -> u64 {
        let offset = mem::size_of::<u64>();
        Cursor::new(&self.buf[offset..])
            .read_u64::<BigEndian>()
            .unwrap()
    }

    pub fn data(&self) -> BmpDataRef {
        let offset = mem::size_of::<u64>() * 2;
        let size = mem::size_of::<u8>() * 3 * self.w() as usize * self.h() as usize;
        BmpDataRef::new(&self.buf[offset..offset + size])
    }
}

This one is a spidery api with 'staged copying'. Advantage is that it does up-front verification in the constructor, and caches the results:

//! Pixel = {
//!     r : u8,
//!     g : u8,
//!     b : u8,
//! };
//!
//! Bmp = struct {
//!     w : u64be,
//!     h : u64be,
//!     data : [Pixel; w * h],
//!     trailer_len : u32be,
//!     trailer_data : [u32be; trailer_len],
//! };

extern crate byteorder;

use byteorder::{BigEndian, ReadBytesExt};
use std::io::{self, Cursor};
use std::mem;

#[derive(Copy, Clone)]
pub struct BmpRef<'a> {
    w: u64,
    h: u64,
    data: BmpDataCursor<'a>,
}

impl<'a> BmpRef<'a> {
    pub fn new(buf: &[u8]) -> Result<BmpRef, ()> {
        let w = Cursor::new(&buf.get(0..).ok_or(())?)
            .read_u64::<BigEndian>()
            .map_err(|_| ())?;
        let h = Cursor::new(&buf.get(mem::size_of_val(&w)..).ok_or(())?)
            .read_u64::<BigEndian>()
            .map_err(|_| ())?;

        let data = BmpDataCursor::new(&buf
            .get(mem::size_of_val(&w) + mem::size_of_val(&h)..)
            .ok_or(())?);

        Ok(BmpRef { w, h, data })
    }

    pub fn w(&self) -> u64 {
        self.w
    }

    pub fn h(&self) -> u64 {
        self.h
    }

    pub fn data(&self) -> BmpDataCursor<'a> {
        self.data
    }
}

#[derive(Copy, Clone)]
pub struct BmpDataCursor<'a> {
    buf: &'a [u8],
}

impl<'a> BmpDataCursor<'a> {
    pub fn new(buf: &[u8]) -> BmpDataCursor {
        BmpDataCursor { buf }
    }
    
    pub fn get(&self, index: usize) -> Option<Pixel> {
        unimplemented!()
    }
    
    pub fn iter(&self) -> BmpDataIter {
        unimplemented!()
    }
}

pub struct BmpDataIter {
    // TODO
}

impl Iterator for BmpDataIter {
    type Item = Pixel;
    
    fn next(&mut self) -> Option<Pixel> {
        unimplemented!()
    }
}

pub struct Pixel {
    r: u8,
    g: u8,
    b: u8,
}

brendanzab added a commit that referenced this issue Aug 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants