Skip to content

Tokenization

rrahn edited this page Apr 6, 2017 · 15 revisions

This document provides technical specifications for the tokenization

Function overview

Functions for reading from input_ranges

read_until

template <typename input_range_t, typename output_iterator_t, typename stop_predicate_t, typename ignore_predicate_t>
    requires input_range_concept<input_range_t> &&
             output_iterator_concept<output_iterator_t> && 
             predicate_concept<stop_predicate_t> &&
             predicate_concept<ignore_predicate_t>
inline void 
read_until(input_range_t &in, output_iterator_t &out, stop_predicate_t &stopFunctor, ignore_predicate_t &ignoreFunctor)
{
    if constexpr(is_chunkable_v<output_iterator_t> && is_chunkable_v< input_range_t>)
    {
        // chunk-wise read
    }
    else
    {
        // element-wise read
    }
    
}
  • [???] should it be an output range?
  • can be expressed as write_until function.
  • complexity: O(n) over the number of elements in input.
  • throws: possibly alloc_error or stream_error?

read_line

shortcut for read_until with is_newline as delimiter predicate.

read

template <typename input_range_t, typename output_iterator_t, typename assert_predicate_t>
    requires input_range_concept<input_range_t> &&
             output_iterator_concept<output_iterator_t> && 
             predicate_concept<assert_predicate_t>
inline void 
read_until(input_range_t &in, output_iterator_t &out, stop_predicate_t &stopFunctor, ignore_predicate_t &ignoreFunctor)
{
     // unspecified
}

Reads just one single element.

read_raw_pod

read_number

read_n [???]

  • reads at most n characters
  • if n not specified reads the whole input range.

Functions for writing to output_ranges

write_until

write_line

write

write_raw_pod

write_number

write_wrapped

Writes a wrapped line. But might be modelled as simple write with an additional functor?

write_n

Misc

split_by

template <typename input_t, typename delimiter_t, typename config_t = std::ignore>
    requires forward_range_concept<input_t> && predicate_concept<delimiter_t>
inline auto
split_by(input_t const & input,
         delimiter_t && delimiter,
         config_t && config)  // optional parameter
{
    /* implementation detail*/
    return // optional<view<view<sequence_type>>>
}

This function operates on a forward_range and returns view of views. The views can be empty if the sequence could not be split because the input might be empty. Otherwise the optional holds a view-of-views, so that no copying of sequence data is needed until the user explicitly assigns the return value to a proper container type to hold the data. This is also the reason, why input_range_concept is not applicable, as there is no guarantee that the seen data for tokenization is still present, when the iteration through the input continues.

crop_outer

namespace seqan3::action
{
constexpr ranges::action< crop_outer_fn > crop_outer { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_outer_fn > crop_outer { /* unspecified */ }
}

Modeling this kind of functions as either views or actions would be desirable. How exactly this has to be implemented remains to be seen❗️

crop_before_last

namespace seqan3::action
{
constexpr ranges::action< crop_before_last_fn > crop_before_last { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_before_last_fn > crop_before_last { /* unspecified */ }
}

Similar to crop_outer.

crop_before_first

namespace seqan3::action
{
constexpr ranges::action< crop_before_first_fn > crop_before_first { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_before_first_fn > crop_before_first { /* unspecified */ }
}

similar to crop_outer.

crop_after_last

namespace seqan3::action
{
constexpr ranges::action< crop_after_last_fn > crop_after_last { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_after_last_fn > crop_after_last { /* unspecified */ }
}

crop_after_first

namespace seqan3::action
{
constexpr ranges::action< crop_after_first_fn > crop_after_first { /* unspecified */ }
}

namespace seqan3::view
{
constexpr ranges::view< crop_after_first_fn > crop_after_first { /* unspecified */ }
}

find_last

template <typename input_t, typename predicate_t>
    requires forward_range_concept<input_t> && predicate_concept<predicate_t>
inline auto
find_last(input_t const & input, 
          predicate_t && p)
{
    /* unspecified */
    return iterator_t<input_t>{begin(input)};
}

The find_last is just an algorithm, that can be optimised when working on buffered streams, as chunking might be more efficient on streams. However, right now it is nowhere used in seqan For standard containers this could be simply replaced with:

view::find_if(view::reverse(buffer), seqan3::equals_char<','>());

find_first

template <typename input_t, typename predicate_t>
    requires forward_range_concept<input_t> && predicate_concept<predicate_t>
inline auto
find_first(input_t const & input, 
          predicate_t && p)
{
    /* unspecified */
    return iterator_t<input_t>{begin(input)};
}

The find_first is just an algorithm, that can be optimised when working on buffered streams, as chunking might be more efficient on streams. However, right now it is only used in one place of seqan, which does it on a simple CharString buffer. For standard containers this could be simply replaced with:

view::find_if(buffer, seqan3::equals_char<','>());

skip_until

template <typename iterator_t, typename predicate_t>
    requires input_iterator_concept<iterator_t> && predicate_concept<iterator_t>
inline void
skip_until(iterator_t it, predicate_t && p)
{
    if constexpr (is_chunkable_v<iterator_t>)
        /* unspecified */
    else
        /* unspecified */
}

Modeled via input iterator. Can be chunked and element-wise.

skip_line

template <typename iterator_t>
    requires input_iterator_concept<iterator_t>
inline void
skip_line(iterator_t & it)
{
    skip_line(it, is_new_line());
    // consume platform dependent line ending
}

Delegates to skip_until and consumes line_ending. This must be platform dependent, i.e. differ between \n and \r\n. should throw if

skip

template <typename iterator_t, typename predicate_t>
    requires input_iterator_concept<iterator_t> && predicate_concept<predicate_t>
inline void
skip(iterator_t & it, predicate_t && p)
{
    // requires p(*it) == true;
    ++it;
}

template <typename iterator_t>
    requires input_iterator_concept<iterator_t>
inline void
skip(iterator_t & it)
{
     skip_one(it, [](auto const & val){ return true; });
}

Should throw parse_error if predicate is not fulfilled.

skip_n

to_formatted_number

TODO

Clone this wiki locally