Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] string conversion for duration types (to_durations, from_durations) #5625

Merged
merged 63 commits into from
Aug 7, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
fa3a659
add string conversion API from_durations()
karthikeyann Jul 2, 2020
26f9833
rename variables, enums
karthikeyann Jul 2, 2020
e7c9359
support isoformat single digit for H:M:S
karthikeyann Jul 3, 2020
5b91efb
add durations_tests.cpp
karthikeyann Jul 3, 2020
7eb218c
update doc for format specifiers
karthikeyann Jul 6, 2020
6a6f23e
skip trailing zero for %f specifier, %u of length 6.
karthikeyann Jul 6, 2020
c372d1e
update doc duration type, ISO format string
karthikeyann Jul 6, 2020
9c0ec8f
misc style changes
karthikeyann Jul 6, 2020
e2d008a
add dot as par of %u, %f subsecond specifier (dot not present if zero)
karthikeyann Jul 7, 2020
d61c645
style fix clang-format
karthikeyann Jul 7, 2020
10795ab
style fix spacing
karthikeyann Jul 7, 2020
907843f
add nvtx to from_durations, comments update
karthikeyann Jul 8, 2020
fc04aa6
add to_durations
karthikeyann Jul 8, 2020
7cf538e
add to_durations unit tests
karthikeyann Jul 8, 2020
8713042
style fix clang-format
karthikeyann Jul 8, 2020
be7ea2d
change log entry for PR #5625
karthikeyann Jul 8, 2020
d6899fe
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 8, 2020
cb5f473
remove units from format_compiler
karthikeyann Jul 8, 2020
99d92ee
stylefix clang-format
karthikeyann Jul 8, 2020
e087ac6
conda recipe header include
karthikeyann Jul 8, 2020
5bfd55e
add duration strings API microseconds unit test
karthikeyann Jul 8, 2020
dc01137
add converters utilities.cuh
karthikeyann Jul 9, 2020
ae02fd8
macros for cudf::test::expect_* column, table for showing line of fai…
karthikeyann Jul 9, 2020
6aa6a4d
add src/strings/convert/utilities.cuh
karthikeyann Jul 9, 2020
508369b
Apply suggestions from code review
karthikeyann Jul 9, 2020
530a723
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 9, 2020
c4fd5bd
Apply suggestions from code review (harrism)
karthikeyann Jul 10, 2020
f639cb8
review comments changes (harrism)
karthikeyann Jul 10, 2020
40309f0
use libcu++ std::chrono methods
karthikeyann Jul 13, 2020
3a13550
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 13, 2020
b658f7d
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 14, 2020
d573172
Apply suggestions from code review (davidwendt)
karthikeyann Jul 15, 2020
5045f65
style fix clang-format
karthikeyann Jul 15, 2020
5c5dad7
replace device_ptr deref with get_value
karthikeyann Jul 15, 2020
b7d7cc2
use device_uvector for duration format_items
karthikeyann Jul 15, 2020
748c95a
change for loop to transform_reduce
karthikeyann Jul 16, 2020
a1ef104
specifiers to std::format supported spec only
karthikeyann Jul 17, 2020
5e76e45
remove modulo_time
karthikeyann Jul 17, 2020
95ff3a6
remove countTrailingZeros
karthikeyann Jul 17, 2020
7cb80a4
change from_durations to std::format compliant
karthikeyann Jul 17, 2020
55afdc9
update unit tests to std::format compliant
karthikeyann Jul 17, 2020
6c5a951
update ISOFormat tests, other tests to be std::format complaint
karthikeyann Jul 17, 2020
bd1d5c8
update parsing D,H,M,S to be std::format compliant
karthikeyann Jul 17, 2020
d9ca314
update default format args
karthikeyann Jul 17, 2020
6a8b2f1
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 17, 2020
0e3bd65
add %p%R%T parse support
karthikeyann Jul 17, 2020
a3b2c47
remove runtime type_id info, use templates
karthikeyann Jul 20, 2020
ee5bf3d
move powers_of_ten to __constant__ memory (to avoid local memory access)
karthikeyann Jul 20, 2020
2e42c68
replace timeparts int32_t array with struct
karthikeyann Jul 20, 2020
0def734
stylefix lowercase struct members
karthikeyann Jul 20, 2020
b7dc79d
add parse_2digit_int, parse_hour, minute, second
karthikeyann Jul 20, 2020
b7045f1
add benchmark DuratiionsToString, StringToDurations
karthikeyann Jul 21, 2020
151fa49
optimization of integer to string function
karthikeyann Jul 22, 2020
05cbb18
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 22, 2020
85d0631
add parse tests for HMSpRT
karthikeyann Jul 27, 2020
9f7dcdb
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 27, 2020
77b518c
support all escape Characters, add unit tests
karthikeyann Jul 27, 2020
bbbe2ac
Merge branch 'branch-0.15' of github.com:rapidsai/cudf into fea-durat…
karthikeyann Jul 27, 2020
1ebfb0a
review updates (davidwendt)
karthikeyann Jul 27, 2020
f37af51
review updates (karthikeyann)
karthikeyann Jul 28, 2020
0d84102
doc update: locale -> without sign
karthikeyann Jul 28, 2020
78f94d8
doc update escaping % character
karthikeyann Jul 30, 2020
ef93d42
change benchmark name
karthikeyann Jul 30, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
- PR #5654 Adding support for `cudf.DataFrame.sample` and `cudf.Series.sample`
- PR #5607 Add Java bindings for duration types
- PR #5612 Add `is_hex` strings API
- PR #5625 String conversion to and from duration types
- PR #5659 Added support for rapids-compose for Java bindings and other enhancements
- PR #5637 Parameterize Null comparator behaviour in Joins
- PR #5623 Add `is_ipv4` strings API
Expand Down
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ test:
- test -f $PREFIX/include/cudf/strings/contains.hpp
- test -f $PREFIX/include/cudf/strings/convert/convert_booleans.hpp
- test -f $PREFIX/include/cudf/strings/convert/convert_datetime.hpp
- test -f $PREFIX/include/cudf/strings/convert/convert_durations.hpp
- test -f $PREFIX/include/cudf/strings/convert/convert_floats.hpp
- test -f $PREFIX/include/cudf/strings/convert/convert_integers.hpp
- test -f $PREFIX/include/cudf/strings/convert/convert_ipv4.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -527,6 +527,7 @@ add_library(cudf
src/strings/contains.cu
src/strings/convert/convert_booleans.cu
src/strings/convert/convert_datetime.cu
src/strings/convert/convert_durations.cu
src/strings/convert/convert_floats.cu
src/strings/convert/convert_hex.cu
src/strings/convert/convert_integers.cu
Expand Down
8 changes: 8 additions & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -245,3 +245,11 @@ set(SUBWORD_TOKENIZER_BENCH_SRC
"${CMAKE_CURRENT_SOURCE_DIR}/text/subword_benchmark.cpp")

ConfigureBench(SUBWORD_TOKENIZER_BENCH "${SUBWORD_TOKENIZER_BENCH_SRC}")

###################################################################################################
# - convert to string benchmark -------------------------------------------------------------------

set(DURATION_TO_STRING_BENCH_SRC
"${CMAKE_CURRENT_SOURCE_DIR}/string/convert_durations_benchmark.cpp")

ConfigureBench(DURATION_TO_STRING_BENCH "${DURATION_TO_STRING_BENCH_SRC}")
112 changes: 112 additions & 0 deletions cpp/benchmarks/string/convert_durations_benchmark.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmark/benchmark.h>

#include <cudf/strings/convert/convert_durations.hpp>
#include <cudf/types.hpp>

#include <tests/utilities/base_fixture.hpp>
#include <tests/utilities/column_utilities.hpp>
#include <tests/utilities/column_wrapper.hpp>
#include <tests/utilities/cudf_gtest.hpp>

#include <algorithm>
#include <random>

#include "../fixture/benchmark_fixture.hpp"
#include "../synchronization/synchronization.hpp"
#include "cudf/column/column_view.hpp"
#include "cudf/wrappers/durations.hpp"

class DurationsToString : public cudf::benchmark {
};
template <class TypeParam>
void BM_convert_from_durations(benchmark::State& state)
{
const cudf::size_type source_size = state.range(0);

// Every element is valid
auto data = cudf::test::make_counting_transform_iterator(
0, [source_size](auto i) { return TypeParam{i - source_size / 2}; });

cudf::test::fixed_width_column_wrapper<TypeParam> source_durations(data, data + source_size);

for (auto _ : state) {
cuda_event_timer raii(state, true); // flush_l2_cache = true, stream = 0
cudf::strings::from_durations(source_durations, "%D days %H:%M:%S");
}

state.SetBytesProcessed(state.iterations() * source_size * sizeof(TypeParam));
}

class StringToDurations : public cudf::benchmark {
};
template <class TypeParam>
void BM_convert_to_durations(benchmark::State& state)
{
const cudf::size_type source_size = state.range(0);

// Every element is valid
auto data = cudf::test::make_counting_transform_iterator(
0, [source_size](auto i) { return TypeParam{i - source_size / 2}; });

cudf::test::fixed_width_column_wrapper<TypeParam> source_durations(data, data + source_size);
auto results = cudf::strings::from_durations(source_durations, "%D days %H:%M:%S");
cudf::strings_column_view source_string(*results);
auto output_type = cudf::data_type(cudf::type_to_id<TypeParam>());

for (auto _ : state) {
cuda_event_timer raii(state, true); // flush_l2_cache = true, stream = 0
cudf::strings::to_durations(source_string, output_type, "%D days %H:%M:%S");
}

state.SetBytesProcessed(state.iterations() * source_size * sizeof(TypeParam));
}

#define DSBM_BENCHMARK_DEFINE(name, type) \
BENCHMARK_DEFINE_F(DurationsToString, name)(::benchmark::State & state) \
{ \
BM_convert_from_durations<type>(state); \
} \
BENCHMARK_REGISTER_F(DurationsToString, name) \
->RangeMultiplier(1 << 5) \
->Range(1 << 10, 1 << 25) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond);

#define SDBM_BENCHMARK_DEFINE(name, type) \
BENCHMARK_DEFINE_F(StringToDurations, name)(::benchmark::State & state) \
{ \
BM_convert_to_durations<type>(state); \
} \
BENCHMARK_REGISTER_F(StringToDurations, name) \
->RangeMultiplier(1 << 5) \
->Range(1 << 10, 1 << 25) \
->UseManualTime() \
->Unit(benchmark::kMicrosecond);

DSBM_BENCHMARK_DEFINE(from_durations_D, cudf::duration_D);
DSBM_BENCHMARK_DEFINE(from_durations_s, cudf::duration_s);
DSBM_BENCHMARK_DEFINE(from_durations_ms, cudf::duration_ms);
DSBM_BENCHMARK_DEFINE(from_durations_us, cudf::duration_us);
DSBM_BENCHMARK_DEFINE(from_durations_ns, cudf::duration_ns);

SDBM_BENCHMARK_DEFINE(to_durations_D, cudf::duration_D);
SDBM_BENCHMARK_DEFINE(to_durations_s, cudf::duration_s);
SDBM_BENCHMARK_DEFINE(to_durations_ms, cudf::duration_ms);
SDBM_BENCHMARK_DEFINE(to_durations_us, cudf::duration_us);
SDBM_BENCHMARK_DEFINE(to_durations_ns, cudf::duration_ns);
50 changes: 25 additions & 25 deletions cpp/include/cudf/strings/convert/convert_datetime.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,18 @@ namespace strings {
*
* | Specifier | Description |
* | :-------: | ----------- |
* | %%d | Day of the month: 01-31 |
* | %%m | Month of the year: 01-12 |
* | %%y | Year without century: 00-99 |
* | %%Y | Year with century: 0001-9999 |
* | %%H | 24-hour of the day: 00-23 |
* | %%I | 12-hour of the day: 01-12 |
* | %%M | Minute of the hour: 00-59|
* | %%S | Second of the minute: 00-59 |
* | %%f | 6-digit microsecond: 000000-999999 |
* | %%z | UTC offset with format ±HHMM Example +0500 |
* | %%j | Day of the year: 001-366 |
* | %%p | Only 'AM', 'PM' or 'am', 'pm' are recognized |
* | \%d | Day of the month: 01-31 |
* | \%m | Month of the year: 01-12 |
* | \%y | Year without century: 00-99 |
* | \%Y | Year with century: 0001-9999 |
* | \%H | 24-hour of the day: 00-23 |
* | \%I | 12-hour of the day: 01-12 |
* | \%M | Minute of the hour: 00-59|
* | \%S | Second of the minute: 00-59 |
* | \%f | 6-digit microsecond: 000000-999999 |
* | \%z | UTC offset with format ±HHMM Example +0500 |
* | \%j | Day of the year: 001-366 |
* | \%p | Only 'AM', 'PM' or 'am', 'pm' are recognized |
*
* Other specifiers are not currently supported.
*
Expand Down Expand Up @@ -81,19 +81,19 @@ std::unique_ptr<column> to_timestamps(
*
* | Specifier | Description |
* | :-------: | ----------- |
* | %%d | Day of the month: 01-31 |
* | %%m | Month of the year: 01-12 |
* | %%y | Year without century: 00-99 |
* | %%Y | Year with century: 0001-9999 |
* | %%H | 24-hour of the day: 00-23 |
* | %%I | 12-hour of the day: 01-12 |
* | %%M | Minute of the hour: 00-59|
* | %%S | Second of the minute: 00-59 |
* | %%f | 6-digit microsecond: 000000-999999 |
* | %%z | Always outputs "+0000" |
* | %%Z | Always outputs "UTC" |
* | %%j | Day of the year: 001-366 |
* | %%p | Only 'AM' or 'PM' |
* | \%d | Day of the month: 01-31 |
* | \%m | Month of the year: 01-12 |
* | \%y | Year without century: 00-99 |
* | \%Y | Year with century: 0001-9999 |
* | \%H | 24-hour of the day: 00-23 |
* | \%I | 12-hour of the day: 01-12 |
* | \%M | Minute of the hour: 00-59|
* | \%S | Second of the minute: 00-59 |
* | \%f | 6-digit microsecond: 000000-999999 |
* | \%z | Always outputs "+0000" |
* | \%Z | Always outputs "UTC" |
* | \%j | Day of the year: 001-366 |
* | \%p | Only 'AM' or 'PM' |
*
* No checking is done for invalid formats or invalid timestamp values.
* All timestamps values are formatted to UTC.
Expand Down
128 changes: 128 additions & 0 deletions cpp/include/cudf/strings/convert/convert_durations.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/strings/strings_column_view.hpp>

namespace cudf {
namespace strings {
/**
* @addtogroup strings_convert
* @{
*/

/**
* @brief Returns a new duration column converting a strings column into
* durations using the provided format pattern.
*
* The format pattern can include the following specifiers:
* "%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS"
*
* | Specifier | Description | Range |
* | :-------: | ----------- | ---------------- |
* | %% | A literal % character | % |
* | \%n | A newline character | \\n |
* | \%t | A horizontal tab character | \\t |
* | \%D | Days | -2,147,483,648 to 2,147,483,647 |
* | \%H | 24-hour of the day | 00 to 23 |
* | \%I | 12-hour of the day | 00 to 11 |
* | \%M | Minute of the hour | 00 to 59 |
* | \%S | Second of the minute | 00 to 59.999999999 |
* | \%OH | same as %H but without sign | 00 to 23 |
* | \%OI | same as %I but without sign | 00 to 11 |
* | \%OM | same as %M but without sign | 00 to 59 |
* | \%OS | same as %S but without sign | 00 to 59 |
* | \%p | AM/PM designations associated with a 12-hour clock | 'AM' or 'PM' |
* | \%R | Equivalent to "%H:%M" | |
* | \%T | Equivalent to "%H:%M:%S" | |
* | \%r | Equivalent to "%OI:%OM:%OS %p" | |
*
* Other specifiers are not currently supported.
*
* Invalid formats are not checked. If the string contains unexpected
* or insufficient characters, that output row entry's duration value is undefined.
*
* Any null string entry will result in a corresponding null row in the output column.
*
* The resulting time units are specified by the `duration_type` parameter.
*
* @throw cudf::logic_error if duration_type is not a duration type.
*
* @param strings Strings instance for this operation.
* @param duration_type The duration type used for creating the output column.
* @param format String specifying the duration format in strings.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New duration column.
*/
std::unique_ptr<column> to_durations(
strings_column_view const& strings,
data_type duration_type,
std::string const& format,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

/**
* @brief Returns a new strings column converting a duration column into
* strings using the provided format pattern.
*
* The format pattern can include the following specifiers:
* "%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS"
*
* | Specifier | Description | Range |
* | :-------: | ----------- | ---------------- |
* | %% | A literal % character | % |
* | \%n | A newline character | \\n |
* | \%t | A horizontal tab character | \\t |
* | \%D | Days | -2,147,483,648 to 2,147,483,647 |
* | \%H | 24-hour of the day | 00 to 23 |
* | \%I | 12-hour of the day | 00 to 11 |
* | \%M | Minute of the hour | 00 to 59 |
* | \%S | Second of the minute | 00 to 59.999999999 |
* | \%OH | same as %H but without sign | 00 to 23 |
* | \%OI | same as %I but without sign | 00 to 11 |
* | \%OM | same as %M but without sign | 00 to 59 |
* | \%OS | same as %S but without sign | 00 to 59 |
* | \%p | AM/PM designations associated with a 12-hour clock | 'AM' or 'PM' |
* | \%R | Equivalent to "%H:%M" | |
* | \%T | Equivalent to "%H:%M:%S" | |
* | \%r | Equivalent to "%OI:%OM:%OS %p" | |
*
* No checking is done for invalid formats or invalid duration values. Formatting sticks to
* specifications of `std::formatter<std::chrono::duration>` as much as possible.
*
* Any null input entry will result in a corresponding null entry in the output column.
*
* The time units of the input column influence the number of digits in decimal of seconds.
* It uses 3 digits for milliseconds, 6 digits for microseconds and 9 digits for nanoseconds.
* If duration value is negative, only one negative sign is written to output string. The specifiers
* with signs are "%H,%I,%M,%S,%R,%T".
*
* @throw cudf::logic_error if `durations` column parameter is not a duration type.
*
* @param durations Duration values to convert.
* @param format The string specifying output format.
* Default format is ""%d days %H:%M:%S".
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New strings column with formatted durations.
*/
std::unique_ptr<column> from_durations(
column_view const& durations,
std::string const& format = "%D days %H:%M:%S",
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
Loading