Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a unique function returning only the unique values in a vector. #940 #965

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

demoncoder-crypto
Copy link

Please let me know if this is the right approach

@perazz
Copy link
Member

perazz commented Mar 26, 2025

Linked to #940

return
endif

! Create a temporary copy that may be sorted

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a matter of style here, but you could be more concise with:

 temp_array = array ; if (want_sorted) call sort(temp_array)

! Start with first element always marked as unique
mask(1) = .true.

! Compare each element with previous to mark duplicates
Copy link

@loiseaujc loiseaujc Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The left-hand side assignment can done be in any order. You could use

do concurrent (i=2:n)
    mask(i) = temp_array(i) /= temp_array(i-1)
enddo

I believe. If multithreading is being used, my understanding is that it'll enable the compiler to better optimize and speed-up the calculation (notably for very large arrays).

Copy link

@loiseaujc loiseaujc Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility using array syntax is simply mask(:n-1) = temp_array(:n-1) /= temp_array(2:). Not sure what would be the fastest here.

allocate(unique_values(unique_count))

! Extract unique elements to result array
j = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your do loop can be replaced by a one-liner: unique_values = pack(temp_array, mask).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the lines 56-67 could be replaced as follows:

        allocate(unique_values, source = pack(temp_array, mask)

or

        unique_values = pack(temp_array, mask)

want_sorted = optval(sorted, .false.)

n = size(array)
if (n == 0) then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also treat the case n == 1 here. If n == 1, there is only a single value to return.

@jalvesz
Copy link
Contributor

jalvesz commented Mar 28, 2025

Thanks @demoncoder-crypto for this PR. Before reviewing it, could you please add a short summary of the key points regarding the What(s), Why(s) and/or How(s) of the PR in your first comment? It is very helpful for the reviewing process to have summary of what is being proposed. You can edit your first comment following the three dots to the right.

! Get sorted unique values from a real array
real :: a(8) = [3.1, 2.5, 7.2, 3.1, 2.5, 8.0, 7.2, 9.5]
real, allocatable :: b(:)
b = unique(a, sorted=.true.) ! b will be [2.5, 3.1, 7.2, 8.0, 9.5]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, having the syntax

b = unique(a, sorted=.true.)

would imply that on entry a is already sorted such that some internal logic can be skipped. Clearly, a is not sorted in this example and so I find the syntax a bit counter-intuitive. Maybe you need change this internally.

@demoncoder-crypto
Copy link
Author

@jalvesz for sure going ahead i will put summary of everything to make sure to reduce the load for reviews

@jalvesz
Copy link
Contributor

jalvesz commented Mar 29, 2025

I was looking at the implementation, please consider the following review, I'll state the key points after the code

#:include "common.fypp"

#:set INT_TYPES_ALT_NAME = list(zip(INT_KINDS, INT_TYPES, INT_KINDS))
#:set REAL_TYPES_ALT_NAME = list(zip(REAL_KINDS, REAL_TYPES, REAL_SUFFIX))
#:set COMPLEX_TYPES_ALT_NAME = list(zip(CMPLX_KINDS, CMPLX_TYPES, CMPLX_SUFFIX))
#:set STRING_TYPES_ALT_NAME = list(zip(STRING_TYPES, STRING_TYPES, STRING_KINDS))
#:set CHAR_TYPES_ALT_NAME = list(zip(["character(len=*)"], ["character(len=len(array))"], ["char"]))

#:set IRSC_KINDS_TYPES = INT_TYPES_ALT_NAME + REAL_TYPES_ALT_NAME + COMPLEX_TYPES_ALT_NAME + STRING_TYPES_ALT_NAME + CHAR_TYPES_ALT_NAME

submodule (stdlib_sorting_unique) stdlib_sorting_unique_impl
    use stdlib_kinds
    use stdlib_sorting, only: sort
    implicit none

    integer, parameter :: ilp = int64

contains

#:for k, t, s in IRSC_KINDS_TYPES
    pure module function unique_1d_${s}$(array,sorted) result(unique_values)
        ${t}$, intent(in) :: array(:)
        logical(lk), intent(in), optional :: sorted
        ${t}$, allocatable :: unique_values(:)

        ${t}$, allocatable :: temp(:)
        integer(ilp) :: i, j, n

        n = size(array,kind=ilp)
        if (n == 0) return

        ! Create a temporary copy for sorting purposes
        allocate(temp(n), source = array)
        if(present(sorted))then
            if(.not.sorted) call sort(temp)
        else

        ! Remove duplicates
        j = 0
        do i = 2, n
            if( temp(i) == temp(i-1) ) then
                j = j + 1
            else
                temp(i-j) = temp(i)
            end if
        end do
        n = n - j
        ! Transfer unique values to output array
        allocate(unique_values(n), source = temp(1:n))
    end function unique_1d_${s}$

#:endfor

end submodule stdlib_sorting_unique_impl
  1. Prefere the following naming convention for internal implementations: <name>_<rank>d_<kind_suffix>
  2. The sorted input variable actually should be meant as an attribute for the input array, not the ouput: In order to create a unique array the input array should be sorted, the precedure can give the option to the user to say "is the input sorted? I won't sort it then to save on computational time", the default behaviour: assume it is not sorted.
  3. You should not allocate an array to size 0, this is an error.
  4. In the implementation proposal here, you don't need to allocate a mask array. Duplicates are thrown out in the last loop.

Other point, in the documentation, please put all your "runnable" examples in the example folder and simply import them in the .md file for the documentation.

Once you have integrated these points, let's discuss how to takle rank>1 arrays.

@demoncoder-crypto
Copy link
Author

Yes I am currently working on this issue. I will update you shortly on this. I am understanding every comment made here so it might take some time but I am really grateful for all the responsiveness and guidance. Thanks means a lot

@perazz
Copy link
Member

perazz commented Mar 29, 2025

3. You should not allocate an array to size 0, this is an error.

I agree on the excellent reviews, please keep going! The only objection is on this point: function results with the allocatable attribute MUST always return an allocated value, per Fortran standard 15.6.2.2 "If the function result is not a pointer, its value shall be defined by the function.". It is necessary, (though not sufficient), for an allocatable function result (any object) to be allocated to be defined. see also where I also learned this the hard way.

So when empty, we should return an empy array i.e. allocate(array(0))

Copy link
Member

@jvdp1 jvdp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @demoncoder-crypto for this PR. It will be quite useful to many users IMO.
Here are some suggestions regarding the API, code and its integration in the stdlib structure

@@ -0,0 +1,176 @@
---
title: unique function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unique is included in stdlib_sorting, the specs of unique should be added in stdlib_sorting.


The `unique` function is currently in **experimental** status.

## Version History
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same format as the other stdlib specs should be used.

Comment on lines +19 to +21
|Version|Change|
|---|---|
|v0.1.0|Initial functionality in experimental status|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|Version|Change|
|---|---|
|v0.1.0|Initial functionality in experimental status|
Experimental


## Requirements

This function has been designed to handle arrays of different types, including intrinsic numeric types, character arrays, and `string_type` arrays. The function should be efficient while maintaining an easy-to-use interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not requirements. The content of these sentences will be included in the description of the API of unique


### `unique` - Returns unique values from an array

#### Interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Interface
#### Syntax

Comment on lines +172 to +176
## Related Functions

* `sort` - Sorts an array in ascending or descending order
* `sort_index` - Creates index array that would sort an array
* `ord_sort` - Performs a stable sort on an array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Related Functions
* `sort` - Sorts an array in ascending or descending order
* `sort_index` - Creates index array that would sort an array
* `ord_sort` - Performs a stable sort on an array

@@ -0,0 +1,64 @@
program example_unique
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split this program in smaller programs and include them in the specs, please?

!! TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
!! SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
module stdlib_sorting_unique
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a submodule of stdlib_sort more appropriate, as it is suggested to access it through stdlib_sorting in the examples?

Comment on lines 57 to 58
unique_count = count(mask)
allocate(unique_values(unique_count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines could be combined into a single one.

allocate(unique_values(unique_count))

! Extract unique elements to result array
j = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the lines 56-67 could be replaced as follows:

        allocate(unique_values, source = pack(temp_array, mask)

or

        unique_values = pack(temp_array, mask)

@jalvesz
Copy link
Contributor

jalvesz commented Mar 29, 2025

@jvdp1 maybe you could check out the modified version proposed here #965 (comment) regarding the internal implementation.

@jvdp1
Copy link
Member

jvdp1 commented Mar 29, 2025

@jvdp1 maybe you could check out the modified version proposed here #965 (comment) regarding the internal implementation.

Sorry, I started my review early today and missed your comment. I am puzzled with the aim of sorted (is it for the input (as in the code) or for the output (as in the specs)?). Otherwise, your code LGTM. It probably avoids a temporary array generated by pack. It could include the case n == 1 in addition to n==0.
Also, allocate(temp(n), source = array) could be written as allocate(temp, source = array).

@jalvesz
Copy link
Contributor

jalvesz commented Mar 29, 2025

I am puzzled with the aim of sorted (is it for the input (as in the code) or for the output

I think it only makes sense for the input as in "is the input sorted or not?" Because in any case the output will be sorted. This flag allows controlling whether to apply the sorting or not to the temporary array which only makes sense to skip if one knows the input is already ordered but that contains duplicates.

@demoncoder-crypto
Copy link
Author

After reading all the helpful comments. I have decided to

1)- Clarify the Semantics of the sorted Flag- I will change the interpretation so that the optional sorted flag indicates whether the input is already sorted. If it’s .true., the function will skip sorting the temporary array, thereby saving computational time. Otherwise, it will sort the array before removing duplicates. This way, the output will consistently be in sorted order when duplicates are removed, but the flag serves as an efficiency hint.

2)- Edge Case Handling- Empty Array: When the input array is empty, I will ensure that the function returns an allocated array of size zero, in line with Fortran standard requirements for allocatable function results. Singleton Array: I will explicitly check if the size of the input is one and, if so, return the single element immediately without unnecessary processing.

I will first focus on these two issues and then proceed to make iterative changes over the next few things. I have decided to keep the preview of next commit manageable. I sincerely thanks so much of input I will address them in coming commits when I resolve the errors mentioned. Thank you all for such support

@demoncoder-crypto
Copy link
Author

I have tried to implement changes above, when merging the branch I don't know something weird happened when I merged with head and I tried to fix it hopefully I did Let me know if the points I mentioned above were successfully integrated

@perazz
Copy link
Member

perazz commented Mar 31, 2025

The build fails due to a circular dependency. As @jvdp1 suggested:

Would this be a submodule of stdlib_sort more appropriate, as it is suggested to access it through stdlib_sorting in the examples?

there should be a submodule(stdlib_sorting) stdlib_sorting_unique that only contains the implementation. To achieve that, you could remove the _impl submodule, put its contents into stdlib_sorting_unique, then put the interface directly in module stdlib_sorting

@demoncoder-crypto
Copy link
Author

Yes I will fix it shortly. Thanks for guiding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants