Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-REQUEST] Opposite String Startswith Search in VAEX Dataframe #2098

Open
khanfarhan10 opened this issue Jun 24, 2022 · 23 comments
Open

Comments

@khanfarhan10
Copy link

khanfarhan10 commented Jun 24, 2022

Description
Say we have a string to match in a database,

We can accomplish that by a simple select and evaluate in VAEX:

df_vaex.select(df_vaex["name"].str.startswith(search_string))

However, this searches for search_string in Database Entries rather than Database Entries in search_string.

Can this be performed using Vaex?

Is your feature request related to a problem? Please describe.

search_string1 = "ASTHA MAT" 
search_string2 = "ASTHA MATERIALS INDIA" 

df_vaex.select(df_vaex["name"].str.startswith(search_string1))
df_vaex.evaluate(df_vaex["name"], selection=True)
# ASTHA MATERIALS

df_vaex.select(search_string2.startswith(df_vaex["name"]))
# TypeError: startswith first arg must be str or a tuple of str, not Expression

Additional context
Would be great to have a reverse search technology in a vectorized fashion for quick searching as in pandas.Series.isin!

@khanfarhan10
Copy link
Author

@JovanVeljanoski any idea on existing approaches to solve this problem?

@JovanVeljanoski
Copy link
Member

Vaex does have isin ... df.x.isin(my_list) for example.

@khanfarhan10
Copy link
Author

Hmmmph, it's strange I didn't notice that. However, can't find any documentation for it nor can find it by searching on the docs.

Nevertheless, Thanks for the quick response!

@JovanVeljanoski
Copy link
Member

@khanfarhan10

Here it is:
https://vaex.io/docs/api.html#vaex.expression.Expression.isin

@khanfarhan10
Copy link
Author

khanfarhan10 commented Jun 24, 2022

Hey, reopening this as the isin string operation does not achieve the desired results!

I need a similar function like startswith but that operates more like the startswith operator in a swapped way.

Lets say e is an element belonging to the dataframe's column,

What I wish to perform is :

search_string.startswith(e) rather than performing e.startswith(search_string)

for each element e in the df column

@khanfarhan10
Copy link
Author

khanfarhan10 commented Jun 24, 2022

For a minimalistic reproducible example :

dict_data = dict(name=["ASTHA MATERIALS" , "LOREM IPSUM" ], locationID=[5454,6767]) # with other cols as well

dict_data
# {'name': ['ASTHA MATERIALS', 'LOREM IPSUM'], 'locationID': [5454, 6767]}
df_vaex = vaex.from_dict(dict_data)
df_vaex
# name locationID
0 ASTHA MATERIALS 5454
1 LOREM IPSUM 6767
search_string = "ASTHA MAT"
df_vaex.select(df_vaex["name"].str.startswith(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f98906910>
# [
#  "ASTHA MATERIALS"
# ]
search_string = "ASTHA MATERIALS INDIA"
df_vaex.select(df_vaex["name"].str.startswith(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d75520>
# []

This is the search I need.

@khanfarhan10
Copy link
Author

Stuff I tried (and failed) :

df_vaex["name"].isin([search_string])
df_vaex["name"].str.find(search_string)>-1
df_vaex["name"].str.contains(search_string)

@JovanVeljanoski
Copy link
Member

How would you do it with pandas?

@khanfarhan10
Copy link
Author

I can't find a workaround in pandas too.

Apparently, is this one of those sad cases that cannot be done without a extremely slow apply/lambda function?

@maartenbreddels
Copy link
Member

looks that way.. although our apply should be parallel (multiprocessing), but it's a good point, I'll see if we can have the reverse without too many changes.

@khanfarhan10
Copy link
Author

khanfarhan10 commented Jun 24, 2022

search_string = "ASTHA MATERIALS INDIA"
# Try lambda and apply stuff
df_vaex.select(df_vaex.apply(lambda element: search_string.startswith(element), [df_vaex["name"]]))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d33c20>
# [
#   "ASTHA MATERIALS"
# ]

Working with this for now!

@khanfarhan10
Copy link
Author

Also notable to mention here that regexes created dynamically might be of some (but not great) help!

search_string = r"ASTHA MATERIALS\s?(INDIA)?"
df_vaex.select(df_vaex["name"].str.contains(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d33f30>
# [
#   "ASTHA MATERIALS"
# ]

@khanfarhan10 khanfarhan10 changed the title [FEATURE-REQUEST] String Startswith Search in VAEX Dataframe [FEATURE-REQUEST] Opposite String Startswith Search in VAEX Dataframe Jun 24, 2022
@Ben-Epstein
Copy link
Contributor

@khanfarhan10
I would think regex is your best bet. If you want something a bit more specific you could do a registered function

import vaex
import pyarrow as pa

dict_data = dict(name=["ASTHA MATERIALS" , "LOREM IPSUM" ], locationID=[5454,6767]) # with other cols as well

df = vaex.from_dict(dict_data)
search_string = "ASTHA MATERIALS INDIA"



@vaex.register_function()
def str_contains_col(col_vals, str_search):
    return pa.array([str_search.startswith(v) for v in col_vals.to_pylist()])

df.func.str_contains_col(df["name"], search_string)

@khanfarhan10
Copy link
Author

That works like a rocket @Ben-Epstein !

Feel free to close/add documentation/samples. @maartenbreddels @JovanVeljanoski @djsutherland

@khanfarhan10
Copy link
Author

khanfarhan10 commented Sep 17, 2022

Just for comment this is still 10X slower than the same pandas equivalent, but yeah, does the job.

@vaex.register_function()
def str_contains_col(col_vals, str_search):
    # return pa.array([str_search.find(v) != -1 for v in col_vals.to_pylist()])
    return pa.array([v in str_search for v in col_vals.to_pylist()])

@JovanVeljanoski
Copy link
Member

Why don't you use pandas @khanfarhan10 ?

@khanfarhan10
Copy link
Author

khanfarhan10 commented Sep 17, 2022

Yeah, can use pandas but had migrated to using VAEX as I was consistenly using it for projects. Apparently I believe there is no way to speed this up using Vaex.

@JovanVeljanoski
Copy link
Member

Is it possible to provide some (fake) data to replicate this on our end. This might never be super fast, but I don't see a reason why it should be much slower than pandas. Worst case should be just as fast ( or just as slow).

I could be missing something tho.

@ukanchan
Copy link

Is it possible to use case in contains function in Vaex?

e.g. df.x.str.contains("Abc", case=False) // df is vaex dataframe

I want to get all records which has substring in particular column but without restricted to case sensitive .

@maartenbreddels
Copy link
Member

Could you open a new issue for that, I don't think it's related to this issue right?

@maartenbreddels
Copy link
Member

In case you don't, and I forget, this seemed to work:
image
(?i) makes it case insensitive

@ukanchan
Copy link

Thank you !!

@ukanchan
Copy link

ukanchan commented Jul 4, 2023

How we can use ignore case in startswith and endswith in Vaex? I tried with regular expression like: startswith("(?i)"+val, regex = True). But it didn't work.

Hoping to hear back from the team. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants