-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE-REQUEST] Opposite String Startswith Search in VAEX Dataframe #2098
Comments
@JovanVeljanoski any idea on existing approaches to solve this problem? |
Vaex does have |
Hmmmph, it's strange I didn't notice that. However, can't find any documentation for it nor can find it by searching on the docs. Nevertheless, Thanks for the quick response! |
Hey, reopening this as the I need a similar function like startswith but that operates more like the startswith operator in a swapped way. Lets say e is an element belonging to the dataframe's column, What I wish to perform is : search_string.startswith(e) rather than performing e.startswith(search_string) for each element e in the df column |
For a minimalistic reproducible example : dict_data = dict(name=["ASTHA MATERIALS" , "LOREM IPSUM" ], locationID=[5454,6767]) # with other cols as well
dict_data
# {'name': ['ASTHA MATERIALS', 'LOREM IPSUM'], 'locationID': [5454, 6767]} df_vaex = vaex.from_dict(dict_data)
df_vaex
search_string = "ASTHA MAT"
df_vaex.select(df_vaex["name"].str.startswith(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f98906910>
# [
# "ASTHA MATERIALS"
# ] search_string = "ASTHA MATERIALS INDIA"
df_vaex.select(df_vaex["name"].str.startswith(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d75520>
# [] This is the search I need. |
Stuff I tried (and failed) :df_vaex["name"].isin([search_string])
df_vaex["name"].str.find(search_string)>-1
df_vaex["name"].str.contains(search_string) |
How would you do it with pandas? |
I can't find a workaround in pandas too. Apparently, is this one of those sad cases that cannot be done without a extremely slow apply/lambda function? |
looks that way.. although our apply should be parallel (multiprocessing), but it's a good point, I'll see if we can have the reverse without too many changes. |
search_string = "ASTHA MATERIALS INDIA"
# Try lambda and apply stuff
df_vaex.select(df_vaex.apply(lambda element: search_string.startswith(element), [df_vaex["name"]]))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d33c20>
# [
# "ASTHA MATERIALS"
# ] Working with this for now! |
Also notable to mention here that regexes created dynamically might be of some (but not great) help! search_string = r"ASTHA MATERIALS\s?(INDIA)?"
df_vaex.select(df_vaex["name"].str.contains(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d33f30>
# [
# "ASTHA MATERIALS"
# ] |
@khanfarhan10
|
That works like a rocket @Ben-Epstein ! Feel free to close/add documentation/samples. @maartenbreddels @JovanVeljanoski @djsutherland |
Just for comment this is still 10X slower than the same pandas equivalent, but yeah, does the job. @vaex.register_function()
def str_contains_col(col_vals, str_search):
# return pa.array([str_search.find(v) != -1 for v in col_vals.to_pylist()])
return pa.array([v in str_search for v in col_vals.to_pylist()]) |
Why don't you use pandas @khanfarhan10 ? |
Yeah, can use pandas but had migrated to using VAEX as I was consistenly using it for projects. Apparently I believe there is no way to speed this up using Vaex. |
Is it possible to provide some (fake) data to replicate this on our end. This might never be super fast, but I don't see a reason why it should be much slower than pandas. Worst case should be just as fast ( or just as slow). I could be missing something tho. |
Is it possible to use case in contains function in Vaex? e.g. df.x.str.contains("Abc", case=False) // df is vaex dataframe I want to get all records which has substring in particular column but without restricted to case sensitive . |
Could you open a new issue for that, I don't think it's related to this issue right? |
Thank you !! |
How we can use ignore case in startswith and endswith in Vaex? I tried with regular expression like: startswith("(?i)"+val, regex = True). But it didn't work. Hoping to hear back from the team. Thanks. |
Description
Say we have a string to match in a database,
We can accomplish that by a simple select and evaluate in VAEX:
However, this searches for search_string in Database Entries rather than Database Entries in search_string.
Can this be performed using Vaex?
Is your feature request related to a problem? Please describe.
Additional context
Would be great to have a reverse search technology in a vectorized fashion for quick searching as in
pandas.Series.isin
!The text was updated successfully, but these errors were encountered: