Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Port CRAN's fuzzyjoin Funtions to dplyr #3293

Closed
jtelleriar opened this issue Jan 8, 2018 · 8 comments
Closed

Feature Request: Port CRAN's fuzzyjoin Funtions to dplyr #3293

jtelleriar opened this issue Jan 8, 2018 · 8 comments

Comments

@jtelleriar
Copy link

jtelleriar commented Jan 8, 2018

I would suggest porting some of CRAN's fuzzyjoin package functions to dplyr (With proper Contribution Acknowledgement):

https://cran.r-project.org/web/packages/fuzzyjoin/index.html

Of special interest is "fuzzy_join" function, which allows to do SQL queries such as these:

SELECT
    REPLACE(PATH, 'W:\\', '\\\\naswoav.usr.corp.gamesa.es\\WO_AV_DATA\\') AS PATH
  FROM
    dispoin.t_DISPOIN_DIR AS A
    INNER JOIN dispoin.t_scadas AS B
      ON
        A.PATH LIKE CONCAT('%', ScadaCode, '%')
        AND (A.PATH LIKE '%2016%' OR A.PATH LIKE '%2017%')
        -- También se puede usar  A.PATH REGEXP '2016|2017';
        AND A.PATH NOT LIKE '%Data_10M%'
        AND A.PATH NOT LIKE '%rejectAL%'
        AND A.PATH NOT LIKE '%REPROCESSED_FILES%'
        AND A.PATH NOT LIKE '%GA677_201308_AL.csv.s3542017.zip%';
@krlmlr
Copy link
Member

krlmlr commented Jan 17, 2018

Thanks. Are you suggesting to run the fuzzy join directly on the database?

@jtelleriar
Copy link
Author

jtelleriar commented Jan 19, 2018

Either in database or in R itself.

However, when you query a SQL database I guess you could always do a cheat like this I guess:

t_DISPOIN_DIR <- tbl(con, "t_DISPOIN_DIR")

t_DISPOIN_DIR %>%
sql(
"
SELECT
    REPLACE(PATH, 'W:\\', '\\\\naswoav.usr.corp.gamesa.es\\WO_AV_DATA\\') AS PATH
  FROM
    dispoin.t_DISPOIN_DIR AS A
    INNER JOIN dispoin.t_scadas AS B
      ON
        A.PATH LIKE CONCAT('%', ScadaCode, '%')
        AND (A.PATH LIKE '%2016%' OR A.PATH LIKE '%2017%')
        -- También se puede usar  A.PATH REGEXP '2016|2017';
        AND A.PATH NOT LIKE '%Data_10M%'
        AND A.PATH NOT LIKE '%rejectAL%'
        AND A.PATH NOT LIKE '%REPROCESSED_FILES%'
        AND A.PATH NOT LIKE '%GA677_201308_AL.csv.s3542017.zip%';
")

@krlmlr
Copy link
Member

krlmlr commented Feb 28, 2018

fuzzyjoin seems to be removed from CRAN, are you aware of a successor?

@jtelleriar
Copy link
Author

At the company where I work we have GitHub Webpage Blocked, and I cannot check properly this issue.

The point is that if the package has been removed from CRAN, there is even more reason to fork some of its functions (as they are) and move them to mainstream dplyr.

I checked them out in the past and they are properly written.

The fuzzyjoin GitHub Repository is at the following link:

https://github.com/dgrtwo/fuzzyjoin

I'll check it out when I arrive at home, though these days I am little busy learning the H2O.ai framework, and doing a couple of Cheatsheets for using its machine learning functions.

Juan

@jtelleriar
Copy link
Author

The fuzzyjoin package indeed seems not to be longer actively maintained.

@batpigandme
Copy link
Contributor

Just released a new version on CRAN today:
https://cran.r-project.org/web/packages/fuzzyjoin/index.html

@krlmlr
Copy link
Member

krlmlr commented Mar 2, 2018

Good to see it back. Will think about it in the context of #2240.

@krlmlr krlmlr closed this as completed Mar 2, 2018
@lock
Copy link

lock bot commented Aug 29, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Aug 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants