Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing fails when protocol is missing #36

Open
schochastics opened this issue Sep 25, 2023 · 6 comments
Open

parsing fails when protocol is missing #36

schochastics opened this issue Sep 25, 2023 · 6 comments
Labels

Comments

@schochastics
Copy link
Member

adaR::ada_url_parse("bit.ly/32G1ciy")
#>             href protocol username password host hostname port pathname search
#> 1 bit.ly/32G1ciy     <NA>     <NA>     <NA> <NA>     <NA> <NA>     <NA>   <NA>
#>   hash
#> 1 <NA>
urltools::url_parse("bit.ly/32G1ciy")
#>   scheme domain port    path parameter fragment
#> 1   <NA> bit.ly <NA> 32G1ciy      <NA>     <NA>

Created on 2023-09-25 with reprex v2.0.2

not sure if/how to handle this

@schochastics schochastics added the bug Something isn't working label Sep 25, 2023
@chainsawriot
Copy link
Collaborator

chainsawriot commented Sep 25, 2023

urltools never checks for url validity.

urltools::url_parse("bas%%%asa/s")
#>   scheme    domain port path parameter fragment
#> 1   <NA> bas%%%asa <NA>    s      <NA>     <NA>

Created on 2023-09-25 with reprex v2.0.2

A thing that we can do is to have two modes. Let's call them strict mode and lax mode. Strict mode is what we are doing. Lax mode is to relax this check:

https://github.com/schochastics/adaR/blob/cd526ec1e32e4666b90a69749043515d1ca30517/src/adaR.cpp#L29

If not a valid URL, attempt to add "https://" and recheck. If it checks, then parse. If it doesn't check, return all NA.

@schochastics schochastics removed the bug Something isn't working label Sep 25, 2023
@schochastics
Copy link
Member Author

Ok this is rather a feature than a bug. I will table this for now (but keep it open), given that the primary use case of the package is dealing with webtracking data which should always have valid urls. But we may get back to your suggestion in case we need to

@chainsawriot
Copy link
Collaborator

Probably YAGNI.

@JBGruber
Copy link

JBGruber commented Oct 5, 2023

Great package! Was hoping to use it as drop-in replacement of urltools for one of my packages. But this behaviour actually breaks some of my functions and it might not be the use case you were thinking, so maybe it's useful to my experience here.

I have two exported functions where I want to work with the domain, even when the user pastes a link with a path. With urltools, I don't have to worry about calling one function from the other:

fun1 <- function(x) {
  urltools::domain(x)
}

fun2 <- function(x) {
  domain <- urltools::domain(x)
  ### Does something ###
  fun1(domain)
}

fun1("https://github.com/schochastics/adaR/issues/36")
#> [1] "github.com"
fun2("https://github.com/schochastics/adaR/issues/36")
#> [1] "github.com"

With adaR, this fails, as getting the domain from a domain returns NA.

fun1 <- function(x) {
  adaR::ada_get_domain(x)
}

fun2 <- function(x) {
  domain <- adaR::ada_get_domain(x)
  ### Does something ###
  fun1(domain)
}

fun1("https://github.com/schochastics/adaR/issues/36")
#> [1] "github.com"
fun2("https://github.com/schochastics/adaR/issues/36")
#> [1] NA

Created on 2023-10-05 with reprex v2.0.2

I'll work around it in the way @chainsawriot suggests, but I found the behaviour of urltools more intuitive.

@schochastics
Copy link
Member Author

without knowing what is happening in ### Does something ### you can always just append
a random protocol. Obviously its a hacky solution but should work. I will think about how to handle this better in the package though.

@chainsawriot
Copy link
Collaborator

lax_get_domain <- function(url) {
    ## not vectorize
    res <- adaR::ada_get_domain(url)
    if (!is.na(res)) {
        return(res)
    }
    return(adaR::ada_get_domain(paste0("https://", url)))
}

lax_get_domain("bit.ly/aasas")
#> [1] "bit.ly"
lax_get_domain("notdomain/notpath")
#> [1] NA
lax_get_domain("https://bit.ly/aasas")
#> [1] "bit.ly"
lax_get_domain("gopher://bit.ly/aasas")
#> [1] "bit.ly"

Created on 2023-10-05 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants