Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] element extraction methods like $, el, element and elements not found #90

Closed
pedramkeyani opened this issue Apr 29, 2020 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@pedramkeyani
Copy link

Describe the bug
The documentation for extracting data from a website is out of date and does not compile.

Code Sample
import it.skrape.extract import it.skrape.selects.$` <-- is not in the selects package and doesn't compile
import it.skrape.selects.el <-- is not in the selects package and doesn't compile
import it.skrape.skrape

data class MyScrapedData(
val userName: String,
val repositoryNames: List
)

fun main() {
val githubUserData = skrape {
url = "https://github.com/skrapeit"

    extract {
        MyScrapedData(
                userName = el(".h-card .p-nickname").text(),
                repositoryNames = `$`("span.repo").map { it.text() }
        )
    }
}
println("${githubUserData.userName}'s repos are ${githubUserData.repositoryNames}")

}`

Expected behavior
I've tried all but the most basic examples to learn the different components of scraping. selects.element and selects.elements are also used in the examples but they don't appear to be in the code. This very well could be a problem with how I have or haven't configured intellij.

@pedramkeyani pedramkeyani added the bug Something isn't working label Apr 29, 2020
@pedramkeyani
Copy link
Author

I also should note that I'm using Intellij and have tried this with it.skrape:skrapeit-core:RELEASE as well as 6 alpha, 4 alpha and 4.1 alpha.

@christian-draeger christian-draeger changed the title [BUG] [BUG] e Apr 29, 2020
@christian-draeger christian-draeger changed the title [BUG] e [BUG] element extraction methods like $, el, element and elements not found Apr 29, 2020
@christian-draeger
Copy link
Collaborator

christian-draeger commented Apr 29, 2020

Hey,
the methods you described has been available at really early versions of skrape{it}. They are not supported anymore in the versions you described.
Please let me know it there is someexame in the docs that is still showing the old syntax, so it can be changed.

New syntax(alpha6)
https://github.com/skrapeit/skrape.it#parse-html-and-extract-it

Important to note:
Please use this dependency:
https://github.com/skrapeit/skrape.it#add-dependency

Please let me know if this worked for you.

@pedramkeyani
Copy link
Author

Hi Christian, here is the document that has the old style. Also it allws you to select between all the versions of the API except for the newer ones (6 alpha) https://docs.skrape.it/docs/dsl/extracting-data-from-websites

I also found examples in the documentation of using selects.element and selects.elements but I can't find it right now. I'll look around for it.

@christian-draeger
Copy link
Collaborator

Ok thanks. I will update the examples soon. Thx for pointing this out. Let me know if the other (newer example in the readme) works for you :)

@pedramkeyani
Copy link
Author

pedramkeyani commented Apr 29, 2020

Thanks for following up. The example in the readme worked (from what I remember yesterday). One of the challenges I am having is how to use the different features to scrape a little more elegently. I stumbled around the code and figured out how to use withClass and then rawCssSelector which was more powerful but I'm still not able to figure out how to iterate through dom elements (for all the elements of a list and map their hrefs to the text). One example of something I need to do is visit dom elements and grab multiple pieces of information to put into an object and store that for later. It may be easier to go through the details on slack so I messages you on the kotlinlang channel.


/*
                    li {
                        rawCssSelector = "div.jokes-nav > ul > li"

                         links.addAll(findAll { eachText() })
                    }
*/

                    a {
                        rawCssSelector  = " div.jokes-nav > ul > li > a"

                        findAll {
                            println(eachHref())
                            println(eachText())

                        }
                    }
                }```

@christian-draeger
Copy link
Collaborator

Ok cool, I will close this one for now

@skrapeit
Copy link
Owner

skrapeit commented Apr 29, 2020

related to #91 - solution that have been discussed on the Kotlin slack to extract all links including its text and href until #91 has been released:

fun main() {
    val allNavLinks = skrape {
        url = "http://www.laughfactory.com/jokes"
        extract {
            htmlDocument {
                ".jokes-nav a" {
                    withAttributeKey = "href"
                    findAll {
                        associate { it.text to it.attribute("href") }
                    }

                }
            }
        }
    }
    println(allNavLinks)
}

prints:
{Popular Jokes=http://www.laughfactory.com/jokes/popular-jokes, Latest Jokes=http://www.laughfactory.com/jokes/latest-jokes, Joke of the Day=http://www.laughfactory.com/jokes/joke-of-the-day, Animal Jokes=http://www.laughfactory.com/jokes/animal-jokes, Blonde Jokes=http://www.laughfactory.com/jokes/blonde-jokes, Boycott These Jokes=http://www.laughfactory.com/jokes/boycott-these-jokes, Clean Jokes=http://www.laughfactory.com/jokes/clean-jokes, Family Jokes=http://www.laughfactory.com/jokes/family-jokes, Food Jokes=http://www.laughfactory.com/jokes/food-jokes, Holiday Jokes=http://www.laughfactory.com/jokes/holiday-jokes, How to be Insulting=http://www.laughfactory.com/jokes/how-to-be-insulting, Insult Jokes=http://www.laughfactory.com/jokes/insult-jokes, Miscellaneous Jokes=http://www.laughfactory.com/jokes/miscellaneous-jokes, National Jokes=http://www.laughfactory.com/jokes/national-jokes, Office Jokes=http://www.laughfactory.com/jokes/office-jokes, Political Jokes=http://www.laughfactory.com/jokes/political-jokes, Pop Culture Jokes=http://www.laughfactory.com/jokes/pop-culture-jokes, Racist Jokes=http://www.laughfactory.com/jokes/racist-jokes, Relationship Jokes=http://www.laughfactory.com/jokes/relationship-jokes, Religious Jokes=http://www.laughfactory.com/jokes/religious-jokes, School Jokes=http://www.laughfactory.com/jokes/school-jokes, Science Jokes=http://www.laughfactory.com/jokes/science-jokes, Sex Jokes=http://www.laughfactory.com/jokes/sex-jokes, Sexist Jokes=http://www.laughfactory.com/jokes/sexist-jokes, Sports Jokes=http://www.laughfactory.com/jokes/sports-jokes, Technology Jokes=http://www.laughfactory.com/jokes/technology-jokes, Word Play Jokes=http://www.laughfactory.com/jokes/word-play-jokes, Yo Momma Jokes=http://www.laughfactory.com/jokes/yo-momma-jokes}

what it’s doing:
defining the http request (skrape methods scope)
do the call (extract methods scope)
deserialize the html body that was received from extracts response
define a selector that matches all elements that are matching the following css-query selector --> “.jokes-nav a[href]”
get all elements that matches the selector as list (findAll method)
extract the elements text and href values to a map (using kotlins build-in associate function)

:) hope this helps

as an alternativ this will also work ()if you don’t like the string invokation like ".jokes-nav a" {}

fun main() {
    val allNavLinks = skrape {
        url = "http://www.laughfactory.com/jokes"
        extract {
            htmlDocument {
                div {
                    withClass = "jokes-nav"
                    a {
                        withAttributeKey = "href"
                        findAll {
                            associate { it.text to it.attribute("href") }
                        }
                    }
                }
            }
        }
    }
    println(allNavLinks)
}

both solutions are perfectly fine to use and i think just a matter of taste. the first one is using a plain css-selector and an invokes it, the second one is build the selector by using the DSL which will be more readable if you have more complex selectors (under the hood skrape{it} will make a selector string out of it again)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants