Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scrapper-Service]add g2r scrapper, modify config.json #43

Merged
merged 9 commits into from
Apr 6, 2020

Conversation

ikoala21
Copy link
Contributor

@ikoala21 ikoala21 commented Mar 14, 2020

Implements #2

Copy link
Owner

@rajatkb rajatkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great a few remarks from my end

  1. Use the logger against inner errors also.
  • Check for None errors against accessing None variable , log and raise exception against them
  • In that way if the scrapper starts failing we will have a detailed trace of what's going wrong.
  • I see you have used the error log in the top level function. Do so in the inner functions as well.
  1. you are putting wrong information in the metadata. Metadata is what belongs to the scrapper. You have to follow the Metadata datamodel. The extra information all goes with the Conference kwargs. Look into the Conference datamodel you will find what extra info you should be putting if you find them while scrapping. You can also put your own extra info.

  2. Also are you going through the top conferences only. You can go through both the regular one and the top conference one.

@rajatkb
Copy link
Owner

rajatkb commented Mar 15, 2020

@sagar-sehgal please put your review for the code.

Copy link
Collaborator

@thesagarsehgal thesagarsehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to run the branch since I got the following error.

2020-03-19 01:37:25,742 - plugins.guide2research - ERROR - Error while parsing page http://www.guide2research.com/?p=12801, find full trace local variable 'content' referenced before assignment

Received the same error for all the links.

I have commented on some of the locations due to which I feel the error might have been caused.

Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Copy link
Owner

@rajatkb rajatkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use debug logging. It's for the purpose of addressing future issues , log where ever needed for debugging issues down the line.
    few examples
  1. page request succefull log
  2. page succefully parsed log
  3. data commited in db debug log etc

Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Show resolved Hide resolved
Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved
@rajatkb rajatkb added enhancement New feature or request gssoc20 GSSOC label for gscco20 tag medium GSSOC label for beginner tag labels Mar 20, 2020
@rajatkb rajatkb linked an issue Mar 20, 2020 that may be closed by this pull request
@rajatkb rajatkb added this to To do in Scrapper-Service via automation Mar 20, 2020
Copy link
Owner

@rajatkb rajatkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use of yield. Though it's not a major issue. But for a large number of links. The yield would help not create large list in memory. You can convert any where there is a list being returned to yield.

  • Approving this from my end but incorporate the yield change.

Scrapper-Service/plugins/guide2research.py Show resolved Hide resolved
@rajatkb rajatkb added hard GSSOC label for beginner tag and removed medium GSSOC label for beginner tag labels Mar 22, 2020
Copy link
Owner

@rajatkb rajatkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Looks fine from my end. Further if any issues come into play , will raise bug issue for same. Do address them 👍 . Otherwise Great job 👍 😁 , awaiting a go by @sagar-sehgal

Scrapper-Service automation moved this from To do to In progress Mar 25, 2020
Conference: object
"""
self.logger.debug(f'Parsing {link}')
page = requests.get(link, allow_redirects=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use the provided function, this will block the thread since no timeout is given. Use the given function

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Make same change , anywhere you may have missed it. Also post a mongodb schema layout of conference captured by your scrapper , one from top conference , and one from general. Should give us the idea in case any change required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, there's a catch with the get_pagemethod provided, it works fine when the page/route is returned. In case of individual conferences listed on both top conferences and all conferences there's a redirect that's happening and it is not allowed by the provided method. In such a case, using get_page would not work as I wanted it to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the request method to include timeout if that's okay

Copy link
Collaborator

@thesagarsehgal thesagarsehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running the code I came am coming across the following error.

[dill routine]:  Runnable failed in runtime due to error raised HTTPConnectionPool(host='www.guide2research.com', port=80): Read timed out. (read timeout=1)

I will come across this when it initially fetches the pages.

Try running the code after deleting the db.

@rajatkb
Copy link
Owner

rajatkb commented Mar 26, 2020

While running the code I came am coming across the following error.

[dill routine]:  Runnable failed in runtime due to error raised HTTPConnectionPool(host='www.guide2research.com', port=80): Read timed out. (read timeout=1)

I will come across this when it initially fetches the pages.

Try running the code after deleting the db.

This should not happen unless

  1. The error was not handled
  2. The already given method for page request was not provided.

Both of these can lead to this. @manasakalaga address this.

@ikoala21
Copy link
Contributor Author

ikoala21 commented Mar 26, 2020

The already given method for page request was not provided.

@rajatkb what do you mean by this?

@ikoala21
Copy link
Contributor Author

ikoala21 commented Mar 26, 2020

Try running the code after deleting the db.

@sagar-sehgal do you mean entries in the db?

EDIT
I've figured out what's happening. the thing is that as I mentioned above, each of the conference pages redirects to it's ultimate destination and the time that's being taken is more than usual. I was able to fix this by adding timeout parameter to my conference page request method

for link in self.all_conf_page_links:
content = self.get_page(
qlink=link, debug_msg=f"Extracted links from {link}").content
links.append(self._parse_all_conference_base(content=content))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Use yield here , you may risk getting huge amount links in memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest a way of going about it by looking at the flow?
Here's an outline

  1. First, I'm looking for all pages of all conferences and saving them to all_conf_page_links
  2. Then I'm further parsing each of the item in that list to get links individual conferences and adding them to a list.

Copy link
Owner

@rajatkb rajatkb Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the below pattern

 def some_function_that_return_list( args ):
            while True:
                 ## do some processing 
                 yield result

 def other_func_that_uses_above( args ):
       for result in some_function_that_return_list(new_args):
             ## do something with result
             new_result = process(result)
             yield result

 for result in other_funct_that_uses_above(args):
      print(result)

Where you see your self iterating over a one-timer list, you can resort to yield. like self.all_conf_page_links , can be a yielded instead of a long list. Similarly the processing done on it can be yield also ........ The idea is dont accumalate into a list just to return it. Yield the list..

@rajatkb
Copy link
Owner

rajatkb commented Mar 26, 2020

The already given method for page request was not provided.

@rajatkb what do you mean by this?

self.get_page() method ... already has an adaptive timeout routine.
Also you need to handle such timeout. Timeouts will be there and it cannot be avoided. So those need to be put as an exception which must be handled and then move to the next page , or next link.

@ikoala21
Copy link
Contributor Author

The already given method for page request was not provided.

@rajatkb what do you mean by this?

self.get_page() method ... already has an adaptive timeout routine.
Also you need to handle such timeout. Timeouts will be there and it cannot be avoided. So those need to be put as an exception which must be handled and then move to the next page , or next link.

I get this.
To reach each of the conference links, I need to allow for redirect to happen, which is not provided in the self.get_page

@rajatkb
Copy link
Owner

rajatkb commented Mar 26, 2020

To reach each of the conference pages ? Are you reaching out to the actual conference url or the one's provided by guide2research. If it's by guide2research does it still requires redirect ?

@ikoala21
Copy link
Contributor Author

ikoala21 commented Mar 26, 2020

To reach each of the conference pages ? Are you reaching out to the actual conference url or the one's provided by guide2research. If it's by guide2research does it still requires redirect ?

I'm reaching out to the link provided by Guide 2 Research itself. and, Yes it requires a redirect. Look at the bottom left of the image below
Screenshot from 2020-03-26 12-41-15


Upon clicking it takes me to the following page via a redirect
Screenshot from 2020-03-26 12-44-07

@rajatkb
Copy link
Owner

rajatkb commented Mar 26, 2020

Well that should not require a redirect does it. U can simply request this jew page. That you require. Though adding redirect true is not a big deal . We can simply add that in the Adaptive Request class in utility package. But still would like to know why is it required ? Because if u request the the link given in screenshot it should get download the html content of it. Redirect would required when this link would redirect to a new page (something of tinyurl sort)

PS: Yes I visited the site, I get what you are saying. They are not following rest and are bringing in pages from that body arguments. U can make the change for that redirect in AdaptiveRequest package, in utility class. It's using the request package internally.

@ikoala21
Copy link
Contributor Author

Any resources on how do I go about adding that adaptive request behavior to the package?

@rajatkb
Copy link
Owner

rajatkb commented Mar 26, 2020

Any resources on how do I go about adding that adaptive request behavior to the package?

Just visit that class. It's using the requests package , u can add the same argument or redirect ther also,

@rajatkb
Copy link
Owner

rajatkb commented Apr 1, 2020

@manasakalaga, since the ur other bug fix is approved, u can now finish up this one.

@ikoala21
Copy link
Contributor Author

ikoala21 commented Apr 4, 2020

mongo object for:

  1. Top conferences:
    topconf

  2. All conferences:
    allconf

@rajatkb
Copy link
Owner

rajatkb commented Apr 6, 2020

@manasakalaga looks clean from my end , merging it.

Copy link
Owner

@rajatkb rajatkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Great JOB ! 👍

@rajatkb rajatkb merged commit b3b33dd into rajatkb:master Apr 6, 2020
Scrapper-Service automation moved this from In progress to Done Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gssoc20 GSSOC label for gscco20 tag hard GSSOC label for beginner tag
Projects
Development

Successfully merging this pull request may close these issues.

[Feat.] add new Scrapper for Guide2Reasearch
3 participants