-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Scrapper-Service]add g2r scrapper, modify config.json #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great a few remarks from my end
- Use the logger against inner errors also.
- Check for None errors against accessing None variable , log and raise exception against them
- In that way if the scrapper starts failing we will have a detailed trace of what's going wrong.
- I see you have used the error log in the top level function. Do so in the inner functions as well.
-
you are putting wrong information in the metadata. Metadata is what belongs to the scrapper. You have to follow the Metadata datamodel. The extra information all goes with the Conference kwargs. Look into the Conference datamodel you will find what extra info you should be putting if you find them while scrapping. You can also put your own extra info.
-
Also are you going through the top conferences only. You can go through both the regular one and the top conference one.
@sagar-sehgal please put your review for the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to run the branch since I got the following error.
2020-03-19 01:37:25,742 - plugins.guide2research - ERROR - Error while parsing page http://www.guide2research.com/?p=12801, find full trace local variable 'content' referenced before assignment
Received the same error for all the links.
I have commented on some of the locations due to which I feel the error might have been caused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use debug logging. It's for the purpose of addressing future issues , log where ever needed for debugging issues down the line.
few examples
- page request succefull log
- page succefully parsed log
- data commited in db debug log etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Use of yield. Though it's not a major issue. But for a large number of links. The yield would help not create large list in memory. You can convert any where there is a list being returned to yield.
-
Approving this from my end but incorporate the yield change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Looks fine from my end. Further if any issues come into play , will raise bug issue for same. Do address them 👍 . Otherwise Great job 👍 😁 , awaiting a go by @sagar-sehgal
Conference: object | ||
""" | ||
self.logger.debug(f'Parsing {link}') | ||
page = requests.get(link, allow_redirects=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use the provided function, this will block the thread since no timeout is given. Use the given function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Make same change , anywhere you may have missed it. Also post a mongodb schema layout of conference captured by your scrapper , one from top conference , and one from general. Should give us the idea in case any change required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, there's a catch with the get_page
method provided, it works fine when the page/route is returned. In case of individual conferences listed on both top conferences
and all conferences
there's a redirect that's happening and it is not allowed by the provided method. In such a case, using get_page
would not work as I wanted it to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change the request method to include timeout if that's okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While running the code I came am coming across the following error.
[dill routine]: Runnable failed in runtime due to error raised HTTPConnectionPool(host='www.guide2research.com', port=80): Read timed out. (read timeout=1)
I will come across this when it initially fetches the pages.
Try running the code after deleting the db.
This should not happen unless
Both of these can lead to this. @manasakalaga address this. |
@rajatkb what do you mean by this? |
@sagar-sehgal do you mean entries in the db? EDIT |
for link in self.all_conf_page_links: | ||
content = self.get_page( | ||
qlink=link, debug_msg=f"Extracted links from {link}").content | ||
links.append(self._parse_all_conference_base(content=content)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use yield here , you may risk getting huge amount links in memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest a way of going about it by looking at the flow?
Here's an outline
- First, I'm looking for all pages of all conferences and saving them to
all_conf_page_links
- Then I'm further parsing each of the item in that list to get links individual conferences and adding them to a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the below pattern
def some_function_that_return_list( args ):
while True:
## do some processing
yield result
def other_func_that_uses_above( args ):
for result in some_function_that_return_list(new_args):
## do something with result
new_result = process(result)
yield result
for result in other_funct_that_uses_above(args):
print(result)
Where you see your self iterating over a one-timer list, you can resort to yield. like self.all_conf_page_links
, can be a yielded instead of a long list. Similarly the processing done on it can be yield also ........ The idea is dont accumalate into a list just to return it. Yield the list..
self.get_page() method ... already has an adaptive timeout routine. |
I get this. |
To reach each of the conference pages ? Are you reaching out to the actual conference url or the one's provided by guide2research. If it's by guide2research does it still requires redirect ? |
Well that should not require a redirect does it. U can simply request this jew page. That you require. Though adding redirect true is not a big deal . We can simply add that in the Adaptive Request class in utility package. But still would like to know why is it required ? Because if u request the the link given in screenshot it should get download the html content of it. Redirect would required when this link would redirect to a new page (something of tinyurl sort) PS: Yes I visited the site, I get what you are saying. They are not following rest and are bringing in pages from that body arguments. U can make the change for that redirect in AdaptiveRequest package, in utility class. It's using the request package internally. |
Any resources on how do I go about adding that adaptive request behavior to the package? |
Just visit that class. It's using the requests package , u can add the same argument or redirect ther also, |
@manasakalaga, since the ur other bug fix is approved, u can now finish up this one. |
…change yield in all confs back to list, fix logger messages
…lds links instead of list, minor debug message fix
…sanitisation, modified ranking extraction
@manasakalaga looks clean from my end , merging it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Great JOB ! 👍
Implements #2