[Scrapper-Service]add g2r scrapper, modify config.json #43

ikoala21 · 2020-03-14T13:21:22Z

Implements #2

rajatkb

Looks great a few remarks from my end

Use the logger against inner errors also.

Check for None errors against accessing None variable , log and raise exception against them
In that way if the scrapper starts failing we will have a detailed trace of what's going wrong.
I see you have used the error log in the top level function. Do so in the inner functions as well.

you are putting wrong information in the metadata. Metadata is what belongs to the scrapper. You have to follow the Metadata datamodel. The extra information all goes with the Conference kwargs. Look into the Conference datamodel you will find what extra info you should be putting if you find them while scrapping. You can also put your own extra info.
Also are you going through the top conferences only. You can go through both the regular one and the top conference one.

rajatkb · 2020-03-15T01:28:20Z

@sagar-sehgal please put your review for the code.

thesagarsehgal

I wasn't able to run the branch since I got the following error.

2020-03-19 01:37:25,742 - plugins.guide2research - ERROR - Error while parsing page http://www.guide2research.com/?p=12801, find full trace local variable 'content' referenced before assignment

Received the same error for all the links.

I have commented on some of the locations due to which I feel the error might have been caused.

Scrapper-Service/plugins/guide2research.py

rajatkb

Use debug logging. It's for the purpose of addressing future issues , log where ever needed for debugging issues down the line.
few examples

page request succefull log
page succefully parsed log
data commited in db debug log etc

Scrapper-Service/plugins/guide2research.py

rajatkb

Use of yield. Though it's not a major issue. But for a large number of links. The yield would help not create large list in memory. You can convert any where there is a list being returned to yield.
Approving this from my end but incorporate the yield change.

Scrapper-Service/plugins/guide2research.py

rajatkb

Looks fine from my end. Further if any issues come into play , will raise bug issue for same. Do address them 👍 . Otherwise Great job 👍 😁 , awaiting a go by @sagar-sehgal

rajatkb · 2020-03-25T05:02:14Z

Scrapper-Service/plugins/guide2research.py

+        Conference: object
+        """
+        self.logger.debug(f'Parsing {link}')
+        page = requests.get(link, allow_redirects=True)


Use the provided function, this will block the thread since no timeout is given. Use the given function

Make same change , anywhere you may have missed it. Also post a mongodb schema layout of conference captured by your scrapper , one from top conference , and one from general. Should give us the idea in case any change required.

Hey, there's a catch with the get_pagemethod provided, it works fine when the page/route is returned. In case of individual conferences listed on both top conferences and all conferences there's a redirect that's happening and it is not allowed by the provided method. In such a case, using get_page would not work as I wanted it to.

I can change the request method to include timeout if that's okay

thesagarsehgal

While running the code I came am coming across the following error.

[dill routine]:  Runnable failed in runtime due to error raised HTTPConnectionPool(host='www.guide2research.com', port=80): Read timed out. (read timeout=1)

I will come across this when it initially fetches the pages.

Try running the code after deleting the db.

rajatkb · 2020-03-26T04:35:18Z

While running the code I came am coming across the following error.
[dill routine]:  Runnable failed in runtime due to error raised HTTPConnectionPool(host='www.guide2research.com', port=80): Read timed out. (read timeout=1)
I will come across this when it initially fetches the pages.

Try running the code after deleting the db.

This should not happen unless

The error was not handled
The already given method for page request was not provided.

Both of these can lead to this. @manasakalaga address this.

ikoala21 · 2020-03-26T06:02:27Z

The already given method for page request was not provided.

@rajatkb what do you mean by this?

ikoala21 · 2020-03-26T06:05:52Z

Try running the code after deleting the db.

@sagar-sehgal do you mean entries in the db?

EDIT
I've figured out what's happening. the thing is that as I mentioned above, each of the conference pages redirects to it's ultimate destination and the time that's being taken is more than usual. I was able to fix this by adding timeout parameter to my conference page request method

rajatkb · 2020-03-26T04:37:30Z

Scrapper-Service/plugins/guide2research.py

+            for link in self.all_conf_page_links:
+                content = self.get_page(
+                    qlink=link, debug_msg=f"Extracted links from {link}").content
+                links.append(self._parse_all_conference_base(content=content))


Use yield here , you may risk getting huge amount links in memory

Can you suggest a way of going about it by looking at the flow?
Here's an outline

First, I'm looking for all pages of all conferences and saving them to all_conf_page_links

Then I'm further parsing each of the item in that list to get links individual conferences and adding them to a list.

Follow the below pattern

def some_function_that_return_list( args ): while True: ## do some processing yield result def other_func_that_uses_above( args ): for result in some_function_that_return_list(new_args): ## do something with result new_result = process(result) yield result for result in other_funct_that_uses_above(args): print(result)

Where you see your self iterating over a one-timer list, you can resort to yield. like self.all_conf_page_links , can be a yielded instead of a long list. Similarly the processing done on it can be yield also ........ The idea is dont accumalate into a list just to return it. Yield the list..

rajatkb · 2020-03-26T06:56:43Z

The already given method for page request was not provided.

@rajatkb what do you mean by this?

self.get_page() method ... already has an adaptive timeout routine.
Also you need to handle such timeout. Timeouts will be there and it cannot be avoided. So those need to be put as an exception which must be handled and then move to the next page , or next link.

ikoala21 · 2020-03-26T07:05:45Z

The already given method for page request was not provided.

@rajatkb what do you mean by this?

self.get_page() method ... already has an adaptive timeout routine.
Also you need to handle such timeout. Timeouts will be there and it cannot be avoided. So those need to be put as an exception which must be handled and then move to the next page , or next link.

I get this.
To reach each of the conference links, I need to allow for redirect to happen, which is not provided in the self.get_page

rajatkb · 2020-03-26T07:07:55Z

To reach each of the conference pages ? Are you reaching out to the actual conference url or the one's provided by guide2research. If it's by guide2research does it still requires redirect ?

ikoala21 · 2020-03-26T07:12:19Z

To reach each of the conference pages ? Are you reaching out to the actual conference url or the one's provided by guide2research. If it's by guide2research does it still requires redirect ?

I'm reaching out to the link provided by Guide 2 Research itself. and, Yes it requires a redirect. Look at the bottom left of the image below

Upon clicking it takes me to the following page via a redirect

rajatkb · 2020-03-26T07:16:28Z

Well that should not require a redirect does it. U can simply request this jew page. That you require. Though adding redirect true is not a big deal . We can simply add that in the Adaptive Request class in utility package. But still would like to know why is it required ? Because if u request the the link given in screenshot it should get download the html content of it. Redirect would required when this link would redirect to a new page (something of tinyurl sort)

PS: Yes I visited the site, I get what you are saying. They are not following rest and are bringing in pages from that body arguments. U can make the change for that redirect in AdaptiveRequest package, in utility class. It's using the request package internally.

ikoala21 · 2020-03-26T07:23:19Z

Any resources on how do I go about adding that adaptive request behavior to the package?

rajatkb · 2020-03-26T07:26:15Z

Any resources on how do I go about adding that adaptive request behavior to the package?

Just visit that class. It's using the requests package , u can add the same argument or redirect ther also,

rajatkb · 2020-04-01T07:38:21Z

@manasakalaga, since the ur other bug fix is approved, u can now finish up this one.

…date

…site name

…change yield in all confs back to list, fix logger messages

…lds links instead of list, minor debug message fix

ikoala21 · 2020-04-04T14:11:40Z

mongo object for:

Top conferences:
All conferences:

…sanitisation, modified ranking extraction

rajatkb · 2020-04-06T03:50:13Z

@manasakalaga looks clean from my end , merging it.

rajatkb

Great JOB ! 👍

rajatkb requested changes Mar 15, 2020

View reviewed changes

rajatkb requested a review from thesagarsehgal March 15, 2020 01:26

rajatkb assigned ikoala21 Mar 16, 2020

thesagarsehgal requested changes Mar 18, 2020

View reviewed changes

Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved

Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved

Scrapper-Service/plugins/guide2research.py Outdated Show resolved Hide resolved

rajatkb requested changes Mar 20, 2020

View reviewed changes

rajatkb added enhancement New feature or request gssoc20 GSSOC label for gscco20 tag medium GSSOC label for beginner tag labels Mar 20, 2020

rajatkb linked an issue Mar 20, 2020 that may be closed by this pull request

[Feat.] add new Scrapper for Guide2Reasearch #2

Closed

rajatkb added this to To do in Scrapper-Service via automation Mar 20, 2020

rajatkb approved these changes Mar 21, 2020

View reviewed changes

Scrapper-Service/plugins/guide2research.py Show resolved Hide resolved

rajatkb added hard GSSOC label for beginner tag and removed medium GSSOC label for beginner tag labels Mar 22, 2020

ikoala21 requested a review from thesagarsehgal March 22, 2020 15:15

rajatkb approved these changes Mar 22, 2020

View reviewed changes

Scrapper-Service automation moved this from To do to In progress Mar 25, 2020

rajatkb requested changes Mar 25, 2020

View reviewed changes

thesagarsehgal reviewed Mar 25, 2020

View reviewed changes

rajatkb reviewed Mar 26, 2020

View reviewed changes

ikoala21 mentioned this pull request Mar 26, 2020

Guide 2 Research Scraper needs a redirect method to get pages #68

Closed

ikoala21 added 8 commits April 2, 2020 15:13

[Scrapper-Service]add g2r scrapper, modify config.json

1691322

[g2r-scrapper]update g2r scrapper

6808967

[Scrapper-Service] add html5lib to requirements.txt

27d10ec

[Scrapper-Service] fix, add logs and exceptions to g2r scrapper

22973c4

[Scrapper-Service] fix date formats, log messages, use get_page, get_…

50ebd10

…date

[Scrapper-Service] change return to yield in link generators, modify …

034cf4c

…site name

[Scrapper-service] fixed not visiting next pages on all conferences, …

473c6e9

…change yield in all confs back to list, fix logger messages

[scrapper-service]g2r scraper now uses modified adaptive request, yie…

340d6f0

…lds links instead of list, minor debug message fix

ikoala21 force-pushed the g2r-scrapper branch from 8404e2f to 340d6f0 Compare April 2, 2020 15:53

[scrapper-service] adds categories, modified bulktext, improved text …

facae61

…sanitisation, modified ranking extraction

ikoala21 force-pushed the g2r-scrapper branch from f7b2592 to facae61 Compare April 5, 2020 18:26

rajatkb approved these changes Apr 6, 2020

View reviewed changes

rajatkb merged commit b3b33dd into rajatkb:master Apr 6, 2020

Scrapper-Service automation moved this from In progress to Done Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Scrapper-Service]add g2r scrapper, modify config.json #43

[Scrapper-Service]add g2r scrapper, modify config.json #43

ikoala21 commented Mar 14, 2020 •

edited

Loading

rajatkb left a comment •

edited

Loading

rajatkb commented Mar 15, 2020 •

edited

Loading

thesagarsehgal left a comment

rajatkb left a comment

rajatkb left a comment

rajatkb left a comment •

edited

Loading

rajatkb Mar 25, 2020

rajatkb Mar 25, 2020

ikoala21 Mar 26, 2020

ikoala21 Mar 26, 2020

thesagarsehgal left a comment

rajatkb commented Mar 26, 2020

ikoala21 commented Mar 26, 2020 •

edited

Loading

ikoala21 commented Mar 26, 2020 •

edited

Loading

rajatkb Mar 26, 2020

ikoala21 Mar 26, 2020

rajatkb Mar 26, 2020 •

edited

Loading

rajatkb commented Mar 26, 2020

ikoala21 commented Mar 26, 2020

rajatkb commented Mar 26, 2020

ikoala21 commented Mar 26, 2020 •

edited

Loading

rajatkb commented Mar 26, 2020 •

edited

Loading

ikoala21 commented Mar 26, 2020

rajatkb commented Mar 26, 2020

rajatkb commented Apr 1, 2020

ikoala21 commented Apr 4, 2020

rajatkb commented Apr 6, 2020

rajatkb left a comment

[Scrapper-Service]add g2r scrapper, modify config.json #43

[Scrapper-Service]add g2r scrapper, modify config.json #43

Conversation

ikoala21 commented Mar 14, 2020 • edited Loading

rajatkb left a comment • edited Loading

Choose a reason for hiding this comment

rajatkb commented Mar 15, 2020 • edited Loading

thesagarsehgal left a comment

Choose a reason for hiding this comment

rajatkb left a comment

Choose a reason for hiding this comment

rajatkb left a comment

Choose a reason for hiding this comment

rajatkb left a comment • edited Loading

Choose a reason for hiding this comment

rajatkb Mar 25, 2020

Choose a reason for hiding this comment

rajatkb Mar 25, 2020

Choose a reason for hiding this comment

ikoala21 Mar 26, 2020

Choose a reason for hiding this comment

ikoala21 Mar 26, 2020

Choose a reason for hiding this comment

thesagarsehgal left a comment

Choose a reason for hiding this comment

rajatkb commented Mar 26, 2020

ikoala21 commented Mar 26, 2020 • edited Loading

ikoala21 commented Mar 26, 2020 • edited Loading

rajatkb Mar 26, 2020

Choose a reason for hiding this comment

ikoala21 Mar 26, 2020

Choose a reason for hiding this comment

rajatkb Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

rajatkb commented Mar 26, 2020

ikoala21 commented Mar 26, 2020

rajatkb commented Mar 26, 2020

ikoala21 commented Mar 26, 2020 • edited Loading

rajatkb commented Mar 26, 2020 • edited Loading

ikoala21 commented Mar 26, 2020

rajatkb commented Mar 26, 2020

rajatkb commented Apr 1, 2020

ikoala21 commented Apr 4, 2020

rajatkb commented Apr 6, 2020

rajatkb left a comment

Choose a reason for hiding this comment

ikoala21 commented Mar 14, 2020 •

edited

Loading

rajatkb left a comment •

edited

Loading

rajatkb commented Mar 15, 2020 •

edited

Loading

rajatkb left a comment •

edited

Loading

ikoala21 commented Mar 26, 2020 •

edited

Loading

ikoala21 commented Mar 26, 2020 •

edited

Loading

rajatkb Mar 26, 2020 •

edited

Loading

ikoala21 commented Mar 26, 2020 •

edited

Loading

rajatkb commented Mar 26, 2020 •

edited

Loading