Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exceeding maximum number of tries #465

Closed
1 task
gboeing opened this issue Dec 11, 2022 · 22 comments · Fixed by #475
Closed
1 task

exceeding maximum number of tries #465

gboeing opened this issue Dec 11, 2022 · 22 comments · Fixed by #475
Labels
bug proxy Proxy/Network issue. May not be exactly reproducible.

Comments

@gboeing
Copy link

gboeing commented Dec 11, 2022

Describe the bug
I have a simple script that runs once a week for to collect citation counts. It has always worked, until last night, when it started failing with the error detailed below. I have tried several times over several hours on multiple machines.

To Reproduce

I have two machines. The following code fails with different errors on the different machines.

from scholarly import scholarly
query = scholarly.search_author('james watson')
author = scholarly.fill(next(query), ['publications'])

Error on machine 1 (ubuntu, python 3.9, scholarly 1.7.5):

Traceback (most recent call last):
  File "/home/g/gb/gboeing/apps/citations/app/citations.py", line 15, in <module>
    author = scholarly.fill(next(query), ['publications'])
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 237, in search_authors
    soup = self._get_soup(url)
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 226, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 175, in _get_page
    return self._get_page(pagerequest, True)
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 177, in _get_page
    raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.

Error on machine 2 (ubuntu, python 3.11, scholarly 1.7.5)::

Traceback (most recent call last):
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 139, in load
    browsers_dict[browser_name] = get_browser_user_agents(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 123, in get_browser_user_agents
    raise FakeUserAgentError(
fake_useragent.errors.FakeUserAgentError: No browser user-agent strings found for browser: chrome

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 975, in send
    self.connect()
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 1447, in connect
    super().connect()
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/http/client.py", line 941, in connect
    self.sock = self._create_connection(
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/socket.py", line 850, in create_connection
    raise exceptions[0]
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/socket.py", line 835, in create_connection
    sock.connect(sa)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 64, in get
    urlopen(
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error timed out>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/geoff/Dropbox/Documents/School/Projects/Code/citations/citations.py", line 1, in <module>
    from scholarly import scholarly
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/__init__.py", line 4, in <module>
    scholarly = _Scholarly()
                ^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_scholarly.py", line 34, in __init__
    self.__nav = Navigator()
                 ^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_navigator.py", line 26, in __call__
    cls._instances[cls] = super(Singleton, cls).__call__(*args,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_navigator.py", line 42, in __init__
    self.pm1 = ProxyGenerator()
               ^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_proxy_generator.py", line 54, in __init__
    self._new_session()
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_proxy_generator.py", line 454, in _new_session
    'User-Agent': UserAgent().random,
                  ^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/fake.py", line 64, in __init__
    self.load()
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/fake.py", line 70, in load
    self.data_browsers = load_cached(
                         ^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 209, in load_cached
    update(path, browsers, use_cache_server=use_cache_server, verify_ssl=verify_ssl)
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 203, in update
    path, load(browsers, use_cache_server=use_cache_server, verify_ssl=verify_ssl)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 154, in load
    jsonLines = get(
                ^^^^
  File "/home/geoff/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py", line 87, in get
    raise FakeUserAgentError("Maximum amount of retries reached")
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached

Expected behavior
I expected the code to succeed without error, like it used to.

Screenshots
n/a

Desktop (please complete the following information):
(see my platform and version details above in reproduction section)

Do you plan on contributing?
Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

  • Yes, I will create a Pull Request with the bugfix.

Additional context
n/a

@gboeing gboeing added the bug label Dec 11, 2022
@loiseaujc
Copy link

I have precisely the same error message for a similar piece of code. Note, however, that if I run it using a Google Colab notebook, it does work. I hope it helps :)

@arunkannawadi arunkannawadi added the proxy Proxy/Network issue. May not be exactly reproducible. label Dec 12, 2022
@swhussain110
Copy link

Is there any solution for this.?

raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.

@arunkannawadi
Copy link
Collaborator

Have you tried running this with FreeProxy or other proxy services? There are examples on the documentation.

@gboeing
Copy link
Author

gboeing commented Dec 13, 2022

Yes, it is the same error with this code snippet that uses FreeProxy:

from scholarly import ProxyGenerator, scholarly

pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)

query = scholarly.search_author('james watson')
author = scholarly.fill(next(query), ['publications'])
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:64, in get(url, verify_ssl)
     61     context = None
     63 with contextlib.closing(
---> 64     urlopen(
     65         request,
     66         timeout=settings.HTTP_TIMEOUT,
     67         context=context,
     68     )
     69 ) as response:
     70     return response.read()

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    215     opener = _opener
--> 216 return opener.open(url, data, timeout)

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:525, in OpenerDirector.open(self, fullurl, data, timeout)
    524     meth = getattr(processor, meth_name)
--> 525     response = meth(req, response)
    527 return response

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:634, in HTTPErrorProcessor.http_response(self, request, response)
    633 if not (200 <= code < 300):
--> 634     response = self.parent.error(
    635         'http', request, response, code, msg, hdrs)
    637 return response

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:563, in OpenerDirector.error(self, proto, *args)
    562 args = (dict, 'default', 'http_error_default') + orig_args
--> 563 return self._call_chain(*args)

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
    497 if result is not None:

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:643, in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 403: Forbidden

During handling of the above exception, another exception occurred:

FakeUserAgentError                        Traceback (most recent call last)
File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:139, in load(browsers, use_cache_server, verify_ssl)
    138         browser_name = browser_name.lower().strip()
--> 139         browsers_dict[browser_name] = get_browser_user_agents(
    140             browser_name,
    141             verify_ssl=verify_ssl,
    142         )
    143 except Exception as exc:

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:100, in get_browser_user_agents(browser, verify_ssl)
     97 """
     98 Retrieve browser user agent strings
     99 """
--> 100 html = get(
    101     settings.BROWSER_BASE_PAGE.format(browser=quote_plus(browser)),
    102     verify_ssl=verify_ssl,
    103 )
    104 html = html.decode("iso-8859-1")

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:87, in get(url, verify_ssl)
     86 if attempt == settings.HTTP_RETRIES:
---> 87     raise FakeUserAgentError("Maximum amount of retries reached")
     88 else:

FakeUserAgentError: Maximum amount of retries reached

During handling of the above exception, another exception occurred:

TimeoutError                              Traceback (most recent call last)
File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1347 try:
-> 1348     h.request(req.get_method(), req.selector, req.data, headers,
   1349               encode_chunked=req.has_header('Transfer-encoding'))
   1350 except OSError as err: # timeout error

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:1282, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1281 """Send a complete request to the server."""
-> 1282 self._send_request(method, url, body, headers, encode_chunked)

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:1328, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1327     body = _encode(body, 'body')
-> 1328 self.endheaders(body, encode_chunked=encode_chunked)

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:1277, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1276     raise CannotSendHeader()
-> 1277 self._send_output(message_body, encode_chunked=encode_chunked)

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:1037, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1036 del self._buffer[:]
-> 1037 self.send(msg)
   1039 if message_body is not None:
   1040 
   1041     # create a consistent interface to message_body

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:975, in HTTPConnection.send(self, data)
    974 if self.auto_open:
--> 975     self.connect()
    976 else:

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:1447, in HTTPSConnection.connect(self)
   1445 "Connect to a host on a given (SSL) port."
-> 1447 super().connect()
   1449 if self._tunnel_host:

File ~/mambaforge/envs/citations/lib/python3.11/http/client.py:941, in HTTPConnection.connect(self)
    940 sys.audit("http.client.connect", self, self.host, self.port)
--> 941 self.sock = self._create_connection(
    942     (self.host,self.port), self.timeout, self.source_address)
    943 # Might fail in OSs that don't implement TCP_NODELAY

File ~/mambaforge/envs/citations/lib/python3.11/socket.py:850, in create_connection(address, timeout, source_address, all_errors)
    849 if not all_errors:
--> 850     raise exceptions[0]
    851 raise ExceptionGroup("create_connection failed", exceptions)

File ~/mambaforge/envs/citations/lib/python3.11/socket.py:835, in create_connection(address, timeout, source_address, all_errors)
    834     sock.bind(source_address)
--> 835 sock.connect(sa)
    836 # Break explicitly a reference cycle

TimeoutError: timed out

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:64, in get(url, verify_ssl)
     61     context = None
     63 with contextlib.closing(
---> 64     urlopen(
     65         request,
     66         timeout=settings.HTTP_TIMEOUT,
     67         context=context,
     68     )
     69 ) as response:
     70     return response.read()

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:216, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    215     opener = _opener
--> 216 return opener.open(url, data, timeout)

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:519, in OpenerDirector.open(self, fullurl, data, timeout)
    518 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 519 response = self._open(req, data)
    521 # post-process response

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:536, in OpenerDirector._open(self, req, data)
    535 protocol = req.type
--> 536 result = self._call_chain(self.handle_open, protocol, protocol +
    537                           '_open', req)
    538 if result:

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:496, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    495 func = getattr(handler, meth_name)
--> 496 result = func(*args)
    497 if result is not None:

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:1391, in HTTPSHandler.https_open(self, req)
   1390 def https_open(self, req):
-> 1391     return self.do_open(http.client.HTTPSConnection, req,
   1392         context=self._context, check_hostname=self._check_hostname)

File ~/mambaforge/envs/citations/lib/python3.11/urllib/request.py:1351, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1350 except OSError as err: # timeout error
-> 1351     raise URLError(err)
   1352 r = h.getresponse()

URLError: <urlopen error timed out>

During handling of the above exception, another exception occurred:

FakeUserAgentError                        Traceback (most recent call last)
Cell In[1], line 1
----> 1 from scholarly import ProxyGenerator, scholarly
      3 pg = ProxyGenerator()
      4 pg.FreeProxies()

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/__init__.py:4
      2 from .data_types import Author, Publication
      3 from ._proxy_generator import ProxyGenerator, DOSException, MaxTriesExceededException
----> 4 scholarly = _Scholarly()

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_scholarly.py:34, in _Scholarly.__init__(self)
     32 load_dotenv(find_dotenv())
     33 self.env = os.environ.copy()
---> 34 self.__nav = Navigator()
     35 self.logger = self.__nav.logger
     36 self._journal_categories = None

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_navigator.py:26, in Singleton.__call__(cls, *args, **kwargs)
     24 def __call__(cls, *args, **kwargs):
     25     if cls not in cls._instances:
---> 26         cls._instances[cls] = super(Singleton, cls).__call__(*args,
     27                                                              **kwargs)
     28     return cls._instances[cls]

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_navigator.py:42, in Navigator.__init__(self)
     38 self._max_retries = 5
     39 # A Navigator instance has two proxy managers, each with their session.
     40 # `pm1` manages the primary, premium proxy.
     41 # `pm2` manages the secondary, inexpensive proxy.
---> 42 self.pm1 = ProxyGenerator()
     43 self.pm2 = ProxyGenerator()
     44 self._session1 = self.pm1.get_session()

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_proxy_generator.py:54, in ProxyGenerator.__init__(self)
     52 self._webdriver = None
     53 self._TIMEOUT = 5
---> 54 self._new_session()

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/scholarly/_proxy_generator.py:454, in ProxyGenerator._new_session(self)
    449 # Suppress the misleading traceback from UserAgent()
    450 with self._suppress_logger('fake_useragent'):
    451     _HEADERS = {
    452         'accept-language': 'en-US,en',
    453         'accept': 'text/html,application/xhtml+xml,application/xml',
--> 454         'User-Agent': UserAgent().random,
    455     }
    456 self._session.headers.update(_HEADERS)
    458 if self._proxy_works:

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/fake.py:64, in FakeUserAgent.__init__(self, cache, use_cache_server, path, fallback, browsers, verify_ssl, safe_attrs)
     61 # initial empty data
     62 self.data_browsers = {}
---> 64 self.load()

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/fake.py:70, in FakeUserAgent.load(self)
     68 with self.load.lock:
     69     if self.cache:
---> 70         self.data_browsers = load_cached(
     71             self.path,
     72             self.browsers,
     73             use_cache_server=self.use_cache_server,
     74             verify_ssl=self.verify_ssl,
     75         )
     76     else:
     77         self.data_browsers = load(
     78             self.browsers,
     79             use_cache_server=self.use_cache_server,
     80             verify_ssl=self.verify_ssl,
     81         )

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:209, in load_cached(path, browsers, use_cache_server, verify_ssl)
    207 def load_cached(path, browsers, use_cache_server=True, verify_ssl=True):
    208     if not exist(path):
--> 209         update(path, browsers, use_cache_server=use_cache_server, verify_ssl=verify_ssl)
    211     return read(path)

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:203, in update(path, browsers, use_cache_server, verify_ssl)
    199 def update(path, browsers, use_cache_server=True, verify_ssl=True):
    200     rm(path)
    202     write(
--> 203         path, load(browsers, use_cache_server=use_cache_server, verify_ssl=verify_ssl)
    204     )

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:154, in load(browsers, use_cache_server, verify_ssl)
    152 try:
    153     data = {}
--> 154     jsonLines = get(
    155         settings.CACHE_SERVER,
    156         verify_ssl=verify_ssl,
    157     ).decode("utf-8")
    158     for line in jsonLines.splitlines():
    159         data.update(json.loads(line))

File ~/mambaforge/envs/citations/lib/python3.11/site-packages/fake_useragent/utils.py:87, in get(url, verify_ssl)
     80 logger.debug(
     81     "Error occurred during fetching %s",
     82     url,
     83     exc_info=exc,
     84 )
     86 if attempt == settings.HTTP_RETRIES:
---> 87     raise FakeUserAgentError("Maximum amount of retries reached")
     88 else:
     89     logger.debug(
     90         "Sleeping for %s seconds",
     91         settings.HTTP_DELAY,
     92     )

FakeUserAgentError: Maximum amount of retries reached

@gboeing
Copy link
Author

gboeing commented Dec 16, 2022

Tested again today. Same errors persist both with and without using a proxy.

@giswqs
Copy link

giswqs commented Dec 17, 2022

I just ran into the same error.

@jkbren
Copy link

jkbren commented Dec 17, 2022

Same error on my end, but I got it running again by upgrading pip install fake-useragent --upgrade

@arunkannawadi
Copy link
Collaborator

Thank you @jkbren I can confirm that I was getting the error, and after updating fake-useragent, it works. And there were a couple of updates to that library in the past few weeks, so this all makes sense.

Closing the issue for now, but if the issue persists even after upgrading fake-useragent, please feel free to reopen it.

@gboeing
Copy link
Author

gboeing commented Dec 18, 2022

I upgraded to fake-useragent 1.1.1, free-proxy 1.0.6, and scholarly 1.7.6 but the same scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar error persists, unchanged.

@arunkannawadi
Copy link
Collaborator

Do you still get the exactly same errors on both machine 1 and 2 as you initially posted?

@zhubonan
Copy link

zhubonan commented Dec 18, 2022

I am also having this, the problem is that the (equivalent) of requests.get("https://scholar.google.com/citations?hl=en&user=8XFlTFIAAAAJ") gets a HTTP/1.1 429 Too Many Requests response. This happens on multiple machines that I have.
However, the same url works on my browser and also with httpx libraray.

Try to tweak the User-Agent but no success. Perhaps Google is blocking certain requests?

@gboeing
Copy link
Author

gboeing commented Dec 19, 2022

@arunkannawadi yes here is the other machine: today I deleted and re-created its virtualenv for scholarly, but the error persists both with and without a proxy. Here's the environment packages list:

Package                       Version
----------------------------- -----------
alabaster                     0.7.12
arrow                         1.2.3
async-generator               1.10
attrs                         22.1.0
Babel                         2.11.0
beautifulsoup4                4.11.1
bibtexparser                  1.4.0
certifi                       2022.12.7
charset-normalizer            2.1.1
Deprecated                    1.2.13
docutils                      0.17.1
exceptiongroup                1.0.4
fake-useragent                1.1.1
free-proxy                    1.0.6
h11                           0.14.0
idna                          3.4
imagesize                     1.4.1
importlib-metadata            5.2.0
importlib-resources           5.10.1
Jinja2                        3.1.2
lxml                          4.9.2
MarkupSafe                    2.1.1
numpy                         1.24.0
outcome                       1.2.0
packaging                     22.0
pandas                        1.5.2
pip                           20.3.4
pkg-resources                 0.0.0
Pygments                      2.13.0
pyparsing                     3.0.9
PySocks                       1.7.1
python-dateutil               2.8.2
python-dotenv                 0.21.0
pytz                          2022.7
requests                      2.28.1
scholarly                     1.7.6
selenium                      4.7.2
setuptools                    44.1.1
six                           1.16.0
sniffio                       1.3.0
snowballstemmer               2.2.0
sortedcontainers              2.4.0
soupsieve                     2.3.2.post1
sphinx                        5.3.0
sphinx-rtd-theme              1.1.1
sphinxcontrib-applehelp       1.0.2
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        2.0.0
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.5
trio                          0.22.0
trio-websocket                0.9.2
typing-extensions             4.4.0
urllib3                       1.26.13
wrapt                         1.14.1
wsproto                       1.2.0
zipp                          3.11.0

And the error again:

Traceback (most recent call last):
  File "/home/g/gb/gboeing/apps/citations/app/citations.py", line 15, in <module>
    author = scholarly.fill(next(query), ['publications'])
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 237, in search_authors
    soup = self._get_soup(url)
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 226, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 175, in _get_page
    return self._get_page(pagerequest, True)
  File "/home/g/gb/gboeing/apps/citations/lib/python3.9/site-packages/scholarly/_navigator.py", line 177, in _get_page
    raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.

@gureckis
Copy link

same problem for me btw!

@arunkannawadi arunkannawadi reopened this Dec 24, 2022
@arunkannawadi
Copy link
Collaborator

I can confirm that from my home proxy and with FreeProxies, I get the error. I can get run the snippet successfully if I used ScraperAPI. I can only suppose that Google Scholar got better at detecting automated requests. If the following snippet fails, then there's little chance that scholarly can fetch your page successfully.

import requests
resp = requests.get("https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=james%20watson")
if resp.status_code != 200:
    print(f"Request failed with {resp.status_code} because {resp.reason}")

I could try to experiment with httpx library as @zhubonan mentioned, but at the moment, using one of the premium proxies seems to be the only way.

@zhubonan
Copy link

Thanks for looking into this.
Any idea how google detects the request? When I use the browser or even just curl, the page loads perfectly fine.

@zhubonan
Copy link

I have updated to 1.7.10 and the error no longer happens.

@arunkannawadi
Copy link
Collaborator

Thank you for confirming. This issue must have been fixed since 1.7.8. I'll close this issue now.

To answer your earlier question, I do not understand how Google Scholar detects requests, but it was not responding to any requests sent from Python's request library, even the ones that are not forbidden by their policy. Changing the underlying library helped.

@gboeing
Copy link
Author

gboeing commented Jan 16, 2023

Confirmed working for me now as well.

@syheliel
Copy link

Problem still exists when I'm using colab, here is my code:
https://colab.research.google.com/drive/1cJVvQBGLyKVNBI9YmgMgoW3ecNPYeFZE?usp=sharing

@arunkannawadi
Copy link
Collaborator

@syheliel please read our documentation. You're running queries that Google Scholar actively blocks without using proxies, which can get your IP address banned temporarily.

@AndreaUnige
Copy link

Hi @arunkannawadi, @gboeing , @zhubonan ,
I am trying the same query but got the same error:

raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.

I am using scholarly 1.7.11, Ubuntu 2022, Python 3.10.

The code I'm trying is the simplest:

pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)

 search_query = scholarly.search_pubs('Perception of physical stability and center of mass of 3D objects')
scholarly.pprint(next(search_query))

Is anyone able to help here ?
Many thanks guys!

@abubelinha
Copy link

abubelinha commented May 2, 2023

Yes I can confirm the error persists after upgrading both scholarly and fake_useragent.

I am on Windows 7, Python 3.8, scholarly 1.7.11, fake_useragent 1.1.3

Side question: is it possible to get scholarly version from code? (tried scholarly.__version__ but didn't work)

PS - FWIW, I think updating fake_useragent might not be that relevant for this issue.
The few times I got a successful FreeProxies() scraping run, I was still using fake-useragent 0.1.11 (see #500).
That success happened yesterday, before I upgraded fake_useragent as suggested above.
(and now that I have fake-useragent 1.1.3 I keep on getting MaxTriesExceededException).
So I think getting FreeProxies() to work is just a matter of waiting for your lucky moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug proxy Proxy/Network issue. May not be exactly reproducible.
Projects
None yet
Development

Successfully merging a pull request may close this issue.