Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy authentication login with cookies not working as expected #5597

Closed
okoliechykwuka opened this issue Aug 12, 2022 · 15 comments
Closed

Comments

@okoliechykwuka
Copy link

okoliechykwuka commented Aug 12, 2022

Description

sending a request to a page after log-in does not seem to maintain the user session. I am able to log in, but when I send a request to another page on the same site, The response from this page shows that I am logged out. I am able to do this seamlessly using selenium but replicating the same result in scrapy has been a hard nut to crack. Currently, I have tried out three different methods.

  1. Using Scrapy to get cookies from a request and passing that to the next request.

  2. Using selenium driver to get cookies from a request and passing the cookie to the next scrapy request.

  3. Using the Builtin method in scrapy cookiejar

With Scrapy

Steps to Reproduce

  1. Scrapy startproject oddsportal and cd into oddsportal
  2. scrapy genspider -t oddsportal oddsportal.com
  3. scrapy crawl oddsportal

Expected behavior:

Code.

import scrapy
from scrapy.spiders import Spider
from scrapy.http import FormRequest
from scrapy.selector import Selector
import logging
from scrapy.utils.response import open_in_browser
from urllib.parse import urljoin
from scrapy.spiders.init import InitSpider
logger = logging.getLogger()




class OddsportalSpider(InitSpider):

    name = 'oddsportal'
    allowed_domains = ['oddsportal.com']  
    start_urls = ['http://oddsportal.com/results/']
    login_page = 'http://oddsportal.com/login'

    
    # crawler's entry point
    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield scrapy.Request(
        url=self.login_page,
        callback=self.parse,
        dont_filter=True    
        )

    # parse response
    def parse(self, response):
        """Generate a login request."""
        yield FormRequest.from_response(
            response=response,
            formdata={'login-username': 'xxxxx', 
                      'login-password': 'xxxxxx',
                      'login-submit': '',
                    },
            clickdata={'type' : 'submit'},
            callback=self.after_login

        )

    def after_login(self, response):
        if b"Wrong username or password" in response.body:
            logger.warning("LOGIN ATTEMPT FAILED")
            return
        else:
            logger.info("LOGIN ATTEMPT SUCCESSFUL")
            url = 'http://oddsportal.com/results/'
            yield  scrapy.Request(url=url, cookies=self.parse_cookies(raw_cookies),callback=self.parse_item, dont_filter=True)  

    def parse_item(self, response):  
        print( 'Thissssssssss----------------------',response.url)
        open_in_browser(response)

I expect the response to another page within the site to maintain the user session, but it shows that I am logged out.

Reproduces how often: [What percentage of the time does it reproduce?]

Scrapy Versions 2.6.2

2022-08-14 20:15:12 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: oddsportal_website)
2022-08-14 20:15:12 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.22621-SP0
2022-08-14 20:15:12 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'oddsportal_website',
 'NEWSPIDER_MODULE': 'oddsportal_website.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['oddsportal_website.spiders'],
 'USER_AGENT': 'oddsportal_website (+http://www.yourdomain.com)'}
2022-08-14 20:15:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-14 20:15:12 [scrapy.extensions.telnet] INFO: Telnet Password: 8b8c5cfdad12fa21
2022-08-14 20:15:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-08-14 20:15:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-14 20:15:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-14 20:15:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-14 20:15:13 [scrapy.core.engine] INFO: Spider opened
2022-08-14 20:15:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-14 20:15:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-14 20:15:14 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://oddsportal.com/robots.txt> from <GET http://oddsportal.com/robots.txt>
2022-08-14 20:15:14 [py.warnings] WARNING: C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\scrapy\core\engine.py:279: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
  return self.download(result, spider) if isinstance(result, Request) else result

2022-08-14 20:15:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/robots.txt> from <GET https://oddsportal.com/robots.txt>
2022-08-14 20:15:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/robots.txt> from <GET http://www.oddsportal.com/robots.txt>
2022-08-14 20:15:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/robots.txt> (referer: None)
2022-08-14 20:15:16 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://oddsportal.com/login> from <GET http://oddsportal.com/login>
2022-08-14 20:15:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/login> from <GET https://oddsportal.com/login>
2022-08-14 20:15:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/robots.txt> from <GET http://www.oddsportal.com/robots.txt>
2022-08-14 20:15:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/robots.txt> (referer: None)
2022-08-14 20:15:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/login> from <GET http://www.oddsportal.com/login>
2022-08-14 20:15:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/login/> from <GET https://www.oddsportal.com/login>
2022-08-14 20:15:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/login/> from <GET http://www.oddsportal.com/login/>
2022-08-14 20:15:19 [filelock] DEBUG: Attempting to acquire lock 2195051637024 on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [filelock] DEBUG: Lock 2195051637024 acquired on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [filelock] DEBUG: Attempting to release lock 2195051637024 on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [filelock] DEBUG: Lock 2195051637024 released on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/login/> (referer: None)  
2022-08-14 20:15:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.oddsportal.com/login/> from <POST https://www.oddsportal.com/login/>
2022-08-14 20:15:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.oddsportal.com/settings/> from <GET https://www.oddsportal.com/login/>
2022-08-14 20:15:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/settings/> (referer: https://www.oddsportal.com/login/)
2022-08-14 20:15:20 [root] INFO: LOGIN ATTEMPT SUCCESSFUL
2022-08-14 20:15:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://oddsportal.com/results/> from <GET http://oddsportal.com/results/>
2022-08-14 20:15:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/results/> from <GET https://oddsportal.com/results/>
2022-08-14 20:15:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/results/> from <GET http://www.oddsportal.com/results/>
2022-08-14 20:15:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/results/> (referer: None)
Thissssssssss---------------------- https://www.oddsportal.com/results/
2022-08-14 20:15:26 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-14 20:15:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5970,
 'downloader/request_count': 19,
 'downloader/request_method_count/GET': 18,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 258376,
 'downloader/response_count': 19,
 'downloader/response_status_count/200': 5,
 'downloader/response_status_count/301': 12,
 'downloader/response_status_count/302': 2,
 'elapsed_time_seconds': 12.83456,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 14, 19, 15, 26, 708672),
 'httpcompression/response_bytes': 2323595,
 'httpcompression/response_count': 5,
 'log_count/DEBUG': 24,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'request_depth_max': 2,
 'response_received_count': 5,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 2,
 'scheduler/dequeued': 13,
 'scheduler/dequeued/memory': 13,
 'scheduler/enqueued': 13,
 'scheduler/enqueued/memory': 13,
 'start_time': datetime.datetime(2022, 8, 14, 19, 15, 13, 874112)}
2022-08-14 20:15:26 [scrapy.core.engine] INFO: Spider closed (finished)
@Gallaecio
Copy link
Member

Gallaecio commented Aug 12, 2022

#5596

I thought maybe you were passing cookies through a header because of the headersvariable, but I see now you have further code doing something else.

Please, refactor your report to make your example minimal.

@Gallaecio Gallaecio reopened this Aug 12, 2022
@okoliechykwuka
Copy link
Author

okoliechykwuka commented Aug 12, 2022

@Gallaecio , I just refactored my code. How can I resolve the issue of getting logged out after I am authenticated by a site?

@Gallaecio
Copy link
Member

Your example is still not minimal. Remove any code that is not required to reproduce the issue.
Also, I am assuming that you believe that there is an issue in Scrapy itself, not in your code. Otherwise, please see https://docs.scrapy.org/en/latest/index.html#getting-help

@okoliechykwuka
Copy link
Author

@Gallaecio I feel this has got to do with scrapy because I have checked similar questions on SO and no solution was provided.

@okoliechykwuka
Copy link
Author

@Gallaecio Please I need help authenticating user sessions using scrapy.

@wRAR
Copy link
Member

wRAR commented Aug 14, 2022

The provided log shows that the login page wasn't even requested successfully (it's broken but I doubt the missing part contains getting a successful response).

And even if the original code works (we can't test the provided one because that would need an account), it doesn't actually check that the logging in works, only that "Wrong username or password" isn't returned.

I feel this has got to do with scrapy because I have checked similar questions on SO and no solution was provided.

Such as?

Please I need help authenticating user sessions using scrapy.

It's a builtin feature, used by others successfully.

@okoliechykwuka
Copy link
Author

Hi @wRAR, I just update the question with all logs, as you can see on the log, I successfully logged in.

Here is a dummy account details for you to test with.

username = chuky

password = `A151515a

@wRAR
Copy link
Member

wRAR commented Aug 14, 2022

http://oddsportal.com/results/ behaves the same in the browser without JS: it doesn't show you are logged in. This is not a bug.

@wRAR wRAR closed this as not planned Won't fix, can't repro, duplicate, stale Aug 14, 2022
@okoliechykwuka
Copy link
Author

@wRAR It does not behave the same way in the browser, It shows that I am logged in. Try and sign in with the above details and navigate to the above link you said it behaves the same way.

Selenium maintains a user session after log-in, why does scrapy log out when a new request is spawned after log-in?

I have compared my browser cookies and that of scrapy and I can find any difference, why then are my logged out.

@wRAR
Copy link
Member

wRAR commented Aug 15, 2022

It does not behave the same way in the browser, It shows that I am logged in.

It does not show that when JS is disabled.

Selenium maintains a user session after log-in, why does scrapy log out when a new request is spawned after log-in?

You are yet to prove that Scrapy "logs out". I believe that it doesn't. When you have the minimal reproducible example of this claim about Scrapy, you can provide it. Until that, this will stay closed as not a Scrapy bug.

And if you still think it's a Scrapy bug, you can try to reproduce this on a different, simpler website.

@okoliechykwuka
Copy link
Author

okoliechykwuka commented Aug 15, 2022

Hi @wRAR,

Here is a minimal example from both selenium and scrapy with their response after login.

Selenium Example.

from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support import expected_conditions as EC

#define our URL
url = 'https://www.oddsportal.com/login/'
username = 'chuky'
password = 'A151515a'
path = r'C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\chromedriver.exe'
webdriver_service = Service(path)
options = ChromeOptions()

# options=options
browser = Chrome(service=webdriver_service, options=options)

browser.get(url)
browser.implicitly_wait(2)
browser.find_element(By.ID, 'onetrust-accept-btn-handler').click()
browser.find_element(By.ID,'login-username1').send_keys(username)
browser.find_element(By.ID,'login-password1').send_keys(password)
browser.implicitly_wait(10)
browser.find_element(By.XPATH,'//*[@id="col-content"]//button[@class="inline-btn-2"]').click()#.send_keys(self.password)

print('successful login')
browser.implicitly_wait(10)
browser.get('https://www.oddsportal.com/results/')

image

Scrapy Example

class OddsportalSpider(CrawlSpider):
    name = 'oddsportal'
    allowed_domains = ['oddsportal.com']  
    # start_urls = ['http://oddsportal.com/results/']
    login_page = 'https://www.oddsportal.com/login/'

    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield scrapy.Request(
        url=self.login_page,
        callback=self.login,
        dont_filter=True    
        )
    # parse response
    def login(self, response):
        """Generate a login request."""

        yield FormRequest.from_response(
             response=response,
              formdata={'login-username': 'chuky', 
                  'login-password': 'A151515a',
                  'login-submit': '',
                },
              callback=self.after_login,
              dont_filter=True
              )
    #simply check if log-in was successful, and spawn another request to /results/
    def after_login(self, response):

        if b"Log in to save and share your coupons." in response.body:
            logger.warning("LOGIN ATTEMPT FAILED")
            return
        else:
            #It redirects to https://www.oddsportal.com/settings/ after logging in, spawning a new request to https://www.oddsportal.com/results/
            logger.info("LOGIN ATTEMPT SUCCESSFUL")
            url = 'https://www.oddsportal.com/results/'
            return  scrapy.Request(url=url,callback=self.parse_item,  dont_filter=True) 
    def parse_item(self, response):  
        print( 'Thissssssssss----------------------',response.url)
        open_in_browser(response) 

As proof that I logged in, I tried logging in with the wrong username which resulted to logger.warning("LOGIN ATTEMPT FAILED") response.

image

But when I switch my request to https://www.oddsportal.com/settings/ after log-in I get this.

image

@wRAR
Copy link
Member

wRAR commented Aug 15, 2022

You've again missed my point about disabling JS.

If you worry about the "Login" link, disable JS when looking at the page opened by open_in_browser and it will disappear. It's not actually present on the page, which you can check by actually examining the response you got.

@okoliechykwuka
Copy link
Author

I just examined the response generated from my request using SplashRequest instead of scrapy.Request and it displayed the Login link. which shows that I am logged out.

@wRAR
Copy link
Member

wRAR commented Aug 15, 2022

Sure, Splash doesn't reuse Scrapy cookies.

This is my last comment here until I see any actual reproducible Scrapy problem.

@elacuesta
Copy link
Member

You seem to have experimented with scrapy-playwright as well (scrapy-plugins/scrapy-playwright#110), why did you scratch that approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants