scrapy authentication login with cookies not working as expected #5597

okoliechykwuka · 2022-08-12T13:12:14Z

Description

sending a request to a page after log-in does not seem to maintain the user session. I am able to log in, but when I send a request to another page on the same site, The response from this page shows that I am logged out. I am able to do this seamlessly using selenium but replicating the same result in scrapy has been a hard nut to crack. Currently, I have tried out three different methods.

Using Scrapy to get cookies from a request and passing that to the next request.
Using selenium driver to get cookies from a request and passing the cookie to the next scrapy request.
Using the Builtin method in scrapy cookiejar

With Scrapy

Steps to Reproduce

Scrapy startproject oddsportal and cd into oddsportal
scrapy genspider -t oddsportal oddsportal.com
scrapy crawl oddsportal

Expected behavior:

Code.

import scrapy
from scrapy.spiders import Spider
from scrapy.http import FormRequest
from scrapy.selector import Selector
import logging
from scrapy.utils.response import open_in_browser
from urllib.parse import urljoin
from scrapy.spiders.init import InitSpider
logger = logging.getLogger()




class OddsportalSpider(InitSpider):

    name = 'oddsportal'
    allowed_domains = ['oddsportal.com']  
    start_urls = ['http://oddsportal.com/results/']
    login_page = 'http://oddsportal.com/login'

    
    # crawler's entry point
    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield scrapy.Request(
        url=self.login_page,
        callback=self.parse,
        dont_filter=True    
        )

    # parse response
    def parse(self, response):
        """Generate a login request."""
        yield FormRequest.from_response(
            response=response,
            formdata={'login-username': 'xxxxx', 
                      'login-password': 'xxxxxx',
                      'login-submit': '',
                    },
            clickdata={'type' : 'submit'},
            callback=self.after_login

        )

    def after_login(self, response):
        if b"Wrong username or password" in response.body:
            logger.warning("LOGIN ATTEMPT FAILED")
            return
        else:
            logger.info("LOGIN ATTEMPT SUCCESSFUL")
            url = 'http://oddsportal.com/results/'
            yield  scrapy.Request(url=url, cookies=self.parse_cookies(raw_cookies),callback=self.parse_item, dont_filter=True)  

    def parse_item(self, response):  
        print( 'Thissssssssss----------------------',response.url)
        open_in_browser(response)

I expect the response to another page within the site to maintain the user session, but it shows that I am logged out.

Reproduces how often: [What percentage of the time does it reproduce?]

Scrapy Versions 2.6.2

2022-08-14 20:15:12 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: oddsportal_website)
2022-08-14 20:15:12 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.22621-SP0
2022-08-14 20:15:12 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'oddsportal_website',
 'NEWSPIDER_MODULE': 'oddsportal_website.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['oddsportal_website.spiders'],
 'USER_AGENT': 'oddsportal_website (+http://www.yourdomain.com)'}
2022-08-14 20:15:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-14 20:15:12 [scrapy.extensions.telnet] INFO: Telnet Password: 8b8c5cfdad12fa21
2022-08-14 20:15:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-08-14 20:15:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-14 20:15:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-14 20:15:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-14 20:15:13 [scrapy.core.engine] INFO: Spider opened
2022-08-14 20:15:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-14 20:15:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-14 20:15:14 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://oddsportal.com/robots.txt> from <GET http://oddsportal.com/robots.txt>
2022-08-14 20:15:14 [py.warnings] WARNING: C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\scrapy\core\engine.py:279: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
  return self.download(result, spider) if isinstance(result, Request) else result

2022-08-14 20:15:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/robots.txt> from <GET https://oddsportal.com/robots.txt>
2022-08-14 20:15:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/robots.txt> from <GET http://www.oddsportal.com/robots.txt>
2022-08-14 20:15:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/robots.txt> (referer: None)
2022-08-14 20:15:16 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://oddsportal.com/login> from <GET http://oddsportal.com/login>
2022-08-14 20:15:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/login> from <GET https://oddsportal.com/login>
2022-08-14 20:15:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/robots.txt> from <GET http://www.oddsportal.com/robots.txt>
2022-08-14 20:15:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/robots.txt> (referer: None)
2022-08-14 20:15:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/login> from <GET http://www.oddsportal.com/login>
2022-08-14 20:15:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/login/> from <GET https://www.oddsportal.com/login>
2022-08-14 20:15:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/login/> from <GET http://www.oddsportal.com/login/>
2022-08-14 20:15:19 [filelock] DEBUG: Attempting to acquire lock 2195051637024 on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [filelock] DEBUG: Lock 2195051637024 acquired on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [filelock] DEBUG: Attempting to release lock 2195051637024 on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [filelock] DEBUG: Lock 2195051637024 released on C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\scrapingenv\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-08-14 20:15:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/login/> (referer: None)  
2022-08-14 20:15:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.oddsportal.com/login/> from <POST https://www.oddsportal.com/login/>
2022-08-14 20:15:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.oddsportal.com/settings/> from <GET https://www.oddsportal.com/login/>
2022-08-14 20:15:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/settings/> (referer: https://www.oddsportal.com/login/)
2022-08-14 20:15:20 [root] INFO: LOGIN ATTEMPT SUCCESSFUL
2022-08-14 20:15:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://oddsportal.com/results/> from <GET http://oddsportal.com/results/>
2022-08-14 20:15:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.oddsportal.com/results/> from <GET https://oddsportal.com/results/>
2022-08-14 20:15:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.oddsportal.com/results/> from <GET http://www.oddsportal.com/results/>
2022-08-14 20:15:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.oddsportal.com/results/> (referer: None)
Thissssssssss---------------------- https://www.oddsportal.com/results/
2022-08-14 20:15:26 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-14 20:15:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5970,
 'downloader/request_count': 19,
 'downloader/request_method_count/GET': 18,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 258376,
 'downloader/response_count': 19,
 'downloader/response_status_count/200': 5,
 'downloader/response_status_count/301': 12,
 'downloader/response_status_count/302': 2,
 'elapsed_time_seconds': 12.83456,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 14, 19, 15, 26, 708672),
 'httpcompression/response_bytes': 2323595,
 'httpcompression/response_count': 5,
 'log_count/DEBUG': 24,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'request_depth_max': 2,
 'response_received_count': 5,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 2,
 'scheduler/dequeued': 13,
 'scheduler/dequeued/memory': 13,
 'scheduler/enqueued': 13,
 'scheduler/enqueued/memory': 13,
 'start_time': datetime.datetime(2022, 8, 14, 19, 15, 13, 874112)}
2022-08-14 20:15:26 [scrapy.core.engine] INFO: Spider closed (finished)

The text was updated successfully, but these errors were encountered:

Gallaecio · 2022-08-12T13:28:14Z

~~#5596~~

I thought maybe you were passing cookies through a header because of the headersvariable, but I see now you have further code doing something else.

Please, refactor your report to make your example minimal.

okoliechykwuka · 2022-08-12T14:04:17Z

@Gallaecio , I just refactored my code. How can I resolve the issue of getting logged out after I am authenticated by a site?

Gallaecio · 2022-08-12T14:40:13Z

Your example is still not minimal. Remove any code that is not required to reproduce the issue.
Also, I am assuming that you believe that there is an issue in Scrapy itself, not in your code. Otherwise, please see https://docs.scrapy.org/en/latest/index.html#getting-help

okoliechykwuka · 2022-08-12T23:13:42Z

@Gallaecio I feel this has got to do with scrapy because I have checked similar questions on SO and no solution was provided.

okoliechykwuka · 2022-08-13T09:06:25Z

@Gallaecio Please I need help authenticating user sessions using scrapy.

wRAR · 2022-08-14T15:07:25Z

The provided log shows that the login page wasn't even requested successfully (it's broken but I doubt the missing part contains getting a successful response).

And even if the original code works (we can't test the provided one because that would need an account), it doesn't actually check that the logging in works, only that "Wrong username or password" isn't returned.

I feel this has got to do with scrapy because I have checked similar questions on SO and no solution was provided.

Such as?

Please I need help authenticating user sessions using scrapy.

It's a builtin feature, used by others successfully.

okoliechykwuka · 2022-08-14T19:20:32Z

Hi @wRAR, I just update the question with all logs, as you can see on the log, I successfully logged in.

Here is a dummy account details for you to test with.

username = chuky

password = `A151515a

wRAR · 2022-08-14T19:36:38Z

http://oddsportal.com/results/ behaves the same in the browser without JS: it doesn't show you are logged in. This is not a bug.

okoliechykwuka · 2022-08-15T11:31:58Z

@wRAR It does not behave the same way in the browser, It shows that I am logged in. Try and sign in with the above details and navigate to the above link you said it behaves the same way.

Selenium maintains a user session after log-in, why does scrapy log out when a new request is spawned after log-in?

I have compared my browser cookies and that of scrapy and I can find any difference, why then are my logged out.

wRAR · 2022-08-15T12:10:59Z

It does not behave the same way in the browser, It shows that I am logged in.

It does not show that when JS is disabled.

Selenium maintains a user session after log-in, why does scrapy log out when a new request is spawned after log-in?

You are yet to prove that Scrapy "logs out". I believe that it doesn't. When you have the minimal reproducible example of this claim about Scrapy, you can provide it. Until that, this will stay closed as not a Scrapy bug.

And if you still think it's a Scrapy bug, you can try to reproduce this on a different, simpler website.

okoliechykwuka · 2022-08-15T12:44:02Z

Hi @wRAR,

Here is a minimal example from both selenium and scrapy with their response after login.

Selenium Example.

from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support import expected_conditions as EC

#define our URL
url = 'https://www.oddsportal.com/login/'
username = 'chuky'
password = 'A151515a'
path = r'C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\chromedriver.exe'
webdriver_service = Service(path)
options = ChromeOptions()

# options=options
browser = Chrome(service=webdriver_service, options=options)

browser.get(url)
browser.implicitly_wait(2)
browser.find_element(By.ID, 'onetrust-accept-btn-handler').click()
browser.find_element(By.ID,'login-username1').send_keys(username)
browser.find_element(By.ID,'login-password1').send_keys(password)
browser.implicitly_wait(10)
browser.find_element(By.XPATH,'//*[@id="col-content"]//button[@class="inline-btn-2"]').click()#.send_keys(self.password)

print('successful login')
browser.implicitly_wait(10)
browser.get('https://www.oddsportal.com/results/')

Scrapy Example

class OddsportalSpider(CrawlSpider):
    name = 'oddsportal'
    allowed_domains = ['oddsportal.com']  
    # start_urls = ['http://oddsportal.com/results/']
    login_page = 'https://www.oddsportal.com/login/'

    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield scrapy.Request(
        url=self.login_page,
        callback=self.login,
        dont_filter=True    
        )
    # parse response
    def login(self, response):
        """Generate a login request."""

        yield FormRequest.from_response(
             response=response,
              formdata={'login-username': 'chuky', 
                  'login-password': 'A151515a',
                  'login-submit': '',
                },
              callback=self.after_login,
              dont_filter=True
              )
    #simply check if log-in was successful, and spawn another request to /results/
    def after_login(self, response):

        if b"Log in to save and share your coupons." in response.body:
            logger.warning("LOGIN ATTEMPT FAILED")
            return
        else:
            #It redirects to https://www.oddsportal.com/settings/ after logging in, spawning a new request to https://www.oddsportal.com/results/
            logger.info("LOGIN ATTEMPT SUCCESSFUL")
            url = 'https://www.oddsportal.com/results/'
            return  scrapy.Request(url=url,callback=self.parse_item,  dont_filter=True) 
    def parse_item(self, response):  
        print( 'Thissssssssss----------------------',response.url)
        open_in_browser(response)

As proof that I logged in, I tried logging in with the wrong username which resulted to logger.warning("LOGIN ATTEMPT FAILED") response.

But when I switch my request to https://www.oddsportal.com/settings/ after log-in I get this.

wRAR · 2022-08-15T13:05:36Z

You've again missed my point about disabling JS.

If you worry about the "Login" link, disable JS when looking at the page opened by open_in_browser and it will disappear. It's not actually present on the page, which you can check by actually examining the response you got.

okoliechykwuka · 2022-08-15T13:44:48Z

I just examined the response generated from my request using SplashRequest instead of scrapy.Request and it displayed the Login link. which shows that I am logged out.

wRAR · 2022-08-15T13:46:32Z

Sure, Splash doesn't reuse Scrapy cookies.

This is my last comment here until I see any actual reproducible Scrapy problem.

elacuesta · 2022-08-15T16:11:10Z

You seem to have experimented with scrapy-playwright as well (scrapy-plugins/scrapy-playwright#110), why did you scratch that approach?

Gallaecio closed this as completed Aug 12, 2022

Gallaecio reopened this Aug 12, 2022

wRAR added the needs more info label Aug 14, 2022

wRAR closed this as not planned Won't fix, can't repro, duplicate, stale Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapy authentication login with cookies not working as expected #5597

scrapy authentication login with cookies not working as expected #5597

okoliechykwuka commented Aug 12, 2022 •

edited

Gallaecio commented Aug 12, 2022 •

edited

okoliechykwuka commented Aug 12, 2022 •

edited

Gallaecio commented Aug 12, 2022

okoliechykwuka commented Aug 12, 2022

okoliechykwuka commented Aug 13, 2022

wRAR commented Aug 14, 2022

okoliechykwuka commented Aug 14, 2022

wRAR commented Aug 14, 2022

okoliechykwuka commented Aug 15, 2022

wRAR commented Aug 15, 2022

okoliechykwuka commented Aug 15, 2022 •

edited

wRAR commented Aug 15, 2022

okoliechykwuka commented Aug 15, 2022

wRAR commented Aug 15, 2022

elacuesta commented Aug 15, 2022

scrapy authentication login with cookies not working as expected #5597

scrapy authentication login with cookies not working as expected #5597

Comments

okoliechykwuka commented Aug 12, 2022 • edited

Description

Steps to Reproduce

Scrapy Versions 2.6.2

Gallaecio commented Aug 12, 2022 • edited

okoliechykwuka commented Aug 12, 2022 • edited

Gallaecio commented Aug 12, 2022

okoliechykwuka commented Aug 12, 2022

okoliechykwuka commented Aug 13, 2022

wRAR commented Aug 14, 2022

okoliechykwuka commented Aug 14, 2022

wRAR commented Aug 14, 2022

okoliechykwuka commented Aug 15, 2022

wRAR commented Aug 15, 2022

okoliechykwuka commented Aug 15, 2022 • edited

wRAR commented Aug 15, 2022

okoliechykwuka commented Aug 15, 2022

wRAR commented Aug 15, 2022

elacuesta commented Aug 15, 2022

okoliechykwuka commented Aug 12, 2022 •

edited

Gallaecio commented Aug 12, 2022 •

edited

okoliechykwuka commented Aug 12, 2022 •

edited

okoliechykwuka commented Aug 15, 2022 •

edited