Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongo, improved proxies, and updated scraping logic #49

Merged
merged 35 commits into from Jun 30, 2019

Conversation

iloveitaly
Copy link
Collaborator

Lots of improvements!

SW stopped flying to MEX. This was causing false negatives.
Run with `node console.js`
I'm not sure what a 1 second cooldown was supposed to prevent. Duplicate
messages from the same flight?
This isn't ruby
@samyun
Copy link
Owner

samyun commented Jun 9, 2019

Awesome! I’ll look through this and merge it in by tonight.

puppeteer-extra has the same workarounds wrapped in a package
@razzamatazm
Copy link

This looks great. I'm have tried a previously working proxy setup (with both hostname and port) and one through illuminati.io and am getting the following errors along with the price not updating:

Jun 11 13:41:15 swacheck2 app/scheduler.9385: > southwest-price-drop-bot@3.1.4 task:check /app
Jun 11 13:41:15 swacheck2 app/scheduler.9385: > node --trace-warnings tasks/check.js
Jun 11 13:41:16 swacheck2 app/scheduler.9385: (node:23) UnhandledPromiseRejectionWarning: Error: Invalid "proxyUrl" option: the URL must contain both hostname and port.
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.anonymizeProxy (/app/node_modules/proxy-chain/build/anonymize_proxy.js:32:15)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at module.exports (/app/lib/browser.js:10:39)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at /app/tasks/check.js:12:23
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object. (/app/tasks/check.js:75:3)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module._compile (internal/modules/cjs/loader.js:774:30)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module.load (internal/modules/cjs/loader.js:641:32)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module._load (internal/modules/cjs/loader.js:556:12)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module.runMain (internal/modules/cjs/loader.js:837:10)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at internal/main/run_main_module.js:17:11
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at emitWarning (internal/process/promises.js:120:15)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processPromiseRejections (internal/process/promises.js:168:7)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processTicksAndRejections (internal/process/task_queues.js:90:32)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: (node:23) Error: Invalid "proxyUrl" option: the URL must contain both hostname and port.
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.anonymizeProxy (/app/node_modules/proxy-chain/build/anonymize_proxy.js:32:15)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at module.exports (/app/lib/browser.js:10:39)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at /app/tasks/check.js:12:23
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object. (/app/tasks/check.js:75:3)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module._compile (internal/modules/cjs/loader.js:774:30)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module.load (internal/modules/cjs/loader.js:641:32)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module._load (internal/modules/cjs/loader.js:556:12)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module.runMain (internal/modules/cjs/loader.js:837:10)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at internal/main/run_main_module.js:17:11
Jun 11 13:41:16 swacheck2 app/scheduler.9385: (node:23) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at emitDeprecationWarning (internal/process/promises.js:134:13)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at emitWarning (internal/process/promises.js:127:3)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processPromiseRejections (internal/process/promises.js:168:7)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processTicksAndRejections (internal/process/task_queues.js:90:32)
Jun 11 13:41:16 swacheck2 app/scheduler.9385: mongo successfully connected!

@razzamatazm
Copy link

I was able to get the proxy working by including http:// in front of the url. That being said, now it's having issues scraping. See logs:

Jun 11 16:12:28 swacheck2 app/scheduler.5302: mongo successfully connected!
Jun 11 16:12:29 swacheck2 app/scheduler.5302: found 1 alerts, checking...
Jun 11 16:12:29 swacheck2 app/scheduler.5302: lock has available permits: 5
Jun 11 16:12:29 swacheck2 app/scheduler.5302: Entered lock, available permits: 4
Jun 11 16:12:30 swacheck2 app/scheduler.5302: Retrieving URL: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PVR&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true
Jun 11 16:14:32 swacheck2 app/scheduler.5302: Unable to get flights - trying again
Jun 11 16:14:32 swacheck2 app/scheduler.5302:
Jun 11 16:14:32 swacheck2 app/scheduler.5302: {
Jun 11 16:14:32 swacheck2 app/scheduler.5302: status: '200',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'content-type': 'text/html; charset=UTF-8',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'x-ion-hop': '1',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: expires: '0',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'cache-control': 'no-cache, no-store, must-revalidate',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: pragma: 'no-cache',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'content-encoding': 'gzip',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: vary: 'Accept-Encoding',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'x-akamai-transformed': '9 64888 0 pmb=mNONE,1',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: date: 'Tue, 11 Jun 2019 23:12:30 GMT',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'content-length': '58822',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'set-cookie': 'akavpau_prod_fullsite=1560294780~id=ec6648d3009d2cd2e75488f337c82749; ' +
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'Path=/',
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'strict-transport-security': 'max-age=600'
Jun 11 16:14:32 swacheck2 app/scheduler.5302: }
Jun 11 16:14:32 swacheck2 app/scheduler.5302: 200
Jun 11 16:16:32 swacheck2 app/scheduler.5302: Error: ERROR! Unknown error! Unable to find flight information on page: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PVR&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true
Jun 11 16:16:32 swacheck2 app/scheduler.5302: html:
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at getPage (/app/lib/bot/get-price.js:212:17)
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at processTicksAndRejections (internal/process/task_queues.js:89:5)
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async getFlights (/app/lib/bot/get-price.js:47:14)
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async getPriceForFlight (/app/lib/bot/get-price.js:8:20)
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async Alert.getLatestPrice (/app/lib/bot/alert.js:172:19)
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async /app/tasks/check.js:33:9
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async Promise.all (index 0)
Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async /app/tasks/check.js:69:5
Jun 11 16:16:32 swacheck2 app/scheduler.5302: No flights found!
Jun 11 16:16:32 swacheck2 app/scheduler.5302: Min price: Infinity
Jun 11 16:16:32 swacheck2 app/scheduler.5302: Got price: 8/22/2019|LAX|PVR|110 { time: 1560294749217, price: Infinity }
Jun 11 16:16:32 swacheck2 app/scheduler.5302: 8/22/2019 #110 LAX → PVR not cheaper
Jun 11 16:16:32 swacheck2 heroku/scheduler.5302: State changed from up to complete
Jun 11 16:16:33 swacheck2 heroku/scheduler.5302: Process exited with status 0

@iloveitaly
Copy link
Collaborator Author

@razzamatazm

...previously working proxy setup

How recently was this working? If you revert to your previous setup are you able to scrape successfully?

Since I posted this PR it looks like SW is blocking requests (from a proxy or my local connection). It looks like they've updated their bot detection system, and it's gotten much much better.

@razzamatazm
Copy link

razzamatazm commented Jun 11, 2019

@iloveitaly It had been working prior to when their bot detection was first implemented. That being said, I was able to move past the error I was receiving in my first post by including "http://" in the proxy var. That being said, now the app is having trouble scraping the price. I was initially searching an international flight booked with points, so to test I tried a US flight booked with cash and it's still having issues.

@samyun
Copy link
Owner

samyun commented Jun 12, 2019

I'm seeing the same thing - looks like an Akamai block.

@razzamatazm
Copy link

@samyun and @iloveitaly - I setup a proxy server at my homelab and still run into the issues - no problems accessing the southwest site through a browser. Not sure if it's Akamai in this case.

@razzamatazm
Copy link

https://github.com/pyro2927/SouthwestCheckin/ <-- This is working as of now. I wonder if we can pull some of the techniques used. It uses the mobile api.

@iloveitaly
Copy link
Collaborator Author

@razzamatazm ah, interesting! I didn't realize there was a mobile API. Looks like the flight cost endpoint hasn't been figured out yet. Any ideas on how to hit it?

@samyun I'm pretty sure it's not a Akamai block. Here's why:

  1. curl https://www.southwest.com/air/booking/select.html There's some analytics code, some obfucated code, and then a snippet that hits a unique token on the root SW domain when the page has loaded and then reloads the page. swa-common is loaded on this page as well, but I'm not sure if it's a duplicate of the inline JS or not (my hunch is it is).
  2. If you run the obfuscated code through jsnice.org you'll see they are doing some really fancy obfuscation. Find the var assigned to Object.create(null), find where it's actively used and add a debugger call next to it. You'll need to do some fiddling to find the right place. You can pull the code into a standalone HTML file to fiddle with it locally.
  3. If you do that, you'll see a list of the properties that are being checked. This is helpful, but it's hard to figure out exactly what is being checked and how. I think what happens is they are checked, serialized into some sort of string, and then added as a header which is then send to the southwest.com/TOKEN URL specified in the initial page load. Fancy stuff!

I went ahead and did this one last time and realized the flags I had to disable the WebGL/GPU stuff was causing the issue. This is now working again!

@iloveitaly
Copy link
Collaborator Author

Hmm, now it's not working for me. No idea why. Can you guys try HEAD and see if it works for you?

@razzamatazm
Copy link

razzamatazm commented Jun 13, 2019 via email

@razzamatazm
Copy link

New, different, exciting errors :)

Jun 13 11:52:55 swacheck3 heroku/router: at=info method=GET path="/style.css" host=swacheck3.herokuapp.com request_id=1fad41d7-311b-41c0-9876-8bd743dc5526 fwd="67.53.122.46" dyno=web.1 connect=0ms service=6ms status=304 bytes=269 protocol=https
Jun 13 11:52:55 swacheck3 heroku/router: at=info method=GET path="/logo.png" host=swacheck3.herokuapp.com request_id=3af111c9-21b6-4717-b8be-3e059f01326a fwd="67.53.122.46" dyno=web.1 connect=0ms service=10ms status=304 bytes=271 protocol=https
Jun 13 11:52:55 swacheck3 app/web.1: Retrieving URL: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PHX&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true
Jun 13 11:52:58 swacheck3 app/web.1: PAGE LOG: Failed to load resource: net::ERR_FAILED
Jun 13 11:52:58 swacheck3 app/web.1: PAGE LOG: Failed to load resource: the server responded with a status of 403 ()
Jun 13 11:54:57 swacheck3 app/web.1: Unable to get flights - trying again
Jun 13 11:54:57 swacheck3 app/web.1:
Jun 13 11:54:57 swacheck3 app/web.1: {
Jun 13 11:54:57 swacheck3 app/web.1: status: '200',
Jun 13 11:54:57 swacheck3 app/web.1: 'content-type': 'text/html; charset=UTF-8',
Jun 13 11:54:57 swacheck3 app/web.1: 'x-ion-hop': '1',
Jun 13 11:54:57 swacheck3 app/web.1: expires: '0',
Jun 13 11:54:57 swacheck3 app/web.1: 'cache-control': 'no-cache, no-store, must-revalidate',
Jun 13 11:54:57 swacheck3 app/web.1: pragma: 'no-cache',
Jun 13 11:54:57 swacheck3 app/web.1: 'content-encoding': 'gzip',
Jun 13 11:54:57 swacheck3 app/web.1: vary: 'Accept-Encoding',
Jun 13 11:54:57 swacheck3 app/web.1: 'x-akamai-transformed': '9 - 0 pmb=mNONE,1',
Jun 13 11:54:57 swacheck3 app/web.1: date: 'Thu, 13 Jun 2019 18:52:56 GMT',
Jun 13 11:54:57 swacheck3 app/web.1: 'content-length': '58937',
Jun 13 11:54:57 swacheck3 app/web.1: 'set-cookie': 'akavpau_prod_fullsite=1560452006~id=bf704764f98270f44819cda28444db01; ' +
Jun 13 11:54:57 swacheck3 app/web.1: 'Path=/',
Jun 13 11:54:57 swacheck3 app/web.1: 'strict-transport-security': 'max-age=600'
Jun 13 11:54:57 swacheck3 app/web.1: }
Jun 13 11:54:57 swacheck3 app/web.1: 200
Jun 13 11:54:58 swacheck3 app/web.1: PAGE LOG: Failed to load resource: the server responded with a status of 403 ()
Jun 13 11:56:58 swacheck3 app/web.1: Error: ERROR! Unknown error! Unable to find flight information on page: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PHX&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true
Jun 13 11:56:58 swacheck3 app/web.1: html:
Jun 13 11:56:58 swacheck3 app/web.1: at getPage (/app/lib/bot/get-price.js:240:17)
Jun 13 11:56:58 swacheck3 app/web.1: at processTicksAndRejections (internal/process/task_queues.js:89:5)
Jun 13 11:56:58 swacheck3 app/web.1: at async getFlights (/app/lib/bot/get-price.js:51:14)
Jun 13 11:56:58 swacheck3 app/web.1: at async getPriceForFlight (/app/lib/bot/get-price.js:8:20)
Jun 13 11:56:58 swacheck3 app/web.1: at async Alert.getLatestPrice (/app/lib/bot/alert.js:172:19)
Jun 13 11:56:58 swacheck3 app/web.1: at async /app/lib/apps/app.js:72:3
Jun 13 11:56:58 swacheck3 app/web.1: No flights found!
Jun 13 11:56:58 swacheck3 app/web.1: Min price: Infinity
Jun 13 11:56:58 swacheck3 app/web.1: Got price: 8/22/2019|LAX|PHX|1121 { time: 1560451974957, price: Infinity }

@iloveitaly
Copy link
Collaborator Author

@razzamatazm yup, the 403 is SW blocking us. No idea how to get around this. I think it has something to do with the IP used, but I can't be sure.

@razzamatazm
Copy link

razzamatazm commented Jun 13, 2019 via email

@iloveitaly
Copy link
Collaborator Author

@razzamatazm is that using this repo, or by manually accessing it via standard chrome?

I think what's going on is SW is associating a browser fingerprint with an IP and then blocking that IP. I know somewhere in the SW code they are checking the __webdriver_script_fn var which is not hidden using the evasions currently implemented.

https://allinonescript.com/index.php/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver?sort=creation

I think the best option is to use the mobile API, but it doesn't look like the price check endpoint has been figured out yet (and I don't have the time to tinker with it).

In any case, this is a huge improvement over what was there, although it doesn't actually work :(

@samyun samyun merged commit 0f48711 into samyun:master Jun 30, 2019
@samyun
Copy link
Owner

samyun commented Jun 30, 2019

I went ahead and merged this in - I found some other evasion repos I'm going to try to work in. Thanks for your help!

@iloveitaly
Copy link
Collaborator Author

@samyun awesome! It's worth noting that this is now working locally again. I think there is some sort of IP block triggered by repeated requests for the same flight (or something alone those lines... just guessing really). Keep us posted on what you find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants