## Case Study: Solve recaptcha v2 with 2captcha service and requests

Created by [tanyongsheng.net](https://tanyongsheng.net)

----

### Prerequisite
1. Buy 2Captcha credit at https://2captcha.com/?from=22013304 (Note: this is an affiliate link)

### Reference 
1. How to Solve Captcha / ReCaptcha - Python and 2captcha https://www.youtube.com/watch?v=R6QddZzCOwM & https://github.com/eupendra/2captcha_demo/blob/main/demo_requests.py
2. Bypass captcha in Python: https://2captcha.com/lang/python


----

#### Step 1: Install libraries

In [None]:
# Install Python libraries
%pip install requests
%pip install lxml
%pip install 2captcha-python
%pip install python-dotenv

#### Step 2: Scrape websites with requests

(i) load environment variable from .env file (note: env.sample is the template file for .env)

In [None]:
import os
from dotenv import load_dotenv

_ = load_dotenv()

# Set up credentials
api_key=os.getenv("2CAPTCHA_API_KEY")
sitekey=os.getenv("INVESTINGNOTE_SITE_KEY")
login_page_url = "https://www.investingnote.com/users/sign_in"
investingnote_username = os.getenv("INVESTINGNOTE_USERNAME")
investingnote_password = os.getenv("INVESTINGNOTE_PASSWORD")

(ii) Find Google Recaptcha v2's site key

- Locate Google recaptcha v2's link

<img src='../../assets/static/investingnote-recaptcha-part1.png' width=600px><br/>

- Try to open the Google recaptcha v2's in new tab

<img src='../../assets/static/investingnote-recaptcha-part2.png' width=600px><br/>

- Get the Google recaptcha v2's site key from the url link

<img src='../../assets/static/investingnote-recaptcha-part3.png' width=600px>

(iii) Login website 

- Handling CRSF Token when login

In [None]:
import requests
from lxml import html

# get crsf token
def get_csrf_token(session, url):
    response = session.get(url=url)
    tree = html.fromstring(response.content)
    csrf_token = tree.xpath("//input[@name='authenticity_token']/@value")[0]
    return csrf_token

- Solve Google recaptcha v2 when login

In [None]:
from twocaptcha import TwoCaptcha

solver=TwoCaptcha(api_key)
def solve_recaptcha(sitekey, url):
    # to solve recaptcha v2
    result = {}
    try:
        result = solver.recaptcha(sitekey=sitekey, 
                                  url=url)
    except Exception as e:
        exit(e)
    return result.get('code')

(iv) Start scraping after login

In [5]:
session = requests.Session()

headers = {"Content-Type": "application/x-www-form-urlencoded",
           "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"}
csrf_token = get_csrf_token(session=session, url=login_page_url)
recaptcha_response = solve_recaptcha(sitekey=sitekey, 
                                     url=login_page_url)

payload = {"utf8": "✓",
           "authenticity_token": csrf_token,
           "user[login]": investingnote_username, 
           "user[password]": investingnote_password,
           "g-recaptcha-response": recaptcha_response, 
           "user[remember_me]": 0,
           "user[remember_me]": 1
           }

response = session.request("POST", 
    url="https://www.investingnote.com/users/sign_in",
    data=payload,
    headers=headers
    )
print(response.status_code)
print(response.url)
print("--------------------")
print(response.content)


## Computing Environment

In [None]:
%load_ext watermark

%watermark
%watermark  --iversions
%watermark -u -n -t -z