# Part 1: Understanding Web Forms and Submitting Data

- Identify a target website with a search form. 
- Choose a website with a public search form (e.g. a movie database).
- Inspect the form elements using browser developer tools (F12 or right-click -> Inspect). 
- Locate the form's action URL, input fields (e.g. name, id) and the method (Get or Post).  

In [27]:
import requests

# OMDb API URL
form_url = 'http://www.omdbapi.com/' # This is the URL to which the form submits the data

# Parameters for the API
form_data = {
    'apikey': '5e3352e7', # own API key
    't': 'Inception', # This is the search query on what movie title to search for
}

# Send request to the OMDb API
response = requests.get(form_url, params=form_data) # Submit the form using GET request since the OMDb uses GET as method

# Check the response
print(response.text) 

{"Title":"Inception","Year":"2010","Rated":"PG-13","Released":"16 Jul 2010","Runtime":"148 min","Genre":"Action, Adventure, Sci-Fi","Director":"Christopher Nolan","Writer":"Christopher Nolan","Actors":"Leonardo DiCaprio, Joseph Gordon-Levitt, Elliot Page","Plot":"A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster.","Language":"English, Japanese, French","Country":"United States, United Kingdom","Awards":"Won 4 Oscars. 159 wins & 220 nominations total","Poster":"https://m.media-amazon.com/images/M/MV5BMjAxMzY3NjcxNF5BMl5BanBnXkFtZTcwNTI5OTM0Mw@@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"8.8/10"},{"Source":"Rotten Tomatoes","Value":"87%"},{"Source":"Metacritic","Value":"74/100"}],"Metascore":"74","imdbRating":"8.8","imdbVotes":"2,581,982","imdbID":"tt1375666","Type":"movie","DVD":"N/A","BoxOff

**form_url** - This should match the action attribute of the form you're submitting data to. In this case, I uses API to which the form submits the data.<br>
**form_data** - The dictionary where the keys are the name attributes of the input fields in the form, and the values are the data you're submitting. In this case, I uses my own API key that I get from OMDb API and then the search query is passed on 't' field. <br>
**requests.get** - Submits a get request to the server with the form data. <br>
**response.text** - Contains the HTML response from the server. </p>

# Part 2: Handling Login Pages and Sessions

- Identify Login Form Elements 
- Use browser developer tools to inspect the login form
- Note down the form's action URL and input field names for username, password, and any hidden fields 

In [33]:
import requests

# Start a session 
session = requests.Session()

# Login URL 
login_url = 'https://the-internet.herokuapp.com/login'

# Login data 
login_data = {
    'username': 'tomsmith', 
    'password': 'SuperSecretPassword!',
    'submit': 'Login'
}

# Perform login
login_response = session.post(login_url, data=login_data)

# Check login success
if 'Logout' in login_response.text:
    print('Login successful!')
else: 
    print('Login failed.')

Login failed.


The *https://the-internet.herokuapp.com/login* was built to practice Selenium and web scraping techniques. 

In [32]:
import requests

# Start session
session = requests.Session()

# Login URL (httpbin's basic auth endpoint)
login_url = 'https://httpbin.org/basic-auth/user/passwd'

# Provide basic auth credentials
session.auth = ('user', 'passwd')

# Perform login
response = session.get(login_url)

if response.status_code == 200:
    print('Login successful!')
else:
    print('Login failed.')

Login successful!


The *httpbin* is a testing service designed for practicing HTTP requests. It provides endpoints to test GET requests, login forms, and more. <br>
You can create a mock login request to the /get or /basic-auth endpoints. 

# Part 3: Scraping Data Behind a Login
- Access pages requiring authentication
- Use the authenticated session to access a protected page
- Handling common issues 
- Discuss how to deal with common issues like CSRF tokens, JavaScript-rendered pages, and headers. 

In [38]:
import requests
from bs4 import BeautifulSoup

# Start a session
session = requests.Session()

# Headers to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
    'Referer': 'https://the-internet.herokuapp.com/login',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}

# Login URL and protected page URL
login_url = 'https://the-internet.herokuapp.com/login'
protected_url = 'https://the-internet.herokuapp.com/secure'

# Login credentials
login_data = {
    'username': 'tomsmith',
    'password': 'SuperSecretPassword!'
}

# Perform login with headers
login_response = session.post(login_url, data=login_data, headers=headers)

# Check if login was successful by looking for the 'Logout' link
if 'Logout' in login_response.text:
    print("Login successful!")

    # Now, access the protected page using the same headers
    protected_response = session.get(protected_url, headers=headers)

    # Parse the protected page using BeautifulSoup
    soup = BeautifulSoup(protected_response.content, 'lxml')

    # Extract and display data (In this case, secure area message)
    secure_message = soup.find('div', class_='flash success').get_text()
    print("Protected Page Message:", secure_message)
else:
    print("Login failed.")


Login failed.


In [35]:
import requests

# Start a session
session = requests.Session()

# Simulate login with a POST request
login_url = 'https://httpbin.org/post'
login_data = {
    'username': 'testuser',
    'password': 'password123'
}

# Perform the login (simulated as a POST request)
login_response = session.post(login_url, data=login_data)

# Display the server's response to the login attempt
print("Server Response:", login_response.json())


Server Response: {'args': {}, 'data': '', 'files': {}, 'form': {'password': 'password123', 'username': 'testuser'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '38', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-66e47040-597fd2ee3fb013b76837f705'}, 'json': None, 'origin': '180.191.146.172', 'url': 'https://httpbin.org/post'}
