### **What is a User-Agent?**

The User-Agent (UA) is a string sent by a client (like a web browser or a bot) to identify itself to the server. It tells the server about the client’s operating system, browser, and device. For example, when you access a website via a browser like Chrome or Firefox, the browser sends a User-Agent string with each request to the website.

In [None]:
import requests
import httpx

In [6]:
"""
Objective: Understanding Request Headers
"""
# TODO: Send request to https://httpbin.org/get using requests and httpx
# TODO: Get the responses from both request
# TODO: Compare the request headers and understand the difference
# %pip install httpx requests
import requests
import httpx
# Using requests
response_requests = requests.get("https://httpbin.org/get")

# Using httpx
response_httpx = httpx.get("https://httpbin.org/get")

# Print the request headers from both responses
print("Request Headers using requests:")
print(response_requests.request.headers)
print("\nRequest Headers using httpx:")
print(response_httpx.request.headers)

Request Headers using requests:
{'User-Agent': 'python-requests/2.32.3', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Request Headers using httpx:
Headers({'host': 'httpbin.org', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1'})


In [9]:
"""
Objective: Modify request headers
"""
# TODO: Send request to https://httpbin.org/get using requests
# TODO: Get the request headers from the response

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}

# TODO: Send new request using modified headers by passing headers params in the get method
# TODO: Get the response
# TODO: Compare the and understand the difference
# TODO: Experiment with different user-agents and share your thoughts

response = requests.get("https://httpbin.org/get", headers=headers)
print("Modified Request Headers:")
print(response.request.headers)
print("\nResponse Headers:")
print(response.headers)

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0',
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}


response = requests.get("https://httpbin.org/get", headers=headers)
print("\nModified Request Headers:")
print(response.request.headers)
print("\nResponse Headers:")
print(response.headers)


Modified Request Headers:
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

Response Headers:
{'Date': 'Tue, 06 May 2025 15:36:26 GMT', 'Content-Type': 'application/json', 'Content-Length': '404', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

Modified Request Headers:
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

Response Headers:
{'Date': 'Tue, 06 May 2025 15:36:29 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}


In [10]:
"""
Objective: Bypassing User-Agent Blocking
"""
# TODO: Send request to https://gamefaqs.gamespot.com/news using requests with and without custom headers
# TODO: Compare the response, which one is blocked and which one is not

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0',
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}

url = "https://gamefaqs.gamespot.com/news"
# Without custom headers
response = requests.get(url)
print("\nResponse without custom headers:")
print(response.status_code)
print(response.text)


# With custom headers
response = requests.get("https://httpbin.org/get", headers=headers)
print("\nResponse with Request Headers:")
print(response.status_code)
print(response.text)


Response without custom headers:
400
<html><body><h1>Request Blocked</h1><p>We've detected unusual traffic from your current system and have blocked this request. Please use an alternate browser or device to continue.</p></body></html>

Response with Request Headers:
503
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
</body>
</html>



In [11]:
"""
Objective: Understanding User-Agent Rotation
If all person you've met today using same shirts, what do you think?
"""
# List of common User-Agent strings
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0'
]

# TODO: Send request to https://httpbin.org/get using random User-Agent from the list
# TODO: Get the response and print the used User-Agent
# TODO: Try to execute it again, is the User-Agent still the same?
# TODO: Try to loop to send up to 10 request, using different User-Agent from the list and print each user agents used

for user_agent in user_agents:
    headers = {
        'User-Agent': user_agent,
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive'
    }
    response = requests.get("https://httpbin.org/get", headers=headers)
    print(f"Response with User-Agent: {user_agent}")
    print(response.status_code)
    print(response.text)

Response with User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
200
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-681a2d3e-0f91103f75562dbe7cd9f61b"
  }, 
  "origin": "103.168.44.81", 
  "url": "https://httpbin.org/get"
}

Response with User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36
200
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36", 
    "X-Amzn-Trace-Id": "Ro

In [7]:
"""
Objective: Using fake user-agent library
"""
# TODO: Install fake_useragent using pip
# TODO: Create a UserAgent object
# TODO: To get a random User-Agent, use ua.random
# TODO: Send request to https://httpbin.org/get using random User-Agent from the fake_useragent
# TODO: Get the response and print the used User-Agent
# TODO: Try to execute it again, is the User-Agent still the same?
# %pip install fake_useragent
from fake_useragent import UserAgent
import requests

ua = UserAgent()

for i in range(3):
    # Get a random User-Agent
    random_user_agent = ua.random

    print(f"Random User-Agent: {random_user_agent}")
    # Send request to https://httpbin.org/get using random User-Agent
    response = requests.get("https://httpbin.org/get", headers={'User-Agent': random_user_agent})
    print(f"Response with Random User-Agent: {response.status_code}")
    print(f"User Agent: {response.request.headers['User-Agent']}")
    print(response.text)

Random User-Agent: Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Mobile Safari/537.36
Response with Random User-Agent: 200
User Agent: Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Mobile Safari/537.36
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Mobile Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-681b42a2-5d5abb73437b55076a96d251"
  }, 
  "origin": "103.168.44.81", 
  "url": "https://httpbin.org/get"
}

Random User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 Edg/135.0.0.0
Response with Random User-Agent: 200
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 Edg/1

In [51]:
"""
Objective: Improve web scraping by using fake_useragent and logging
"""
# TODO: Visit https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.settlements.html
# TODO: Try sending request to that site without custom header
# TODO: If you failed, use random User-Agent using fake_useragent
# TODO: Extract the data table and save it to a json file
# %pip install pandas

import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent

json_path = "data.json"
csv_path = "data.csv"

ua = UserAgent()
# Get a random User-Agent
random_user_agent = ua.random

# Set up Selenium with the random User-Agent
options = webdriver.ChromeOptions()
options.add_argument(f"user-agent={random_user_agent}")
driver = webdriver.Chrome(options=options)
driver.get("https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.settlements.html")

# Wait for the page to load
try:
    wait = WebDriverWait(driver, 15)
    button = driver.find_element(By.XPATH, "//button[.//span[text()='Load All']]")
    driver.execute_script("arguments[0].click();", button)
    print("Tombol berhasil diklik.")
except Exception as e:
    print(f"Gagal klik tombol: {e}")

# Get the page source
page_source = driver.page_source
soup = bs(page_source, 'html.parser')

data = []

# Find the table
table = soup.find('table')

row_head = table.find('thead').find_all('th')
th_data = []
for row in row_head:
    th_data.append(row.text.strip())
data.append(th_data)
# Find the rows in the table
rows = table.find('tbody').find_all('tr')
for row in rows:
    # Find the columns in each row
    cols = row.find_all('td')
    col_data = []
    for col in cols:
        col_data.append(col.text.strip())
    data.append(col_data)

# Save the data to a JSON file
df = pd.DataFrame(data[1:], columns=data[0])
df.to_csv(csv_path, index=False)
print(f"Data saved to {csv_path}")

# Close the browser
driver.quit()

Tombol berhasil diklik.
Data saved to data.csv


### **Proxy Rotation**

A proxy acts as an intermediary server between your scraping script (client) and the target website (server). When you send a request through a proxy, the request is routed through the proxy server, which then forwards the request to the destination website. The website sees the request coming from the proxy’s IP address instead of your actual IP address. This makes proxies particularly useful in web scraping, as they help with anonymity, bypassing rate limits, and preventing IP bans.

In [52]:
""" 
Objective: Understanding rate limits
"""
# TODO: Execute this cell

url = "https://api.github.com/users/octocat"

for i in range(1, 66):
    response = requests.get(url)
    print(f"Sending request {i}. Status code: {response.status_code}")

Sending request 1. Status code: 200
Sending request 2. Status code: 200
Sending request 3. Status code: 200
Sending request 4. Status code: 200
Sending request 5. Status code: 200
Sending request 6. Status code: 200
Sending request 7. Status code: 200
Sending request 8. Status code: 200
Sending request 9. Status code: 200
Sending request 10. Status code: 200
Sending request 11. Status code: 200
Sending request 12. Status code: 200
Sending request 13. Status code: 200
Sending request 14. Status code: 200
Sending request 15. Status code: 200
Sending request 16. Status code: 200
Sending request 17. Status code: 200
Sending request 18. Status code: 200
Sending request 19. Status code: 200
Sending request 20. Status code: 200
Sending request 21. Status code: 200
Sending request 22. Status code: 200
Sending request 23. Status code: 200
Sending request 24. Status code: 200
Sending request 25. Status code: 200
Sending request 26. Status code: 200
Sending request 27. Status code: 200
Sending re

In [None]:
""" 
Objective: Understanding Proxy
"""
# This is a sample code
import requests

# Define the URL you want to scrape
url = 'https://httpbin.org/ip'  # httpbin provides your IP address for testing

# Define the proxy to use
proxy = {
    'http': 'http://10.10.1.10:8080',
    'https': 'http://10.10.1.10:8080'
}

# Send a request through the proxy
response = requests.get(url, proxies=proxy, timeout=5)

# Print the response (this will show the IP address the website sees)
print(response.text)

ProxyError: HTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /ip (Caused by ProxyError('Unable to connect to proxy', ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x74499dfe40b0>, 'Connection to 101.255.120.54 timed out. (connect timeout=5)')))

In [None]:
""" 
Objective: Finding free proxies
"""
# TODO: Do a google search for free proxies and share your thoughts


In [58]:
""" 
Objective: Using free proxies
"""

import requests
import time

# Step 1: Get proxy list
proxy_url = "https://api.proxyscrape.com/v4/free-proxy-list/get?request=display_proxies&proxy_format=protocolipport&format=text"
proxy_list = requests.get(proxy_url).text.strip().split('\r\n')
proxy_list = [p for p in proxy_list if 'http' in p]
print(f"Total proxy loaded: {len(proxy_list)}")

url = "https://api.github.com/users/octocat"

# Step 2: Trigger GitHub rate limit (60 req/hour per IP)
print("Triggering rate limit (may take ~1 minute)...")
for i in range(65):
    res = requests.get(url)
    print(f"{i+1}: {res.status_code}")
    if res.status_code == 403:
        print("Rate limit hit!")
        break
    time.sleep(1)

# Step 3: Use proxy rotation to bypass rate limit
for proxy in proxy_list:
    proxies = {
        "http": f"http://{proxy}",
        "https": f"http://{proxy}",
    }
    try:
        print(f"Trying proxy: {proxy}")
        response = requests.get(url, proxies=proxies, timeout=5)
        if response.status_code == 200:
            print("Success with proxy:", proxy)
            print("Response:", response.json())
            break
        else:
            print("Failed with status:", response.status_code)
    except Exception as e:
        print(f"Proxy failed: {proxy}, Error: {e}")


Total proxy loaded: 577
Triggering rate limit (may take ~1 minute)...
1: 403
Rate limit hit!
Trying proxy: http://138.68.60.8:80
Proxy failed: http://138.68.60.8:80, Error: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /users/octocat (Caused by ProxyError('Unable to connect to proxy', NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x74499df0dcd0>: Failed to resolve 'http' ([Errno -3] Temporary failure in name resolution)")))
Trying proxy: http://162.223.90.150:80
Proxy failed: http://162.223.90.150:80, Error: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /users/octocat (Caused by ProxyError('Unable to connect to proxy', NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x74499df0d4c0>: Failed to resolve 'http' ([Errno -3] Temporary failure in name resolution)")))
Trying proxy: http://89.58.55.33:80
Proxy failed: http://89.58.55.33:80, Error: HTTPSConnectionPool(host='api.g

### **Reflection**
What user-agent that typically get blocked? Can you mention it?

the user-agent typically get blocked is the user-agent that not the real browser or allowed for the websile/apis that usualy like a bot user-agent or automation from an app 

### **Exploration**
There are a lot of proxy providers out there. Do a research on at least 3 of them and compare it.