### **What is a User-Agent?**

The User-Agent (UA) is a string sent by a client (like a web browser or a bot) to identify itself to the server. It tells the server about the client’s operating system, browser, and device. For example, when you access a website via a browser like Chrome or Firefox, the browser sends a User-Agent string with each request to the website.

In [2]:
import requests
import httpx

In [3]:
"""
Objective: Understanding Request Headers
"""
# TODO: Send request to https://httpbin.org/get using requests and httpx
# TODO: Get the request headers from both responses
# TODO: Compare the request headers and understand the difference

req = requests.get('https://httpbin.org/get')
r = httpx.get('https://httpbin.org/get')

print("request", req.headers)
print("httpx", r.headers)

request {'Date': 'Thu, 10 Apr 2025 01:25:09 GMT', 'Content-Type': 'application/json', 'Content-Length': '305', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
httpx Headers({'date': 'Thu, 10 Apr 2025 01:25:10 GMT', 'content-type': 'application/json', 'content-length': '302', 'connection': 'keep-alive', 'server': 'gunicorn/19.9.0', 'access-control-allow-origin': '*', 'access-control-allow-credentials': 'true'})


In [4]:
"""
Objective: Modify request headers
"""
# TODO: Send request to https://httpbin.org/get using requests
# TODO: Get the request headers from the response

r = requests.get('https://httpbin.org/get')

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}

# TODO: Send new request using modified headers by passing headers params in get method
# TODO: Get the request headers from the response
# TODO: Compare the request headers and understand the difference
# TODO: Experiment with different headers and share your thoughts

# Send the request with modified headers
r = requests.get('https://httpbin.org/get', headers=headers)

# Get the request headers from the response
print(r.request.headers)

# Compare the request headers with the original headers
original_headers = r.headers
print("Original Headers:")
print(original_headers)
print("\nModified Headers:")
print(r.request.headers)

{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}
Original Headers:
{'Date': 'Thu, 10 Apr 2025 01:25:21 GMT', 'Content-Type': 'application/json', 'Content-Length': '402', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

Modified Headers:
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}


In [5]:
"""
Objective: Bypassing User-Agent Blocking
"""
# TODO: Send request to https://gamefaqs.gamespot.com/news using requests with and without custom headers
# TODO: Compare the response, which one is blocked and which one is not
r = requests.get('https://gamefaqs.gamespot.com/news')
print("without custom header",r.status_code)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}

rNew = requests.get('https://gamefaqs.gamespot.com/news', headers=headers)
print("with custom headers",rNew.status_code)


without custom header 400
with custom headers 200


In [6]:
"""
Objective: Understanding User-Agent Rotation
If all person you've met today using same shirts, what do you think?
"""
# List of common User-Agent strings
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0'
]

# TODO: Send request to https://httpbin.org/get using random User-Agent from the list
# TODO: Get the response and print the used User-Agent
# TODO: Try to execute it again, is the User-Agent still the same?
# TODO: Try to loop to send up to 10 request, using different User-Agent from the list and print each user agents used

import random
for i in range(10):
    headers = {
        'User-Agent': user_agents[random.randint(0, len(user_agents) - 1)]
    }
    r = requests.get('https://httpbin.org/get', headers=headers)
    print(r.json()['headers']['User-Agent'])

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36


In [8]:
"""
Objective: Using fake user-agent library
"""
# TODO: Install fake_useragent using pip
# TODO: Create a UserAgent object
# TODO: To get a random User-Agent, use ua.random
# TODO: Send request to https://httpbin.org/get using random User-Agent from the fake_useragent
# TODO: Get the response and print the used User-Agent
# TODO: Try to execute it again, is the User-Agent still the same?

from fake_useragent import UserAgent
uAgent = UserAgent()



for i in range(10):
    headers = {
    'User-Agent': uAgent.random
    }
    r = requests.get('https://httpbin.org/get', headers=headers)
    print(r.json()['headers']['User-Agent'])




Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.3 Mobile/15E148 Safari/604.1
Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.3 Mobile/15E148 Safari/604.1
Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.3 Safari/605.1.15
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36
Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36
Mozilla/5.0 (iPhone; CPU iPhone OS 18_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.3 Mobile/15E148 Safari/604.1
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, li

In [13]:
"""
Objective: Improve web scraping by using fake_useragent and logging
"""
# TODO: Visit https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.settlements.html
# TODO: Try sending request to that site without custom header
# TODO: If you failed, use random User-Agent using fake_useragent
# TODO: Extract the data table and save it to a json file

import logging
import json
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import time
import pandas as pd
from curl_cffi import requests

logging.basicConfig(level=logging.INFO)
uAgent = UserAgent()
timeout = 60  # Increased timeout to 60 seconds
max_retries = 3  # Maximum number of retries
url = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?strategy=DEFAULT&tradeDate=04/08/2025&pageSize=500&isProtected&_t=1744252567504'

for attempt in range(max_retries):
    try:
        headers = { 'User-Agent': uAgent.random } # Random User-Agent from fake_useragent
        r = requests.get(url, headers=headers, timeout=timeout, impersonate="chrome110")  # Set timeout for the request
        r.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        break  # If the request is successful, break out of the loop
    except requests.exceptions.RequestException as e:
        logging.error(f"Attempt {attempt + 1} failed: {e}")
        if attempt < max_retries - 1:
            time.sleep(10)  # Wait for 10 seconds before retrying
        else:
            logging.error("Max retries reached. Request failed.")
            exit()  # Exit the script if all retries fail

if r.status_code == 200:
    print("Request succeeded with status code:", r.status_code)
    data = r.json()
    # Extract the data table from the response
    df = pd.DataFrame(data['settlements'])
    # Save the data to a JSON file
    df.to_json('settlements.json', orient='records', lines=True)
    print("Data saved to settlements.json")

    
  
else:
    print("Request failed with status code:", r.status_code)


Request succeeded with status code: 200
Data saved to settlements.json


### **Proxy Rotation**

A proxy acts as an intermediary server between your scraping script (client) and the target website (server). When you send a request through a proxy, the request is routed through the proxy server, which then forwards the request to the destination website. The website sees the request coming from the proxy’s IP address instead of your actual IP address. This makes proxies particularly useful in web scraping, as they help with anonymity, bypassing rate limits, and preventing IP bans.

In [14]:
""" 
Objective: Understanding rate limits
"""
# TODO: Execute this cell

url = "https://api.github.com/users/octocat"

for i in range(1, 66):
    response = requests.get(url)
    print(f"Sending request {i}. Status code: {response.status_code}")

Sending request 1. Status code: 403
Sending request 2. Status code: 403
Sending request 3. Status code: 403
Sending request 4. Status code: 403
Sending request 5. Status code: 403
Sending request 6. Status code: 403
Sending request 7. Status code: 403
Sending request 8. Status code: 403
Sending request 9. Status code: 403
Sending request 10. Status code: 403
Sending request 11. Status code: 403
Sending request 12. Status code: 403
Sending request 13. Status code: 403
Sending request 14. Status code: 403
Sending request 15. Status code: 403
Sending request 16. Status code: 403
Sending request 17. Status code: 403
Sending request 18. Status code: 403
Sending request 19. Status code: 403
Sending request 20. Status code: 403
Sending request 21. Status code: 403
Sending request 22. Status code: 403
Sending request 23. Status code: 403
Sending request 24. Status code: 403
Sending request 25. Status code: 403
Sending request 26. Status code: 403
Sending request 27. Status code: 403
Sending re

In [36]:
""" 
Objective: Understanding Proxy
"""
import requests

# Open the proxies.txt file
with open('proxies.txt', 'r') as file:
    proxy_list = file.readlines()  # Read all lines into a list

# Iterate over proxies
for proxy in proxy_list:
    proxy = proxy.strip()  # Remove leading/trailing whitespace
    
    # If authentication is needed
    if "@" in proxy:
        proxies = {
            "http": f"http://{proxy}",
            "https": f"https://{proxy}"
        }
    else:
        # If only IP:Port
        proxies = {
            "http": f"http://{proxy}",
            "https": f"https://{proxy}"
        }
    

    #USE .ENV FILE FOR SENSITIVE DATA
    import os
    from dotenv import load_dotenv
    load_dotenv()
    HTTP = os.getenv("HTTP")
    HTTPS = os.getenv("HTTPS")

    # Make a request using webshare.io proxy
    webshare_Proxy = {
      "http":HTTP,
      "https": HTTPS
    } 
    try:
        response = requests.get("https://httpbin.org/ip", proxies=webshare_Proxy, timeout=30)
        print(f"Proxy {proxy} works! Response: {response.text}")
    except requests.exceptions.RequestException as e:
        print(f"Proxy {proxy} failed. Error: {e}")

Proxy 51.44.176.151:20202 works! Response: {
  "origin": "185.199.228.220"
}

Proxy 3.104.88.178:80 works! Response: {
  "origin": "86.38.234.176"
}

Proxy 43.159.152.105:13001 works! Response: {
  "origin": "173.211.0.148"
}

Proxy 170.106.135.2:13001 works! Response: {
  "origin": "86.38.234.176"
}

Proxy 32.223.6.94:80 works! Response: {
  "origin": "86.38.234.176"
}



In [None]:
"""
Objective: Improve web scraping by using fake_useragent and logging
"""
# TODO: Visit https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.settlements.html
# TODO: Try sending request to that site without custom header
# TODO: If you failed, use random User-Agent using fake_useragent
# TODO: Extract the data table and save it to a json file

import logging
import json
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import time
import pandas as pd
from curl_cffi import requests

logging.basicConfig(level=logging.INFO)
uAgent = UserAgent()
timeout = 60  # Increased timeout to 60 seconds
max_retries = 3  # Maximum number of retries
url = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?strategy=DEFAULT&tradeDate=04/08/2025&pageSize=500&isProtected&_t=1744252567504'

for attempt in range(max_retries):
    try:
        headers = { 'User-Agent': uAgent.random } # Random User-Agent from fake_useragent
        r = requests.get(url, headers=headers, timeout=timeout, impersonate="chrome110")  # Set timeout for the request
        r.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        break  # If the request is successful, break out of the loop
    except requests.exceptions.RequestException as e:
        logging.error(f"Attempt {attempt + 1} failed: {e}")
        if attempt < max_retries - 1:
            time.sleep(10)  # Wait for 10 seconds before retrying
        else:
            logging.error("Max retries reached. Request failed.")
            exit()  # Exit the script if all retries fail

if r.status_code == 200:
    print("Request succeeded with status code:", r.status_code)
    data = r.json()
    # Extract the data table from the response
    df = pd.DataFrame(data['settlements'])
    # Save the data to a JSON file
    df.to_json('settlements.json', orient='records', lines=True)
    print("Data saved to settlements.json")

    
  
else:
    print("Request failed with status code:", r.status_code)


Request succeeded with status code: 200
Data saved to settlements.json


In [None]:
""" 
Objective: Finding free proxies
"""
# TODO: Do a google search for free proxies and share your thoughts
# i am usually using webshare.io for free proxy. you will get 10 free proxies


In [None]:
""" 
Objective: Using free proxies
"""

proxy_url = "https://api.proxyscrape.com/v4/free-proxy-list/get?request=display_proxies&proxy_format=protocolipport&format=text"

proxy_list = requests.get(proxy_url).text # Get the proxy list
proxy_list = proxy_list.strip().split('\r\n') # Split the proxy list
proxy_list = [proxy for proxy in proxy_list if 'http' in proxy] # Filter http only
print(len(proxy_list))

url = "https://api.github.com/users/octocat"

# TODO: Trigger blocking by sending 60 or more request to the URL

# TODO: Once we hit the rate limit, we need to use a proxy
# TODO: A free proxy may not always work, use looping to find a working proxy (response.status_code == 200)
# TODO: Once the request is successful, print the proxy used and exit the loop


809


In [37]:
url = "https://api.github.com/users/octocat"

# TODO: Trigger blocking by sending 60 or more request to the URL
for i in range(1, 66):
    response = requests.get(url)
    print(f"Sending request {i}. Status code: {response.status_code}")

Sending request 1. Status code: 200
Sending request 2. Status code: 200
Sending request 3. Status code: 200
Sending request 4. Status code: 200
Sending request 5. Status code: 200
Sending request 6. Status code: 200
Sending request 7. Status code: 200
Sending request 8. Status code: 200
Sending request 9. Status code: 200
Sending request 10. Status code: 200
Sending request 11. Status code: 200
Sending request 12. Status code: 200
Sending request 13. Status code: 200
Sending request 14. Status code: 200
Sending request 15. Status code: 200
Sending request 16. Status code: 200
Sending request 17. Status code: 200
Sending request 18. Status code: 200
Sending request 19. Status code: 200
Sending request 20. Status code: 200
Sending request 21. Status code: 200
Sending request 22. Status code: 200
Sending request 23. Status code: 200
Sending request 24. Status code: 200
Sending request 25. Status code: 200
Sending request 26. Status code: 200
Sending request 27. Status code: 200
Sending re

In [40]:
# TODO: Once we hit the rate limit, we need to use a proxy
# TODO: A free proxy may not always work, use looping to find a working proxy (response.status_code == 200)
# TODO: Once the request is successful, print the proxy used and exit the loop
url = "https://api.github.com/users/octocat"
for i in range(1, 66):
    response = requests.get(url, proxies=webshare_Proxy)
    print(f"Sending request {i}. Status code: {response.status_code}")
    if response.status_code == 200:
        print(f"Proxy {proxy} works! Response: {response.text}")
        break

Sending request 1. Status code: 200
Proxy http://15.235.53.20:28003 works! Response: {"login":"octocat","id":583231,"node_id":"MDQ6VXNlcjU4MzIzMQ==","avatar_url":"https://avatars.githubusercontent.com/u/583231?v=4","gravatar_id":"","url":"https://api.github.com/users/octocat","html_url":"https://github.com/octocat","followers_url":"https://api.github.com/users/octocat/followers","following_url":"https://api.github.com/users/octocat/following{/other_user}","gists_url":"https://api.github.com/users/octocat/gists{/gist_id}","starred_url":"https://api.github.com/users/octocat/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/octocat/subscriptions","organizations_url":"https://api.github.com/users/octocat/orgs","repos_url":"https://api.github.com/users/octocat/repos","events_url":"https://api.github.com/users/octocat/events{/privacy}","received_events_url":"https://api.github.com/users/octocat/received_events","type":"User","user_view_type":"public","site_admin":fals

### **Reflection**
What user-agent that typically get blocked? Can you mention it?

1. Web Scraping Bots
- Python-urllib/3.9
- Scrapy/2.5
- Java/1.8.0_181
- curl/7.64.1
- wget/1.20.3

2. Malicious or Suspicious Bots
- AhrefsBot
- MJ12bot
- SemrushBot
- DotBot
- BLEXBot

3. Outdated Browsers
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
- Mozilla/5.0 (Windows NT 5.1; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3

4. Generic or Empty User-Agents
- Mozilla/5.0 (compatible; GenericBot/1.0)
- - (empty user-agent string)




### **Exploration**
There are a lot of proxy providers out there. Do a research on at least 3 of them and compare it.

https://medium.com/@rjrizani_66086/analysis-of-recommended-free-proxy-providers-for-web-scraping-776ff914ef6c    