### **Getting HTML Content**
The Requests library is a popular Python tool that allows you to easily make HTTP requests. It is commonly used for web scraping, where you need to gather data from websites. Think of HTTP requests as asking a website to give you information, like a page of text or an image. With Requests, you can send these requests with just a few lines of code and receive responses that contain the information you need.

In [None]:
import requests
import os
import json

"""
Practice Exercise: Requests Basics

Complete each function below by following the TODO instructions. 
Each function includes the objective of the task and the expected output.
"""

In [3]:
# You can change the URL to any URL you like
# Execute this cell before continue
import requests

example_url = "https://example.com"
response = requests.get(example_url)

In [6]:
"""
Objective: Understand the result of requests.get()
"""
# TODO: Print the response object and its type
print(response)
# TODO: Print the status code of the response
print(response.status_code)
# TODO: Print the reason phrase of the response
print(response.reason)

<Response [200]>
200
OK


In [15]:
"""
Objective: Compare the output of the response.text and response.content methods.
"""
# TODO: Print the first 100 characters of the response text
print(response.text[:100])
# TODO: Print the first 100 characters of the response content
print(response.content[:100])
# TODO: Print the length of the response text
print(len(response.text))
# TODO: Print the length of the response content
print(len(response.content))    
# TODO: Print the type of the response text
print(type(response.text))
# TODO: Print the type of the response content
print(type(response.content))
# TODO: Save the response text to a file
print(response.text)
# TODO: Save the response content to a file
print(response.content)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <m
b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <m'
1256
1256
<class 'str'>
<class 'bytes'>
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        c

In [6]:
""" 
Objective: Use .content to download an image file
"""
import os
import requests

# image_url = "https://via.placeholder.com/150"
image_url = "https://i.etsystatic.com/19635104/r/il/887285/2591051093/il_570xN.2591051093_f2sk.jpg"
output_file = "image.jpg"
subfolder = "images"

# TODO: Use requests to download the image binary (image_url)    
# You can change the image url to any image you like
# TODO: Write the image binary to a file (output_file)
# TODO: Locate the image file in a sub-folder
# TODO: Print a message indicating the file has been saved
os.makedirs(subfolder, exist_ok=True)
response = requests.get(image_url)
response.raise_for_status() # Check if download was successful

output_path = os.path.join(subfolder, output_file)
with open(output_path, 'wb') as f:
    f.write(response.content)

print(f"Image saved successfully to {output_path}")


Image saved successfully to images\image.jpg


In [8]:
""" 
Objective: Implement looping to see how the website response to over-requesting
"""

# TODO: Define the URL to fetch data from, you can use example_url variable
# TODO: Iterate over 100 times to fetch data from the URL
# TODO: Print the status code and reason phrase of the response of each iteration
url = "https://i.etsystatic.com/19635104/r/il/887285/2591051093/il_570xN.2591051093_f2sk.jpg"

for i in range(100):
    response = requests.get(url)
    print(f"Request {i+1}: Status {response.status_code} - {response.reason}")
new_url = "https://api.github.com/users/octocat"
# TODO: Do the same as above but use the new_url variable

Request 1: Status 200 - OK
Request 2: Status 200 - OK
Request 3: Status 200 - OK
Request 4: Status 200 - OK
Request 5: Status 200 - OK
Request 6: Status 200 - OK
Request 7: Status 200 - OK
Request 8: Status 200 - OK
Request 9: Status 200 - OK
Request 10: Status 200 - OK
Request 11: Status 200 - OK
Request 12: Status 200 - OK
Request 13: Status 200 - OK
Request 14: Status 200 - OK
Request 15: Status 200 - OK
Request 16: Status 200 - OK
Request 17: Status 200 - OK
Request 18: Status 200 - OK
Request 19: Status 200 - OK
Request 20: Status 200 - OK
Request 21: Status 200 - OK
Request 22: Status 200 - OK
Request 23: Status 200 - OK
Request 24: Status 200 - OK
Request 25: Status 200 - OK
Request 26: Status 200 - OK
Request 27: Status 200 - OK
Request 28: Status 200 - OK
Request 29: Status 200 - OK
Request 30: Status 200 - OK
Request 31: Status 200 - OK
Request 32: Status 200 - OK
Request 33: Status 200 - OK
Request 34: Status 200 - OK
Request 35: Status 200 - OK
Request 36: Status 200 - OK
R

In [9]:
""" 
Objective: Understanding HTTP error codes
"""
urls = [
    "https://httpstat.us/400", "https://httpstat.us/403", "https://httpstat.us/404",
    "https://httpstat.us/500", "https://httpstat.us/503", "https://httpstat.us/504",
    "https://httpstat.us/521", "https://httpstat.us/522", "https://httpstat.us/200"
]

# TODO: Loop through each URL in the list
# TODO: Use requests to fetch the URL
# TODO: Print the status code and its reason
for url in urls:
    try:
        response = requests.get(url)
        print(f"URL: {url}")
        print(f"Status: {response.status_code} - {response.reason}\n")
    except requests.RequestException as e:
        print(f"Error accessing {url}: {e}\n")

URL: https://httpstat.us/400
Status: 400 - Bad Request

URL: https://httpstat.us/403
Status: 403 - Forbidden

URL: https://httpstat.us/404
Status: 404 - Not Found

URL: https://httpstat.us/500
Status: 500 - Internal Server Error

URL: https://httpstat.us/503
Status: 503 - Service Unavailable

URL: https://httpstat.us/504
Status: 504 - Gateway Timeout

URL: https://httpstat.us/521
Status: 521 - Web Server Is Down

URL: https://httpstat.us/522
Status: 522 - Connection Timed out

URL: https://httpstat.us/200
Status: 200 - OK



In [None]:
"""
Objective: Implement timeout to limit waiting time for a response
"""
import os
import requests

adidas_url = "https://www.adidas.de/api/products/ID9465"

# TODO: Use requests to fetch the adidas URL without timeout and see how long you can wait
# TODO: Use requests to fetch the adidas URL with timeout equals to 5 seconds
# TODO: Use try-except to handle the timeout error

# Request without timeout
print("Making request without timeout...")
try:
    response = requests.get(adidas_url)
    print(f"Response received: {response.status_code}")
except requests.RequestException as e:
    print(f"Request failed: {e}")

# Request with 5 second timeout
print("\nMaking request with 5 second timeout...")
try:
    response = requests.get(adidas_url, timeout=5)
    print(f"Response received: {response.status_code}")
except requests.Timeout:
    print("Request timed out after 5 seconds")
except requests.RequestException as e:
    print(f"Request failed: {e}")


Making request with 5 second timeout...
Request timed out after 5 seconds


In [None]:
""" 
Objective: Handle failed requests in the middle of a loop
"""

urls = [
    "https://httpstat.us/200", 
    "invalid_url.com",
    "https://httpstat.us/200", 
    "https://www.adidas.de/api/products/ID9465"
]

# TODO: Loop through each URL in the list
# TODO: Use requests to fetch the URL
# TODO: Handle HTTP error and print the error message
# TODO: Handle all other exceptions (RequestException)

urls = [
    "https://httpstat.us/200", 
    "invalid_url.com",
    "https://httpstat.us/200", 
    "https://www.adidas.de/api/products/ID9465"
]

for url in urls:
    print(f"\nTrying URL: {url}")
    try:
        if not url.startswith(('http://', 'https://')):
            url = 'http://' + url
            
        response = requests.get(url)
        response.raise_for_status() 
        print(f"Success! Status: {response.status_code} - {response.reason}")
        
    except requests.HTTPError as e:
        print(f"HTTP Error: {e}")
    except requests.RequestException as e:
        print(f"Request Failed: {e}")


Trying URL: https://httpstat.us/200
Success! Status: 200 - OK

Trying URL: invalid_url.com
Request Failed: HTTPConnectionPool(host='invalid_url.com', port=80): Max retries exceeded with url: / (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x000001E5B257B200>: Failed to resolve 'invalid_url.com' ([Errno 11001] getaddrinfo failed)"))

Trying URL: https://httpstat.us/200
Success! Status: 200 - OK

Trying URL: https://www.adidas.de/api/products/ID9465


In [2]:
""" 
Objective: Get response in JSON format
"""
import requests

api_url = "https://api.ipify.org?format=json"

# TODO: Use requests to fetch your IP from the API (api_url)
# TODO: Compare the response using both .text and .json()
# TODO: Extract the IP address from the JSON response
# TODO: Print your IP


response = requests.get(api_url)
response.raise_for_status()

# Compare text vs json responses
print("Response as text:", response.text)
print("Response as parsed JSON:", response.json())

# Extract and print IP
ip_data = response.json()
ip_address = ip_data['ip']
print(f"\nYour IP address is: {ip_address}")

Response as text: {"ip":"114.10.42.165"}
Response as parsed JSON: {'ip': '114.10.42.165'}

Your IP address is: 114.10.42.165


In [4]:
""" 
Objective: Handling JSON output file
"""
import json
api_url = "https://jsonplaceholder.typicode.com/posts"
output_file = "posts.json"

# TODO: Use requests.get() to fetch JSON data from the API (api_url)
# TODO: Write the JSON response to a file (output_file)
# TODO: Print a message indicating the file has been saved

response = requests.get(api_url)
response.raise_for_status()

# Write JSON response to file
with open(output_file, 'w') as f:
    json.dump(response.json(), f, indent=2)

print(f"JSON data has been saved to {output_file}")

JSON data has been saved to posts.json


In [6]:
""" 
Objective: Send a POST request with payload data
"""
api_url = "https://httpbin.org/post"
payload = {"name": "Rudi Kurniawan", "age": 40}

# TODO: Replace name value with your name in the payload
# TODO: Use requests.post() to send payload (payload) to the API (api_url)
# TODO: Print the JSON response from the POST request and print back your payload data

# Send POST request with payload
response = requests.post(api_url, json=payload)
response.raise_for_status()

# Print response and payload data
print("Server Response:")
print(response.json())
print("\nSent Payload:")
print(payload)

Server Response:
{'args': {}, 'data': '{"name": "Rudi Kurniawan", "age": 40}', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Content-Length': '37', 'Content-Type': 'application/json', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.3', 'X-Amzn-Trace-Id': 'Root=1-67c77f39-498fea4c5bcacebf1c003850'}, 'json': {'age': 40, 'name': 'Rudi Kurniawan'}, 'origin': '114.10.42.165', 'url': 'https://httpbin.org/post'}

Sent Payload:
{'name': 'Rudi Kurniawan', 'age': 40}


### **Reflection**
What is the difference between sending a HTTP request directly using Requests and using a browser?

(answer here)

Here are the key differences between using Python Requests and a web browser for HTTP requests:

1. Headers and User Agent
   
   - Browsers automatically send many headers (user-agent, accept-language, cookies, etc.)
   - Requests sends minimal headers by default, requiring manual configuration for specific headers
2. JavaScript Execution
   
   - Browsers execute JavaScript and render dynamic content
   - Requests only receives raw HTML/data, cannot execute JavaScript
3. Cookie Management
   
   - Browsers automatically manage cookies and maintain sessions
   - Requests requires manual cookie handling using sessions or cookie jars
4. Cache
   
   - Browsers maintain a cache of resources and follow cache headers
   - Requests doesn't cache by default, each request fetches fresh data
5. Security Features
   
   - Browsers handle HTTPS certificates, CORS, and other security features automatically
   - Requests requires manual configuration for certificates and security features
6. Resource Loading
   
   - Browsers automatically load related resources (images, CSS, scripts)
   - Requests only fetches the specifically requested URL
7. User Interface
   
   - Browsers provide visual feedback and rendering
   - Requests is programmatic, requiring manual handling of responses
These differences make browsers better for interactive web browsing, while Requests is better for automated data collection and API interactions.

### **Exploration**
Now, you should be able to get HTML content using Requests and parse it using BeautifulSoup inside your program. Once your program is terminated, the data will lost. How to access the data after program is closed?

There are several ways to persist the data after your program terminates:

In [None]:
# Save raw HTML
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

# Save parsed content
with open('parsed_content.txt', 'w', encoding='utf-8') as f:
    f.write(soup.get_text())
    
import json

# Save to json
# Convert data to structured format
data = {
    'title': soup.title.text,
    'paragraphs': [p.text for p in soup.find_all('p')]
}

# Save to JSON file
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2)
    
# Save to csv
import csv

with open('table_data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    for row in soup.find_all('tr'):
        writer.writerow([cell.text for cell in row.find_all(['td', 'th'])])

# Save to Database
import sqlite3

# Create/connect to database
conn = sqlite3.connect('scraped_data.db')
cursor = conn.cursor()

# Create table and insert data
cursor.execute('''CREATE TABLE IF NOT EXISTS articles
                 (title TEXT, content TEXT)''')
cursor.execute('INSERT INTO articles VALUES (?, ?)',
              (soup.title.text, soup.get_text()))
conn.commit()
conn.close()

Each method has its advantages:

- Text/HTML files: Simple, human-readable
- JSON: Structured data, easy to parse
- CSV: Good for tabular data
- Database: Queryable, structured, good for large datasets