# 05. üåê Standard Library Reference: `urllib` (URL Handling)

The **`urllib`** library is a package that collects several modules for working with Uniform Resource Locators (URLs).

### üßê The "High-Level" Analogy
Unlike the `socket` module (which is like building your own telephone), `urllib` is like using a **web browser** inside your Python script. It operates at the **Application Layer**, meaning it handles all the messy details of headers, handshakes, and protocols for you.



**Key Topics Covered:**
* **`urllib.request`:** Fetching web pages (The "Browser").
* **`urllib.parse`:** Building and dissecting URLs (The "Translator").
* **`urllib.error`:** Handling HTTP exceptions (The "Troubleshooter").
* **`urllib.robotparser`:** Checking web scraping rules (The "Rulebook").

In [1]:
import urllib.request
import urllib.parse
import urllib.error
import urllib.robotparser
import json

## 1.1 üì• `urllib.request` (Fetching Data)

This module handles sending and retrieving data over HTTP, HTTPS, and FTP.

**The "File" Analogy:**
One of the best features of `urllib` is that it treats a website exactly like a **file on your hard drive**.
1.  You `open` it.
2.  You `read` it.
3.  You `close` it.

We use the `with` statement (Context Manager) to ensure the connection is automatically closed after we are done, preventing memory leaks.

In [2]:
URL = 'https://jsonplaceholder.typicode.com/todos/1'

# 1. Basic GET request
# urlopen returns a response object (file-like)
try:
    print(f"Attempting to connect to {URL}...")
    
    # "with" automatically closes the socket when the block ends
    with urllib.request.urlopen(URL) as response:
        
        # KEY STEP: The internet sends 'bytes'. We must 'decode' them to string (utf-8).
        # .read() grabs the whole file at once
        data = response.read().decode('utf-8')
        
        # Load the JSON string into a Python dictionary
        data_dict = json.loads(data)
        
        print("\n--- üì© Data Received ---")
        print(f"Title: {data_dict['title']}")
        print(f"Status Code: {response.status} (200 = OK)")
        
except Exception as e:
    print(f"Could not complete request: {e}")

Attempting to connect to https://jsonplaceholder.typicode.com/todos/1...

--- üì© Data Received ---
Title: delectus aut autem
Status Code: 200 (200 = OK)


## 1.2 üî® `urllib.parse` (Building URLs)

Have you ever seen a URL with weird characters like `%20` or `%3F`? This is **URL Encoding**.
Computers (and servers) get confused by spaces and special symbols in URLs. `urllib.parse` is the tool that "packs" your data safely so it can travel across the internet without breaking.



**Why is this useful?**
If you are building a search tool or an API client (like for a weather app), you cannot just paste user input into a string. You must *encode* it first.

In [3]:
base_url = 'https://api.example.com/search'
query_params = {'q': 'data science', 'lang': 'en'}

# 1. Encoding Query Parameters
# urlencode converts the dictionary {'q': 'data science'} 
# into the safe string 'q=data+science&lang=en'
encoded_params = urllib.parse.urlencode(query_params)
print(f"Encoded Params: {encoded_params}")

# 2. Combining Base URL and Parameters
full_url = f'{base_url}?{encoded_params}'
print(f"Full URL: {full_url}")

# 3. Dissecting a URL
# Sometimes you have a full URL and need to extract just the hostname or path.
parsed = urllib.parse.urlparse(full_url)
print(f"\n--- üîç URL Anatomy ---")
print(f"Scheme (Protocol): {parsed.scheme}")
print(f"Path (Location):   {parsed.path}")
print(f"Query (Data):      {parsed.query}")

Encoded Params: q=data+science&lang=en
Full URL: https://api.example.com/search?q=data+science&lang=en

--- üîç URL Anatomy ---
Scheme (Protocol): https
Path (Location):   /search
Query (Data):      q=data+science&lang=en


## 1.3 üö® `urllib.error` (Handling Failures)

Things go wrong on the internet. `urllib` gives you specific errors so you know *why* it failed.

* **`HTTPError` (The Server said "No"):** The server was reached, but it refused the request (e.g., 404 Not Found, 403 Forbidden).
* **`URLError` (The Network failed):** We couldn't even reach the server (e.g., WiFi is off, DNS failed).

In [4]:
# We purposefully try to hit a page that doesn't exist (Status 404)
target_url = 'https://httpbin.org/status/404'

try:
    print(f"Connecting to {target_url}...")
    with urllib.request.urlopen(target_url) as response:
        print("This line should not be reached.")
        
except urllib.error.HTTPError as e:
    # HTTPError is a subclass of URLError, caught here
    print(f"\n‚ùå Caught HTTP Error!")
    print(f"Code: {e.code} (Server Refused)")
    print(f"Reason: {e.reason}")
    
except urllib.error.URLError as e:
    # Catches non-HTTP errors like a DNS failure
    print(f"\nüö´ Caught URL Error!")
    print(f"Reason: {e.reason} (Network Issue)")

Connecting to https://httpbin.org/status/404...

‚ùå Caught HTTP Error!
Code: 404 (Server Refused)
Reason: NOT FOUND


## 1.4 ü§ñ `urllib.robotparser` (Web Scraping Ethics)

Before you write a script to download thousands of pages from a website, you must check their **`robots.txt`** file. This is the "House Rules" for bots.

`urllib.robotparser` helps you be a "polite" programmer by checking if your bot (`User-Agent`) is allowed to touch a specific page.

In [5]:
rp = urllib.robotparser.RobotFileParser()
rp.set_url('https://www.google.com/robots.txt')

# We must read the robots.txt file (simulating the load)
# In a real app, rp.read() fetches and parses the file from the internet
# rp.read()

# Test if a common bot is allowed
# print(f"Is Googlebot allowed to crawl /search? {rp.can_fetch('Googlebot', '/search')}")

---

## ÓÅûÊΩÆ Mini-Challenge: The Query Builder

**Task:** You are building an API request to filter products on an e-commerce site.

1.  Define a dictionary of query parameters: `{'product': 'laptop pro', 'sort': 'price', 'limit': 25}`.
2.  Build the final, correctly encoded URL using `urllib.parse.urlencode()` and string formatting.

In [6]:
base_api = "https://data.example.com/api/v1/filter"
params = {'product': 'Laptop Pro', 'sort': 'price', 'limit': 25, 'user': 'Shashika'}

# Write your solution here

encoded = urllib.parse.urlencode(params)
final_url = f"{base_api}?{encoded}"

print(final_url)

https://data.example.com/api/v1/filter?product=Laptop+Pro&sort=price&limit=25&user=Shashika


---

## üåü Core Insight for Your CSE Career

### Parse vs. Request
The split between `urllib.parse` and `urllib.request` is crucial to understand.

1.  **`parse`** deals with the *logic* of the URL (encoding, breaking down query strings, paths).
2.  **`request`** deals with the *physical transmission* (sending HTTP packets, managing headers).

Even if you eventually use simpler external libraries (like `requests`), you will often find yourself strictly using `urllib.parse` to correctly format complex URL paths and safe query strings for your Data Science projects.