### **Asynchronous Web Scraping**

### Blocking vs Non-Blocking code

#### 1. Blocking Code Example

In blocking code, each task waits for the previous one to finish before starting.

```python
# Blocking code example with multiple tasks
import time

# Simulate a blocking task
def read_file(file_name):
    print(f"Reading file: {file_name}")
    time.sleep(2)  # Simulate a time delay for reading a file (2 seconds)
    print(f"Finished reading {file_name}")

# Main execution
def main():
    start_time = time.time()
    
    # Simulate multiple blocking tasks
    read_file("file1.txt")
    read_file("file2.txt")
    read_file("file3.txt")
    
    print(f"All tasks finished in {time.time() - start_time} seconds.")

main()
```
---

#### 2. Non-blocking Code Example

In non-blocking code, the program can start a new task without waiting for the previous one to finish. This is achieved using asynchronous programming.

```python
import aiofiles
import asyncio

# Non-blocking task using asyncio
async def read_file(file_name):
    print(f"Reading file: {file_name}")
    await asyncio.sleep(2)  # Simulate a time delay without blocking
    print(f"Finished reading {file_name}")

# Main execution
async def main():
    start_time = asyncio.get_event_loop().time()
    
    # Start multiple tasks concurrently
    tasks = [
        read_file("file1.txt"),
        read_file("file2.txt"),
        read_file("file3.txt")
    ]
    
    # Run all tasks concurrently
    await asyncio.gather(*tasks)
    
    print(f"All tasks finished in {asyncio.get_event_loop().time() - start_time} seconds.")

# Run the async program
asyncio.run(main())
```
---
**Situation:**  
You have 10 kilograms of dirty laundry, and you went to a self-service laundry. The machines can each hold up to 3 kilograms of clothes, and each machine takes 1 hour to finish a load. If you use just one machine, it will take 4 hours to finish all your laundry. How can you speed up the process?

**Situation:** You have a date in one hour, and you want to impress your crush with a fresh haircut. However, you also need to do laundry because you don’t have any clean clothes to wear for work tomorrow. How would you handle this situation?

---

**Asynchronous web scraping** allows you to send multiple HTTP requests concurrently without blocking the execution of the program. This is ideal for I/O-bound tasks like scraping many web pages, as it enables the program to process multiple requests at once, reducing total scraping time.

### Key Concepts

- **Event Loop**: Manages asynchronous tasks, allowing one task to run while others wait.
- **Non-blocking I/O**: HTTP requests don't block the program; it continues to send more requests or process other tasks while waiting for responses.
- **Coroutines**: Functions defined with `async def` that can be paused and resumed.
- **`await`**: Pauses a coroutine until a result is available, such as the response from an HTTP request.

### Example Code

```python
import aiohttp
import asyncio

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

async def main():
    urls = ["http://example.com/page1", "http://example.com/page2"]
    tasks = [fetch(url) for url in urls]
    results = await asyncio.gather(*tasks)
    for result in results:
        print(f"Fetched {len(result)} characters")

asyncio.run(main())
```

### Advantages

1. **Efficiency**: Scrapes multiple pages concurrently, reducing total time.
2. **Low Resource Usage**: Uses a single thread, consuming less memory and CPU.
3. **Scalability**: Handles large volumes of data without the overhead of multithreading.

### Considerations

- **Rate Limiting**: Respect website rate limits to avoid getting blocked.
- **Error Handling**: Ensure proper handling for failed requests and timeouts.

### Asyncio

In [None]:
"""  
Objective: Defining an Async Function
"""
import asyncio
import time


def greet():
    time.sleep(1)
    print("Hello, World!")

async def async_greet():
    # await asyncio.sleep(1)
    print("Hello, World!")

print(type(greet))
# TODO: Print the type of async_greet()
print(type(async_greet))

print(type(greet()))
# TODO: Print the type of returned value of async_greet()
print(type(async_greet()))


In [None]:
"""  
Objective: Executing async function
"""
import asyncio


async def add_numbers(a, b):
    return a + b

add_numbers(1, 2)
asyncio.run(add_numbers(1, 2))

# TODO: Try to execute add_numbers like a normal function
# TODO: Try to execute add_numbers using asyncio.run()

In [None]:
"""  
Objective: Executing async function
"""
import asyncio


async def add_numbers(a, b):
    return a + b

def main():
    result = None
    # TODO: Change result value by executing add_numbers

    print(result)

# TODO: Execute main() function

In [None]:
"""  
Objective: Executing async function
"""
import asyncio


async def add_numbers(a, b):
    return a + b

async def main():
    result = None
    # TODO: Change result value by await add_numbers

    print(result)

# TODO: Execute main() function

In [None]:
"""  
Objective: Error Handling in Async Functions
"""
async def divide_numbers(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero!")
    return a / b

# TODO: Create main() function to execute divide_numbers(10, 0) asynchronously
# TODO: Add error handling if b is zero
# TODO: Execute main() function

In [None]:
"""  
Objective: Running Multiple Tasks
"""
import asyncio


async def task_1():
    print("Task 1 started...")
    await asyncio.sleep(2)
    print("Task 1 completed!")
    return "Result from Task 1"

# TODO: Create 2 more function like above
# TODO: Create main function to execute all task function asynchronously
# TODO: Execute main function
# TODO: Analyze the flow execution

In [None]:
"""  
Objective: Running Multiple Tasks Concurrently
"""
# TODO: Change the main function using asyncio.gather()

In [None]:
"""  
Objective: Another way to gather many tasks at once
Using *args
"""
# TODO: Improve the previous code using *args instead of calling one by one function

In [None]:
"""  
Objective: Using asyncio.create_task()
"""
import asyncio


async def main():
    # Create tasks
    t1 = asyncio.create_task(task_1())
    # TODO: Create task for Task 2
    # TODO: Create task for Task 3
    
    await t1  # Wait for Task 1 to finish
    # TODO: Wait for Task 2 to finish
    # TODO: Wait for Task 3 to finish

asyncio.run(main())

In [None]:
"""  
Objective: Combining create_task() and gather()
"""
import asyncio


# Simulated web scraping task
async def scrape_page(url):
    print(f"Starting {url}")
    await asyncio.sleep(1)  # Simulate network delay
    print(f"Scraped: {url}")
    return {"url": url, "content": f"Dummy content from {url}"}

# Main function
async def main():
    urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net"
    ]
    
    # Simulate creating tasks for scraping pages
    tasks = []
    # TODO: Create a list of task by combining scrape_page and urls list
    
    # Wait for all tasks to complete and gather results
    results = await asyncio.gather(*tasks)
    # TODO: print the result here

# TODO: Run the main function

In [None]:
"""  
Objective: Using asyncio.as_completed for Immediate Results
"""
# TODO: Import necessary package
# TODO: Create a task function that accept 2 parameters: task name, delay time
# TODO: Create a main function to create a list of task
# TODO: Loop coroutine object inside asyncio.as_completed(list of task)
# TODO: Wait the coroutine object to get the result and print it
# TODO: Execute the main function

In [None]:
"""  
Objective: Simulating web scraping process
"""
import asyncio

async def fetch_data(url):
    await asyncio.sleep(1)  # Simulate fetch
    print(f"Fetched: {url}")
    return f"Data from {url}"

async def process_data(data):
    await asyncio.sleep(0.5)  # Simulate processing
    print(f"Processed: {data}")
    return f"Processed {data}"

async def save_data(data):
    await asyncio.sleep(0.5)  # Simulate saving
    print(f"Saved: {data}")

async def main():
    urls = ["http://example.com", "http://example.org", "http://example.net"]
    
    # TODO: Fetch all data concurrently
    # TODO: Process data concurrently
    # TODO: Save all data concurrently

# TODO: Run the workflow


### HTTPX

httpx is a modern Python library designed for making HTTP requests. It supports both synchronous and asynchronous programming and offers advanced features like HTTP/2 and connection pooling. It’s often described as an async-friendly alternative to requests with a similar API.

---

```bash
pip install httpx
```
---

**Basic Usage**
**Synchronous Request**
```python
import httpx

response = httpx.get('https://example.com')
print(response.status_code)
print(response.text)
```
---

### **When to Use `httpx`**
- **Web Scraping**: Make multiple requests concurrently with async support.
- **APIs**: Communicate with RESTful or GraphQL APIs using efficient HTTP/2.
- **Proxies**: Handle requests via proxy servers with ease.
- **Modern HTTP Features**: Use advanced features like HTTP/2 and custom middleware.

---

In [None]:
"""  
Objective: Sending a simple HTTP request using httpx
"""
import httpx

r = httpx.get('https://httpbin.org/get')
# TODO: Try to manipulate the r object above as you are using requests

In [None]:
"""  
Objective: Sending HTTP request using httpx client
httpx.Client() is what you can use instead of requests.Session()
"""
import httpx
import time


start_time = time.time()

# Send the first request
response_1 = httpx.get("https://httpbin.org/cookies/set?cookie_name=cookie_value", follow_redirects=True)
print("First Request (Set Cookie):", response_1.json())

# Send a second request to check cookies
response_2 = httpx.get("https://httpbin.org/cookies")
print("Second Request (No Session):", response_2.json())

# TODO: Send a third request to check cookies

end_time = time.time()

print(f"Total execution time {end_time - start_time:.2f}")

In [None]:
"""  
Objective: Sending HTTP request using httpx client
httpx.Client() is what you can use instead of requests.Session()
"""
# TODO: Improve code above by using httpx client
# TODO: Analyze the difference

In [None]:
"""  
Objective: Make asynchronous requests, Using AsyncClient.
"""
async with httpx.AsyncClient() as client:
    response_1 = await client.get("https://httpbin.org/get")
    response_2 = await client.get("https://httpbin.org/get")
    # TODO: Add another response object from the same site
    # TODO: Print all response status code

In [None]:
"""  
Objective: Simulating sending a list of URLs
"""
import httpx
import asyncio


# List of URLs to scrape (use a test URL or public API)
urls = ["https://httpbin.org/get"] * 100  # Sending 100 requests to the same URL

# Function to send requests concurrently
async def fetch(url, client):
    print(f"Sending request to {url}")
    response = await client.get(url)
    return response.status_code  # Return the status code to track success

# Main function to send all requests concurrently
async def send_requests():
    async with httpx.AsyncClient() as client:
        
        # TODO: Use asyncio.gather to send requests concurrently      

# TODO: Run the function

In [None]:
"""  
Objective: Monitoring progress as each task completed
"""
# TODO: Improve previous code to monitor progress using asyncio.as_completed()
# TODO: Add a counter to count how many request already send

In [None]:
"""  
Objective: Limiting Concurrent Requests using Semaphore to avoid overloading the server
"""
# TODO: improve previous code by limiting to max 10 requests using asyncio.Semaphore()

In [None]:
"""  
Objective: Implement asynchronous in your web scraping
"""
# TODO: Create a new branch from your previous web scraping project
# TODO: Implement asynchronous using httpx.AsyncClient
# TODO: Push and put the github link here for grading

### **Reflection**
By using asynchronous, we can send multiple request at once. By doing that, what do you think will effect on the server side?

(answer here)

### **Exploration**
Explore how you can optimize the scraping execution time while still maintaining control over the quantity of request.