### **Introduction to `yield` in Python**

The `yield` keyword in Python is used to create **generators**, which are a type of iterable that allows you to produce values **lazily**, one at a time, instead of returning all at once like in a list.

---

### **Key Features of `yield`:**

1. **State Retention:**
   - Unlike `return`, which exits a function completely, `yield` pauses the function and retains its state. The function can be resumed from where it left off.

2. **Efficient Memory Usage:**
   - Because generators produce items one at a time, they are more memory-efficient than creating and storing all items in memory at once.

3. **Simplifies Iterator Creation:**
   - Generators eliminate the need for implementing `__iter__()` and `__next__()` methods manually.

4. **Use Cases:**
   - Generators are ideal for handling large data streams, infinite sequences, or any scenario where you don't need all the data at once.

---

### **How `yield` Works:**

#### **1. Creating a Generator Function:**
   - Any function that contains a `yield` statement automatically becomes a generator function.
   - Instead of returning a single value, the function generates a series of values, pausing after each `yield`.

#### Example:
```python
def count_up_to(n):
    count = 1
    while count <= n:
        yield count
        count += 1

# Using the generator
for num in count_up_to(5):
    print(num)
```

**Output:**
```
1
2
3
4
5
```

**Explanation:**
- The function `count_up_to` pauses at each `yield` and resumes when the next value is requested.

---

#### **2. Comparing `yield` vs `return`:**
- **`return`**: Ends the function and sends a single value.
- **`yield`**: Pauses the function and can return multiple values over time.

```python
def using_return():
    return [1, 2, 3]  # Returns all values at once

def using_yield():
    yield 1
    yield 2
    yield 3  # Yields values one at a time
```

---

### **When to Use `yield`?**

1. **Large Datasets:**
   - When processing a dataset that is too large to fit in memory, like reading a massive file line by line.
   
   Example:
   ```python
   def read_file(file_name):
       with open(file_name) as file:
           for line in file:
               yield line.strip()
   ```

2. **Infinite Sequences:**
   - When you need to generate a potentially infinite series, such as Fibonacci numbers or prime numbers.
   
   Example:
   ```python
   def infinite_fibonacci():
       a, b = 0, 1
       while True:
           yield a
           a, b = b, a + b
   ```

3. **Pipelines:**
   - When chaining multiple processing steps together, using generators avoids creating intermediate lists.


In [1]:
# Example of data lost using return

def start_scraping(response_api):
    results = []

    for i in response_api:
        color = i["color"] # This will trigger error
        results.append(color)
    return results
    # print("End of function")

response_api = [
    {"ID": 1, "item": "Laptop", "color": "black"},
    {"ID": 2, "item": "Smart Watch", "color": "green"},
    {"ID": 3, "item": "Camera"},
]

print(start_scraping(response_api))

KeyError: 'color'

In [2]:
# Example of data retrieved with yield

def start_scraping(response_api):
    for i in response_api:
        yield i["color"] # This will trigger error
        # print("End of function")

# Dummy data
response_api = [
    {"ID": 1, "item": "Laptop", "color": "black"},
    {"ID": 2, "item": "Smart Watch", "color": "green"},
    {"ID": 3, "item": "Camera"},
]

# Create a generator object
results = start_scraping(response_api)

for i in results:
    print(i)

black
green


KeyError: 'color'

In [None]:
# Compare the size of a list and a generator
import sys

example_list = [i for i in range(1000)]
example_generator = (i for i in range(1000))

print(sys.getsizeof(example_list))
print(sys.getsizeof(example_generator))


8856
192


In [5]:
""" 
Objective: Understanding the difference between a funtion and a generator
"""
list_data = [i for i in range(10)]

# TODO: 
# 1. Create a function that reverse a list manually, without reverse method
# 2. Execute your function using list_data as the input parameter
# 3. Check your function by printing them
# 4. Print all of the item using loop

def reverse_list(input_list):
    reversed_list = []
    for i in range(len(input_list) - 1, -1, -1):
        reversed_list.append(input_list[i])
    return reversed_list

# Execute function with list_data
reversed_data = reverse_list(list_data)

# Print the reversed list
print("Reversed list:", reversed_data)

# Print items using loop
print("\nPrinting each item:")
for item in reversed_data:
    print(item)

Reversed list: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Printing each item:
9
8
7
6
5
4
3
2
1
0


In [None]:
""" 
Objective: Understanding the difference between a funtion and a generator
"""
# TODO: 
# 1. Re-create previous function using yield
# 2. Execute your function using list_data as the input parameter
# 3. Check your function by printing them
# 4. Print all of the item using loop
# 5. Analyze the difference between them

def reverse_list_generator(input_list):
    for i in range(len(input_list) - 1, -1, -1):
        yield input_list[i]

# Execute generator with list_data
reversed_gen = reverse_list_generator(list_data)

# Print the generator object
print("Generator object:", reversed_gen)

# Print all items using loop
print("\nPrinting each item:")
for item in reversed_gen:
    print(item)

# Analysis of differences:
print("\nKey differences between function and generator:")
print("1. Memory usage: Generator creates items one at a time")
print("2. State: Generator maintains state between yields")
print("3. Output: Generator returns an iterator instead of a list")
print("4. Execution: Generator pauses at each yield")

Generator object: <generator object reverse_list_generator at 0x000002C46F1D7AE0>

Printing each item:
9
8
7
6
5
4
3
2
1
0

Key differences between function and generator:
1. Memory usage: Generator creates items one at a time
2. State: Generator maintains state between yields
3. Output: Generator returns an iterator instead of a list
4. Execution: Generator pauses at each yield


In [7]:
# TODO: Execute this cell and take a look at csv file before continue
import csv

def create_csv(file_name, base_url, num_entries):
    with open(file_name, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        
        # Write header
        writer.writerow(["ID", "URL"])
        
        # Write rows with dynamically generated URLs
        for i in range(1, num_entries + 1):
            # Replace "page-20.html" with the current ID
            dynamic_url = base_url + f"/catalogue/page-{i}.html"
            writer.writerow([i, dynamic_url])
    
    print(f"CSV file '{file_name}' with {num_entries} dynamic URLs has been created.")

create_csv(
    file_name="books_urls.csv",
    base_url="https://books.toscrape.com",
    num_entries=1000000
)


CSV file 'books_urls.csv' with 1000000 dynamic URLs has been created.


In [None]:
""" 
Objective: Compare the speed of scraping execution from huge file of csv
"""

import requests
import csv

def read_urls_from_csv(file_path):
    """
    Reads a CSV file and returns a list of URLs found in the 'URL' column.
    """
    urls = []  # Initialize an empty list to store URLs
    with open(file_path, mode='r') as file:
        # Create a CSV reader object to parse the CSV file
        csv_reader = csv.DictReader(file)
        
        # Iterate through each row in the CSV file
        for row in csv_reader:
            # Append the value in the 'URL' column to the urls list
            urls.append(row["URL"])
    
    return urls  # Return the list of URLs

# Read the URLs from the CSV file into the data_csv list
data_csv = read_urls_from_csv('books_urls.csv')

# Iterate through each URL in the list
for url in data_csv:
    print(f"Getting {url}")  # Print a message indicating the URL being fetched
    response = requests.get(url).status_code  # Send a GET request and get the status code
    
    # Raise an exception to intentionally halt the program (for testing purposes)
    raise

# TODO: Take a look at how long it takes before raising error

# 6.7 Seconds before raising erorr

Getting https://books.toscrape.com/catalogue/page-1.html


RuntimeError: No active exception to reraise

In [9]:
""" 
Objective: Compare the speed of scraping execution from huge file of csv
"""
# TODO:
# 1. Re-create previous function by using yield
# 2. Compare the time execution and give your insight

import requests
import csv
import time

def read_urls_generator(file_path):
    """
    Generator function that yields URLs one at a time from the CSV file
    """
    with open(file_path, mode='r') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            yield row["URL"]

# Test regular function timing
start_time = time.time()
data_csv = read_urls_from_csv('books_urls.csv')
regular_load_time = time.time() - start_time
print(f"Regular function load time: {regular_load_time:.2f} seconds")

# Test generator timing
start_time = time.time()
urls_gen = read_urls_generator('books_urls.csv')
gen_init_time = time.time() - start_time
print(f"Generator initialization time: {gen_init_time:.2f} seconds")

# Test URL fetching with generator
start_time = time.time()
for url in urls_gen:
    print(f"Getting {url}")
    response = requests.get(url).status_code
    print(f"Time elapsed: {time.time() - start_time:.2f} seconds")
    break  # Stop after first URL for testing

print("\nInsights on execution time differences:")
print(f"1. Regular function took {regular_load_time:.2f}s to load all URLs")
print(f"2. Generator took only {gen_init_time:.2f}s to initialize")
print("3. Generator starts processing immediately without loading entire file")
print("4. Memory usage is significantly lower with generator")
print("5. Generator allows processing to begin before reading entire file")

Regular function load time: 3.09 seconds
Generator initialization time: 0.00 seconds
Getting https://books.toscrape.com/catalogue/page-1.html
Time elapsed: 1.91 seconds

Insights on execution time differences:
1. Regular function took 3.09s to load all URLs
2. Generator took only 0.00s to initialize
3. Generator starts processing immediately without loading entire file
4. Memory usage is significantly lower with generator
5. Generator allows processing to begin before reading entire file


In [11]:
""" 
Objective: Using yield for scraping
"""

import requests
from bs4 import BeautifulSoup


# Scrape product data from a list of URLs
def scrape_product_urls(urls):
    """
    Scrape product URLs from a list of pages.
    """
    all_product_urls = []
    for url in urls:
        print(f"Scraping: {url}")

        # TODO: 
        # 1. Get the html response of the page url
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        products = soup.find_all('article', class_='product_pod')
        
        # 2. Extract the items url into all_product_urls
        for product in products:
            # Get the link from h3 > a tag
            link = product.h3.a['href']
            # Convert relative URL to absolute URL
            if not link.startswith('http'):
                link = 'https://books.toscrape.com/catalogue/' + link
            all_product_urls.append(link)

    return all_product_urls

# Main execution
if __name__ == "__main__":
    page_urls = [
        "https://books.toscrape.com/catalogue/page-1.html",
        "https://books.toscrape.com/catalogue/page-10.html",
        "https://books.toscrape.com/catalogue/page-200.html",
        "https://books.toscrape.com/catalogue/page-20.html"
    ]
    product_items = scrape_product_urls(page_urls)

    # Print the extracted product URLs
    for item in product_items:
        print(item)

Scraping: https://books.toscrape.com/catalogue/page-1.html
Scraping: https://books.toscrape.com/catalogue/page-10.html
Scraping: https://books.toscrape.com/catalogue/page-200.html
Scraping: https://books.toscrape.com/catalogue/page-20.html
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
https://books.toscrape.com/catalogue/the-requiem-red_995/index.html
https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-

In [12]:
""" 
Objective: Using yield for scraping
"""
# TODO: 
# 1. Update previous code by using yield

def scrape_product_urls(urls):
    """
    Generator function that yields product URLs from a list of pages.
    """
    for url in urls:
        print(f"Scraping: {url}")

        # Get HTML response
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all product links
        products = soup.find_all('article', class_='product_pod')
        
        # Yield each product URL
        for product in products:
            link = product.h3.a['href']
            if not link.startswith('http'):
                link = 'https://books.toscrape.com/catalogue/' + link
            yield link

# Main execution
if __name__ == "__main__":
    page_urls = [
        "https://books.toscrape.com/catalogue/page-1.html",
        "https://books.toscrape.com/catalogue/page-10.html",
        "https://books.toscrape.com/catalogue/page-200.html",
        "https://books.toscrape.com/catalogue/page-20.html"
    ]
    
    # Create generator object
    product_urls = scrape_product_urls(page_urls)

    # Print URLs as they are generated
    for url in product_urls:
        print(url)

Scraping: https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
https://books.toscrape.com/catalogue/the-requiem-red_995/index.html
https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
https://books.toscrape.com/catalogue/the-black-maria_991/index.html
https://books.toscrape.com/catalogue/starving-hearts-tr

In [13]:
""" 
Objective: Using yield for scraping
"""
# TODO:
# 1. From your last assignment, update your code by using yield
# 2. Create a new branch and push into github
# 3. Put the URL here

def scrape_product_urls(urls):
    """
    Generator function that yields product URLs one at a time from a list of pages.
    """
    for url in urls:
        print(f"Scraping: {url}")
        
        # Get HTML response
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all product links and yield them one by one
        products = soup.find_all('article', class_='product_pod')
        for product in products:
            link = product.h3.a['href']
            if not link.startswith('http'):
                link = 'https://books.toscrape.com/catalogue/' + link
            yield link

# Main execution
if __name__ == "__main__":
    page_urls = [
        "https://books.toscrape.com/catalogue/page-1.html",
        "https://books.toscrape.com/catalogue/page-10.html",
        "https://books.toscrape.com/catalogue/page-200.html",
        "https://books.toscrape.com/catalogue/page-20.html"
    ]
    
    # Create generator object
    for item in scrape_product_urls(page_urls):
        print(item)

Scraping: https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
https://books.toscrape.com/catalogue/soumission_998/index.html
https://books.toscrape.com/catalogue/sharp-objects_997/index.html
https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
https://books.toscrape.com/catalogue/the-requiem-red_995/index.html
https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
https://books.toscrape.com/catalogue/the-black-maria_991/index.html
https://books.toscrape.com/catalogue/starving-hearts-tr

### **Reflection**
If you have a lot of memory, do you think you still need a generator? Give me your reason!

(answer here)

Yes, generators are still valuable even with abundant memory for several important reasons:

1. Code Readability and Design
   
   - Generators provide a cleaner, more intuitive way to work with sequences
   - They make the code's intent clearer by explicitly showing the iteration pattern
2. Processing Efficiency
   
   - Generators start processing immediately without waiting to load all data
   - This reduces latency for the first results, especially important in web applications
   - Better CPU utilization since you're processing one item at a time
3. Resource Management
   
   - Even with lots of memory, efficient resource usage is good practice
   - Servers often handle multiple concurrent processes/users
   - Memory saved can be used for other operations or caching
4. Scalability
   
   - Code written with generators scales better as data grows
   - No need to rewrite code if data size increases beyond memory
   - Future-proofs your applications
5. Stream Processing
   
   - Generators are ideal for real-time data streams
   - Perfect for processing live data feeds or continuous inputs
   - Better suited for pipeline operations
6. Error Handling
   
   - Easier to handle errors on individual items
   - Doesn't lose all progress if one item fails
   - Can continue processing remaining items
Therefore, generators remain valuable for their design benefits and processing patterns, not just memory efficiency.

### **Exploration**
In Python, generators and iterators are both essential tools for working with sequences of data. However, we only covers the generators topic here. Explore about the iterators!

Here's an exploration of Python iterators:

In [None]:
# Understanding Iterators in Python

# 1. Basic Iterator Example
class NumberIterator:
    def __init__(self, limit):
        self.limit = limit
        self.counter = 0
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.counter < self.limit:
            self.counter += 1
            return self.counter - 1
        raise StopIteration

# Using the iterator
numbers = NumberIterator(5)
for num in numbers:
    print(num)  # Outputs: 0, 1, 2, 3, 4

# 2. Custom String Iterator
class ReverseString:
    def __init__(self, text):
        self.text = text
        self.index = len(text)
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.index > 0:
            self.index -= 1
            return self.text[self.index]
        raise StopIteration

# Using string iterator
text = ReverseString("Hello")
for char in text:
    print(char)  # Outputs: o, l, l, e, H

# 3. Built-in Iterator Example
my_list = [1, 2, 3]
iterator = iter(my_list)
print(next(iterator))  # 1
print(next(iterator))  # 2
print(next(iterator))  # 3

Key points about iterators:

1. Protocol
   
   - Iterators must implement __iter__() and __next__()
   - __iter__() returns the iterator object
   - __next__() returns the next value or raises StopIteration
2. Differences from Generators
   
   - Iterators are classes with explicit methods
   - More complex to write than generators
   - Offer more control over iteration behavior
   - Can maintain more complex state
3. Use Cases
   
   - Custom iteration patterns
   - Complex object traversal
   - Memory-efficient sequence processing
   - Implementation of container objects
4. Advantages
   
   - Fine-grained control over iteration
   - Can implement complex iteration logic
   - Reusable iteration behavior
   - State preservation between iterations
5. Common Applications
   
   - Custom collection types
   - Database cursors
   - File readers
   - Network data streams