# Unit 4 Extracting and Saving Images from Web Pages

Here is the text converted to Markdown, with headings, code blocks, and bolded keywords for clarity.

-----

# Overview

Welcome to the lesson on extracting and saving **images from web pages**. In this lesson, you will learn how to use **Python** and the **BeautifulSoup** library to scrape images from web pages and save them locally. By the end of this lesson, you will have a solid understanding of the entire process, from making web requests to locating image elements and saving the images.

-----

# Making Web Requests and Parsing HTML

We start by fetching the HTML content of the website we want to scrape. In this case, we'll use `https://books.toscrape.com/`.

First, import the necessary libraries, make an HTTP GET request to the website, and parse the HTML content using `BeautifulSoup`.

```python
import requests
from bs4 import BeautifulSoup

url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
```

In this example, we fetch and parse the HTML content of the Books website.

-----

# Locating and Extracting Image URLs

With the parsed HTML content, use **BeautifulSoup** to locate image elements and extract their URLs from the `src` attribute.

```python
images = soup.find_all('img')
image_urls = [img['src'] for img in images]
```

We now have a list of image URLs extracted from the web page.

-----

# Downloading and Saving Images

Finally, we will download and save the extracted images to the local file system.

Let's first ensure the `images` directory exists and create it if it doesn't using the `makedirs` function from the `os` module.

```python
import os
os.makedirs('images', exist_ok=True)
```

Next, we can iterate over the image URLs, send requests to each URL, and save the images.

```python
for src in image_urls:
    full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src
    img_response = requests.get(full_src, stream=True)
    if img_response.status_code == 200:
        img_name = os.path.basename(src) # Extract the image name from the URL
        with open(f"images/{img_name}", 'wb') as f:
            for chunk in img_response.iter_content(1024):
                f.write(chunk)
        print(f"Saved {img_name}")
```

After running the code, all the images will be saved in the `images` directory. Let's understand the code step by step:

  * We iterate over the image URLs extracted from the web page. Note that we construct the full URL by prepending the base URL if the image URL is relative.
  * For each URL, we send an HTTP GET request to download the image.
  * If the request is successful (status code 200), we extract the image name from the URL and save the image to the `images` directory. Notice that we save the image in binary mode (`'wb'`) and write the image content in chunks of 1024 bytes, which is more memory-efficient for large files.
  * Finally, we print a message indicating that the image was saved.

-----

# Summary and Exercises

In this lesson, you learned how to extract and save images from web pages using **Python** and the **BeautifulSoup** library. You learned how to make web requests, parse HTML content, locate image elements, extract image URLs, and save images to the local file system.

Now it's time to practice what you've learned in the exercises. Good luck\!

## Downloading the First Image

Great job so far! Now let's run the code you saw in the lesson.

In this exercise, we will download the first image from the website and save it locally.

Here's a brief recap of the key steps:

First, we fetch the webpage content using a GET request and parse it with BeautifulSoup.

Next, we find all the image tags in the HTML content and extract the URL of the first image.

Finally, we download the image and save it to the images directory.

Run the code below to see it in action and observe how the image is saved in the images directory.

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')

# Let's save the first image to the 'images' directory
# First let's create 'images' directory to save the image
os.makedirs('images', exist_ok=True)

# Get the image source from the src attribute and construct the full URL
src = images[0]['src']
full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src

# Send a GET request to the image URL
img_response = requests.get(full_src, stream=True)

# Save the image to the 'images' directory if the request is successful
if img_response.status_code == 200:
    img_name = os.path.basename(src)
    with open(f"images/{img_name}", 'wb') as f:
        for chunk in img_response.iter_content(1024):
            f.write(chunk)
    print(f"Saved {img_name}")

```



## Save the Second Image

Great progress so far! Let's make a slight change to the script so that instead of saving the first image, we save the second image.

This change will help you understand how to navigate different elements in the list of images.

Modify the code to save the second image found on the page instead of the first one.

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')

# Let's save the first image to the 'images' directory
# First let's create 'images' directory to save the image
os.makedirs('images', exist_ok=True)

# Get the image source from the src attribute and construct the full URL
src = images[0]['src']
full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src

# Send a GET request to the image URL
img_response = requests.get(full_src, stream=True)

# Save the image to the 'images' directory if the request is successful
if img_response.status_code == 200:
    img_name = os.path.basename(src)
    with open(f"images/{img_name}", 'wb') as f:
        for chunk in img_response.iter_content(1024):
            f.write(chunk)
    print(f"Saved {img_name}")

```

Here is the updated script.

The only change needed is to modify the list index from `[0]` to `[1]` to select the second image from the list.

-----

### Saving the Second Image

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')

# Let's save the second image to the 'images' directory
# First let's create 'images' directory to save the image
os.makedirs('images', exist_ok=True)

# Get the image source from the src attribute and construct the full URL
# Change the index from 0 to 1 to get the second image
src = images[1]['src']
full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src

# Send a GET request to the image URL
img_response = requests.get(full_src, stream=True)

# Save the image to the 'images' directory if the request is successful
if img_response.status_code == 200:
    img_name = os.path.basename(src)
    with open(f"images/{img_name}", 'wb') as f:
        for chunk in img_response.iter_content(1024):
            f.write(chunk)
    print(f"Saved {img_name}")
```

## Complete the Image Downloader

Well done so far! Now, let's practice extracting images from a website. In this task, you will extract all the images from the website https://books.toscrape.com/ and save them in a folder named images.

Fill in the TODO sections to complete the code.

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')

os.makedirs('images', exist_ok=True)

for image in images:
    # TODO: Extract the URL of the image from the 'src' attribute

    # TODO: Download the image using requests module. For simplicity, you can use url + src to get the full image URL.
    
    img_name = os.path.basename(src)
    with open(f"images/{img_name}", 'wb') as f:
        for chunk in img_response.iter_content(1024):
            f.write(chunk)
    print(f"Saved {img_name}")


```

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

images = soup.find_all('img')

os.makedirs('images', exist_ok=True)

for image in images:
    # TODO: Extract the URL of the image from the 'src' attribute
    # Added a check to handle potential errors
    try:
        src = image['src']
    except KeyError:
        # Skip if the image tag doesn't have a 'src' attribute
        continue

    # TODO: Download the image using requests module. For simplicity, you can use url + src to get the full image URL.
    full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src
    img_response = requests.get(full_src, stream=True)

    # Save the image if the request was successful
    if img_response.status_code == 200:
        img_name = os.path.basename(src)
        with open(f"images/{img_name}", 'wb') as f:
            for chunk in img_response.iter_content(1024):
                f.write(chunk)
        print(f"Saved {img_name}")
```

Let's bring everything together! Write a Python script that extracts and saves the first image from a webpage. Follow the instructions in the starter code to complete the task.

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'

os.makedirs('images', exist_ok=True)

# TODO: Fetch the webpage content using requests.get() and store in `response`

# TODO: Parse the HTML content using BeautifulSoup

# TODO: Find the first image in the HTML using BeautifulSoup

# TODO: Extract the `src` attribute from the first image element

# TODO: Construct the full image URL if it's not complete. Hint - it should look like this: 'https://books.toscrape.com/<src>'

# TODO: Make a GET request to the image URL to download the image

# TODO: Write the image data to a file in a folder named 'images'

```

```python
import requests
from bs4 import BeautifulSoup
import os

# Updated URL to point to the site with images
url = 'https://books.toscrape.com/'

os.makedirs('images', exist_ok=True)

# TODO: Fetch the webpage content using requests.get() and store in `response`
response = requests.get(url)

# TODO: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# TODO: Find the first image in the HTML using BeautifulSoup
first_image = soup.find('img')

# Add a check to ensure an image was found
if first_image:
    # TODO: Extract the `src` attribute from the first image element
    src = first_image['src']

    # TODO: Construct the full image URL if it's not complete. Hint - it should look like this: 'https://books.toscrape.com/<src>'
    full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src

    # TODO: Make a GET request to the image URL to download the image
    img_response = requests.get(full_src, stream=True)

    # TODO: Write the image data to a file in a folder named 'images'
    if img_response.status_code == 200:
        img_name = os.path.basename(src)
        with open(f"images/{img_name}", 'wb') as f:
            for chunk in img_response.iter_content(1024):
                f.write(chunk)
        print(f"Saved {img_name}")
    else:
        print(f"Failed to download image from {full_src}")
else:
    print("No image found on the page.")

```