### **WEB SCRAPPING AND IMAGE SCRAPPING USING PYTHON**

---



### **1. Introduction to Web Scraping**
- **What is Web Scraping?**: Explain that web scraping is a way to extract information from websites automatically. Compare it to copying and pasting text from a website, but faster and done by a program.
- **Ethical Considerations**: Mention the importance of scraping responsibly and respecting website terms of service.

### **2. Setting Up the Environment**
- **Python Basics**: If they are new to Python, give a brief introduction. Show them how to install Python and a text editor like VS Code or Jupyter Notebook.
- **Installing Libraries**:
  - Install the `requests` and `BeautifulSoup` libraries using pip:
    ```bash
    pip install requests beautifulsoup4
    ```

### **3. Basic Example: Scraping a Simple Web Page**
Start with a basic example like scraping the title of a webpage.

- **Explain Each Step**:
  - **Step 1**: Explain what a URL is and how `requests.get()` fetches the webpage.
  - **Step 2**: Introduce HTML and how `BeautifulSoup` helps in parsing it.
  - **Step 3**: Show how to extract specific elements, like the title.
#### **Example: Scraping a Web Page Title**


In [26]:
import requests
from bs4 import BeautifulSoup

# Step 1: Send a request to the website
url = "https://news.ycombinator.com/"
response = requests.get(url)

# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Extract the title of the webpage
title = soup.title.string                   #give about the domain
print("Title of the webpage:", title)

Title of the webpage: Hacker News



### **4. Intermediate Example: Scraping Multiple Items**
Move on to scraping multiple items, like a list of headlines or links from a news website.


### **5. Advanced Example: Scraping with Pagination**
Show them how to scrape data from multiple pages, such as all the headlines across multiple pages.

#### **Example: Scraping Multiple Pages**




In [27]:
import requests
from bs4 import BeautifulSoup

# Function to scrape a single page
def scrape_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    headlines = soup.find_all('span', class_='titleline')
    return [headline.text for headline in headlines]

# Scrape headlines from the first 3 pages
base_url = "https://news.ycombinator.com/news?p="
all_headlines = []

for page in range(1, 2):  # Scraping the first 1 pages
    page_url = base_url + str(page)
    all_headlines.extend(scrape_page(page_url))

# Print all headlines
for index, headline in enumerate(all_headlines):
    print(f"{index + 1}. {headline}")

1. Blitz: A lightweight, modular, extensible web renderer (github.com/dioxuslabs)
2. Show HN: OSM – self host the entire planet in ~30 minutes (gist.github.com)
3. Isometric Projection in Game Development (pikuma.com)
4. Verso – web browser built on top of the Servo web engine (github.com/versotile-org)
5. PermitFlow (YC W22) Is Hiring Senior/Staff+ Engineers and Designers in NYC (ashbyhq.com)
6. Generating Simpson's Paradox with Z3 (kevinlynagh.com)
7. Adbfs-rootless – Mount Android phones on Linux with adb. No root required (github.com/spion)
8. Interstellar movie is implemented with Einstein's equations in 40k lines C++ (twitter.com/bitfield)
9. It took my savings and 14 years but I’m about to beat arthritis (thetimes.com)
10. Show HN: Anycast+ – An AI-powered podcast app (anycast.website)
11. Server Mono: A Typeface Inspired by Typewriters, Apple's SF Mono, and CLIs (servermono.com)
12. ASCII 3D Renderer for JavaScript (github.com/kciter)
13. Things I've learned building a modern T

Certainly! Let's go through the code line by line:

*   List item
*   List item



```python
import requests
```
- **Purpose**: This line imports the `requests` library, which is used for making HTTP requests in Python. It allows you to send HTTP/1.1 requests with methods like `GET` and `POST`, enabling you to interact with web pages and APIs.

```python
from bs4 import BeautifulSoup
```
- **Purpose**: This line imports the `BeautifulSoup` class from the `bs4` module, which is part of the BeautifulSoup library. BeautifulSoup is a Python library used for parsing HTML and XML documents and extracting data from them.

```python
print("Fetching the webpage...")
```
- **Purpose**: This line outputs the message "Fetching the webpage..." to the console, letting the user know that the script is about to request the webpage.

```python
url = "https://news.ycombinator.com/"
```
- **Purpose**: This line defines the `url` variable, storing the URL of the webpage you want to scrape. In this case, it's the Hacker News homepage.

```python
response = requests.get(url)
```
- **Purpose**: This line sends a GET request to the URL stored in the `url` variable using the `requests.get()` method. The response from the server (which includes the HTML content of the webpage) is stored in the `response` variable.

```python
if response.status_code == 200:
```
- **Purpose**: This line checks if the request was successful by evaluating the status code of the response. HTTP status code `200` means "OK", indicating that the webpage was successfully fetched.

```python
print("Webpage fetched successfully!")
```
- **Purpose**: If the status code is `200`, this line prints a confirmation message to the console, indicating that the webpage was successfully fetched.

```python
soup = BeautifulSoup(response.text, 'html.parser')
```
- **Purpose**: This line creates a `BeautifulSoup` object named `soup` by parsing the HTML content of the webpage (available in `response.text`). The `'html.parser'` argument specifies that the built-in Python HTML parser should be used to parse the document. The `soup` object now represents the HTML document in a structured format, allowing for easy data extraction.

```python
print("Parsing the headlines...")
```
- **Purpose**: This line outputs the message "Parsing the headlines..." to the console, letting the user know that the script is about to extract the headlines from the parsed HTML.

```python
headlines = soup.find_all('span', class_='titleline')
```
- **Purpose**: This line searches the parsed HTML (`soup`) for all `<span>` elements with the class `titleline`. The `find_all()` method returns a list of all matching elements, which is stored in the `headlines` variable. Each element in this list corresponds to a headline on the webpage.

```python
if headlines:
```
- **Purpose**: This line checks if the `headlines` list is not empty (i.e., it contains at least one element). If it is non-empty, the following block of code (inside the `if` statement) will be executed.

```python
for index, headline in enumerate(headlines):
```
- **Purpose**: This line starts a `for` loop that iterates over each element in the `headlines` list. The `enumerate()` function is used to keep track of the index of each element, starting from 0. The `index` variable stores the current index, and `headline` stores the current headline element.

```python
print(f"{index + 1}. {headline.text}")
```
- **Purpose**: Inside the loop, this line prints the index and the text of the current headline. The `f` before the string indicates an f-string, which allows you to embed expressions inside curly braces `{}` within the string. `index + 1` gives a 1-based index (starting from 1 instead of 0), and `headline.text` extracts the text content from the current `headline` element.

```python
else:
    print("No headlines found.")
```
- **Purpose**: If the `headlines` list is empty (i.e., no matching elements were found), this `else` block is executed, printing the message "No headlines found."

```python
else:
    print(f"Failed to fetch the webpage. Status code: {response.status_code}")
```
- **Purpose**: If the initial check `if response.status_code == 200` fails (meaning the status code is not 200), this `else` block is executed. It prints a message indicating that the webpage could not be fetched, along with the actual status code returned by the server.

### Summary:
- The code fetches the Hacker News homepage, checks if the request was successful, parses the HTML to find all headlines, and prints them. If something goes wrong (e.g., the request fails or no headlines are found), it outputs appropriate error messages.

In [31]:
#

### **IMAGE SCRAPPING**


In [29]:
import requests
from bs4 import BeautifulSoup
import os
import shutil

# Step 1: Define the URL of the webpage to scrape
url = "https://unsplash.com/images/things/toys"  # Replace with the URL you want to scrape

# Step 2: Send a request to the website
response = requests.get(url)

# Step 3: Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Step 4: Create a folder to save the images
if not os.path.exists('downloaded_images'):
    os.makedirs('downloaded_images')

# Step 5: Find all image tags <img> in the HTML
images = soup.find_all('img')

# Step 6: Loop through the image tags and download the images
for index, img in enumerate(images):
    img_url = img.get('src')

    # Some images might have relative URLs
    if img_url and not img_url.startswith('http'):
        img_url = url + img_url

    if img_url:  # Proceed if img_url is valid
        # Step 7: Send a request to the image URL
        img_response = requests.get(img_url, stream=True)

        # Step 8: Save the image to the folder with a unique name
        img_path = os.path.join('downloaded_images', f'image_{index}.jpg')
        with open(img_path, 'wb') as img_file:
            shutil.copyfileobj(img_response.raw, img_file)
        print(f"Image {index} downloaded and saved as {img_path}")

# Step 9: Confirm completion
print("All images downloaded and saved successfully!")

# Step 10: Create a readme.txt or description.txt file
with open('downloaded_images/readme.txt', 'w') as readme_file:
    readme_file.write("This folder contains images scraped from the following URL: " + url + "\n")
    readme_file.write("Steps:\n")
    readme_file.write("1. Define the URL of the webpage to scrape.\n")
    readme_file.write("2. Send a request to the website to get the HTML content.\n")
    readme_file.write("3. Parse the HTML content using BeautifulSoup.\n")
    readme_file.write("4. Create a folder to save the images if it doesn't already exist.\n")
    readme_file.write("5. Find all image tags <img> in the HTML content.\n")
    readme_file.write("6. Loop through the image tags and download the images.\n")
    readme_file.write("7. Send a request to each image URL and save the image.\n")
    readme_file.write("8. Save each image to the folder with a unique name.\n")
    readme_file.write("9. Confirm that all images are downloaded and saved successfully.\n")


All images downloaded and saved successfully!
