# Web Scraping

Web scraping is the process of automatically extracting data from websites. It involves fetching a webpage, parsing its content, and collecting specific information such as text, images, or other structured data.



## How Does Web Scraping Work?

**Send a Request:**
Use an HTTP library (e.g., requests) to send a request to a webpage's URL.
The server responds with the webpage's HTML content.

**Parse the Response:**
Use an HTML parsing library (e.g., BeautifulSoup) to extract specific elements (like text, links, tables).

**Extract Data:**
Target specific HTML elements (e.g., tags, classes, or IDs) to collect the desired information.

**Save the Data:**
Store the scraped data in a structured format such as a CSV file, database, or JSON.


## Why use Web Scraping?


Web scraping is a powerful technique used to extract data from websites in an automated and efficient manner. It involves fetching a webpage’s HTML content, parsing it, and extracting specific information such as text, images, or structured data like tables. This process is widely used because websites often contain vast amounts of valuable data that are not readily available in downloadable formats like APIs or public datasets. Web scraping is particularly advantageous when large-scale data collection is required for business insights, competitive analysis, or academic research.

Businesses use web scraping to monitor competitors' pricing, product availability, and customer reviews, enabling them to make informed decisions and stay ahead in the market. For instance, an e-commerce retailer might scrape product prices from competitors to dynamically adjust their own pricing. Researchers often leverage web scraping to gather datasets from news websites, social media platforms, or public resources to analyze trends and behavior. It is also extensively used in industries like real estate to scrape property listings, finance to collect stock market data, and travel to compare flight or hotel prices.

The process of web scraping typically involves sending a request to a webpage using an HTTP client, parsing the returned HTML using libraries like BeautifulSoup or lxml, and then extracting the desired data based on tags, attributes, or class names. Tools like Selenium or Playwright are often used for dynamic content that is generated by JavaScript. Once collected, the data is cleaned and saved in structured formats such as CSV, JSON, or databases for further analysis. Web scraping is also a key enabler for machine learning applications, as it allows the collection of large datasets required to train models for tasks like sentiment analysis or recommendation systems.

One of the main reasons web scraping is preferred is its ability to automate repetitive tasks, such as monitoring stock prices or gathering daily job postings. It also helps fill gaps where APIs are unavailable or limited in functionality. For example, many websites do not offer APIs for public data access, so web scraping becomes the only option to collect this information. Another benefit is the scalability of web scraping, as it can handle large amounts of data across multiple websites, saving time and effort compared to manual collection.

Despite its advantages, web scraping has certain challenges. Dynamic websites that load content through JavaScript can make data extraction more complex, requiring advanced tools like Selenium or Puppeteer. Additionally, many websites employ anti-scraping mechanisms, such as CAPTCHAs or rate-limiting, to detect and block bots. Legal and ethical considerations also play a significant role, as scraping some websites may violate their terms of service. Therefore, it’s essential to review the website’s `robots.txt` file or use available APIs wherever possible to stay compliant.

In summary, web scraping is an essential tool for gathering large-scale data efficiently, driving business intelligence, and automating data collection processes. It empowers industries, researchers, and developers to analyze trends, optimize strategies, and build advanced applications. However, it should always be done responsibly, adhering to legal and ethical guidelines.


## What industries benefit most from scraping?
Industries like e-commerce, travel, real estate, finance, and marketing benefit the most from web scraping due to their reliance on vast and dynamic datasets. It enables these industries to automate data collection, improve decision-making, and stay competitive in a fast-changing market. However, it’s essential to perform web scraping ethically, respecting legal boundaries and the website’s terms of service.

# Best Tools for Web Scraping

Web scraping tools can vary based on complexity, use case, and technical requirements. Here’s a breakdown of the best tools for web scraping:

---

## **1. Python Libraries**

### **a) `BeautifulSoup`**
- **Best For**: Beginners and small-scale scraping of static websites.
- **Description**: A simple library for parsing HTML and XML. Allows easy navigation of HTML elements.
- **Example**:
    ```python
    from bs4 import BeautifulSoup
    import requests

    url = "https://example.com"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract data
    titles = soup.find_all("h1")
    for title in titles:
        print(title.text)
    ```

---

### **b) `Requests`**
- **Best For**: Fetching HTML content from static websites.
- **Description**: A library for making HTTP requests (GET, POST, etc.). Often used with `BeautifulSoup`.

---

### **c) `Scrapy`**
- **Best For**: Large-scale, customizable scraping projects.
- **Description**: A powerful framework for crawling and scraping websites. Handles requests, parsing, and data pipelines.
- **Example**:
    ```bash
    scrapy startproject myproject
    ```

---

### **d) `Selenium`**
- **Best For**: Scraping dynamic websites that load content via JavaScript.
- **Description**: A browser automation tool that simulates user interaction with webpages.
- **Example**:
    ```python
    from selenium import webdriver

    driver = webdriver.Chrome()
    driver.get("https://example.com")
    print(driver.page_source)
    driver.quit()
    ```

---

### **e) `Playwright`**
- **Best For**: Advanced, modern scraping of dynamic websites.
- **Description**: A faster, more reliable alternative to `Selenium` for handling JavaScript-heavy pages.

---

### **f) `Puppeteer`**
- **Best For**: Headless browser scraping.
- **Description**: A Node.js library to control a Chromium browser. Great for dynamic web content.

---

## **2. No-Code Scraping Tools**

### **a) Octoparse**
- **Best For**: Visual, point-and-click scraping.
- **Description**: A user-friendly tool with a drag-and-drop interface. Supports dynamic and JavaScript-heavy websites.

---

### **b) ParseHub**
- **Best For**: Scraping complex websites without code.
- **Description**: A visual scraper that uses machine learning for intelligent data extraction.

---

### **c) DataMiner**
- **Best For**: Browser-based scraping.
- **Description**: A Chrome/Edge browser extension for scraping data directly from webpages.

---

### **d) WebHarvy**
- **Best For**: Easy scraping for small to medium-scale projects.
- **Description**: A visual scraping tool that extracts data based on patterns defined by the user.

---

## **3. Advanced Tools and Frameworks**

### **a) `Scrapy`**
- **Best For**: Enterprise-grade scraping.
- **Features**:
  - Asynchronous requests for faster scraping.
  - Built-in data pipelines for storing data.
  - Extensible with middlewares for anti-scraping measures.

---

### **b) `Crawlera` (Smart Proxy Manager by ScrapingBee)**
- **Best For**: Avoiding IP blocks and anti-bot mechanisms.
- **Description**: A proxy service that rotates IPs and manages request throttling.

---

### **c) `ScraperAPI`**
- **Best For**: Simplified scraping with proxy management.
- **Description**: A service to handle rotating proxies, CAPTCHAs, and JavaScript rendering.

---

### **d) Apify**
- **Best For**: Cloud-based scraping and automation.
- **Description**: A platform for running scraping scripts in the cloud. Integrates well with popular frameworks.

---

## **4. Headless Browsers**

### **a) Puppeteer**
- **Best For**: Controlling Chrome/Chromium in headless mode.

---

### **b) Playwright**
- **Best For**: Cross-browser automation and scraping.

---

## **5. API-Based Scraping**

Many websites offer APIs to access structured data, which is faster and more reliable than scraping:
- **Twitter API**: Scraping tweets and user data.
- **OpenWeather API**: Accessing weather data.
- **Google Maps API**: For geospatial and location-based services.

---

## **6. Proxy Services**

To avoid IP bans and bypass rate limits:
- **SmartProxy**: Reliable rotating proxies.
- **Bright Data (Luminati)**: Offers advanced scraping tools and proxy pools.
- **Proxymesh**: Simple proxy service for bypassing geo-restrictions.

---

## **7. Cloud-Based Scraping Platforms**

For large-scale scraping projects:
- **ScrapingBee**: Handles headless browsers, proxies, and CAPTCHAs.
- **Zyte (formerly ScrapingHub)**: Offers scraping frameworks and proxy management tools.
- **Import.io**: Converts webpages into structured data without coding.

---

## **Best Tools Based on Use Cases**

| **Use Case**                     | **Best Tools**                             |
|-----------------------------------|--------------------------------------------|
| Small-scale static scraping       | `BeautifulSoup`, `Requests`                |
| Dynamic website scraping          | `Selenium`, `Playwright`, `Puppeteer`      |
| Large-scale scraping              | `Scrapy`, `Apify`, `Crawlera`              |
| No-code scraping                  | `Octoparse`, `ParseHub`, `DataMiner`       |
| Avoiding IP bans                  | `ScraperAPI`, `SmartProxy`, `Bright Data`  |
| API-based scraping                | APIs (e.g., Twitter API, Google Maps API)  |

---

## **Conclusion**

The best tools for web scraping depend on your requirements:
- For small projects, libraries like `BeautifulSoup` and `Requests` are simple and effective.
- For dynamic content, tools like `Selenium`, `Playwright`, or `Puppeteer` work best.
- For large-scale projects, frameworks like `Scrapy` or platforms like `Apify` ensure scalability.



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import random
import time

# Rotating User-Agent headers
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]

headers = {
    "User-Agent": random.choice(user_agents),
    "Accept-Language": "en-US,en;q=0.9",
}

# URL to scrape
url = "https://www.amazon.in/s?bbn=81107433031&rh=n%3A81107433031%2Cp_85%3A10440599031&_encoding=UTF8&content-id=amzn1.sym.58c90a12-100b-4a2f-8e15-7c06f1abe2be&pd_rd_r=eb705f4e-d34b-456d-a496-b52f6602d46b&pd_rd_w=hwFSy&pd_rd_wg=MVPlH&pf_rd_p=58c90a12-100b-4a2f-8e15-7c06f1abe2be&pf_rd_r=DHWY31K2T4Q1ARX3NX8Z&ref=pd_hp_d_atf_unk"

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Initializing lists for product data
    product_names = []
    product_prices = []
    product_ratings = []

    # Locating the main slot containing all products
    product_container = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})

    # Iterating over individual products
    if product_container:
        for product in product_container.find_all("div", {"data-component-type": "s-search-result"}):
            # Extracting product name
            name = product.find("span", {"class": "a-size-base-plus a-color-base a-text-normal"})
            # Extracting product price
            price = product.find("span", {"class": "a-price-whole"})
            # Extracting product rating
            rating = product.find("span", {"class": "a-icon-alt"})

            product_names.append(name.text.strip() if name else "N/A")
            product_prices.append(price.text.strip() if price else "N/A")
            product_ratings.append(rating.text.strip() if rating else "N/A")

    # Saving fetched data to DataFrame
    data = {
        "Product Name": product_names,
        "Price (₹)": product_prices,
        "Rating": product_ratings,
    }
    df = pd.DataFrame(data)

    # Saving the dataframe to a CSV file
    df.to_csv("amazon_products.csv", index=False)
    print("Scraping completed. Data saved to amazon_products.csv")
    print(df.head())
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")


# Amazon Product Scraper

This code scrapes product details such as names, prices, ratings, and URLs from Amazon and saves them into a CSV file. Below is the explanation of each part of the script:

---

## **Import Libraries**
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
```
1. **`requests`**: Used to send HTTP requests to fetch the HTML content of a webpage.
2. **`BeautifulSoup`**: Parses the HTML content and provides methods for navigating and extracting data from it.
3. **`pandas`**: Used to store the scraped data in a structured format (DataFrame) and save it to a CSV file.
4. **`time`**: Introduces delays between requests to avoid being flagged as a bot.
5. **`random`**: Generates random delays and rotates User-Agent headers to mimic real user behavior.

---

## **Rotate User-Agent Headers**
```python
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]
```
- A list of different User-Agent strings simulates requests from different browsers and operating systems.
- This helps avoid detection by Amazon's anti-scraping mechanisms.

---

## **Set Base URL**
```python
base_url = "https://www.amazon.in/s?bbn=81107433031&rh=n%3A81107433031%2Cp_85%3A10440599031&page={page}"
```
- The base URL is the template for Amazon's search results.
- The `{page}` placeholder allows fetching multiple pages by substituting the page number.

---

## **Initialize Lists for Data**
```python
product_names, product_prices, product_ratings, product_urls = [], [], [], []
```
- These empty lists will store the scraped data:
  - **`product_names`**: Names of the products.
  - **`product_prices`**: Prices of the products.
  - **`product_ratings`**: Ratings of the products.
  - **`product_urls`**: URLs for the product detail pages.

---

## **Iterate Over Pages**
```python
for page in range(1, 4):  # Adjust range for more pages
    print(f"Scraping page {page}...")
```
- Loops through the pages to scrape.
- **`range(1, 4)`**: Specifies scraping from page 1 to page 3. Adjust the range for more pages.

---

## **Set Request Headers**
```python
headers = {
    "User-Agent": random.choice(user_agents),
    "Accept-Language": "en-US,en;q=0.9",
}
```
- Rotates the User-Agent header to reduce the chances of being flagged as a bot.
- Adds an "Accept-Language" header to indicate the preferred language for the response.

---

## **Send Request to Amazon**
```python
response = requests.get(base_url.format(page=page), headers=headers)
```
- Sends an HTTP GET request to fetch the HTML content of the current page.
- **`base_url.format(page=page)`**: Replaces `{page}` with the current page number.

---

## **Handle Request Failure**
```python
if response.status_code != 200:
    print(f"Failed to fetch page {page}. Status code: {response.status_code}")
    continue
```
- Checks if the server responded successfully (status code `200`).
- If the request fails, logs the error and skips to the next page.

---

## **Parse HTML Content**
```python
soup = BeautifulSoup(response.content, "html.parser")
```
- Uses BeautifulSoup to parse the HTML content into a navigable format.

---

## **Find Product Container**
```python
product_container = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})
```
- Finds the main container holding all the product listings using its class.
- If this container is not found, the scraper skips further processing.

---

## **Extract Product Details**
```python
for product in product_container.find_all("div", {"data-component-type": "s-search-result"}):
```
- Iterates through all product elements within the container using their `data-component-type`.

---

### **a) Extract Product Name**
```python
name = product.find("span", {"class": "a-size-base-plus a-color-base a-text-normal"})
product_names.append(name.text.strip() if name else "N/A")
```
- Finds the product name using its class and appends it to `product_names`.
- If the name is not found, appends `"N/A"`.

---

### **b) Extract Product Price**
```python
price = product.find("span", {"class": "a-price-whole"})
product_prices.append(price.text.strip() if price else "N/A")
```
- Finds the product price using its class and appends it to `product_prices`.
- If the price is not found, appends `"N/A"`.

---

### **c) Extract Product Rating**
```python
rating = product.find("span", {"class": "a-icon-alt"})
product_ratings.append(rating.text.strip() if rating else "N/A")
```
- Finds the product rating using its class and appends it to `product_ratings`.
- If the rating is not found, appends `"N/A"`.

---

### **d) Extract Product URL**
```python
url_tag = product.find("a", {"class": "a-link-normal s-no-outline"})
product_url = f"https://www.amazon.in{url_tag['href']}" if url_tag else "N/A"
product_urls.append(product_url)
```
- Finds the product detail page URL and appends it to `product_urls`.
- Prepends the base URL (`https://www.amazon.in`) to make it a complete link.

---

## **Add Delay Between Requests**
```python
time.sleep(random.uniform(1, 3))
```
- Introduces a random delay (1-3 seconds) between requests to mimic human behavior and avoid being flagged as a bot.

---

## **Save Data to a DataFrame**
```python
df = pd.DataFrame({
    "Product Name": product_names,
    "Price (₹)": product_prices,
    "Rating": product_ratings,
    "Product URL": product_urls,
})
```
- Creates a Pandas DataFrame to store the scraped data in a structured format.

---

## **Save Data to CSV**
```python
df.to_csv("amazon_products_detailed.csv", index=False)
print("Scraping completed. Data saved to amazon_products_detailed.csv")
```
- Saves the DataFrame to a CSV file named `amazon_products_detailed.csv`.
- Logs a message indicating successful completion.

---

# **Summary**
This script efficiently scrapes product data (name, price, rating, and URL) from Amazon using:
1. **HTTP requests** with random User-Agent headers.
2. **HTML parsing** using BeautifulSoup.
3. **Data storage** with Pandas DataFrame and CSV.

```

In [22]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

# Rotating User-Agent headers
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]

# Base URL
base_url = "https://www.amazon.in/s?bbn=81107433031&rh=n%3A81107433031%2Cp_85%3A10440599031&page={page}"

# Initializing lists for data
product_names, product_prices, product_ratings, product_urls = [], [], [], []

# Scraping multiple pages
for page in range(1, 4):  # Adjust range for more pages
    print(f"Scraping page {page}...")
    headers = {
        "User-Agent": random.choice(user_agents),
        "Accept-Language": "en-US,en;q=0.9",
    }
    response = requests.get(base_url.format(page=page), headers=headers)
    
    if response.status_code != 200:
        print(f"Failed to fetch page {page}. Status code: {response.status_code}")
        continue

    soup = BeautifulSoup(response.content, "html.parser")
    product_container = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})

    if product_container:
        for product in product_container.find_all("div", {"data-component-type": "s-search-result"}):
            # Extracting product name
            name = product.find("span", {"class": "a-size-base-plus a-color-base a-text-normal"})
            # Extracting product price
            price = product.find("span", {"class": "a-price-whole"})
            # Extracting product rating
            rating = product.find("span", {"class": "a-icon-alt"})
            # Extracting product URL
            url_tag = product.find("a", {"class": "a-link-normal s-no-outline"})
            product_url = f"https://www.amazon.in{url_tag['href']}" if url_tag else "N/A"

            product_names.append(name.text.strip() if name else "N/A")
            product_prices.append(price.text.strip() if price else "N/A")
            product_ratings.append(rating.text.strip() if rating else "N/A")
            product_urls.append(product_url)

    # Adding delay to avoid being flagged as a bot
    time.sleep(random.uniform(1, 3))

# Saving fetched data to a DataFrame
df = pd.DataFrame({
    "Product Name": product_names,
    "Price (₹)": product_prices,
    "Rating": product_ratings,
    "Product URL": product_urls,
})

# Saving dataframe to CSV
df.to_csv("amazon_products_detailed.csv", index=False)
print("Scraping completed. Data saved to amazon_products_detailed.csv")


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping completed. Data saved to amazon_products_detailed.csv


#  Jumia Product Scraper

This script scrapes product information (name, price, rating, and link) from the Jumia website, logs errors, and saves the data into a CSV file. Here's a detailed breakdown of the code:

---

## **Import Libraries**
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import random
import time
import logging
import matplotlib.pyplot as plt
```
1. **`requests`**: Fetches HTML content of the webpage.
2. **`BeautifulSoup`**: Parses and extracts data from the HTML.
3. **`pandas`**: Stores scraped data in a structured DataFrame and saves it to a CSV file.
4. **`random`**: Introduces randomness in delays and User-Agent headers to mimic real user behavior.
5. **`time`**: Adds delays between requests to avoid detection as a bot.
6. **`logging`**: Logs errors encountered during scraping to a file.
7. **`matplotlib.pyplot`**: Prepares the script for data visualization (not used in scraping here).

---

## **Configure Logging**
```python
logging.basicConfig(filename="scraper.log", level=logging.ERROR, format="%(asctime)s - %(levelname)s - %(message)s")
```
- Configures logging to record errors in a file named `scraper.log`.
- Logs include timestamps, error levels, and error messages for debugging.

---

## **User-Agent Rotation**
```python
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]
```
- A list of different User-Agent headers simulates requests from various browsers and operating systems, helping to avoid detection as a bot.

---

## **Base URL**
```python
base_url = "https://www.jumia.co.ke/catalog/?q=smartphones&page={}"
```
- Defines the base URL for the search results page. The `{}` acts as a placeholder for the page number.

---

## **Get Total Pages**
```python
def get_total_pages(url, headers):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            pagination = soup.find("div", class_="pg-w -ptm -pbxl")
            if pagination:
                return int(pagination.find_all("a")[-2].text.strip())  # Extract total pages
        return 1  # Default to 1 if pagination not found
    except Exception as e:
        logging.error(f"Error fetching total pages: {e}")
        return 1
```
1. **Purpose**: Determines the total number of pages in the search results.
2. **Process**:
   - Sends a request to the first page.
   - Parses the pagination element to find the total number of pages.
3. **Error Handling**:
   - Logs errors in fetching total pages and defaults to 1 page.

---

## **Scrape Product Data**
```python
def scrape_jumia(num_pages):
    product_data = []
    headers = {
        "User-Agent": random.choice(user_agents),
        "Accept-Language": "en-US,en;q=0.9",
    }
```
- Initializes an empty list (`product_data`) to store scraped data.
- Sets headers with a random User-Agent for each request.

---

### **Iterate Over Pages**
```python
for page in range(1, num_pages + 1):
    print(f"Scraping page {page}...")
    try:
        response = requests.get(base_url.format(page), headers=headers)
        if response.status_code != 200:
            logging.error(f"Failed to fetch page {page}. Status code: {response.status_code}")
            continue
```
- Iterates over the total number of pages.
- Sends a GET request to each page using the formatted URL.
- Logs an error if the request fails (non-200 status code).

---

### **Parse HTML Content**
```python
soup = BeautifulSoup(response.content, "html.parser")
products = soup.find_all("article", class_="prd _fb col c-prd")
```
- Parses the HTML response using BeautifulSoup.
- Finds all product containers with the specified class.

---

### **Extract Product Details**
```python
for product in products:
    # Extract product name
    name_tag = product.find("h3", class_="name")
    name = name_tag.text.strip() if name_tag else "N/A"

    # Extract product price
    price_tag = product.find("div", class_="prc")
    price = price_tag.text.strip() if price_tag else "N/A"

    # Extract product link
    link_tag = product.find("a", href=True)
    link = "https://www.jumia.co.ke" + link_tag["href"] if link_tag else "N/A"

    # Extract product rating
    rating_tag = product.find("div", class_="stars")
    rating = rating_tag.get("aria-label") if rating_tag else "N/A"

    product_data.append({
        "Product Name": name,
        "Price": price,
        "Rating": rating,
        "Product Link": link
    })
```
1. **`name`**: Extracts the product name from the `h3` tag.
2. **`price`**: Extracts the product price from the `div` tag.
3. **`link`**: Extracts the product detail URL and appends the base URL to it.
4. **`rating`**: Extracts the rating using the `aria-label` attribute.
5. Appends the extracted details as a dictionary to the `product_data` list.

---

### **Add Delay Between Requests**
```python
time.sleep(random.uniform(1, 3))
```
- Introduces a random delay (1-3 seconds) between requests to avoid being flagged as a bot.

---

## **Save Data to CSV**
```python
df = pd.DataFrame(product_data)
df.to_csv("jumia_enhanced_products.csv", index=False)
print("Scraping completed. Data saved to jumia_enhanced_products.csv")
```
1. Creates a Pandas DataFrame from the `product_data` list.
2. Saves the DataFrame to a CSV file named `jumia_enhanced_products.csv`.

---

## **Script Execution**
```python
headers = {
    "User-Agent": random.choice(user_agents),
    "Accept-Language": "en-US,en;q=0.9",
}
total_pages = get_total_pages(base_url.format(1), headers)
print(f"Total pages found: {total_pages}")

df = scrape_jumia(total_pages)
```
1. Calls the `get_total_pages` function to determine the number of pages.
2. Calls the `scrape_jumia` function to scrape data from all pages.

---

## **Summary**
This script:
- Scrapes product names, prices, ratings, and links from Jumia.
- Handles errors gracefully and logs them for debugging.
- Saves the scraped data into a CSV file for further use.

```

In [92]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import random
import time
import logging
import matplotlib.pyplot as plt

# Configure logging
logging.basicConfig(filename="scraper.log", level=logging.ERROR, format="%(asctime)s - %(levelname)s - %(message)s")

# Rotating User-Agent headers to mimic browser behavior
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
]

# Base URL for Jumia
base_url = "https://www.jumia.co.ke/catalog/?q=smartphones&page={}"

# Function to get total number of pages
def get_total_pages(url, headers):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            pagination = soup.find("div", class_="pg-w -ptm -pbxl")
            if pagination:
                return int(pagination.find_all("a")[-2].text.strip())  # Extracting total pages
        return 1  # Default to 1 if pagination not found
    except Exception as e:
        logging.error(f"Error fetching total pages: {e}")
        return 1

# Function to scrape product data
def scrape_jumia(num_pages):
    product_data = []
    headers = {
        "User-Agent": random.choice(user_agents),
        "Accept-Language": "en-US,en;q=0.9",
    }

    for page in range(1, num_pages + 1):
        print(f"Scraping page {page}...")
        try:
            response = requests.get(base_url.format(page), headers=headers)
            if response.status_code != 200:
                logging.error(f"Failed to fetch page {page}. Status code: {response.status_code}")
                continue

            soup = BeautifulSoup(response.content, "html.parser")
            products = soup.find_all("article", class_="prd _fb col c-prd")  # Target product containers
            
            for product in products:
                # Extracting product name
                name_tag = product.find("h3", class_="name")
                name = name_tag.text.strip() if name_tag else "N/A"

                # Extracting product price
                price_tag = product.find("div", class_="prc")
                price = price_tag.text.strip() if price_tag else "N/A"

                # Extracting product link
                link_tag = product.find("a", href=True)
                link = "https://www.jumia.co.ke" + link_tag["href"] if link_tag else "N/A"

                # Extracting product rating
                rating_tag = product.find("div", class_="stars")
                rating = rating_tag.get("aria-label") if rating_tag else "N/A"

                product_data.append({
                    "Product Name": name,
                    "Price": price,
                    "Rating": rating,
                    "Product Link": link
                })
        except Exception as e:
            logging.error(f"Error on page {page}: {e}")

        # Random delay to mimic human behavior
        time.sleep(random.uniform(1, 3))
    
    return pd.DataFrame(product_data)
# Scraping data
headers = {
    "User-Agent": random.choice(user_agents),
    "Accept-Language": "en-US,en;q=0.9",
}
total_pages = get_total_pages(base_url.format(1), headers)
print(f"Total pages found: {total_pages}")

# Scraping all pages
df = scrape_jumia(total_pages)
df.to_csv("jumia_enhanced_products.csv", index=False)
print("Scraping completed. Data saved to jumia_enhanced_products.csv")

# Jumia Product Scraper with Ratings

This script scrapes product information, including names, prices, ratings, and links, from Jumia's website, saves the data into a CSV file, and logs any errors during the process. Here's a detailed breakdown:

---

## **1. Import Libraries**
```python
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
import logging
```
- **`requests`**: Handles HTTP requests to fetch webpage content.
- **`BeautifulSoup`**: Parses and extracts data from HTML content.
- **`pandas`**: Creates a DataFrame and saves data to a CSV file.
- **`selenium - webdriver`**: Mimick web browsing activity.
- **`logging`**: Logs errors and debugging information to a file.

---

## **2. Configure Logging**
```python
logging.basicConfig(filename="scraper.log", level=logging.ERROR, format="%(asctime)s - %(levelname)s - %(message)s")
```
- Configures logging to write errors into a file named `scraper.log`.
- Logs include:
  - **Timestamp** (`%(asctime)s`)
  - **Error Level** (`%(levelname)s`)
  - **Error Message** (`%(message)s`).

---

## **3. Define Base URL**
```python
base_url = "https://www.jumia.co.ke/catalog/?q=smartphones&page={}"
```
- The base URL contains a query for "smartphones" with pagination indicated by `{}`.

---

## **4. Get Total Pages**
```python
def get_total_pages(url):
    try:
        response = webdriver.Chrome()
        response.get(url)
        soup = BeautifulSoup(response.page_source, "html.parser")
        pagination = soup.find("div", class_="pg-w -ptm -pbxl")
        response.close()
        if pagination:
          return int(''.join(filter(str.isdigit, pagination.find_all("a")[-1].get("href")))) # Extracting total pages
        return 1  # Default to 1 if pagination not found
    except Exception as e:
        logging.error(f"Error fetching total pages: {e}")
        return 1
```
- **Purpose**: Extracts the total number of pages from the pagination element.
- **Logic**:
  - Sends a request to the first page.
  - Finds the pagination container and extracts the second-to-last page number.
  - Returns 1 if no pagination is found.

---

## **5. Scrape Product Data**
```python
def scrape_jumia(num_pages):
    product_data = []
```
- Initializes an empty list, `product_data`, to store product information.
- Randomizes the User-Agent for each request.

---

### **a) Iterate Over Pages**
```python
for page in range(1, num_pages + 1):
    print(f"Scraping page {page}...")
    try:
        response = webdriver.Chrome()
        response.get(base_url.format(page))
```
- Loops through all pages.

---

### **b) Parse HTML Content**
```python
soup = BeautifulSoup(response.page_source, "html.parser")
products = soup.find_all("article", class_="prd _fb col c-prd")
```
- Parses the HTML response using BeautifulSoup.
- Finds all product containers matching the specified class.

---

### **c) Extract Product Details**
```python
for product in products:
    # Extract product name
    name_tag = product.find("h3", class_="name")
    name = name_tag.text.strip() if name_tag else "N/A"

    # Extract product price
    price_tag = product.find("div", class_="prc")
    price = price_tag.text.strip() if price_tag else "N/A"

    # Extract product link
    link_tag = product.find("a", href=True)
    link = "https://www.jumia.co.ke" + link_tag["href"] if link_tag else "N/A"

    # Extract product rating
    rating_tag = product.find("div", class_="stars _s")
    if rating_tag:
        rating = rating_tag.text.strip()  # Extract visible rating
    else:
        rating = "N/A"

    product_data.append({
        "Product Name": name,
        "Price": price,
        "Rating": rating,
        "Product Link": link
    })
```
1. **`name`**: Extracts the product name from the `h3` tag.
2. **`price`**: Extracts the product price from the `div` tag.
3. **`link`**: Constructs the full product URL.
4. **`rating`**: Extracts the rating (if available).

---

## **6. Save Data to CSV**
```python
df = pd.DataFrame(product_data)
df.to_csv("jumia_enhanced_with_ratings.csv", index=False)
print("Scraping completed. Data saved to jumia_enhanced_with_ratings.csv")
```
1. Creates a Pandas DataFrame from the `product_data` list.
2. Saves the DataFrame to a CSV file named `jumia_enhanced_with_ratings.csv`.

---

## **7. Main Script Execution**
```python
total_pages = get_total_pages(base_url.format(1), headers)
print(f"Total pages found: {total_pages}")

df = scrape_jumia(total_pages)
```
1. Calls `get_total_pages` to determine the number of pages to scrape.
2. Calls `scrape_jumia` to extract product data from all pages.

---

## **Summary**
1. Scrapes product names, prices, ratings, and URLs from Jumia.
2. Handles errors gracefully with logging.
3. Saves the scraped data into a CSV file.

```

In [None]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import random
import time
import logging
import matplotlib.pyplot as plt
import time


# Configure logging
logging.basicConfig(filename="scraper.log", level=logging.ERROR, format="%(asctime)s - %(levelname)s - %(message)s")


# Base URL for Jumia
base_url = "https://www.jumia.co.ke/catalog/?q=smartphones&page={}"

# Function to get total number of pages
def get_total_pages(url):
    try:
        response = webdriver.Chrome()
        response.get(url)
        soup = BeautifulSoup(response.page_source, "html.parser")
        pagination = soup.find("div", class_="pg-w -ptm -pbxl")
        response.close()
        if pagination:
          return int(''.join(filter(str.isdigit, pagination.find_all("a")[-1].get("href")))) # Extracting total pages
        return 1  # Default to 1 if pagination not found
    except Exception as e:
        logging.error(f"Error fetching total pages: {e}")
        return 1

# Function to scrape product data
def scrape_jumia(num_pages):
    product_data = []

    for page in range(1, num_pages + 1):
        print(f"Scraping page {page}...")
        try:
            response = webdriver.Chrome()
            response.get(base_url.format(page))

            soup = BeautifulSoup(response.page_source, "html.parser")
            products = soup.find_all("article", class_="prd _fb col c-prd")  # Target product containers
            
            for product in products:
                # Extracting product name
                name_tag = product.find("h3", class_="name")
                name = name_tag.text.strip() if name_tag else "N/A"

                # Extracting product price
                price_tag = product.find("div", class_="prc")
                price = price_tag.text.strip() if price_tag else "N/A"

                # Extracting product link
                link_tag = product.find("a", href=True)
                link = "https://www.jumia.co.ke" + link_tag["href"] if link_tag else "N/A"

                # Extract product rating
                rating_tag = product.find("div", class_="stars _s")
                if rating_tag:
                    rating = rating_tag.text.strip()  # Extracting visible rating
                else:
                    rating = "N/A"

                product_data.append({
                    "Product Name": name,
                    "Price": price,
                    "Rating": rating,
                    "Product Link": link
                })
            response.close()
        except Exception as e:
            logging.error(f"Error on page {page}: {e}")
    
    return pd.DataFrame(product_data)

total_pages = get_total_pages(base_url.format(1))
print(f"Total pages found: {total_pages}")

# Scraping all pages
df = scrape_jumia(total_pages)
df.to_csv("jumia_enhanced_with_ratings.csv", index=False)
print("Scraping completed. Data saved to jumia_enhanced_with_ratings.csv")

                                                  Thank You !!!