# Web Scraping Amazon Reviews

## Project Overview

The goal of this project is to scrape 100 reviews from each review page for two selected travel bag products on Amazon. The process involves extracting essential information from user reviews, including the reviewer's name, review title, star rating, date of the review, and the review content. This data is then compiled into a structured format and saved as a CSV file for further analysis.

## Steps Involved

### 1. Setting Up the Environment
- **Libraries Used**: 
  - `requests` for making HTTP requests
  - `pandas` for data manipulation
  - `BeautifulSoup` from `bs4` for parsing HTML
  - `datetime` for handling date formatting
  - `time` and `random` for adding delays between requests to mimic human browsing behavior
- **Logging Configuration**: 
  - Configured logging to capture any errors that occur during the scraping process.

### 2. Headers Configuration
- **User-Agent Rotation**:
  - Created a list of user-agent headers to rotate between requests to avoid being blocked by the server. This helps simulate requests coming from different browsers.

### 3. Extracting HTML Data
- **Function `reviewsHtml(url, len_page)`**:
  - Handles the extraction of HTML content from the review pages.
  - Iterates through the specified number of pages, constructs the URL for each page, sends an HTTP GET request, and parses the HTML response using BeautifulSoup.
  - Includes error handling to log any failed requests and to print an error message if a page cannot be retrieved.

### 4. Parsing Review Data
- **Function `getReviews(html_data)`**:
  - Takes the parsed HTML data and extracts the required information from each review.
  - Navigates through the HTML structure to find the reviewer's name, star rating, review title, review date, and review description.
  - Incorporates error handling to ensure that if any piece of data is missing or cannot be parsed, a default value ('N/A') is assigned.

### 5. Main Function
- **Function `main()`**:
  - Orchestrates the entire process.
  - Initializes a list of URLs to be scraped and iterates through each URL, calling the `reviewsHtml` function to get the HTML content for the specified number of pages.
  - For each page's HTML content, the `getReviews` function is called to extract review data, which is then appended to a list.
  - After collecting all reviews, the data is converted into a pandas DataFrame and saved to a CSV file named `amazon_reviews1.csv`.

## Execution

The script begins by calling the `main()` function, which triggers the scraping process for the specified review pages. Each page is scraped with a slight random delay between requests to avoid detection and blocking. The extracted review data is then compiled into a CSV file, ready for analysis.

## Purpose and Benefits

The primary purpose of this project is to demonstrate web scraping techniques for collecting user reviews from e-commerce sites. This approach can be extended to various applications, such as sentiment analysis, product feature analysis, and customer feedback evaluation. By automating the data collection process, it saves significant time and effort compared to manual data entry.

## Conclusion

This project successfully showcases a method to scrape user reviews from Amazon, capturing essential review details and storing them in a structured format for further use. The approach ensures robustness through error handling and user-agent rotation, making it a valuable tool for extracting user-generated content from the web.


In [8]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
import time
import random
import logging

# Configure logging
logging.basicConfig(filename='scraping_errors.log', level=logging.ERROR, format='%(asctime)s - %(message)s')

# Headers list to rotate user-agents
headers_list = [
    {
        'authority': 'www.amazon.com',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
        'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
    },
    # Add more user-agents here if needed
]

# Extract Data as HTML object from Amazon review page
def reviewsHtml(url, len_page):
    soups = []

    for page_no in range(1, len_page + 1):
        page_url = f"{url}&pageNumber={page_no}"
        try:
            headers = random.choice(headers_list)
            response = requests.get(page_url, headers=headers)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            soups.append(soup)
            print(f"Scraped page {page_no} from URL: {url}")
            time.sleep(random.uniform(2, 5))  # Random delay between 2 and 5 seconds
        except requests.exceptions.RequestException as e:
            logging.error(f"Failed to retrieve page {page_no} from URL {page_url}: {e}")
            print(f"Failed to retrieve page {page_no} from URL {page_url}: {e}")
            break

    return soups

# Grab reviews name, description, date, stars, title from HTML
def getReviews(html_data):
    data_dicts = []

    boxes = html_data.select('div[data-hook="review"]')

    for box in boxes:
        try:
            name = box.select_one('[class="a-profile-name"]').text.strip()
        except Exception as e:
            name = 'N/A'

        try:
            stars = box.select_one('[data-hook="review-star-rating"]').text.strip().split(' out')[0]
        except Exception as e:
            stars = 'N/A'

        try:
            title = box.select_one('[data-hook="review-title"]').text.strip()
        except Exception as e:
            title = 'N/A'

        try:
            datetime_str = box.select_one('[data-hook="review-date"]').text.strip().split(' on ')[-1]
            date = datetime.strptime(datetime_str, '%B %d, %Y').strftime("%d/%m/%Y")
        except Exception as e:
            date = 'N/A'

        try:
            description = box.select_one('[data-hook="review-body"]').text.strip()
        except Exception as e:
            description = 'N/A'

        data_dict = {
            'Name': name,
            'Stars': stars,
            'Title': title,
            'Date': date,
            'Description': description
        }

        data_dicts.append(data_dict)

    return data_dicts

# Main Function to run the scraping process
def main():
    urls = [
        "https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next",
        "https://www.amazon.com/LOVEVOOK-Weekender-Compartment-Toiletry-Hospital/product-reviews/B0C58Q4FM1/ref=cm_cr_getr_d_paging_btm_next"
    ]

    all_reviews = []

    for url in urls:
        soups = reviewsHtml(url, len_page=10)
        for soup in soups:
            reviews = getReviews(soup)
            all_reviews.extend(reviews)

    df = pd.DataFrame(all_reviews)
    df.to_csv('amazon_reviews1.csv', index=False)
    print("Scraping completed and data saved to 'amazon_reviews1.csv'")

if __name__ == "__main__":
    main()


Scraped page 1 from URL: https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next
Scraped page 2 from URL: https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next
Scraped page 3 from URL: https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next
Scraped page 4 from URL: https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next
Scraped page 5 from URL: https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next
Scraped page 6 from URL: https://www.amazon.com/Waterproof-Weekender-Essentials-Hospital-Overnight/product-reviews/B0CGR1XGVX/ref=cm_cr_getr_d_paging_btm_next
Scraped page 7 from URL: https://www.amazon.co

In [9]:
df = pd.read_csv('amazon_reviews1.csv')

len(df)

200

In [11]:
df.head(200)

Unnamed: 0,Name,Stars,Title,Date,Description
0,Stormy,5.0,"5.0 out of 5 stars\nIt's sturdy, spacious, and...",13/05/2024,I got this bag when it was $14. I honestly was...
1,IrishEyes4Ever,5.0,"5.0 out of 5 stars\nGreat, for an Affordable T...",16/04/2024,The price is reasonable for what you get. A ro...
2,Hailey R.,5.0,"5.0 out of 5 stars\nGreat value, great little bag",17/04/2024,I needed an easy weekend duffle for my hospita...
3,Ash,5.0,5.0 out of 5 stars\nAMAZING! Much better than ...,17/05/2024,This bag is much better than I was expecting w...
4,Hannah,5.0,5.0 out of 5 stars\nLove it,17/05/2024,Perfect for an overnight bag. Love the size an...
...,...,...,...,...,...
195,Anni,5.0,5.0 out of 5 stars\nGood Quality & Stylish,16/03/2024,Bought the bag for my trip to London and it is...
196,andrea.lee,5.0,"5.0 out of 5 stars\nBeautiful, functional and ...",06/06/2024,I love this bag! I got it for an upcoming trip...
197,Laurie,5.0,5.0 out of 5 stars\nLove ytis bag,11/06/2024,This bag is the perfect size for an overnight ...
198,Adrianne D Forrest,4.0,4.0 out of 5 stars\nThe Strap Broke,15/06/2024,I bought this bag in January in anticipation o...
