----
# Data Scraping
----

### Notebook Summary

In this notebook, I will perform data scraping on Amazon to gather product information on headphones.

## Set Up
-----

In [None]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import time

## Web Scraping
-----

Visiting `https://www.amazon.co.uk/robots.txt` to check what elements are allowed or disallowed for scraping.

Unfortunately individual customer reviews are restricted as well as all review-related pages, so for this project I am going to be scraping:

- Product Description
- Price
- Overall Rating
- Prime Eligiblity

### 1. Define URL

In [None]:
base_url = "https://www.amazon.co.uk/s?keywords=adult+headphones&i=electronics&page="

`https://www.amazon.co.uk/robots.txt`

-----
**Comment:**

- **keywords = adult+headphones:** Searching for "adult headphones".
- **i = electronics:** Searching within the "Electronics" category.
- **page=** Specifying the page number (to be entered on scraping).

### 2. Perform Scraping

In [None]:
# To store scraped data
data = []

In [None]:
# Number of pages to scrape (around 30 products per page)
num_pages = 50

In [None]:
# Loop through the number of pages to scrape
for page in range(num_pages):
    # Print the current page number being scraped
    print(f"Scraping page {page + 1}...")

    # Construct the URL for the current page
    url = base_url + str(page + 1)
    print(url)
    # Define the headers for the get request to mimic a browser (avoids error)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/117.0.5938.62"
    }
    
    # Send a get request to the URL with the headers
    response = requests.get(url, headers=headers)
    
    # Check if the request was successful (status code = 200)
    if response.status_code == 200:
        # Use BeautifulSoup to 'read' content on page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extracts div storing all search results
        all_headphones = soup.find_all('div', {'data-component-type': 's-search-result'})

        # Loop through each product found to extract product details
        for headphone in all_headphones:

            # Getting product IDs
            if headphone.has_attr('data-asin'):
                hp_ASIN = hp_ASIN = headphone['data-asin']
            else:
                'Not Specified'

            # Getting product descriptions
            desc = headphone.find('span', class_='a-size-medium a-color-base a-text-normal')
            hp_desc = desc.get_text(strip=True) if desc else 'N/A'

            # Getting product price
            price_pound = headphone.find('span', class_='a-price-whole')
            price_pennies = headphone.find('span', class_='a-price-fraction')

            if price_pound and price_pennies:
                hp_price = price_pound.get_text(strip=True) + (price_pennies.get_text(strip=True))
            else:
                hp_price = 'Not Specified'
            
            # Get overall rating
            rating = headphone.find('span', class_='a-icon-alt')
            hp_rating = rating.get_text(strip=True) if rating else 'N/A'

            # Check if headphone is prime
            prime = headphone.find('span', class_='aok-relative s-icon-text-medium s-prime')
            is_prime = '1' if prime else '0'

            # Add product info to list
            data.append({
                'Product ID' : hp_ASIN,
                'Description': hp_desc,
                'Price': hp_price,
                'Rating': hp_rating,
                'Is Prime': is_prime
            })
        
        # Delay by 2 seconds after each page to avoid overwhelming the server
        time.sleep(2)  

    else:
        print(f"Failed to access the webpage")


### 3. Storing results in DataFrame and exporting to CSV

In [None]:
headphones_df = pd.DataFrame(data)

In [None]:
headphones_df.to_csv('../../data/scraped_data.csv')