----
# Data Scraping
----

### Notebook Summary

In this notebook, I will perform data scraping on Amazon to gather information about noise-cancelling headphones priced under £300.

## Set Up
-----

In [2]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import time

## Web Scraping
-----

Visiting `https://www.amazon.co.uk/robots.txt` to check what elements are allowed or disallowed for scraping.

Unfortunately individual customer reviews are restricted as well as all review-related pages, so for this project I am going to be scraping:

- Product Name
- Price
- Overall Rating
- Prime Eligible 

### 1. Define URL

In [3]:
url = "https://www.amazon.co.uk/s?keywords=Noise+Cancelling+Headphones&i=electronics&rh=p_36%3A-30000"

-----
**Comment:**

- **keywords = Noise+Cancelling+Headphones:** Searching for "Noise Cancelling Headphones".
- **i = electronics:** Searching within the "Electronics" category.
- **p_36%3A-30000:** Filtering for products with a price of up to £300 (the price is in pence, so £300 = 30000 pence).

### 2. Perform Scraping

In [4]:
# To store scraped data
data = []

In [5]:
# Number of pages to scrape (around 30 products per page)
num_pages = 10 

In [6]:
# Loop through the number of pages to scrape
for page in range(num_pages):
    # Print the current page number being scraped
    print(f"Scraping page {page + 1}...")

    # Construct the URL for the current page
    url = url + str(page + 1)
    
    # Define the headers for the get request to mimic a browser (avoids error)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/117.0.5938.62"
    }
    
    # Send a get request to the URL with the headers
    response = requests.get(url, headers=headers)
    
    # Check if the request was successful (status code = 200)
    if response.status_code == 200:
        # Use BeautifulSoup to 'read' content on page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extracts div storing all search results
        all_headphones = soup.find_all('div', {'data-component-type': 's-search-result'})

        # Loop through each product found to extract product details
        for headphone in all_headphones:

            # Getting product titles
            title = headphone.find('span', class_='a-size-medium a-color-base a-text-normal')
            hp_title = title.get_text(strip=True) if title else 'N/A'

            # Getting product price
            price_pound = headphone.find('span', class_='a-price-whole')
            price_pennies = headphone.find('span', class_='a-price-fraction')

            if price_pound and price_pennies:
                hp_price = price_pound.get_text(strip=True) + (price_pennies.get_text(strip=True))
            else:
                'N/A'
            
            # Get overall rating
            rating = headphone.find('span', class_='a-icon-alt')
            hp_rating = rating.get_text(strip=True) if rating else 'N/A'

            # Check if headphone is prime
            prime = headphone.find('span', class_='aok-relative s-icon-text-medium s-prime')
            is_prime = '1' if prime else '0'

            # Add product info to list
            data.append({
                'Title': hp_title,
                'Price': hp_price,
                'Rating': hp_rating,
                'Is Prime': is_prime
            })
        
        # Delay by 2 seconds after each page to avoid overwhelming the server
        time.sleep(2)  

    else:
        print(f"Failed to access the webpage")


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...


### 3. Storing results in DataFrame and exporting to CSV

In [7]:
headphones_df = pd.DataFrame(data)

In [8]:
headphones_df.to_csv('../../data/headphones_data.csv')

## Conclusion
-----

After scraping data for 300 headphones, storing this data in a dataframe and exporting to a CSV, I will now move on to cleaning of the data.


In this notebook, I scraped data for 300 noise-cancelling headphones from Amazon, focusing on products priced under £300. The data was stored in a DataFrame, which I then exported to a CSV file. With the data now available, the next step is to clean and preprocess the data to ensure high quality. 
