# Amazon Product Scraper

This notebook demonstrates a web scraping tool designed to extract product data from Amazon Egypt.

## Importing Libraries and Fetching Functions

This section imports all the necessary libraries, including `requests`, `pandas`, `BeautifulSoup` and a custom `functions.py` file from my GitHub repository.

In [None]:
import os
import sys
import time
import random
import requests
response = requests.get("https://raw.githubusercontent.com/ziadsalama95/amazon-web-scraping/main/functions.py")
with open("functions.py", "wb") as file:
    file.write(response.content)
import functions as fn
import pandas as pd
from bs4 import BeautifulSoup

## Setting the Amazon URL

Here, we define the base URL for Amazon Egypt's search page. This URL will be used as the starting point for scraping data.

In [None]:
url = 'https://www.amazon.eg/s'

## Defining User Agents and Search Terms

We define a list of user agents to mimic requests from various browsers and devices. This helps avoid detection by the website as a bot. Additionally, a comprehensive list of search terms is defined, which will be used to query Amazon's search page. The maximum number of results to retrieve is also set.

In [None]:
user_agents = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:113.0) Gecko/20100101 Firefox/113.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:112.0) Gecko/20100101 Firefox/112.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Version/13.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:113.0) Gecko/20100101 Firefox/113.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edge/113.0.0.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; Trident/7.0; AS; rv:11.0) like Gecko", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 OPR/94.0.0.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 OPR/93.0.0.0", "Mozilla/5.0 (Android 13; Mobile; rv:113.0) Gecko/113.0 Firefox/113.0", "Mozilla/5.0 (Android 12; Mobile; rv:112.0) Gecko/112.0 Firefox/112.0", "Mozilla/5.0 (iPhone; CPU iPhone OS 16_2 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Version/16.2 Mobile/15E148 Safari/537.36", "Mozilla/5.0 (iPad; CPU OS 16_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Version/16.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:112.0) Gecko/20100101 Firefox/112.0 Edge/112.0.0.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:111.0) Gecko/20100101 Firefox/111.0 Edge/111.0.0.0", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:113.0) Gecko/20100101 Firefox/113.0", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:112.0) Gecko/20100101 Firefox/112.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 OPR/91.0.0.0", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 OPR/90.0.0.0", "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1", "Mozilla/5.0 (Android 12; Mobile; rv:113.0) Gecko/113.0 Firefox/113.0 Edge/113.0.0.0", "Mozilla/5.0 (Android 11; Mobile; rv:112.0) Gecko/112.0 Firefox/112.0"]

list_search = ["best sellers", "hot new releases", "top rated", "electronics", "smartphones", "laptops", "tablets", "smart home devices", "wearable technology", "home gadgets", "kitchen appliances", "home decor", "furniture", "outdoor gear", "fitness equipment", "yoga mats", "sportswear", "running shoes", "health supplements", "beauty products", "skincare", "haircare", "makeup", "fragrances", "organic products", "sustainable products", "eco-friendly", "baby products", "toys", "games", "puzzles", "books", "ebooks", "audiobooks", "office supplies", "stationery", "craft supplies", "art supplies", "musical instruments", "video games", "gaming consoles", "board games", "travel gear", "luggage", "backpacks", "camping gear", "hiking boots", "cycling gear", "pet supplies", "dog toys", "cat toys", "pet food", "car accessories", "automotive tools", "gardening tools", "power tools", "hand tools", "home improvement", "DIY kits", "lighting", "LED lights", "security cameras", "headphones", "bluetooth speakers", "portable chargers", "phone accessories", "camera gear", "drone accessories", "VR headsets", "smart watches", "fitness trackers", "electric scooters", "e-bikes", "3D printers", "robot vacuums", "air purifiers", "humidifiers", "space heaters", "fans", "blenders", "coffee makers", "air fryers", "instant pots", "cookware sets", "bakeware", "cutlery", "dishware", "glasses", "water bottles", "wine glasses", "storage solutions", "organizers", "closet systems", "laundry baskets", "cleaning supplies", "bedding", "mattresses", "pillows", "blankets", "curtains"]

max_results = 1000

## Scraping Logic

This code iterates through a list of search keywords (`list_search`) to scrape product data from Amazon. For each keyword, the script fetches product details such as name, rating, number of reviews and price by parsing the HTML content of the search results page. The scraping continues across multiple pages until the specified number of products (`max_results`) is collected.

To avoid detection and ensure smooth scraping, the script uses random `User-Agent` headers and includes a delay between requests. The collected data is stored in `total_products`, with progress updates printed throughout the execution.

In [None]:
total_products = []

In [None]:
for search in list_search:
    page = 1
    products = []
    
    while len(products) < max_results:
        
        try:
            headers = ({'User-Agent': random.choice(user_agents)})
            params = {'k': search, 'page': page}
            response = requests.get(url, headers=headers, params=params)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            continue

        soup = BeautifulSoup(response.content, "html.parser")
        containers = soup.findAll('div', {'class': 'sg-col-4-of-24 sg-col-4-of-12 s-result-item s-asin sg-col-4-of-16 sg-col s-widget-spacing-small sg-col-4-of-20'})
        
        if not containers:
            print("No more products found on this page.")
            break

        for container in containers:
            product_name = fn.get_product_name(container)
            product_rating = fn.get_product_rating(container)
            product_nreviews = fn.get_product_nreviews(container)
            product_price = fn.get_product_price(container)

            products.append({
                'Name': product_name,
                'Price (EGP)': product_price,
                'Rating': product_rating,
                'Reviews': product_nreviews,
                'Keyword': search,
            })

            if len(products) >= max_results:
                break

        total_products.extend(products)
        print(f"Got {len(products)} products for: {search} (Page {page})")
        page += 1
        time.sleep(3)

    print(f"Finished search for: {search}. Total products found: {len(products)}")

print(f"Total products collected: {len(total_products)}")

In [None]:
df = pd.DataFrame(total_products)
df.drop_duplicates(inplace=True)
df.reset_index(inplace=True)
df.shape

## Saving the Data

In [None]:
os.makedirs('data', exist_ok=True)
df.to_csv('data/products.csv', index=False)
df.head()