## AMAZON SCRAPING PROJECT

In [15]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import time
import random

## Description

This notebook demonstrates how to scrape product data from Amazon India using Python. It utilizes `requests` for HTTP requests, `BeautifulSoup` for HTML parsing, and regular expressions for data extraction. The workflow includes:

- Defining a scraping function to extract product details such as name, price, discount, rating, sponsorship status, and product link.
- Iterating through multiple pages of search results for a given query.
- Aggregating the extracted data into a pandas DataFrame for further analysis.

**Example search URL:**  
`https://www.amazon.in/s?k=motorola&page=4&ref=sr_pg_4`

In [16]:
def scrape_(search, pagenumber, headers):
    """Scrape a single page of Amazon search results.
    
    Args:
        search (str): Search query
        pagenumber (int): Page number to scrape
        headers (dict): HTTP headers for request
        
    Returns:
        tuple: Lists of product data (names, prices, ratings, etc.)
    """
    product_names = []
    prices = []
    ratings = []
    links = []
    discount_pcts = []
    is_sponsored = []
    
    url = f"https://www.amazon.in/s?k={search}&page={pagenumber}&ref=sr_pg_{pagenumber}"
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    product_containers = soup.find_all('div', attrs={'data-component-type': 's-search-result'})
    print(f"Found {len(product_containers)} products on page {pagenumber}")
    for container in product_containers:
        try:
        # Check if sponsored
            product_sponsored = False
            if container.find(string=re.compile(r'sponsored', re.IGNORECASE)):
                product_sponsored = True
            # print(f"Sponsored: {product_sponsored}")

            name_element = container.select_one('a h2 span')
            product_name = (name_element.text.strip()) if name_element else "N/A"
            # print(product_name)
            
        # Extract price - looking for current price
            price_element = container.find('span', class_='a-offscreen')
            if price_element:
                price_text = price_element.text.strip()
                # Clean price text (remove currency symbol and commas)
            else:
                price_text = "N/A"
            # print(price_text)
            
        # Extract discount percentage
            discount_element = container.find('span', string=lambda x: '% off' in x if x else False)
            if discount_element:
                discount = discount_element.string if discount_element else "N/A"
                discount= discount[1:(len(discount))-1]
            else:
                discount = "N/A"
            # print(discount)

        # Extract rating - updated to handle Amazon's current format
            rating_element = container.find('span', class_='a-icon-alt')
            if rating_element:
                rating_text = rating_element.text.strip()
                # rating_match = re.search(r'([\d.]+)', rating_text)
                rating = rating_text if rating_text else "N/A"
            else:
                rating = "N/A"
            # print(rating)
            
        # Extract product link
            link_element = container.find('a')
            if link_element and 'href' in link_element.attrs:
                href = link_element['href']
                if href.startswith('/'):
                    product_link = "https://www.amazon.in" + href
                else:
                    product_link = href
            else:
                product_link = "N/A"
            # print(product_link)
                
            is_sponsored.append(product_sponsored)
            product_names.append(product_name)
            prices.append(price_text)
            discount_pcts.append(discount)
            ratings.append(rating)
            links.append(product_link)
        except Exception as e:
                print(f"Error processing a product: {str(e)}")
                # Add placeholder values for this product
                product_names.append("Error")
                prices.append("Error")
                discount_pcts.append("Error")
                ratings.append("Error")
                is_sponsored.append("Error")
                links.append("Error")
        time.sleep(random.uniform(2, 4))
    return product_names, prices, discount_pcts, ratings, is_sponsored, links


## Main Function Overview

The `main()` function orchestrates the complete web scraping process:

1. Sets up request headers to mimic browser behavior
2. Initializes dictionaries to store scraped data:
    - Product names
    - Prices 
    - Discounts
    - Ratings
    - Sponsorship status
    - Product links

3. Iterates through pages 1-4 of search results
4. For each page:
    - Calls `scrape_()` function
    - Aggregates results into data structure
    
5. Creates pandas DataFrame from collected data
6. Exports results to CSV file named after search query

In [17]:
def main(search_query):
    """Main function to execute scraping process.
    
    Args:
        search_query (str): Search term to look for on Amazon
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }
    
    all_data = {
        'Product_Name': [], 
        'Price': [], 
        'Discount': [], 
        'Rating': [], 
        'Sponsored': [], 
        'Link': []
    }

    for page in range(1, 3):
        results = scrape_(search_query, page, headers)
        
        # Extend lists with page results
        all_data['Product_Name'].extend(results[0])
        all_data['Price'].extend(results[1])
        all_data['Discount'].extend(results[2])
        all_data['Rating'].extend(results[3])
        all_data['Sponsored'].extend(results[4])
        all_data['Link'].extend(results[5])

    df = pd.DataFrame(all_data)
    print("\nFirst 7 results:")
    print(df.head(7))
    df.to_csv(f'{search_query}.csv', index=False)
    print("Data saved !")

In [18]:
if __name__ == '__main__':
    search_query = input("Enter the search query: ").replace(" ", "+")
    main(search_query) 

Found 22 products on page 1
Found 22 products on page 2

First 7 results:
                                        Product_Name      Price Discount  \
0                    Apple iPhone 15 (128 GB) - Blue    ₹59,500  15% off   
1  iPhone 16 Pro 1 TB: 5G Mobile Phone with Camer...  ₹1,62,900   4% off   
2                    Apple iPhone 15 (128 GB) - Blue    ₹59,500  15% off   
3                    Apple iPhone 15 (128 GB) - Pink    ₹59,500  15% off   
4                   Apple iPhone 15 (128 GB) - Black    ₹59,500  15% off   
5                   Apple iPhone 15 (128 GB) - Green    ₹61,200  12% off   
6  LOXXO® Microfiber Candy Case Compatible for iP...       ₹379  58% off   

               Rating  Sponsored  \
0  4.5 out of 5 stars       True   
1  4.3 out of 5 stars       True   
2  4.5 out of 5 stars      False   
3  4.5 out of 5 stars      False   
4  4.5 out of 5 stars      False   
5  4.5 out of 5 stars      False   
6  3.9 out of 5 stars      False   

                            