# Web Scrapping from Amazon website.

## 1.1 📌 Introduction

E-commerce platforms like Amazon host millions of products, generating a wealth of information that can be valuable for businesses, researchers, and consumers. Product details such as prices, reviews, and ratings change frequently, making it difficult to track trends manually. Web scraping provides a practical way to extract this information in a structured format for analysis.

This project demonstrates how to build an Amazon Web Scraper using Python to automatically collect product data. By leveraging libraries like Requests, BeautifulSoup, and pandas, the scraper extracts relevant details — including product names, prices, ratings, and reviews — and stores them for further analysis.

---
## 1.2 🔍 Problem Statement

Manually tracking Amazon product data is time-consuming, inconsistent, and prone to error. For businesses, it’s crucial to monitor competitor pricing, product availability, and customer sentiment to stay competitive. Similarly, researchers and analysts need reliable datasets to study market trends and consumer preferences.

## 1.3 🎯 Objectives

The key objectives of this project are:


1.   Data Extraction – Scrape product information (titles, prices, ratings, and reviews) from Amazon product pages.
2.   Data Cleaning & Storage – Organize the extracted data into a structured format (CSV) for easy use.
3.   Portfolio Demonstration – Showcase practical web scraping and Python programming skills in a real-world use case.



## 2. Import Libraries

In [57]:
 from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

## 3. Fetching variables before we create the dataset

---
### 3.1 Fetch product title

In [58]:
# Function to extract Product Title
def fetch_product_title(soup):
  title_string = "" # Initialize title_string
  try:
    # Outer Tag Object
    title = soup.find("span", attrs={"id":'productTitle'}).text.strip()
    title_string = title # Assign title to title_string

  except AttributeError:
    title_string = "" # Ensure title_string is defined even if there's an error

  return title_string

### 3.2 Fetch Product Price

In [59]:
# Function to extract Product Price
def fetch_product_price(soup):
  price = "" # Initialize price
  try:
    price = soup.find("span", attrs={'class':"aok-offscreen"}).string.strip()
  except AttributeError: # Catching specific exception
    price = "" # Ensure price is defined even if there's an error
  return price

### 3.3 Fetch Product Rating

In [60]:
# Function to extract Product Rating
def fetch_product_rating(soup):
  rating = "" # Initialize rating
  try:
    rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
  except AttributeError:
    try:
      rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
    except AttributeError: # Catching specific exception
      rating = "" # Ensure rating is defined even if there's an error
  return rating

### 3.4 Fetch Number of User Reviews

In [61]:
# Function to extract Number of User Reviews
def fetch_product_review_count(soup):
  review_count = "" # Initialize review_count
  try:
    review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
  except AttributeError:
    review_count = "" # Ensure review_count is defined even if there's an error
  return review_count

### 3.5 Fetch Availability Status

In [62]:
# Function to extract Availability Status
def fetch_product_availability(soup):
  available = "Not Available" # Initialize available
  try:
    available = soup.find("div", attrs={'id':'availability'})
    available = available.find("span").string.strip()

  except AttributeError:
    available = "Not Available" # Ensure available is defined even if there's an error

  return available

## 4. Dunder Main

This is where we user the requests library to access the amazon website, extracting variables (gathering data on product title, product price, number of reviews, and product availability) and storing it into a dataframe which is converted to a comma-separated values (csv) file.

In [63]:
if __name__ == '__main__':

  # add your user agent
  HEADERS = ({'User-Agent':'', 'Accept-Language': 'en-US, en;q=0.5'})

  # The webpage URL
  URL = "https://www.amazon.com/s?k=playstation+4&ref=nb_sb_noss_2"

  # HTTP Request
  webpage = requests.get(URL, headers=HEADERS)

  # Soup Object containing all data
  soup = BeautifulSoup(webpage.content, "html.parser")

  # Fetch links as List of Tag Objects
  links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})

  # Store the links
  links_list = []

  # Loop for extracting links from Tag Objects
  for link in links:
    links_list.append(link.get('href'))

  d = {"title":[], "price":[], "rating":[], "reviews":[], "availability":[]}

  # Loop for extracting product details from each link
  for link in links_list:
    new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)

    page_soup = BeautifulSoup(new_webpage.content, "html.parser")

    # Function calls to display all necessary product information
    d['title'].append(fetch_product_title(page_soup))
    d['price'].append(fetch_product_price(page_soup))
    d['rating'].append(fetch_product_rating(page_soup))
    d['reviews'].append(fetch_product_review_count(page_soup))
    d['availability'].append(fetch_product_availability(page_soup))

  amazon_df = pd.DataFrame.from_dict(d)
  amazon_df['title'] = amazon_df['title'].replace('', np.nan)
  amazon_df = amazon_df.dropna(subset=['title'])
  amazon_df.to_csv("amazon_data.csv", header=True,  index=False)