## Web scraping tool using chroma driver , scraping flipkart website

**Import Key Libraries**

In [1]:
# 
from selenium import webdriver                                      # Purpose: Loads the browser driver (e.g., Chrome, Firefox) to automate web browsing tasks
from selenium.webdriver.common.by import By                         # Purpose: Provides convenient ways to locate elements, such as By.ID, By.CLASS_NAME, By.XPATH, etc.
from selenium.webdriver.support.ui import WebDriverWait             # Purpose: Allows you to wait for a certain condition (like element visibility) before proceeding. Used to deal with dynamic content loading.
from selenium.webdriver.support import expected_conditions as EC    # Purpose: Used with WebDriverWait to specify what condition to wait for — like an element being present, clickable, etc.
import time                                                         # Purpose: Provides manual delays using time.sleep() when needed.
from bs4 import BeautifulSoup                                       # Purpose: Parses HTML content fetched from the website. Commonly used to extract structured data (like titles, prices, links).
import lxml                                                         # Purpose: A high-performance XML/HTML parser used by BeautifulSoup for faster and more accurate parsing.
import pandas as pd                                                 # Purpose: Provides powerful data manipulation capabilities. You can store scraped data in a DataFrame and export it to CSV, Excel, etc.
import re                                                           # Purpose: Enables regular expression matching. Used to extract text or clean/validate data.
from datetime import datetime                                       # Purpose: Helps in working with date and time, such as tagging scraped data with a timestamp.
from selenium.webdriver.common.keys import Keys                     # Purpose: Allows you to simulate keyboard keys like ENTER, TAB, ESC, etc., during automation.

| Module                   | Purpose                           |
| ------------------------ | --------------------------------- |
| `selenium.webdriver`     | Automates browsers                |
| `By`                     | Element location strategy         |
| `WebDriverWait` + `EC`   | Dynamic waiting for page elements |
| `time`                   | Static wait or delay              |
| `BeautifulSoup` + `lxml` | Parse and navigate HTML           |
| `pandas`                 | Data storage and export           |
| `re`                     | Text extraction and cleaning      |
| `datetime`               | Timestamps                        |
| `Keys`                   | Simulating key presses            |


### Step1: Get all product Links

In [2]:
# 🧾 Section 1: Setup & Inputs
#Inputs to search
search_box_text = 'sports shoes for women'                                                                                  # search_box_text: The keyword to search on Flipkart.   
website_link = 'https://www.flipkart.com/'                                                                                  # website_link: The Flipkart homepage URL.

#initiating the browser
#session start time
session_start_time = datetime.now().time()
print(f"Session Start Time: {session_start_time} ---------------------------> ")                                            # Stores the time the scraping session begins (used for logging).

# 🚀 Section 2: Browser Initialization
#starting the browser
driver = webdriver.Chrome()                                                                                                 # Launches a new Chrome browser session and opens Flipkart's website.   
driver.get(website_link)                                                                                                    # Maximizes the browser for better visibility and scraping.
driver.maximize_window()

# 🔍 Section 3: Perform the Search
print('Waiting for search input...')                                                                                                 
search_input = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[autocomplete="off"]')))  # Waits up to 120 seconds for the search box to appear.
        
print('Typing in search input...') 
search_input.send_keys(search_box_text) 
        
print('Submitting search form...') 
search_input.send_keys(Keys.RETURN)                                                                                         # Sends the search keyword and presses ENTER.
        
print('Waiting for search results...') 
WebDriverWait(driver, 120).until( EC.presence_of_element_located((By.CSS_SELECTOR, '[target="_blank"]')) )                  # Waits for at least one search result to be present after submitting the query.

print('Collecting pagination links...') 

# 📄 Section 4: Handle Pagination
#we want first 25 pages [pagination link]  [1000 Products]
#logic: Let's get the first page pagination link and append the number in the end for 25 pages and store in a list
all_pagination_links =[]

first_page = driver.find_elements(By.CSS_SELECTOR, 'nav a')[0]                                                              # Gets the pagination link of the first page of results.
first_page_link = first_page.get_attribute('href')                                                                    
all_pagination_links.append(first_page_link)                                                                                

for i in range(2, 26):                                                                                                      # Constructs pagination links for pages 2 to 25 by editing the last digit of the first page URL.
    new_pagination_link = first_page_link[: -1] + str(i)
    all_pagination_links.append(new_pagination_link)

print('Pagination Links Count:', len(all_pagination_links)) 
print("All Pagination Links: ", all_pagination_links)


print("Collecting Product Detail Page Links")
all_product_links = []


# 📦 Section 5: Collect All Product Links
for link in all_pagination_links:                                                                             
    driver.get(link)
    # Wait for the page to load by checking document.readyState                                                              # Visits each pagination page.
    WebDriverWait(driver, 120).until(lambda d: d.execute_script('return document.readyState') == 'complete')

    #wait until elements located                                                                                             # Waits until the page is fully loaded.
    WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'rPDeLR')))                              # Finds all product cards using class name rPDeLR (Flipkart's internal class for product tiles).
                
    all_products = driver.find_elements(By.CLASS_NAME, 'rPDeLR')
    all_links = [element.get_attribute('href') for element in all_products]                                                  # Collects the href link from each product and adds it to the list.

    print(f"{link} Done ------>")

    all_product_links.extend(all_links)
    
print('All Product Detail Page Links Captured: ', len(all_product_links)) 

# 🧾 Section 6: Save the Product Links
# Creating a DataFrame from the list
df_product_links = pd.DataFrame(all_product_links, columns=['product_links'])                                                # Converts the product links into a Pandas DataFrame.
#remove any duplicates
df_product_links = df_product_links.drop_duplicates(subset=['product_links'])                                                # Removes duplicate URLs.

print("Total Product Detail Page Links", len(df_product_links))                                                              # Saves all links to a CSV file named flipkart_product_links.csv.
df_product_links.to_csv('flipkart_product_links.csv', index = False)

# ✅ Section 7: Cleanup
driver.close()                                                                                                               # Closes the browser session.
session_end_time = datetime.now().time()
print(f"Session End Time: {session_end_time} ---------------------------> ")                                                 # Logs the time when scraping ended.


Session Start Time: 15:55:25.624962 ---------------------------> 
Waiting for search input...
Typing in search input...
Submitting search form...
Waiting for search results...
Collecting pagination links...
Pagination Links Count: 25
All Pagination Links:  ['https://www.flipkart.com/search?q=sports+shoes+for+women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=off&as=off&page=1', 'https://www.flipkart.com/search?q=sports+shoes+for+women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=off&as=off&page=2', 'https://www.flipkart.com/search?q=sports+shoes+for+women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=off&as=off&page=3', 'https://www.flipkart.com/search?q=sports+shoes+for+women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=off&as=off&page=4', 'https://www.flipkart.com/search?q=sports+shoes+for+women&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=off&as=off&page=5', 'https://www.flipkart.com/search?q=sports+

| Step | Action                                        |
| ---- | --------------------------------------------- |
| 1    | Open Flipkart and search for a keyword        |
| 2    | Wait for results and paginate across 25 pages |
| 3    | Collect all product detail page URLs          |
| 4    | Save the results in a CSV file                |
| 5    | Log session start and end time                |


### Step2: Get Individual product information

In [3]:
# 🔍 Goal : Scrape detailed information (brand, title, price, discount, rating, reviews) for individual products from Flipkart using product URLs stored in a CSV file.

# 📌 1. Setup & Inputs
#session start time
session_start_time = datetime.now().time()
print(f"Session Start Time: {session_start_time} ---------------------------> ")                                    # Captures the start time of the scraping session.


#reading the csv file which contain all product links
df_product_links = pd.read_csv("flipkart_product_links.csv")

# Remove the below line to scrap all the products. For demonstration purpose we are scraping only 10 products       # Loads product links from the CSV and selects only the top 10 (for demo/testing).
df_product_links = df_product_links.head(10)

all_product_links = df_product_links['product_links'].tolist()                                                      # Converts the DataFrame column to a Python list.
print("Collecting Individual Product Detail Information")

#🚀 2. Start WebDriver
#starting the browser
driver = webdriver.Chrome()                                                                                         # Starts a new instance of the Chrome browser

# 🔄 3. Scraping Loop Initialization                                                                                # Prepares variables to store results, failures, and counters.
complete_product_details = []
unavailable_products = []
successful_parsed_urls_count = 0
complete_failed_urls_count = 0

# 🧪 4. Main Loop: Scrape Each Product Page
for product_page_link in all_product_links:                                                                         # Opens each product link and waits for the page to fully load.                                                    
    try: 
        driver.get(product_page_link)
    
        # Wait for the page to load by checking document.readyState
        WebDriverWait(driver, 120).until(lambda d: d.execute_script('return document.readyState') == 'complete')
    
        WebDriverWait(driver, 120).until( EC.presence_of_element_located((By.CSS_SELECTOR, '[target="_blank"]')))

# ❌ 5. Check for "Unavailable" Products       
        #checking if product is available or not                                                                    # Looks for a “Sold Out” or “Currently Unavailable” message. If found, logs and skips.                       
        try:
            product_status =  driver.find_element(By.CLASS_NAME, 'Z8JjpR').text
            if product_status == 'Currently Unavailable' or product_status == 'Sold Out':
                unavailable_products.append(product_page_link)
                successful_parsed_urls_count += 1
                print(f"URL {successful_parsed_urls_count} completed --->")
        except:
            pass

# 🏷️ 6. Extract Key Fields    
        #brand
        brand =  driver.find_element(By.CLASS_NAME, 'mEh187').text
    
        #title   ::  Title (with color/extra info removed)   
        title = driver.find_element(By.CLASS_NAME, 'VU-ZEz').text
        title = re.sub(r'\s*\([^)]*\)', '', title)  #removing data withing parenthesis (color information)
    
        #price      
        price = driver.find_element(By.CLASS_NAME, 'Nx9bqj').text
        price = re.findall(r'\d+', price)
        price = ''.join(price)
    
        # Discount  
        try:
            discount = driver.find_element(By.CLASS_NAME, 'UkUFwK').text
            discount = re.findall(r'\d+', discount)
            discount = ''.join(discount)
            discount = int(discount) / 100
        except:
            discount = ''
    
        #for a new product, there will be no avg_rating and total_ratings                                       # Rating & Total Ratings
        try:
            product_review_status = driver.find_element(By.CLASS_NAME, 'E3XX7J').text
            if product_review_status == 'Be the first to Review this product':
                avg_rating = ''
                total_ratings = ''
        except:
            avg_rating = driver.find_element(By.CLASS_NAME, 'XQDdHH').text
            total_ratings = driver.find_element(By.CLASS_NAME, 'Wphh3N').text.split(' ')[0]
            #remove the special character
            if ',' in total_ratings:
                total_ratings = int(total_ratings.replace(',', ''))
            else:
                total_ratings = int(total_ratings)
    
        successful_parsed_urls_count += 1
        print(f"URL {successful_parsed_urls_count} completed *******")
        complete_product_details.append([product_page_link, title, brand, price, discount, avg_rating, total_ratings])    # ✅ 7. Store Results
    except Exception as e:                                                                                                # ❗ 8. Handle Exceptions
        print(f"Failed to establish a connection for URL {product_page_link}:  {e}")
        unavailable_products.append(product_page_link)
        complete_failed_urls_count += 1
        print(f"Failed URL Count {complete_failed_urls_count}")

# 📊 9. Create DataFrames
#create pandas dataframe 
df = pd.DataFrame(complete_product_details, columns = ['product_link', 'title', 'brand', 'price', 'discount', 'avg_rating', 'total_ratings'])
#duplicates processing
df_duplicate_products = df[df.duplicated(subset=['brand', 'price', 'discount', 'avg_rating', 'total_ratings'])]
df = df.drop_duplicates(subset=['brand', 'price', 'discount', 'avg_rating', 'total_ratings'])
#unavailable products
df_unavailable_products = pd.DataFrame(unavailable_products, columns=['link'])


#prining the stats
print("Total product pages scrapped: ", len(all_product_links))
print("Final Total Products: ", len(df))
print("Total Unavailable Products : ", len(df_unavailable_products))
print("Total Duplicate Products: ", len(df_duplicate_products))

# 💾 10. Save to CSV
#saving all the files
df.to_csv('flipkart_product_data.csv', index = False)
df_unavailable_products.to_csv('unavailable_products.csv', index = False)
df_duplicate_products.to_csv('duplicate_products.csv', index = False)

# 🛑 11. Close Driver & Print Stats
driver.close()
session_end_time = datetime.now().time()
print(f"Session End Time: {session_end_time} ---------------------------> ")

Session Start Time: 15:57:55.357811 ---------------------------> 
Collecting Individual Product Detail Information
URL 1 completed *******
URL 2 completed *******
URL 3 completed *******
URL 4 completed *******
URL 5 completed *******
URL 6 completed *******
URL 7 completed *******
URL 8 completed *******
URL 9 completed *******
URL 10 completed *******
Total product pages scrapped:  10
Final Total Products:  9
Total Unavailable Products :  0
Total Duplicate Products:  1
Session End Time: 15:58:13.192908 ---------------------------> 


| Task       | Description                                 |
| ---------- | ------------------------------------------- |
| 🏁 Start   | Load links & start session                  |
| 🔄 Loop    | Visit each product page                     |
| 📌 Extract | Brand, Title, Price, Discount, Ratings      |
| ❌ Handle  | Unavailable or Failed Pages                |
| 📦 Store   | Product data in DataFrame                   |
| 💾 Save    | Output to `flipkart_product_data.csv`, etc. |
| ✅ Clean   | Remove duplicates                          |
| 🛑 Finish  | End session & close browser                 |
