### Flipkart Web Scrapping

<span style = 'font-size:0.8em;'>
    
#### Imports
- `import os`: This statement imports the Python `os` module, which provides a portable way to interact with the operating system. It can be used for tasks such as file operations, environment variables, and directory manipulation.
- `import time`: This imports the Python `time` module, which provides various time-related functions. It can be used for tasks such as measuring time intervals, delaying execution, or generating timestamps.
- `import requests`: This imports the `requests` library, which is a popular HTTP library for making HTTP requests in Python. It simplifies the process of sending HTTP requests and handling responses.
- `from bs4 import BeautifulSoup as bs`: This imports the `BeautifulSoup` class from the `bs4` module. `BeautifulSoup` is a Python library for parsing HTML and XML documents. It provides a simple interface for extracting data from web pages.
- `from selenium import webdriver`: This imports the `webdriver` module from the `selenium` package. Selenium is a powerful tool for automating web browsers. The `webdriver` module allows you to control web browsers programmatically.
- `from selenium.webdriver.common.by import By`: This imports the `By` class from the `selenium.webdriver.common.by` module. The `By` class provides methods for locating elements on a web page using different strategies, such as by ID, by class name, by tag name, etc.

This code snippet sets up the necessary imports for performing web scraping and browser automation tasks using Python. It includes modules for making HTTP requests (`requests`), parsing HTML (`BeautifulSoup`), interacting with web browsers (`selenium`), and accessing operating system functionalities (`os`, `time`).
</span>

In [1]:
import os
import time
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver 
from selenium.webdriver.common.by import By # This needs to be used 

<span style = 'font-size:0.8em;'>
    
#### Selenium WebDriver Initialization
- `from selenium import webdriver`: This imports the `webdriver` module from the `selenium` package. This module provides a variety of web drivers to control different web browsers.

#### WebDriver Initialization
- `driver = webdriver.Chrome()`: This initializes a new instance of the Chrome WebDriver. It will launch a new Chrome browser window. Make sure you have ChromeDriver installed and its path properly configured if you're executing this code outside of a Python notebook environment.

#### Web Page Interaction
- `driver.get(flipkart_url)`: This navigates the Chrome browser to the specified Flipkart URL (`flipkart_url`), opening the web page.
- `page_text = driver.page_source`: This retrieves the HTML source code of the currently loaded web page and stores it in the `page_text` variable. This allows you to parse and extract data from the page source using BeautifulSoup or any other HTML parsing library.

The provided code initializes a Chrome WebDriver, opens the Flipkart website, and retrieves the HTML source code of the page for further processing.
</span>

In [2]:
from selenium import webdriver

# Define the path to the Chrome WebDriver executable
# DRIVER_PATH = r"chromedriver.exe"

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()
searchString = "Laptop"
flipkart_url = "https://www.flipkart.com/search?q=laptop&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"

# Open the Flipkart URL
driver.get(flipkart_url)

page_text = driver.page_source
page_text

'<html lang="en" class=" fonts-loaded"><head><script type="text/javascript" async="async" src="https://flipkart.d1.sc.omtrdc.net/id?d_visid_ver=1.5.4&amp;callback=s_c_il%5B0%5D._setAnalyticsFields&amp;mcorgid=17EB401053DAF4840A490D4C%40AdobeOrg&amp;mid=24700370080013094580830784426186765672"></script><script type="text/javascript" async="async" src="https://dpm.demdex.net/id?d_visid_ver=1.5.4&amp;d_rtbd=json&amp;d_ver=2&amp;d_orgid=17EB401053DAF4840A490D4C%40AdobeOrg&amp;d_nsid=0&amp;d_cb=s_c_il%5B0%5D._setMarketingCloudFields"></script><link href="https://rukminim2.flixcart.com" rel="preconnect"><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.c48a12.css"><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.47c551.css"><meta http-equiv="Content-type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta property="fb:page_id" conten

<span style = 'font-size:0.8em;'>

#### BeautifulSoup Initialization
This line initializes a BeautifulSoup object named `flipkart_html` by parsing the HTML source code (`page_text`) obtained from the Flipkart webpage. The `'html.parser'` argument specifies the parser to be used, in this case, the built-in HTML parser provided by BeautifulSoup. This allows you to navigate and extract data from the HTML content of the webpage using BeautifulSoup's methods and functions.

#### BeautifulSoup FindAll
`findAll` method provided by BeautifulSoup to search for all `<div>` elements with the specified class attribute (`"cPHDOP col-12-12"`) within the `flipkart_html` BeautifulSoup object.
- The `findAll` method returns a list of all matching elements found in the HTML document.
- In this case, it's searching for div elements with a class attribute value of "cPHDOP col-12-12", which likely corresponds to some specific containers or boxes on the Flipkart webpage. These containers might hold product information or other relevant data that you're interested in extracting.

- After this operation, `del bigboxes[0:2]` the first two elements of the `bigboxes` list will be removed as it contains header elements.
</span>

In [3]:
flipkart_html = bs(page_text, 'html.parser')

In [4]:
bigboxes =flipkart_html.findAll("div", {"class": "cPHDOP col-12-12"})

In [5]:
len(bigboxes)

30

In [6]:
del bigboxes[0:2]

<span style = 'font-size:0.8em;'>

#### Constructing Product Link
- This line constructs the product link by concatenating the base Flipkart URL ("https://www.flipkart.com") with the href attribute value of the anchor (`<a>`) tag within the first element (`bigboxes[0]`) of the `bigboxes` list.
- `bigboxes[0].div.div.a['href']` accesses the href attribute of the anchor tag within the nested `<div>` structure of the first element in `bigboxes`.
- The constructed `productlink` variable holds the URL of the first product listed on the Flipkart webpage.
</span>

In [7]:
productlink = "https://www.flipkart.com" + bigboxes[0].div.div.a['href']

In [8]:
box = bigboxes[0]
print(box)

<div class="cPHDOP col-12-12"><div class="_75nlfW"><div data-id="COMGZNMS5AHPF2GP" style="width: 100%;"><div class="tUxRFH" data-tkid="f09b0ebb-98b3-40d2-9496-e790a6a4a6f4.COMGZNMS5AHPF2GP.SEARCH"><a class="CGtC98" href="/lenovo-v15-intel-celeron-dual-core-4th-gen-8-gb-256-gb-ssd-windows-11-home-82qya00min-laptop/p/itmdad173c69b235?pid=COMGZNMS5AHPF2GP&amp;lid=LSTCOMGZNMS5AHPF2GPPMPOX7&amp;marketplace=FLIPKART&amp;q=laptop&amp;store=6bo%2Fb5g&amp;spotlightTagId=BestsellerId_6bo%2Fb5g&amp;srno=s_1_1&amp;otracker=search&amp;otracker1=search&amp;fm=Search&amp;iid=f09b0ebb-98b3-40d2-9496-e790a6a4a6f4.COMGZNMS5AHPF2GP.SEARCH&amp;ppt=sp&amp;ppn=sp&amp;ssid=ppqfjfo6ds0000001714712842354&amp;qH=312f91285e048e09" rel="noopener noreferrer" target="_blank"><div><div class="UzRoYO CmflSf" style="background: rgb(0, 160, 152);">Bestseller</div></div><div class="Otbq5D"><div class="yPq5Io"><div><div class="_4WELSP" style="height: 200px; width: 200px;"><img alt="Lenovo V15 Intel Celeron Dual Core 4th 

<span style = 'font-size:0.8em;'>

#### Iterating Over `bigboxes` and Extracting Product Links
- The code iterates over each element (`i`) in the `bigboxes` list using a for loop.
- Within each iteration, it attempts to extract the product link by accessing the href attribute of the anchor (`<a>`) tag within the nested `<div>` structure of the current element (`i`) in `bigboxes`.
- If the structure or attribute required for extracting the product link is not found (due to missing elements or attributes), the code catches the `AttributeError` or `KeyError` exception and continues to the next iteration using the `continue` statement.
- By using a try-except block, the code handles potential errors gracefully, ensuring that it doesn't crash if an element doesn't have the expected structure or attribute.
- If successful, the product link is constructed by concatenating the base Flipkart URL ("https://www.flipkart.com") with the extracted href attribute value, and it is printed to the console using the `print` function.
</span>


In [81]:
for i in bigboxes:
    try:
        product_link = "https://www.flipkart.com" + i.div.div.div.a['href']
        print(product_link)
    except (AttributeError, KeyError):
        # Skip this element if it doesn't have the expected structure or attribute
        continue

https://www.flipkart.com/lenovo-v15-intel-celeron-dual-core-4th-gen-8-gb-256-gb-ssd-windows-11-home-82qya00min-laptop/p/itmdad173c69b235?pid=COMGZNMS5AHPF2GP&lid=LSTCOMGZNMS5AHPF2GPPMPOX7&marketplace=FLIPKART&q=laptop&store=6bo%2Fb5g&spotlightTagId=BestsellerId_6bo%2Fb5g&srno=s_1_1&otracker=search&otracker1=search&fm=Search&iid=f09b0ebb-98b3-40d2-9496-e790a6a4a6f4.COMGZNMS5AHPF2GP.SEARCH&ppt=sp&ppn=sp&ssid=ppqfjfo6ds0000001714712842354&qH=312f91285e048e09
https://www.flipkart.com/apple-2020-macbook-air-m1-8-gb-256-gb-ssd-mac-os-big-sur-mgn63hn-a/p/itm3c872f9e67bc6?pid=COMFXEKMGNHZYFH9&lid=LSTCOMFXEKMGNHZYFH9P56X45&marketplace=FLIPKART&q=laptop&store=6bo%2Fb5g&spotlightTagId=BestsellerId_6bo%2Fb5g&srno=s_1_2&otracker=search&otracker1=search&fm=Search&iid=f09b0ebb-98b3-40d2-9496-e790a6a4a6f4.COMFXEKMGNHZYFH9.SEARCH&ppt=sp&ppn=sp&ssid=ppqfjfo6ds0000001714712842354&qH=312f91285e048e09
https://www.flipkart.com/lenovo-v15-amd-ryzen-3-quad-core-7320u-8-gb-512-gb-ssd-windows-11-home-g4-amn-1-t

<span style = 'font-size:0.8em;'>

#### Extracting Product Information


- Constructs the URL of a specific product page on Flipkart's website.
- Navigates the WebDriver-controlled browser to the product page URL.
- Retrieves the HTML source code of the product page.
- Initializes a BeautifulSoup object for parsing the HTML content of the product page.

These actions enable the extraction and parsing of information from the product page, facilitating tasks such as scraping product details, reviews, or any other relevant data for further analysis or processing.
</span>

In [10]:
productLink = "https://www.flipkart.com" + box.div.div.div.a['href']
# Open the Flipkart URL
driver.get(productLink)
prodRes= driver.page_source
driver.quit()
prod_html = bs(prodRes, "html.parser")

In [11]:
print(productLink)

https://www.flipkart.com/lenovo-v15-intel-celeron-dual-core-4th-gen-8-gb-256-gb-ssd-windows-11-home-82qya00min-laptop/p/itmdad173c69b235?pid=COMGZNMS5AHPF2GP&lid=LSTCOMGZNMS5AHPF2GPPMPOX7&marketplace=FLIPKART&q=laptop&store=6bo%2Fb5g&spotlightTagId=BestsellerId_6bo%2Fb5g&srno=s_1_1&otracker=search&otracker1=search&fm=Search&iid=f09b0ebb-98b3-40d2-9496-e790a6a4a6f4.COMGZNMS5AHPF2GP.SEARCH&ppt=sp&ppn=sp&ssid=ppqfjfo6ds0000001714712842354&qH=312f91285e048e09


In [12]:
print(prod_html)

<html class="fonts-loaded" lang="en"><head><link href="https://rukminim2.flixcart.com" rel="preconnect"/><link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.c48a12.css" rel="stylesheet"/><link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.47c551.css" rel="stylesheet"/><meta content="text/html; charset=utf-8" http-equiv="Content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="102988293558" property="fb:page_id"/><meta content="658873552,624500995,100000233612389" property="fb:admins"/><link href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/><link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/><meta content="website" property="og:type"/><meta content="Flipkart.com" name="og_site_name" property="og:site_name"/><link href="/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57

In [13]:
commentboxes = prod_html.find_all('div', {'class': "RcXBOT"})

In [65]:
pricebox = prod_html.find_all('div', {'class': "Nx9bqj CxhGGd"})

In [66]:
pricebox = pricebox[0]

In [73]:
pricebox = str(pricebox)
pricebox

'<div class="Nx9bqj CxhGGd">₹19,990</div>'

In [14]:
len(commentboxes)

11

In [19]:
commentboxes[1].div.div.find_all("p",{"class":"_2NsDsF AwS1CA"})[0].text

'Arun Kushwah'

In [80]:
for i in commentboxes:
    try:
        if i.div.div is not None:
            names = i.div.div.find_all("p", {"class": "_2NsDsF AwS1CA"})
            if names:
                print(names[0].text)
    except AttributeError:
        # Skip this element if it doesn't have the expected structure
        continue

Subramanya Urs
Arun Kushwah
manish banerjee
Shubham  Singh
Sumit  Sahu
Md Naseem Ansari
P Lallawm Kima
Dilesh Hanwat
Kailash Singh
Rakesh  Kumar 


In [89]:
print(commentboxes[0])

<div class="RcXBOT"><div class="col"><div class="col EPCmJX"><div class="row"><div class="XQDdHH Ga3i8K">5</div><p class="z9E0IG">Just wow!</p></div><div class="row"><div class="ZmyHeo"><div><div class="">Nice laptop 💻</div><span class="wTYmpv"><span>READ MORE</span></span></div></div></div><div class="xmAgz5 pVVA7t"><div class="Be4x5X d517go" style='background-image: url("https://rukminim1.flixcart.com/blobio/124/124/imr/blobio-imr_36134e11bd0147698f898e90ad1fb0fe.jpg?q=90"), url("data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjIiIGhlaWdodD0iMTgiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyI+PGcgZmlsbD0iIzlEOUQ5RCIgZmlsbC1ydWxlPSJub256ZXJvIj48cGF0aCBkPSJNMjAgMEgyQzEgMCAwIDEgMCAyd

<span style = 'font-size:0.8em;'>

####  Accumulating review information into the reviews list

- Initializes an empty list named `reviews` to store extracted review information.
- Iterates over each `commentbox` in `commentboxes`.
- Attempts to extract the following details from each `commentbox`:
  - Price: Extracts the price information.
  - Customer Name: Extracts the name of the customer leaving the review.
  - Rating: Extracts the rating given by the customer.
  - Comment Head: Extracts the heading of the comment.
  - Customer Comment: Extracts the actual comment left by the customer.
- If any of the extraction attempts encounter an exception, it handles the exception and assigns a default message indicating that the information was not found.
- Constructs a dictionary (`mydict`) containing the extracted information.
- Appends the dictionary to the `reviews` list.

This process repeats for each `commentbox`, accumulating review information into the `reviews` list. The resulting list contains dictionaries, with each dictionary representing the details of a single review.
</span>

In [74]:
reviews = []

for commentbox in commentboxes:
    try:
        soup = bs(pricebox, 'html.parser')
        price = soup.div.text
    except Exception as e:    
        price = 'Price not found: ' + str(e)

    try:
        name = commentbox.div.div.find_all('p', {"class":"_2NsDsF AwS1CA"})[0].text
    except Exception as e:
        name = 'Name not found: ' + str(e)

    try:
        rating = commentbox.div.div.div.div.text
    except Exception as e:
        rating = 'Rating not found: ' + str(e)

    try:
        commentHead = commentbox.div.div.div.p.text
    except Exception as e:
        commentHead = 'Comment Head not found: ' + str(e)

    try:
        comtag = commentbox.div.div.find_all('div', {'class': ''})
        custComment = comtag[0].div.text
    except Exception as e:
        custComment = 'Comment not found: ' + str(e)

    mydict = {"Price": price, "Product": searchString, "Customer Name": name, "Rating": rating, "Heading": commentHead, "Comment": custComment}
    reviews.append(mydict)

In [75]:
reviews

[{'Price': '₹19,990',
  'Product': 'Laptop',
  'Customer Name': 'Subramanya Urs',
  'Rating': '5',
  'Heading': 'Just wow!',
  'Comment': 'Nice laptop 💻'},
 {'Price': '₹19,990',
  'Product': 'Laptop',
  'Customer Name': 'Arun Kushwah',
  'Rating': '4',
  'Heading': 'Good quality product',
  'Comment': 'Very good'},
 {'Price': '₹19,990',
  'Product': 'Laptop',
  'Customer Name': 'manish banerjee',
  'Rating': '5',
  'Heading': 'Wonderful',
  'Comment': 'Value For Money 💰💰'},
 {'Price': '₹19,990',
  'Product': 'Laptop',
  'Customer Name': 'Shubham  Singh',
  'Rating': '5',
  'Heading': 'Must buy!',
  'Comment': 'Battery up to 3 hours ok but leptop slow chalta hai looking very nice'},
 {'Price': '₹19,990',
  'Product': 'Laptop',
  'Customer Name': 'Sumit  Sahu',
  'Rating': '5',
  'Heading': 'Terrific',
  'Comment': 'Best in  price'},
 {'Price': '₹19,990',
  'Product': 'Laptop',
  'Customer Name': 'Md Naseem Ansari',
  'Rating': '4',
  'Heading': 'Worth the money',
  'Comment': 'Super pro

<span style = 'font-size:0.8em;'>
The extracted review information stored in the `reviews` list to a CSV file named "reviews.csv". It utilizes the `csv` module to open a new CSV file in write mode with UTF-8 encoding.This process organizes the review data into a tabular format, facilitating further analysis or sharing of the information.    
The extracted review information is written to the CSV file, making it accessible for further analysis or processing using spreadsheet software or other tools that support CSV file formats.
</span>

In [88]:
import csv

with open('reviews.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=reviews[0].keys())
    writer.writeheader()
    for review in reviews:
        writer.writerow(review)
