"What are some ways to extract information from a website that doesn't have an export option?"

The reason behind this question was my need to keep a watchful eye on product prices listed on Flipkart and Amazon, and be notified when the prices hit their lowest point.

Considering that others might also require to monitor product prices from websites, I decided to share a Python program that emulates web browsing behavior, enables easy data storage and retrieval, and can even send alerts via email when prices reach a minimum from first day of monitoring.

Initially, let's import several Python modules and packages that will be used in the script:

1. bs4 (BeautifulSoup): a Python library used for web scraping to parse HTML and XML documents.
2. requests: a Python library used to send HTTP requests to websites and retrieve the response data.
3. smtplib: a Python library used for sending email messages using Simple Mail Transfer Protocol (SMTP).
4. csv: a Python library used for reading and writing CSV (Comma Separated Values) files.
5. datetime: a Python library used to work with dates and times.
6. os.path: a Python library used for working with file paths.
7. MIMEText: a Python library used to create email messages in MIME (Multipurpose Internet Mail Extensions) format.

In [None]:
#Importing libraries

from bs4 import BeautifulSoup
import requests
import smtplib
import csv 
import datetime
import os.path
from email.mime.text import MIMEText

We will store the URLs for the product pages on Flipkart and Amazon in variables for easy reference in our code.

In [None]:
Flipkart_URL = 'https://www.flipkart.com/logitech-gamepad-f310/p/itmdyfyfrzch29cf?pid=ACCDYFXZ6QEUZEFZ&lid=LSTACCDYFXZ6QEUZEFZIISO8W&marketplace=FLIPKART&q=controller&store=4rr%2Fkm5%2Fr39%2Fa7g%2Fy7a&spotlightTagId=TrendingId_4rr%2Fkm5%2Fr39%2Fa7g%2Fy7a&srno=s_1_5&otracker=AS_QueryStore_OrganicAutoSuggest_1_9_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_9_na_na_na&fm=search-autosuggest&iid=01b3d6f3-06c0-449b-a2af-9d3fd8e91cbb.ACCDYFXZ6QEUZEFZ.SEARCH&ppt=sp&ppn=sp&ssid=mjbnhasl2o0000001678195519116&qH=594c103f2c6e04c3'
Amazon_URL = 'https://www.amazon.in/Logitech-G-940-000112-F310-Gamepad/dp/B0757QFBRL/ref=sr_1_4?crid=YSVEB0DD6CAB&keywords=logitech+controller+for+pc&qid=1678195661&s=computers&sprefix=logitech+controller+for+pc%2Ccomputers%2C182&sr=1-4'


For this program, I'm using the 'Requests: HTTP for Humans' library.
As GET request to the URL will send headers.
and since we are using python 'User Agent' header will be something like 'python-requests/2.19.1'.
Basically this means someone is scraping the website using python.
As some websites block such requests we will define headers to behave as humanly as possible.
In order to look what headers are sent by browser please go to 'https://httpbin.org/get'.
From above mentioned URL copy "User-Agent" tag and substitute below.

In [None]:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

Read more about HTTP and Headers here 'https://www.seobility.net/en/wiki/HTTP_headers'.
Read more about requests library 'https://requests.readthedocs.io/en/latest/'.

To retrieve the content of the website, we'll use the get() method from the requests library, which will allow us to establish a connection to the website and capture its response.

In [None]:
Flipkart_page = requests.get(Flipkart_URL, headers = headers)
Amazon_page = requests.get(Amazon_URL, headers = headers)

Everything contained in that webpage is now in the page object.
In order to parse this (HTML)pages we will use Beautiful Soup (BS4) library. 
BS4 is a Python library for parsing HTML and XML documents. 
It provides simple methods to navigate, search, and modify parse trees.

The page objects now contains all the content from the webpage we retrieved. To parse the HTML code contained within the webpage, we will use the Beautiful Soup (BS4) library. BS4 is a Python library specifically designed for parsing HTML and XML documents. It provides easy-to-use methods for navigating, searching, and modifying the parse tree of the document.

In [None]:
Flipkart_soup = BeautifulSoup(Flipkart_page.content, "html.parser")
Amazon_soup = BeautifulSoup(Amazon_page.content, "html.parser")

BS4 provides a set of useful functions that simplify the process of traversing HTML tags. You can find a detailed documentation of all functions on the BS4 website at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. 

Using functions like find(), find_all(), and get_text() we will grab product title, availability and prices from the website.

In [None]:
#Grabbing product title

Flipkart_Title = Flipkart_soup.find_all('span',attrs={"class":"B_NuCI"})[0].get_text()
Flipkart_Title = Flipkart_Title.strip()

Amazon_Title = Amazon_soup.find(id='productTitle').get_text()
Amazon_Title = Amazon_Title.strip()

In above code, for the flipkart website, the code first searches for all 'span' elements that have a 'class' attribute of 'B_NuCI' using find_all() function.

Similarly, for amazon website, the code locates an element with the 'id' attribute of 'availability' using find() function.

The get_text() function is then used to extract the text content of this element.

And .strip() function is used to remove any extra whitespace characters from the text

In [None]:
#Pulling Availability of the Product from website

Flipkart_Availability = Flipkart_soup.find_all('div',attrs={"class":"_16FRp0"})

if (len(Flipkart_Availability)==0):
    Flipkart_Availability = 'In stock'
else:
    Flipkart_Availability = Flipkart_Availability[0].get_text()
Flipkart_Availability = Flipkart_Availability.strip()

Amazon_Availability = Amazon_soup.find(id='availability').get_text()
Amazon_Availability = Amazon_Availability.strip()
Amazon_Availability_Split = Amazon_Availability.lower().split()

In above code, for the Flipkart website, the code first searches for all 'div' elements that have a 'class' attribute of '_16FRp0' using the find_all() function. 

If the length of this list of elements is zero i.e. this 'div' has no text content, it means that the product is in stock, so the 'Flipkart_Availability' variable is assigned the string 'In stock'. 

If the length is not zero, the code retrieves the text from the first element in the list and assigns it to 'Flipkart_Availability' variable.

The get_text() function is then used to extract the text content of this element.

The .strip() function is used to remove any extra whitespace characters from the text.

Similarly,for the Amazon website, the code locates an element with the id attribute of 'availability' using the find() function. 

The get_text() function is then used to extract the text content of this element. 

The .strip() function is again used to remove any extra whitespace characters from the text. 

The text is then split into words using the .split() function and converted to lowercase. 

This allows us to check if the word 'out' is present in the text, which indicates that the product is out of stock. Which can be seen in below part of the code

In [None]:
#Getting Product prices from the website.

if (Flipkart_Availability == 'In stock'):
    Flipkart_Price = Flipkart_soup.find_all('div',attrs={"class":"_30jeq3 _16Jk6d"})[0].get_text()
    Flipkart_Price = Flipkart_Price.strip()
    Flipkart_Price = Flipkart_Price[1:]
    Flipkart_Price = int(Flipkart_Price.replace(',',''))
else:
    Flipkart_Price = 0


if ('out' not in Amazon_Availability_Split):
    Amazon_Price = Amazon_soup.find(id='corePriceDisplay_desktop_feature_div').get_text()
    Amazon_Price = Amazon_Price.strip()
    Amazon_Price = Amazon_Price.split('â‚¹')[1]
    Amazon_Price = int(Amazon_Price.replace(',',''))
else :
    Amazon_Price = 0

In above code, if the product is in stock on Flipkart (Flipkart_Availability == 'In stock'), the code searches for all 'div' elements that have a 'class' attribute of '_30jeq3 _16Jk6d' using the find_all().

The price is then extracted from the text content of the first element in the list, and various string operations are used to clean up the price and convert it to an integer.

Similarly, if the product is not in stock on Amazon ('out' in Amazon_Availability_Split), the 'Amazon_Price' variable is set to 0. 

Otherwise, the code locates an element with the 'id' attribute of 'corePriceDisplay_desktop_feature_div' using the find() function. 

The price is then extracted from the text content of this element using string operations similar to those used for Flipkart, and the price is converted to an integer.





Now we will initliase some variables.

1. The file_path variable stores the path of the CSV file that will be used to store the product information. 
2. We will use the os.path.isfile() function to check if the CSV file already exists. If the file exists, the 'flag' variable is set to 'True'. If not, 'flag' is set to 'False'.
3. The 'filename' variable stores the name of the CSV file.
4. The 'current_datetime' variable stores the current date and time.
5. The 'Amazon_price_list' and 'Flipkart_price_list' variables are empty lists that will be used to store the prices of the product on Amazon and Flipkart websites, respectively.

In [None]:
file_path = r'D:/ProjectWebScraper/DiscountAlert.csv'

#The r prefix is used to create a raw string literal, which makes it easier to include backslashes in the file path.

flag = os.path.isfile(file_path)

filename = "DiscountAlert.csv"
current_datetime = datetime.datetime.now()

Amazon_price_list = []
Flipkart_price_list = []

Also, let's first write a function which will help us send email notifications to our mail ids when the prices are low.

In [None]:
def send_email(subject, body, sender, recipients, password):
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = sender
    msg['To'] = ', '.join(recipients)
    smtp_server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
    smtp_server.login(sender, password)
    smtp_server.sendmail(sender, recipients, msg.as_string())
    smtp_server.quit()

sender = "vaidya.kare@gmail.com"
recipients = ["vishweshhampali@gmail.com"]
password = "yfvjflmehenezdyl"

In above code, we are defining a function 'send_email()' that takes five paramters - 'subject', 'body', 'sender', 'recipients', and 'password'.

Inside the function, an email message is created using 'MIMEText()' function from the email package. 

Then, a secure connection is established with Gmail's SMTP server using 'smtplib.SMTP_SSL()'. The function then logs into the sender's account using 'smtp_server.login()'. Finally, the email message is sent using 'smtp_server.sendmail()', and the connection is closed using 'smtp_server.quit()'.


Next, we will develop the program logic store extracted product prices in a CSV file. This will allow us to keep track of the price changes over time. Additionally, we can set up a notification system that compares the current price with the minimum price stored in the CSV file, and sends an email alert if the current price drops to the minimum or lower.