Question 1: What is Web Scraping? Why is it Used? Give three areas where Web Scraping is used to get data.
Answer :
Web scraping is the process of extracting data from websites by using automated software or tools, which can access the website's HTML code, parse it, and extract the desired information. This technique can be used to collect data such as text, images, URLs, and other structured or unstructured data.
Web scraping is used for a variety of reasons, such as:
Market research: Companies can use web scraping to gather pricing information on products from competitor websites, or to collect customer reviews to improve their own products.

Content aggregation: Web scraping can be used to gather news articles or blog posts from various sources and aggregate them on a single platform.

Data analysis: Researchers or data scientists can use web scraping to collect data for analysis, such as social media data or information on public opinion.

Here are three specific areas where web scraping is commonly used:
E-commerce: Web scraping can be used to gather pricing information and product details from e-commerce websites to gain insights into the market and competitor pricing strategies.

Social media monitoring: Web scraping can be used to monitor social media platforms for mentions of specific brands or products, which can help businesses to manage their online reputation or respond to customer inquiries.

Academic research: Web scraping can be used to gather data for academic research, such as collecting data on public opinion or political sentiment from news websites or social media platforms.

Question 2: What are the different methods used for Web Scraping?
Answer :
MANUAL SCRAPING
COPY-PASTING - In manual scraping, what you do is copy and paste web content. This is time-consuming and repetitive and begs for a more effective means of web scraping. It is however very effective because a website’s defences are targeted at automated scraping and not manual scraping techniques. Even with this benefit, manual scraping is hardly being done because it is time-consuming while automated scraping is quicker and cheaper.
AUTOMATED SCRAPING
HTML PARSING - There are many web scraping tools and libraries available that can extract data from websites. Some popular examples include BeautifulSoup, Scrapy, and Selenium in Python.

DOM PARSING - DOM is short for Document Object Model and it defines the style structure and content of XML files. Scrapers make use of DOM parsers to get an in-depth view of a web page’s structure. They can also use a DOM parser to get nodes containing information and then use a tool like XPath to scrape web pages. Internet Explorer or Firefox browsers can be embedded to extract the entire web page or just parts of it.

VERTICAL AGGREGATION - Vertical aggregation platforms are created by companies with access to large scale computing power to target specific verticals. Some companies run the platforms on the cloud. Bots creation and monitoring for specific verticals are done by these platforms without any human intervention. The quality of the bots is measured based on the quality of data they extract since they are created based on the knowledge base for the specific vertical.

XPATH - XML Path Language is a query language that is used with XML documents. XPath can be used to navigate XML documents because of their tree-like structure by selecting nodes based on different parameters. XPath can be used together with DOM parsing to scrape an entire web page.

GOOGLE SHEETS - Google sheets are a web scraping tool that is quite popular among web scrapers. From within sheets, a scraper can make use of IMPORT XML (,) function to scrape as much data as is needed from websites. This method is only useful when specific data or patterns are required from a website. You can also use this command to check if your website is secure from scraping.

TEXT PATTERN MATCHING - This is a matching technique that involves the use of the UNIX grep command and is used with popular programming languages like Perl or Python.

Question 3: What is Beautiful Soup? Why is it used?
Answer :
Beautiful Soup is a Python library used for web scraping purposes. It is a popular parsing library that is used to extract data from HTML and XML documents. Beautiful Soup can parse the HTML code of a web page and extract the relevant data, such as links, images, and text, by providing a simple and intuitive interface.
Beautiful Soup is used for web scraping because:
It is easy to learn and use: Beautiful Soup is a beginner-friendly library that is easy to install and use. Its intuitive and flexible interface makes it easy to navigate the HTML code and extract the desired data.

It can handle imperfect HTML code: Many web pages have imperfect HTML code, which can cause errors when parsing. Beautiful Soup can handle such imperfect HTML code and still extract the relevant data.

It supports multiple parsing methods: Beautiful Soup supports multiple parsing methods, including HTML and XML parsing, which makes it versatile and useful for a wide range of web scraping tasks.

It has a large community: Beautiful Soup has a large community of developers who contribute to its development and provide support. This makes it a reliable and stable library for web scraping.

In summary, Beautiful Soup is a powerful and versatile web scraping library that can extract data from HTML and XML documents. Its ease of use, support for imperfect HTML code, and support for multiple parsing methods make it a popular choice among web scrapers.
Example of Beautiful Soup to get top 5 result of search query Product in flipkart website here i have considered smart watch for men

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from fake_useragent import UserAgent

# Setup
ua = UserAgent()
search_query = "smart watch for men".replace(" ", "+")
amazon_url = f"https://www.amazon.com/s?k={search_query}"

headers = {
    "User-Agent": ua.random,
    "Accept-Language": "en-US,en;q=0.9"
}

response = requests.get(amazon_url, headers=headers)

if response.status_code == 200:
    soup = bs(response.text, "html.parser")
    products = soup.find_all("div", {"data-component-type": "s-search-result"})
    print(f"Found {len(products)} products.")

    names = []
    top5_urls = []

    for i, product in enumerate(products[:10]):  # Scan first 10 to increase chances
        try:
            # Title may be in different span depending on layout
            title_element = product.find("h2")
            if title_element and title_element.find("span"):
                title_text = title_element.find("span").get_text(strip=True)
                names.append(title_text)

                # Get product URL from the 'a' tag
                link_tag = title_element.find("a", href=True)
                if link_tag:
                    # Ensure the link is a complete URL
                    full_link = link_tag['href']
                    if full_link.startswith('/'):
                        full_link = "https://www.amazon.com" + full_link
                    top5_urls.append(full_link)
                else:
                    top5_urls.append("N/A")  # In case URL not found

            # Stop after 5 valid results
            if len(names) == 5:
                break
        except Exception as e:
            print(f"Error scraping product {i+1}: {e}")
            continue

    # Check if we got any results
    if names and top5_urls:
        df = pd.DataFrame({'Product_Title': names, 'URL': top5_urls})
        print(df)
    else:
        print("No valid product titles or links found.")
else:
    print(f"Failed to retrieve Amazon page. Status code: {response.status_code}")


Found 16 products.
                                       Product_Title  URL
0  Smart Watch for Men Women, Alexa Built-in Fitn...  N/A
1  Smart Watch, 1.85" Smartwatch for Men Women (A...  N/A
2  Smart Watch(Answer/Make Call), 1.91" Smartwatc...  N/A
3  Smart Watch with Alexa Built-in‌, 1.83'' Touch...  N/A
4  Smart Watch for Men Women, 1.96" Fitness Track...  N/A


In [None]:
# Importing libraries
import random
import time
import pandas as pd
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen, Request
from fake_useragent import UserAgent

# Initialize UserAgent for rotating user-agents
ua = UserAgent()

# URL to scrape (search query for 'smart watch for men' on Amazon)
search_query = "smart watch for men"
search_query = search_query.replace(" ", "+")
amazon_url = "https://www.amazon.com/s?k=" + search_query

# Function to make request with rotating user agent and error handling
def make_request(url):
    headers = {
        'User-Agent': ua.random,  # Using random user-agent
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7'
    }

    req = Request(url, headers=headers)

    # Introduce a random delay (between 20 and 40 seconds)
    delay = random.uniform(20, 40)
    print(f"Waiting for {delay:.2f} seconds...")  # Print delay time
    time.sleep(delay)

    try:
        response = urlopen(req)
        return response.read()
    except Exception as e:
        print(f"Error: {e}")
        return None

# Fetch the page content using the function
amazon_page = make_request(amazon_url)

if amazon_page:
    amazon_html = bs(amazon_page, 'html.parser')

    # Getting all product names and URLs
    products = amazon_html.find_all("div", {"data-component-type": "s-search-result"})

    names = []
    top5_urls = []

    for i, product in enumerate(products[:5]):  # Iterate through the top 5 products
        # Updated title selector (replace with the actual selector)
        title_element = product.find("h2", {"class": "a-size-mini a-spacing-none a-color-base s-line-clamp-2"})
        if title_element:
            title_element = title_element.find("a") # find the <a> tag within the h2
            # If title_element is found, extract the text
            if title_element:
                product_name = title_element.text.strip()
                names.append(product_name)

            # Find the URL element within the same product container
            url_element = product.find("a", {"class": "a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"})
            if url_element:
                product_url = "https://www.amazon.com" + url_element['href']
                top5_urls.append(product_url)
                print(f"Result {i+1}: {product_name}")
            else:
                print(f"URL not found for product {i+1}: {product_name}")
                top5_urls.append(None) # Append None if URL is not found to maintain list length
        else:
            print(f"Title not found for product {i+1}")

    # Create a DataFrame with Product Titles and URLs
    df = pd.DataFrame({'Product_Title': names, "URL": top5_urls})
    display(df)  # Using display for better formatting in Jupyter
else:
    print("Failed to retrieve the page. Please try again later.")

Waiting for 21.99 seconds...
Title not found for product 1
Title not found for product 2
Title not found for product 3
Title not found for product 4
Title not found for product 5


Unnamed: 0,Product_Title,URL


In [None]:
# Importing libraries
import random
import time
import pandas as pd
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen, Request
from fake_useragent import UserAgent

# Initialize UserAgent for rotating user-agents
ua = UserAgent()

# URL to scrape (search query for 'smart watch for men' on Amazon)
search_query = "smart watch for men"
search_query = search_query.replace(" ", "+")
amazon_url = "https://www.amazon.com/s?k=" + search_query

# Function to make request with rotating user agent and error handling
def make_request(url):
    headers = {
        'User-Agent': ua.random,  # Using random user-agent
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7'
    }

    req = Request(url, headers=headers)

    # Introduce a random delay (between 20 and 40 seconds)
    delay = random.uniform(20, 40)
    print(f"Waiting for {delay:.2f} seconds...")  # Print delay time
    time.sleep(delay)

    try:
        response = urlopen(req)
        return response.read()
    except Exception as e:
        print(f"Error: {e}")
        return None

# Fetch the page content using the function
amazon_page = make_request(amazon_url)

if amazon_page:
    amazon_html = bs(amazon_page, 'html.parser')

    # Getting all product names and URLs
    products = amazon_html.find_all("div", {"data-component-type": "s-search-result"})

    names = []
    top5_urls = []

    for i, product in enumerate(products[:5]):  # Iterate through the top 5 products
        # --- DEBUGGING ---
        print(f"\n--- DEBUG: Product {i + 1} ---")
        #print(product.prettify())  # Print the HTML of the product for inspection (optional)
        # --- END DEBUGGING ---

        # Updated title selector
        title_element = product.find("span", {"class": "a-size-base-plus a-color-base a-text-normal"})

        # Updated URL selector
        url_element = product.find("a", {"class": "a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"})

        if title_element and url_element:
            names.append(title_element.text.strip())  # Extract text and strip whitespace
            top5_urls.append("https://www.amazon.com" + url_element['href'])
            print(f"Result {i+1}: {title_element.text.strip()}")
        else:
            print(f"Title or URL not found for product {i+1}")

    # Create a DataFrame with Product Titles and URLs
    df = pd.DataFrame({'Product_Title': names, "URL": top5_urls})
    display(df)  # Using display for better formatting in Jupyter
else:
    print("Failed to retrieve the page. Please try again later.")

Waiting for 23.93 seconds...

--- DEBUG: Product 1 ---
Title or URL not found for product 1

--- DEBUG: Product 2 ---
Title or URL not found for product 2

--- DEBUG: Product 3 ---
Title or URL not found for product 3

--- DEBUG: Product 4 ---
Title or URL not found for product 4

--- DEBUG: Product 5 ---
Title or URL not found for product 5


Unnamed: 0,Product_Title,URL


Question 4: Why is flask used in this Web Scraping project?
Answer:
Flask is a web framework that is commonly used to build web applications and APIs in Python. While it is not required for web scraping, it can be useful in certain scenarios.
In the context of a web scraping project, Flask is used to create a simple web interface that allows users to input URLs or search queries and view the scraped data in a user-friendly format. For example, you could create a Flask app that takes a search query for a product on Flipkart, scrapes the results page, and displays the relevant information (such as product names, prices, and ratings) in a table on the app's homepage.
Using Flask in this way can make it easier to share the results of your web scraping project with others, as they can access the scraped data through a web interface rather than needing to run the scraping code themselves.



Question 5: Write the names of AWS services used in this project. Also, explain the use of each service.
Answer :
Two AWS Services used in this project are:
Elastic Beanstalk
Code Pipeline
1. Elastic Beanstalk
Elastic Beanstalk is a fully managed service provided by Amazon Web Services (AWS) that allows developers to easily deploy, manage, and scale web applications and services written in popular programming languages like Java, Python, Node.js, PHP, Ruby, Go, and .NET. With Elastic Beanstalk, developers can focus on writing code without worrying about the underlying infrastructure, as the service handles provisioning and configuration of the resources needed to run the application.

Here are some key features of Elastic Beanstalk:

Platform as a Service (PaaS): Elastic Beanstalk abstracts away the underlying infrastructure and provides a simple interface for developers to deploy their applications. Developers simply upload their application code, and Elastic Beanstalk handles the rest, including provisioning the necessary resources (such as compute instances, load balancers, and databases) and configuring the environment.

Multi-language Support: Elastic Beanstalk supports a wide range of programming languages, frameworks, and platforms, including Java, Python, Node.js, PHP, Ruby, Go, and .NET. It also supports popular web servers like Apache, Nginx, and IIS.

Easy Deployment: Developers can deploy their applications to Elastic Beanstalk using a variety of methods, including the Elastic Beanstalk console, the AWS CLI, or APIs. Elastic Beanstalk supports versioning of deployments, so developers can roll back to a previous version if needed.

Auto Scaling: Elastic Beanstalk automatically scales the application up or down based on demand, ensuring that the application is always available and responsive to users. It can also automatically balance traffic across multiple instances of the application to optimize performance.

Monitoring and Logging: Elastic Beanstalk provides monitoring and logging capabilities that allow developers to monitor the health and performance of their application, and troubleshoot issues if they arise. It also integrates with other AWS services like CloudWatch and Elastic Load Balancing to provide a complete solution for monitoring and managing applications.

Overall, Elastic Beanstalk is a powerful and flexible service that can help developers quickly and easily deploy and manage web applications and services on AWS.

2. Code Pipeline
AWS CodePipeline is a fully managed continuous delivery service provided by Amazon Web Services (AWS). It automates the release process for applications, enabling developers to rapidly and reliably build, test, and deploy their code changes.

Here are some key features of AWS CodePipeline:

Pipeline Creation: Developers can create custom pipelines for their applications, specifying the source code repository, build tools, testing frameworks, deployment targets, and other settings. They can also define the stages of the pipeline and the actions that should be performed in each stage.

Source Code Integration: CodePipeline integrates with a wide range of source code repositories, including AWS CodeCommit, GitHub, and Bitbucket. Developers can configure their pipelines to automatically detect code changes in the repository and trigger the build and deployment process.

Build and Test Automation: CodePipeline supports a variety of build and test tools, including AWS CodeBuild, Jenkins, and Bamboo. Developers can configure their pipelines to run automated tests as part of the build process, ensuring that code changes meet quality standards before being deployed.

Deployment Automation: CodePipeline can deploy applications to a wide range of targets, including Amazon EC2 instances, AWS Elastic Beanstalk environments, and AWS Lambda functions. It can also integrate with other AWS services like AWS CodeDeploy and AWS CloudFormation to support more complex deployment scenarios.

Continuous Monitoring: CodePipeline provides continuous monitoring of the pipeline and its stages, giving developers visibility into the progress of each stage and the status of each action. It also integrates with AWS CloudWatch to provide monitoring and alerting capabilities for the pipeline and the application.

Overall, AWS CodePipeline is a powerful tool for automating the release process for applications, enabling developers to deploy changes quickly and reliably while maintaining high quality standards. By eliminating the need for manual intervention and automating many of the tedious and error-prone tasks involved in software deployment, CodePipeline can help teams deliver software faster and with fewer errors.