# Web Scraping Beauty Website Using Python and Selenium
In the vast landscape of web scraping, there are websites that willingly share their data, and then there are those that guard it behind layers of JavaScript, dynamic content, and fancy designs. The cosmetics industry, known for its alluring products and ever-changing trends, is no exception. In this project, I'll take you on a journey where I scraped the Tira Beauty website, a treasure trove of skincare and makeup products, and reveal the secrets behind it. This project is not just about the beauty of cosmetics but also the beauty of web scraping using Python and Selenium.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install selenium # install selenium for web scraping

In [None]:
#Install the necessary tools and libraries. 
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import random
from selenium.webdriver.common.keys import Keys

In [None]:
# list of User-Agent strings to mimic real user behavior 
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
    "Mozilla/5.0 (iPhone14,3; U; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19A346 Safari/602.1"
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    # Add more User-Agent strings as needed
]

In [None]:
#Set up Chrome in headless mode
# The webpage URL
URL = "https://www.tirabeauty.com/collection/skin"
# Set up the WebDriver (in this case, using Chrome)
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--window-size=1920,1080")
random_user_agent = random.choice(user_agents)
chrome_options.add_argument(f"--user-agent={random_user_agent}")
driver = webdriver.Chrome(options=chrome_options)
# Open the webpage in the browser
driver.get(URL)

To load more products, we scroll down the page multiple times, hitting the END key like a seasoned explorer. We pause for a moment, allowing the content to load and unveil itself, and repeat the process a few times.

In [None]:
# Scroll down multiple times to load more products (adjust the number of times as needed)
for _ in range(3):  # You can increase the number of iterations for more scrolling
    driver.find_element(By.TAG_NAME,'body').send_keys(Keys.END)
    time.sleep(5)  # Wait for the page to load (adjust the time as needed)

# Get the page source after scrolling
page_source = driver.page_source

# Close the browser
driver.quit()

# Soup Object containing all data
soup = BeautifulSoup(page_source, "html.parser")

# Fetch links as List of Tag Objects
links = soup.find_all("a", attrs={'class': 'product-wrap'})

# Store the links
links_list = []

# Loop for extracting links from Tag Objects
for link in links:
    links_list.append(link.get('href'))

d = {"title":[], "current_price":[], "old_price":[],"discount":[],"rating":[], "reviews":[]}
    
links_list

In [None]:
# install google chrome
!wget https://dl.google.com/linux/linux_signing_key.pub
!sudo apt-key add linux_signing_key.pub
!echo 'deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main' >> /etc/apt/sources.list.d/google-chrome.list
!sudo apt-get -y update
!sudo apt-get install -y google-chrome-stable

# install chromedriver
!wget -O /tmp/chromedriver.zip https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/117.0.5938.92/linux64/chromedriver-linux64.zip
!unzip /tmp/chromedriver.zip chromedriver-linux64/chromedriver -d /usr/local/bin/

#move the chrome driver executable to correct path
!mv /usr/local/bin/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver

In [None]:
# To check Google Chrome's version
!google-chrome --version

# To check Chrome Driver's version
!chromedriver -v

In [None]:
# Function to scrape a web page using Selenium
def get_price_using_selenium(url):
    # Initialize ChromeOptions to configure the WebDriver
    chrome_options = webdriver.ChromeOptions()
    
    # Configure ChromeOptions for headless browsing (without a visible browser window)
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument("--window-size=1920,1080")
    
    # Randomly select a User-Agent to mimic real browser behavior
    random_user_agent = random.choice(user_agents)
    chrome_options.add_argument(f"--user-agent={random_user_agent}")
    
    # Create a WebDriver instance using ChromeOptions
    driver = webdriver.Chrome(options=chrome_options)
    
    # Navigate to the specified URL
    driver.get(url)
    
    # Wait for some time (e.g., 15 seconds) to allow the JavaScript content to load
    time.sleep(15)
    
    # Extract data from the web page using helper functions
    d['title'].append(get_element_using_selenium(driver, 'ID', 'item_name'))
    d['current_price'].append(get_element_using_selenium(driver, 'ID', 'item_price'))
    d['old_price'].append(get_element_using_selenium(driver, 'CLASS_NAME', 'old-amount'))
    d['discount'].append(get_element_using_selenium(driver, 'CLASS_NAME', 'save-amount-per'))
    d['rating'].append(get_element_using_selenium(driver, 'CLASS_NAME', 'average-ratings'))
    d['reviews'].append(get_element_using_selenium(driver, 'CLASS_NAME', 'image-review-heading'))
    
    # Quit the WebDriver to release resources
    driver.quit()
    
    # Return a value (1 in this case, you can customize it as needed)
    return 1


In [None]:
# To check user agent being used
# Get user Agent with execute_script
driver_ua = driver.execute_script("return navigator.userAgent")
print("User agent:")
print(driver_ua)

In [None]:
# Function to extract an element using specified attributes from a webpage using Selenium
def get_element_using_selenium(driver, attribute, attributeName):
    try:
        # Check if the attribute is 'ID'
        if attribute == 'ID':
            # Find the element by its ID and extract its text content
            value = driver.find_element(By.ID, attributeName).text
        else:
            # Find the element by its CLASS_NAME and extract its text content
            value = driver.find_element(By.CLASS_NAME, attributeName).text
    except Exception as error:
        # If an error occurs (element not found, for example), set the value to an empty string
        value = ""
        # Optionally, you can uncomment the following line to print an error message
        # print('Error occurred while extracting', attributeName, 'Error:', error)
    
    # Return the extracted value
    return value

We loop through the product links, scraping details such as title, current price, old price, discount, rating, and reviews for each one. This is where the magic happens. Selenium interacts with the website, and we harvest the data.

In [None]:
# Loop for extracting product details from each link
for link in links_list:
    # Construct the full URL by appending the link to the base URL
    url = "https://www.tirabeauty.com" + link
    
    # Call the function to scrape product information and store it in the 'd' dictionary
    get_price_using_selenium(url)


We store our findings in a Pandas DataFrame, giving us a structured view of the beauty secrets we've unveiled. To keep our data pristine, we clean it up by replacing empty values and dropping rows with missing titles.

In [None]:
# Create a DataFrame from the 'd' dictionary
products_df = pd.DataFrame.from_dict(d)

# Replace empty ('') values in the 'title' column with NaN
products_df['title'].replace('', np.nan, inplace=True)

# Drop rows with NaN values in the 'title' column
products_df = products_df.dropna(subset=['title'])

# Save the DataFrame to a CSV file named "products_csv.csv" (you can change the filename)
products_df.to_csv("products_csv.csv", header=True, index=False)

# Display the DataFrame if needed
print(products_df)

# Market Research: Uncovering Insights

The data we've collected from Tira Beauty's website isn't just a list of cosmetics products. It's a goldmine of insights for market research. Here's how:

**Product Trends**: By analyzing the titles and descriptions, we can identify emerging product trends in the cosmetics industry. Are natural skincare products on the rise? Is there a surge in demand for specific makeup brands?

**Pricing Strategy**: The data includes current and old prices, as well as discounts. This information can help businesses fine-tune their pricing strategies, understand the impact of discounts, and stay competitive.

**Customer Feedback**: Ratings and reviews provide valuable feedback from customers. Brands can use this data to gauge product satisfaction and identify areas for improvement.

**Competitor Analysis**: With data on multiple products, businesses can perform competitive analysis. They can compare their product offerings, prices, and customer feedback against competitors in the market.

**Inventory Management**: Understanding which products are popular and which are not can aid in inventory management. Businesses can stock up on in-demand items and make informed decisions about discontinuing slow-moving products.

**Seasonal Insights:** By analyzing the data over time, businesses can uncover seasonal trends. For example, do certain skincare products sell better in the summer? Are there holiday-themed makeup collections?