## Step 1: Installations

To start scraping, you'll need to install the items below.
- `selenium`- python library
- `webdriver`- for Google Chrome browser

#### More Info

- `selenium:` Used for web automation and controlling web browsers programmatically.
- `webdriver:` Provides a way to start and control a web browser using Python.
- `By:` Provides mechanisms for locating elements on a web page.
- `WebDriverWait and expected_conditions:` Used for waiting until certain conditions are met before proceeding with code execution.
- `Keys:` Provides special keys (e.g., Enter, Shift) for simulating keyboard actions.
- `BeautifulSoup:` Used for parsing HTML and XML documents and extracting data from web pages.
- `time:` Provides time-related functions, such as adding delays in code execution.
- `requests:` Used for making HTTP requests to retrieve web pages or resources.
- `csv:` Provides functionality for reading and writing CSV files.
- `pandas:` Widely used for data manipulation and analysis, with DataFrames for working with tabular data.
- `pd.set_option('display.max_columns', None):` Sets an option in pandas to display all columns when printing DataFrames.
- `re:` Provides support for regular expressions, used for pattern matching and text manipulation.

In [None]:
!pip install selenium

In [None]:
# Importing the required libraries and modules

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup
import time

import requests
import re
import csv
import pandas as pd

pd.set_option('display.max_columns', None)
import re

## Step 2: Install WebDriver

1. The `chromedriver_path` variable specifies the path to the ChromeDriver executable on your device.
2. The options variable is used to set options for Chrome, such as running it in headless mode.
3. The `webdriver.Chrome()` creates a Chrome WebDriver instance using the ChromeDriver executable.
4. The `driver.get()` method opens the target website.
5. The `time.sleep(5)` waits for 5 seconds to allow the page to load.
6. The code then scrolls down the page using `driver.execute_script("window.scrollTo(0, document.body.scrollHeight) ;")`.
7. It waits for 5 seconds again to allow the page to load more items.
8. The code calculates the new height of the page and compares it with the previous height to check if more items were loaded. If the heights are the same, it means there are no more items to load, and the loop breaks.
9. The process continues until all items on the page have been loaded.

In [None]:
# Path to ChromeDriver on your device
chromedriver_path = '/Users/mejoal-salem/Downloads/chromedriver-mac-x64/chromedriver'


# Specify Chrome options (optional)

options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Run Chrome in headless mode

driver = webdriver.Chrome()  # Ensure you have the chromedriver executable in your PATH

# Open the target website
driver.get("https://Enter your link")

# Wait for the page to load
time.sleep(5)

# Scroll down the page to load more items
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait for the page to load
    time.sleep(5)

    # Calculate the new height and compare it with the previous height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

In this code, we first initialize the WebDriver instance and open the Parmigiani Fleurier website. Then, we define the `get_watch_urls` function, which takes the WebDriver instance as a parameter.

Inside the function, we extract the page source using `driver.page_source` and parse it using BeautifulSoup. We find all `<div>` elements with the class `w-grid-item-h` and extract the URLs from the `<a>` elements within those `<div>` elements. We check if the URLs are relative and prepend the base URL if necessary.

Finally, we call the `get_watch_urls` function, passing the WebDriver instance, and assign the returned list of watch URLs to the `watch_detail_urls` variable. We can then use the `watch_detail_urls` list to navigate to each watch's detail page and scrape the desired information.

In [None]:
# Function to extract watch URLs from the page source
def get_watch_urls(driver):


    # Get the page source after all content is loaded
    page_source = driver.page_source

    # Parse the page source using BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all <div> elements with the class 'w-grid-item-h'
    watch_elements = soup.find_all('div', class_='w-grid-item-h')
    # Extract watch URLs
    watch_urls = [elem.find('a')['href'] for elem in watch_elements[6:] if elem.find('a')]

    # Check if the URLs are relative and prepend the base URL if needed
    base_url = 'https://www.parmigiani.com/'
    watch_urls = [base_url + url if not url.startswith('http') else url for url in watch_urls]

    # Return the list of watch URLs
    return watch_urls

In [None]:
# Call the get_watch_urls function and assign the returned URLs to a variable

watch_detail_urls = get_watch_urls(driver)

In [None]:
urls=[]
# Print the URLs
for url in watch_detail_urls:
    urls.append(url)


print(urls)

['https://www.parmigiani.com/en/watches/tonda-pf-split-seconds-chronograph-rose-gold-grey/', 'https://www.parmigiani.com/en/watches/tonda-pf-flying-tourbillon-platinum-blue/', 'https://www.parmigiani.com/en/watches/tonda-pf-skeleton/', 'https://www.parmigiani.com/en/watches/tonda-pf-skeleton-2/', 'https://www.parmigiani.com/en/watches/tonda-pf-xiali/', 'https://www.parmigiani.com/en/watches/tonda-pf-hijri/', 'https://www.parmigiani.com/en/watches/tonda-pf-annual-calendar-brown-alligator/', 'https://www.parmigiani.com/en/watches/tonda-pf-ac-rose-gold/', 'https://www.parmigiani.com/en/watches/tonda-pf-annual-calendar-steel/', 'https://www.parmigiani.com/en/watches/tonda-pf-chrono-rose-gold/', 'https://www.parmigiani.com/en/watches/tonda-pf-chronograph-blue-alligator/', 'https://www.parmigiani.com/en/watches/tonda-pf-chrono-steel/', 'https://www.parmigiani.com/en/watches/tonda-pf-minute-rattrapante/', 'https://www.parmigiani.com/en/watches/tonda-pf-gmt-rattrapante-gold/', 'https://www.par

In [None]:
# remove any duplicate URLs from the urls list and convert it back to a list.
urls = list(set(urls))

In [None]:
# number of URLs in the urls list.
print(len(urls))

41


## Step 3: Extracting Watch Data and Creating DataFrame


1. We start by creating an empty list called `watch_data` to store the extracted data for each watch.
2. We loop through each URL in the `urls` list, which contains the URLs of the watch detail pages.
3. For each URL, we send a GET request using the `requests` library to retrieve the HTML content of the watch page.
4. We use BeautifulSoup to parse the HTML content and create a BeautifulSoup object called `soup`.
5. We create an empty dictionary called `watch_fields` to store the extracted data for the current watch.
6. We use various CSS selectors and `soup.find()` or `soup.select_one()` methods to extract specific information from the watch page.
7. We extract the needed information.
8. We append the `watch_fields` dictionary to the `watch_data` list.
9. The loop continues for each URL in the `urls` list until all watches have been processed.
10. At the end of the loop, the `watch_data` list contains dictionaries with the extracted data for each watch.
11. We import the pandas library as pd, then define a list called`field_names`that contains the column names for the CSV file. These column names correspond to the keys in the`watch_fields`dictionary.

12. We create a DataFrame called df using the pd.DataFrame() constructor. We pass `watch_data` as the data argument and `field_names` as the column argument. This creates a DataFrame where each dictionary in `watch_data` corresponds to a row, and the keys in the dictionary correspond to the column names.

13. Finally, we can perform further operations on the DataFrame df or save it as a CSV file using the `df.to_csv()` method.

In [None]:
image_src = []
image_tit = []

In [None]:
# Extract image source
image_element = driver.find_elements(By.CLASS_NAME, "attachment-full.size-full.wp-post-image")
for i in image_element[6:]:
    image_srcs = i.get_attribute("src")
    image_src.append(image_srcs)

title_element = driver.find_elements(By.CLASS_NAME, "w-vwrapper")#.usg_vwrapper_1.align_center.valign_middle")a
for j in title_element:
    # # Extract title text
    title_text = j.text
    image_tit.append(title_text)

In [None]:
# Extract the desired information from the given website and print it in a structured format for each watch

# Create a list to store the watches data

watch_data = []

c = 0

# Loop through each watch link
for url in urls:

    response = requests.get(url) # Send a request to the watch page
    soup = BeautifulSoup(response.content, 'html.parser')  # Parse the watch page content

    # Store the product data in a dictionary
    watch_fields = {}

    watch_fields['Image src'] = image_src[c]
    watch_fields['Image title'] = image_tit[c]
    c+=1

    ref_span = soup.select('#heroDesktop > div.l-section-h.i-cf > div > div > div > div > div > div.w-post-elm.post_custom_field.subtitle.type_text > span')
    sku_span = soup.find('span', class_='sku')

    if ref_span:
        watch_fields['reference_number'] = ref_span[0].get_text(strip=True)
    elif sku_span:
        watch_fields['reference_number'] = sku_span.get_text(strip=True)
    else:
        watch_fields['reference_number'] = 'N/A'

    # Extract the watch URL
    watch_fields['watch_URL'] = url

    # Extract the specific model

    specific_model = soup.find('h1', class_='w-post-elm post_title us_custom_7ed3deef entry-title color_link_inherit')
    if specific_model:
        watch_fields['specific_model'] = specific_model.get_text(strip=True)
    else:
        watch_fields['specific_model'] = ''


    # Extract the parent_model
    # Find all the watch name elements using CSS selectors
    parent_watch = soup.select('#page-content > section.l-section.wpb_row.us_custom_faa47779.height_auto > div > div > div > div > div > ol > li:nth-child(5) > a')  # Replace '.watch-name-element-class' with the actual CSS selector for watch names

    # Extract the text from the watch name elements
    parent_model = [element.get_text() for element in parent_watch]

    # Print the extracted watch names
    for parent in parent_model:
        watch_fields['parent_model'] = (parent)

    # Extract the brand
    watch_fields['brand'] = "Parmigiani Fleurier"

    # Extract the image URL
    # Using CSS selector to find the img inside the specific div
    img_tag = soup.select_one('.us_custom_5f589ac2 img')


    # Extract the 'src' attribute
    img_src = img_tag['src'] if img_tag else 'Image source not found'
    watch_fields['image_URL'] =img_src

    # Set the country of origin
    made_in = "Switzerland"
    watch_fields["made_in"] = made_in


    # Extract case specifications & Extract dial specifications

    watch_info = soup.find_all('span', class_='w-post-elm-value')

    def extract_following_text(label):
        elements = soup.find_all(string=re.compile(f"{label}"))
        for element in elements:
            # Navigate to the next element in the DOM that likely contains the value
            next_element = element.find_next()
            if next_element and not next_element.text.strip().startswith(label):
                return next_element.text.strip()


    # Extracting values
    watch_fields['crystal'] = extract_following_text("Glass:")
    watch_fields['case_material'] =extract_following_text("Material:")
    watch_fields['water_resistance'] = extract_following_text("Water-Resistance:")
    watch_fields['caseback'] =extract_following_text("Back:")

    watch_fields['Calibre'] = extract_following_text("Calibre:")
    watch_fields['Frequency'] = extract_following_text("Frequency:")
    watch_fields['movement'] =extract_following_text("Winding:")
    watch_fields['power_reserve'] = extract_following_text("Power Reserve:")
    watch_fields['jewels'] =extract_following_text("NB of jewels:")

    # Extract dial specifications

    watch_fields['dial_color'] = extract_following_text("Color:")
    watch_fields['diameter'] = extract_following_text("Dimensions:")
    watch_fields['case_shape'] =extract_following_text("Hands:") ###############3
    watch_fields['case_finish'] =extract_following_text("Finishing:") ############
    watch_fields['numerals'] = extract_following_text("Index:")


    # Extract the description

    description_info = soup.find_all('div', class_='w-post-elm post_content paragraph')
    if description_info:
        watch_fields['description'] = description_info[0].get_text(strip=True)


    # Extract additional features

    elements = soup.find_all('span', class_='w-post-elm-value')

    # If there are elements
    if elements:
        text = elements[0].get_text()
    # Use regex to search for the desired text
        pattern = re.compile(r'(H|h)ours?, (M|m)inutes?', re.IGNORECASE)

    # Check if the text matches the pattern

        if re.search(pattern, text):
            # If there is a match, set the text as the value for the "features" key in the dictionary
            watch_fields['features'] = text
        else:
            # Iterate over all elements starting from the second one
            for element in elements[1:]:
            # Extract the text from the current element
                text = element.get_text()
            # Use regex to search for the desired text
                pattern = re.compile(r'(H|h)ours?, (M|m)inutes?', re.IGNORECASE)
            # Check if the text matches the pattern
                if re.search(pattern, text):
                    # If there is a match, set the text as the value for the "features" key in the dictionary
                    watch_fields['features'] = text
        # If you only want to search in the first matching element and skip the rest, you can use break here to exit the loop
                break

    first_thickness_paragraph = soup.find('p', string=lambda text: 'Thickness:' in text if text else False)

    # Extract the value following "Thickness:"
    if first_thickness_paragraph:
        next_span = first_thickness_paragraph.find_next('span', class_='w-post-elm-value')
        if next_span:
            thickness_value = next_span.text.strip()
            watch_fields['case_thickness']=thickness_value


    # Extract the price

    prices=soup.find_all('div', class_="w-post-elm product_field price")

    for price in prices:
        price_text = price.text.split()
        currency = price_text[0]
        price_value = price_text[-1]
        watch_fields['price'] = price_value
        watch_fields['currency'] = currency

    # Append the watch data to the list

    watch_fields['image_URL'] =  watch_fields['Image src']
    watch_data.append(watch_fields)

In [None]:
field_names = [
    'reference_number', 'watch_URL',  'type', 'brand', 'year_introduced', 'parent_model', 'specific_model',
    'nickname', 'marketing_name', 'style', 'currency', 'price', 'image_URL', 'made_in', 'case_shape',
    'case_material', 'case_finish', 'caseback', 'diameter', 'between_lugs', 'lug_to_lug', 'case_thickness',
    'bezel_material', 'bezel_color', 'crystal', 'water_resistance', 'weight', 'dial_color', 'numerals',
    'bracelet_material', 'bracelet_color', 'clasp_type', 'movement', 'Calibre', 'power_reserve', 'Frequency',
    'jewels', 'features', 'description', 'short_description'
]
data = pd.DataFrame(watch_data, columns=field_names)
print(data)

           reference_number  \
0   pfh916-2010002-200182-3   
1     pfh921-2020002-200182   
2     PFC912-2020001-200182   
3     PFC912-1020001-100182   
4     pfh982-1022401-100182   
5     pfh983-1020001-100182   
6     PFC907-2020001-300182   
7     PFC907-2020001-200182   
8     PFC907-1020001-100182   
9     PFC915-2020001-200182   
10    PFC915-2020001-300182   
11    PFC915-1020001-100182   
12    pfc904-1020001-100182   
13    pfc905-2020001-200182   
14    PFC905-1020001-100182   
15    pfc914-2020002-200182   
16    PFC914-2020001-200182   
17    PFC914-2020001-300182   
18    PFC914-1020001-100182   
19    PFC804-2020001-200182   
20    pfc804-2020001-300182   
21    pfc804-1020003-100182   
22    PFC804-1020001-100182   
23    PFC804-1020002-100182   
24    pfc931-2020001-400182   
25    pfc931-1020001-400182   
26    pfc930-2020001-400182   
27    pfc930-1020001-400182   
28    PFC802-2120002-300181   
29    PFC801-1510020-HC6181   
30    pfc801-1510320-ha3181   
31    PF

In [None]:
# save the extracted data from the website into a CSV file
data.to_csv('Scraping_data.csv', index=False)

In [None]:

# Find all the watch name elements using CSS selectors
parent_watch = soup.select('#page-content > section.l-section.wpb_row.us_custom_faa47779.height_auto > div > div > div > div > div > ol > li:nth-child(5) > a')  # Replace '.watch-name-element-class' with the actual CSS selector for watch names

# Extract the text from the watch name elements
parent_model = [element.get_text() for element in parent_watch]

# Print the extracted watch names
for parent in parent_model:
    watch_fields['parent_model'] = (parent)

TONDA PF


# Step 4: Data Analysis and Visualization



In this code, the selected columns are analyzed and visualized using various charts.

- The first chart is a bar chart showing the distribution of dial colors.
- The second chart is a pie chart showing the distribution of water resistance.
- The third chart is a bar chart showing the distribution of case shapes.
- The fourth chart is a bar chart showing the distribution of movement by brand.

In [None]:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Select the columns you want to analyze and visualize.

columns_to_analyze = [
    'reference_number', 'watch_URL', 'type', 'brand', 'year_introduced', 'parent_model', 'specific_model',
    'nickname', 'marketing_name', 'style', 'currency', 'price', 'image_URL', 'made_in', 'case_shape',
    'case_material', 'case_finish', 'caseback', 'diameter', 'between_lugs', 'lug_to_lug', 'case_thickness',
    'bezel_material', 'bezel_color', 'crystal', 'water_resistance', 'weight', 'dial_color', 'numerals',
    'bracelet_material', 'bracelet_color', 'clasp_type', 'movement', 'Calibre', 'power_reserve', 'Frequency',
    'jewels', 'features', 'description', 'short_description'
]

# Perform data analysis
data_analysis = data[columns_to_analyze]


In [None]:
# Visualize the distribution of dial colors.

dial_color_counts = data_analysis['dial_color'].value_counts()
plt.figure(figsize=(10, 6))

dial_color_counts.plot(kind='bar')
plt.title('Dial Color')
plt.ylabel('')
plt.show()

In [None]:
# Visualize the distribution of water resistance.

#Visualize(1)
water_resistance_counts = data_analysis['water_resistance'].value_counts()
plt.figure(figsize=(10, 6))
colors = ['#f0f9e8', '#bae4bc', '#7bccc4', '#43a2ca', '#0868ac']


#Visualize(2)
plt.figure(figsize=(10, 6))
data_analysis["water_resistance"].value_counts().plot(kind="pie", autopct='%1.1f%%')
plt.title("Water resistance distribution")
plt.ylabel("")
plt.show()

In [None]:
# Visualize the distribution of case shapes.
case_shape_counts = data_analysis['case_shape'].value_counts()
plt.figure(figsize=(10, 6))
case_shape_counts.plot(kind='bar')
plt.title('case_shape dustbuation')
plt.xlabel('case_shape')
plt.ylabel('The number of watches')
plt.show()

In [None]:
# Calculate the distribution of movement by brand.

brand_movement_counts = data_analysis.groupby('brand')['movement'].value_counts()

# Visualize the distribution of movement by brand

plt.figure(figsize=(10, 6))
brand_movement_counts.plot(kind='bar')
plt.title('Distribution of movement by brand')
plt.xlabel('brand - movement')
plt.ylabel('The number of watches')
plt.show()