# Background of this research

Before getting into scraping, I need to determine which website I am going to scrape from. Earlier on I discovered that I had to do some experimentation on HOW to scrape, because there are different ways and they don't all work the same on different websites/brands.

Ranking of Jewelry Brands:
1. Van Cleef and Arpels 1906 ; 505.7M USD
    - https://www.vancleefarpels.com/us/en/the-maison/timeline/origins.html#:~:text=Born%20in%20the%20wake%20of,debut%20of%20a%20bejeweled%20destiny.
    - https://www.zoominfo.com/c/van-cleef--arpels/170328141
2. Cartier 1847 ; 6.2B
    - https://www.cartier.com/en-sk/maison/the-story/story-and-heritage
    - https://www.zippia.com/cartier-careers-18270/revenue/#
3. CatbirdNYC 2004 ; 60.2M USD
    - https://www.brooklynnavyyard.org/tenants/catbird/
    - https://growjo.com/company/Catbird_NYC#google_vignette
4. Laurie Fleming Jewelry approx. 2014
    - https://www.appmybizaccount.gov.on.ca/onbis/businessnames/viewInstance/view.pub?id=185e6c856f92f2744af3ee5b87a9ed2f11e4b9fc2c846b7587cb497c8c3736b9&_timestamp=4631642856566348
    
**Table 1: Jewelry brands and what methods can be used to scrape from them**

|Company|Basic Request| Selenium| Proxy| Web Scraping Service|
|:-----:|:-----------:|:-------:|:----:|:-------------------:|
|CatbirdNYC| No          | Yes       | x    | x              |
|Cartier| No          | x       | x    | x                   |
|Van Cleef and Arpels| No          | Yes       | x    | x                   |
|Laurie Fleming | Yes          | x       | x    | x              |
    
In terms of VCA versus Cartier ranking, VCA is known as more of an upscale brand for jewelry than Cartier because it is pricier and handmade.


# Intended Audience and Audience Background

Anyone is welcome to read through my "research"! It's for fun and for anyone who shares my interest (or even if they don't)! It's not extremely technical or thorough, to be clear, since I did some basic research and I did not cite sources for the most part. 

If any reader's would like to use this code, they need some technical background and basic coding ability. This code and task does not go into deeper topics like memory allocation, but knowledge on coding, scraping, website security, html, javascript, and more, will make it easier to understand (not that what I'm doing is terribly difficult). 

# Research Process

Before delving into the code, here is the outline of this notebook. To preface, I haven't scraped websites before besides using Reddit API years ago, so I don't have much knowledge or experience in these regards.

Check a websites robots.txt file before/during this process to have a better idea of what websites do or do not allow.

I am testing/trying to scrape from mainly Van Cleef and Arpels. I started with the basic requests library, and realized that didn't work on all jewelry websites, probably to prevent scraping from bots or scalpers. 

Based on the table above, in Table 1, you can see that the basic request only works on Laurie Fleming Jewelry's website and CatbirdNYC's website. It does not work on Van Cleef and Arpels, nor Cartier. I think the reason why is because higher end luxury companies have more resources for extra features like preventative measures on their websites, and these companies in particular have more incentive to protect against resellers/scalpers since their items are in hot demand and are worth a lot. 

When I couldn't scrape from VCA, I used Selenium. 

Selenium capabilities:
-  can interact with web pages the same way a real user would, including dynamically generated content from JavaScript
- allows you to perform actions on web elements such as clicking buttons, filling out forms, or scrolling
- can handle client-side validations, pop-ups, alerts, and cookies in real-time, replicating a user's session more accurately

The last point, was what I noticed was causing issues in the regular requests. My header information, particularly the cookies, was not correct or perhaps had session specific data that while executing it with the regular requests, caused issues and I could not retrieve the HTML data.

# How You Can Use This Code

My logic flow:

1. Get the website/page that has "out of stock" for the item you want
2. Add the link into the selenium code portion
3. Add email alerts (?) for restock notifications
4. Add your code with your specifications to a shell script to execute continuously 

I added code at the bottom that could be used to email yourself an alert when an item comes. I also added the simple shell script code that you would put this python code into.

# Questions

1. Can I cause issues by submitting a bunch of requests? Does that give me information on rate limits?
2. Can I get in trouble and/or banned for scraping?
3. What kind of capabilites are scraping APIs offering? Are they used by bots and scalpers? How useful is it for the tasks?

# Conclusion and Future Work

I learned that Selenium is an absolute must for scraping. I also don't know how these companies deal with bots, and I would not want them to block my IP, so I would also implement proxies to rotate my IPs. If I were to perform these searches on a larger scale, for example for multiple people wanting this information or on multiple items, I would look into using an API, both for the learning experience and for the convenience. I would also use a Google Chrome driver instead of a Safari driver, because it would be easier for the email notification portion. The reason why I didn't is because downgrading Google Chrome is difficult, and I didn't know how to downgrade it to match the driver (the latest version was newer than the newest driver).

If I continued this project, I would like to create a website for ease of use, that one could input some basic information and then sign up for alerts on restocks. That would require me to at a minimum:
- Pay for a scraping API service
- Create a website
- Implement a backend database for emails of users




In [125]:
from selenium import webdriver
from selenium import webdriver
from selenium.webdriver.safari.service import Service as SafariService

import requests
import os
from bs4 import BeautifulSoup

In [None]:
# Basic Request code
# def main():
#     url = "https://www.vancleefarpels.com/us/en/collections/jewelry/alhambra/vcara45900---vintage-alhambra-pendant.html"
    
#     response = requests.get(url)
    
#     soup = BeautifulSoup(response.content, "html.parser")
#     #elements = soup.find_all(class_="comment")
    
#     print(f"Elements: {len(elements)}")
#     print(f"Scraping: {url}")
#     #print(response.content)

In [128]:
#I made the driver Safari because I couldn't downgrade chrome to match the current driver
driver = webdriver.Safari()
driver.get(url)

# Get page source and parse with BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")
page_source = driver.page_source
#print(page_source)

driver.quit()

# CatbirdNYC

In [126]:
#url = 'https://laurieflemingjewellery.com/products/hidden-fairy-charm'
#url = 'https://www.cartier.com/en-us/jewelry/rings/love/love-ring-CRB4084600.html'
# url = 'https://www.catbirdnyc.com/jewelry/collections/city-exclusive-charms.html'

In [129]:
item_you_want = 'Cherry Blossom Gold Charm'

In [136]:
def main():

    elements = soup.find_all('li', class_="item product product-item")
    product_cards = soup.find_all('product-card')
    
    print(f"Elements Found: {len(elements)}")
    for i, element in enumerate(elements):
        #print(f"Element {i + 1}:")
        #print("Raw HTML:")
        #print(element.prettify())  # Pretty-print HTML for better readability

        name_tag = element.find('a', class_='product-item-link') # finds name of item 
        
        name = name_tag.text.strip() if name_tag else 'No name found'
        if name == item_you_want:
            sold_out = product_cards[i].get('data-is-sold-out')
            if (sold_out) == "1":
                print("it is sold out")
        # Comment the following code below back in if you want your computer to alert you
#         else:
#             os.system('say "Hi, your item is in stock."')
            price_tag = element.find('span', class_='price')
            price = price_tag.text.strip() if price_tag else 'No price found'

            print(f"Product Name: {name}")
            print(f"Price: {price}")
            print('-' * 40)

    print(f"Scraping: {url}")
    

# VCA section


I tried by inspecting the web page, seeing what information is sent from my computer to VCA, and inputing that information manually in my request header but it didn't work. I think the cookie information was important, but maybe session specific information didn't match. In any case, when I removed the cookie information, it was an immediate access denied, but with, my script was spinning around and attempting the request.


In [None]:
#url = "https://www.vancleefarpels.com/us/en/collections/jewelry/alhambra/vcara45900---vintage-alhambra-pendant.html"
#VCA w/o stock
# url = 'https://www.vancleefarpels.com/us/en/collections/jewelry/alhambra/vcard34900---vintage-alhambra-pendant.html'

In [127]:
headers = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "accept-encoding": "gzip, deflate, br, zstd",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "max-age=0",
    "sec-ch-ua": '"Not/A)Brand";v="99", "Google Chrome";v="115", "Chromium";v="115"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "Windows",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

In [None]:
# For example, VCA alhambra necklace
# Need to specify the material and gold color
# Available example
item_you_want = 'Vintage Alhambra pendant'
material_type = '18K yellow gold, Mother-of-pearl'

# Not available example
# item_you_want = 'Vintage Alhambra pendant'
# material_type = '18K white gold, Chalcedony'

In [None]:
def main():

    elements = soup.find_all('main', class_="vca-main")
    
    print(f"Elements Found: {len(elements)}")
    for i, element in enumerate(elements):
#         print(f"Element {i + 1}:")
#         print("Raw HTML:")
#         print(element.prettify())  # Pretty-print HTML for better readability

        #name_tag = element.find('a', class_='product-item-link') # finds name of item 
        
        #name = name_tag.text.strip() if name_tag else 'No name found'
        availability = element.find('section',class_='vca-product vca-product-v1 vca-component')
        #print(availability)
        if availability:
            data_tracking = availability.get('data-tracking')
            #print(data_tracking)
            json_data_tracking = json.loads(data_tracking)
            name = json_data_tracking[0].get('item_name')
            price = json_data_tracking[0].get('price')
            
            item_sellable = json_data_tracking[0].get('item_sellable')
            
            if item_sellable == 'true':
                print("Item is available")
                os.system('say "Hi, your item is in stock."')
            else:
                print("Item is not available")

            print(f"Product Name: {name}")
            print(f"Price: {price}")
            print('-' * 40)

    print(f"Scraping: {url}")

In [137]:
if __name__ == "__main__":
    main()

Elements Found: 13
it is sold out
Product Name: Cherry Blossom Gold Charm
Price: $144.00
----------------------------------------
Scraping: https://www.catbirdnyc.com/jewelry/collections/city-exclusive-charms.html


# Email alerts

You have to do some extra configuration if you want to email yourself. You can go into your gmail account settings to figure it out. I didn't add it because it was annoying and I don't want to add my email in here 

In [109]:
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

In [111]:
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# Open a plain text file for reading. For this example, assume that
# the text file contains only ASCII characters.
textfile = 'textfile.txt'
with open(textfile, 'r', encoding='utf-8') as fp:
    # Create a text/plain message
    msg = MIMEText(fp.read(), 'plain', 'utf-8')

# Sender and recipient email addresses
me = "xxx@gmail.com"
you = "xxx@gmail.com"

# Create the email headers
msg['Subject'] = f'The contents of {textfile}'
msg['From'] = me
msg['To'] = you

# Send the message via our SMTP server
try:
    with smtplib.SMTP('localhost') as s:
        s.sendmail(me, [you], msg.as_string())
    print('Email sent successfully!')
except Exception as e:
    print(f'Failed to send email: {e}')

Failed to send email: [Errno 61] Connection refused
