## Scraping SHEIN with Beautiful Soup

#### Libraries
* requests: Sends HTTP requests.
* beautifulsoup4: Pulls data out of HTML and XML files.
* lxml: Provides powerful API for parsing HTML and XML.

In [10]:
import requests
from bs4 import BeautifulSoup

When we send an HTTP request, we send some headers along with the request.

(Control + Shift + I)and switch to the Network tab and reload any page, you can find the request headers.

Refer to : https://developer.mozilla.org/en-US/docs/Glossary/Request_header

In [11]:
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Accept-Language': 'en-US,en;q=0.5'
}

To view the HTML code, we have an Inspect option in every browser. (Ctrl + Shift + I). For example, open [this page](https://ca.shein.com/SHEIN-CURVE-Women-s-Plus-Size-Color-Block-Round-Neck-Sleeveless-Dress-p-28209235.html?src_identifier=uf=caadgpm_PlusSize_20230814_GPM&src_module=ads&mallCode=1&pageListType=4&imgRatio=3-4)

Make sure you are on the Elements tab in the Developer tools panel

* Use the .find() method to find a span element. 
* Then pass the productTitle id in a dictionary called attrs that accepts the attributes. 
* The .get_text() method returns the text in a string format. 
* The .strip() method is used to remove any extra leading and trailing whitespaces.

In [12]:
# accepts a url and returns a dictionary
def get_product_details(product_url: str) -> dict:
  # Create an empty product details dictionary
  product_details = {}

  # Get the product page content and create a soup
  page = requests.get(product_url, headers=headers)
  soup = BeautifulSoup(page.content, features="lxml")
  try:
    # Scrape the product details
    title = soup.find('h1', attrs={'class': 'product-intro__head-name'}).get_text().strip()
    extracted_price = soup.find('del', attrs={'class': 'del-price'}).get_text().strip()
    price = extracted_price.split('$')[1]

    # Adding it to the product details dictionary
    product_details['title'] = title
    product_details['price'] = price
    product_details['product_url'] = product_url

    # Return the product details dictionary
    return product_details
  except Exception as e:
    print('Could not fetch product details')
    print(f'Failed with exception: {e}')


In [13]:
product_url = input('Enter product url: ')

In [14]:
product_details = get_product_details(product_url)
print(product_details)

{'title': "SHEIN CURVE+ Women's Plus Size Color Block Round Neck Sleeveless Dress", 'price': '28.99', 'product_url': 'https://ca.shein.com/SHEIN-CURVE-Women-s-Plus-Size-Color-Block-Round-Neck-Sleeveless-Dress-p-28209235.html?src_identifier=uf=caadgpm_PlusSize_20230814_GPM&src_module=ads&mallCode=1&pageListType=4&imgRatio=3-4'}
