### Product data extraction

This notebook is an example of how to extract product info into an easy format in python.
Feel free to copy and use it in any code you may be developing.

I did this while doing hackathon night at https://jibia.nl

Requirements:  
- python 3.7  
- requests  
- extruct  
- re

In [7]:
import requests
import extruct
import re

In [2]:
# Random url examples from googling where to buy an xbox one x
page_urls = [
    "https://www.wehkamp.nl/microsoft-xbox-one-x-213121/",
    "https://www.bol.com/nl/p/xbox-one-x-console-1-tb-battlefield-v-battlefield-1943/9200000097502294/",
    "https://www.walmart.com/ip/Microsoft-Xbox-One-X-1TB-Console-Black-CYV-00001/276629190",
    "https://www.google.com",
    "https://www.bing.com"
]

In [3]:
gtin13regex = re.compile("[^0-9]([0-9]{13})[^0-9]")

In [28]:
extracted = []
for page_url in page_urls:
    page = requests.get(page_url).text
    
    machine_readable_data = extruct.extract(page)

    json_ld_product = [x for x in machine_readable_data['json-ld'] if "@type" in x and x["@type"] == 'Product']
    microdata_product = [x for x in machine_readable_data['microdata'] if "type" in x and x["type"] == 'http://schema.org/Product']

    # If there is only one 'gtin13' found, then it is most likely the product gtin13.
    gtin13_regexed = set(gtin13regex.findall(page))
    
    # Cannot verify that this is actually a product, skip.
    if not json_ld_product and not microdata_product:
        continue
    
    elif json_ld_product:
        json_ld = json_ld_product[0]
        title = json_ld['name']
        price = json_ld['offers']['price']
        currency = json_ld['offers']['priceCurrency']
        image = json_ld['image'][0]['url'] if isinstance(json_ld['image'], list) else json_ld['image']['url']
        description = json_ld['description']
        
    elif microdata_product:
        microdata = microdata_product[0]
        title = microdata['properties']['name']
        price = microdata['properties']['offers'][0]['properties']['price']
        currency = microdata['properties']['offers'][0]['properties']['priceCurrency']
        image = microdata['properties']['image'][0]
        description = microdata['properties']['description']
    else:
        raise Exception("undefined behavior")
    
    extracted.append({
        "json_ld": json_ld_product, 
        "microdata": microdata_product, 
        "gtin13_regexed": gtin13_regexed,
        "sample_product_summary": {
            'url': page_url,
            'title': title,
            'price': price,
            'currency': currency,
            'image': image,
            'description': description
        }
    })
    
# print(extracted)

In [27]:
for product in extracted:
    summary = product['sample_product_summary']
    print(f"""SUMMARY: {summary['title']}
    price: {summary['price']} {summary['currency']}
    image: {summary['image']}
    url: {summary['url']}
    description: {summary['description']}
""")

SUMMARY: Xbox One X
    price: 499 EUR
    image: https://images.wehkamp.nl/i/wehkamp/213121_pb_01/microsoft-xbox-one-x-zwart-0889842208337.jpg
    url: https://www.wehkamp.nl/microsoft-xbox-one-x-213121/
    description: De Xbox One X is de meest krachtige console met 40% meer vermogen dan de Xbox One S. De console maakt het 4K gaming &eacute;cht mogelijk. Games spelen beter op Xbox One X. Ze zien er fenomenaal uit en laden snel. Geniet van vloeiende 4K gameplay, zelfs op een 1080p scherm. De console heeft 6 teraflops grafische rekenkracht (GPU), 320 GB per seconde geheugensnelheid en 8 CPU cores. Uiteraard speel je ook de beste line-up van Xbox One, Xbox 360 en originele Xbox klassiekers op Xbox One X.<br><br><strong>Verbeterde console</strong><br>De 8-core Custom AMD CPU heeft een snelheid van 2,3GHz voor een verbeterde AI, realistischere werelden en vlottere interacties tijdens het spelen. <br>Een GPU met 6 teraflops zorgt voor 4K-omgevingen en karakters die realistischer zijn dan 