<a href="https://colab.research.google.com/github/simodepth/Structured-data/blob/main/Bulk_Extract_Structured_Data_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#How to extract and compare structured data against competitors in bulk

Structured data represent a goldmine in your technical SEO asset as a thorough and detailed implementation of schemas on valuable pages can positively impact  CTR and prompt Google to tip your webpages with 'sexy' SERP features.


If you're looking for a quick win-win method to analyze structured data from both your website and your competitors, this script may be to your help.


#How to optimize Schema Markup ?


🚨 **Make it simple** - write in ways that translate easily to structured data. Write your content and title tags in **[Triples](https://schemantra.com/blog/2022/05/22/structured-data-optimization-for-seo-and-for-semantic-seo/)** or `subject —> verb —> object` 

🚨 **Markup relevant content only** - not all the pages need a schema mark up if they are no not designed to bring additional value to the search

🚨 **Mark up ONLY existing content on your page** - you don’t want to pinpoint a piece of information which is not on your page as Google may penalize your rankings

🚨 **Avoid [Schema Stuffing](https://www.searchenginejournal.com/schema-stuffing-spam/449891/)** - including multiple schemas to your content may do nothing but harm if it results in misleading suggestions that the content is something other than it is

🚨 **Define your primary entity using mainEntity**, and describe its relations to other entities using `URL`, `sameAs` and `About`.

In [None]:
#@title Install Modules
!pip install extruct
!pip install w3lib.htmml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting extruct
  Downloading extruct-0.13.0-py2.py3-none-any.whl (25 kB)
Collecting pyrdfa3
  Downloading pyRdfa3-3.5.3-py3-none-any.whl (121 kB)
[K     |████████████████████████████████| 121 kB 42.2 MB/s 
Collecting rdflib-jsonld
  Downloading rdflib_jsonld-0.6.2-py2.py3-none-any.whl (4.0 kB)
Collecting jstyleson
  Downloading jstyleson-0.0.2.tar.gz (2.0 kB)
Collecting mf2py
  Downloading mf2py-1.1.2.tar.gz (25 kB)
Collecting html-text>=0.5.1
  Downloading html_text-0.5.2-py2.py3-none-any.whl (7.5 kB)
Collecting rdflib
  Downloading rdflib-6.1.1-py3-none-any.whl (482 kB)
[K     |████████████████████████████████| 482 kB 57.6 MB/s 
Collecting w3lib
  Downloading w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting isodate
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 758 kB/s 
Building wheels for collected packages: jstyleso

In [None]:
#@title Import Modules
import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse

#Have an unlimited list of competing URLs to scrape

In [None]:
sites = ['https://www.my1styears.com/personalised-silver-baby-jewellery-gift-set.html',
         'https://www.my1styears.com/personalised-grey-bunny-soft-toy.html',
         'https://www.my1styears.com/personalised-penguin-fleece-robe.html',
         ]

#Extract the metadata for one sample page

In [None]:
def extract_metadata(url):

    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph',
                                         'rdfa'])
    return metadata

In [None]:
metadata = extract_metadata('https://www.my1styears.com/personalised-grey-bunny-soft-toy.html')
metadata

{'json-ld': [{'@context': 'https://schema.org/',
   '@type': 'WebSite',
   'name': 'My 1st Years',
   'potentialAction': {'@type': 'SearchAction',
    'query-input': 'required name=search_term_string',
    'target': 'https://www.my1styears.com/catalogsearch/result/?q={search_term_string}'},
   'url': 'https://www.my1styears.com/'}],
 'microdata': [{'@context': 'http://schema.org',
   '@type': 'Product',
   'image': 'https://cdn.my1styears.com/media/catalog/product/cache/d8e9c4df3ed3fb9e8c3721796f593e45/s/e/se_50111091_new-core-bunny---grey-a.jpg',
   'name': 'Personalised Light Grey Bunny Soft Toy',
   'offers': {'@type': 'Offer', 'price': '28', 'priceCurrency': 'GBP'},
   'sku': 'SE_MFY5001S'}],
 'opengraph': [{'@context': {'fb': 'http://ogp.me/ns/fb#',
    'og': 'http://ogp.me/ns#',
    'product': 'http://ogp.me/ns/product#'},
   '@type': 'product',
   'og:description': '...',
   'og:image': 'https://cdn.my1styears.com/media/catalog/product/cache/0521aff7518f3552a798ff9cfed7f306/s/e/

#Investigate whether the URL is using a specific metadata type

In [None]:
def uses_metadata_type(metadata, metadata_type):
    if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
        return True
    else:
        return False

In [None]:
uses_metadata_type(metadata, 'opengraph')

True

In [None]:
uses_metadata_type(metadata, 'rdfa')

True

In [None]:
uses_metadata_type(metadata, 'json-ld')

True

In [None]:
uses_metadata_type(metadata, 'microdata')

True

In [None]:
#@title Extract metadata usage for each site
df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph', 'rdfa'])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)

    row = {
        'url': urldata.netloc, 
        'microdata': uses_metadata_type(metadata, 'microdata'),
        'json-ld': uses_metadata_type(metadata, 'json-ld'),
        'opengraph': uses_metadata_type(metadata, 'opengraph'),
        'rdfa': uses_metadata_type(metadata, 'rdfa')              
    }

    df = df.append(row, ignore_index=True)

df.head(10).sort_values(by='microdata', ascending=False)

Unnamed: 0,url,microdata,json-ld,opengraph,rdfa
0,www.my1styears.com,True,True,True,True
1,www.my1styears.com,True,True,True,True
2,www.my1styears.com,True,True,True,True


#Examine the specific metadata used

In [None]:
def key_exists(dict, key):

    if not any(item['@type'] == key for item in dict):
        return False
    else:
        return True   

#Scrape specific metadata usage per site
---
We’re looping over the URLs, scraping the HTML, extracting the metadata, and then checking each key to see whether it is implemented by a given metadata type.

In [None]:
metadata = extract_metadata('https://www.my1styears.com/personalised-grey-bunny-soft-toy.html')
metadata

{'json-ld': [{'@context': 'https://schema.org/',
   '@type': 'WebSite',
   'name': 'My 1st Years',
   'potentialAction': {'@type': 'SearchAction',
    'query-input': 'required name=search_term_string',
    'target': 'https://www.my1styears.com/catalogsearch/result/?q={search_term_string}'},
   'url': 'https://www.my1styears.com/'}],
 'microdata': [{'@context': 'http://schema.org',
   '@type': 'Product',
   'image': 'https://cdn.my1styears.com/media/catalog/product/cache/d8e9c4df3ed3fb9e8c3721796f593e45/s/e/se_50111091_new-core-bunny---grey-a.jpg',
   'name': 'Personalised Light Grey Bunny Soft Toy',
   'offers': {'@type': 'Offer', 'price': '28', 'priceCurrency': 'GBP'},
   'sku': 'SE_MFY5001S'}],
 'opengraph': [{'@context': {'fb': 'http://ogp.me/ns/fb#',
    'og': 'http://ogp.me/ns#',
    'product': 'http://ogp.me/ns/product#'},
   '@type': 'product',
   'og:description': '...',
   'og:image': 'https://cdn.my1styears.com/media/catalog/product/cache/0521aff7518f3552a798ff9cfed7f306/s/e/

In [None]:
df_specific = pd.DataFrame(columns = ['url', 
                                      'organization-json-ld', 
                                      'organization-microdata',                                   
                                      'product-json-ld', 
                                      'product-microdata',                  
                                      'offer-json-ld', 
                                      'offer-microdata',     
                                      'review-json-ld', 
                                      'review-microdata',   
                                      'aggregaterating-json-ld', 
                                      'aggregaterating-microdata',   
                                      'breadcrumblist-json-ld', 
                                      'breadcrumblist-microdata',            
                                     ])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)


    row = {
        'url': urldata.netloc, 
        'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
        'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
        'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
        'product-microdata': key_exists(metadata['microdata'], 'Product'),
        'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
        'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
        'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
        'review-microdata': key_exists(metadata['microdata'], 'Review'),
        'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
        'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
        'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
        'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
    }

    df_specific = df_specific.append(row, ignore_index=True)

df_specific.sort_values(by='url', ascending=False).head(3).T


Unnamed: 0,0,1,2
url,www.my1styears.com,www.my1styears.com,www.my1styears.com
organization-json-ld,False,False,False
organization-microdata,False,False,False
product-json-ld,False,False,False
product-microdata,True,True,True
offer-json-ld,False,False,False
offer-microdata,False,False,False
review-json-ld,False,False,False
review-microdata,False,False,False
aggregaterating-json-ld,False,False,False
