<a href="https://colab.research.google.com/github/simodepth/Structured-data/blob/main/Bulk_Extract_Structured_Data_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#How to extract and compare structured data against competitors in bulk

Structured data represent a goldmine in your technical SEO asset as a thorough and detailed implementation of schemas on valuable pages can positively impact  CTR and prompt Google to tip your webpages with 'sexy' SERP features.


If you're looking for a quick win-win method to analyze structured data from both your website and your competitors, this script may be to your help.


#How to optimize Schema Markup ?


🚨 **Make it simple** - write in ways that translate easily to structured data. Write your content and title tags in **[Triples](https://schemantra.com/blog/2022/05/22/structured-data-optimization-for-seo-and-for-semantic-seo/)** or `subject —> verb —> object` 

🚨 **Markup relevant content only** - not all the pages need a schema mark up if they are no not designed to bring additional value to the search

🚨 **Mark up ONLY existing content on your page** - you don’t want to pinpoint a piece of information which is not on your page as Google may penalize your rankings

🚨 **Avoid [Schema Stuffing](https://www.searchenginejournal.com/schema-stuffing-spam/449891/)** - including multiple schemas to your content may do nothing but harm if it results in misleading suggestions that the content is something other than it is

🚨 **Define your primary entity using mainEntity**, and describe its relations to other entities using `URL`, `sameAs` and `About`.

In [1]:
#@title Install Modules
!pip install extruct
!pip install w3lib.htmml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting extruct
  Downloading extruct-0.13.0-py2.py3-none-any.whl (25 kB)
Collecting jstyleson
  Downloading jstyleson-0.0.2.tar.gz (2.0 kB)
Collecting mf2py
  Downloading mf2py-1.1.2.tar.gz (25 kB)
Collecting rdflib-jsonld
  Downloading rdflib_jsonld-0.6.2-py2.py3-none-any.whl (4.0 kB)
Collecting w3lib
  Downloading w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting rdflib
  Downloading rdflib-6.2.0-py3-none-any.whl (500 kB)
[K     |████████████████████████████████| 500 kB 48.3 MB/s 
Collecting pyrdfa3
  Downloading pyRdfa3-3.5.3-py3-none-any.whl (121 kB)
[K     |████████████████████████████████| 121 kB 70.9 MB/s 
Collecting html-text>=0.5.1
  Downloading html_text-0.5.2-py2.py3-none-any.whl (7.5 kB)
Collecting isodate
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 636 kB/s 
Building wheels for collected packages: jstyleso

In [2]:
#@title Import Modules
import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse

#Have an unlimited list of competing URLs to scrape

In [3]:
sites = ['https://www.morrisons.com/help/', 
'https://groceries.morrisons.com/products/pukka-steak-slice-569177011',
'https://www.morrisons-corporate.com/about-us/',
'https://groceries.morrisons.com/webshop/bundle/breaded-chicken-salad-wrap-bundle/1006702395',
'https://groceries.morrisons.com/',
'https://groceries.morrisons.com/content/recipes-by-morrisons-33805?clkInTab=Recipes',
'https://groceries.morrisons.com/on-offer',
'https://my.morrisons.com/storefinder/'
'https://www.morrisons.jobs/'
]
         

#Extract the metadata for one sample page

In [4]:
def extract_metadata(url):

    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph',
                                         'rdfa'])
    return metadata

In [5]:
metadata = extract_metadata('https://www.morrisons-corporate.com/about-us/')
metadata

{'json-ld': [],
 'microdata': [{'@context': 'http://schema.org',
   '@type': 'WebPage',
   'value': "About Us\nStrategy Company History Board Members' Biographies Anti Corruption Property Group Tax Strategy Hong Kong Office Whistleblowing Statement\nInvestor Centre\nAnnual Report Financial Reports Shareholder Information Presentations Investor Contacts Register for Morrisons Q2 results call\nSustainability\nPurpose Planet People Policies Performance Ethical Trading\nMedia Centre\nCorporate News News Archive\nSuppliers\nSupplying Morrisons Supplier Information GSCOP - Contact details and information for suppliers\nhome About Us expand_more\n\nAbout Us From a Bradford market stall to the UK’s 4th largest supermarket chain\n\nLeadership Meet the board\n\narrow_forward\n\nLearn more about our history and key milestones More information\n\narrow_forward\n\nVisit our section dedicated to investors Read more\n\narrow_forward\n\nLatest News\n\nMorrisons unveils new ‘My Morrisons’ app\n\n07 Jul

#Investigate whether the URL is using a specific metadata type

In [6]:
def uses_metadata_type(metadata, metadata_type):
    if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
        return True
    else:
        return False

In [7]:
uses_metadata_type(metadata, 'opengraph')

True

In [8]:
uses_metadata_type(metadata, 'rdfa')

True

In [9]:
uses_metadata_type(metadata, 'json-ld')

False

In [10]:
uses_metadata_type(metadata, 'microdata')

True

In [12]:
#@title Extract metadata usage for each site
df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph', 'rdfa'])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)

    row = {
        'url': urldata.netloc + urldata.path, 
        'microdata': uses_metadata_type(metadata, 'microdata'),
        'json-ld': uses_metadata_type(metadata, 'json-ld'),
        'opengraph': uses_metadata_type(metadata, 'opengraph'),
        'rdfa': uses_metadata_type(metadata, 'rdfa')              
    }

    df = df.append(row, ignore_index=True)

df.head(10).sort_values(by='microdata', ascending=False)

Unnamed: 0,url,microdata,json-ld,opengraph,rdfa
2,www.morrisons-corporate.com/about-us/,True,False,True,True
0,www.morrisons.com/help/,False,False,False,False
1,groceries.morrisons.com/products/pukka-steak-s...,False,False,True,True
3,groceries.morrisons.com/webshop/bundle/breaded...,False,False,False,False
4,groceries.morrisons.com/,False,False,False,True
5,groceries.morrisons.com/content/recipes-by-mor...,False,False,False,True
6,groceries.morrisons.com/on-offer,False,False,False,True
7,my.morrisons.com/storefinder/https://www.morri...,False,False,False,False


#Examine the specific metadata used

In [13]:
def key_exists(dict, key):

    if not any(item['@type'] == key for item in dict):
        return False
    else:
        return True   

#Scrape specific metadata usage per site


---
We’re looping over the URLs, scraping the HTML, extracting the metadata, and then checking each key to see whether it is implemented by a given metadata type.

In [14]:
metadata = extract_metadata('https://groceries.morrisons.com/')
metadata

{'json-ld': [],
 'microdata': [],
 'opengraph': [],
 'rdfa': [{'@id': 'https://groceries.morrisons.com/webshop/startWebshop.do#clohp-carousel__slide114017',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#tabpanel'}]},
  {'@id': 'https://groceries.morrisons.com/webshop/startWebshop.do#clohp-carousel__navItem113536',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#tab'}]},
  {'@id': 'https://groceries.morrisons.com/webshop/startWebshop.do#clohp-carousel__navItem114017',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#tab'}]},
  {'@id': 'https://groceries.morrisons.com/webshop/startWebshop.do#clohp-carousel__navItem113625',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#tab'}]},
  {'@id': '_:N3ca5290d50e745cd8f58417dce6603e3',
   'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#b

In [20]:
df_specific = pd.DataFrame(columns = ['url-path', 
                                      'organization-json-ld', 
                                      'organization-microdata',                                   
                                      'product-json-ld', 
                                      'product-microdata',                  
                                      'offer-json-ld', 
                                      'offer-microdata',     
                                      'review-json-ld', 
                                      'review-microdata',   
                                      'aggregaterating-json-ld', 
                                      'aggregaterating-microdata',   
                                      'breadcrumblist-json-ld', 
                                      'breadcrumblist-microdata',            
                                     ])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)


    row = {
        'url-path': urldata.netloc + urldata.path, 
        'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
        'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
        'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
        'product-microdata': key_exists(metadata['microdata'], 'Product'),
        'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
        'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
        'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
        'review-microdata': key_exists(metadata['microdata'], 'Review'),
        'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
        'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
        'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
        'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
    }

    df_specific = df_specific.append(row, ignore_index=True)

df_specific.sort_values(by='url-path', ascending=False).head(9).T



Unnamed: 0,0,2,7,3,1,6,5,4
url-path,www.morrisons.com/help/,www.morrisons-corporate.com/about-us/,my.morrisons.com/storefinder/https://www.morri...,groceries.morrisons.com/webshop/bundle/breaded...,groceries.morrisons.com/products/pukka-steak-s...,groceries.morrisons.com/on-offer,groceries.morrisons.com/content/recipes-by-mor...,groceries.morrisons.com/
organization-json-ld,False,False,False,False,False,False,False,False
organization-microdata,False,False,False,False,False,False,False,False
product-json-ld,False,False,False,False,False,False,False,False
product-microdata,False,False,False,False,False,False,False,False
offer-json-ld,False,False,False,False,False,False,False,False
offer-microdata,False,False,False,False,False,False,False,False
review-json-ld,False,False,False,False,False,False,False,False
review-microdata,False,False,False,False,False,False,False,False
aggregaterating-json-ld,False,False,False,False,False,False,False,False
