<a href="https://colab.research.google.com/github/simodepth/sitemap/blob/main/%E2%AD%90%EF%B8%8FSitemap_Tech_Audit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Automate Sitemap Audit with Python

XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs. However, they’re also a useful tool in competitor analysis and allow you to quickly identify all of a site’s pages and the level of importance the site assigns to each page.

The following Python script may help you cut off plenty of time of manual technical research and to fetch plenty of juicy data insights at the same time. 

#Requirements and Assumptions

- Run the script on **Google Colab** 
- Make sure to `!pip install` advertools and pandas packages
- Remember that **[sitemaps are a recommendation](https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap)** to Google about which pages you think are important; Google does not pledge to crawl every URL in a sitemap.


In [1]:
#@title Install Packages

!pip install advertools
!pip install pandas 


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting advertools
  Downloading advertools-0.13.1-py2.py3-none-any.whl (309 kB)
[K     |████████████████████████████████| 309 kB 5.2 MB/s 
Collecting scrapy
  Downloading Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 65.8 MB/s 
[?25hCollecting twython
  Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Collecting itemadapter>=0.1.0
  Downloading itemadapter-0.6.0-py3-none-any.whl (10 kB)
Collecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.5.zip (47 kB)
[K     |████████████████████████████████| 47 kB 6.5 MB/s 
Collecting queuelib>=1.4.2
  Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting zope.interface>=4.1.3
  Downloading zope.interface-5.4.0-cp37-cp37m-manylinux2010_x86_64.whl (251 kB)
[K     |████████████████████████████████| 251 kB 56.6 MB/s 
[?25hCollecting w3lib>=1.17.0
  Downloading w3lib-1.22.0-py2.

In [2]:
#@title Import Packages
import advertools as adv

import pandas as pd

import requests

import time

import warnings
warnings.filterwarnings("ignore")

from lxml import etree

from IPython.core.display import display, HTML

from google.colab import files 

display(HTML("<style>.container { width:100% !important; }</style>"))

In [3]:
#@title 1️⃣ Scrape URLs From The Sitemap
sitemap_url = "https://seodepths.com/sitemap_index.xml"
sitemap = adv.sitemap_to_df(sitemap_url)
sitemap.to_csv("sitemap.csv")
sitemap_df = pd.read_csv("sitemap.csv", index_col=False)
sitemap_df.drop(columns=["Unnamed: 0"], inplace=True)
sitemap_df

2022-07-17 13:43:55,126 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://seodepths.com/post-sitemap.xml
2022-07-17 13:43:55,131 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://seodepths.com/page-sitemap.xml


Unnamed: 0,loc,lastmod,image,image_loc,sitemap,sitemap_size_mb,download_date
0,https://seodepths.com/,,,,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
1,https://seodepths.com/python-for-seo/entity-an...,2022-07-16 16:47:33+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
2,https://seodepths.com/seo-news/how-nlp-nlu-can...,2022-07-16 16:12:36+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
3,https://seodepths.com/python-for-seo/define-se...,2022-07-16 16:06:01+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
4,https://seodepths.com/python-for-seo/how-to-ki...,2022-07-15 07:49:16+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
5,https://seodepths.com/python-for-seo/sitemap-a...,2022-07-15 07:40:27+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
6,https://seodepths.com/python-for-seo/detect-go...,2022-07-13 09:39:20+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
7,https://seodepths.com/seo-news/google-pros-con...,2022-07-12 16:49:21+00:00,\n\t\t\t,https://seodepths.com/wp-content/uploads/2022/...,https://seodepths.com/post-sitemap.xml,0.005334,2022-07-17 13:43:55.164529+00:00
8,https://seodepths.com/,,,,https://seodepths.com/page-sitemap.xml,0.000737,2022-07-17 13:43:55.158998+00:00
9,https://seodepths.com/about/,2022-07-13 10:13:25+00:00,,,https://seodepths.com/page-sitemap.xml,0.000737,2022-07-17 13:43:55.158998+00:00


In [4]:
#@title 2️⃣ Check Tag Usage Within The Sitemap (If Existing)
def check_sitemap_tag_usage(sitemap):
     lastmod = sitemap["lastmod"].isna().value_counts()
     priority = sitemap["priority"].isna().value_counts()
     changefreq = sitemap["changefreq"].isna().value_counts()
     lastmod_perc = sitemap["lastmod"].isna().value_counts(normalize = True) * 100
     priority_perc = sitemap["priority"].isna().value_counts(normalize = True) * 100
     changefreq_perc = sitemap["changefreq"].isna().value_counts(normalize = True) * 100
     sitemap_tag_usage_df = pd.DataFrame(data={"lastmod":lastmod,
     "priority":priority,
     "changefreq":changefreq,
     "lastmod_perc": lastmod_perc,
     "priority_perc": priority_perc,
     "changefreq_perc": changefreq_perc})
     return sitemap_tag_usage_df.astype(int)
    

# 📓 Sidenote
- ✅ You want to make sure you have got `<loc>` and `<lastmod>` implemented
- ❌ If you're displaying your sitemap to Googlebot, you want to [avoid using  `<priority>` and `<changefreq>`](https://https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap) values as Google ignores them.

In [5]:
#@title 3️⃣ Get a clue about the site-tree of your Website
sitemap_url_df = adv.url_to_df(sitemap_df["loc"])
sitemap_url_df

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,last_dir
0,https://seodepths.com/,https,seodepths.com,/,,,,,
1,https://seodepths.com/python-for-seo/entity-an...,https,seodepths.com,/python-for-seo/entity-and-sentiment-analysis-...,,,python-for-seo,entity-and-sentiment-analysis-python,entity-and-sentiment-analysis-python
2,https://seodepths.com/seo-news/how-nlp-nlu-can...,https,seodepths.com,/seo-news/how-nlp-nlu-can-affect-seo/,,,seo-news,how-nlp-nlu-can-affect-seo,how-nlp-nlu-can-affect-seo
3,https://seodepths.com/python-for-seo/define-se...,https,seodepths.com,/python-for-seo/define-seo-search-intent-with-...,,,python-for-seo,define-seo-search-intent-with-python,define-seo-search-intent-with-python
4,https://seodepths.com/python-for-seo/how-to-ki...,https,seodepths.com,/python-for-seo/how-to-kick-off-entity-researc...,,,python-for-seo,how-to-kick-off-entity-research-nlp-python,how-to-kick-off-entity-research-nlp-python
5,https://seodepths.com/python-for-seo/sitemap-a...,https,seodepths.com,/python-for-seo/sitemap-audit-python/,,,python-for-seo,sitemap-audit-python,sitemap-audit-python
6,https://seodepths.com/python-for-seo/detect-go...,https,seodepths.com,/python-for-seo/detect-google-tag-rewriting-se...,,,python-for-seo,detect-google-tag-rewriting-serpapi,detect-google-tag-rewriting-serpapi
7,https://seodepths.com/seo-news/google-pros-con...,https,seodepths.com,/seo-news/google-pros-cons-annotations/,,,seo-news,google-pros-cons-annotations,google-pros-cons-annotations
8,https://seodepths.com/,https,seodepths.com,/,,,,,
9,https://seodepths.com/about/,https,seodepths.com,/about/,,,about,,about


In [52]:
#@title 4️⃣ Get a grip on HTTPS Usage on URLs in the Sitemap
sitemap_url_df["scheme"].value_counts().to_frame()

Unnamed: 0,scheme
https,10


## 🤖 ROBOTS.TXT

In [7]:
#@title 5️⃣ Have a look at the Robots.txt 
import requests
r = requests.get("https://www.seodepths.com/robots.txt")
r.status_code

200

If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.



In [8]:
#@title 6️⃣ Bulk audit Robots.txt of the URLs in the sitemap
sitemap_df_robotstxt_check = adv.robotstxt_test("https://www.seodepths.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])
sitemap_df_robotstxt_check["can_fetch"].value_counts()

True    10
Name: can_fetch, dtype: int64

# 📓 SIDENOTES

**user_agents=["*"]** = we have performed the audit for all of the user-agents

**True** =  URLs are all crawlable. 

**False** = some URLs are being disallowed

👇


---
If URLs are being disallowed, run the script below


In [16]:
#@title Identify disallowed URLs
pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]

Unnamed: 0,robotstxt_url,user_agent,url_path,can_fetch


## STATUS CODE CHECK

In [22]:
#@title Check URLs Status Code within the Sitemap
adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()


200    27
Name: status, dtype: int64

❌ **If any, which URLs in the sitemaps return 404?**

In [23]:
df_headers[df_headers["status"] == 404]

Unnamed: 0,url,crawl_time,status,download_timeout,download_slot,download_latency,depth,protocol,body,resp_headers_content-length,...,resp_headers_x-cache-ctime,resp_headers_content-encoding,resp_headers_x-ac,request_headers_accept,request_headers_accept-language,request_headers_user-agent,request_headers_accept-encoding,resp_headers_last-modified,resp_headers_x-nananana,resp_headers_x-pingback


##⚠️ NOT COMPULSORY - Check Canonicalization from Response Headers 


---

From time to time, **using canonicalization hints on the response headers is beneficial for crawling and indexing**

**If you want to include a canonicalization hint on the HTTP header, you need to guarantee that the URL canonical tag and the respsonse header canonical tag are the same**.

The following steps will be:
- Checking  whether the response header for canonical usage exists.
- Comparing the response header canonical value to the HTML canonical value - if it exists.
- Checking whether the canonical values are self-referential.




In [24]:
#@title Does a response header for canonical usage exists?
df_headers.columns
# the answer is "YES" if the output retrieves a 'resp_headers_link'

Index(['url', 'crawl_time', 'status', 'download_timeout', 'download_slot',
       'download_latency', 'depth', 'protocol', 'body',
       'resp_headers_content-length', 'resp_headers_server',
       'resp_headers_date', 'resp_headers_content-type',
       'resp_headers_strict-transport-security', 'resp_headers_vary',
       'resp_headers_x-hacker', 'resp_headers_host-header',
       'resp_headers_cache-control', 'resp_headers_x-nitro-cache',
       'resp_headers_x-nitro-cache-from', 'resp_headers_x-nitro-rev',
       'resp_headers_link', 'resp_headers_x-cache-ctime',
       'resp_headers_content-encoding', 'resp_headers_x-ac',
       'request_headers_accept', 'request_headers_accept-language',
       'request_headers_user-agent', 'request_headers_accept-encoding',
       'resp_headers_last-modified', 'resp_headers_x-nananana',
       'resp_headers_x-pingback'],
      dtype='object')

In [30]:
#@title Compare the response header canonical to the HTML canonical
df_headers["resp_headers_link"]
print("Checking any links within the Response Header")

df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:\/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()



Checking any links within the Response Header


False    27
dtype: int64

## **True** = the response header canonical equals the URL canonical

## **False** = the response header canonical DOES NOT equal the URL canonical

In [28]:
#@title Define the Canonical Values from the Response Header
df_headers[(df_headers["response_header_canonical"] != df_headers["url"]) & (df_headers["status"] == 200)]

Unnamed: 0,url,crawl_time,status,download_timeout,download_slot,download_latency,depth,protocol,body,resp_headers_content-length,...,resp_headers_content-encoding,resp_headers_x-ac,request_headers_accept,request_headers_accept-language,request_headers_user-agent,request_headers_accept-encoding,resp_headers_last-modified,resp_headers_x-nananana,resp_headers_x-pingback,response_header_canonical
0,https://seodepths.com/,2022-07-17 14:18:57,200,180,seodepths.com,0.152629,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate",,,,https://cdn-gbphn.nitrocdn.com
1,https://seodepths.com/python-for-seo/entity-and-sentiment-analysis-python/,2022-07-17 14:18:57,200,180,seodepths.com,0.193086,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate","Sun, 17 Jul 2022 14:18:57 GMT",Batcache-Set,,https://cdn-gbphn.nitrocdn.com
2,https://seodepths.com/python-for-seo/sitemap-audit-python/,2022-07-17 14:18:57,200,180,seodepths.com,0.191443,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate",,,,https://cdn-gbphn.nitrocdn.com
3,https://seodepths.com/python-for-seo/define-seo-search-intent-with-python/,2022-07-17 14:18:57,200,180,seodepths.com,0.199115,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate","Sun, 17 Jul 2022 14:18:57 GMT",Batcache-Set,,https://cdn-gbphn.nitrocdn.com
4,https://seodepths.com/python-for-seo/detect-google-tag-rewriting-serpapi/,2022-07-17 14:18:57,200,180,seodepths.com,0.19627,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate","Sun, 17 Jul 2022 14:18:57 GMT",Batcache-Set,https://seodepths.com/xmlrpc.php,https://cdn-gbphn.nitrocdn.com
5,https://seodepths.com/python-for-seo/how-to-kick-off-entity-research-nlp-python/,2022-07-17 14:18:57,200,180,seodepths.com,0.206657,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate","Sun, 17 Jul 2022 14:18:57 GMT",Batcache-Set,https://seodepths.com/xmlrpc.php,https://cdn-gbphn.nitrocdn.com
6,https://seodepths.com/seo-news/google-pros-cons-annotations/,2022-07-17 14:18:57,200,180,seodepths.com,0.306869,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate",,,https://seodepths.com/xmlrpc.php,https://seodepths.com/wp-json/
7,https://seodepths.com/seo-news/how-nlp-nlu-can-affect-seo/,2022-07-17 14:18:57,200,180,seodepths.com,0.359001,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate",,,,https://seodepths.com/wp-json/
8,https://seodepths.com/about/,2022-07-17 14:18:57,200,180,seodepths.com,0.248889,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate",,,,https://seodepths.com/wp-json/
9,https://seodepths.com/,2022-07-17 14:24:03,200,180,seodepths.com,0.138918,0,HTTP/1.1,,0,...,gzip,3.dca _atomic_dca,"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",en,advertools/0.13.1,"gzip, deflate","Sun, 17 Jul 2022 14:24:03 GMT",Batcache-Set,,https://cdn-gbphn.nitrocdn.com


In [53]:
#@title Check whether the canonical values are self-referential
df_headers["response_header_canonical"] = df_headers["resp_headers_link"].str.extract(r"([^<>][a-z:\/0-9-.]*)")
(df_headers["response_header_canonical"] == df_headers["url"]).value_counts()

False    27
dtype: int64

## Check for X-Robots Tag in the Sitemap


---
You might want to check whether an X-robots tag was appended within the response header as a result of a temporary amendment to the Robots.txt directives.

In [36]:
#@title Does an X-Robots Tag exists within the URLs on the Sitemap?
def robots_tag_checker(dataframe:pd.DataFrame):
     for i in df_headers:
          if i.__contains__("robots"):
               return i
          else:
               return "There is no robots tag"
robots_tag_checker(df_headers)


'There is no robots tag'

## ❌ If there were an X-Robot Tag, you should use the code below





---

**We can check whether there is a “noindex” directive from the response headers**

In the Google Search Console Coverage Report, those appear as “Submitted marked as noindex”.

Contradicting indexing and canonicalization hints and signals might make a search engine ignore all of the signals while making the search algorithms trust less to the user-declared signals.



In [37]:
df_headers["response_header_x_robots_tag"].value_counts()
df_headers[df_headers["response_header_x_robots_tag"] == "noindex"]

KeyError: ignored

##Audit Meta Tag Robots



---
Even if a web page is not disallowed from robots.txt, it can still be disallowed from the HTML Meta Tags.

Thus, checking the HTML Meta Tags for better indexation and crawling is necessary.

Using the “custom selectors” is necessary to perform the HTML Meta Tag audit for the sitemap URLs.


---

📓 If "False" is the returned value, there are no URLs with **noindex|nofollow** attribute within the Robots command.


In [43]:
sitemap = adv.sitemap_to_df("https://seodepths.com/sitemap_index.xml")

adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl",

follow_links=False,

xpath_selectors= {"meta_command": "//meta[@name='robots']/@content"}, # to extract all the robots commands from the URLs from the sitemap.

custom_settings={"CLOSESPIDER_PAGECOUNT":1000}) # we have set the crawling to 1000 URLs from hte sitemap

df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)

df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()

2022-07-17 15:51:34,361 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://seodepths.com/page-sitemap.xml
2022-07-17 15:51:34,373 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://seodepths.com/post-sitemap.xml


False    18
Name: meta_command, dtype: int64

##⚠️ WARNING - If your GSC informs you about **indexing issues** stemming from a URL apparently displaying a **Index,Follow**, you should inspect the `<body>` section of the URL to assess whether a **noindex,follow** is in place

In [44]:
#@title View the Meta Tag Robots applied on the Sitemap URLs
df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]

Unnamed: 0,url,meta_command
0,https://seodepths.com/,"index, follow"
1,https://seodepths.com/python-for-seo/define-seo-search-intent-with-python/,"index, follow, max-snippet:-1, max-image-preview:large"
2,https://seodepths.com/python-for-seo/how-to-kick-off-entity-research-nlp-python/,"index, follow, max-snippet:-1, max-image-preview:large"
3,https://seodepths.com/python-for-seo/detect-google-tag-rewriting-serpapi/,"follow, index, max-snippet:-1, max-video-preview:-1, max-image-preview:large"
4,https://seodepths.com/python-for-seo/sitemap-audit-python/,"index, follow, max-image-preview:large, max-snippet:-1"
5,https://seodepths.com/python-for-seo/entity-and-sentiment-analysis-python/,"index, follow, max-snippet:-1, max-image-preview:large"
6,https://seodepths.com/about/,"follow, index, max-snippet:-1, max-video-preview:-1, max-image-preview:large"
7,https://seodepths.com/seo-news/how-nlp-nlu-can-affect-seo/,"index, follow, max-snippet:-1, max-image-preview:large"
8,https://seodepths.com/seo-news/google-pros-cons-annotations/,"follow, index, max-snippet:-1, max-video-preview:-1, max-image-preview:large"
9,https://seodepths.com/,"index, follow"



#Check the duplicate URLs within the Sitemap URLs

In [46]:
#@title Look for Duplicate URLs Within Sitemap Submissions
sitemap_df["loc"].duplicated().value_counts()

False    9
True     1
Name: loc, dtype: int64

📓 **False = no duplicated URLs are uploaded in the sitemap**

❌ **True** = Duplicated URLs are caught in the Sitemap

👇

---
Should you have Duplicated URLs uploaded in your Sitemap, run the script below 


In [50]:
#@title ❌ How many duplicated URLs in the Sitemap?
pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)

Unnamed: 0_level_0,loc
sitemap,Unnamed: 1_level_1
https://seodepths.com/page-sitemap.xml,1


In [51]:
#@title 💡Which URLs are caught as Duplicated in the Sitemap?
sitemap_df[sitemap_df["loc"].duplicated() == True]

Unnamed: 0,loc,lastmod,image,image_loc,sitemap,sitemap_size_mb,download_date
8,https://seodepths.com/,,,,https://seodepths.com/page-sitemap.xml,0.000737,2022-07-17 13:43:55.158998+00:00
