<a href="https://colab.research.google.com/github/simodepth/sitemap/blob/main/%E2%AD%90%EF%B8%8FSitemap_Tech_Audit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Automate Sitemap Audit with Python

XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs. However, they’re also a useful tool in competitor analysis and allow you to quickly identify all of a site’s pages and the level of importance the site assigns to each page.

The following Python script may help you cut off plenty of time of manual technical research and to fetch plenty of juicy data insights at the same time. 

#Requirements and Assumptions

- Run the script on **Google Colab** 
- Make sure to `!pip install` advertools and pandas packages
- Remember that **[sitemaps are a recommendations](https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap)** to Google about which pages you think are important; Google does not pledge to crawl every URL in a sitemap.


In [None]:
#@title Install Packages

!pip install advertools
!pip install pandas 


Collecting advertools
  Downloading advertools-0.13.1-py2.py3-none-any.whl (309 kB)
[K     |████████████████████████████████| 309 kB 4.3 MB/s 
[?25hCollecting twython
  Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Collecting scrapy
  Downloading Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 45.7 MB/s 
Collecting protego>=0.1.15
  Downloading Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting Twisted>=17.9.0
  Downloading Twisted-22.4.0-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 49.7 MB/s 
[?25hCollecting parsel>=1.5.0
  Downloading parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting pyOpenSSL>=16.2.0
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 3.3 MB/s 
[?25hCollecting service-identity>=16.0.0
  Downloading service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting cssselect>=0.9.1
  Downloading cssselect-1.1.0-py2.py3-none

In [None]:
#@title Import Packages
import advertools as adv

import pandas as pd

from lxml import etree

from IPython.core.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
#@title 1️⃣ Scrape URLs From The Sitemap
sitemap_url = "yoursite/sitemap_xml.com"
sitemap = adv.sitemap_to_df(sitemap_url)
sitemap.to_csv("sitemap.csv")
sitemap_df = pd.read_csv("sitemap.csv", index_col=False)
sitemap_df.drop(columns=["Unnamed: 0"], inplace=True)
sitemap_df

In [None]:
#@title 2️⃣ Check Tag Usage Within The Sitemap (If Existing)
def check_sitemap_tag_usage(sitemap):
     lastmod = sitemap["lastmod"].isna().value_counts()
     priority = sitemap["priority"].isna().value_counts()
     changefreq = sitemap["changefreq"].isna().value_counts()
     lastmod_perc = sitemap["lastmod"].isna().value_counts(normalize = True) * 100
     priority_perc = sitemap["priority"].isna().value_counts(normalize = True) * 100
     changefreq_perc = sitemap["changefreq"].isna().value_counts(normalize = True) * 100
     sitemap_tag_usage_df = pd.DataFrame(data={"lastmod":lastmod,
     "priority":priority,
     "changefreq":changefreq,
     "lastmod_perc": lastmod_perc,
     "priority_perc": priority_perc,
     "changefreq_perc": changefreq_perc})
     return sitemap_tag_usage_df.astype(int)

SyntaxError: ignored

# 📓 Sidenote
- ✅ You want to make sure you have got `<loc>` and `<lastmod>` implemented
- ❌ If you're displaying your sitemap to Googlebot, you want to [avoid using  `<priority>` and `<changefreq>`](https://https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap) values as Google ignores them.

In [None]:
#@title 3️⃣ Get a clue about the site-tree of your Website
sitemap_url_df = adv.url_to_df(sitemap_df["loc"])
sitemap_url_df

In [None]:
#@title 4️⃣ Get a grip on HTTPS Usage on URLs in the Sitemap
sitemap_url_df["scheme"].value_counts().to_frame()

Unnamed: 0,scheme
https,286


## 🤖 ROBOTS.TXT

In [None]:
#@title 5️⃣ Have a look at the Robots.txt 
import requests
r = requests.get("yoursite.com/robots.txt")
r.status_code

If the response status code is 200, it means there is a robots.txt file for the user-agent-based crawling control.



In [None]:
#@title 6️⃣ Bulk audit Robots.txt of the URLs in the sitemap
sitemap_df_robotstxt_check = adv.robotstxt_test("yoursite.com/robots.txt", urls=sitemap_df["loc"], user_agents=["*"])
sitemap_df_robotstxt_check["can_fetch"].value_counts()

# 📓 SIDENOTES

**user_agents=["*"]** = we have performed the audit for all of the user-agents

**True** =  URLs are all crawlable. 

**False** = some URLs are being disallowed

👇


---
If URLs are being disallowed, run the script below


In [None]:
#@title Identify disallowed URLs
pd.set_option("display.max_colwidth",255)
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check["can_fetch"] == False]

In [None]:
#@title Audit Meta Tag Robots
df_meta_check[df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True) == False][["url", "meta_command"]]

In [None]:
#@title Double-check for "NoIndex NoFollow" attributes within URLs from the Sitemap
sitemap = adv.sitemap_to_df("yoursite.com/sitemap_index.xml")

adv.crawl(url_list=sitemap["loc"][:1000], output_file="meta_command_audit.jl",

follow_links=False,

xpath_selectors= {"meta_command": "//meta[@name='robots']/@content"},

custom_settings={"CLOSESPIDER_PAGECOUNT":1000})

df_meta_check = pd.read_json("meta_command_audit.jl", lines=True)

df_meta_check["meta_command"].str.contains("nofollow|noindex", regex=True).value_counts()

📓 If "False" is retrieved at the end of the sitemaps list, it means that there are no URLs with **noindex|nofollow** attribute


## STATUS CODE CHECK

In [None]:
#@title Check URLs Status Code within the Sitemap
adv.crawl_headers(sitemap_df["loc"], output_file="sitemap_df_header.jl")
df_headers = pd.read_json("sitemap_df_header.jl", lines=True)
df_headers["status"].value_counts()

❌ **If any, which URLs in the sitemaps return 404?**

In [None]:
df_headers[df_headers["status"] == 404]

Unnamed: 0,url,crawl_time,status,download_timeout,download_slot,download_latency,depth,protocol,body,resp_headers_content-length,...,resp_headers_set-cookie,resp_headers_link,resp_headers_referrer-policy,resp_headers_content-type,request_headers_accept,request_headers_accept-language,request_headers_user-agent,request_headers_accept-encoding,request_headers_cookie,resp_headers_x-pingback



#Get Response Headers from the URLs listed in the Sitemap

In [None]:
df_headers[(df_headers["response_header_canonical"] != df_headers["url"]) & (df_headers["status"] == 200)]

In [None]:
#@title Look for Duplicate URLs Within Sitemap Submissions
sitemap_df["loc"].duplicated().value_counts()

False    286
Name: loc, dtype: int64

📓 **False = no duplicated URLs are uploaded in the sitemap**

❌ **True** = Duplicated URLs are caught in the Sitemap

👇

---
Should you have Duplicated URLs uploaded in your Sitemap, run the script below 


In [None]:
#@title ❌ How many duplicated URLs in the Sitemap?
pd.pivot_table(sitemap_df[sitemap_df["loc"].duplicated()==True], index="sitemap", values="loc", aggfunc="count").sort_values(by="loc", ascending=False)

KeyError: ignored

In [None]:
#@title 💡Which URLs are caught as Duplicated in the Sitemap?
sitemap_df[sitemap_df["loc"].duplicated() == True]