<a href="https://colab.research.google.com/github/simodepth/website-crawl/blob/main/Crawl_Efficacy_Measurement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> <h1> How to Measure Crawl Efficacy with a few lines of Python </h1>

<p>

Crawl efficacy measures the time between a page is (re)published and the time web spiders kickstart the crawl

In other words, crawl efficacy assesses how quickly a new page you submit to Google takes before getting crawled. 

As opposed to crawl budget, crawl efficacy is an actionable metric because as it decreases, the more SEO-critical content can be surfaced to your audience across Google.
You can also use it to diagnose SEO issues. Drill down into URL patterns to understand how fast content from various sections of your site is being crawled and if this is what is holding back organic performance.


</p>

> <h2>How to Measure Crawl Efficacy?</h2>

You can measure crawl efficacy by comparing an XML sitemap last mod date with the last crawl date retrievable from the GSC API. The lower the output value, the highest the crawl efficacy, therefore web crawlers responsiveness in crawling your website.

`Last Mode date - Last Crawl date = Crawl Efficacy`

> <h2> Requirements </h2>

- A crawl of the XML sitemap from you target site enclosing a XLSX file with "`Address`" and `"loc mod"` as columns. You can scrape a sitemap by leaning on my previous post taking you throughout an [XML Sitemap Audit with Python](https://seodepths.com/python-for-seo/sitemap-audit-python/)

- A Screaming Frog crawl with the Google Search Console API enabled. You can export the `search_console_all` XLSX and decide yourself which variables to keep in the dataset as long as you don't toss the "Address" and the "Last Crawl" columns.

> <h3> Credits </h3>

Huge shoutout to Jess Scholz for inspiring this tiny Python project and her gem article on the [misleading value of crawl budget](https://searchengineland.com/crawl-efficacy-optimization-389085) on the Search Engine Land blog

In [24]:
import pandas as pd
Last_Crawl = pd.read_excel('/content/SF crawl Last Crawl GSC.xlsx')
df = pd.DataFrame(Last_Crawl, columns=['Address','Status Code', 'Title 1', 'Indexability Status', 'Clicks', 'Impressions', 'CTR', 'Summary', 'Coverage', 'Last Crawl'])
df

Unnamed: 0,Address,Status Code,Title 1,Indexability Status,Clicks,Impressions,CTR,Summary,Coverage,Last Crawl
0,https://seodepths.com/,200.0,SEO Research and Python for SEO - SEO Depths,,16.0,196.0,0.0816,URL is on Google,Submitted and indexed,2022-10-23
1,https://seodepths.com/wp-content/plugins/jetpa...,0.0,,Blocked by robots.txt,,,,,,NaT
2,https://seodepths.com/about/,301.0,,Redirected,1.0,37.0,0.027,URL is on Google,Submitted and indexed,2022-07-18
3,https://seodepths.com/seo-research/,301.0,,Redirected,,,,URL is not on Google,Page with redirect,2022-08-27
4,https://seodepths.com/python-for-seo/entity-co...,200.0,Entity Competitor Analysis with NLP in Python ...,,3.0,74.0,0.0405,URL is on Google,Submitted and indexed,2022-10-14
5,https://seodepths.com/wp-content/plugins/jetpa...,0.0,,Blocked by robots.txt,,,,,,NaT
6,https://seodepths.com/wp-content/plugins/jetpa...,0.0,,Blocked by robots.txt,,,,,,NaT
7,https://seodepths.com/python-for-seo/,200.0,Best Python Scripts for SEO 2022 (Updated) - S...,,9.0,812.0,0.0111,URL is on Google,Submitted and indexed,2022-10-28
8,https://seodepths.com/about-2/,200.0,About: Simone De Palma - SEO Depths,,0.0,41.0,0.0,URL is on Google,Submitted and indexed,2022-10-20
9,https://seodepths.com/seo-news/,200.0,SEO News: Loads of Testing & Research - SEO De...,,0.0,26.0,0.0,URL is on Google,Submitted and indexed,2022-10-21


In [25]:
Last_mod = pd.read_excel('/content/XML sitemap last mod.xlsx')
df2 = pd.DataFrame(Last_mod , columns=['Address','last mod'])
df2

Unnamed: 0,Address,last mod
0,https://seodepths.com/seo-research/google-pros...,2022-08-14
1,https://seodepths.com/python-for-seo/detect-go...,2022-09-21
2,https://seodepths.com/about/,2022-08-19
3,https://seodepths.com/python-for-seo/sitemap-a...,2022-09-21
4,https://seodepths.com/python-for-seo/how-to-ki...,2022-09-21
5,https://seodepths.com/seo-research/how-nlp-nlu...,2022-09-29
6,https://seodepths.com/python-for-seo/entity-an...,2022-09-21
7,https://seodepths.com/python-for-seo/define-se...,2022-09-21
8,https://seodepths.com/seo-research/how-google-...,2022-08-27
9,https://seodepths.com/seo-research/structured-...,2022-08-27


In [27]:
#@title Vlookup to merge the data frames
result = df.merge(df2,  on='Address')
result['Status Code'] = result['Status Code'].astype(int)
#rename the columns
result.columns=['URL','Status Code','title','Indexability','Clicks','Impressions','CTR','Sumary','coverage','Last Mod','Last Crawl']
#you can choose to drop unwanted columns to improve readability
ft = result.drop(['title','Indexability','Clicks','Impressions','CTR'], axis=1)
#drop duplicate rows
sg = ft.drop_duplicates('URL')
sg

Unnamed: 0,URL,Status Code,Sumary,coverage,Last Mod,Last Crawl
0,https://seodepths.com/,200,URL is on Google,Submitted and indexed,2022-10-23,2022-10-21
2,https://seodepths.com/about/,301,URL is on Google,Submitted and indexed,2022-07-18,2022-08-19
4,https://seodepths.com/python-for-seo/entity-co...,200,URL is on Google,Submitted and indexed,2022-10-14,2022-09-21
6,https://seodepths.com/python-for-seo/,200,URL is on Google,Submitted and indexed,2022-10-28,2022-08-27
7,https://seodepths.com/about-2/,200,URL is on Google,Submitted and indexed,2022-10-20,2022-08-27
8,https://seodepths.com/seo-news/,200,URL is on Google,Submitted and indexed,2022-10-21,2022-09-13
9,https://seodepths.com/python-for-seo/build-sim...,200,URL is on Google,Submitted and indexed,2022-10-20,2022-10-23
11,https://seodepths.com/python-for-seo/how-to-us...,200,URL is on Google,Submitted and indexed,2022-10-21,2022-09-21
13,https://seodepths.com/python-for-seo/google-au...,200,URL is on Google,Submitted and indexed,2022-10-20,2022-09-21
15,https://seodepths.com/python-for-seo/semantic-...,200,URL is on Google,Submitted and indexed,2022-10-15,2022-10-21


In [28]:
#@title Calculate Crawl Efficacy and Save the data frame
sg['Crawl Efficacy'] = sg.apply(lambda row: row['Last Mod']-row['Last Crawl'],axis=1)
#save the data frame
sg.to_csv('crawl_efficacy.csv',index=False)
sg.head() #remove .head() if you want to view the full results straight away

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,URL,Status Code,Sumary,coverage,Last Mod,Last Crawl,Crawl Efficacy
0,https://seodepths.com/,200,URL is on Google,Submitted and indexed,2022-10-23,2022-10-21,2 days
2,https://seodepths.com/about/,301,URL is on Google,Submitted and indexed,2022-07-18,2022-08-19,-32 days
4,https://seodepths.com/python-for-seo/entity-co...,200,URL is on Google,Submitted and indexed,2022-10-14,2022-09-21,23 days
6,https://seodepths.com/python-for-seo/,200,URL is on Google,Submitted and indexed,2022-10-28,2022-08-27,62 days
7,https://seodepths.com/about-2/,200,URL is on Google,Submitted and indexed,2022-10-20,2022-08-27,54 days
