#### What is Sitemap:

A sitemap is a file that provides information about pages, images, videos included on your website.<br>
Sitemaps are used by search engines like Google for efficient crawling.<br>

There are various sitemap formats such as XML, RSS, Text. XML sitemap format is popularly used and is thus demostrated in further coding cells of this notebook.

#### <font color="green">Use of Sitemap:</font>

A sitemap is used to provide specific information about your website namely, webpages, images on the webpages.<br>
For example:<br>
A video sitemap will provide information about the video location, video title, rating and much more.<br>
An image sitemap will include the location of the images included in a specific webpage.

#### Types of Sitemap:

Following are the 3 types of sitemaps used widely:

1) URL Sitemap (Basic Sitemap that most websites use)<br>
2) Image Sitemap<br>
3) Video Sitemap

<font color="green">XML Tags used in Sitemap:</font>

Following are the available XML Tags for building a Sitemap:

1) \<urlset> (Required Tag): This is main parent tag of the file which contains information related to all the URLs. It has an attribute 'xmlns' which defines the schema of the sitemap.<br>
    \<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        
        
2) \<url> (Required Tag): This is child tag of \<urlset> and is responsible for each URL entry. Further explained tags are child tags of \<url> tag.
        
3) \<loc> (Required Tag): This includes URL of the page. It must begin with the protocol like http or https, and end with a trailing slash (/). The URLs must be unique, there should not be any duplicate entries.

4) \<lastmod> (Optional Tag): This value specifies the last modified date of the page. This date should be in YYYY-MM-DD format. This date should resemble the date on last modification was done to the page, and not the creation date of sitemap.

5) \<changefreq> (Optional Tag): This indicates about how frequently the page is likely to modify. This value provides information about how often search engines should crawl the page. Valid values are: always, hourly, daily, weekly, monthly, yearly, never

6) \<priority> (Optional Tag): The priority of URL is a measure of how important this page is relative to other URLs on the website. This value should be between 0.0 to 1.0. The default priority of a page is 0.5. For example, if URL is of homepage, then priority must be 1.0.

Sample XML Sitemap:

\<?xml version="1.0" encoding="UTF-8"?><br>
\<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><br>
    &emsp;\<url><br>
        &emsp;&emsp;\<loc> http://www.abc.com/ \</loc><br>
        &emsp;&emsp;\<lastmod> 2022-09-01 \</lastmod><br>
        &emsp;&emsp;\<changefreq> daily \</changefreq><br>
        &emsp;&emsp;\<priority> 0.9 \</priority>  <br>
    &emsp;\</url><br>
\</urlset>

#### How to construct Sitemap:

Following are the steps through which you can construct a sitemap for your website:

Step 1: Collection of URLs to be included for the sitemap: This includes collecting the URLs which are essential for building sitemap. These might be stored in a database and retrieved. Here, in <i>Part 1</i> of this notebook, we do web-scraping for a single page for demostration purpose. We pre-process all the URLs that are retrieved and store these unique URLs in a list.

Step 2: Creating Sitemap for URLs: From the URLs we stored in Part 1, we construct the sitemap. Depending on the number of URLs, we need to create sitemaps. A sitemap can include <i>Maximum 50,000 URL entries</i> and size of sitemap must be <i>less than 50 MB</i>. This procedure has been explained in <i>Part 2</i> of the notebook.

Step 3: Creation of Sitemap Index: If there are multiple sitemap files created, then you can enlist each entry into a Sitemap Index file. A Sitemap Index can include <i>maximum 50,000 Sitemap entries</i> and size of Sitemap Index must be <i>less than 50 MB</i>. There can be more than 1 Sitemap Index file. XML Format of Sitemap Index file is similar to Sitemap. You will get a detailed view of this in notebook <i>Part 2</i>.

Each URL entry must be unique in a Sitemap, and each Sitemap entry must be unique in Sitemap Index.

Lets deep dive into all of these Parts.

### PART 1: Web Scraping of URLs

In [1]:
# Importing necessary libraries for web-scraping

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
from datetime import datetime

In [2]:
# Importing libraries for constructing sitemap

from xml.dom import minidom
import os 

In [3]:
# Establishing a request wih the website

req = Request("https://www.udemy.com/")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")

In [4]:
# Retrieving all the URLs from the website

links = []
for link in soup.findAll('a'):
    href = link.get('href')
    links.append(href)

In [5]:
links

['/',
 '/courses/development/',
 '/courses/development/web-development/',
 '/courses/development/data-science/',
 '/courses/development/mobile-apps/',
 '/courses/development/programming-languages/',
 '/courses/development/game-development/',
 '/courses/development/databases/',
 '/courses/development/software-testing/',
 '/courses/development/software-engineering/',
 '/courses/development/development-tools/',
 '/courses/development/no-code-development/',
 '/courses/business/',
 '/courses/business/entrepreneurship/',
 '/courses/business/communications/',
 '/courses/business/management/',
 '/courses/business/sales/',
 '/courses/business/strategy/',
 '/courses/business/operations/',
 '/courses/business/project-management/',
 '/courses/business/business-law/',
 '/courses/business/analytics-and-intelligence/',
 '/courses/business/human-resources/',
 '/courses/business/industry/',
 '/courses/business/e-commerce/',
 '/courses/business/media/',
 '/courses/business/real-estate/',
 '/courses/busi

Observations:

1) Issue: There were some URLs having only / <br>
Solution: Such URLs actually represent the home page i.e. Base URL, so these URLs must be removed from the list, and only one base URL must be added to the list

2) Issue: Some URLs have certain arguments with use of ? <br>
Solution: URLs must be cleaned, with no ? symbols, remove the latter part after ?, just keep the actual valid URLs

3) Issue: Most of the URLs don't have a domain starting with https://www.udemy.com <br>
Solution: Domain needs to be added for such URLs

4) Issue: Some URLs don't have a ending / <br>
Solution: / needs to be added at the end of such URLs

5) Issue: There may be duplicate URLs acquired while scraping <br>
Solution: Duplicate URLs must be removed

6) Issue: The URLs must be of the same domain as www.udemy.com, some URLs have their domain as about.udemy.com <br>
Solution: Such URLs must be removed

7) Issue: Udemy has its own sitemap, and we have obtained that link as well while scraping URLs <br>
Solution: The URL indicating Udemy sitemap must be removed

#### <font color="green">Why these observations were made, what is the use of this preprocessing in Sitemap:</font>

For an entry to be included in \<loc> tag, <br>
    
1) It must be a valid URL that starts with http or https and should have a proper domain, which covers Issues 1 and 3. <br>All the URLs of a specific sitemap must have the same domain as that of the sitemap. Like if sitemap is located on "https://www.abc.com/sitemap.xml", then URL entries contained in that sitemap must have their domain as "https://wwww.abc.com/ " which is issue 6. <br>So for handling all these issues, we have prepended BASE_URL i.e. Udemy domain. Also URLs that already have some domain other than BASE_URL are removed as these don't match the same domain.
    
2) The URL must have a trailing slash, which is our issue 4 and it should have no arguments, which covers our issue 2. So we have added a trailing slash to URLs and removed the arguments (? part in URL).
    
3) We need unique URL entries for a sitemap, which handles issue 5. Udemy has its own sitemap entry, and we don't include sitemap entry in a sitemap. So the URL conatining sitemap entry is removed, which solves our issue 7.

In [6]:
BASE_URL = 'https://www.udemy.com'

#### Pre-processing of URL Links

In [7]:
def preprocessURLs(links):
    """
    As per the observations and solutions discussed, this method is used for preprocessing of URLs.
    It will take URL Links as input, and output valid URLs that can be included in sitemap.
    """
    
    Urls = []

    for link in links:
        processed_link = link
    
        # Check for valid URL starting with https://
        if not processed_link.startswith("https://"):
            # Check for starting /
            if not processed_link.startswith("/"):
                processed_link = "/" + processed_link
            processed_link = BASE_URL + processed_link
        
        # Check for ? arguments
        if "?" in processed_link:
            processed_link = processed_link.split("?")[0]
        
        # Check for ending /
        if not processed_link.endswith("/"):
            processed_link = processed_link + "/"
        
        # Check for duplicate URLs
        if processed_link not in Urls and (processed_link.startswith(BASE_URL) and (not processed_link=='https://www.udemy.com/sitemap/')):
            Urls.append(processed_link)
            
    return Urls

In [8]:
Urls = preprocessURLs(links)
Urls

['https://www.udemy.com/',
 'https://www.udemy.com/courses/development/',
 'https://www.udemy.com/courses/development/web-development/',
 'https://www.udemy.com/courses/development/data-science/',
 'https://www.udemy.com/courses/development/mobile-apps/',
 'https://www.udemy.com/courses/development/programming-languages/',
 'https://www.udemy.com/courses/development/game-development/',
 'https://www.udemy.com/courses/development/databases/',
 'https://www.udemy.com/courses/development/software-testing/',
 'https://www.udemy.com/courses/development/software-engineering/',
 'https://www.udemy.com/courses/development/development-tools/',
 'https://www.udemy.com/courses/development/no-code-development/',
 'https://www.udemy.com/courses/business/',
 'https://www.udemy.com/courses/business/entrepreneurship/',
 'https://www.udemy.com/courses/business/communications/',
 'https://www.udemy.com/courses/business/management/',
 'https://www.udemy.com/courses/business/sales/',
 'https://www.udemy.c

Each Sitemap must contain a maximum of 50,000 <url> tags. <br>
And each URL from above list will become a separate <url> tag for the sitemap. <br>
Thus we are checking the limit first, inorder to calculate number of sitemaps to generate if the URLs > 50,000. <br>
If URLs > 50,000 they are combined into blocks of 50,000 each and the sitemaps are then constructed. We call this as 'Splitting of Sitemap'.
And in such cases if there are multiple sitemaps (more than 1), it is better to create a Sitemap Index.

In [9]:
# Check for 50,000 limit, else split it in parts

def calculateNumberOfSitemaps(num_Urls):
    """
    This method calculates the number of sitemaps that can be generated, given the number of URLs as an input.
    """
    
    num_sitemaps = num_Urls // 50000
    if num_Urls % 50000 !=0:
        num_sitemaps += 1
        
        return num_sitemaps

In [10]:
len(Urls)

240

In [11]:
calculateNumberOfSitemaps(len(Urls))

1

So here, in our case the URLs are only 240, as we have only scraped the Udemy homepage. <br>
If we had scraped more pages, we would have more unique URLs. <br>
So here we have to create only 1 sitemap, and hence there will be no Sitemap Index

<font color="green">There are 2 ways to build Sitemaps and Sitemap Index:</font>

<i>Method 1:</i> <br>
The general way of including all the URLs in Sitemap and splitting sitemaps into 50,000 URL blocks each.<br>
We have implemented this method in Part 2: Building a Combined Sitemap.

<i>Method 2:</i> <br>
Grouping each URL based on the category they belong to and creating then creating sitemaps.<br>
Here, the number of sitemaps will depend on the number of categories and number of URLs in each category.<br>
In usual cases, each category will have 1 sitemap considering that the URLs <=50,000.<br>
But in case certain category has more than 50,000 URLs, that sitemap will be splitted into 2 sitemaps for that category.<br>
We have implemented this method in Part 3: Building Category-wise Sitemap.

Let's have a look at each of these methods.

### PART 2: Building a Combined Sitemap

In [12]:
# Create method for creating sitemaps, pass file_name and Urls

def createSitemap(file_name, Urls):
    """
    This method takes file_name and Urls list as input and provides a sitemap file as an output.
    The sitemap file will be created on the current directory path.
    """
    
    root = minidom.Document()
    urlset = root.createElement('urlset')
    root.appendChild(urlset)
    urlset.setAttribute("xmlns","http://www.sitemaps.org/schemas/sitemap/0.9")
    
    for url_link in Urls:
        url = root.createElement('url')
        loc = root.createElement("loc")
        loc.appendChild(root.createTextNode(url_link))
        url.appendChild(loc)
        
        urlset.appendChild(url)
        
    xml_str = root.toprettyxml(indent="\t", encoding="UTF-8").decode("utf-8")
        
    # Writing Sitemap.xml to File
    f = open(file_name, "w")
    f.write(xml_str)
    f.close()
    

Optional tags that can be included with inside \<url> tag: \<changefreq>, \<priority>, \<lastmod> <br>

Above for loop needs to be replaced with the one mentioned as follows, incase you want to add optional tags (just like we did it for \<loc> tag)<br>
Here, in the below loop all the optional tags along with required tag \<loc> are included.<br>
You can choose the optional tags and appropriately remove the unwanted code patches from the loop.<br>

    for url_link in Urls:
    
        url = root.createElement('url')
        loc = root.createElement("loc")
        loc.appendChild(root.createTextNode(url_link))
        url.appendChild(loc)

        changefreq = root.createElement("changefreq")
        changefreq.appendChild(root.createTextNode("daily"))
        url.appendChild(changefreq)

        priority = root.createElement("priority")
        priority.appendChild(root.createTextNode("1.0"))
        url.appendChild(priority)

        lastmod = root.createElement("lastmod")
        lastmod.appendChild(root.createTextNode("2022-12-19"))
        url.appendChild(lastmod)
    
        urlset.appendChild(url)

In [13]:
def createCombinedSitemap():
    """
    This method first calculates the number of sitemaps to be generated, then the file name is decided.
    Later, URLs are sliced into 50,000 blocks each and the corresponding sitemap files are created using these URLs.
    This method returns all the sitemap file names as its output.
    """
    
    sitemap_file_names = []   
    num_sitemaps = calculateNumberOfSitemaps(len(Urls))
    
    sitemap_index = 0
    if num_sitemaps > 1:
        sitemap_index = 1
    
    for part_number in range(num_sitemaps):
    
        if num_sitemaps == 1:
            file_name = "sitemap.xml"
        else:
            file_name = "sitemap_" + str(part_number+1) + ".xml"
    
        min_offset = 50000 * (part_number)
        max_offset = 50000 * (part_number + 1)
        if max_offset > len(Urls):
            max_offset = len(Urls)

        Url_links = Urls[min_offset : max_offset]
        createSitemap(file_name, Url_links)
    
        if sitemap_index:
            sitemap_file_names.append(file_name)

    return sitemap_file_names


In [14]:
sitemap_file_names = createCombinedSitemap()

In [15]:
# Reading the sitemap file we just created

f = open("sitemap.xml", "r")
print(f.read())
f.close()

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>https://www.udemy.com/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/web-development/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/data-science/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/mobile-apps/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/programming-languages/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/game-development/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/databases/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/software-testing/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/software-engineering/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/courses/development/development-tools/</loc>
	

In [16]:
def createSitemapIndex(file_names):
    """
    This method takes all the sitemap file names as input and constructs a sitemap index file.
    The sitemap index file will be created on the current directory path.
    """
    
    root = minidom.Document()
    sitemap_index = root.createElement('sitemapindex')
    root.appendChild(sitemap_index)
    sitemap_index.setAttribute("xmlns","http://www.sitemaps.org/schemas/sitemap/0.9")
    
    for sitemap_file in file_names:
        sitemap = root.createElement("sitemap")
        loc = root.createElement("loc")
        loc.appendChild(root.createTextNode(BASE_URL + "/" + sitemap_file))
        sitemap.appendChild(loc)
        
        sitemap_index.appendChild(sitemap)
        
    xml_str = root.toprettyxml(indent="\t", encoding="UTF-8").decode("utf-8")
    
    f = open("sitemap_index.xml", "w")
    f.write(xml_str)
    f.close()

#### <font color="green">Why we are preprending BASE_URL to file names:</font> 
(BASE_URL + "/" + sitemap_file)

A Sitemap file must have same domain name, just as the URL entries that are in the sitemap.<br>
As all the URLs have BASE_URL as their domain, we have prepended BASE_URL to sitemap file name as well. <br>
So that this sitemap file will be rendered from the same domain, even though it is stored somewhere else.

In [17]:
# Inorder to demostrate Sitemap Index, we will add few sample URLs to the main list so that URLs > 50,000

Urls = Urls * 250
len(Urls)

60000

#### Why there is need to create Sitemap Index:

In order to maintain record od all the sitemap files created, we create a Sitemap Index file. <br>
A Sitemap Index consists of multiple sitemap entries, which actually resemble to sitemaps containing URLs and their information. 
A Sitemap Index can include maximum 50,000 sitemap entries and size of Sitemap Index must be less than 50 MB. There can be more than 1 Sitemap Index file.


#### XML Tags used in Sitemap Index:

Following are the available XML Tags for building a Sitemap Index:

1) \<sitemapindex> (Required Tag): This is main parent tag of the file which contains information about sitemap files. It has an attribute 'xmlns' which defines the schema of the sitemap index.
\<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

2) \<sitemap> (Required Tag): This is child tag of \<sitemapindex> and is responsible for each individual sitemap entry. Further explained tags are child tags of \<sitemap> tag.

3) \<loc> (Required Tag): This includes URL location of the sitemap. It must begin with the protocol like http or https. The Sitemap entries must be unique, there should not be any duplicates.

4) \<lastmod> (Optional Tag): This value specifies the last modified date of the corresponding sitemap file. This date should be in YYYY-MM-DD format.

In [18]:
sitemap_file_names = createCombinedSitemap()

if len(sitemap_file_names) > 0:
    createSitemapIndex(sitemap_file_names)

In [19]:
# Reading the sitemap index file that we just created

f = open("sitemap_index.xml", "r")
print(f.read())
f.close()

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<sitemap>
		<loc>https://www.udemy.com/sitemap_1.xml</loc>
	</sitemap>
	<sitemap>
		<loc>https://www.udemy.com/sitemap_2.xml</loc>
	</sitemap>
</sitemapindex>



### PART 3: Building Category-wise Sitemap

#### <font color="green">Need for Categorizing URLs:</font>

Inorder to create and maintain sitemaps depending on certain category parameter, we need to categorize URLs.<br>
This categorizing will improve efficieny between teams of a certain organization.<br>
Each team will maintain its own sitemap, thus channelization and distribution of work, maintainence and monitoring.<br>
Thus, categorizing of URLs is organizing your sitemaps in a better, efficient and useful way.

In [20]:
# Categorizing URLs as per each category

categories = ["Course", "Topic", "Platform"]

# categoryUrls is a dictionary where category_name is the key and value is list of Urls of that category
categoryUrls = {}

In [21]:
# We will retireve the Udemy URLs again, as for Sitemap Index demonstration we had added dummy URLs

links = []
for link in soup.findAll('a'):
    href = link.get('href')
    links.append(href)
    
Urls = preprocessURLs(links)

In [22]:
len(Urls)

240

Here, depending on the patterns in URL links, we will categorize them into 3 categories.<br>
This categorization purely depends on business requirements. And the categories should be pre-decided / chosen by the business team, as required.

Pattern-wise Categorization used in our case:<br>

1) Course: URLs having /courses/ or /course/ in their URL path will be categorized into Course category.<br>
2) Topic: URLs having /topic/ in their URL path will have Topic as their category.<br>
3) Platform: URLs that include no such pattern are more likely to be platform URLs.

In [23]:
for url_link in Urls:
    if ('/course/' in url_link) or ('/courses/' in url_link):
        courseUrls = []
        if 'Course' in categoryUrls:
            courseUrls = categoryUrls['Course']
        courseUrls.append(url_link)
        categoryUrls['Course'] = courseUrls
    
    elif ('/topic/' in url_link):
        topicUrls = []
        if 'Topic' in categoryUrls:
            topicUrls = categoryUrls['Topic']
        topicUrls.append(url_link)
        categoryUrls['Topic'] = topicUrls
    
    else:
        platformUrls = []
        if 'Platform' in categoryUrls:
            platformUrls = categoryUrls['Platform']
        platformUrls.append(url_link)
        categoryUrls['Platform'] = platformUrls

In [24]:
categoryUrls

{'Platform': ['https://www.udemy.com/',
  'https://www.udemy.com/featured-topics/',
  'https://www.udemy.com/teaching/',
  'https://www.udemy.com/udemy-business/request-demo-mx/',
  'https://www.udemy.com/udemy-business/',
  'https://www.udemy.com/mobile/',
  'https://www.udemy.com/support/',
  'https://www.udemy.com/affiliate/',
  'https://www.udemy.com/terms/',
  'https://www.udemy.com/terms/privacy/'],
 'Course': ['https://www.udemy.com/courses/development/',
  'https://www.udemy.com/courses/development/web-development/',
  'https://www.udemy.com/courses/development/data-science/',
  'https://www.udemy.com/courses/development/mobile-apps/',
  'https://www.udemy.com/courses/development/programming-languages/',
  'https://www.udemy.com/courses/development/game-development/',
  'https://www.udemy.com/courses/development/databases/',
  'https://www.udemy.com/courses/development/software-testing/',
  'https://www.udemy.com/courses/development/software-engineering/',
  'https://www.udemy.

In [25]:
# Getting count of URLs for each category, inorder to tally with the sitemaps later

for category in categoryUrls:
    value = categoryUrls[category]
    print(category,": ",len(value))

Platform :  10
Course :  144
Topic :  86


In [26]:
def buildCategorySitemap(category_name, Urls, partition_number, consider_partition_number):
    """
    This method takes following arguments as its input:
    category_name: Category to which the URLs belong to
    Urls: List of URLs of having Category category_name (Here Categpry wil be Platform/Topic/Course)
    partition_number: Part Number of the sitemap to be generated.
                      If category has more than 50,000 URLs, multiple sitemaps will be generated for that category.
                      This parameter tells which sitemap needs to be generated out of all the multiple sitemaps.
    consider_partition_number: It indicates whether to include partition_number in file_name.
                               If there is more than 1 sitemap to be generated, consider_partition_number=1, Else 0.
    This method constructs a sitemap file for the provided category.
    The sitemap file will be created on the current directory path.
    """
    root = minidom.Document()
    urlset = root.createElement('urlset')
    root.appendChild(urlset)
    urlset.setAttribute("xmlns","http://www.sitemaps.org/schemas/sitemap/0.9")
    
    for url_link in Urls:
        url = root.createElement("url")
        loc = root.createElement("loc")
        loc.appendChild(root.createTextNode(url_link))
        url.appendChild(loc)
        
        urlset.appendChild(url)
        
    xml_str = root.toprettyxml(indent="\t", encoding="UTF-8").decode("utf-8")
    
    file_name = category_name.lower() + "_sitemap.xml"
    if consider_partition_number:
        file_name = category_name.lower() + "_sitemap_" + str(partition_number+1) + ".xml"
        
    f = open(file_name, "w")
    f.write(xml_str)
    f.close()
    
    return file_name

In [27]:
def createPartitionsAndBuildCategorySitemap(category_name, Urls):
    """
    This method takes category name and list of URLs as its input.
    First it calculates the number of sitemaps to be generated, decides whether to consider file part number in file name.
    Later, URLs are sliced into 50,000 blocks each for that specific category.
    Lastly, the corresponding category-specific sitemap files are created.
    This method returns all the sitemap file names for the category as its output.
    """
    
    sitemap_file_names = []
    
    # Check for 50,000 limit, else split it in parts
    num_category_sitemaps = len(Urls) // 50000
    if len(Urls) % 50000 !=0:
        num_category_sitemaps += 1
    
    consider_part_number_in_file_name = 1
    if num_category_sitemaps == 1:
        consider_part_number_in_file_name = 0
    
    for part_number in range(num_category_sitemaps):
        
        min_offset = 50000 * (part_number)
        max_offset = 50000 * (part_number + 1)
        if max_offset > len(Urls):
            max_offset = len(Urls)
        
        Url_links = Urls[min_offset : max_offset]
        file_name = buildCategorySitemap(category_name, Url_links, part_number, consider_part_number_in_file_name)
        sitemap_file_names.append(file_name)
    
    return sitemap_file_names

In [28]:
def createCategorywiseSitemap():
    """
    This method searches whether there are any URLs present for each category from list of pre-decided categories.
    If URLs exist for a category, these URLs are provided for creation of sitemaps.
    This method maintains sitemap file names for all the categories and returns these as its output.
    """
    
    sitemap_file_names = []
    
    for category in categories:
        
        if category in categoryUrls:
            Url_links = categoryUrls[category]
            file_names = createPartitionsAndBuildCategorySitemap(category, Url_links)
            sitemap_file_names.extend(file_names)
            
    return sitemap_file_names

In [29]:
def createSitemapIndex(file_names):
    """
    This method takes all the category-specific sitemap file names as input and constructs a sitemap index file.
    The sitemap index file will be created on the current directory path.
    """
    
    root = minidom.Document()
    sitemap_index = root.createElement('sitemapindex')
    root.appendChild(sitemap_index)
    sitemap_index.setAttribute("xmlns","http://www.sitemaps.org/schemas/sitemap/0.9")
    
    for sitemap_file in file_names:
        sitemap = root.createElement("sitemap")
        loc = root.createElement("loc")
        loc.appendChild(root.createTextNode(BASE_URL + "/" + sitemap_file))
        sitemap.appendChild(loc)
        
        sitemap_index.appendChild(sitemap)
        
    xml_str = root.toprettyxml(indent="\t", encoding="UTF-8").decode("utf-8")
    
    f = open("sitemap_index.xml", "w")
    f.write(xml_str)
    f.close()

In [30]:
sitemap_file_names = createCategorywiseSitemap()

if len(sitemap_file_names) > 0:
    createSitemapIndex(sitemap_file_names)

In [31]:
# We will read Sitemap Index so that we can get the names of category Sitemaps created, and access them further

f = open("sitemap_index.xml", "r")
print(f.read())
f.close()

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<sitemap>
		<loc>https://www.udemy.com/course_sitemap.xml</loc>
	</sitemap>
	<sitemap>
		<loc>https://www.udemy.com/topic_sitemap.xml</loc>
	</sitemap>
	<sitemap>
		<loc>https://www.udemy.com/platform_sitemap.xml</loc>
	</sitemap>
</sitemapindex>



In [32]:
# Reading one of the category sitemaps that we have created

f = open("platform_sitemap.xml", "r")
print(f.read())
f.close()

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>https://www.udemy.com/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/featured-topics/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/teaching/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/udemy-business/request-demo-mx/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/udemy-business/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/mobile/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/support/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/affiliate/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/terms/</loc>
	</url>
	<url>
		<loc>https://www.udemy.com/terms/privacy/</loc>
	</url>
</urlset>



-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------

### Recap Points: 

1) A Sitemap file consists of a parent tag \<urlset>, which has many \<url> tags inside it, and each \<url> tag contains one mandatory child tag \<loc> that indicates URL location of the webpage.

2) \<urlset> tag has an attribute named xmlns indicating schema of the sitemap. xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"

3) Maximum limit of \<url> tags in a Sitemap File: 50,000<br>
If there are multiple sitemaps, they need to be listed in a Sitemap Index.

4) A Sitemap Index file consists of a parent tag \<sitemapindex>, which has many \<sitemap> entries inside it, and each \<sitemap> tag contains one mandatory child tag \<loc> that indicates URL location of the sitemap.

5) \<sitemapindex> tag has an attribute named xmlns indicating schema of the sitemap index. xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"

6) Maximum limit of \<sitemap> tags in a Sitemap Index File: 50,000<br>
There can be multiple sitemap index files.

7) Reference links: <br>
https://www.sitemaps.org/protocol.html <br>
https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview <br><br>

8) Snapshot of a Sitemap: 
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
9) Snapshot of a Sitemap Index:

<img align="left" src="Screenshots/Sitemap.PNG" height="300" width="450">

<img align="right" src="Screenshots/Sitemap_Index.PNG" height="400" width="500">
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

10) Snapshot of robots.txt: <br>
robots.txt is a file that mentions the path to the Sitemap / Sitemap Index file. This file must have the same domain as other sitemaps and sitemap index files. <br>
A search engine obtains the sitemap path from robots.txt file, and then crawls all the sitemap and URL entries.<br><br>
Mentioned below are snapshots for robots.txt (Considering BASE_URL as the domain name): <br>
a) Including your Sitemap file path in robots.txt: <br>
(If there is only 1 Sitemap for your website)

<img align="left" src="Screenshots/robots1.PNG" height="400" width="500">
<br><br><br><br><br>

b) Including your Sitemap Index file path in robots.txt: <br>
(If there are multiple Sitemap files for your website, then you'll have a Sitemap Index file created in this case)

<img align="left" src="Screenshots/robots2.PNG" height="400" width="500">

-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
### Information about Image and Video Sitemaps (Extensions of Sitemap):

Image Sitemap:

An Image Sitemap consists all the information related to images included on your website. An Image Sitemap, in simple terms is a Sitemap having \<image:image> tag entries. <br>
Each \<url> tag can have \<image:image> tags that contain information about an image that is on that corresponding URL page.<br>
There can be maximum <i>1000</i> \<image:image> entries inside a \<url> tag. <br>
\<image:loc> is a required child tag for \<image:image>.
Note that in below sample image sitemap, we need to include xmls:image attribute for the image sitemap schema. <br>

Sample Image Sitemap: <br>
\<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" <br>
    &ensp; xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" > <br>
    &emsp;\<url> <br>
        &emsp;&emsp;\<loc> https://www.abc.com/example/ \</loc><br>
        &emsp;&emsp;\<image:image><br>
              &emsp;&emsp;&emsp;\<image:loc> https://www.abc.com/example/image1.png \</image:loc><br>
        &emsp;&emsp;\</image:image><br>
    &emsp;\</url><br>
\</urlset>

-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------

Video Sitemap:

A Video Sitemap consists all videos that are a part of your website. A Video Sitemap can be described is a Sitemap having \<video:video> tag entries. It is merely same just as Image Sitemap, just the xml tags differ. <br>
Each \<url> tag can have \<video:video> tags that contain information about videos on that corresponding URL page.<br>
There can be maximum <i>1000</i> \<video:video> entries inside a \<url> tag. <br>
\<video:thumbnail_loc>, \<video:title>, \<video:description>, \<video:content_loc> are the mandatory child tags for \<video:video>. There are many optional tags as well that can be included. <br>
We need to include xmls:video attribute of \<urlset> tag for the video sitemap schema.

Sample Video Sitemap: <br>
\<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" <br>
&ensp; xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" ><br>
   &ensp;\<url><br>
     &ensp;&ensp;\<loc> https://www.abc.com/example/ \</loc><br>
     &ensp;&ensp;\<video:video><br>
        &ensp;&ensp; &ensp;\<video:thumbnail_loc> https://www.abc.com/example/video_thumbs/thumb1.jpg \</video:thumbnail_loc><br>
       &ensp;&ensp; &ensp;\<video:title> A Deep Dive into abc.com Services \</video:title><br>
       &ensp;&ensp; &ensp;\<video:description> Video that demonstrates the services provided by abc.com and getting in touch \</video:description><br>
       &ensp;&ensp; &ensp;\<video:content_loc> https://www.abc.com/example/videoIntro.mp4 \</video:content_loc><br>
      &ensp;&ensp;\</video:video><br>
    &ensp;\</url><br>
\</urlset>

-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------

#### References: <br>

Image Sitemap: https://developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps <br>
Video Sitemap: https://developers.google.com/search/docs/crawling-indexing/sitemaps/video-sitemaps

-----------------------------------------------------------------------------------------