**Section 1 : Web crawling from www.timeanddate.com**

This section of the script scrapes holiday data from the "timeanddate.com" website for different countries and years, and saves the results in an Excel file.


1.   **Imports Libraries**: It uses requests for making HTTP requests, BeautifulSoup for parsing HTML, pandas for organizing data, and tqdm for showing a progress bar.
2.   **Sets Up URLs**: It defines base URLs to fetch the country list and holiday data for each country and year.
3.   **Gets Country List**: The script sends a request to the website, extracts a list of countries, and stores them in a dictionary with country codes and names.
4.    **Prepares for Data Collection**: It sets up a dictionary to hold holiday data for each country and year.
5.   **Scrapes Data**: For each country and year, it sends a request, extracts the holiday data from the webpage, and adds it to the dictionary.
6.   **Stores and Saves Data**: The collected data is stored in a pandas DataFrame and saved to an Excel file called holidays.xlsx.
7.   **Completion**: It prints a message when the data is successfully saved.

In [None]:
pip install requests beautifulsoup4 pandas openpyxl



In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
from datetime import datetime

# Base URL template
BASE_URL = "https://www.timeanddate.com/calendar/custom.html?year={year}&country={country}&cols=3&df=1&hol=1&lang=en"

# Main URL to get country list
URL = "https://www.timeanddate.com/calendar/custom.html"

# Get the list of countries from the dropdown on the website
resp = requests.get(URL)
soup = BeautifulSoup(resp.text, "html.parser")
country_select = soup.find("select", {"id": "sf_country"})

# Extract country codes and names
countries = {option["value"]: option.text for option in country_select.find_all("option")}

# Set the years to extract (2020-2028)
years = list(range(2020, 2029))

# List to store extracted holiday data
holiday_data = []

def format_date(dd_mmm, year):
    try:
        # Convert "1 Jan" to a datetime object
        date_obj = datetime.strptime(f"{dd_mmm} {year}", "%d %b %Y")

        # Convert to required formats
        full_date = date_obj.strftime("%Y-%m-%d")  # YYYY-MM-DD
        day_month = date_obj.strftime("%d-%m")  # DD-MM

        return full_date, day_month
    except ValueError:
        return None, None  # Handle unexpected formats

# Iterate over each country and year
for country_code, country_name in tqdm(countries.items(), desc="Scraping holidays"):
    for year in years:
        url = BASE_URL.format(year=year, country=country_code)
        resp = requests.get(url)

        if resp.status_code == 200:
            soup = BeautifulSoup(resp.text, "html.parser")
            holiday_table = soup.find("table", {"class": "cl1h"})

            if holiday_table:
                for holiday_row in holiday_table.find_all("tr"):
                    date_span = holiday_row.find("span", {"class": "co1"})
                    name_td = holiday_row.find("a")

                    if date_span and name_td:
                        raw_date = date_span.text.strip()
                        full_date, day_month = format_date(raw_date, year)  # Format date
                        holiday_name = name_td.text

                        if full_date:
                            # Append row to list
                            holiday_data.append([full_date, holiday_name, country_name, year, day_month])

# Convert to DataFrame
df = pd.DataFrame(holiday_data, columns=["Date", "Event", "Country", "Year", "Day-Month"])

# Save to Excel
df.to_excel("holidays.xlsx", index=False)

print("Data successfully saved to holidays.xlsx")

Scraping holidays: 100%|██████████| 235/235 [07:47<00:00,  1.99s/it]


Data successfully saved to holidays.xlsx


**Section 2 : Removing duplicates**

This section looks through the list of holidays in a year and remove any duplicate dates.

In [7]:
# Load the excel file from Section 1
file_path = "holidays.xlsx"
df = pd.read_excel(file_path)

# Remove duplicate rows
df_cleaned = df.drop_duplicates()

# Save the cleaned data back to an Excel file
cleaned_file_path = "holidays_cleaned.xlsx"
df_cleaned.to_excel(cleaned_file_path, index=False)

print("Duplicates removed. Cleaned data saved to:", cleaned_file_path)

Duplicates removed. Cleaned data saved to: holidays_cleaned.xlsx
