# Web Scraping - Get 2018 Calendar
The project's goal is to create a dataset that mirrors the **calendar of the year 2018**, including for each day the corresponding day of the week and an indicator to identify whether that day is a federal holiday in the United States.

## Import Packages
The process starts by importing the necessary packages.

In [1]:
# Import packages
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd

Then it's important to define the main variables.

In [2]:
# Define year
year = 2018

# Initialize the calendar dataframe
calendar = pd.DataFrame()

# Define a dictionary to recode days
recode_days = {'Mon':'Monday', 'Tue':'Tuesday', 'Wed':'Wednesday', 'Thu':'Thursday', 'Fri':'Friday', 'Sat':'Saturday', 'Sun':'Sunday'}

## Web Scraping
Now it's time to extract data from a calendar web page (https://www.timeanddate.com/calendar/custom.html?year=2018&month=1&country=1&typ=1&cols=1&display=1&df=1&hol=1).

The data of interest are organized within a table with the weekdays as headers and the day numbers within the cells. Holidays are identified by the color red, unlike the other days which are colored black. 

As an example, January of the year 2018 has been taken, from which all the data will be extracted.

In [4]:
# Create a list of months' number
list_month = [num for num in range(1,13)]

for month in list_month:

    # URL of the webpage to be downloaded
    url_page = f"https://www.timeanddate.com/calendar/custom.html?year={year}&month={month}&country=1&typ=1&cols=1&display=1&df=1&hol=1"

    # Make a GET request to the webpage
    response = requests.get(url_page)

    # Print the page's code
    # print(response.text)

    # -------------------------------------------------------------------------------------------------------------------------------------
    # Use BeautifulSoup to parse the HTML content of the webpage
    soup = BeautifulSoup(response.content, "html.parser")

    # -------------------------------------------------------------------------------------------------------------------------------------
    # Identify the content of the table
    table_content = soup.find_all("td", class_="cbm cba cmi")[0]

    # Identify the dimensions
    nrow_table = len(table_content.find_all("tr")) # each row of the table is identified with "tr", including the header
    ncol_table = len(table_content.find_all("thead")[0].find_all("td"))

    # -------------------------------------------------------------------------------------------------------------------------------------
    # Create a list containing the weekdays
    weekdays = (table_content.find_all("thead")[0].find_all("td"))*(nrow_table-1)
    weekdays = [weekdays[x].text for x in range(len(weekdays))]

    # Re-code the weekdays
    weekdays = [recode_days[value] for value in weekdays if value in recode_days]

    # Create a list containing the holidays of the month
    find_holidays = table_content.find_all("div", class_="ccd co1")
    holidays = [int(hol.text) for hol in find_holidays]

    # -------------------------------------------------------------------------------------------------------------------------------------
    # Create a list containing the number of days of the month
    numb = [0]*((nrow_table-1)*ncol_table)
    numb = [table_content.find_all("td")[iter+ncol_table].text for iter in range(len(numb))]

    # Format the values, create the date, and the Flag_Holiday field (if the day is a holiday or not)
    date = numb.copy()
    Flag_Holiday = numb.copy()
    for i in range(len(numb)):
        if numb[i]=="\xa0": # Empty cell
            date[i]=None 
        else:
            numb[i] = int(numb[i]) # Cell containing the day number
            date[i] = datetime(year=year, month=month, day=numb[i]).strftime('%Y-%m-%d')
            if numb[i] in holidays:
                Flag_Holiday[i] = True 
            else:
                Flag_Holiday[i] = False

    # -------------------------------------------------------------------------------------------------------------------------------------
    # Create the Dataset
    calendar_month = pd.DataFrame({'Date':date, 'Day': weekdays, 'Flag_Holiday': Flag_Holiday}).dropna().reset_index().drop('index', axis=1)

    if calendar.empty == True:
        calendar = calendar_month
    else:
        calendar = pd.concat([calendar,calendar_month])


## Export Data
Export data in a csv file.

In [9]:
calendar.to_csv('Calendar.csv', index=False)