<h1>Data Exploration for Monthly Weekly Earnings by Industry</h1>

This file is used for exploring the data that is on the Monthly weekly earnings by industry table in Stats Canada

We look to leverage as much of the code as posssible that was built from the scrapper library here and see what are changes are necessary to make the scrapper library more robust.

In [1]:
# importing libraries
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup
from sqlalchemy import create_engine
import itertools
import os
# load env
from dotenv import load_dotenv
load_dotenv()
import sys
sys.path.insert(1, os.getenv('LIBRARY_PATH'))
import scrapper

In [None]:
# URL that we will be scraping
url = "https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410020101&pickMembers%5B0%5D=1.3&pickMembers%5B1%5D=2.1&cubeTimeFrame.startMonth=01&cubeTimeFrame.startYear=2022&cubeTimeFrame.endMonth=10&cubeTimeFrame.endYear=2022&referencePeriods=20220101%2C20221001"
# The table that we extract will have the below column names, and so we extract them out to ensure we have the correct column names.
filter_names = ["Geography", "Type of employee"]

# Helper function to find data between two strings
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

# Helper function to check if a string is a float
def isfloat(num):
    try:
        float(num)
        return True
    except ValueError:
        return False

In [2]:
# We use the requests library to get the HTML content of the page. Afterwards, we use the BeautifulSoup library to parse the HTML content.
# We then use the find_between function to extract the data that we want.
# NOTE: the data is contained in a function within scripts tag. and so we need to extract the data from there.
# We load the data into a json object after we extract it from within the scrippts tag.
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = find_between(soup.prettify(), 'tableContainerElement = $(".tableContainer").clone();', 'window.addEventListener("resize", function() {') + 'end'
data = find_between(result, 'prepareTable(', '\n')[:-2]
json_data = json.loads(data)
rows = json_data['rows']

In [3]:
# Return the headers for the data table.
# The headers contain the data for our columns. We need to extract the values from the headers and return them as a list.
headers = next(item for item in json_data['headers']["columnHeaders"] if item["name"] == "Reference period")
header_values = []
for item in headers["values"]:
        header_values.append(item["value"])

In [4]:
# Return the rows for the data table, and then flatten the rows.
rows = json_data['rows']
flattened_rows = list(itertools.chain.from_iterable([row['values'] for row in rows]))
new_rows = []
for row in flattened_rows:
    new_rows.append(row['value'])
keys = []
data = {}
# We iterate through the rows and extract the data into a dictionary.
# The data follows a pattern where every nth item is a key, and the next n items are the values for that key.
# We use the clean_string function to clean the key, and the isfloat function to check if the value is a float.
# We then append the key to the keys list, and the values to the data dictionary.
for row in new_rows:
    if not isfloat(row):
        key = row.replace(" ", "_").replace(",","").replace("(","").replace(")","").replace("-","_").replace("__","_").lower()[:60]
        keys.append(key)
        data[key] = []
    if isfloat(row):
        data[key].append(float(row))

In [5]:
# We transform the data into a dictionary where the key is the name of the row, and the value is a list of values.
rows_values = {key: value for key, value in data.items()}
# We then transform the data into a list of dictionaries where the key is the name of the row, and the value is a dictionary of the values.
final_data = [{"key": name, **{month: value for month, value in zip(header_values, values)}} for name, values in rows_values.items()]


In [6]:
# We then transform the data into a pandas dataframe, and then transpose the dataframe.
df = pd.DataFrame(final_data).transpose().drop("key")
df.columns = keys
# We then add the date column to the dataframe.
df["date"] = soup.find_all('meta', attrs={'name': 'dcterms.issued'})[0]['content']


In [7]:

df['month'] = df.index
df.reset_index(drop=True, inplace=True)
# We then rename the columns to be more readable, and then add the month column to the dataframe.
for filter_name in filter_names:
        new_name = filter_name.replace(" ", "_").replace(",","").replace("(","").replace(")","").replace("-","_").replace("__","_").lower()[:60]
        df[new_name] = next(item for item in json_data['headers']["columnHeaders"] if item["name"] == filter_name)["values"][0]["value"]


In [8]:
df.head()

  industrial_aggregate_including_unclassified_businesses  \
0                                            67888.0       
1                                            68106.0       
2                                            69162.0       
3                                            71520.0       
4                                            77444.0       

  industrial_aggregate_excluding_unclassified_businesses  \
0                                            66720.0       
1                                            66916.0       
2                                            67972.0       
3                                            70276.0       
4                                            75991.0       

  goods_producing_industries forestry_logging_and_support  \
0                     9549.0                          0.0   
1                     9300.0                          0.0   
2                     9414.0                          0.0   
3                     9788.0      