# Feb 2025 version - Enrolment by Program, Credential type and Field of Study

## Introduction

**Goal of this workbook:**

Following on from looking at domestic and international enrolment, this notebook examines program and credential specific enrolment including field of study. This is to get a more precise understanding of programs and their enrolment contributing to tuition fee revenue, which may be at risk. 

Source:
[StatCan: Postsecondary enrolments, by detailed field of study, institution, and program and student characteristics](https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=3710027701)

The new Federal IRCC rules of 2024 restricts the number of international study permits issued and Post-Graduate Work Permits (PGWP) availability, according to the program/credential type, length of the program and field of study. The field-of-study requirement further complicates the picture of enrolment and revenue changes for specific schools and offers another dimension to this situation - which programs and fields of study have proliferated and are they what the government believes we are most in need of?

For clarity and ease of reference I will lay out the current guides & updates with source immediately below.

### IRCC Sources, Notices, definitions and targets (close when not needed)

1. [PGWP: Who can apply, eligible Program types and Field of study requirements, current](https://www.canada.ca/en/immigration-refugees-citizenship/services/study-canada/work/after-graduation/eligibility.html)
    - **Bachelor's, Master's or Doctoral degrees from a university have no field of study requirement**
    - All other university programs and college programs must graduate in an eligible field of study. Fields of study are clasisfied by a six-digit [Classification of Instructional Programs (CIP) code](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1420413) with a leading two-digit code followed by a trailing four-digit specifier, e.g. 14 is engineering, 14.1901 is Mechanical Engineering.
    
    - The PGWP eligible fields of study ([link to CIP codes](https://www.canada.ca/en/immigration-refugees-citizenship/services/study-canada/work/after-graduation/eligibility.html#field-of-study)) are:
        - Agriculture and agri-food (01.x, some other niche codes)
        - Education 
            - Teaching & Counselling 13.x, 
            - Foods, nutrition 19.x
        - Healthcare: 
            - Genetics 26.x, 30.x, 
            - Sports Exercise 31.x, 
            - Science technology 41.x, 
            - Psychology 42.x
            - Health, Medicine, Dentistry, Therapies 51.x - huge subcategory
            - Residencies 60.x, combined specialties 61,x
        - Science, technology, engineering and mathematics / STEM
            - Architecture 04.x
            - Environmental Science 03.x
            - Computer Sciences 11.x - very wide spectrum
            - Engineering (Chemical/Civil/Mech/Bio) 14.x
            - Computer Engineering 15.x
            - Biology 26.x
            - Math 27.x
            - Misc Sciences 30.x
            - Chemistry 40.x
            - Management Science & e-commerce 52.x, digital marketing 52.1404
        - Trades (electrician, HVAC etc)
            - 15.x, 44, 45, 46, 47, 48, 49
        - Transport
            - 44.x

2. [2025 allocations under the international student cap; Jan 24, 2025](https://www.canada.ca/en/immigration-refugees-citizenship/news/notices/2025-provincial-territorial-allocations-under-international-student-cap.html)
    - 437,000 total study permits for 2025, a 10% decrease from the 2024 cap. 
    - Master's & Doctoral students now required to submit a PAL as well but extra space has been made specifically for those programs.

    - "**Considering growth in the graduate international student sector has been sustainable**...2025 graduate student sub-allocation is based on the number of study permits that [provinces] respectively issued to graduate students in 2023". This is an indirect admission that there is specific program-type mismanagement to be identified here.
    - 2025 National Targets for study permits:
        - **73.2k graduate degree students (PAL Required)**
        - **243k permits for remaining PAL/TAL required programs for a total of 316k from PAL required groups**
        - Exempt categories: 72.2k issued to K-12 and 48.5k to other PAL, for a grand total of 437k

        - **Provincial/Terr Allocations** below are the maximum number of permits that will be processed, assuming an average approval rate from each province/territory (looks to be around 66%). The anticipated number of study permits issued by IRCC is lower than these maxima and will sum to 316k expected. Both the allocation and approved estimates will be added as data here.
            - (Graduate, All Other, *Total*)
            - AB: 5256, 42082, *47338*
            - BC: 28333, 47754, *76087*
            - MB: 1980, 16611, *18591*
            - NB: 3112, 11673, *14785*
            - NL: 2648, 6534, *9182*
            - NT: 0, 705, *705*
            - NS: 4191, 14411, *18602*
            - NU: 0, 0, *0*
            - ON: 32579, 149011, *181590*
            - PE: 391, 2044, *2435*
            - QC: 38786, 123956, *162742*
            - SK: 2791, 14850, *17641*
            - YT: 1, 463, *464*

3. [Additional information about the International Student Program reforms; February 5, 2024](https://www.canada.ca/en/immigration-refugees-citizenship/news/notices/international-student-program-reform-more-information.html)
    - Most study permit applications to the federal govt must include a Provincial Attestation Letter (PAL) provided by the province (on behalf of their school/PSI, PALs distributed to PSIs by province)
        - The PAL is the accounting metric that ensures new int'l student numbers are accurate
    - PAL Required: most non-degree graduate programs (certificates & diplomas); most post-secondary study permit apps
    - PAL NOT Required: primary & secondary students, master's or doctoral degree students; in-Canada study permit holders
    - Masters degree graduates now get a 3 year PGWP
    - New students at public-private partnership college programs now ineligible for PGWP
        - Check [CBC data](https://www.cbc.ca/news/canada/toronto/international-student-study-permits-data-1.7125827#Full%20data) - schools that have a public/private partnership, check their enrolment; is it disproportionately lower than those without, all else equal/similar?

4. [Original IRCC statement; January 24, 2024](https://www.canada.ca/en/immigration-refugees-citizenship/news/2024/01/canada-to-stabilize-growth-and-decrease-number-of-new-international-student-permits-issued-to-approximately-360000-for-2024.html).
    - **360,000 approved study permits** for 2024, a decrease of 35% from 2023
    - *"Those pursuing master’s and doctoral degrees, and elementary and secondary education are not included in the cap."*
    - Portion of the cap allocated to each province/territory

#### Older releases:
5. [Distance Learning in COVID-19; May 14, 2020](https://www.canada.ca/en/immigration-refugees-citizenship/news/notices/pgwpp-rules-covid19.html)
    - PGWP eligibility was not affected by students whose fall 2020 courses will be online due to COVID. Students may begin their classes while outside Canada and complete up to 50% of their program via distance learning if they cannot travel to Canada sooner. No time deducted from length of future PGWP for studies outside of Canada.
    - Good evidence that any variation in international enrolment in 2020-21 is not work-permit related, with the long term incentive of a path to permanent residence not removed at the federal level.

### **Important notes**

1. Statcan's Classification of program types [is here](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1252482&CVD=1252483&CLV=0&MLV=2&D=1). 
    - **Graduate (second cycle) means Master's programs, or programs that otherwise require a Bachelor's degree**
    - **Graduate (third cycle) is PhD**
    - Certificates and Diplomas are inconsistent and have different criteria in different provinces, see pt. 4

2. I'm using **16/17 - 22/23 enrolment data**, and 2023/2024 tuition fee figures, the latest available. From there we can update with live data on enrolment as it becomes official, and project scenarios with hypothetical declines in student enrolment to estimate revenue changes into the future.

3. The easiest distinction is at Program Type, between undergraduate and graduate degrees for their tuition fee costs. However, you need to look at Credential Type for certificates and diplomas (popular at the colleges)

4. There is inconsistency in where graduate diplomas/certificates sit in 'program type'.
    - For example there are 509,000 Credential type: Diploma students across all of Canada in 22/23 and 386k of them are in 'Career, Technical or Professional Training Program' Program Type. 85k of these are sitting under 'Pre-University Program' (of 87k total in the Pre-uni category) which makes me think there are PSIs classifying a High School diploma, which wouldn't impact tuition fees. The remaining 55k are scattered across various other program types
    - Certificate credentials are clearer - of 190k in Canada, almost all are captured in the 'Career, Technical or Professional Training Program', 'Post Career, Technical or Professional Training Program' or Undergraduate.
    - *Post career, technical or professional training program* specifically includes **Ontario graduate certificate programs**

5. I only imported the student enrolment from full programs - there were around 100,000 enrolments (out of 2.2m total enrolments in all programs) in 'non-program' some were non credit, some undergraduate, some graduate, I assume this meant students taking individual classes to complete programs at a later date, and not an end-to-end program enrolment on a schedule.
6. The above analysis was done on full-time and part time students. As with my analysis earlier, I am only taking full-time PSI student data (a total of 1.7m in Canada)


All this is to say the calculations here will be estimates at best, with the heavy lifting being done by the difference between domestic and international tuition fees mostly at the undergraduate and graduate degree level, as these are the most numerous and require somewhat less granularity than fees for certificates/diplomas.

## Imports

In [186]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# for the preprocessing pipeline and variables to verify enrolment numbers align
import import_ipynb
import domestic_intl

## Data for Program type enrolment

Adding enrolment by program types and international/domestic students - [enrolment by program type and status of student in Canada](https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=3710027701). 

**Important notes**

1. Classification of program types [is here](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1252482&CVD=1252483&CLV=0&MLV=2&D=1). 
    - **Graduate (second cycle) means Master's programs, or those that otherwise require a Bachelor's degree**
    - **Graduate (third cycle) is PhD**
    - Certificates and Diplomas are inconsistent and have different criteria in different provinces, see pt. 4

2. I'm using **2016-17 to 2022-2023 enrolment data**, the latest available. 
    

3. The easiest distinction is at Program Type, between undergraduate and graduate degrees for their tuition fee costs. However, you need to look at Credential Type for certificates and diplomas (popular at the colleges)

4. There is inconsistency in where graduate diplomas/certificates sit in 'program type'.
    - For example there are 509,000 Credential type: Diploma students across all of Canada in 22/23 and 386k of them are in 'Career, Technical or Professional Training Program' Program Type. 85k of these are sitting under 'Pre-University Program' (of 87k total in the Pre-uni category) which makes me think there are PSIs classifying a High School diploma, which wouldn't impact tuition fees. The remaining 55k are scattered across various other program types
    - Certificate credentials are clearer - of 190k in Canada, almost all are captured in the 'Career, Technical or Professional Training Program', 'Post Career, Technical or Professional Training Program' or Undergraduate.
    - *Post career, technical or professional training program* specifically includes **Ontario graduate certificate programs**

5. I only imported the student enrolment from full programs - in 22-23 there were around 100,000 enrolments (out of 2.2m total enrolments in all programs) in 'non-program' some were non credit, some undergraduate, some graduate, I assume this meant students taking individual classes to complete programs at a later date, and not an end-to-end program enrolment on a schedule.
6. The above analysis was done on full-time and part time students. As with my analysis earlier, I am only taking full-time PSI student data (a total of 1.7m in Canada)


All this is to say the calculations here will be estimates at best, with the heavy lifting being done by the difference between domestic and international tuition fees mostly at the undergraduate and graduate degree level, as these are the most numerous and require somewhat less granularity than fees for certificates/diplomas.

### Cell below - audit check on 1.74m students (all FT Domestic/International students enrolled) in 22/23

In [187]:
canada_dom = domestic_intl.canada_dom
canada_intl = domestic_intl.canada_intl

# point 5 above "a total of 1.7m Full-time students in 2022" what does the data from earlier say? This was only done on fulltime students.
canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum() + canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()

1741692

Above - confirms 1.74m students in earlier data (looking at enrolment numbers directly) which checks out with this data here on programs

## StatCan data on postsecondary enrolments by field of study, institution, and program and student characteristics

[Source](https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=3710027701)

From the source, the following data were removed / not specified in order to not exceed the 2 million data point download limit from StatCan:

- Geography and institutions:
    - Province level enrolment was collected, but not individual institutions. This will initially help keep things generalised until we see a specific trend to examine.

- Field of Study:
    - Sub-CIP code Program enrolment were not specifically collected, e.g. CIP 01 Agriculture enrolment was collected, but Plant Sciences [01.11] was not specifically collected, and would only appear in the aggregate CIP 01 Enrolment data. There are over 50 CIP codes with a good level of specificity already, though.

- Program Type
    - Basic Education and Skills Program enrolment was not specifically collected
    - non-programs were not collected

- Credential Type
    - GED / High School Diploma specific enrolment was not collected
    - Attestation or other short program credentials were not collected

- Registration Status
    - As before, only Full-time students are collected, not Part-time

- Status of Student in Canada
    - Canadian students, International Students and Not reported were collected as individual categories, but not as a total aggregated number because I'd like to examine differences in program/credential enrolment and growth of such along international/domestic lines

- Reference period
    - Years 2016/17 to years 2022/23 were collected

Once sufficient areas have been eliminated we can drill down more on institutions or specific fields if warranted.

In [188]:
# import the new csv file
progs = pd.read_csv('/Users/thomasdoherty/Desktop/canadian-psi-project/psi_data/statcan_data/progs_credentials_fields.csv')

In [189]:
progs.sample(6)

Unnamed: 0,REF_DATE,GEO,DGUID,Field of study,Program type,Credential type,Institution type,Registration status,Status of student in Canada,Gender,...,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
36084,2018/2019,Quebec,2021A000224,Agricultural and veterinary sciences/services/...,Graduate program (third cycle),Degree (includes applied degree),"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1609733157,31.2.14.5.1.2.3.1,147,,,,0
65110,2018/2019,Saskatchewan,2021A000247,Social sciences [45.],Graduate program (third cycle),Degree (includes applied degree),"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1614338198,188.348.14.5.1.2.2.1,21,,,,0
37503,2022/2023,Quebec,2021A000224,Education [13.],Graduate program (second cycle),"Total, credential type","Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1609770563,31.73.13.1.1.2.3.1,243,,,,0
58492,2017/2018,Manitoba,2021A000246,Social sciences [45.],"Career, technical or professional training pro...","Total, credential type","Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1613888297,175.348.4.1.1.2.3.1,18,,,,0
69739,2021/2022,Alberta,2021A000248,English language and literature/letters [23.],Graduate program (second cycle),Degree (includes applied degree),"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1614673886,199.189.13.5.1.2.2.1,27,,,,0
79838,2019/2020,British Columbia,2021A000259,Security and protective services [43.],Undergraduate program,Associate degree,"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1615443157,226.335.8.7.1.2.2.1,3,,,,0


In [190]:
# check for columns where records are missing values
print(progs.columns[progs.isnull().all()])
print(progs.columns[progs.isnull().any()])

Index(['STATUS', 'SYMBOL', 'TERMINATED'], dtype='object')
Index(['STATUS', 'SYMBOL', 'TERMINATED'], dtype='object')


In [191]:
# confirm this is all full-time students
progs['Registration status'].value_counts()

Registration status
Full-time student    84506
Name: count, dtype: int64

In [192]:
progs['Status of student in Canada'].value_counts()

Status of student in Canada
Canadian students                            43528
International students                       35897
Not reported, status of student in Canada     5081
Name: count, dtype: int64

## Preprocessing pipeline

Custom Transformers (CTs) defined in utils package

### Mar 1 test utils package - commented out below (keep until sure we don't need)

In [193]:
# # Custom transformer (CT) for dropping unnecessary columns from the dataframe

# DropColumns = domestic_intl.DropColumns

# RenameColumns = domestic_intl.RenameColumns

# FormatDate = domestic_intl.FormatDate

# AddInstitutionAndProvince = domestic_intl.AddInstitutionAndProvince

# AbbreviateInstitutionNames = domestic_intl.AbbreviateInstitutionNames

# # Removing territories from the data

# class RemoveTerritories(BaseEstimator, TransformerMixin):
#     def __init__(self, column, territories):
#         """
#         :param column: The column where we look for territory names
#         :param territories: A list of territory names (e.g. ["Yukon", "Northwest Territories", "Nunavut"])
#         """
#         self.column = column
#         self.territories = territories # territories defined above

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         """
#         Removes rows where the specified column contains any of the territory names.
#         """
#         X = X.copy()
#         # Build a regex pattern that matches any of the territory strings
#         pattern = '|'.join(self.territories)

#         # Keep rows that do NOT contain any of the territory names (case-insensitive).
#         # If the column is sometimes NaN, na=False ensures we don't error out.
#         return X[~X[self.column].str.contains(pattern, case=False, na=False)]
    

# # reorder the columns for ease of reading

# class ReorderColumns(BaseEstimator, TransformerMixin):
#     def __init__(self, desired_order):
#         """
#         parameter - desired_order: A list specifying the columns in the order desired.
#         e.g. ["FY Start", "Province/Territory", "Institution Name", ...]
#         """
#         self.desired_order = desired_order

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         """
#         Reorders the columns to the specified order. Any columns not in 'desired_order'
#         are appended at the end in their existing order.
#         """
#         X = X.copy()
        
#         # Columns that are explicitly ordered
#         ordered_cols = [c for c in self.desired_order if c in X.columns]
        
#         # Any remaining columns not in 'desired_order'
#         leftover_cols = [c for c in X.columns if c not in ordered_cols]
        
#         # Final order is the desired columns first, then leftover
#         final_order = ordered_cols + leftover_cols
        
#         return X[final_order]
    
# # pivot Canadian status (domestic/international/unreported) into three columns of the same record, all else being equal

# class PivotCanadianStatus(BaseEstimator, TransformerMixin):
#     def __init__(
#         self, 
#         index_cols=[
#             "FY Start", 
#             "Province/Territory", 
#             "Institution Name", 
#             "Program type", 
#             "Credential type", 
#             "Field of study"
#             ],
#         pivot_col="Canadian Status",
#         values_col="Enrolment"
#     ):
#         """
#         Parameter - index_cols: The columns to keep as index in the pivot (remain in rows).
#         Parameter - pivot_col: The column whose unique values become new columns (e.g., 'Canadian Status').
#         Parameter - values_col: The numeric column to place in new columns (e.g., 'Enrolment').
#         """
#         self.index_cols = index_cols
#         self.pivot_col = pivot_col
#         self.values_col = values_col

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         """
#         Pivots the dataframe so that 'Canadian Status' becomes columns:
#           -> 'Domestic Enrolment', 'International Enrolment'.

#         After pivoting, renames columns accordingly and returns the wide table.
#         """
#         X = X.copy()

#         # 1. Pivot
#         pivoted = X.pivot_table(
#             index=self.index_cols,
#             columns=self.pivot_col,
#             values=self.values_col,
#             aggfunc='sum'  # If duplicates exist, sum them
#         ).reset_index()

#         # 2. Rename columns from 'Canadian students' -> 'Domestic Enrolment' etc.
#         col_rename_map = {
#             'Canadian students': 'Domestic Enrolment',
#             'International students': 'International Enrolment',
#             'Not reported, status of student in Canada': 'CA Status Unreported Enrolment'
#         }
#         pivoted = pivoted.rename(columns=col_rename_map)

#         # 3. convert new Domestic and International Enrolment columns to integers
#         pivoted['Domestic Enrolment'] = pivoted['Domestic Enrolment'].fillna(0).astype(int)
#         pivoted['International Enrolment'] = pivoted['International Enrolment'].fillna(0).astype(int)
#         pivoted['CA Status Unreported Enrolment'] = pivoted['CA Status Unreported Enrolment'].fillna(0).astype(int)

#         # 4. Reorder columns if desired
#         # Ensure Domestic and International appear last in an expected order
#         final_cols = [c for c in pivoted.columns if c not in col_rename_map.values()]
#         final_cols += ['Domestic Enrolment', 'International Enrolment', 'CA Status Unreported Enrolment']
#         final_cols = [c for c in final_cols if c in pivoted.columns]  # Only keep existing columns

#         return pivoted[final_cols]

### Defined variables (provinces, territories, abbreviations, Francophone institutions etc)

In [194]:
from utils.constants import PROVINCES_TERRITORIES_CA, PROVINCES_CA, TERRITORIES, PROVINCE_CODES, ABBREVIATIONS
from utils.pipeline_transformers import DropColumns, RenameColumns, FormatDate, AddInstitutionAndProvince, AbbreviateInstitutionNames, RemoveTerritories, ReorderColumns, PivotCanadianStatus

In [195]:
# Define the pipeline
from sklearn.pipeline import Pipeline

program_pipeline = Pipeline(steps=[
    # drop columns
    ('drop_columns', DropColumns(columns=[
        'DGUID', 'Registration status', 'Institution type', 'Gender', 'UOM', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'
    ])),
    # rename columns
    ('rename_columns', RenameColumns(column_names={
        "GEO": "Province/Territory",
        "REF_DATE": "FY Start",
        "VALUE": "Enrolment",
        "Status of student in Canada": "Canadian Status"
    })),
    # drop record step -  Yukon / NWT / Nunavut
    ('remove_territories', RemoveTerritories(
        column="Province/Territory",
        territories=TERRITORIES
    )),
    # formatting data type steps
    ('format_fy', FormatDate(column="FY Start")),
    # ('format_enrolment', FormatValue(column="Enrolment")),
    
    # Add Institution Name and Province columns
    ('add_institution_and_province', AddInstitutionAndProvince(
        column='Province/Territory',
        institution_col='Institution Name', 
        province_col='Province/Territory', 
        provinces=PROVINCES_TERRITORIES_CA
    )),
    
    # Abbreviate institution names
    ('abbreviate_institution_names', AbbreviateInstitutionNames(
        column='Institution Name', 
        replacements=ABBREVIATIONS
    )),

    #Reorder the columns
    ('reorder_columns', ReorderColumns(desired_order=[
        "FY Start", 
        "Province/Territory", 
        "Institution Name", 
        "Program type",
        "Credential type",
        "Field of study",
        "Canadian Status",
        "Enrolment"
    ])),

    # pivot the Canadian status column
    ('pivot_canadian_status', PivotCanadianStatus())
])

In [196]:
cleaned_progs = program_pipeline.fit_transform(progs)

In [197]:
cleaned_progs.sample(6)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
11274,2017,Prince Edward Island,Prince Edward Island (total),Undergraduate program,"Total, credential type","Indigenous and foreign languages, literatures,...",0,3,0
33580,2021,New Brunswick,New Brunswick (total),Graduate program (third cycle),"Total, credential type",Psychology [42.],93,3,0
26472,2020,Canada,Canada (total),"Total, program type",Other type of credential associated with a pro...,"Area, ethnic, cultural, gender, and group stud...",3,0,0
14445,2018,Canada,Canada (total),Undergraduate program,Certificate,Mathematics and statistics [27.],18,3,0
27442,2020,New Brunswick,New Brunswick (total),"Post career, technical or professional trainin...",Diploma,"Total, field of study",216,42,0
1054,2016,British Columbia,BC (total),"Total, program type","Total, credential type","Unclassified, field of study",534,480,0


In [198]:
cleaned_progs.columns

Index(['FY Start', 'Province/Territory', 'Institution Name', 'Program type',
       'Credential type', 'Field of study', 'Domestic Enrolment',
       'International Enrolment', 'CA Status Unreported Enrolment'],
      dtype='object')

In [199]:
cleaned_progs['Institution Name'].unique()

array(['Alberta (total)', 'BC (total)', 'Canada (total)',
       'Manitoba (total)', 'New Brunswick (total)',
       'Newfoundland and Labrador (total)', 'Nova Scotia (total)',
       'Ontario (total)', 'Prince Edward Island (total)',
       'Quebec (total)', 'Saskatchewan (total)'], dtype=object)

In [200]:
cleaned_progs['Province/Territory'].unique()

array(['Alberta', 'British Columbia', 'Canada', 'Manitoba',
       'New Brunswick', 'Newfoundland and Labrador', 'Nova Scotia',
       'Ontario', 'Prince Edward Island', 'Quebec', 'Saskatchewan'],
      dtype=object)

Currently we are not looking at the individual institution level, so we can go by Province/Territory rather than Institution name until this changes.

## Cleaning up N/As and Not Reported records

Several instances of 'Not reported, status of student', 'Not applicable, credential type' etc to be examined to see if we can discard and make space for more valuable records

### Auditing

In [201]:
progs['Status of student in Canada'].value_counts()

Status of student in Canada
Canadian students                            43528
International students                       35897
Not reported, status of student in Canada     5081
Name: count, dtype: int64

`progs` is the original unprocessed sheet. The 5081 Not reported records will now be in "CA Status Unreported Enrolment

In [202]:
print(progs[progs['Status of student in Canada'] == 'Not reported, status of student in Canada']['VALUE'].isnull().sum())

0


Above tells us all 5081 columns of unreported status have some enrolment. They're not nulls.

In the `cleaned_progs` processed data, this will be seen as Enrolment counted under the 'CA Status Unreported Enrolment' column and against domestic and international enrolment, we will be able to see how much this unknown contributes to the data and the extent of uncertainty it introduces.

Starting at the totals across Canada we can check which entries have unreported status each year to see the scale of the problem

In [203]:
# How many across Canada per year - grand totals across all programs, credentials & fields of study

cleaned_progs[(cleaned_progs['Province/Territory'] == 'Canada') &
              (cleaned_progs['CA Status Unreported Enrolment'] > 0) & 
              (cleaned_progs['Program type'] == 'Total, program type') & 
              (cleaned_progs['Credential type'] == 'Total, credential type') & 
              (cleaned_progs['Field of study'] == 'Total, field of study')
            ][['FY Start', 'Institution Name', 'Credential type', 'Program type', 'Field of study', 'CA Status Unreported Enrolment']].sort_values(by='FY Start', ascending=False)

Unnamed: 0,FY Start,Institution Name,Credential type,Program type,Field of study,CA Status Unreported Enrolment
38908,2022,Canada (total),"Total, credential type","Total, program type","Total, field of study",1659
32678,2021,Canada (total),"Total, credential type","Total, program type","Total, field of study",2103
26543,2020,Canada (total),"Total, credential type","Total, program type","Total, field of study",1911
20471,2019,Canada (total),"Total, credential type","Total, program type","Total, field of study",948
14393,2018,Canada (total),"Total, credential type","Total, program type","Total, field of study",987
8269,2017,Canada (total),"Total, credential type","Total, program type","Total, field of study",1401
2139,2016,Canada (total),"Total, credential type","Total, program type","Total, field of study",1458


So we are looking at approximately 2100 per year out of 1.7 million, or 0. about 0.15% of the total records. There's a noticeable jump in unreported Enrolment post-Covid, may be related to programs going fully online and thus requirements and logistics for immigration stats may have been relaxed or changed temporarily.

Any particular trend to where unreported status is coming from?

In [204]:
# Check provinces where unreported enrolment is highest
cleaned_progs[(cleaned_progs['FY Start'] == 2022) & 
              (cleaned_progs['Program type'] == 'Total, program type') & 
              (cleaned_progs['Credential type'] == 'Total, credential type') & 
              (cleaned_progs['Field of study'] == 'Total, field of study')
              ].sort_values(by='CA Status Unreported Enrolment', ascending=False)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
38908,2022,Canada,Canada (total),"Total, program type","Total, credential type","Total, field of study",1320684,421008,1659
41539,2022,Ontario,Ontario (total),"Total, program type","Total, credential type","Total, field of study",546672,236769,1542
42864,2022,Saskatchewan,Saskatchewan (total),"Total, program type","Total, credential type","Total, field of study",35247,7032,90
37180,2022,Alberta,Alberta (total),"Total, program type","Total, credential type","Total, field of study",153537,29139,15
37825,2022,British Columbia,BC (total),"Total, program type","Total, credential type","Total, field of study",123987,53583,6
39551,2022,Manitoba,Manitoba (total),"Total, program type","Total, credential type","Total, field of study",38181,10563,3
39957,2022,New Brunswick,New Brunswick (total),"Total, program type","Total, credential type","Total, field of study",18150,5439,3
40371,2022,Newfoundland and Labrador,Newfoundland and Labrador (total),"Total, program type","Total, credential type","Total, field of study",15288,4725,0
40820,2022,Nova Scotia,Nova Scotia (total),"Total, program type","Total, credential type","Total, field of study",38049,11805,0
41866,2022,Prince Edward Island,Prince Edward Island (total),"Total, program type","Total, credential type","Total, field of study",4992,2121,0


Over 90% of the unreported status comes from Ontario. By spot checking other years (2021, 2020, 2019) Ontario is consistently top, with Saskatchewan. SK is a small population so this is unusual. It may just be down to a reporting error from one school. Ontario having generally larger institutions by enrolment, the same is likely true.
Drill down into Program & Credential types...

In [205]:
cleaned_progs[(cleaned_progs['Province/Territory'] == 'Ontario') &
              (cleaned_progs['FY Start'] == 2022) &
              (cleaned_progs['CA Status Unreported Enrolment'] > 0) &  
              (cleaned_progs['Credential type'] == 'Total, credential type') & 
              (cleaned_progs['Field of study'] == 'Total, field of study')
            ][
                ['FY Start', 'Province/Territory', 'Institution Name', 'Credential type', 'Program type', 'Field of study', 'CA Status Unreported Enrolment']
                ].sort_values(by='CA Status Unreported Enrolment', ascending=False)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Credential type,Program type,Field of study,CA Status Unreported Enrolment
41539,2022,Ontario,Ontario (total),"Total, credential type","Total, program type","Total, field of study",1542
41015,2022,Ontario,Ontario (total),"Total, credential type","Career, technical or professional training pro...","Total, field of study",1041
41302,2022,Ontario,Ontario (total),"Total, credential type","Post career, technical or professional trainin...","Total, field of study",321
41649,2022,Ontario,Ontario (total),"Total, credential type",Undergraduate program,"Total, field of study",180


In [206]:
cleaned_progs[(cleaned_progs['Province/Territory'] == 'Ontario') &
              (cleaned_progs['FY Start'] == 2022) &
              (cleaned_progs['CA Status Unreported Enrolment'] > 0) &  
              (cleaned_progs['Program type'] == 'Total, program type') & 
              (cleaned_progs['Field of study'] == 'Total, field of study')
            ][
                ['FY Start', 'Province/Territory', 'Institution Name', 'Credential type', 'Program type', 'Field of study', 'CA Status Unreported Enrolment']
                ].sort_values(by='CA Status Unreported Enrolment', ascending=False)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Credential type,Program type,Field of study,CA Status Unreported Enrolment
41539,2022,Ontario,Ontario (total),"Total, credential type","Total, program type","Total, field of study",1542
41444,2022,Ontario,Ontario (total),Diploma,"Total, program type","Total, field of study",996
41370,2022,Ontario,Ontario (total),Certificate,"Total, program type","Total, field of study",363
41407,2022,Ontario,Ontario (total),Degree (includes applied degree),"Total, program type","Total, field of study",180
41497,2022,Ontario,Ontario (total),Other type of credential associated with a pro...,"Total, program type","Total, field of study",3


When we line up programs against credentials like this, We can see nearly a total match on Unreported Canadian Status between:
- Career/Technical training programs and the Diploma credential (1041 vs 996)
- Post Career, technical or professional training program and Certificate (321 vs 363). This is specifically an Ontario Graduate Certificate program, which have grown widely in Ontario Colleges in the last 10 years.
- Undergraduate program and Degree, both exactly 180 unreported.

The strong alignment with these values for specific categories suggests to me the unreported status originates from some specific programs at specific schools where reporting wasn't done properly. This can be looked into in Ontario specific analysis later.

### Conclusion of Auditing

Students enrolled with unreported Canadian Status (unknown if domestic or international) is around 1500-1600 per year against total PSI full-time enrolment of 1.6 to 1.7 million. It is barely 0.2% of enrolment, almost all coming from Ontario and, at time of writing without looking at specific schools, looks likely to be coming from some reporting errors at specific programs/specific schools and is of little overall relevance to the work at this time.

## Workflow for avoiding double counting

The way the data is reported has a unique record for every combination of program type, credential type and the totals of either category. 

This will double / triple / quadruple count a given student e.g. Undergraduate Degree will appear in Undergrad / Degree, Programs (total) / Degree and Undergrad / Credential (total). This is fixed in this block:

Only working with provincial totals here

In [207]:
cleaned_progs['Province/Territory'].value_counts()

Province/Territory
Canada                       8240
Ontario                      5342
British Columbia             4656
Quebec                       3902
Alberta                      3611
Manitoba                     3549
Saskatchewan                 3264
Nova Scotia                  3162
Newfoundland and Labrador    2792
New Brunswick                2612
Prince Edward Island         1833
Name: count, dtype: int64

In [208]:
# The target for domestic enrolment in 2022 is 1.32m and for international is 420k. This is based on the data from the earlier notebook.

print(f"Total Domestic students in 2022 is {canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum()}")
print(f"Total International students in 2022 is {canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()}")

Total Domestic students in 2022 is 1320684
Total International students in 2022 is 421008


In [209]:
cleaned_progs[cleaned_progs['FY Start'] == 2022].sort_values(by='Domestic Enrolment', ascending=False).head(5)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
38908,2022,Canada,Canada (total),"Total, program type","Total, credential type","Total, field of study",1320684,421008,1659
38740,2022,Canada,Canada (total),"Total, program type",Degree (includes applied degree),"Total, field of study",860619,208341,204
39096,2022,Canada,Canada (total),Undergraduate program,"Total, credential type","Total, field of study",761976,145479,204
39003,2022,Canada,Canada (total),Undergraduate program,Degree (includes applied degree),"Total, field of study",740001,135174,201
41539,2022,Ontario,Ontario (total),"Total, program type","Total, credential type","Total, field of study",546672,236769,1542


Above illustrates double counting - an undergraduate Engineering degree student in Ontario in 2022 will be counted in all five of the above records and will also be counted in their Field of Study specific record

`all_prov_totals` below is to illustrate the fields that are double counting which we'll use as an audit check when we examine programs, credentials and field of study in particular.

In [210]:
all_prov_totals = cleaned_progs[~(cleaned_progs['Province/Territory'].str.contains('Canada')) &
                                  (cleaned_progs['Program type'] == 'Total, program type') & 
                                  (cleaned_progs['Credential type'] == 'Total, credential type') & 
                                  (cleaned_progs['Field of study'] == 'Total, field of study')
                                ]

In [211]:
print(f"Total Domestic enrolment across all provinces combined is {all_prov_totals[all_prov_totals['FY Start'] == 2022]['Domestic Enrolment'].sum()}, compared to earlier national sum of {canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum()}")
print(f"Total International enrolment across all provinces combined is {all_prov_totals[all_prov_totals['FY Start'] == 2022]['International Enrolment'].sum()}, compared to earlier national sum of {canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()}")

Total Domestic enrolment across all provinces combined is 1319085, compared to earlier national sum of 1320684
Total International enrolment across all provinces combined is 420990, compared to earlier national sum of 421008


See two cells up - We remove double counting from:
1. The Canada whole country total. The count is now just the provinces combined, without territories as we removed them in the pipeline. We can see one cell above 18 international students and around 1500 domestic students missing from this current dataset - that's the territories. Spot check other years than 2022 and similar numbers appear.

2. All Program totals only - double counting from the specific program and the totals removed
3. All Credential totals only
4. All Fields of study - no double count on this field

We can refer back to this to ensure data hygiene when we want to look at enrolment changes along any of these lines. We've looked at enrolment in different provinces in the first workbook, so we'll go first by program type, then credentials and then Fields of study.

Below I will use the same audit statement (three cells up) to filter for unique programs

## Analysis of Program type enrolment, 16/17 - 22/23

Per auditing above, we can generally ignore the Unreported Enrolment column.

Per double-counting above. We'll start each section here by splitting off from `cleaned_progs` to avoid duplicate counts of student enrolment.

How has enrolment in different programs changed over the years? Is this different amongst domestic/international students?

Big Question - has the big influx of international students mostly gone to some programs over others e.g. technical, Masters, PhD?

### Program enrolment changes across, all fields of study, all credential types, Canada-wide

In [212]:
# use the double-counting method to just toggle off the program total. Then run it against the canada_dom and canada_intl dataframes from last workbook to see if the numbers align.
programs_df = cleaned_progs[(cleaned_progs['Province/Territory'].str.contains('Canada')) &
                                  ~(cleaned_progs['Program type'] == 'Total, program type') & 
                                  (cleaned_progs['Credential type'] == 'Total, credential type') & 
                                  (cleaned_progs['Field of study'] == 'Total, field of study')
                                ]

In [213]:
print(f"Total Domestic enrolment across all provinces combined is {programs_df[programs_df['FY Start'] == 2022]['Domestic Enrolment'].sum()}, compared to earlier national sum of {canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum()}")
print(f"Total International enrolment across all provinces combined is {programs_df[programs_df['FY Start'] == 2022]['International Enrolment'].sum()}, compared to earlier national sum of {canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()}")

Total Domestic enrolment across all provinces combined is 1294116, compared to earlier national sum of 1320684
Total International enrolment across all provinces combined is 407619, compared to earlier national sum of 421008


We lost about 2-3% of enrolment by taking out total program type - a fair amount more than the <0.5% from the first run of the double counting, but given we're working with all program/credential totals, double counting seems very unlikely to be an issue here. Just some imprecise reporting of specific programs perhaps.

In [214]:
programs_df['Program type'].value_counts()

Program type
Career, technical or professional training program            7
Graduate program (above the third cycle)                      7
Graduate program (second cycle)                               7
Graduate program (third cycle)                                7
Graduate qualifying program (second cycle)                    7
Graduate qualifying program (third cycle)                     7
Health-related residency program                              7
Other programs                                                7
Post career, technical or professional training program       7
Post-baccalaureate non-graduate program                       7
Pre-university program                                        7
Qualifying program for career, technical or pre-university    7
Undergraduate program                                         7
Undergraduate qualifying program                              7
Name: count, dtype: int64

Above we can see all 7 years have entries for the same program types, we can plot a line graph of enrolment for each to see the change over time. We should be mindful of some unique program types that are only used in some provinces, so plotting separately and just noticing the change in raw enrolment first would be important

In [215]:
programs_df.sample(10)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
26006,2020,Canada,Canada (total),Graduate qualifying program (third cycle),"Total, credential type","Total, field of study",3,0,0
7444,2017,Canada,Canada (total),Graduate program (above the third cycle),"Total, credential type","Total, field of study",3,0,0
1588,2016,Canada,Canada (total),Health-related residency program,"Total, credential type","Total, field of study",13581,2058,0
32869,2021,Canada,Canada (total),Undergraduate program,"Total, credential type","Total, field of study",768153,143781,432
38609,2022,Canada,Canada (total),Pre-university program,"Total, credential type","Total, field of study",81402,1311,0
32133,2021,Canada,Canada (total),Graduate qualifying program (second cycle),"Total, credential type","Total, field of study",48,84,0
20669,2019,Canada,Canada (total),Undergraduate program,"Total, credential type","Total, field of study",754848,135645,399
13586,2018,Canada,Canada (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",329589,84993,162
7745,2017,Canada,Canada (total),Health-related residency program,"Total, credential type","Total, field of study",13446,2010,21
20155,2019,Canada,Canada (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",13020,6324,0


In [216]:
# plot for each unique program type
programs = programs_df['Program type'].unique()

# iterate over each program type and build separate figures
for prog in programs:
    # filter for rows of the current program type
    subset = programs_df[programs_df['Program type'] == prog].copy() # copy to avoid SettingWithCopyWarning

    # sort by FT Start
    subset.sort_values(by='FY Start', ascending=True, inplace=True) # inplace to modify the copy subset dataframe

    # line plot
    fig = px.line(
        subset,
        x='FY Start',
        y=['Domestic Enrolment', 'International Enrolment'],
        title=f"Domestic and International Enrolment for {prog} over time",
        labels={'value': 'Enrolment', 'FY Start': 'Year'},
    )

    fig.update_layout(
        xaxis_title="FY Start",
        yaxis_title="Enrolment"
    )

    fig.show()

**Ignore from here on:** 
- Graduate program above third cycle (above PhD), Qualifying Second Cycle and Qualifying Third Cycle all have tiny enrolments in single digits, max 100 in the country. 
- Other programs: no granularity into the nature of programs and numbers are four figures across the whole country. No interest. 
- Undergraduate Qualifying: <1000 nationwide in both domestic and international. No real significance.

**Quickfire Analysis of key program types:**
- Career / Technical training: Domestic down ~15%, International more than doubled from 50k to ~130k
- Master's (Graduate, second cycle): Domestic has barely moved, International doubled from ~27k to ~50k
**- PhD (Graduate Third Cycle): Good growth for domestic and international, international adding 7k to domestic's 3.5k since 16/17. Investigate more**
- Health Residency: stayed relatively constant, growth in both demographics, twice the growth for domestic students (up 1.5k) than international (~700) over the period. Lower priority to investigate due to small numbers. May be political reasons e.g. international MDs must return to host country where a scholarship was awarded.
- **Post career technical training: Huge international growth from <7k to >40k in 7 years, but domestic enrolment has stayed flat. Big interest**
- **Post-bacc non graduate program: Similar to Post career technical training, higher domestic base rapidly being caught up. Investigate**
- Pre-university: almost no change, perhaps the Quebec pre-university diplomas. Low priority, confirm it's a Quebec thing.
- Qualifying for Tech training or pre-university: No international change, 30% decline in domestic market to only 10k students nationwide. Where/what are these programs and why is dropout happening w. no international growth?
- **Undergraduate: Largest cohort by far. Domestic up 10-15k in this time, International up 50k (50% growth)**


Below we can present the change since 2016/17 in the programs of interest:

In [217]:
# filter for just the key programs identified above - only programs with more than 5000 enrolment in one of the International/Domestic student categories
key_progs = [
    ptype
    for ptype, grp in programs_df[programs_df['FY Start'] == 2022].groupby('Program type') # df.groupby() returns ptype the program and grp the dataframe of enrolment for that matching program type
    if grp['Domestic Enrolment'].sum() > 5000 or grp['International Enrolment'].sum() > 5000
    ]

print(key_progs)

['Career, technical or professional training program', 'Graduate program (second cycle)', 'Graduate program (third cycle)', 'Health-related residency program', 'Post career, technical or professional training program', 'Post-baccalaureate non-graduate program', 'Pre-university program', 'Qualifying program for career, technical or pre-university', 'Undergraduate program']


### Change in enrolment in programs, 2016/17 to 2022/23

- Bar chart to show change % in enrolment since 16/27

In [218]:
# Just want key program types
df_key = programs_df[programs_df['Program type'].isin(key_progs)]

# Separate data for 2016 and 2022, group by Program type, summing Domestic & Intl.
df_2016 = (
    df_key[df_key['FY Start'] == 2016]
    .groupby('Program type', as_index=False)[['Domestic Enrolment', 'International Enrolment']]
    .sum()
    .rename(columns={
        'Domestic Enrolment': 'Domestic_16',
        'International Enrolment': 'International_16'
    })
)

df_2022 = (
    df_key[df_key['FY Start'] == 2022]
    .groupby('Program type', as_index=False)[['Domestic Enrolment', 'International Enrolment']]
    .sum()
    .rename(columns={
        'Domestic Enrolment': 'Domestic_22',
        'International Enrolment': 'International_22'
    })
)

# Merge 2016 and 2022 data on Program type
df_merged = pd.merge(df_2016, df_2022, on='Program type', how='inner')

# Compute percentage growth = (2022 / 2016 - 1) * 100, handling div-by-zero in case no enrolment (unlikely but theoretically possible)
# If Domestic_16 or International_16 is zero, treat growth as 0% or skip those rows
df_merged['Domestic Growth (%)'] = (
    (df_merged['Domestic_22'] / df_merged['Domestic_16'] - 1) * 100
).fillna(0)

df_merged['International Growth (%)'] = (
    (df_merged['International_22'] / df_merged['International_16'] - 1) * 100
).fillna(0)

# Melt so each row has either Domestic or International growth
df_melt = df_merged.melt(
    id_vars='Program type',
    value_vars=['Domestic Growth (%)', 'International Growth (%)'],
    var_name='Category',
    value_name='Growth (%)'
)

# 6) Plot a grouped bar chart
fig = px.bar(
    df_melt,
    x='Program type',
    y='Growth (%)',
    color='Category',
    barmode='group',
    color_discrete_map={
        'Domestic Growth (%)': 'blue',
        'International Growth (%)': 'red'
    },
    title='Change in Enrolment from 2016 to 2022 by Program Type'
)

# Show negative if needed
fig.update_layout(
    yaxis=dict(range=[-50, 200]) 
)
fig.show()

Clearly the growth in programs has overwhelmingly come from Post career or prof training, Post-bacc and Career Training programs. There are no program types where domestic growth has been larger than international growth.

Pre-university needs to be looked at a bit more carefully because the international numbers were very very small initially.



### Pie chart - has the market share of programs changed substantially for Canadian or International students?

- Pie chart to show changing share of the enrolment since 2016/17

To address the skills gap part of the equation here, I want to see if particular programs have been in decline or rise set relative to the rest of the landscape e.g. more Masters students or less PhD level.

In [219]:
# check if we need - df for the pie charts
# # df_2016

In [220]:
from plotly.subplots import make_subplots

colors = px.colors.qualitative.Plotly  # same color palette for consistency
color_map = {ptype: colors[i % len(colors)] for i, ptype in enumerate(key_progs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=df_2016,
    label_col='Program type',
    value_col='Domestic_16',
    subtitle="Domestic students in 2016"
)

add_pie(
    row=1, col=2,
    df=df_2022,
    label_col='Program type',
    value_col='Domestic_22',
    subtitle="Domestic students in 2022"
)

add_pie(
    row=2, col=1,
    df=df_2016,
    label_col='Program type',
    value_col='International_16',
    subtitle="International students in 2016"
)

add_pie(
    row=2, col=2,
    df=df_2022,
    label_col='Program type',
    value_col='International_22',
    subtitle="International students in 2022"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="Share of Enrolment by Program Type – Domestic vs. International Students (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

In [221]:
df_2016

Unnamed: 0,Program type,Domestic_16,International_16
0,"Career, technical or professional training pro...",336987,49194
1,Graduate program (second cycle),72864,27549
2,Graduate program (third cycle),32391,17475
3,Health-related residency program,13581,2058
4,"Post career, technical or professional trainin...",8646,6966
5,Post-baccalaureate non-graduate program,13440,2103
6,Pre-university program,83595,693
7,"Qualifying program for career, technical or pr...",13857,189
8,Undergraduate program,749376,99711


In [222]:
import pandas as pd
import plotly.express as px

# ----- 1) Build a "long" DataFrame for DOMESTIC students -----
domestic_rows = []

# Collect 2016 Domestic from df_2016
for _, row in df_2016.iterrows():
    domestic_rows.append({
        'Year': '2016',
        'Program type': row['Program type'],
        'Enrolment': row['Domestic_16']
    })

# Collect 2022 Domestic from df_2022
for _, row in df_2022.iterrows():
    domestic_rows.append({
        'Year': '2022',
        'Program type': row['Program type'],
        'Enrolment': row['Domestic_22']
    })

df_domestic = pd.DataFrame(domestic_rows)

# ----- 2) Build the DOMESTIC stacked bar chart -----
fig_dom = px.bar(
    df_domestic,
    x='Enrolment',          # numeric axis
    y='Year',               # two categories: 2016, 2022
    color='Program type',   # stacked by program type
    orientation='h',        # horizontal bars
    barmode='stack',
    title='Domestic Enrolment by Program Type – 2016 vs 2022'
)

fig_dom.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig_dom.show()

# ----- 3) Build a "long" DataFrame for INTERNATIONAL students -----
international_rows = []

# Collect 2016 International from df_2016
for _, row in df_2016.iterrows():
    international_rows.append({
        'Year': '2016',
        'Program type': row['Program type'],
        'Enrolment': row['International_16']
    })

# Collect 2022 International from df_2022
for _, row in df_2022.iterrows():
    international_rows.append({
        'Year': '2022',
        'Program type': row['Program type'],
        'Enrolment': row['International_22']
    })

df_international = pd.DataFrame(international_rows)

# ----- 4) Build the INTERNATIONAL stacked bar chart -----
fig_int = px.bar(
    df_international,
    x='Enrolment',
    y='Year',
    color='Program type',
    orientation='h',
    barmode='stack',
    title='International Enrolment by Program Type – 2016 vs 2022'
)

fig_int.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig_int.show()


Quickfire analysis:
- **Undergrad Dominates:** The undergraduate numbers dominate both groups. The domestic landscape has stayed relatively stable, save for a slight increase in Undergrad share traded from the Careers/Technical programs.
- **Career / Technical programs blow up in international market**. Career/Technical went from 24% to 32% of international enrolment adding 80,000 students in 6 years.
- **Int'l student Degrees grow, but pale in comparison to technical training**. Undergraduate program share was almost 50% in 2016 and by 2022 it was just over 35%, but in isolation the UG enrolment itself grew by 50% - nearly 50,000 students added in six years nationwide. It's extraordinary that the largest program cluster grew 50% but still lost 15% of the international market share. This is the same for Masters (Graduate 2nd cycle) adding over 22,000 students or growing by nearly double in isolation, but seeing its share shrink by 1.2%


#### Conclusion on Degrees (Undergrad, Masters, PhD) vs Technical/Career training
- #### Around 64-65% of domestic students were studying a Degree program in 2016, and in 2022 this was 67.6%
- #### Around 70% of international students were studying a Degree in 2016 but this had dropped to 54% in 2022.

Below I'm making a dataframe for just the 10 provinces, not Canada as a whole, in the 9 'key programs' over the 7 years from 2016 to 2023. This should clearly give us 630 records if every province has enrolment in all 9 programs in all 7 years. Any less than that and we have nulls or some programs that just don't take place in certain provinces.

In [223]:
# examining just the key programs but including provincial totals now
# should have provincial totals for all creds, all fields, individual programs no totals, only the key programs

key_progs_prov = cleaned_progs[~(cleaned_progs['Province/Territory'].str.contains('Canada')) &
                               ~(cleaned_progs['Program type'] == 'Total, program type') &
                                (cleaned_progs['Program type'].isin(key_progs)) &
                                (cleaned_progs['Credential type'] == 'Total, credential type') & 
                                (cleaned_progs['Field of study'] == 'Total, field of study')
                              ]

In [224]:
print(f"There are {key_progs_prov['Program type'].nunique()} distinct program types in the key programs by province dataframe")
print(f"There are {key_progs_prov['Province/Territory'].nunique()} distinct Provinces in the key programs by province dataframe")
print(f"There are {key_progs_prov['FY Start'].nunique()} distinct years in the key programs by province dataframe")

There are 9 distinct program types in the key programs by province dataframe
There are 10 distinct Provinces in the key programs by province dataframe
There are 7 distinct years in the key programs by province dataframe


In [225]:
key_progs_prov.shape

(513, 9)

Only 513 records so there are some 'quirky' programs that don't always appear in every province, or are unique to some provinces.

#### 1. Pre-University Program (Revealed to be QC only, tiny SK enrolment)

In [226]:
print(f"Pre-University programs appear in: \n{key_progs_prov[key_progs_prov['Program type'] == 'Pre-university program']['Province/Territory'].value_counts()}\n\n")

Pre-University programs appear in: 
Province/Territory
Quebec          7
Saskatchewan    5
Name: count, dtype: int64




In [227]:
key_progs_prov[key_progs_prov['Program type'] == 'Pre-university program'].sort_values(by='Domestic Enrolment', ascending=False)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
29777,2020,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",84228,1365,0
5380,2016,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",83598,693,0
35971,2021,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",82683,1308,0
11553,2017,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",82302,816,0
17589,2018,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",81660,939,0
42184,2022,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",81366,1308,0
23676,2019,Quebec,Quebec (total),Pre-university program,"Total, credential type","Total, field of study",80703,1461,0
30285,2020,Saskatchewan,Saskatchewan (total),Pre-university program,"Total, credential type","Total, field of study",66,0,0
36485,2021,Saskatchewan,Saskatchewan (total),Pre-university program,"Total, credential type","Total, field of study",36,3,0
42700,2022,Saskatchewan,Saskatchewan (total),Pre-university program,"Total, credential type","Total, field of study",36,6,0


Saskatchewan offering just a few dozen students is not of analytical relevance here. What is relevant is that there are records where no students at all enrolled in a particular program type and the data was still reported, but some years are simply absent and unreported from the data. 

We can audit check program types where international enrolment and domestic enrolment are both 0 to see if other null records can be removed

For all intents and purposes, Pre-university program is a Quebec-only designation. This is the main program offering of CEGEP institutions which allow students to join Universities for an Undergraduate degree after two years of study. It seems to be fairly static in its enrolment: not much international growth in raw numbers, domestic hovering probably reflecting demographic ups and down by the year.

The takeaway is that Quebec offers a two-year program for students to join university, which is known as a transfer program. This is offered in other provinces but it seems under a different category.

#### 2. Post career, technical or professional training program (all but QC, ON Dominates)

In [228]:
print(f"Post career professional program appears in:\n{key_progs_prov[key_progs_prov['Program type'] == 'Post career, technical or professional training program']['Province/Territory'].value_counts()}\n\n")

Post career professional program appears in:
Province/Territory
Alberta                      7
British Columbia             7
Manitoba                     7
New Brunswick                7
Newfoundland and Labrador    7
Nova Scotia                  7
Ontario                      7
Saskatchewan                 7
Prince Edward Island         3
Name: count, dtype: int64




All provinces but Quebec offer this program, and PEI in only three years. Further digging...

In [229]:
key_progs_prov[key_progs_prov['Program type'] == 'Post career, technical or professional training program'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
28903,2020,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",8841,17037,105
22784,2019,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",8586,17118,27
16689,2018,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",8013,12366,0
35075,2021,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",7950,26373,294
10635,2017,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",7674,8856,6
4465,2016,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",7329,6414,6
41302,2022,Ontario,Ontario (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",6795,40677,321
33165,2021,Manitoba,Manitoba (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",660,387,0
13071,2018,British Columbia,BC (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",552,714,0
6914,2017,British Columbia,BC (total),"Post career, technical or professional trainin...","Total, credential type","Total, field of study",534,432,0


In [230]:
post_career_df = key_progs_prov[key_progs_prov['Program type'] == 'Post career, technical or professional training program']
# compare 2016/17 to 2022/23
post_career_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Post career, technical or professional training program') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

post_career_provs = key_progs_prov[key_progs_prov['Program type'] == 'Post career, technical or professional training program']['Province/Territory'].unique()

In [231]:
# ----- A) DOMESTIC CHART -----

post_career_df_16_22

# Create a stacked bar chart (2 bars for 2016 & 2022, stacked by province)
fig_dom = px.bar(
    post_career_df_16_22,
    x='Domestic Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='Domestic Enrolment by Province/Territory – 2016 vs 2022'
)

fig_dom.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_dom.show()

# ----- B) INTERNATIONAL CHART -----

# ) Create a stacked bar chart (again 2016 & 2022, stacked by province)
fig_int = px.bar(
    post_career_df_16_22,
    x='International Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='International Enrolment by Province/Territory – 2016 vs 2022'
)

fig_int.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_int.show()

Ontario dominates this list! over ten times the international and domestic enrolment on this program type, which includes the Ontario Graduate certificate.

Capturing this in some variables below

#### 3. **Career, technical or professional training program (all provinces solid enrolment but ON dominates intl enrolment)**

In [232]:
print(f"Career and technical programs appear in:\n{key_progs_prov[key_progs_prov['Program type'] == 'Career, technical or professional training program']['Province/Territory'].value_counts()}\n\n")

Career and technical programs appear in:
Province/Territory
Alberta                      7
British Columbia             7
Manitoba                     7
New Brunswick                7
Newfoundland and Labrador    7
Nova Scotia                  7
Ontario                      7
Prince Edward Island         7
Quebec                       7
Saskatchewan                 7
Name: count, dtype: int64




All ten provinces consistently present here which is a good sign.

In [233]:
key_progs_prov[key_progs_prov['Program type'] == 'Career, technical or professional training program'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
10335,2017,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",154404,46005,285
4164,2016,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",150834,31833,576
16404,2018,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",149544,58581,135
22486,2019,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",143979,67407,363
28608,2020,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",132591,66393,1116
34786,2021,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",124488,73020,1233
41015,2022,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",115653,95043,1041
5171,2016,Quebec,Quebec (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",93651,4296,0
11342,2017,Quebec,Quebec (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",92664,4401,0
29569,2020,Quebec,Quebec (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",92001,6795,0


Given Canada's population distribution, this looks like a standard sensible enrolment table for widely popular programs to both domestic and international students. Ontario is the most populous province followed by Quebec, followed by Alberta and BC. 

I anticipate this is where the skilled trades training programs are reported and we can examine this later with the Field of Study codes to verify.

In [234]:
career_tech_prof_df = key_progs_prov[key_progs_prov['Program type'] == 'Career, technical or professional training program']

# compare 2016/17 to 2022/23
career_tech_prof_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Career, technical or professional training program') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

career_tech_prof_provs = key_progs_prov[key_progs_prov['Program type'] == 'Career, technical or professional training program']['Province/Territory'].unique()

In [235]:
# pie chart comparing 2016 to 2022 share of enrolment across provinces

colors = px.colors.qualitative.Plotly  # same color palette for consistency
color_map = {prov: colors[i % len(colors)] for i, prov in enumerate(career_tech_prof_provs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=career_tech_prof_df_16_22[career_tech_prof_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2016 Domestic"
)

add_pie(
    row=1, col=2,
    df=career_tech_prof_df_16_22[career_tech_prof_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2022 Domestic"
)

add_pie(
    row=2, col=1,
    df=career_tech_prof_df_16_22[career_tech_prof_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2016 International"
)

add_pie(
    row=2, col=2,
    df=career_tech_prof_df_16_22[career_tech_prof_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2022 International"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="Career/Technical Training Program Enrolment – Domestic vs. International (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

[According to the 2021 Census)](https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory), Ontario, Quebec, BC and Alberta have 38.5%, 23%, 13.5% and 11.5% of Canada's population respectively. Domestically there is a slight overrepresentation of ON and QC in these programs. Look at the raw numbers though - Ontario was 45% of this program type in 2016 and lost 35,000 students from this category in six years. The other major provinces stayed relatively constant or declined slightly.

For international this is not true at all. Again it's interesting to see the raw numbers change against the share. All the major provinces added a few thousand international students to this program type but Ontario tripled in this category, from a base already around four times as large as the nearest province.

Ontario really is the anomaly here, their Careers/Technical programs were 150k/35k Domestic/International in 2016 and were 115k / 95k six years later.

In [236]:
career_tech_prof_df_16_22

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
80,2016,Alberta,Alberta (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",31950,3915,30
617,2016,British Columbia,BC (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",29517,6831,0
2436,2016,Manitoba,Manitoba (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",6705,1257,0
2905,2016,New Brunswick,New Brunswick (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",5022,408,0
3293,2016,Newfoundland and Labrador,Newfoundland and Labrador (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",5307,78,0
3690,2016,Nova Scotia,Nova Scotia (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",7764,9,33
4164,2016,Ontario,Ontario (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",150834,31833,576
4905,2016,Prince Edward Island,Prince Edward Island (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",1743,159,0
5171,2016,Quebec,Quebec (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",93651,4296,0
5731,2016,Saskatchewan,Saskatchewan (total),"Career, technical or professional training pro...","Total, credential type","Total, field of study",4131,399,24


In [237]:
# Create a stacked bar chart (2 bars for 2016 & 2022, stacked by province)
fig_dom = px.bar(
    career_tech_prof_df_16_22,
    x='Domestic Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='Domestic Enrolment by Province/Territory – 2016 vs 2022'
)

fig_dom.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_dom.show()

# ----- B) INTERNATIONAL CHART -----

# ) Create a stacked bar chart (again 2016 & 2022, stacked by province)
fig_int = px.bar(
    career_tech_prof_df_16_22,
    x='International Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='International Enrolment by Province/Territory – 2016 vs 2022'
)

fig_int.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_int.show()

#### 4. Qualifying Program for Career/Technical or Pre-university (12-10k annually in QC, tiny elsewhere)

In [238]:
print(f"Qualifying program for career/technical or pre-univ appears in:\n{key_progs_prov[key_progs_prov['Program type'] == 'Qualifying program for career, technical or pre-university']['Province/Territory'].value_counts()}\n\n")

Qualifying program for career/technical or pre-univ appears in:
Province/Territory
Alberta                      7
New Brunswick                7
Newfoundland and Labrador    7
Ontario                      7
Quebec                       7
Manitoba                     2
Name: count, dtype: int64




BC, SK, NS and PEI are absent here.

In [239]:
key_progs_prov[key_progs_prov['Program type'] == 'Qualifying program for career, technical or pre-university'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
5385,2016,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",12513,129,0
11558,2017,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",12102,147,0
17594,2018,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",11337,201,0
29782,2020,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",10737,87,0
23681,2019,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",10494,195,0
35976,2021,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",10389,105,0
42189,2022,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",9630,132,0
273,2016,Alberta,Alberta (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",873,12,0
6394,2017,Alberta,Alberta (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",852,6,15
18605,2019,Alberta,Alberta (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",531,0,0


This is a strange program type which Quebec seems to dominate, save for a few hundred domestic students in various other provinces. Let's spot check individual provinces...

In [240]:
qual_career_tech_df = key_progs_prov[key_progs_prov['Program type'] == 'Qualifying program for career, technical or pre-university']

qual_career_tech_df[qual_career_tech_df['Province/Territory'] == 'Ontario']

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
4516,2016,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",12,45,39
10681,2017,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",381,0,39
16728,2018,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",63,180,0
22821,2019,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",78,204,0
28930,2020,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",57,12,0
35104,2021,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",87,9,0
41332,2022,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",36,18,0


In [241]:
qual_career_tech_df.sort_values(by='International Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
22821,2019,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",78,204,0
17594,2018,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",11337,201,0
23681,2019,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",10494,195,0
16728,2018,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",63,180,0
11558,2017,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",12102,147,0
42189,2022,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",9630,132,0
5385,2016,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",12513,129,0
35976,2021,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",10389,105,0
29782,2020,Quebec,Quebec (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",10737,87,0
4516,2016,Ontario,Ontario (total),"Qualifying program for career, technical or pr...","Total, credential type","Total, field of study",12,45,39


The largest province is very low here and other than Quebec, where the program has very steadily declined around 20% in six years, the program type barely has three figures of students anywhere else. It's also virtually absent of international students. I think we can forget about this program unless we want to do some specific research into QC programming.

#### 5. Post-baccalaureate non-graduate program (BC Dominates 5-10k, Domestic only ON)

In [242]:
print(f"Post-Bacc non-graduate programs: \n{key_progs_prov[key_progs_prov['Program type'] == 'Post-baccalaureate non-graduate program']['Province/Territory'].value_counts()}\n\n")

Post-Bacc non-graduate programs: 
Province/Territory
Alberta                      7
British Columbia             7
Manitoba                     7
New Brunswick                7
Newfoundland and Labrador    7
Nova Scotia                  7
Ontario                      7
Prince Edward Island         7
Saskatchewan                 7
Quebec                       6
Name: count, dtype: int64




In [243]:
key_progs_prov[key_progs_prov['Program type'] == 'Post-baccalaureate non-graduate program'].sort_values(by='Domestic Enrolment', ascending=False).head(15)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
4497,2016,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",7368,90,0
41329,2022,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",7215,93,0
35099,2021,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",7104,102,0
28927,2020,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",6915,96,0
22810,2019,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",6663,99,0
10665,2017,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",6570,84,0
16720,2018,Ontario,Ontario (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",6462,90,0
37612,2022,British Columbia,BC (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",4344,6624,0
25281,2020,British Columbia,BC (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",4314,4593,0
19183,2019,British Columbia,BC (total),Post-baccalaureate non-graduate program,"Total, credential type","Total, field of study",4287,5460,0


This is unusual because the program seems to capture different types of programs in different provinces. In Ontario it clearly serves as a mostly domestic program given the remarkable stability of international enrolment, keeping within 20 students every year. 

In BC meanwhile it appears to be what might be the equivalent of the Ontario Grad Certificate or Diploma. Baccalaureate means after the undergraduate degree, so this would line up.

In [244]:
post_bacc_df = key_progs_prov[key_progs_prov['Program type'] == 'Post-baccalaureate non-graduate program']

# compare 2016/17 to 2022/23
post_bacc_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Post-baccalaureate non-graduate program') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

post_bacc_provs = key_progs_prov[key_progs_prov['Program type'] == 'Post-baccalaureate non-graduate program']['Province/Territory'].unique()

In [245]:
colors = px.colors.qualitative.Plotly
color_map = {prov: colors[i % len(colors)] for i, prov in enumerate(post_bacc_provs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=post_bacc_df_16_22[post_bacc_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2016 Domestic"
)

add_pie(
    row=1, col=2,
    df=post_bacc_df_16_22[post_bacc_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2022 Domestic"
)

add_pie(
    row=2, col=1,
    df=post_bacc_df_16_22[post_bacc_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2016 International"
)

add_pie(
    row=2, col=2,
    df=post_bacc_df_16_22[post_bacc_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2022 International"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="Post-baccalaureate non-grad Program Enrolment – Domestic vs. International (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

As mentioned this is a strange category because the Domestic and international landscapes look entirely different. This seems to be a categorisation for graduate non-degree programs like certificates and diplomas but Ontario does not use them in this way. The growth of Nova Scotia in this category from 33 international students in 2016, to 2325 students in 2022 is also notable, this clearly was not a viable program before and rapidly became one. Given the work looking at international enrolment in the previous workbook, I lean towards this being from Cape Breton U who were very active in growing their international contingent.

#### 6. Health-Related residency program (very stable, closely tied to population in provinces)

In [246]:
print(f"Health Residency: \n{key_progs_prov[key_progs_prov['Program type'] == 'Health-related residency program']['Province/Territory'].value_counts()}\n\n")

Health Residency: 
Province/Territory
Alberta                      7
British Columbia             7
Manitoba                     7
Newfoundland and Labrador    7
Nova Scotia                  7
Ontario                      7
Quebec                       7
Saskatchewan                 7
Name: count, dtype: int64




All but PEI and NB appear here. This most certainly feels like a Medical School related program where the provinces may not have a single medical school.

In [247]:
key_progs_prov[key_progs_prov['Program type'] == 'Health-related residency program'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
22683,2019,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",5706,1392,0
16582,2018,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",5679,1383,0
28799,2020,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",5592,1380,0
41195,2022,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",5592,1707,0
34974,2021,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",5553,1488,0
4343,2016,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",5148,1224,0
10534,2017,Ontario,Ontario (total),Health-related residency program,"Total, credential type","Total, field of study",4908,1134,0
11534,2017,Quebec,Quebec (total),Health-related residency program,"Total, credential type","Total, field of study",3462,531,0
29757,2020,Quebec,Quebec (total),Health-related residency program,"Total, credential type","Total, field of study",3417,534,0
35954,2021,Quebec,Quebec (total),Health-related residency program,"Total, credential type","Total, field of study",3405,579,0


These are quite precise in the breakdown along population lines, except for international students BC is virtually non-existent unlike the other major provinces

In [248]:
health_res_df = key_progs_prov[key_progs_prov['Program type'] == 'Health-related residency program']

# compare 2016/17 to 2022/23
health_res_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Health-related residency program') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

health_res_provs = key_progs_prov[key_progs_prov['Program type'] == 'Health-related residency program']['Province/Territory'].unique()

In [249]:
colors = px.colors.qualitative.Plotly  # same color palette for consistency
color_map = {prov: colors[i % len(colors)] for i, prov in enumerate(health_res_provs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=health_res_df_16_22[health_res_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2016 Domestic"
)

add_pie(
    row=1, col=2,
    df=health_res_df_16_22[health_res_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2022 Domestic"
)

add_pie(
    row=2, col=1,
    df=health_res_df_16_22[health_res_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2016 International"
)

add_pie(
    row=2, col=2,
    df=health_res_df_16_22[health_res_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2022 International"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="Medical Residency Enrolment – Domestic vs. International (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

The domestic share of students in health residencies is remarkably similar to the [2021 Census](https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory) (Wikipedia article which references the Census), within a single percentage point of all the major provinces share of the total Canadian population.

Internationally the picture is dominated by Ontario but there isn't much movement beyond Quebec and Ontario recruiting a few hundred more students each.

#### 7. Undergraduate (Unremarkable, ON nearly half of both dom/intl)

Undergraduate degrees will be present in every province (at least they should be!)

In [250]:
key_progs_prov[key_progs_prov['Program type'] == 'Undergraduate program'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
35423,2021,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",361893,69381,300
41649,2022,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",361329,69291,180
29246,2020,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",360882,66267,411
17049,2018,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",355503,52650,429
23138,2019,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",355137,60855,372
11014,2017,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",354087,47133,426
4843,2016,Ontario,Ontario (total),Undergraduate program,"Total, credential type","Total, field of study",352596,41745,630
11843,2017,Quebec,Quebec (total),Undergraduate program,"Total, credential type","Total, field of study",139698,19071,0
5669,2016,Quebec,Quebec (total),Undergraduate program,"Total, credential type","Total, field of study",138996,17988,0
17882,2018,Quebec,Quebec (total),Undergraduate program,"Total, credential type","Total, field of study",136515,20289,0


No surprises along domestic lines, Ontario followed by Quebec followed by Alberta, but the share of enrolment is a bit out of sync of the general population...

In [251]:
undergrad_df = key_progs_prov[key_progs_prov['Program type'] == 'Undergraduate program']

# compare 2016/17 to 2022/23
undergrad_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Undergraduate program') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

undergrad_provs = key_progs_prov[key_progs_prov['Program type'] == 'Undergraduate program']['Province/Territory'].unique()

In [252]:
colors = px.colors.qualitative.Plotly  # same color palette for consistency
color_map = {prov: colors[i % len(colors)] for i, prov in enumerate(undergrad_provs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=undergrad_df_16_22[undergrad_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2016 Domestic"
)

add_pie(
    row=1, col=2,
    df=undergrad_df_16_22[undergrad_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2022 Domestic"
)

add_pie(
    row=2, col=1,
    df=undergrad_df_16_22[undergrad_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2016 International"
)

add_pie(
    row=2, col=2,
    df=undergrad_df_16_22[undergrad_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2022 International"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="Undergraduate Enrolment – Domestic vs. International (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

Domestic shape has changed very very little, AB and ON adding slightly whilst QC has declined somewhat.

In [253]:
# Create a stacked bar chart (2 bars for 2016 & 2022, stacked by province)
fig_dom = px.bar(
    undergrad_df_16_22,
    x='Domestic Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='Domestic Enrolment by Province/Territory – 2016 vs 2022'
)

fig_dom.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_dom.show()

# ----- B) INTERNATIONAL CHART -----

# ) Create a stacked bar chart (again 2016 & 2022, stacked by province)
fig_int = px.bar(
    undergrad_df_16_22,
    x='International Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='International Enrolment by Province/Territory – 2016 vs 2022'
)

fig_int.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_int.show()

The stacked bar charts are nice to show the growth in segments alongside absolute growth. Contrast the stable domestic scene vs the significant growth in the intternational base, though it is fairly well distributed here among the populations.

#### 8. Master's (**Remarkably** stable distribution for both dom/int, all provinces)

In [254]:
key_progs_prov[key_progs_prov['Program type'] == 'Graduate program (second cycle)'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
34874,2021,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",33921,15654,0
28700,2020,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",33702,13323,0
22575,2019,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",32592,14271,0
16488,2018,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",32025,12795,0
41105,2022,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",31830,17973,0
10419,2017,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",31266,11517,0
4248,2016,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type","Total, field of study",30198,9996,0
29682,2020,Quebec,Quebec (total),Graduate program (second cycle),"Total, credential type","Total, field of study",23334,11961,12
35879,2021,Quebec,Quebec (total),Graduate program (second cycle),"Total, credential type","Total, field of study",22896,13497,0
42094,2022,Quebec,Quebec (total),Graduate program (second cycle),"Total, credential type","Total, field of study",22065,15810,0


In [255]:
masters = key_progs_prov[key_progs_prov['Program type'] == 'Graduate program (second cycle)']

# compare 2016/17 to 2022/23
masters_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Graduate program (second cycle)') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

masters_provs = key_progs_prov[key_progs_prov['Program type'] == 'Graduate program (second cycle)']['Province/Territory'].unique()

In [256]:
colors = px.colors.qualitative.Plotly 
color_map = {prov: colors[i % len(colors)] for i, prov in enumerate(masters_provs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=masters_df_16_22[masters_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2016 Domestic"
)

add_pie(
    row=1, col=2,
    df=masters_df_16_22[masters_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2022 Domestic"
)

add_pie(
    row=2, col=1,
    df=masters_df_16_22[masters_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2016 International"
)

add_pie(
    row=2, col=2,
    df=masters_df_16_22[masters_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2022 International"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="Master's Degrees – Domestic vs. International (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

The distributions are incredibly stable down to the fractions of a percentage point in some cases. This is true of both domestic and international, despite most of the provinces roughly doubling their international enrolment numbers in that period. The fact for international students they did this almost in tandem has kept the picture the same. Overall, the growth in numbers of Master's degree international students is in the tens of thousands and was described as stable/sustainably growing by IRCC when they introduced the study permit cap in January 24.

In [257]:
# Create a stacked bar chart (2 bars for 2016 & 2022, stacked by province)
fig_dom = px.bar(
    masters_df_16_22,
    x='Domestic Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='Domestic Enrolment by Province/Territory – 2016 vs 2022'
)

fig_dom.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_dom.show()

# ----- B) INTERNATIONAL CHART -----

# ) Create a stacked bar chart (again 2016 & 2022, stacked by province)
fig_int = px.bar(
    masters_df_16_22,
    x='International Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='International Enrolment by Province/Territory – 2016 vs 2022'
)

fig_int.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_int.show()

#### 9. PhD Degrees (Normal domestic but more international QC PhD than ON)

In [258]:
key_progs_prov[key_progs_prov['Program type'] == 'Graduate program (third cycle)'].sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
34937,2021,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",16539,7380,0
41165,2022,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",16497,7749,0
28762,2020,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",16209,6516,0
22647,2019,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",15627,6018,0
16548,2018,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",15318,5259,0
10499,2017,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",14958,5016,0
4308,2016,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type","Total, field of study",14757,4806,0
42159,2022,Quebec,Quebec (total),Graduate program (third cycle),"Total, credential type","Total, field of study",9663,9102,0
35944,2021,Quebec,Quebec (total),Graduate program (third cycle),"Total, credential type","Total, field of study",9609,8763,0
29748,2020,Quebec,Quebec (total),Graduate program (third cycle),"Total, credential type","Total, field of study",9459,8160,0


In [259]:
phd_df = key_progs_prov[key_progs_prov['Program type'] == 'Graduate program (third cycle)']

# compare 2016/17 to 2022/23
phd_df_16_22 = key_progs_prov[
    (key_progs_prov['Program type'] == 'Graduate program (third cycle)') & 
    (key_progs_prov['FY Start'].isin([2016, 2022])) 
    ]

phd_provs = key_progs_prov[key_progs_prov['Program type'] == 'Graduate program (third cycle)']['Province/Territory'].unique()

In [260]:
phd_provs

array(['Alberta', 'British Columbia', 'Manitoba', 'New Brunswick',
       'Newfoundland and Labrador', 'Nova Scotia', 'Ontario',
       'Prince Edward Island', 'Quebec', 'Saskatchewan'], dtype=object)

In [261]:
colors = px.colors.qualitative.Plotly  # same color palette for consistency
color_map = {prov: colors[i % len(colors)] for i, prov in enumerate(phd_provs)}

# Create a 2×2 domain (pie) subplot
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'domain'}, {'type': 'domain'}],
           [{'type': 'domain'}, {'type': 'domain'}]],
    subplot_titles=[
        "2016 Domestic", "2022 Domestic",
        "2016 International", "2022 International"
    ]
)

# Helper function to add a donut to the figure
def add_pie(row, col, df, label_col, value_col, subtitle):
    labels = df[label_col]
    values = df[value_col]
    # Map each program type to a consistent color
    slice_colors = [color_map[lbl] for lbl in labels]

    fig.add_trace(go.Pie(
        labels=labels,
        values=values,
        marker=dict(colors=slice_colors),
        hole=0.4,
        # Show percentage on slice, plus raw data on hover
        textinfo='percent',
        hovertemplate="%{label}<br>Enrolment: %{value}<extra></extra>"
    ), row=row, col=col)
    # Optional: rename the subplot title
    fig.layout.annotations[(row-1)*2 + (col-1)].text = subtitle

add_pie(
    row=1, col=1,
    df=phd_df_16_22[phd_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2016 Domestic"
)

add_pie(
    row=1, col=2,
    df=phd_df_16_22[phd_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='Domestic Enrolment',
    subtitle="2022 Domestic"
)

add_pie(
    row=2, col=1,
    df=phd_df_16_22[phd_df_16_22['FY Start'] == 2016],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2016 International"
)

add_pie(
    row=2, col=2,
    df=phd_df_16_22[phd_df_16_22['FY Start'] == 2022],
    label_col='Province/Territory',
    value_col='International Enrolment',
    subtitle="2022 International"
)

# 5) Update overall layout
fig.update_layout(
    width=1000,
    height=800,
    title="PhD Degree enrolment – Domestic vs. International (16/17 and 22/23)",
    legend_title="Program Type"
)

fig.show()

In [262]:
# Create a stacked bar chart (2 bars for 2016 & 2022, stacked by province)
fig_dom = px.bar(
    phd_df_16_22,
    x='Domestic Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='Domestic Enrolment by Province/Territory – 2016 vs 2022'
)

fig_dom.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_dom.show()

# ----- B) INTERNATIONAL CHART -----

# ) Create a stacked bar chart (again 2016 & 2022, stacked by province)
fig_int = px.bar(
    phd_df_16_22,
    x='International Enrolment',
    y='FY Start',
    color='Province/Territory',
    orientation='h',
    barmode='stack',
    title='International Enrolment by Province/Territory – 2016 vs 2022'
)

fig_int.update_layout(
    width=900,
    height=600,
    xaxis_title='Enrolment',
    yaxis_title=None,
    yaxis=dict(
        tickmode='array',
        tickvals=[2016, 2022],      # Only show 2016 & 2022
        ticktext=['2016', '2022']   # Labels for each tick
    )
)

fig_int.show()

Another strange picture emerging - domestically this closely resembles the province's population distribution, but international enrolment we see more Quebecois PhD students than Ontario! By some way. Even while ON added 2,500 PhDs in the 6 year period, QC added over 300 to increase its share.

### Conclusion of Program type analysis

The analysis of the different program types can be framed with two questions:

1. Which programs enrolment distribution resembles [Canada's actual population distribution](https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory)?
2. Which programs enrolment distribution most dramatically changes over the six-year period?

For the first question above, the medical residency programs were remarkably close to the provincial population shares, rough shares of which are 39% in ON, 23% in QC, 13.5% in BC, 11.5% in BC and below 4% for everybody else.

Master's and PhD degrees were also fairly close to the distributions, although Masters were more characterised by being incredibly stable over the 6 year period despite adding about 30,000 students across the country in every major province.

In PhD programs steady growth was seen in both domestic and international students but surprisingly ON trails QC by the number of international PhD candidates.

Undergraduates, the biggest cohort by far, are slightly leaning into Ontario disproportionately more than population would suggest. This is true of domestic and international students, but the international market is a little more volatile and has been captured by Ontario more, almost 50% of international UG students are there.

The most dramatic changes and imbalances are in the Career and Post-career Technical/Professional training programs and ON simply blows everybody away here. There are some unusual quirks like post-baccalaureate being used in BC showing similar, but less aggressive, changes than the technical programs.

**Career/Technical/Professional training programs:**

- These seem to contain the trades (confirm this in Field of Study later) but may also contain many diploma and certificate programs across other provinces. 

- Domestically, enrolment distribution by province does resemble the Canadian population at large, ON around 40% of enrolment and the rest below that. Ontario does lose around 35,000 domestic students in this category between 2016 and 2022 (down 30%) while other provinces hold steady, perhaps only slightly decline here.

- Internationally, ON is absolutely dominant with nearly two-thirds of international students in Canada taking these programs in Ontario in 2016. This rises to nearly three-quarters by 2022 by adding a further 60,000 international students.

**Post Career/Technical/Professional training programs:**
- This is very much an Ontario thing, largely the Ontario Graduate Certificate. Domestically Ontario has around 80-85% of students in this category.
- Internationally this went from 6000 students in ON in 2016 to 40,000 in 2022, dwarfing the declining domestic enrolment in this program type.


It is interesting that whilst growth has occurred in the degree programs (UG, Masters and PhD) and these have largely reflected populations across the provinces, the career and technical/professional programs have been the Wild West in enrolment distributions and raw numerical growth.


## Credential type enrolment changes over the years

Based on findings from the Program type exploration, some programs will be dropped from credential type analysis as they were found to be quite niche in enrolment in Program type analysis above.

These are:
- Health-related residency program: Very stable and tied to population distribution
- Qualifying program for career, technical or pre-university: 10-12,000 students in QC but tiny elsewhere
- Pre-university program: Quebec only, mostly domestic students around 80k students

Other program types we should check the number of unique credential types to remove unnecessary variables.

In [263]:
# keep just the key programs minus health residency, qualifying for career/technical and pre-university as above
progs_credentials = [
    'Career, technical or professional training program',
    'Graduate program (second cycle)',
    'Graduate program (third cycle)',
    'Post career, technical or professional training program',
    'Post-baccalaureate non-graduate program',
    'Undergraduate program'
 ]

In [264]:
# use the double-counting method to just toggle off the program total. Then run it against the canada_dom and canada_intl dataframes from last workbook to see if the numbers align.
creds_df = cleaned_progs[~(cleaned_progs['Province/Territory'] == 'Canada') &
                            (cleaned_progs['Program type'].isin(progs_credentials)) & 
                            ~(cleaned_progs['Credential type'] == 'Total, credential type') & 
                            (cleaned_progs['Field of study'] == 'Total, field of study')
                            ]

### Quick data audit. 

`creds_df` is all program/credentials by provinces, `ca_creds_df` is the aggregated Canadian national figures; they should be the same; the former simply sums to the latter (minus territories).

Spot check the year 2022 for equal enrolment

In [265]:
# data audit - take the inverse with only the Canadian national totals
ca_creds_df = cleaned_progs[(cleaned_progs['Province/Territory'] == 'Canada') &
                            (cleaned_progs['Program type'].isin(progs_credentials)) & 
                            ~(cleaned_progs['Credential type'] == 'Total, credential type') & 
                            (cleaned_progs['Field of study'] == 'Total, field of study')
                            ]

Now sum the total domestic and international enrolments for the year 2022

In [266]:
creds_df[creds_df['FY Start'] == 2022]['Domestic Enrolment'].sum() + creds_df[creds_df['FY Start'] == 2022]['International Enrolment'].sum()

1567110

In [267]:
ca_creds_df[ca_creds_df['FY Start'] == 2022]['Domestic Enrolment'].sum() + ca_creds_df[ca_creds_df['FY Start'] == 2022]['International Enrolment'].sum()

1568505

1.567m to 1.568m for the two dataframes, all enrolment combined. about 1400 more enrolment in the CA national totals which is a very sensible guess for the territories total enrolment.

Next Step of audit: how many of these credential/program combinations are relevant and how many are a tiny e.g. 10-15 enrolment anomaly?

In [268]:
# get total enrolment for each credential type for each program type
prog_cred_enrolment_22 = creds_df[creds_df['FY Start'] == 2022].groupby(['Program type', 'Credential type'])[['Domestic Enrolment', 'International Enrolment']].sum()

# make a new column for total enrolment
prog_cred_enrolment_22['Int + Dom Enrolment'] = prog_cred_enrolment_22['Domestic Enrolment'] + prog_cred_enrolment_22['International Enrolment']

# make a column for the % that this segment's enrolment represents of the entire total
prog_cred_enrolment_22['% of All Enrolment'] = (prog_cred_enrolment_22['Int + Dom Enrolment'] / prog_cred_enrolment_22['Int + Dom Enrolment'].sum() * 100).round(2)

prog_cred_enrolment_22

Unnamed: 0_level_0,Unnamed: 1_level_0,Domestic Enrolment,International Enrolment,Int + Dom Enrolment,% of All Enrolment
Program type,Credential type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Career, technical or professional training program",Certificate,40566,19335,59901,3.82
"Career, technical or professional training program",Diploma,232755,108639,341394,21.78
"Career, technical or professional training program","Not applicable, credential type",198,27,225,0.01
"Career, technical or professional training program",Other type of credential associated with a program,1560,540,2100,0.13
Graduate program (second cycle),Certificate,651,132,783,0.05
Graduate program (second cycle),Degree (includes applied degree),72711,47769,120480,7.69
Graduate program (second cycle),Diploma,2370,1236,3606,0.23
Graduate program (second cycle),"Not applicable, credential type",0,0,0,0.0
Graduate program (second cycle),Other type of credential associated with a program,48,9,57,0.0
Graduate program (third cycle),Certificate,6,9,15,0.0


General findings:

We can really see how dominant undergraduate degrees are in the overall PSI landscape.

Clear categories we can just drop because they are rounding errors (**< 0.1%**):

- Career/Technical:
    - Not applicable

- Graduate Program (second cycle):
    - N/A
    - Other
    - Certificate

- Graduate Program (third cycle) which should just be PhD:
    - Certificate
    - Diploma

- Post Career technical/professional training:
    - N/A
    - Other

- Post-bacc non-graduate program
    - Certificate
    - Other

- Undergraduate program
    - Other

### Examine different credentials within program types

The first thing to do is establish in the program types we have, what are the credential types offered?

#### Career, technical or professional training program

In [269]:
creds_df[creds_df['Program type'] == 'Career, technical or professional training program']['Credential type'].unique()

array(['Certificate', 'Diploma',
       'Other type of credential associated with a program',
       'Not applicable, credential type'], dtype=object)

In [270]:
creds_df[creds_df['Program type'] == 'Career, technical or professional training program']['Credential type'].value_counts()

Credential type
Diploma                                               70
Certificate                                           63
Other type of credential associated with a program    30
Not applicable, credential type                       25
Name: count, dtype: int64

Quite a mix between Diplomas, certificates and other/NA, but which have the most enrolment?

In [271]:
creds_df[creds_df['Program type'] == 'Career, technical or professional training program'].sort_values(by='Domestic Enrolment', ascending=False).head(10)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
10295,2017,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",133008,38352,246
4126,2016,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",130290,27486,474
16364,2018,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",128379,50256,114
22447,2019,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",124326,58572,306
28570,2020,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",114930,56196,1041
34742,2021,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",106086,58980,1008
40972,2022,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",97485,78930,996
5142,2016,Quebec,Quebec (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",81549,3366,0
11313,2017,Quebec,Quebec (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",80706,3873,0
29540,2020,Quebec,Quebec (total),"Career, technical or professional training pro...",Diploma,"Total, field of study",79260,4656,0


#### Undergraduate Programs

#### Graduate (Second Cycle)

## Field of Study enrolment changes over the years

In [272]:
# use the double-counting method to just toggle off the program total. Then run it against the canada_dom and canada_intl dataframes from last workbook to see if the numbers align.
fields_df = cleaned_progs[(cleaned_progs['Institution Name'].str.contains('Canada')) &
                                  (cleaned_progs['Program type'] == 'Total, program type') & 
                                  (cleaned_progs['Credential type'] == 'Total, credential type') & 
                                  ~(cleaned_progs['Field of study'] == 'Total, field of study')
                                ]

In [273]:
fields_df

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
2094,2016,Canada,Canada (total),"Total, program type","Total, credential type",Agricultural and veterinary sciences/services/...,13350,2196,3
2095,2016,Canada,Canada (total),"Total, program type","Total, credential type",Architecture and related services [04.],13263,1560,0
2096,2016,Canada,Canada (total),"Total, program type","Total, credential type","Area, ethnic, cultural, gender, and group stud...",5886,798,0
2097,2016,Canada,Canada (total),"Total, program type","Total, credential type",Basic skills and general exam preparation (not...,4818,1680,66
2098,2016,Canada,Canada (total),"Total, program type","Total, credential type",Bible/Biblical studies [39.02],69,6,0
...,...,...,...,...,...,...,...,...,...
38906,2022,Canada,Canada (total),"Total, program type","Total, credential type",Security and protective services [43.],22605,1455,24
38907,2022,Canada,Canada (total),"Total, program type","Total, credential type",Social sciences [45.],64269,16899,6
38909,2022,Canada,Canada (total),"Total, program type","Total, credential type",Transportation and materials moving [49.],2727,399,6
38910,2022,Canada,Canada (total),"Total, program type","Total, credential type","Unclassified, field of study",4251,4626,24


First is to assess Field of Study values and how frequently they appear

In [274]:
fields_df['Field of study'].nunique()

49

In [275]:
fields_df['Field of study'].value_counts(ascending=True)

Field of study
Military technologies and applied sciences [29.]                                                   1
Liberal arts and sciences, general studies and humanities [24.]                                    7
Library science [25.]                                                                              7
Mathematics and statistics [27.]                                                                   7
Mechanic and repair technologies/technicians [47.]                                                 7
Medical residency/fellowship programs [61.]                                                        7
Military science, leadership and operational art [28.]                                             7
Multidisciplinary/interdisciplinary studies [30.]                                                  7
Natural resources and conservation [03.]                                                           7
Parks, recreation, leisure, fitness, and kinesiology [31.]                  

In [276]:
fields_df[fields_df['Field of study'] == "Military technologies and applied sciences [29.]"]

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
38894,2022,Canada,Canada (total),"Total, program type","Total, credential type",Military technologies and applied sciences [29.],9,0,0


Aside from this niche subject, all the CIP codes in the data appear consistently every year.

We'll move on to enrolment figures in each of these subjects.

In [277]:
fields_df

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
2094,2016,Canada,Canada (total),"Total, program type","Total, credential type",Agricultural and veterinary sciences/services/...,13350,2196,3
2095,2016,Canada,Canada (total),"Total, program type","Total, credential type",Architecture and related services [04.],13263,1560,0
2096,2016,Canada,Canada (total),"Total, program type","Total, credential type","Area, ethnic, cultural, gender, and group stud...",5886,798,0
2097,2016,Canada,Canada (total),"Total, program type","Total, credential type",Basic skills and general exam preparation (not...,4818,1680,66
2098,2016,Canada,Canada (total),"Total, program type","Total, credential type",Bible/Biblical studies [39.02],69,6,0
...,...,...,...,...,...,...,...,...,...
38906,2022,Canada,Canada (total),"Total, program type","Total, credential type",Security and protective services [43.],22605,1455,24
38907,2022,Canada,Canada (total),"Total, program type","Total, credential type",Social sciences [45.],64269,16899,6
38909,2022,Canada,Canada (total),"Total, program type","Total, credential type",Transportation and materials moving [49.],2727,399,6
38910,2022,Canada,Canada (total),"Total, program type","Total, credential type","Unclassified, field of study",4251,4626,24


In [278]:
#snapshots of fields of study in 2016 and 2022
fields_2016 = fields_df[fields_df['FY Start'] == 2016]

fields_2022 = fields_df[fields_df['FY Start'] == 2022]

#sort field_2016 fields of study by domestic enrolment
fields_2016.sort_values(by='Domestic Enrolment', ascending=False).head(20)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
2100,2016,Canada,Canada (total),"Total, program type","Total, credential type","Business, management, marketing and related su...",192753,60405,138
2112,2016,Canada,Canada (total),"Total, program type","Total, credential type",Health professions and related programs [51.],156207,6885,324
2120,2016,Canada,Canada (total),"Total, program type","Total, credential type","Liberal arts and sciences, general studies and...",123432,13374,69
2107,2016,Canada,Canada (total),"Total, program type","Total, credential type",Engineering [14.],79911,27261,0
2126,2016,Canada,Canada (total),"Total, program type","Total, credential type",Multidisciplinary/interdisciplinary studies [30.],71625,8871,24
2138,2016,Canada,Canada (total),"Total, program type","Total, credential type",Social sciences [45.],69543,12819,3
2099,2016,Canada,Canada (total),"Total, program type","Total, credential type",Biological and biomedical sciences [26.],60051,6831,0
2106,2016,Canada,Canada (total),"Total, program type","Total, credential type",Education [13.],57429,2508,3
2142,2016,Canada,Canada (total),"Total, program type","Total, credential type",Visual and performing arts [50.],56697,6096,159
2134,2016,Canada,Canada (total),"Total, program type","Total, credential type",Psychology [42.],43524,2754,3


Let's keep domestic and international apart for now as the raw numbers in 2016 are so far apart.

### Enrolment by field of study

In [279]:
# bar chart, horizontal, field of study on y axis, domestic enrolment on x axis, descending order
fig = px.bar(
    fields_2016.sort_values(by='Domestic Enrolment', ascending=True), # rather than pre-sorting, use the sort_values method within the plot
    x='Domestic Enrolment',
    y='Field of study',
    orientation='h',
    title='Fields of Study by Domestic Enrolment – 2016'
)

fig.update_layout(
    width=1500,
    height=1000, # allow more labels to fit
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig.show()

In [280]:
# bar chart, horizontal, field of study on y axis, domestic enrolment on x axis, descending order
fig = px.bar(
    fields_2016.sort_values(by='International Enrolment', ascending=True), # rather than pre-sorting, use the sort_values method within the plot
    x='International Enrolment',
    y='Field of study',
    orientation='h',
    title='Fields of Study by International Enrolment – 2016'
)

fig.update_layout(
    width=1500,
    height=1000, # allow more labels to fit
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig.show()

In [281]:
# bar chart, horizontal, field of study on y axis, domestic enrolment on x axis, descending order
fig = px.bar(
    fields_2022.sort_values(by='Domestic Enrolment', ascending=True), # rather than pre-sorting, use the sort_values method within the plot
    x='Domestic Enrolment',
    y='Field of study',
    orientation='h',
    title='Fields of Study by Domestic Enrolment – 2022'
)

fig.update_layout(
    width=1500,
    height=1000, # allow more labels to fit
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig.show()

In [282]:
# bar chart, horizontal, field of study on y axis, domestic enrolment on x axis, descending order
fig = px.bar(
    fields_2022.sort_values(by='International Enrolment', ascending=True), # rather than pre-sorting, use the sort_values method within the plot
    x='International Enrolment',
    y='Field of study',
    orientation='h',
    title='Fields of Study by International Enrolment – 2022'
)

fig.update_layout(
    width=1500,
    height=1000, # allow more labels to fit
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig.show()

### Visualising changes in Field of Study enrolment, 2016-2022

Plan:
- Look at 2016 and 2022 records for each Field of Study
- Plot the fields of study one after another, showing 2016 enrolment and 2022 side by side (see the delta in six years)
- plot domestic and international enrolment on separate figures


In [283]:
fields_16_22 = fields_df[fields_df['FY Start'].isin([2016, 2022])]

In [284]:
fields_16_22.head(5)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
2094,2016,Canada,Canada (total),"Total, program type","Total, credential type",Agricultural and veterinary sciences/services/...,13350,2196,3
2095,2016,Canada,Canada (total),"Total, program type","Total, credential type",Architecture and related services [04.],13263,1560,0
2096,2016,Canada,Canada (total),"Total, program type","Total, credential type","Area, ethnic, cultural, gender, and group stud...",5886,798,0
2097,2016,Canada,Canada (total),"Total, program type","Total, credential type",Basic skills and general exam preparation (not...,4818,1680,66
2098,2016,Canada,Canada (total),"Total, program type","Total, credential type",Bible/Biblical studies [39.02],69,6,0


Going to melt into long dataframe to plot

In [340]:
# Pivot to wide format => columns "2016", "2022"
df_pivot_dom = fields_16_22.pivot(index='Field of study', columns='FY Start', values='Domestic Enrolment').fillna(0)
df_pivot_dom.rename(columns={2016: '2016', 2022: '2022'}, inplace=True)

# Sort descending by 2016 to show largest bar at the top
df_pivot_dom.sort_values(by='2016', ascending=False, inplace=True)

# Plot wide-form data with side-by-side bars
fig = px.bar(
    df_pivot_dom,
    x=['2016', '2022'],    # two distinct bars
    y=df_pivot_dom.index,  # Field of study on the y-axis
    orientation='h',
    barmode='group',       # side-by-side (not stacked)
    title='Fields of Study by Domestic Enrolment – 2016 vs 2022',
    color_discrete_map={'2016': 'blue', '2022': 'red'}  # optional custom colors
)

# Reverse the y-axis to show the largest bar at the top
fig.update_yaxes(autorange='reversed')  
fig.update_layout(
    width=1500,
    height=1000,
    bargap=0.3,
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig.show()

# International dataframe 

df_pivot_intl = fields_16_22.pivot(index='Field of study', columns='FY Start', values='International Enrolment').fillna(0)
df_pivot_intl.rename(columns={2016: '2016', 2022: '2022'}, inplace=True)

# Sort descending by 2016 to show largest bar at the top
df_pivot_intl.sort_values(by='2016', ascending=False, inplace=True)

fig = px.bar(
    df_pivot_intl,
    x=['2016', '2022'],        # two distinct bars
    y=df_pivot_intl.index,     # Field of study on the y-axis
    orientation='h',
    barmode='group',           # side-by-side (not stacked)
    title='Fields of Study by International Enrolment – 2016 vs 2022',
    color_discrete_map={'2016': 'blue', '2022': 'red'}  # optional custom colors
)

# Reverse the y-axis to show the largest bar at the top
fig.update_yaxes(autorange='reversed')  
fig.update_layout(
    width=1500,
    height=1000,
    bargap=0.3,
    xaxis_title='Enrolment',
    yaxis_title=None
)

fig.show()

Analysis:

# Below is pre Feb 26

### Data audit

Per the earlier analysis, ensure this data concerns only Full-time students and other features to align with enrolment data.

# To do Feb 19 - explore programs data here, compare intl/dom enrolment, compare overall enrolment in categories
- EDA of the cleaned programs dataframe - which are the most popular programs, use the enrolment to determine which are the program categories of greatest relevance to revenue for schools.
- Use existing knowledge e.g. graduate certificate, graduate diploma programs growing to see which institutions invested in these.
- Any Program/Credential type categories which are esoteric or unknown, check their enrolment, if they are very small across the board we could remove them

### Audit enrolment numbers against `combined_df` and Canada-wide totals before analysis of program/credential level enrolment.

#### Individual Schools spot check

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'].isin(['U of BC', 'York U', 'U of Alberta'])) 
                   & (cleaned_program_df['Program type'] == 'Total, program type') 
                   & (cleaned_program_df['Credential type'] == 'Total, credential type')][['FY Start', 'Institution Name', 'Domestic Enrolment', 'International Enrolment']]

Canadian Status,FY Start,Institution Name,Domestic Enrolment,International Enrolment
269,2022,U of Alberta,34653,8361
647,2022,U of BC,36735,14472
1781,2022,York U,36144,8697


In [None]:
combined_df[(combined_df['Institution Name'].isin(['U of BC', 'York U', 'U of Alberta']))
            & (combined_df['FY Start'] == 2022)][['FY Start', 'Institution Name', 'Domestic Enrolment', 'International Enrolment']]

Unnamed: 0,FY Start,Institution Name,Domestic Enrolment,International Enrolment
2859,2022,U of Alberta,34653,8361
2873,2022,U of BC,36735,14472
3181,2022,York U,36144,8697


All three institutions align in numbers so we can probably trust the numbers are good at the collection level. It was important to check here because we may remove low enrolment programs/credentials not hugely relevant to tuition fee revenue on the whole, as the analysis goes on here.

#### National/Provincial totals audited vs Canadian nationwide totals (avoid double counting total rows)

In the earlier sections analysing Canada-wide statistics before breaking into province/PSI level:

- Canada-wide total domestic enrolment in 2022/23 (FY start 2022) was **1,320,684 students**
- Canada-wide total international enrolment in 2022/23 (FY start 2022) was **421,008 students**

The `cleaned_program_df` will have several instances of double-counting at the provincial total levels counting total program types and total credential types (see cell below)

In [None]:
# Note the total, program type, total credential type rows which are double counting enrolment
cleaned_program_df.sort_values(by='Domestic Enrolment', ascending=False).head(5)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1551,2022,Ontario,Ontario (total),"Total, program type","Total, credential type",546672,236769
1547,2022,Ontario,Ontario (total),"Total, program type",Degree (includes applied degree),414537,93018
1557,2022,Ontario,Ontario (total),Undergraduate program,"Total, credential type",361329,69291
1553,2022,Ontario,Ontario (total),Undergraduate program,Degree (includes applied degree),359139,67233
2385,2022,Quebec,Quebec (total),"Total, program type","Total, credential type",344982,59814


This is just an extract showing the double counting. We want to show this data agrees with the previous set.

Let's reconcile by grabbing the provincial totals in one dataframe and all the individual institutions and programs in another. They are just two different methods of counting enrolment with varying degrees of granularity.

Below, `programs_prov_totals` will consist only of records of the provincial totals across all programs and credentials for each province.

In [None]:
programs_prov_totals = cleaned_program_df[(cleaned_program_df['Institution Name'].str.contains('(total)')) & 
                                          (cleaned_program_df['Program type'] == 'Total, program type') & 
                                          (cleaned_program_df['Credential type'] == 'Total, credential type')
                                          ]


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



In [None]:
programs_prov_totals

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
20,2022,Alberta,Alberta (total),"Total, program type","Total, credential type",153537,29139
330,2022,British Columbia,BC (total),"Total, program type","Total, credential type",123987,53583
799,2022,Manitoba,Manitoba (total),"Total, program type","Total, credential type",38181,10563
925,2022,New Brunswick,New Brunswick (total),"Total, program type","Total, credential type",18150,5439
1045,2022,Newfoundland and Labrador,Newfoundland and Labrador (total),"Total, program type","Total, credential type",15288,4725
1147,2022,Nova Scotia,Nova Scotia (total),"Total, program type","Total, credential type",38049,11805
1551,2022,Ontario,Ontario (total),"Total, program type","Total, credential type",546672,236769
1815,2022,Prince Edward Island,Prince Edward Island (total),"Total, program type","Total, credential type",4992,2121
2385,2022,Quebec,Quebec (total),"Total, program type","Total, credential type",344982,59814
2594,2022,Saskatchewan,Saskatchewan (total),"Total, program type","Total, credential type",35247,7032


In [None]:
print(f"Total Domestic enrolment across all provinces combined is {programs_prov_totals['Domestic Enrolment'].sum()}, compared to earlier dataset of {canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum()}")
print(f"Total International enrolment across all provinces combined is {programs_prov_totals['International Enrolment'].sum()}, compared to earlier dataset of {canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()}")

Total Domestic enrolment across all provinces combined is 1319085, compared to earlier dataset of 1320684
Total International enrolment across all provinces combined is 420990, compared to earlier dataset of 421008


**To audit properly, we will use the above comparison statement to get the new dataset as close as possible to the Canadian nationwide numbers we pulled at the very beginning of the project. Namely 1.320m Domestic Canadian students in 2022/23 and 421,008 international students.** 

At the provincial total level, it's a very close matches! Likely just missing the territories which we removed in the second dataset, but not the first. These data are reconciled for all practical intents and purposes.


#### Schools Total

As for the individual credential/program dataframe, we need to be a bit more precise. 

- We should exclude the provincial totals which are double counting the enrolment at the individual schools.
- The individual institutions will also be aggregating the total program/credential type on top of the individual programs and credentials.

Doing this step by step, we have province level totals reconciled, now reconcile institution totals across program/credential:

In [None]:
programs_no_totals = cleaned_program_df[~(cleaned_program_df['Institution Name'].str.contains('(total)')) & 
                                        (cleaned_program_df['Program type'] == 'Total, program type') & 
                                        (cleaned_program_df['Credential type'] == 'Total, credential type')
                                        ]


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



In [None]:
programs_no_totals['Domestic Enrolment'].sum()

1319103

In [None]:
programs_no_totals.sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1715,2022,Ontario,U of Toronto,"Total, program type","Total, credential type",62826,26856
2434,2022,Quebec,U de Montréal,"Total, program type","Total, credential type",38646,11211
647,2022,British Columbia,U of BC,"Total, program type","Total, credential type",36735,14472
1781,2022,Ontario,York U,"Total, program type","Total, credential type",36144,8697
1756,2022,Ontario,Western U,"Total, program type","Total, credential type",34998,5829
...,...,...,...,...,...,...,...
2570,2022,Saskatchewan,Parkland Regional College,"Total, program type","Total, credential type",21,0
1069,2022,Nova Scotia,Atlantic School of Theology,"Total, program type","Total, credential type",18,6
1674,2022,Ontario,U de l'Ontario français,"Total, program type","Total, credential type",15,81
2566,2022,Saskatchewan,Great Plains College,"Total, program type","Total, credential type",12,0


# to-do Feb 24 - try and reconcile the 72k domestic enrolment and 20k international enrolment missing in this data

# use the Province (total) records to reconcile 1.32m as above, and then split them off from the main dataframe!!

# then you can compare province (total) dataframe for total int'l students and the individual records combined

1.25 million is off by 72,000 from 1.32m. May be due to dropping the territories and some other records.

For international enrolment (target is 421,008):

In [None]:
programs_no_totals['International Enrolment'].sum()

398694

In [None]:
programs_no_totals.sort_values(by='Domestic Enrolment', ascending=False).head(50)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1717,2022,Ontario,U of Toronto,Undergraduate program,Degree (includes applied degree),44616,19845
1783,2022,Ontario,York U,Undergraduate program,Degree (includes applied degree),32862,7017
270,2022,Alberta,U of Alberta,Undergraduate program,Degree (includes applied degree),28686,4560
1730,2022,Ontario,U of Waterloo,Undergraduate program,Degree (includes applied degree),27897,5973
1650,2022,Ontario,Toronto Metropolitan U,Undergraduate program,Degree (includes applied degree),27450,2985
1757,2022,Ontario,Western U,Undergraduate program,Degree (includes applied degree),27042,3702
648,2022,British Columbia,U of BC,Undergraduate program,Degree (includes applied degree),26601,9915
1702,2022,Ontario,U of Ottawa-U d'Ottawa,Undergraduate program,Degree (includes applied degree),26307,6327
1463,2022,Ontario,McMaster U,Undergraduate program,Degree (includes applied degree),25518,4530
2436,2022,Quebec,U de Montréal,Undergraduate program,Degree (includes applied degree),24579,4086


### EDA - Key Questions to answer and important program grey areas

[Notes from StatCan on classification of programs](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1252482)

1. Where is the most significant enrolment for international students going? Undergrad? Graduate 2nd Cycle --> degree or certificate/diploma?

2. Any program/credential types e.g. pre-university that are unanimously low enrolment so they can be removed? Growth in particular program areas?

3. Any provincial or program specific to be mindful of?
    - **Code 5 in Ontario, Graduate certificates** are in Post Career, technical or professional training program. Is this true of other provinces?
    - **Code 6 from notes Pre-university programs** include university-stream programs at Colleges and CEGEPs in Quebec - are other provinces using this as university-transfer stream programs at college?
    - **Code 9 Post-baccalaureate non-graduate program** includes programs that require a bachelor's degree for admission but can also capture programs at undergrad level which complete at a level beyond bachelor's degree, due to depth of learning (e.g. LLB or MD Degree). **Some flexibility** in how provinces can choose to report these professional degree programs. B.Ed are either here or undergraduate degrees depending on whether they're considered post-degree in outcome.




In [None]:
cleaned_program_df['Program type'].value_counts()

Program type
Total, program type                                        974
Career, technical or professional training program         420
Undergraduate program                                      414
Graduate program (second cycle)                            241
Pre-university program                                     170
Post-baccalaureate non-graduate program                    160
Graduate program (third cycle)                             139
Post career, technical or professional training program    136
Name: count, dtype: int64

In [None]:
cleaned_program_df['Credential type'].value_counts()

Credential type
Total, credential type                                891
Diploma                                               558
Degree (includes applied degree)                      463
Certificate                                           377
Not applicable, credential type                       267
Other type of credential associated with a program     58
Associate degree                                       40
Name: count, dtype: int64

The dicrepancy between the Total (for both Credential type and Program type) and the individual types of programs/credentials may have come from the fact they are contributed to by individual categories that aren't present in the data uploaded. There were numerous program types for example like *Basic Education & Skills* (high school) and Health-related residency program that I did not import, and several non-program categories. These were generally quite small in number and are focused on domestic students and thus wouldn't be impacting revenue in the same way as the programs listed above would.

## Exploring credential types and their enrolment

Before looking directly at enrolment this section's goal is to explore the confluence of Program type / Credential type across the provinces, as the StatCan source shows there are different interpretations of the categories by province. We should try not to make an apples to oranges comparison.

We'll start by doing a pie chart to see the breakdown of enrolment in different program types - what programs are international and domestic students joining?

In [None]:
# Filter out "total, program type" and "total, credential type" as they'll be double counting enrolment otherwise
pie_df_programs = cleaned_program_df[
    (cleaned_program_df['Program type'] != 'Total, program type') &
    (cleaned_program_df['Credential type'] != 'Total, credential type')
]

# Aggregate Domestic and International enrolment by Program type
domestic_agg = pie_df_programs.groupby('Program type', as_index=False)['Domestic Enrolment'].sum()
international_agg = pie_df_programs.groupby('Program type', as_index=False)['International Enrolment'].sum()

In [None]:
pie_df_programs['Domestic Enrolment'].sum()

2502237

In [None]:
domestic_agg

Unnamed: 0,Program type,Domestic Enrolment
0,"Career, technical or professional training pro...",550206
1,Graduate program (second cycle),151569
2,Graduate program (third cycle),70815
3,"Post career, technical or professional trainin...",16449
4,Post-baccalaureate non-graduate program,27921
5,Pre-university program,162819
6,Undergraduate program,1522458


In [None]:
# pie chart of the domestic enrolment and international enrolment from the domestic_agg numbers and international_agg numbers
fig_dom = px.pie(
    domestic_agg,
    names='Program type',
    values='Domestic Enrolment',
    title='Domestic Enrolment by Program Type (2022-23)',
    hover_data=['Domestic Enrolment'],
    template='plotly'
)
fig_dom.show()

fig_intl = px.pie(
    international_agg,
    names='Program type',
    values='International Enrolment',
    title='International Enrolment by Program Type (2022-23)',
    hover_data=['International Enrolment'],
    template='plotly'
)
fig_intl.show()

Important to remember with credential reporting:
- Diplomas will be reported in different program types in different provinces e.g. in Ontario, Graduate program (second cycle) is almost entirely degrees (presumably master's), but there are 2000 diplomas in this Graduate (second cycle)  in Quebec.
- If we are exploring any hypothesis of certificates and diplomas being the driver of international enrolment, we need to know exactly how they're reported in each different province.

See the below for ON, QC and BC totals - 

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'].isin(['Ontario (total)', 'Quebec (total)', 'BC (total)'])) & 
                   (cleaned_program_df['Program type'] == 'Graduate program (second cycle)')
                   ].sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1534,2022,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type",31830,17973
1531,2022,Ontario,Ontario (total),Graduate program (second cycle),Degree (includes applied degree),31716,17946
2374,2022,Quebec,Quebec (total),Graduate program (second cycle),"Total, credential type",22065,15810
2371,2022,Quebec,Quebec (total),Graduate program (second cycle),Degree (includes applied degree),19737,14478
313,2022,British Columbia,BC (total),Graduate program (second cycle),"Total, credential type",7638,5268
311,2022,British Columbia,BC (total),Graduate program (second cycle),Degree (includes applied degree),7530,4980
2372,2022,Quebec,Quebec (total),Graduate program (second cycle),Diploma,2160,915
310,2022,British Columbia,BC (total),Graduate program (second cycle),Certificate,78,102
1532,2022,Ontario,Ontario (total),Graduate program (second cycle),Diploma,66,18
1533,2022,Ontario,Ontario (total),Graduate program (second cycle),Other type of credential associated with a pro...,48,9


As we can see above this program type is dominated by degrees in all three provinces but a diploma at this level also exists. It may be analogous to the Pre-University diploma offered in Quebec, but for master's programs. 

Checking diplomas below:

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'].isin(['Ontario (total)', 'Quebec (total)', 'BC (total)', 'Alberta (total)'])) &
                   (cleaned_program_df['Credential type'] == 'Diploma') &
                    (cleaned_program_df['Domestic Enrolment'] > 50) # a few niche programs with very small enrolment hidden
                   ].sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
2383,2022,Quebec,Quebec (total),"Total, program type",Diploma,162006,8619
1548,2022,Ontario,Ontario (total),"Total, program type",Diploma,98679,79035
1527,2022,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,97485,78930
2378,2022,Quebec,Quebec (total),Pre-university program,Diploma,79833,1176
2368,2022,Quebec,Quebec (total),"Career, technical or professional training pro...",Diploma,76053,5757
18,2022,Alberta,Alberta (total),"Total, program type",Diploma,25488,9324
1,2022,Alberta,Alberta (total),"Career, technical or professional training pro...",Diploma,25389,9315
327,2022,British Columbia,BC (total),"Total, program type",Diploma,16857,16776
307,2022,British Columbia,BC (total),"Career, technical or professional training pro...",Diploma,16212,9765
2372,2022,Quebec,Quebec (total),Graduate program (second cycle),Diploma,2160,915


Pre-University programs are [unique programs in the Quebec schooling system](https://www.cegepsquebec.ca/en/cegeps/presentation/systeme-scolaire-quebecois/) that prepare students for Undergraduate University programs. 

As the numbers show, they are highly focused towards domestic students coming from the Quebec secondary school system. It accounts for nearly half of all Quebec Diplomas, with almost the entire other half coming from Career, technical or professional programs as with Ontario. 

This career, technical or professional category seems the most appropriate category to compare diplomas

In [None]:
# box plot of program types and their domestic enrolment, excluding Total, program type

fig = px.box(
    cleaned_program_df[
        cleaned_program_df['Program type'] != 'Total, program type'
    ],
    x='Program type',
    y='Domestic Enrolment',
    color='Program type',
    hover_data=['Province/Territory', 'Institution Name'],
    points='outliers',
    title='Domestic Enrolment by Program Type (Excluding Total)',
    template='plotly'
)

fig.show()

**Associate Degree** is the smallest category - is it because only a few provinces use it?


In [None]:
# find Associate degree and list the provinces it is used in
cleaned_program_df[cleaned_program_df['Credential type'] == 'Associate degree']['Province/Territory'].unique()

# cleaned_program_df[(cleaned_program_df['Credential type'] == 'Associate degree') ]

array(['British Columbia', 'Manitoba'], dtype=object)

Only BC and Manitoba seem to offer Associate degrees. This is a good opportunity to do some more data audit

In [None]:
cleaned_program_df[((cleaned_program_df['Credential type'] == 'Associate degree') & (cleaned_program_df['Program type'] == 'Total, program type'))]

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
324,2022,British Columbia,BC (total),"Total, program type",Associate degree,4725,3603
362,2022,British Columbia,Camosun College,"Total, program type",Associate degree,351,102
378,2022,British Columbia,Capilano U,"Total, program type",Associate degree,264,534
391,2022,British Columbia,Coast Mountain College,"Total, program type",Associate degree,15,42
408,2022,British Columbia,College of New Caledonia,"Total, program type",Associate degree,75,207
423,2022,British Columbia,College of the Rockies,"Total, program type",Associate degree,48,15
441,2022,British Columbia,Douglas College,"Total, program type",Associate degree,1296,438
480,2022,British Columbia,Kwantlen Polytechnic U,"Total, program type",Associate degree,18,72
495,2022,British Columbia,Langara College,"Total, program type",Associate degree,1575,1503
506,2022,British Columbia,Nicola Valley Institute of Technology,"Total, program type",Associate degree,54,0


### Data audit of enrolment records against credential/programs

In the last cell, we can see 12 people enrolled in the Associate degree programs, but the individual school entries with Associate degree sums only to 9.

We can audit programatically with BC and other credential types.

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'] == 'BC (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
330,2022,British Columbia,BC (total),"Total, program type","Total, credential type",123987,53583


In [None]:
# Sum of domestic enrolment for BC across all programs and credentials in 2022

print(f"Total domestic enrolment across BC (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'BC (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['Domestic Enrolment'].sum()}\n")

print(f"Total domestic enrolment across BC (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'British Columbia') & 
                    (cleaned_program_df['Institution Name'] == 'BC (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['Domestic Enrolment'].sum()}")

Total domestic enrolment across BC (All Programs & All Credentials record row): 123987

Total domestic enrolment across BC (sum of individual credentials records): 123822


The above cell is finding the total domestic enrolment from the individual credential type records (every record in the dataframe extract in the cell below, except the first record) and comparing with the Total all programs / Total all credentials record which is the very top record of the extract below. 

We can see there is a discrepancy of 165 domestic students - at 124,000 students this is a 0.13% discrepancy

I'll now do the same with the international students:

In [None]:
cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'British Columbia') & 
                    (cleaned_program_df['Institution Name'] == 'BC (total)') &
                    (cleaned_program_df['Program type'] == 'Total, program type')].sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
330,2022,British Columbia,BC (total),"Total, program type","Total, credential type",123987,53583
326,2022,British Columbia,BC (total),"Total, program type",Degree (includes applied degree),84054,26790
327,2022,British Columbia,BC (total),"Total, program type",Diploma,16857,16776
325,2022,British Columbia,BC (total),"Total, program type",Certificate,8589,1107
328,2022,British Columbia,BC (total),"Total, program type","Not applicable, credential type",6753,5055
324,2022,British Columbia,BC (total),"Total, program type",Associate degree,4725,3603
329,2022,British Columbia,BC (total),"Total, program type",Other type of credential associated with a pro...,2844,249


In [None]:
print(f"Total international enrolment across BC (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'BC (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['International Enrolment'].sum()}")

print(f"Total international enrolment across BC (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'British Columbia') & 
                    (cleaned_program_df['Institution Name'] == 'BC (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['International Enrolment'].sum()}")

Total international enrolment across BC (All Programs & All Credentials record row): 53583
Total international enrolment across BC (sum of individual credentials records): 53580


This discrepancy is 3 out of 53.5k total - not even 0.01%

Let's try another province

In [None]:
print(f"Total international enrolment across AB (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'Alberta (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['International Enrolment'].sum()}")

print(f"Total international enrolment across AB (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'Alberta') & 
                    (cleaned_program_df['Institution Name'] == 'Alberta (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['International Enrolment'].sum()}")

Total international enrolment across AB (All Programs & All Credentials record row): 29139
Total international enrolment across AB (sum of individual credentials records): 29136


In [None]:
print(f"Total domestic enrolment across AB (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'Alberta (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['Domestic Enrolment'].sum()}")

print(f"Total domestic enrolment across AB (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'Alberta') & 
                    (cleaned_program_df['Institution Name'] == 'Alberta (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['Domestic Enrolment'].sum()}")

Total domestic enrolment across AB (All Programs & All Credentials record row): 153537
Total domestic enrolment across AB (sum of individual credentials records): 153543


Both the above Alberta enrolment stats are within single digits of the comparison - at 29k and 153k these are <0.1% discrepancies we are okay with. 

### Calculate revenue

Things covered in this section:
- Combining the tuition fees from `tuition` dataframe with enrolment figures in `enrolment`
- Using tuition fees and 22/23 enrolment to calculate revenue from enrolment/tuition fees
- Projecting 23/24 enrolment based on international % growth rates and domestic enrolment changes, to provide some estimate of 23/24 enrolment all else being equal.
- Using hypothetical scenarios e.g. 10% drop in enrolment, 30% drop, 50% drop from international student enrolment to forecast & estimate revenue losses by school.

Notes:
- As we go forward, we should spot check enrolment in certain program types and credential types with the international enrolment from combined_df. There are probably going to be at least some inconsistencies in how a 'diploma' or 'certificate' is categorized and the average tuition fee values we're working with mean these will be informed estimates at best. But there should not be instances where specific programs / credentials have more enrolment than entire 'international' or 'domestic' students from combined_df.

## High-level conclusions:

Any relationship to population growth?

## Next Steps