# Feb 2025 version - Enrolment by Program, Credential type and Field of Study

## Introduction

**Goal of this workbook:**

Following on from looking at domestic and international enrolment, this notebook examines program and credential specific enrolment including field of study. This is to get a more precise understanding of programs and their enrolment contributing to tuition fee revenue, which may be at risk. 

Source:
[StatCan: Postsecondary enrolments, by detailed field of study, institution, and program and student characteristics](https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=3710027701)

The new Federal IRCC rules of 2024 restricts the number of international study permits issued and Post-Graduate Work Permits (PGWP) availability, according to the program/credential type, length of the program and field of study. The field-of-study requirement further complicates the picture of enrolment and revenue changes for specific schools and offers another dimension to this situation - which programs and fields of study have proliferated and are they what the government believes we are most in need of?

For clarity and ease of reference I will lay out the current guides & updates with source immediately below.

### IRCC Sources, Notices, definitions and targets (close when not needed)

1. [PGWP: Who can apply, eligible Program types and Field of study requirements, current](https://www.canada.ca/en/immigration-refugees-citizenship/services/study-canada/work/after-graduation/eligibility.html)
    - **Bachelor's, Master's or Doctoral degrees from a university have no field of study requirement**
    - All other university programs and college programs must graduate in an eligible field of study. Fields of study are clasisfied by a six-digit [Classification of Instructional Programs (CIP) code](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1420413) with a leading two-digit code followed by a trailing four-digit specifier, e.g. 14 is engineering, 14.1901 is Mechanical Engineering.
    
    - The PGWP eligible fields of study ([link to CIP codes](https://www.canada.ca/en/immigration-refugees-citizenship/services/study-canada/work/after-graduation/eligibility.html#field-of-study)) are:
        - Agriculture and agri-food
        - Education
        - Healthcare
        - Science, technology, engineering and mathematics (STEM)
        - Trades (electrician, HVAC etc)
        - Transport

2. [2025 allocations under the international student cap; Jan 24, 2025](https://www.canada.ca/en/immigration-refugees-citizenship/news/notices/2025-provincial-territorial-allocations-under-international-student-cap.html)
    - 437,000 total study permits for 2025, a 10% decrease from the 2024 cap. 
    - Master's & Doctoral students now required to submit a PAL as well but extra space has been made specifically for those programs.

    - "**Considering growth in the graduate international student sector has been sustainable**...2025 graduate student sub-allocation is based on the number of study permits that [provinces] respectively issued to graduate students in 2023". This is an indirect admission that there is specific program-type mismanagement to be identified here.
    - 2025 National Targets for study permits:
        - **73.2k graduate degree students (PAL Required)**
        - **243k permits for remaining PAL/TAL required programs for a total of 316k from PAL required groups**
        - Exempt categories: 72.2k issued to K-12 and 48.5k to other PAL, for a grand total of 437k

        - **Provincial/Terr Allocations** below are the maximum number of permits that will be processed, assuming an average approval rate from each province/territory (looks to be around 66%). The anticipated number of study permits issued by IRCC is lower than these maxima and will sum to 316k expected. Both the allocation and approved estimates will be added as data here.
            - (Graduate, All Other, *Total*)
            - AB: 5256, 42082, *47338*
            - BC: 28333, 47754, *76087*
            - MB: 1980, 16611, *18591*
            - NB: 3112, 11673, *14785*
            - NL: 2648, 6534, *9182*
            - NT: 0, 705, *705*
            - NS: 4191, 14411, *18602*
            - NU: 0, 0, *0*
            - ON: 32579, 149011, *181590*
            - PE: 391, 2044, *2435*
            - QC: 38786, 123956, *162742*
            - SK: 2791, 14850, *17641*
            - YT: 1, 463, *464*

3. [Additional information about the International Student Program reforms; February 5, 2024](https://www.canada.ca/en/immigration-refugees-citizenship/news/notices/international-student-program-reform-more-information.html)
    - Most study permit applications to the federal govt must include a Provincial Attestation Letter (PAL) provided by the province (on behalf of their school/PSI, PALs distributed to PSIs by province)
        - The PAL is the accounting metric that ensures new int'l student numbers are accurate
    - PAL Required: most non-degree graduate programs (certificates & diplomas); most post-secondary study permit apps
    - PAL NOT Required: primary & secondary students, master's or doctoral degree students; in-Canada study permit holders
    - Masters degree graduates now get a 3 year PGWP
    - New students at public-private partnership college programs now ineligible for PGWP
        - Check [CBC data](https://www.cbc.ca/news/canada/toronto/international-student-study-permits-data-1.7125827#Full%20data) - schools that have a public/private partnership, check their enrolment; is it disproportionately lower than those without, all else equal/similar?

4. [Original IRCC statement; January 24, 2024](https://www.canada.ca/en/immigration-refugees-citizenship/news/2024/01/canada-to-stabilize-growth-and-decrease-number-of-new-international-student-permits-issued-to-approximately-360000-for-2024.html).
    - **360,000 approved study permits** for 2024, a decrease of 35% from 2023
    - *"Those pursuing master’s and doctoral degrees, and elementary and secondary education are not included in the cap."*
    - Portion of the cap allocated to each province/territory

#### Older releases:
5. [Distance Learning in COVID-19; May 14, 2020](https://www.canada.ca/en/immigration-refugees-citizenship/news/notices/pgwpp-rules-covid19.html)
    - PGWP eligibility was not affected by students whose fall 2020 courses will be online due to COVID. Students may begin their classes while outside Canada and complete up to 50% of their program via distance learning if they cannot travel to Canada sooner. No time deducted from length of future PGWP for studies outside of Canada.
    - Good evidence that any variation in international enrolment in 2020-21 is not work-permit related, with the long term incentive of a path to permanent residence not removed at the federal level.

### **Important notes**

1. Statcan's Classification of program types [is here](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1252482&CVD=1252483&CLV=0&MLV=2&D=1). 
    - **Graduate (second cycle) means Master's programs, or programs that otherwise require a Bachelor's degree**
    - **Graduate (third cycle) is PhD**
    - Certificates and Diplomas are inconsistent and have different criteria in different provinces, see pt. 4

2. I'm using **up to 2022/23 enrolment data, and 2023/2024 tuition fee figures**, the latest available. From there we can update with live data on enrolment as it becomes official, and project scenarios with hypothetical declines in student enrolment to estimate revenue changes into the future.

3. The easiest distinction is at Program Type, between undergraduate and graduate degrees for their tuition fee costs. However, you need to look at Credential Type for certificates and diplomas (popular at the colleges)

4. There is inconsistency in where graduate diplomas/certificates sit in 'program type'.
    - For example there are 509,000 Credential type: Diploma students across all of Canada in 22/23 and 386k of them are in 'Career, Technical or Professional Training Program' Program Type. 85k of these are sitting under 'Pre-University Program' (of 87k total in the Pre-uni category) which makes me think there are PSIs classifying a High School diploma, which wouldn't impact tuition fees. The remaining 55k are scattered across various other program types
    - Certificate credentials are clearer - of 190k in Canada, almost all are captured in the 'Career, Technical or Professional Training Program', 'Post Career, Technical or Professional Training Program' or Undergraduate.
    - *Post career, technical or professional training program* specifically includes **Ontario graduate certificate programs**

5. I only imported the student enrolment from full programs - there were around 100,000 enrolments (out of 2.2m total enrolments in all programs) in 'non-program' some were non credit, some undergraduate, some graduate, I assume this meant students taking individual classes to complete programs at a later date, and not an end-to-end program enrolment on a schedule.
6. The above analysis was done on full-time and part time students. As with my analysis earlier, I am only taking full-time PSI student data (a total of 1.7m in Canada)


All this is to say the calculations here will be estimates at best, with the heavy lifting being done by the difference between domestic and international tuition fees mostly at the undergraduate and graduate degree level, as these are the most numerous and require somewhat less granularity than fees for certificates/diplomas.

## Imports

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# for the preprocessing pipeline and variables to verify enrolment numbers align
import import_ipynb
import domestic_intl

## Data for Program type enrolment

Adding enrolment by program types and international/domestic students - [enrolment by program type and status of student in Canada](https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=3710027701). 

**Important notes**

1. Classification of program types [is here](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1252482&CVD=1252483&CLV=0&MLV=2&D=1). 
    - **Graduate (second cycle) means Master's programs, or those that otherwise require a Bachelor's degree**
    - **Graduate (third cycle) is PhD**
    - Certificates and Diplomas are inconsistent and have different criteria in different provinces, see pt. 4

2. I'm using **2016-17 to 2022-2023 enrolment data**, the latest available. 
    

3. The easiest distinction is at Program Type, between undergraduate and graduate degrees for their tuition fee costs. However, you need to look at Credential Type for certificates and diplomas (popular at the colleges)

4. There is inconsistency in where graduate diplomas/certificates sit in 'program type'.
    - For example there are 509,000 Credential type: Diploma students across all of Canada in 22/23 and 386k of them are in 'Career, Technical or Professional Training Program' Program Type. 85k of these are sitting under 'Pre-University Program' (of 87k total in the Pre-uni category) which makes me think there are PSIs classifying a High School diploma, which wouldn't impact tuition fees. The remaining 55k are scattered across various other program types
    - Certificate credentials are clearer - of 190k in Canada, almost all are captured in the 'Career, Technical or Professional Training Program', 'Post Career, Technical or Professional Training Program' or Undergraduate.
    - *Post career, technical or professional training program* specifically includes **Ontario graduate certificate programs**

5. I only imported the student enrolment from full programs - in 22-23 there were around 100,000 enrolments (out of 2.2m total enrolments in all programs) in 'non-program' some were non credit, some undergraduate, some graduate, I assume this meant students taking individual classes to complete programs at a later date, and not an end-to-end program enrolment on a schedule.
6. The above analysis was done on full-time and part time students. As with my analysis earlier, I am only taking full-time PSI student data (a total of 1.7m in Canada)


All this is to say the calculations here will be estimates at best, with the heavy lifting being done by the difference between domestic and international tuition fees mostly at the undergraduate and graduate degree level, as these are the most numerous and require somewhat less granularity than fees for certificates/diplomas.

In [4]:
canada_dom = domestic_intl.canada_dom
canada_intl = domestic_intl.canada_intl

# point 5 above "a total of 1.7m Full-time students in 2022" what does the data from earlier say? This was only done on fulltime students.
canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum() + canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()

1741692

Above - confirms 1.74m students in earlier data (looking at enrolment numbers directly) which checks out with this data here on programs

## StatCan data on postsecondary enrolments by field of study, institution, and program and student characteristics

[Source](https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=3710027701)

From the source, the following data were removed / not specified in order to not exceed the 2 million data point download limit from StatCan:

- Geography and institutions:
    - Province level enrolment was collected, but not individual institutions.

- Field of Study:
    - Sub-CIP code Program enrolment were not specifically collected, e.g. CIP 01 Agriculture enrolment was collected, but Plant Sciences [01.11] was not specifically collected, and would only appear in the aggregate CIP 01 Enrolment data

- Program Type
    - Basic Education and Skills Program enrolment was not specifically collected
    - non-programs were not collected

- Credential Type
    - GED / High School Diploma specific enrolment was not collected
    - Attestation or other short program credentials were not collected

- Registration Status
    - As before, only Full-time students are collected, not Part-time

- Status of Student in Canada
    - Canadian students, International Students and Not reported were collected as individual categories, but not as a total aggregated number because I'd like to examine differences in program/credential enrolment and growth of such along international/domestic lines

- Reference period
    - Years 2016/17 to years 2022/23 were collected

Once sufficient areas have been eliminated we can drill down more on institutions or specific fields if warranted.

In [5]:
# import the new csv file
progs = pd.read_csv('/Users/thomasdoherty/Desktop/canadian-psi-project/psi_data/statcan_data/progs_credentials_fields.csv')

In [6]:
progs.sample(6)

Unnamed: 0,REF_DATE,GEO,DGUID,Field of study,Program type,Credential type,Institution type,Registration status,Status of student in Canada,Gender,...,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
46459,2019/2020,Ontario,2021A000235,Engineering/engineering-related technologies/t...,"Qualifying program for career, technical or pr...","Total, credential type","Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1611404508,125.131.3.1.1.2.2.1,3,,,,0
66428,2022/2023,Saskatchewan,2021A000247,History [54.],Graduate program (second cycle),"Total, credential type","Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1614384161,188.462.13.1.1.2.2.1,18,,,,0
62177,2022/2023,Saskatchewan,2021A000247,Education [13.],Graduate program (second cycle),Certificate,"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1614264804,188.73.13.3.1.2.2.1,3,,,,0
80952,2020/2021,British Columbia,2021A000259,Visual and performing arts [50.],"Post career, technical or professional trainin...","Total, credential type","Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1615484919,226.392.5.1.1.2.3.1,15,,,,0
72339,2020/2021,Alberta,2021A000248,Visual and performing arts [50.],"Total, program type","Not applicable, credential type","Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1614732374,199.392.1.9.1.2.3.1,0,,,,0
40919,2021/2022,Quebec,2021A000224,Public administration and social service profe...,"Career, technical or professional training pro...",Diploma,"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1609887274,31.341.4.4.1.2.3.1,66,,,,0


In [7]:
# check for columns where records are missing values
print(progs.columns[progs.isnull().all()])
print(progs.columns[progs.isnull().any()])

Index(['STATUS', 'SYMBOL', 'TERMINATED'], dtype='object')
Index(['STATUS', 'SYMBOL', 'TERMINATED'], dtype='object')


In [8]:
# confirm this is all full-time students
progs['Registration status'].value_counts()

Registration status
Full-time student    84506
Name: count, dtype: int64

In [9]:
progs['Status of student in Canada'].value_counts()

Status of student in Canada
Canadian students                            43528
International students                       35897
Not reported, status of student in Canada     5081
Name: count, dtype: int64

## Preprocessing pipeline

Custom Transformers (CTs) defined in utils package

### Mar 1 test utils package - commented out below

In [None]:
# # Custom transformer (CT) for dropping unnecessary columns from the dataframe

# DropColumns = domestic_intl.DropColumns

# RenameColumns = domestic_intl.RenameColumns

# FormatDate = domestic_intl.FormatDate

# AddInstitutionAndProvince = domestic_intl.AddInstitutionAndProvince

# AbbreviateInstitutionNames = domestic_intl.AbbreviateInstitutionNames

# # Removing territories from the data

# class RemoveTerritories(BaseEstimator, TransformerMixin):
#     def __init__(self, column, territories):
#         """
#         :param column: The column where we look for territory names
#         :param territories: A list of territory names (e.g. ["Yukon", "Northwest Territories", "Nunavut"])
#         """
#         self.column = column
#         self.territories = territories # territories defined above

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         """
#         Removes rows where the specified column contains any of the territory names.
#         """
#         X = X.copy()
#         # Build a regex pattern that matches any of the territory strings
#         pattern = '|'.join(self.territories)

#         # Keep rows that do NOT contain any of the territory names (case-insensitive).
#         # If the column is sometimes NaN, na=False ensures we don't error out.
#         return X[~X[self.column].str.contains(pattern, case=False, na=False)]
    

# # reorder the columns for ease of reading

# class ReorderColumns(BaseEstimator, TransformerMixin):
#     def __init__(self, desired_order):
#         """
#         parameter - desired_order: A list specifying the columns in the order desired.
#         e.g. ["FY Start", "Province/Territory", "Institution Name", ...]
#         """
#         self.desired_order = desired_order

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         """
#         Reorders the columns to the specified order. Any columns not in 'desired_order'
#         are appended at the end in their existing order.
#         """
#         X = X.copy()
        
#         # Columns that are explicitly ordered
#         ordered_cols = [c for c in self.desired_order if c in X.columns]
        
#         # Any remaining columns not in 'desired_order'
#         leftover_cols = [c for c in X.columns if c not in ordered_cols]
        
#         # Final order is the desired columns first, then leftover
#         final_order = ordered_cols + leftover_cols
        
#         return X[final_order]
    
# # pivot Canadian status (domestic/international/unreported) into three columns of the same record, all else being equal

# class PivotCanadianStatus(BaseEstimator, TransformerMixin):
#     def __init__(
#         self, 
#         index_cols=[
#             "FY Start", 
#             "Province/Territory", 
#             "Institution Name", 
#             "Program type", 
#             "Credential type", 
#             "Field of study"
#             ],
#         pivot_col="Canadian Status",
#         values_col="Enrolment"
#     ):
#         """
#         Parameter - index_cols: The columns to keep as index in the pivot (remain in rows).
#         Parameter - pivot_col: The column whose unique values become new columns (e.g., 'Canadian Status').
#         Parameter - values_col: The numeric column to place in new columns (e.g., 'Enrolment').
#         """
#         self.index_cols = index_cols
#         self.pivot_col = pivot_col
#         self.values_col = values_col

#     def fit(self, X, y=None):
#         return self

#     def transform(self, X):
#         """
#         Pivots the dataframe so that 'Canadian Status' becomes columns:
#           -> 'Domestic Enrolment', 'International Enrolment'.

#         After pivoting, renames columns accordingly and returns the wide table.
#         """
#         X = X.copy()

#         # 1. Pivot
#         pivoted = X.pivot_table(
#             index=self.index_cols,
#             columns=self.pivot_col,
#             values=self.values_col,
#             aggfunc='sum'  # If duplicates exist, sum them
#         ).reset_index()

#         # 2. Rename columns from 'Canadian students' -> 'Domestic Enrolment' etc.
#         col_rename_map = {
#             'Canadian students': 'Domestic Enrolment',
#             'International students': 'International Enrolment',
#             'Not reported, status of student in Canada': 'CA Status Unreported Enrolment'
#         }
#         pivoted = pivoted.rename(columns=col_rename_map)

#         # 3. convert new Domestic and International Enrolment columns to integers
#         pivoted['Domestic Enrolment'] = pivoted['Domestic Enrolment'].fillna(0).astype(int)
#         pivoted['International Enrolment'] = pivoted['International Enrolment'].fillna(0).astype(int)
#         pivoted['CA Status Unreported Enrolment'] = pivoted['CA Status Unreported Enrolment'].fillna(0).astype(int)

#         # 4. Reorder columns if desired
#         # Ensure Domestic and International appear last in an expected order
#         final_cols = [c for c in pivoted.columns if c not in col_rename_map.values()]
#         final_cols += ['Domestic Enrolment', 'International Enrolment', 'CA Status Unreported Enrolment']
#         final_cols = [c for c in final_cols if c in pivoted.columns]  # Only keep existing columns

#         return pivoted[final_cols]

### Defined variables (provinces, territories, abbreviations, Francophone institutions etc)

In [10]:
from utils.constants import PROVINCES_TERRITORIES_CA, PROVINCES_CA, TERRITORIES, PROVINCE_CODES, ABBREVIATIONS
from utils.pipeline_transformers import DropColumns, RenameColumns, FormatDate, AddInstitutionAndProvince, AbbreviateInstitutionNames, RemoveTerritories, ReorderColumns, PivotCanadianStatus

In [11]:
# Define the pipeline
from sklearn.pipeline import Pipeline

program_pipeline = Pipeline(steps=[
    # drop columns
    ('drop_columns', DropColumns(columns=[
        'DGUID', 'Registration status', 'Institution type', 'Gender', 'UOM', 'UOM_ID',
        'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'
    ])),
    # rename columns
    ('rename_columns', RenameColumns(column_names={
        "GEO": "Province/Territory",
        "REF_DATE": "FY Start",
        "VALUE": "Enrolment",
        "Status of student in Canada": "Canadian Status"
    })),
    # drop record step -  Yukon / NWT / Nunavut
    ('remove_territories', RemoveTerritories(
        column="Province/Territory",
        territories=TERRITORIES
    )),
    # formatting data type steps
    ('format_fy', FormatDate(column="FY Start")),
    # ('format_enrolment', FormatValue(column="Enrolment")),
    
    # Add Institution Name and Province columns
    ('add_institution_and_province', AddInstitutionAndProvince(
        column='Province/Territory',
        institution_col='Institution Name', 
        province_col='Province/Territory', 
        provinces=PROVINCES_TERRITORIES_CA
    )),
    
    # Abbreviate institution names
    ('abbreviate_institution_names', AbbreviateInstitutionNames(
        column='Institution Name', 
        replacements=ABBREVIATIONS
    )),

    #Reorder the columns
    ('reorder_columns', ReorderColumns(desired_order=[
        "FY Start", 
        "Province/Territory", 
        "Institution Name", 
        "Program type",
        "Credential type",
        "Field of study",
        "Canadian Status",
        "Enrolment"
    ])),

    # pivot the Canadian status column
    ('pivot_canadian_status', PivotCanadianStatus())
])

In [12]:
cleaned_progs = program_pipeline.fit_transform(progs)

In [13]:
cleaned_progs.sample(6)

Unnamed: 0,FY Start,Province/Territory,Institution Name,Program type,Credential type,Field of study,Domestic Enrolment,International Enrolment,CA Status Unreported Enrolment
20917,2019,Manitoba,Manitoba (total),"Post career, technical or professional trainin...","Total, credential type",Social sciences [45.],6,15,0
4296,2016,Ontario,Ontario (total),Graduate program (third cycle),"Total, credential type",Legal professions and studies [22.],168,48,0
5062,2016,Prince Edward Island,Prince Edward Island (total),Undergraduate program,Degree (includes applied degree),Agricultural and veterinary sciences/services/...,228,90,0
5863,2016,Saskatchewan,Saskatchewan (total),Post-baccalaureate non-graduate program,Degree (includes applied degree),"Business, management, marketing and related su...",3,27,0
5696,2016,Saskatchewan,Saskatchewan (total),"Career, technical or professional training pro...",Diploma,"Culinary, entertainment, and personal services...",60,0,0
25700,2020,Canada,Canada (total),"Career, technical or professional training pro...","Total, credential type",Agricultural and veterinary sciences/services/...,6327,567,12


In [63]:
cleaned_progs.columns

Index(['FY Start', 'Province/Territory', 'Institution Name', 'Program type',
       'Credential type', 'Field of study', 'Domestic Enrolment',
       'International Enrolment', 'CA Status Unreported Enrolment'],
      dtype='object', name='Canadian Status')

#### Pivoting table to create one record of international and domestic enrolment (commented out until we need)

In [18]:
# # create an international student percentage column
# combined_df['% International'] = round((combined_df['International Enrolment'] / (combined_df['Domestic Enrolment'] + combined_df['International Enrolment'])) * 100, 2)

## Cleaning up N/As and Not Reported records

Several instances of 'Not reported, status of student', 'Not applicable, credential type' etc to be examined to see if we can discard and make space for more valuable records

In [19]:
progs['Status of student in Canada'].value_counts()

Status of student in Canada
Canadian students                            43528
International students                       35897
Not reported, status of student in Canada     5081
Name: count, dtype: int64

### Pre- Feb 26 work, incorporate with new data pulled on Feb 26

In [None]:
# import the csv file
programs = pd.read_csv('/Users/thomasdoherty/Desktop/canadian-psi-project/psi_data/statcan_data/statcan-program-int-can-enrolment.csv')

In [None]:
programs.sample(4)

Unnamed: 0,REF_DATE,GEO,DGUID,Field of study,Program type,Credential type,Institution type,Registration status,Status of student in Canada,Gender,...,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
2437,2022/2023,"Trent University, Ontario",,"Total, field of study","Total, program type",Degree (includes applied degree),"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1612357617,138.1.1.5.1.2.2.1,8958,,,,0
4256,2022/2023,"University of British Columbia, British Columbia",,"Total, field of study",Undergraduate program,"Total, credential type","Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1615559531,227.1.8.1.1.2.3.1,9915,,,,0
2791,2022/2023,Confederation College of Applied Arts and Tech...,,"Total, field of study","Career, technical or professional training pro...",Other type of credential associated with a pro...,"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1612972684,153.1.4.8.1.2.3.1,357,,,,0
122,2022/2023,Prince Edward Island,2021A000211,"Total, field of study","Career, technical or professional training pro...",Certificate,"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1608870141,5.1.4.3.1.2.3.1,48,,,,0


### Data audit

Per the earlier analysis, ensure this data concerns only Full-time students and other features to align with enrolment data.

In [None]:
# check registration status is all Full-time student - no other status
programs['Registration status'].unique()

array(['Full-time student'], dtype=object)

In [None]:
# is field of study providing any useful information? - No, just total. Remove in next step
programs['Field of study'].unique()

array(['Total, field of study'], dtype=object)

In [None]:
programs['Credential type'].unique()

array(['Total, credential type', 'Certificate', 'Diploma',
       'Degree (includes applied degree)',
       'Other type of credential associated with a program',
       'Not applicable, credential type', 'Associate degree'],
      dtype=object)

In [None]:
programs['Status of student in Canada'].unique()

array(['Canadian students', 'International students'], dtype=object)

### Cleaning of program-type enrolment - using pipeline

The same steps can be used with the custom transformers already defined, we just tweak the column names that we want dropping / reformatting.

We will also make three more custom transformers:
1. For removing the territories from the data.
2. Reordering the columns for Institution name coming immediately after province
3. To pivot the Canadian Students/International Students into two international / domestic enrolment columns, which should cut down the number of records by around half

In [None]:
program_pipeline = Pipeline(steps=[
    # drop columns
    ('drop_columns', DropColumns(columns=[
        'DGUID', 'Registration status', 'Field of study', 'Institution type', 'Gender', 'UOM', 'UOM_ID',
        'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'
    ])),
    # rename columns
    ('rename_columns', RenameColumns(rename_map={
        "GEO": "Province/Territory",
        "REF_DATE": "FY Start",
        "VALUE": "Enrolment",
        "Status of student in Canada": "Canadian Status",
    })),
    # drop record step -  Yukon / NWT / Nunavut
    ('remove_territories', RemoveTerritories(
        column="Province/Territory",
        territories=["Yukon", "Northwest Territories", "Nunavut"]
    )),
    # formatting data type steps
    ('format_fy', FormatFYStart(column="FY Start")),
    ('format_enrolment', FormatValue(column="Enrolment")),
    
    # Add Institution Name and Province columns
    ('add_institution_and_province', AddInstitutionAndProvince(
        column='Province/Territory',
        institution_col='Institution Name', 
        province_col='Province/Territory', 
        provinces=provinces_territories_ca
    )),
    
    # Abbreviate institution names
    ('abbreviate_institution_names', AbbreviateInstitutionNames(
        column='Institution Name', 
        replacements=abbreviations
    )),

    # Reorder the columns
    ('reorder_columns', ReorderColumns(desired_order=[
        "FY Start", 
        "Province/Territory", 
        "Institution Name", 
        "Program type",
        "Credential type",
        "Canadian Status",
        "Enrolment"
    ])),

    # pivot the Canadian status column
    ('pivot_canadian_status', PivotCanadianStatus())
])

In [None]:
cleaned_program_df = program_pipeline.fit_transform(programs)

In [None]:
cleaned_program_df.sample(5)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
403,2022,British Columbia,College of New Caledonia,"Career, technical or professional training pro...","Total, credential type",696,387
1699,2022,Ontario,U of Ottawa-U d'Ottawa,"Total, program type","Not applicable, credential type",1095,522
404,2022,British Columbia,College of New Caledonia,"Post career, technical or professional trainin...",Diploma,3,201
685,2022,British Columbia,U of the Fraser Valley,"Post career, technical or professional trainin...","Total, credential type",0,0
1038,2022,Newfoundland and Labrador,Newfoundland and Labrador (total),Post-baccalaureate non-graduate program,Degree (includes applied degree),315,0


Quick Audit of a few items - no Nunavut/Yukon, check program types

In [None]:
cleaned_program_df['Province/Territory'].unique()

array(['Alberta', 'British Columbia', 'Manitoba', 'New Brunswick',
       'Newfoundland and Labrador', 'Nova Scotia', 'Ontario',
       'Prince Edward Island', 'Quebec', 'Saskatchewan'], dtype=object)

In [None]:
cleaned_program_df['Program type'].unique()

array(['Career, technical or professional training program',
       'Graduate program (second cycle)',
       'Graduate program (third cycle)',
       'Post career, technical or professional training program',
       'Post-baccalaureate non-graduate program', 'Total, program type',
       'Undergraduate program', 'Pre-university program'], dtype=object)

In [None]:
# search for York University
cleaned_program_df[cleaned_program_df['Institution Name'].str.contains('York')]

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1772,2022,Ontario,York U,Graduate program (second cycle),Degree (includes applied degree),1704,1008
1773,2022,Ontario,York U,Graduate program (second cycle),Diploma,9,9
1774,2022,Ontario,York U,Graduate program (second cycle),"Total, credential type",1716,1014
1775,2022,Ontario,York U,Graduate program (third cycle),Degree (includes applied degree),1485,357
1776,2022,Ontario,York U,Graduate program (third cycle),"Total, credential type",1485,357
1777,2022,Ontario,York U,"Total, program type",Certificate,21,6
1778,2022,Ontario,York U,"Total, program type",Degree (includes applied degree),36054,8379
1779,2022,Ontario,York U,"Total, program type",Diploma,9,9
1780,2022,Ontario,York U,"Total, program type","Not applicable, credential type",57,300
1781,2022,Ontario,York U,"Total, program type","Total, credential type",36144,8697


# To do Feb 19 - explore programs data here, compare intl/dom enrolment, compare overall enrolment in categories
- EDA of the cleaned programs dataframe - which are the most popular programs, use the enrolment to determine which are the program categories of greatest relevance to revenue for schools.
- Use existing knowledge e.g. graduate certificate, graduate diploma programs growing to see which institutions invested in these.
- Any Program/Credential type categories which are esoteric or unknown, check their enrolment, if they are very small across the board we could remove them

### Audit enrolment numbers against `combined_df` and Canada-wide totals before analysis of program/credential level enrolment.

#### Individual Schools spot check

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'].isin(['U of BC', 'York U', 'U of Alberta'])) 
                   & (cleaned_program_df['Program type'] == 'Total, program type') 
                   & (cleaned_program_df['Credential type'] == 'Total, credential type')][['FY Start', 'Institution Name', 'Domestic Enrolment', 'International Enrolment']]

Canadian Status,FY Start,Institution Name,Domestic Enrolment,International Enrolment
269,2022,U of Alberta,34653,8361
647,2022,U of BC,36735,14472
1781,2022,York U,36144,8697


In [None]:
combined_df[(combined_df['Institution Name'].isin(['U of BC', 'York U', 'U of Alberta']))
            & (combined_df['FY Start'] == 2022)][['FY Start', 'Institution Name', 'Domestic Enrolment', 'International Enrolment']]

Unnamed: 0,FY Start,Institution Name,Domestic Enrolment,International Enrolment
2859,2022,U of Alberta,34653,8361
2873,2022,U of BC,36735,14472
3181,2022,York U,36144,8697


All three institutions align in numbers so we can probably trust the numbers are good at the collection level. It was important to check here because we may remove low enrolment programs/credentials not hugely relevant to tuition fee revenue on the whole, as the analysis goes on here.

#### National/Provincial totals audited vs Canadian nationwide totals (avoid double counting total rows)

In the earlier sections analysing Canada-wide statistics before breaking into province/PSI level:

- Canada-wide total domestic enrolment in 2022/23 (FY start 2022) was **1,320,684 students**
- Canada-wide total international enrolment in 2022/23 (FY start 2022) was **421,008 students**

In [None]:
# canada_dom is the domestic student data originally imported - 1.32 million students in 2022/23
canada_dom[canada_dom['FY Start'] == 2022]

Unnamed: 0,FY Start,Canadian Status,Enrolment,Institution Name,Province/Territory
13,2022,Canadian students,1320684,Canada (total),Canada


In [None]:
canada_intl[canada_intl['FY Start'] == 2022]

Unnamed: 0,FY Start,Canadian Status,Enrolment,Institution Name,Province/Territory
13,2022,International students,421008,Canada (total),Canada


The `cleaned_program_df` will have several instances of double-counting at the provincial total levels counting total program types and total credential types (see cell below)

In [None]:
# Note the total, program type, total credential type rows which are double counting enrolment
cleaned_program_df.sort_values(by='Domestic Enrolment', ascending=False).head(5)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1551,2022,Ontario,Ontario (total),"Total, program type","Total, credential type",546672,236769
1547,2022,Ontario,Ontario (total),"Total, program type",Degree (includes applied degree),414537,93018
1557,2022,Ontario,Ontario (total),Undergraduate program,"Total, credential type",361329,69291
1553,2022,Ontario,Ontario (total),Undergraduate program,Degree (includes applied degree),359139,67233
2385,2022,Quebec,Quebec (total),"Total, program type","Total, credential type",344982,59814


This is just an extract showing the double counting. We want to show this data agrees with the previous set.

Let's reconcile by grabbing the provincial totals in one dataframe and all the individual institutions and programs in another. They are just two different methods of counting enrolment with varying degrees of granularity.

Below, `programs_prov_totals` will consist only of records of the provincial totals across all programs and credentials for each province.

In [None]:
programs_prov_totals = cleaned_program_df[(cleaned_program_df['Institution Name'].str.contains('(total)')) & 
                                          (cleaned_program_df['Program type'] == 'Total, program type') & 
                                          (cleaned_program_df['Credential type'] == 'Total, credential type')
                                          ]


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



In [None]:
programs_prov_totals

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
20,2022,Alberta,Alberta (total),"Total, program type","Total, credential type",153537,29139
330,2022,British Columbia,BC (total),"Total, program type","Total, credential type",123987,53583
799,2022,Manitoba,Manitoba (total),"Total, program type","Total, credential type",38181,10563
925,2022,New Brunswick,New Brunswick (total),"Total, program type","Total, credential type",18150,5439
1045,2022,Newfoundland and Labrador,Newfoundland and Labrador (total),"Total, program type","Total, credential type",15288,4725
1147,2022,Nova Scotia,Nova Scotia (total),"Total, program type","Total, credential type",38049,11805
1551,2022,Ontario,Ontario (total),"Total, program type","Total, credential type",546672,236769
1815,2022,Prince Edward Island,Prince Edward Island (total),"Total, program type","Total, credential type",4992,2121
2385,2022,Quebec,Quebec (total),"Total, program type","Total, credential type",344982,59814
2594,2022,Saskatchewan,Saskatchewan (total),"Total, program type","Total, credential type",35247,7032


In [None]:
print(f"Total Domestic enrolment across all provinces combined is {programs_prov_totals['Domestic Enrolment'].sum()}, compared to earlier dataset of {canada_dom[canada_dom['FY Start'] == 2022]['Enrolment'].sum()}")
print(f"Total International enrolment across all provinces combined is {programs_prov_totals['International Enrolment'].sum()}, compared to earlier dataset of {canada_intl[canada_intl['FY Start'] == 2022]['Enrolment'].sum()}")

Total Domestic enrolment across all provinces combined is 1319085, compared to earlier dataset of 1320684
Total International enrolment across all provinces combined is 420990, compared to earlier dataset of 421008


**To audit properly, we will use the above comparison statement to get the new dataset as close as possible to the Canadian nationwide numbers we pulled at the very beginning of the project. Namely 1.320m Domestic Canadian students in 2022/23 and 421,008 international students.** 

At the provincial total level, it's a very close matches! Likely just missing the territories which we removed in the second dataset, but not the first. These data are reconciled for all practical intents and purposes.


#### Schools Total

As for the individual credential/program dataframe, we need to be a bit more precise. 

- We should exclude the provincial totals which are double counting the enrolment at the individual schools.
- The individual institutions will also be aggregating the total program/credential type on top of the individual programs and credentials.

Doing this step by step, we have province level totals reconciled, now reconcile institution totals across program/credential:

In [None]:
programs_no_totals = cleaned_program_df[~(cleaned_program_df['Institution Name'].str.contains('(total)')) & 
                                        (cleaned_program_df['Program type'] == 'Total, program type') & 
                                        (cleaned_program_df['Credential type'] == 'Total, credential type')
                                        ]


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



In [None]:
programs_no_totals['Domestic Enrolment'].sum()

1319103

In [None]:
programs_no_totals.sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1715,2022,Ontario,U of Toronto,"Total, program type","Total, credential type",62826,26856
2434,2022,Quebec,U de Montréal,"Total, program type","Total, credential type",38646,11211
647,2022,British Columbia,U of BC,"Total, program type","Total, credential type",36735,14472
1781,2022,Ontario,York U,"Total, program type","Total, credential type",36144,8697
1756,2022,Ontario,Western U,"Total, program type","Total, credential type",34998,5829
...,...,...,...,...,...,...,...
2570,2022,Saskatchewan,Parkland Regional College,"Total, program type","Total, credential type",21,0
1069,2022,Nova Scotia,Atlantic School of Theology,"Total, program type","Total, credential type",18,6
1674,2022,Ontario,U de l'Ontario français,"Total, program type","Total, credential type",15,81
2566,2022,Saskatchewan,Great Plains College,"Total, program type","Total, credential type",12,0


# to-do Feb 24 - try and reconcile the 72k domestic enrolment and 20k international enrolment missing in this data

# use the Province (total) records to reconcile 1.32m as above, and then split them off from the main dataframe!!

# then you can compare province (total) dataframe for total int'l students and the individual records combined

1.25 million is off by 72,000 from 1.32m. May be due to dropping the territories and some other records.

For international enrolment (target is 421,008):

In [None]:
programs_no_totals['International Enrolment'].sum()

398694

In [None]:
programs_no_totals.sort_values(by='Domestic Enrolment', ascending=False).head(50)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1717,2022,Ontario,U of Toronto,Undergraduate program,Degree (includes applied degree),44616,19845
1783,2022,Ontario,York U,Undergraduate program,Degree (includes applied degree),32862,7017
270,2022,Alberta,U of Alberta,Undergraduate program,Degree (includes applied degree),28686,4560
1730,2022,Ontario,U of Waterloo,Undergraduate program,Degree (includes applied degree),27897,5973
1650,2022,Ontario,Toronto Metropolitan U,Undergraduate program,Degree (includes applied degree),27450,2985
1757,2022,Ontario,Western U,Undergraduate program,Degree (includes applied degree),27042,3702
648,2022,British Columbia,U of BC,Undergraduate program,Degree (includes applied degree),26601,9915
1702,2022,Ontario,U of Ottawa-U d'Ottawa,Undergraduate program,Degree (includes applied degree),26307,6327
1463,2022,Ontario,McMaster U,Undergraduate program,Degree (includes applied degree),25518,4530
2436,2022,Quebec,U de Montréal,Undergraduate program,Degree (includes applied degree),24579,4086


### EDA - Key Questions to answer and important program grey areas

[Notes from StatCan on classification of programs](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1252482)

1. Where is the most significant enrolment for international students going? Undergrad? Graduate 2nd Cycle --> degree or certificate/diploma?

2. Any program/credential types e.g. pre-university that are unanimously low enrolment so they can be removed? Growth in particular program areas?

3. Any provincial or program specific to be mindful of?
    - **Code 5 in Ontario, Graduate certificates** are in Post Career, technical or professional training program. Is this true of other provinces?
    - **Code 6 from notes Pre-university programs** include university-stream programs at Colleges and CEGEPs in Quebec - are other provinces using this as university-transfer stream programs at college?
    - **Code 9 Post-baccalaureate non-graduate program** includes programs that require a bachelor's degree for admission but can also capture programs at undergrad level which complete at a level beyond bachelor's degree, due to depth of learning (e.g. LLB or MD Degree). **Some flexibility** in how provinces can choose to report these professional degree programs. B.Ed are either here or undergraduate degrees depending on whether they're considered post-degree in outcome.




In [None]:
cleaned_program_df['Program type'].value_counts()

Program type
Total, program type                                        974
Career, technical or professional training program         420
Undergraduate program                                      414
Graduate program (second cycle)                            241
Pre-university program                                     170
Post-baccalaureate non-graduate program                    160
Graduate program (third cycle)                             139
Post career, technical or professional training program    136
Name: count, dtype: int64

In [None]:
cleaned_program_df['Credential type'].value_counts()

Credential type
Total, credential type                                891
Diploma                                               558
Degree (includes applied degree)                      463
Certificate                                           377
Not applicable, credential type                       267
Other type of credential associated with a program     58
Associate degree                                       40
Name: count, dtype: int64

The dicrepancy between the Total (for both Credential type and Program type) and the individual types of programs/credentials may have come from the fact they are contributed to by individual categories that aren't present in the data uploaded. There were numerous program types for example like *Basic Education & Skills* (high school) and Health-related residency program that I did not import, and several non-program categories. These were generally quite small in number and are focused on domestic students and thus wouldn't be impacting revenue in the same way as the programs listed above would.

## Exploring credential types and their enrolment

Before looking directly at enrolment this section's goal is to explore the confluence of Program type / Credential type across the provinces, as the StatCan source shows there are different interpretations of the categories by province. We should try not to make an apples to oranges comparison.

We'll start by doing a pie chart to see the breakdown of enrolment in different program types - what programs are international and domestic students joining?

In [None]:
# Filter out "total, program type" and "total, credential type" as they'll be double counting enrolment otherwise
pie_df_programs = cleaned_program_df[
    (cleaned_program_df['Program type'] != 'Total, program type') &
    (cleaned_program_df['Credential type'] != 'Total, credential type')
]

# Aggregate Domestic and International enrolment by Program type
domestic_agg = pie_df_programs.groupby('Program type', as_index=False)['Domestic Enrolment'].sum()
international_agg = pie_df_programs.groupby('Program type', as_index=False)['International Enrolment'].sum()

In [None]:
pie_df_programs['Domestic Enrolment'].sum()

2502237

In [None]:
domestic_agg

Unnamed: 0,Program type,Domestic Enrolment
0,"Career, technical or professional training pro...",550206
1,Graduate program (second cycle),151569
2,Graduate program (third cycle),70815
3,"Post career, technical or professional trainin...",16449
4,Post-baccalaureate non-graduate program,27921
5,Pre-university program,162819
6,Undergraduate program,1522458


In [None]:
# pie chart of the domestic enrolment and international enrolment from the domestic_agg numbers and international_agg numbers
fig_dom = px.pie(
    domestic_agg,
    names='Program type',
    values='Domestic Enrolment',
    title='Domestic Enrolment by Program Type (2022-23)',
    hover_data=['Domestic Enrolment'],
    template='plotly'
)
fig_dom.show()

fig_intl = px.pie(
    international_agg,
    names='Program type',
    values='International Enrolment',
    title='International Enrolment by Program Type (2022-23)',
    hover_data=['International Enrolment'],
    template='plotly'
)
fig_intl.show()

Important to remember with credential reporting:
- Diplomas will be reported in different program types in different provinces e.g. in Ontario, Graduate program (second cycle) is almost entirely degrees (presumably master's), but there are 2000 diplomas in this Graduate (second cycle)  in Quebec.
- If we are exploring any hypothesis of certificates and diplomas being the driver of international enrolment, we need to know exactly how they're reported in each different province.

See the below for ON, QC and BC totals - 

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'].isin(['Ontario (total)', 'Quebec (total)', 'BC (total)'])) & 
                   (cleaned_program_df['Program type'] == 'Graduate program (second cycle)')
                   ].sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
1534,2022,Ontario,Ontario (total),Graduate program (second cycle),"Total, credential type",31830,17973
1531,2022,Ontario,Ontario (total),Graduate program (second cycle),Degree (includes applied degree),31716,17946
2374,2022,Quebec,Quebec (total),Graduate program (second cycle),"Total, credential type",22065,15810
2371,2022,Quebec,Quebec (total),Graduate program (second cycle),Degree (includes applied degree),19737,14478
313,2022,British Columbia,BC (total),Graduate program (second cycle),"Total, credential type",7638,5268
311,2022,British Columbia,BC (total),Graduate program (second cycle),Degree (includes applied degree),7530,4980
2372,2022,Quebec,Quebec (total),Graduate program (second cycle),Diploma,2160,915
310,2022,British Columbia,BC (total),Graduate program (second cycle),Certificate,78,102
1532,2022,Ontario,Ontario (total),Graduate program (second cycle),Diploma,66,18
1533,2022,Ontario,Ontario (total),Graduate program (second cycle),Other type of credential associated with a pro...,48,9


As we can see above this program type is dominated by degrees in all three provinces but a diploma at this level also exists. It may be analogous to the Pre-University diploma offered in Quebec, but for master's programs. 

Checking diplomas below:

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'].isin(['Ontario (total)', 'Quebec (total)', 'BC (total)', 'Alberta (total)'])) &
                   (cleaned_program_df['Credential type'] == 'Diploma') &
                    (cleaned_program_df['Domestic Enrolment'] > 50) # a few niche programs with very small enrolment hidden
                   ].sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
2383,2022,Quebec,Quebec (total),"Total, program type",Diploma,162006,8619
1548,2022,Ontario,Ontario (total),"Total, program type",Diploma,98679,79035
1527,2022,Ontario,Ontario (total),"Career, technical or professional training pro...",Diploma,97485,78930
2378,2022,Quebec,Quebec (total),Pre-university program,Diploma,79833,1176
2368,2022,Quebec,Quebec (total),"Career, technical or professional training pro...",Diploma,76053,5757
18,2022,Alberta,Alberta (total),"Total, program type",Diploma,25488,9324
1,2022,Alberta,Alberta (total),"Career, technical or professional training pro...",Diploma,25389,9315
327,2022,British Columbia,BC (total),"Total, program type",Diploma,16857,16776
307,2022,British Columbia,BC (total),"Career, technical or professional training pro...",Diploma,16212,9765
2372,2022,Quebec,Quebec (total),Graduate program (second cycle),Diploma,2160,915


Pre-University programs are [unique programs in the Quebec schooling system](https://www.cegepsquebec.ca/en/cegeps/presentation/systeme-scolaire-quebecois/) that prepare students for Undergraduate University programs. 

As the numbers show, they are highly focused towards domestic students coming from the Quebec secondary school system. It accounts for nearly half of all Quebec Diplomas, with almost the entire other half coming from Career, technical or professional programs as with Ontario. 

This career, technical or professional category seems the most appropriate category to compare diplomas

In [None]:
# box plot of program types and their domestic enrolment, excluding Total, program type

fig = px.box(
    cleaned_program_df[
        cleaned_program_df['Program type'] != 'Total, program type'
    ],
    x='Program type',
    y='Domestic Enrolment',
    color='Program type',
    hover_data=['Province/Territory', 'Institution Name'],
    points='outliers',
    title='Domestic Enrolment by Program Type (Excluding Total)',
    template='plotly'
)

fig.show()

**Associate Degree** is the smallest category - is it because only a few provinces use it?


In [None]:
# find Associate degree and list the provinces it is used in
cleaned_program_df[cleaned_program_df['Credential type'] == 'Associate degree']['Province/Territory'].unique()

# cleaned_program_df[(cleaned_program_df['Credential type'] == 'Associate degree') ]

array(['British Columbia', 'Manitoba'], dtype=object)

Only BC and Manitoba seem to offer Associate degrees. This is a good opportunity to do some more data audit

In [None]:
cleaned_program_df[((cleaned_program_df['Credential type'] == 'Associate degree') & (cleaned_program_df['Program type'] == 'Total, program type'))]

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
324,2022,British Columbia,BC (total),"Total, program type",Associate degree,4725,3603
362,2022,British Columbia,Camosun College,"Total, program type",Associate degree,351,102
378,2022,British Columbia,Capilano U,"Total, program type",Associate degree,264,534
391,2022,British Columbia,Coast Mountain College,"Total, program type",Associate degree,15,42
408,2022,British Columbia,College of New Caledonia,"Total, program type",Associate degree,75,207
423,2022,British Columbia,College of the Rockies,"Total, program type",Associate degree,48,15
441,2022,British Columbia,Douglas College,"Total, program type",Associate degree,1296,438
480,2022,British Columbia,Kwantlen Polytechnic U,"Total, program type",Associate degree,18,72
495,2022,British Columbia,Langara College,"Total, program type",Associate degree,1575,1503
506,2022,British Columbia,Nicola Valley Institute of Technology,"Total, program type",Associate degree,54,0


### Data audit of enrolment records against credential/programs

In the last cell, we can see 12 people enrolled in the Associate degree programs, but the individual school entries with Associate degree sums only to 9.

We can audit programatically with BC and other credential types.

In [None]:
cleaned_program_df[(cleaned_program_df['Institution Name'] == 'BC (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
330,2022,British Columbia,BC (total),"Total, program type","Total, credential type",123987,53583


In [None]:
# Sum of domestic enrolment for BC across all programs and credentials in 2022

print(f"Total domestic enrolment across BC (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'BC (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['Domestic Enrolment'].sum()}\n")

print(f"Total domestic enrolment across BC (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'British Columbia') & 
                    (cleaned_program_df['Institution Name'] == 'BC (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['Domestic Enrolment'].sum()}")

Total domestic enrolment across BC (All Programs & All Credentials record row): 123987

Total domestic enrolment across BC (sum of individual credentials records): 123822


The above cell is finding the total domestic enrolment from the individual credential type records (every record in the dataframe extract in the cell below, except the first record) and comparing with the Total all programs / Total all credentials record which is the very top record of the extract below. 

We can see there is a discrepancy of 165 domestic students - at 124,000 students this is a 0.13% discrepancy

I'll now do the same with the international students:

In [None]:
cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'British Columbia') & 
                    (cleaned_program_df['Institution Name'] == 'BC (total)') &
                    (cleaned_program_df['Program type'] == 'Total, program type')].sort_values(by='Domestic Enrolment', ascending=False)

Canadian Status,FY Start,Province/Territory,Institution Name,Program type,Credential type,Domestic Enrolment,International Enrolment
330,2022,British Columbia,BC (total),"Total, program type","Total, credential type",123987,53583
326,2022,British Columbia,BC (total),"Total, program type",Degree (includes applied degree),84054,26790
327,2022,British Columbia,BC (total),"Total, program type",Diploma,16857,16776
325,2022,British Columbia,BC (total),"Total, program type",Certificate,8589,1107
328,2022,British Columbia,BC (total),"Total, program type","Not applicable, credential type",6753,5055
324,2022,British Columbia,BC (total),"Total, program type",Associate degree,4725,3603
329,2022,British Columbia,BC (total),"Total, program type",Other type of credential associated with a pro...,2844,249


In [None]:
print(f"Total international enrolment across BC (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'BC (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['International Enrolment'].sum()}")

print(f"Total international enrolment across BC (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'British Columbia') & 
                    (cleaned_program_df['Institution Name'] == 'BC (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['International Enrolment'].sum()}")

Total international enrolment across BC (All Programs & All Credentials record row): 53583
Total international enrolment across BC (sum of individual credentials records): 53580


This discrepancy is 3 out of 53.5k total - not even 0.01%

Let's try another province

In [None]:
print(f"Total international enrolment across AB (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'Alberta (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['International Enrolment'].sum()}")

print(f"Total international enrolment across AB (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'Alberta') & 
                    (cleaned_program_df['Institution Name'] == 'Alberta (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['International Enrolment'].sum()}")

Total international enrolment across AB (All Programs & All Credentials record row): 29139
Total international enrolment across AB (sum of individual credentials records): 29136


In [None]:
print(f"Total domestic enrolment across AB (All Programs & All Credentials record row): {cleaned_program_df[(cleaned_program_df['Institution Name'] == 'Alberta (total)') & 
                   (cleaned_program_df['Program type'] == 'Total, program type') & 
                   (cleaned_program_df['Credential type'] == 'Total, credential type')]['Domestic Enrolment'].sum()}")

print(f"Total domestic enrolment across AB (sum of individual credentials records): {cleaned_program_df[(cleaned_program_df['Province/Territory'] == 'Alberta') & 
                    (cleaned_program_df['Institution Name'] == 'Alberta (total)') &
                    ~(cleaned_program_df['Credential type'] == 'Total, credential type') &
                    (cleaned_program_df['Program type'] == 'Total, program type')]['Domestic Enrolment'].sum()}")

Total domestic enrolment across AB (All Programs & All Credentials record row): 153537
Total domestic enrolment across AB (sum of individual credentials records): 153543


Both the above Alberta enrolment stats are within single digits of the comparison - at 29k and 153k these are <0.1% discrepancies we are okay with. 

### Calculate revenue

Things covered in this section:
- Combining the tuition fees from `tuition` dataframe with enrolment figures in `enrolment`
- Using tuition fees and 22/23 enrolment to calculate revenue from enrolment/tuition fees
- Projecting 23/24 enrolment based on international % growth rates and domestic enrolment changes, to provide some estimate of 23/24 enrolment all else being equal.
- Using hypothetical scenarios e.g. 10% drop in enrolment, 30% drop, 50% drop from international student enrolment to forecast & estimate revenue losses by school.

Notes:
- As we go forward, we should spot check enrolment in certain program types and credential types with the international enrolment from combined_df. There are probably going to be at least some inconsistencies in how a 'diploma' or 'certificate' is categorized and the average tuition fee values we're working with mean these will be informed estimates at best. But there should not be instances where specific programs / credentials have more enrolment than entire 'international' or 'domestic' students from combined_df.

## High-level conclusions:

Any relationship to population growth?

## Next Steps