## Impact of IRCC changes on tuition fee revenue for schools

Goals of this notebook:
1. Use the international enrolment data found and analysed above to identify and quantify financial risk of tuition fee revenue decline for institutions, based on the January 2024 IRCC changes.

Process & Methodology:
1. Finding tuition fees charged by PSIs, starting with provincial and program type averages available on StatCan.
2. Finding specific program and credential-type enrolment at PSIs and estimate the 'income' (Tuition multiplied by international enrolment) from these programs that is at risk.
3. Project changes based on hypothetical scenarios from best case to worst case

Important:
- This section relies on projections and estimates of best to worst case scenarios. As 2025 unfolds, the data will become more concrete.


### Imports, load dataset, create pipeline

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

# for the preprocessing pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

From Statistics Canada: 
- [Canadian and international student tuition fees by level of study (current dollars)](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3710004501)

This section will import the tuition fees table for 2023/24, as well as the number of graduate and undergraduate enrolments by international/domestic status. 

Since it was the 2023/2024 FY that the IRCC changes were announced (January 2024) this would be the most accurate point to project losses for postsecondary institutions, beginning with declines in enrolment for September 2024 and onwards, given further updates have occurred throughout the 2024 calendar year.

In a later version we may be able to see by number of permits issued specifically (when the data are available), but for now we will estimate based on hypothetical scenarios e.g. a blanket 25% decline for all institutions and other scenarios.

In [None]:
# open the csv file
tuition = pd.read_csv('/Users/thomasdoherty/Desktop/canadian-psi-project/psi_data/statcan_data/23-24_fees.csv')

In [None]:
tuition.sample(5)

Unnamed: 0,REF_DATE,GEO,DGUID,Level of study,UOM,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
35,2023/2024,Saskatchewan,2016A000247,International graduate,Current dollars,75,units,0,v96320887,9.4,9280.0,,,,0
9,2023/2024,Prince Edward Island,2016A000211,Canadian graduate,Current dollars,75,units,0,v96320861,3.2,5750.0,,,,0
10,2023/2024,Prince Edward Island,2016A000211,International undergraduate,Current dollars,75,units,0,v96320862,3.3,19128.0,,,,0
2,2023/2024,Canada,2016A000011124,International undergraduate,Current dollars,75,units,0,v96320854,1.3,38251.0,,,,0
46,2023/2024,Yukon,2016A000260,International undergraduate,Current dollars,75,units,0,v1073542480,12.3,,..,,,0


We'll make a preprocessing pipeline with custom transformers to clean this up as we did for the enrolment data. This pipeline will be useful for the program enrolment data later too as we are still undertaking similar key steps:
- Removing many columnsthat are of no interest 
- Renaming some key columns, and formatting the values.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# 1. DropColumns transformer
class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(columns=self.columns, errors='ignore')

# 2. RenameColumns transformer
class RenameColumns(BaseEstimator, TransformerMixin):
    def __init__(self, rename_map):
        """
        parameter - rename_map: Dict of columns to rename, e.g.
        {"GEO": "province/territory", "REF_DATE": "FY Start", "VALUE": "Annual Tuition (CAD)"}
        """
        self.rename_map = rename_map

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.rename(columns=self.rename_map)

# 3. FormatFYStart transformer
class FormatFYStart(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        """
        parameter - column: The column containing strings like "2023/2024"
        """
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """
        Converts something like "2023/2024" → "23/24" by splitting on '/' 
        and taking the last two digits of each piece.
        """
        def to_short_fy(fy_str):
            # E.g. "2023/2024" -> split -> ["2023", "2024"] -> "23/24"
            parts = fy_str.split('/')
            if len(parts) == 2:
                return f"{parts[0]}"
            # If format is unexpected, just return the original string
            return fy_str

        X = X.copy()
        X[self.column] = X[self.column].astype(str).apply(to_short_fy)
        return X


# 4. FormatValue transformer
class FormatValue(BaseEstimator, TransformerMixin):
    def __init__(self, column, fill_value=None):
        """
        parameter - column: The column containing numeric tuition or other values (possibly NaN or trailing .0)
        parameter - fill_value: Can be none or 0. If None, keeps NaN (pandas integer compatible)
        """
        self.column = column
        self.fill_value = fill_value

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        """
        Converts tuition values to integers (removes trailing .0).
        If 'fill_value' is specified, NaN values are replaced with that value.
        Otherwise, they remain NaN.
        """
        X = X.copy()
        if self.fill_value is not None:
            # Optionally fill NaNs with a default value
            X[self.column] = X[self.column].fillna(self.fill_value)

        # Convert non-NaN values to int
        def to_int_or_nan(val):
            if pd.isna(val):
                return pd.NA  # or leave as np.nan
            return int(float(val))

        X[self.column] = X[self.column].apply(to_int_or_nan)
        return X

In [None]:
from sklearn.pipeline import Pipeline

# Example usage:
tuition_pipeline = Pipeline(steps=[
    ('drop_columns', DropColumns(columns=[
        'DGUID', 'UOM', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'
    ])),
    ('rename_columns', RenameColumns(rename_map={
        "GEO": "Province/Territory",
        "REF_DATE": "FY Start",
        "VALUE": "Annual Tuition (CAD)"
    })),
    ('format_fy', FormatFYStart(column="FY Start")),
    ('format_tuition', FormatValue(column="Annual Tuition (CAD)"))
])

In [None]:
# Apply to the tuition DataFrame
cleaned_tuition_df = tuition_pipeline.fit_transform(tuition)

In [None]:
cleaned_tuition_df.sample(5)

Unnamed: 0,FY Start,Province/Territory,Level of study,Annual Tuition (CAD)
42,2023,British Columbia,International undergraduate,35469.0
32,2023,Saskatchewan,Canadian undergraduate,9240.0
46,2023,Yukon,International undergraduate,
4,2023,Newfoundland and Labrador,Canadian undergraduate,3593.0
35,2023,Saskatchewan,International graduate,9280.0


In [None]:
# any nulls in the tuition fee column of the dataframe
cleaned_tuition_df[cleaned_tuition_df['Annual Tuition (CAD)'].isnull()]

Unnamed: 0,FY Start,Province/Territory,Level of study,Annual Tuition (CAD)
45,2023,Yukon,Canadian graduate,
46,2023,Yukon,International undergraduate,
47,2023,Yukon,International graduate,


In [None]:
# drop yukon, northwest territories, nunavut - low size and not in psi data
cleaned_tuition_df = cleaned_tuition_df[~cleaned_tuition_df['Province/Territory'].isin(territories)]

### EDA of Tuition fees

Let's do some short EDA work on these tuition fees to see which are the most and least expensive.

In [None]:
cleaned_tuition_df[cleaned_tuition_df['Province/Territory'] == 'Canada']

Unnamed: 0,FY Start,Province/Territory,Level of study,Annual Tuition (CAD)
0,2023,Canada,Canadian undergraduate,7152
1,2023,Canada,Canadian graduate,7542
2,2023,Canada,International undergraduate,38251
3,2023,Canada,International graduate,22114


In [None]:
cleaned_tuition_df.sample(3)

Unnamed: 0,FY Start,Province/Territory,Level of study,Annual Tuition (CAD)
5,2023,Newfoundland and Labrador,Canadian graduate,3435
20,2023,Quebec,Canadian undergraduate,3489
31,2023,Manitoba,International graduate,12730


In [None]:
fig = px.strip(
    cleaned_tuition_df,
    x="Level of study",                # Categories on the x-axis (e.g., "Canadian undergraduate")
    y="Annual Tuition (CAD)",
    color="Province/Territory",        # Color by province/territory
    color_discrete_map={"Canada": "black"},  # Specify Canada to appear in black
    hover_data=["Province/Territory"], # Hover shows which province a point belongs to
    template='plotly'
)

fig.update_layout(
    title="Tuition Fees by Level of Study (Canada avg in Black)",
    xaxis_title="Level of Study",
    yaxis_title="Annual Tuition (CAD)",
    legend_title="Province/Territory"
)

fig.show()

In [None]:
fig = px.box(
    cleaned_tuition_df,
    x="Level of study",
    y="Annual Tuition (CAD)",
    color="Level of study",
    hover_data=["Province/Territory"],
    points="outliers"
)
fig.show()

Canada is deliberately marked in black in the scatter/swarm plot above as a mean marker.

There are two ways the 'Canada' figure may have been calculated:
1. By treating Canada as it is one single province, multiplying the program fee in each category by the enrolment, making a total sum, and dividing by total student number (assuming that is how the provincial figures were calculated)
2. By taking a weighted average of the provincial figures, summing those provincial figures and multiplying each by the share of Canada's population.

I'm assuming the first; Canada is calculated with the same logic as the individual provinces, just accounting for every single data point in every province here. Given that Ontario has the most students, it makes sense that the Canada data point sits closest to Ontario, and the highest fees of any province in the International undergraduate category would naturally weigh the Canada average towards Ontario at the higher end of the fees list.