# StatCan Data on Credential type across institutions from 2011-2021

As Canadian PSIs have declined in domestic enrolment and replaced this with international students, has their credential offering changed as a result?

In [227]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [228]:
# Reading in the CSV
df = pd.read_csv("/Users/thomasdoherty/Desktop/canadian-psi-project/psi_data/statcan_data/statcan-credentials.csv", encoding='utf-8')

In [229]:
df.sample(10)

Unnamed: 0,REF_DATE,GEO,DGUID,Field of study,Program type,Credential type,Institution type,Registration status,Status of student in Canada,Gender,...,UOM_ID,SCALAR_FACTOR,SCALAR_ID,VECTOR,COORDINATE,VALUE,STATUS,SYMBOL,TERMINATED,DECIMALS
13543,2015/2016,"Camosun College, British Columbia",,"Total, field of study","Total, program type",Degree (includes applied degree),"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1445551323,239.1.1.5.1.2.2.1,1080.0,,,,0
25,2014/2015,Canada,2021A000011124,"Total, field of study","Total, program type",Diploma,"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1438513443,1.1.1.4.1.2.2.1,381246.0,,,,0
14669,2011/2012,"Kwantlen Polytechnic University, British Columbia",,"Total, field of study","Total, program type",Diploma,"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1445825702,250.1.1.4.1.2.3.1,297.0,,,,0
5979,2019/2020,"Collège TAV, Quebec",,"Total, field of study","Total, program type","Not applicable, credential type","Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1441414362,123.1.1.9.1.2.2.1,36.0,,,,0
8229,2012/2013,Loyalist College of Applied Arts and Technolog...,,"Total, field of study","Total, program type",Degree (includes applied degree),"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1443109342,158.1.1.5.1.2.2.1,114.0,,,,0
12458,2020/2021,"University of British Columbia, British Columbia",,"Total, field of study","Total, program type",Degree (includes applied degree),"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1445154520,227.1.1.5.1.2.2.1,34146.0,,,,0
13697,2011/2012,"North Island College, British Columbia",,"Total, field of study","Total, program type",Other type of credential associated with a pro...,"Total, institution type",Full-time student,Canadian students,"Total, gender",...,223,units,0,v1445578871,240.1.1.8.1.2.2.1,27.0,,,,0
8240,2020/2021,Loyalist College of Applied Arts and Technolog...,,"Total, field of study","Total, program type",Degree (includes applied degree),"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1443109345,158.1.1.5.1.2.3.1,6.0,,,,0
4595,2013/2014,"CÉGEP de Saint-Jérôme, Quebec",,"Total, field of study","Total, program type",Diploma,"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1441178129,80.1.1.4.1.2.3.1,15.0,,,,0
3799,2013/2014,"CÉGEP de Baie-Comeau, Quebec",,"Total, field of study","Total, program type",Diploma,"Total, institution type",Full-time student,International students,"Total, gender",...,223,units,0,v1440968003,59.1.1.4.1.2.3.1,24.0,,,,0


In [230]:
print(df.columns)

Index(['REF_DATE', 'GEO', 'DGUID', 'Field of study', 'Program type',
       'Credential type', 'Institution type', 'Registration status',
       'Status of student in Canada', 'Gender', 'UOM', 'UOM_ID',
       'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE', 'STATUS',
       'SYMBOL', 'TERMINATED', 'DECIMALS'],
      dtype='object')


drop unnecessary columns

In [231]:
df.drop(['DGUID', 'Field of study', 'Program type', 'Institution type', 'Registration status', 'Gender', 'UOM', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'STATUS', 'SYMBOL', 'TERMINATED', 'DECIMALS'], axis=1, inplace=True)

In [232]:
df.sample(5)

Unnamed: 0,REF_DATE,GEO,Credential type,Status of student in Canada,VALUE
11168,2011/2012,"Mount Royal University, Alberta","Not applicable, credential type",International students,315.0
13597,2014/2015,"Camosun College, British Columbia",Other type of credential associated with a pro...,International students,525.0
4074,2020/2021,"CÉGEP de Limoilou, Quebec","Not applicable, credential type",Canadian students,387.0
3443,2021/2022,"CÉGEP André-Laurendeau, Quebec",Diploma,Canadian students,2466.0
9090,2020/2021,Manitoba,Certificate,International students,615.0


Rename columns so it is clearer to read - same processing as the international/domestic split

In [233]:
# rename columns
df.rename(columns={"REF_DATE": "FY Start","GEO": "School/Locality", "Status of student in Canada": "Status", "VALUE": "Enrolment"}, inplace=True)

In [234]:
df["FY Start"] = df["FY Start"].apply(lambda x: int(x[:4]))

split up the provincial data from individual schools

In [235]:
# List of provinces and territories in Canada
canadian_provinces_territories = [
    "Alberta", "British Columbia", "Manitoba", "New Brunswick", "Newfoundland and Labrador",
    "Nova Scotia", "Ontario", "Prince Edward Island", "Quebec", "Saskatchewan",
    "Northwest Territories", "Nunavut", "Yukon", "Canada"
]

# Convert the list to a set for fast exact matching
province_set = set(canadian_provinces_territories)

# Create a mask for exact matches with Canada or any province/territory
exact_match_mask = df['School/Locality'].isin(province_set)

# Create the Canada & Provinces DataFrame (exact matches)
canada_df = df[exact_match_mask]

# For rows that do not match exactly, extract the college/university name before the last comma
# We will create a mask where School/Locality does not match any province or Canada
schools_df = df[~exact_match_mask]

# Split 'School/Locality' by ", " from the right and extract school name and province/territory, delimiter is ', '
schools_df[['Institution Name', 'Province/Territory']] = schools_df['School/Locality'].str.rsplit(", ", n=1, expand=True)

# Drop the original 'School/Locality' column if no longer needed
schools_df.drop(columns=['School/Locality'], inplace=True)



# Now, `canada_df` contains rows where School/Locality is exactly a province or territory, and
# `schools_df` contains rows with college/university names (before the last comma)

# Display the results
print(f"Number of rows in canada_df: {len(canada_df)}")
print(f"Number of rows in colleges_universities_df: {len(schools_df)}")

Number of rows in canada_df: 1313
Number of rows in colleges_universities_df: 13903




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [236]:
# move Institution Name to the front of the DataFrame
cols = schools_df.columns.tolist()
cols = cols[-2:] + cols[:-2] # this just redefines cols as the last element of cols first, then the rest of the elements in the same order
schools_df = schools_df[cols]

# light cleaing - remove "of Applied Arts & Technology"ArithmeticError# Remove "of Applied Arts and Technology" from all school names
schools_df['Institution Name'] = schools_df['Institution Name'].str.replace(
    " of Applied Arts and Technology", "", regex=False
)

schools_df['Institution Name'] = schools_df['Institution Name'].str.replace(
    " Institute of Technology and Advanced Learning", "", regex=False
)

In [237]:
schools_df.sample(10)

Unnamed: 0,Institution Name,Province/Territory,FY Start,Credential type,Status,Enrolment
7715,Centennial College,Ontario,2012,Diploma,Canadian students,7263.0
4486,CÉGEP de Saint-Félicien,Quebec,2020,Diploma,International students,183.0
13138,Douglas College,British Columbia,2015,Associate degree,Canadian students,1350.0
3895,CÉGEP régional de Lanaudière à Joliette,Quebec,2015,"Not applicable, credential type",International students,0.0
1614,University of New Brunswick,New Brunswick,2013,Certificate,Canadian students,54.0
13736,Coast Mountain College,British Columbia,2021,Certificate,Canadian students,135.0
3182,Université du Québec à Rimouski,Quebec,2020,Degree (includes applied degree),International students,399.0
2301,Université de Montréal,Quebec,2013,Degree (includes applied degree),International students,5175.0
2280,Université de Montréal,Quebec,2014,Diploma,International students,525.0
11164,Mount Royal University,Alberta,2018,"Not applicable, credential type",Canadian students,1812.0


In [238]:
# rename the School/Locality column to Province in canada_df
canada_df.rename(columns={"School/Locality": "Province/Territory"}, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [239]:
canada_df.sample(5)

Unnamed: 0,FY Start,Province/Territory,Credential type,Status,Enrolment
189,2013,Newfoundland and Labrador,Degree (includes applied degree),International students,1707.0
10418,2011,Alberta,Diploma,Canadian students,25158.0
9985,2020,Saskatchewan,Diploma,International students,900.0
9943,2011,Saskatchewan,Certificate,Canadian students,1845.0
10467,2016,Alberta,"Not applicable, credential type",Canadian students,20073.0


Add Francophone tag

In [240]:
# List of strings that indicate the school is Francophone
francophone_schools = [
    "Université Sainte-Anne", "Collège Boréal", "Collège d'Alfred", 
    "Collège dominicain", "La Cité collégiale", "Université de Hearst", 
    "Université de l'Ontario français", "Université de Moncton", 
    "Collège Communautaire du Nouveau-Brunswick", "Collège de l'Île", "L'École Technique et Professionnelle"
]

# Create the "Francophone" column with 0 as the default value
schools_df['Francophone'] = 0

# Update the "Francophone" column based on the School Name
schools_df['Francophone'] = schools_df.apply(
    lambda row: 1 if (
        any(francophone in row['Institution Name'] for francophone in francophone_schools) or 
        (row['Province/Territory'] == "Quebec" and "McGill University" not in row['Institution Name'])
    ) else 0,
    axis=1
)

# Display the updated DataFrame
print(schools_df[['Institution Name', 'Province/Territory', 'Francophone']].head())


                        Institution Name         Province/Territory  \
238  Memorial University of Newfoundland  Newfoundland and Labrador   
239  Memorial University of Newfoundland  Newfoundland and Labrador   
240  Memorial University of Newfoundland  Newfoundland and Labrador   
241  Memorial University of Newfoundland  Newfoundland and Labrador   
242  Memorial University of Newfoundland  Newfoundland and Labrador   

     Francophone  
238            0  
239            0  
240            0  
241            0  
242            0  


In [241]:
schools_df.sample(5)

Unnamed: 0,Institution Name,Province/Territory,FY Start,Credential type,Status,Enrolment,Francophone
13509,Camosun College,British Columbia,2014,Certificate,International students,24.0,0
11570,Grande Prairie Regional College,Alberta,2020,Degree (includes applied degree),Canadian students,,0
5256,Collège Bart,Quebec,2015,Diploma,Canadian students,177.0,1
11771,Medicine Hat College,Alberta,2019,"Not applicable, credential type",International students,90.0,0
8149,Georgian College,Ontario,2016,Diploma,Canadian students,5802.0,0


## National & Provincial level changes in credentials offered 2011-2021

In [242]:
canada_df

Unnamed: 0,FY Start,Province/Territory,Credential type,Status,Enrolment
0,2011,Canada,Certificate,Canadian students,68367.0
1,2012,Canada,Certificate,Canadian students,69231.0
2,2013,Canada,Certificate,Canadian students,74811.0
3,2014,Canada,Certificate,Canadian students,74370.0
4,2015,Canada,Certificate,Canadian students,71352.0
...,...,...,...,...,...
15166,2017,Nunavut,"Not applicable, credential type",Canadian students,99.0
15167,2018,Nunavut,"Not applicable, credential type",Canadian students,117.0
15168,2019,Nunavut,"Not applicable, credential type",Canadian students,129.0
15169,2020,Nunavut,"Not applicable, credential type",Canadian students,147.0


Let's pivot the table so that every row is a different province/territory in a given year for either Canadian or international students, columns will be certificate / diploma / degree

In [243]:
# List of territories to exclude - numbers are too small
territories = ["Yukon", "Northwest Territories", "Nunavut"]

canada_df = canada_df[
    ~canada_df['Province/Territory'].isin(territories)                     
]

# pivot the table for separate columns for each unique credential type
canada_df = canada_df.pivot_table(
    index=['FY Start', 'Province/Territory', 'Status'], # specify what is staying the same
    columns='Credential type', # specify what is being pivoted
    values='Enrolment', # specifying the values to fill the new pivoted columns
    aggfunc='sum', 
    fill_value=0 # replace NaN with 0
).reset_index()

# rename the index column
canada_df.columns.name = None

In [244]:
canada_df

Unnamed: 0,FY Start,Province/Territory,Status,Associate degree,Certificate,Degree (includes applied degree),Diploma,"Not applicable, credential type",Other type of credential associated with a program
0,2011,Alberta,Canadian students,0.0,8232.0,85542.0,25158.0,19920.0,0.0
1,2011,Alberta,International students,0.0,432.0,8079.0,2220.0,1173.0,0.0
2,2011,British Columbia,Canadian students,7575.0,10593.0,72414.0,24174.0,2712.0,5265.0
3,2011,British Columbia,International students,393.0,411.0,13788.0,3540.0,777.0,1023.0
4,2011,Canada,Canadian students,7575.0,68367.0,799896.0,370542.0,58755.0,33405.0
...,...,...,...,...,...,...,...,...,...
237,2021,Prince Edward Island,International students,0.0,48.0,1452.0,426.0,3.0,0.0
238,2021,Quebec,Canadian students,0.0,8418.0,151677.0,164817.0,12987.0,0.0
239,2021,Quebec,International students,0.0,948.0,40308.0,7896.0,1560.0,0.0
240,2021,Saskatchewan,Canadian students,0.0,1938.0,27387.0,2370.0,3543.0,33.0


In [245]:
# cast all the credential type columns to integers
for col in canada_df.columns[3:]:
    canada_df[col] = canada_df[col].astype(int)

# rename the columns to be more readable
canada_df.rename(columns={
    'Province/Territory': 'Province/Territory',
    'Associate degree': 'Associate Enr',
    'Certificate': 'Certificate Enr',
    'Diploma': 'Diploma Enr',
    'Degree (includes applied degree)': 'Degree Enr',
    'Not applicable, credential type': 'N/a',
    'Other type of credential associated with a program': 'Other Enr'
}, inplace=True)

In [246]:
# create a total column for all credentials for each record
canada_df['Total Enr'] = canada_df[
    ['Associate Enr', 'Certificate Enr', 'Diploma Enr', 'Degree Enr', 'N/a', 'Other Enr']
].sum(axis=1)

In [247]:
canada_df

Unnamed: 0,FY Start,Province/Territory,Status,Associate Enr,Certificate Enr,Degree Enr,Diploma Enr,N/a,Other Enr,Total Enr
0,2011,Alberta,Canadian students,0,8232,85542,25158,19920,0,138852
1,2011,Alberta,International students,0,432,8079,2220,1173,0,11904
2,2011,British Columbia,Canadian students,7575,10593,72414,24174,2712,5265,122733
3,2011,British Columbia,International students,393,411,13788,3540,777,1023,19932
4,2011,Canada,Canadian students,7575,68367,799896,370542,58755,33405,1338540
...,...,...,...,...,...,...,...,...,...,...
237,2021,Prince Edward Island,International students,0,48,1452,426,3,0,1929
238,2021,Quebec,Canadian students,0,8418,151677,164817,12987,0,337899
239,2021,Quebec,International students,0,948,40308,7896,1560,0,50712
240,2021,Saskatchewan,Canadian students,0,1938,27387,2370,3543,33,35271


Need to audit - where is there lots of N/a, over 15% for a given record?

In [255]:
# check any records where N/a is more than 10% of Total Enr
na_mask = canada_df['N/a'] > 0.15 * canada_df['Total Enr']
print(na_mask.value_counts())

False    242
Name: count, dtype: int64


In [256]:
# check any where Other Enr is more than 15% of Total Enr
other_mask = canada_df['Other Enr'] > 0.07 * canada_df['Total Enr']
print(other_mask.value_counts())

False    242
Name: count, dtype: int64


Above - There are no records where N/a is greater than 15% or Other is more than 7% so we should have good quality data

In [259]:
# spot check how many schools have associates degree at all
canada_df[canada_df['Associate Enr'] > 0]

Unnamed: 0,FY Start,Province/Territory,Status,Associate Enr,Certificate Enr,Degree Enr,Diploma Enr,N/a,Other Enr,Total Enr
2,2011,British Columbia,Canadian students,7575,10593,72414,24174,2712,5265,122733
3,2011,British Columbia,International students,393,411,13788,3540,777,1023,19932
4,2011,Canada,Canadian students,7575,68367,799896,370542,58755,33405,1338540
5,2011,Canada,International students,393,6777,86757,20328,9693,2079,126027
24,2012,British Columbia,Canadian students,9510,10422,72180,22452,2742,4371,121677
25,2012,British Columbia,International students,705,501,14397,3723,801,1059,21186
26,2012,Canada,Canadian students,9510,69231,810798,371742,58299,31812,1351392
27,2012,Canada,International students,705,7485,96162,22959,10761,2172,140244
46,2013,British Columbia,Canadian students,8805,11277,80460,21039,3972,4041,129594
47,2013,British Columbia,International students,828,483,17508,4272,1203,1038,25332


I want to see if the composition of credentials offered across the country/provinces has changed in ten years

In [260]:
import plotly.express as px

# List of provinces/territories to loop through
provinces = canada_df['Province/Territory'].unique()
statuses = canada_df['Status'].unique()

# Loop through each province and status to create separate plots
for province in provinces:
    for status in statuses:
        # Filter data for the current province and status
        province_status_data = canada_df[
            (canada_df['Province/Territory'] == province) & 
            (canada_df['Status'] == status)
        ]

        # Create a line plot for the enrolment data
        fig = px.line(
            province_status_data,
            x='FY Start',
            y=['Associate Enr', 'Certificate Enr', 'Degree Enr', 'Diploma Enr', 'Other Enr', 'Total Enr'],
            title=f'Enrolment Trends in {province} ({status})',
            labels={'value': 'Enrolment', 'variable': 'Credential Type'},
            markers=True,
        )

        # Customize markers to appear as "X"s
        fig.update_traces(marker=dict(symbol='x'))

        # Show the interactive plot
        fig.show()



### Thoughts on Provincial Data

Degree remains the most popular qualification but the most growth for international students is coming from Diplomas and to some extent, certificates.

Just from eyeballing this, there is a noticeable spike in 2019 for most provinces in the Total Enr and that is mirrored in the Diploma Enr that is clearly contributing to it. In the graphs, when the light blue (Total) climbs faster, it's because the purple (diploma) line is pushing it up in rate. This is in the context of other credentials like Degrees also growing, but Diploma enrolment is growing faster.

Quebec has a unique case where there are more diploma students than degree students domestically! This is not the case internationally though.

In Manitoba, almost all the international growth has been to degrees after initially diplomas climbed but then receded in the COVID years.

Ontario is showing substantial growth in certificates and diplomas in the international enrolment whilst degree enrolment steadily ticks up.

## Visualising school changes in credentials offered over time

Similar to the international/domestic notebook, we probably want to ignore the territories for now due to low numbers here.

We will also now pivot the table so that every school and every province shows enrolment by unique credential type as a column.