# Preparation & Cleaning of the ON Colleges Enrolment by Credentials worksheet

In [25]:
# open the excel file on 22-23 Stats sheet as a pandas dataframe
# imports

import openpyxl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
# load and read the Credentials sheet from the workbook
creds_df = pd.read_excel('/Users/thomasdoherty/Desktop/canadian-psi-project/psi_data/cleaning_copy_excel/on_college_2012-/2012-2022 college_enrolment_headcount.xlsx', sheet_name='Credentials')

In [27]:
creds_df

Unnamed: 0,College Name,Fiscal Year,Credential Type Description,Headcount Full-Time Fall
0,Algonquin College,2012-2013,Certificate,2577
1,Algonquin College,2012-2013,Degree (includes applied degree),381
2,Algonquin College,2012-2013,Diploma,13110
3,Algonquin College,2012-2013,General Equivalency Diploma / High School Diploma,0
4,Algonquin College,2012-2013,Not Available or Not Applicable,0
...,...,...,...,...
1159,St. Lawrence College,2022-2023,Certificate,2262
1160,St. Lawrence College,2022-2023,Degree (includes applied degree),755
1161,St. Lawrence College,2022-2023,Diploma,8227
1162,St. Lawrence College,2022-2023,Not Applicable,0


In [28]:
creds_df.shape

(1164, 4)

**Remember that any instances of between 0-9 in Headcount have * in them from the original MCU Data. We will replace * with 5 as an estimate**

In [46]:
# replace instances of * in Headcount with 5
creds_df['Headcount Full-Time Fall'] = creds_df['Headcount Full-Time Fall'].replace('*', 5)

In [47]:
creds_df.shape

(1164, 4)

### Add a TOTAL Full-Time Enrolment figure for each school in each year, so we can build a % share of the credential in the student body

In [62]:
# group the data by each college, every year, and find the total head count
creds_df.groupby(['College Name', 'Fiscal Year'])['Headcount Full-Time Fall'].sum()

College Name          Fiscal Year
Algonquin College     2012-2013      16068
                      2013-2014      16844
                      2014-2015      17025
                      2015-2016      17435
                      2016-2017      17385
                                     ...  
St. Lawrence College  2018-2019       8795
                      2019-2020       9014
                      2020-2021       9228
                      2021-2022       9037
                      2022-2023      11244
Name: Headcount Full-Time Fall, Length: 264, dtype: int64

In [70]:
# add a column that displays the total headcount in that year for each college
creds_df["School's Total Headcount This Year"] = creds_df.groupby(['College Name', 'Fiscal Year'])['Headcount Full-Time Fall'].transform('sum')

In [74]:
creds_df

Unnamed: 0,College Name,Fiscal Year,Credential Type Description,Headcount Full-Time Fall,School's Total Headcount This Year
0,Algonquin College,2012-2013,Certificate,2577,16068
1,Algonquin College,2012-2013,Degree (includes applied degree),381,16068
2,Algonquin College,2012-2013,Diploma,13110,16068
3,Algonquin College,2012-2013,General Equivalency Diploma / High School Diploma,0,16068
4,Algonquin College,2012-2013,Not Applicable,0,16068
...,...,...,...,...,...
1159,St. Lawrence College,2022-2023,Certificate,2262,11244
1160,St. Lawrence College,2022-2023,Degree (includes applied degree),755,11244
1161,St. Lawrence College,2022-2023,Diploma,8227,11244
1162,St. Lawrence College,2022-2023,Not Applicable,0,11244


Above: I've now added a column which allows us to calculate the share of students non a given Credential type against the entire student body.

I want to clean up the lengthy and inconsistent Credential Type Descriptions. Some of the records are 'Not Applicable', some are 'Not Available or Not Applicable' so I will combine the records to state Not Applicable below

In [67]:
creds_df.loc[
    (creds_df['Credential Type Description'] == 'Not Applicable') | 
    (creds_df['Credential Type Description'] == 'Not Available or Not Applicable'), 
    'Credential Type Description'] = 'Not Applicable'

I will now add **share of the student headcount** which each credential type has. What credentials dominate the school landscape?

In [79]:
# New column which divides Headcount Full-Time Fall by School's Total Headcount This Year
creds_df['Credential Share of Headcount'] = round((creds_df['Headcount Full-Time Fall'] / creds_df["School's Total Headcount This Year"]) * 100, 2)

In [80]:
creds_df

Unnamed: 0,College Name,Fiscal Year,Credential Type Description,Headcount Full-Time Fall,School's Total Headcount This Year,Credential Share of Headcount
0,Algonquin College,2012-2013,Certificate,2577,16068,16.04
1,Algonquin College,2012-2013,Degree (includes applied degree),381,16068,2.37
2,Algonquin College,2012-2013,Diploma,13110,16068,81.59
3,Algonquin College,2012-2013,General Equivalency Diploma / High School Diploma,0,16068,0.00
4,Algonquin College,2012-2013,Not Applicable,0,16068,0.00
...,...,...,...,...,...,...
1159,St. Lawrence College,2022-2023,Certificate,2262,11244,20.12
1160,St. Lawrence College,2022-2023,Degree (includes applied degree),755,11244,6.71
1161,St. Lawrence College,2022-2023,Diploma,8227,11244,73.17
1162,St. Lawrence College,2022-2023,Not Applicable,0,11244,0.00


## BOUNDARY: Earlier work below which may now be redundant

In [68]:
# unique entries for Credential Type Description
creds_df['Credential Type Description'].value_counts()

Credential Type Description
Not Applicable                                          295
Certificate                                             264
Diploma                                                 264
Degree (includes applied degree)                        210
Other type of credential associated with a program       79
General Equivalency Diploma / High School Diploma        27
Attestation and other credentials for short programs     25
Name: count, dtype: int64

I want to note whether these institutions **offer** certain qualifications like degrees, as well as the change in enrolment to these programs.

In [49]:
# sum of the number of students in each credential type
creds_df[creds_df['Credential Type Description'] == 'Certificate']

Unnamed: 0,College Name,Fiscal Year,Credential Type Description,Headcount Full-Time Fall
0,Algonquin College,2012-2013,Certificate,2577
5,Algonquin College,2013-2014,Certificate,2856
10,Algonquin College,2014-2015,Certificate,2937
15,Algonquin College,2015-2016,Certificate,3094
20,Algonquin College,2016-2017,Certificate,3075
...,...,...,...,...
1135,St. Lawrence College,2018-2019,Certificate,1067
1141,St. Lawrence College,2019-2020,Certificate,1150
1147,St. Lawrence College,2020-2021,Certificate,1022
1153,St. Lawrence College,2021-2022,Certificate,1431


Above: That is every school in every fiscal year that has a record of a certificate credential and the headcount.

Below: Same with Degrees, Diplomas, Not Applicable & Other

In [50]:
certificates = creds_df[creds_df['Credential Type Description'] == 'Certificate']

In [51]:
diplomas = creds_df[creds_df['Credential Type Description'] == 'Diploma']

In [52]:
degrees = creds_df[creds_df['Credential Type Description'] == 'Degree (includes applied degree)']

In [53]:
cred_na = creds_df[(creds_df['Credential Type Description'] == 'Not Applicable') | (creds_df['Credential Type Description'] == 'Not Available or Not Applicable')]

In [55]:
cred_na = creds_df[(creds_df['Credential Type Description'] == 'Not Applicable') | (creds_df['Credential Type Description'] == 'Not Available or Not Applicable')]

In [56]:
# check
creds_df['Credential Type Description'].value_counts()

Credential Type Description
Not Applicable                                          295
Certificate                                             264
Diploma                                                 264
Degree (includes applied degree)                        210
Other type of credential associated with a program       79
General Equivalency Diploma / High School Diploma        27
Attestation and other credentials for short programs     25
Name: count, dtype: int64

In [57]:
cred_other = creds_df[creds_df['Credential Type Description'] == 'Other type of credential associated with a program']

I don't think the GED / High School or Attestation for short programs are particularly relevant so I will ignore them for now

**Next step: Look at the student numbers in each of those Credential types and track their changes Year-on-Year**

In [69]:
# What is the headcount for degree programs for each school in each year?
degrees.groupby(['College Name', 'Fiscal Year'])['Headcount Full-Time Fall'].sum()

College Name          Fiscal Year
Algonquin College     2012-2013      381
                      2013-2014      437
                      2014-2015      507
                      2015-2016      548
                      2016-2017      612
                                    ... 
St. Lawrence College  2018-2019      855
                      2019-2020      829
                      2020-2021      818
                      2021-2022      825
                      2022-2023      755
Name: Headcount Full-Time Fall, Length: 210, dtype: int64