# VISUALISATION OF 2015-2024 STATSCAN JOB PROJECTIONS

This is my attempt at learning data analysis, visualization and Python (and the Anaconda Package). As a frequent user of the new job bank website provided by the government of Canada, I found it frustrating to constantly have to search for individual job descriptions to find job projections. 

Working with data provided from Statistics Canada, I tasked myself with developing a bigger picture view of the projected job market as envinsioned by the government statisticians.

## Data Preparation

I was able to obtain several files from StatsCan but opted to only use the summary csv file which contained all the data of interest. The steps taken are as follows:
1. Imported all the necessary libraries and modules
2. Set up plotly for offline use in interactive ploting
3. Renamed columns due to use of both English and French as official languages by the Canadian Govt.
4. Split the dataFrame to sort out summary data that compiled data for job groups and selected range of interest for actual analysis


In [18]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

# plotly could be installed easily with `pip install plotly`
# Use plotly in offline mode - no user account necessary
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode()


#Turn off chained indexing warning
pd.options.mode.chained_assignment = None  # default='warn'

data=pd.read_csv("StatsCan_projections/Summary_sommaire_2015_2024.csv",encoding='latin-1')
#imports the summary csv file with job projection data       
    

#The following picks out English columns and avoids French ones which are duplicates
df=data[['Code', 'Occupation_Name',
       'Employment_emploi_2014', 'Employment_Growth_croissance_emploi',
       'Retirements_retraites', 'Other_Replacement_autre_remplacement',
       "Total_Job_Openings_Perspective_d'emploi",
       'School_Leavers_Sortants_scolaires', 'Immigration',
       'Other_Seekers_autres_chercheurs', 'Job_Seekers_Chercheurs_emploi',
       'Recent_Labour_Market_Conditions',
       'Future_Labour_Market_Conditions']]

#Renamed columns for easier understanding and interpretation
df.columns=['code', 'occupation_name',
       'employment_2014', 'employment_growth',
       'retirements', 'other_replacement',
       "total_job_openings",
       'school_leavers', 'immigration',
       'other_seekers', 'total_job_seekers',
       'recent_labour_market_conditions',
       'future_labour_market_conditions']

#Slice out dataset for skill levels, 0,A-D
#Removed last two columns due to being unnecessary (N/A)
skill_level_summary=df[:6]
skill_level_summary=skill_level_summary[['code', 'occupation_name', 'employment_2014', 'employment_growth',
       'retirements', 'other_replacement', 'total_job_openings',
       'school_leavers', 'immigration', 'other_seekers', 'total_job_seekers']]


#Slice out dataset for skill type, 1-9
#Removed last two columns due to being unnecessary (N/A)
skill_type_summary=df[6:16]
skill_type_summary=skill_type_summary[['code', 'occupation_name', 'employment_2014', 'employment_growth',
       'retirements', 'other_replacement', 'total_job_openings',
       'school_leavers', 'immigration', 'other_seekers', 'total_job_seekers']]

#Slice out all the NOC Codes available
all_skill_levels=df[16:]
#Create new column for the skill level as determined by the first 2 digits of the NOC code
all_skill_levels.loc[:,'skill_level']=all_skill_levels.code.str[1:3]

There are 5 major skill levels, 0,A,B,C and D. To enable easier identification of such skill levels, I wrote a function to determine the skill level based on the NOC Code of the occupation. This was used to create a new data point for each occupation for future reference.

In [19]:
def getSkillLevel(code):
    '''Takes an input of the skill_level (str) and determines the appropriate Skill Level.
    code must be entered as a string'''
    A=['0','1']
    B=['2','3']
    C=['4','5']
    D=['6','7']
    
    if code[0]=='0':
        return "NOC 0"
    elif code[1] in A:
        return "NOC A"
    elif code[1] in B:
        return "NOC B"
    elif code[1] in C:
        return "NOC C"
    elif code[1] in D:
        return "NOC D"
    else:
        return "Error"
    
#Create new column to store the NOC code
all_skill_levels['NOC'] = list(map(getSkillLevel, all_skill_levels['skill_level']))

#New column to store the net job openings
difference=all_skill_levels['total_job_openings']-all_skill_levels['total_job_seekers']
all_skill_levels['net_jobs']=difference

#Sort the jobs based on number of net job openings (Highest at the top)
all_skill_levels_sorted=all_skill_levels.sort_values('net_jobs', ascending=False, inplace=False, na_position='last')

The function **_goodGraphs()_** was developed to cater to repeated calls for plots. Rather than write a new line for each plot, I
looked to simplify the process to just require the dataFrame of interest and the plot title. This was based on the assumption that the plot format will be similar (atleast at this early phase of the project).

I chose to use Plotly due to the interactive nature and ease of use. My previous attempts required manual splitting of data to
allow readable plots. This was solved by the interactivity which allows the viewer to zoom into a region of interest and get actual values with good resolution.

In [20]:
def goodGraphs(df,plot_title):
    #Function that takes input of dataFrame and title and develops a Plotly graph
    
    # `data` is a list of datasets
    # `go.Bar` is one of the plotly 'geoms'
    # `x` and `y` are columns of the DataFrame
    data = [go.Bar(
        x=df.code,
        y=df.net_jobs,
        text=df.occupation_name)]

    layout = go.Layout(
        title=plot_title,    
        xaxis=dict(
            title='NOC Codes',
            titlefont=dict(
                family='Courier New, monospace',
                size=16,
                color='#7f7f7f'
            )
        ),
        yaxis=dict(
            title='Net Jobs',
            titlefont=dict(
                family='Courier New, monospace',
                size=16,
                color='#7f7f7f'
            )
        )
    )

    # Render new figure with data
    iplot(go.Figure(data=data, layout=layout))

## Data Visualization

The data was sorted to provide a clearer picture where the viewer can easily identify the occupations with significant mismatches between labour demand and supply. 

### NOC 0 Occupations
These are Management occupations, regardless of industry and education requirements.
Most of the occupations in this category have a positive outlook (more job openings compared to the number of job seekers)

This category has a fairly small number of occupations with only 29 groups provided.

In [21]:
noc_0=all_skill_levels_sorted[all_skill_levels_sorted.NOC == "NOC 0"]
goodGraphs(noc_0,'NOC 0 Projections')

### NOC A Occupations
These are occupations that usually require university education
Most of the occupations in this category have a positive outlook (more job openings compared to the number of job seekers)
Of particular interest is the projected future shortage of medical professionals with nurses expected to face the biggest shortage followed by physicians.

On the other end of the spectrum, computer programmers and interactive media developers are expected to face a labour surplus.

In [22]:
noc_A=all_skill_levels_sorted[all_skill_levels_sorted.NOC == "NOC A"]
goodGraphs(noc_A,'NOC A Projections')

### NOC B Occupations
These are occupations that usually require college education or apprenticeship training.
Most of the occupations in this category have a positive outlook although the ratio isn't as pronounced as the previous two categories. 

It is of particular interest that the occupation facing the bleakest future (in terms of jobs, not necessarily prosperity) is athletics. Athletes, coaches, referees and related occupations are expected to face a surplus of labour to the tune of 17500. 

In [23]:
noc_B=all_skill_levels_sorted[all_skill_levels_sorted.NOC == "NOC B"]
goodGraphs(noc_B,'NOC B Projections')

### NOC C Occupations
These are occupations that usually require secondary school and/or occupation specific training

Transport truck drivers are identified as the main candidates for a future labour shortage with numbers almost twice that of its nearest neighbour, retail salespersons. Other than the two mentioned occupations, the rest of the occupations in this category have faily small deviations in net jobs ranging from a labour surplus of 4500 and a labour shortage of 3100. 

In [24]:
noc_C=all_skill_levels_sorted[all_skill_levels_sorted.NOC == "NOC C"]
goodGraphs(noc_C,'NOC C Projections')

### NOC D Occupations
These are occupations where on-the-job training is usually provided.

Few occupation groups are included in this category. There are no major outliers. This category has more labour surplus projections with larger surpluses compared to labour shortages.

In [25]:
noc_D=all_skill_levels_sorted[all_skill_levels_sorted.NOC == "NOC D"]
goodGraphs(noc_D,'NOC D Projections')

### Occupations with positive job projections
These are occupations where the job projections indicated a labour shortage or balance.

The number of total unfilled positions is expected to be 301,000 from these occupations.

In [26]:
noc_pos=all_skill_levels_sorted[all_skill_levels_sorted.net_jobs >= 0]
goodGraphs(noc_pos,'+VE Job Projections')
print('The total number of projected unfilled positions (with +ve job projections) is '+ str(noc_pos.net_jobs.sum()))

The total number of projected unfilled positions (with +ve job projections) is 301000


### Occupations with negative job projections
These are occupations where the job projections indicated a labour surplus.

The number of total unfilled positions is expected to be -171,400 from these occupations.

In [27]:
noc_neg=all_skill_levels_sorted[all_skill_levels_sorted.net_jobs < 0]
goodGraphs(noc_neg,'-VE Job Projections')
print('The total number of projected unfilled positions (with -ve job projections) is '+ str(noc_neg.net_jobs.sum()))

The total number of projected unfilled positions (with -ve job projections) is -171400


In [38]:
trace0 = go.Scatter(
    x=noc_0.total_job_seekers,
    y=noc_0.total_job_openings,
    mode='markers',
    marker=dict(size=12,
                line=dict(width=1)
               ),
    name='NOC 0',
    text=noc_0.occupation_name,
    )

trace1 = go.Scatter(
    x=noc_A.total_job_seekers,
    y=noc_A.total_job_openings,
    mode='markers',
    marker=dict(size=12,
                line=dict(width=1)
               ),
    name='NOC A',
    text=noc_A.occupation_name,
        )

data = [trace0, trace1]
layout = go.Layout(
    title='Job seekers vs Job openings',
    hovermode='compare',
    xaxis=dict(
        title='Job openings',
        ticklen=5,
        zeroline=False,
        gridwidth=2,
    ),
    yaxis=dict(
        title='Job seekers',
        ticklen=5,
        gridwidth=2,
    ),
)

iplot(go.Figure(data=data, layout=layout))

