# Insight Data Science Consulting Project: 80,000 hours - Chapter 4

Note: this is a part of a consulting project with [80,000 hours](https://80000hours.org/).

## Stage 1: Ask a question

My objective is to rank skills (and possibly knowledge, tools & tech) based on how valuable they are. The skills are listed by US Department of Labor [here](https://www.onetonline.org/find/descriptor/browse/Skills/2.B.1/).

There is no performance measure for this rank yet since it is subjective. Yet in the future, one can create a poll to rate pairwise. 

## Stage 2: Set the environment up and get data

First, set up a directory for data and link it to this workplace. Download data into your choice of directory.

In [1]:
#Set up the environment
import pandas as pd                        #Pandas
import numpy as np                         #Numpy
import pycurl                              #For saving file from url
import os                                  #For checking if a file exists
from pandas.parser import CParserError     #For checking if a file contains a set of values
import matplotlib.pyplot as plt            #For plotting
import matplotlib
%matplotlib inline

#Some machine learning tools
from sklearn.linear_model import LassoCV, LassoLarsCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

#For radar graph plot
import numpy as np
import matplotlib
import matplotlib.path as path
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Set up data directory
DataDir = "C:/Users/Admin/Desktop/Insight/data/"
OutputDir = "C:/Users/Admin/Desktop/Insight/output/"


## Stage 3+4+5: Feature exploration, scores, and results

See previous chapter.

## Stage 6: Compute score for occupations

In [2]:
interest = 'Skill'
#interest = 'Knowledge'

In [3]:
#First read the list of occupation 
filename = "All_Career_Clusters.csv"
d0 = pd.pandas.read_csv(DataDir+filename)

In [4]:
d0.rename(columns={'Code':'SOC code'}, inplace=True)

In [5]:
filename = "d"+ interest +".csv"
d1 = pd.read_csv(DataDir + filename)
d1 = d1.drop('Unnamed: 0', 1)
d1 = d1.set_index('SOC code')

In [6]:
filename = "score_" + interest + ".csv" 
d2 = pd.read_csv(DataDir+ filename)
d2 = d2.drop('Unnamed: 0', 1)
d2.rename(columns={'index':interest}, inplace=True)
d2 = d2.set_index(interest)

In [7]:
dOccupationScore = pd.DataFrame(index=d1.index)
dOccupationScore['Income'] = 0
dOccupationScore['Satisfaction'] = 0
dOccupationScore['Learnability'] = 0
dOccupationScore['Security'] = 0
dOccupationScore = dOccupationScore[['Income','Satisfaction','Learnability', 'Security']]

In [8]:
d1.shape, d2.shape

((953, 35), (35, 8))

In [9]:
for i in ['Income','Satisfaction','Learnability', 'Security']:
    for j in d1.index:
        for k in d1.columns:
            dOccupationScore.loc[j,i] = dOccupationScore.loc[j,i] + d1.loc[j,k]*d2.loc[k,i]

In [10]:
exec("dOccupationScore_" + interest + "= dOccupationScore")

In [11]:
exec("dOccupationScore = dOccupationScore_" + interest)

In [12]:
 dOccupationScore[:5]

Unnamed: 0_level_0,Income,Satisfaction,Learnability,Security
SOC code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11-1011.00,3997.545801,681.907307,-3257.938285,3160.033157
11-1011.03,3328.889581,573.755212,-2744.503501,2660.366948
11-1021.00,3274.591492,568.868551,-2665.072519,2628.753581
11-2011.00,3153.180203,542.763968,-2608.551079,2534.761642
11-2021.00,3355.438992,574.722918,-2766.801352,2681.08812


In [13]:
#dOccupationScore['score'] = (dOccupationScore['Satisfaction'] + dOccupationScore['Income'] \
#                              + dOccupationScore['Security'] + dOccupationScore['Learnability'])/4
dOccupationScore['score'] = (dOccupationScore['Satisfaction'] + dOccupationScore['Income'] + dOccupationScore['Security'])/3
dOccupationScore['color'] = (dOccupationScore['score'] - np.min(dOccupationScore['score']))/ \
                        (np.max(dOccupationScore['score']) - np.min(dOccupationScore['score']))

In [14]:
vmin = min(np.min(dOccupationScore['Income']),np.min(dOccupationScore['Satisfaction']), np.min(dOccupationScore['Security']))

vmax = max(np.max(dOccupationScore['Income']),np.max(dOccupationScore['Satisfaction']), np.max(dOccupationScore['Security']))

vmin, vmax

(193.62658662057549, 3997.5458008744663)

In [15]:
dOccupationScore = dOccupationScore.reset_index().merge(d0, left_on='SOC code',  right_on='SOC code', how ='left')
dOccupationScore = dOccupationScore.sort_values(by='score',ascending=False)

In [16]:
# #For Skills and Knowledges combo. only
# dOccupationScore_combine = pd.DataFrame(dOccupationScore_Skill\
#                                        [['SOC code', 'Income_scale','Satisfaction_scale','Learnability_scale','Security_scale']])

# dOccupationScore_combine = dOccupationScore_combine.merge(dOccupationScore_Knowledge
#                                                           , left_on='SOC code',  right_on='SOC code', how ='left')
# dOccupationScore_combine = dOccupationScore_combine.drop(['Income','Satisfaction','Learnability','Security'],axis = 1)

# dOccupationScore_combine['Income'] = 0.5*(dOccupationScore_combine['Income_scale_x'] + dOccupationScore_combine['Income_scale_y'])
# dOccupationScore_combine['Satisfaction'] = 0.5*(dOccupationScore_combine['Satisfaction_scale_x'] + dOccupationScore_combine['Satisfaction_scale_y'])
# dOccupationScore_combine['Learnability'] = 0.5*(dOccupationScore_combine['Learnability_scale_x'] + dOccupationScore_combine['Learnability_scale_y'])
# dOccupationScore_combine['Security'] = 0.5*(dOccupationScore_combine['Security_scale_x'] + dOccupationScore_combine['Security_scale_y'])

# dOccupationScore_combine['score'] = (dOccupationScore_combine['Satisfaction'] + dOccupationScore_combine['Income'] \
#                                      + dOccupationScore_combine['Security'] + dOccupationScore_combine['Learnability'])/4
# dOccupationScore_combine['color'] = (dOccupationScore_combine['score'] - np.min(dOccupationScore_combine['score']))/ \
#                         (np.max(dOccupationScore_combine['score']) - np.min(dOccupationScore_combine['score']))
# dOccupationScore = dOccupationScore_combine


In [17]:
dOccupationScore[['Career Pathway','Occupation','Income','Satisfaction','Learnability','Security','score']]

Unnamed: 0,Career Pathway,Occupation,Income,Satisfaction,Learnability,Security,score
0,Business Financial Management and Accounting,Chief Executives,3997.545801,681.907307,-3257.938285,3160.033157,2613.162088
2,Governance,Chief Executives,3997.545801,681.907307,-3257.938285,3160.033157,2613.162088
3,Family and Community Services,Chief Executives,3997.545801,681.907307,-3257.938285,3160.033157,2613.162088
4,Logistics Planning and Management Services,Chief Executives,3997.545801,681.907307,-3257.938285,3160.033157,2613.162088
1,Management,Chief Executives,3997.545801,681.907307,-3257.938285,3160.033157,2613.162088
68,Administration and Administrative Support,"Education Administrators, Elementary and Secon...",3716.029174,644.437600,-3070.154627,2981.575684,2447.347486
90,Therapeutic Services,Medical and Health Services Managers,3679.255981,632.875310,-3031.874700,2933.934677,2415.355322
87,Management,Medical and Health Services Managers,3679.255981,632.875310,-3031.874700,2933.934677,2415.355322
88,Diagnostic Services,Medical and Health Services Managers,3679.255981,632.875310,-3031.874700,2933.934677,2415.355322
89,Health Informatics,Medical and Health Services Managers,3679.255981,632.875310,-3031.874700,2933.934677,2415.355322


In [459]:
#set color map
cmap = matplotlib.cm.get_cmap('RdYlGn')

In [460]:
dSummary = pd.concat([dOccupationScore[:20], dOccupationScore[-20:]])
Dir = OutputDir + interest + '_occupation/'
#dSummary = dOccupationScore[-20:]
#Dir = OutputDir + 'worst_occupation/'

In [461]:
def radarplot(df, lower, upper, space):
    df = df.sort_values(by='score')
    for count in range(0,len(df)):

        #Adapted from Copyright (C) 2011  Nicolas P. Rougier

        # Data to be represented
        # ----------
        properties = ['Income','Satisfaction', 'Security','Learnability']
        values = df.loc[df.index[count],:][['Income','Satisfaction', 'Security','Learnability']]
        # ----------

        # Choose some nice colors
        matplotlib.rc('axes', facecolor = 'white')

        # Make figure background the same colors as axes 
        fig = plt.figure(figsize=(8,6), facecolor='white')


        # Use a polar axes
        axes = plt.subplot(111, polar=True)

        # Set ticks to the number of properties (in radians)
        #t = np.arange(0,2*np.pi,2*np.pi/len(properties))
        t = np.arange(np.pi/4,2*np.pi,2*np.pi/len(properties))
        plt.xticks(t, [])

        # Set yticks from 0 to 10
        #plt.yticks(np.linspace(0,10,11))
        #plt.yticks(np.linspace(0,4,9))
        plt.yticks(np.linspace(lower,upper,(upper-lower)/space+1))

        # Draw polygon representing values
        points = [(x,y) for x,y in zip(t,values)]
        points.append(points[0])
        points = np.array(points)
        codes = [path.Path.MOVETO,] + \
                [path.Path.LINETO,]*(len(values) -1) + \
                [ path.Path.CLOSEPOLY ]
        _path = path.Path(points, codes)
        _patch = patches.PathPatch(_path, fill=True, color=cmap(df.loc[df.index[count],'color']), linewidth=0, alpha=.7)
        axes.add_patch(_patch)
        _patch = patches.PathPatch(_path, fill=False, linewidth = 2)
        axes.add_patch(_patch)

        # Draw circles at value points
        plt.scatter(points[:,0],points[:,1], linewidth=2,
                    s=50, color='white', edgecolor='black', zorder=10)

        # Set axes limits
        #plt.ylim(0,10)
        #plt.ylim(0,4)
        plt.ylim(lower,upper)

        #add tile
        #plt.title(df.index[count])
        #add tile
        plt.title(df.loc[df.index[count],'Career Pathway']+'/ \n ' + df.loc[df.index[count],'Occupation'])

        # Draw ytick labels to make sure they fit properly
        for i in range(len(properties)):
            angle_rad = i/float(len(properties))*2*np.pi + np.pi/4
            angle_deg = i/float(len(properties))*360 + 45
            ha = "right"
            if angle_rad < np.pi/2 or angle_rad > 3*np.pi/2: ha = "left"
            #plt.text(angle_rad, 10.75, properties[i], size=14,
            #plt.text(angle_rad, 4.75, properties[i], size=14,
            plt.text(angle_rad, upper + 0.25, properties[i], size=14,
                     horizontalalignment=ha, verticalalignment="center")

            # A variant on label orientation
            #    plt.text(angle_rad, 11, properties[i], size=14,
            #             rotation=angle_deg-90,
            #             horizontalalignment='center', verticalalignment="center")

        # Done
        plt.savefig(Dir + str(count+1).zfill(2) +'-radar-chart.png', facecolor='white');
        #plt.show()
        plt.clf();

In [464]:
radarplot(dSummary, 0, 400, 100) #for skill
radarplot(dSummary, -1200, 1200, 600); # for knowledge

<matplotlib.figure.Figure at 0x103ffba8>

<matplotlib.figure.Figure at 0x14fcb860>

<matplotlib.figure.Figure at 0x119dc128>

<matplotlib.figure.Figure at 0xc79d978>

<matplotlib.figure.Figure at 0xc432c50>

<matplotlib.figure.Figure at 0xc432908>

<matplotlib.figure.Figure at 0x113359b0>

<matplotlib.figure.Figure at 0x11335c88>

<matplotlib.figure.Figure at 0xedb0198>

<matplotlib.figure.Figure at 0xdfdb438>

<matplotlib.figure.Figure at 0x1161f9e8>

<matplotlib.figure.Figure at 0x11e2e320>

<matplotlib.figure.Figure at 0xe13b908>

<matplotlib.figure.Figure at 0xf248860>

<matplotlib.figure.Figure at 0xe4049b0>

<matplotlib.figure.Figure at 0xc785278>

<matplotlib.figure.Figure at 0xe70de48>

<matplotlib.figure.Figure at 0x11ae4320>

<matplotlib.figure.Figure at 0xe13b198>

<matplotlib.figure.Figure at 0xefce8d0>

<matplotlib.figure.Figure at 0xe38d4e0>

<matplotlib.figure.Figure at 0x117e7c50>

<matplotlib.figure.Figure at 0xe96f9e8>

<matplotlib.figure.Figure at 0xe412278>

<matplotlib.figure.Figure at 0x116280b8>

<matplotlib.figure.Figure at 0x11e2d278>

<matplotlib.figure.Figure at 0xeb5c160>

<matplotlib.figure.Figure at 0x11fb4b70>

<matplotlib.figure.Figure at 0x11654d30>

<matplotlib.figure.Figure at 0xe671c50>

<matplotlib.figure.Figure at 0xee45668>

<matplotlib.figure.Figure at 0x14f79390>

<matplotlib.figure.Figure at 0xe6e1358>

<matplotlib.figure.Figure at 0x1064ae48>

<matplotlib.figure.Figure at 0xe110a90>

<matplotlib.figure.Figure at 0x14faf320>

<matplotlib.figure.Figure at 0xe2f4e10>

<matplotlib.figure.Figure at 0xf5cceb8>

<matplotlib.figure.Figure at 0xc79d438>

<matplotlib.figure.Figure at 0x11fbc6a0>

In [463]:
d = dOccupationScore
d = d.reset_index()
d.to_csv(DataDir + 'occupationscore_'+interest+'.csv')