# Insight Data Science Consulting Project: 80,000 hours - Chapter 4

Note: this is a part of a consulting project with [80,000 hours](https://80000hours.org/).

## Stage 1: Ask a question

My objective is to rank skills (and possibly knowledge, tools & tech) based on how valuable they are. The skills are listed by US Department of Labor [here](https://www.onetonline.org/find/descriptor/browse/Skills/2.B.1/).

There is no performance measure for this rank yet since it is subjective. Yet in the future, one can create a poll to rate pairwise. 

## Stage 2: Set the environment up and get data

First, set up a directory for data and link it to this workplace. Download data into your choice of directory.

In [1]:
#Set up the environment
import pandas as pd                        #Pandas
import numpy as np                         #Numpy
import pycurl                              #For saving file from url
import os                                  #For checking if a file exists
from pandas.parser import CParserError     #For checking if a file contains a set of values
import matplotlib.pyplot as plt            #For plotting
import matplotlib
%matplotlib inline

#Some machine learning tools
from sklearn.linear_model import LassoCV, LassoLarsCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

#For radar graph plot
import numpy as np
import matplotlib
import matplotlib.path as path
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Set up data directory
DataDir = "C:/Users/Admin/Desktop/Insight/80000hrs/"


## Stage 3+4+5: Feature exploration, scores, and results

See previous chapter.

## Stage 6: Compute score for occupations

In [2]:
interest = 'Skill'
#interest = 'Knowledge'

In [3]:
#First read the list of occupation 
filename = "00/All_Career_Clusters.csv"
d0 = pd.pandas.read_csv(DataDir+filename)

In [4]:
d0.rename(columns={'Code':'SOC code'}, inplace=True)

In [5]:
filename = "01/d"+ interest +".csv"
d1 = pd.read_csv(DataDir + filename)
d1 = d1.drop('Unnamed: 0', 1)
d1 = d1.set_index('SOC code')

In [6]:
filename = "03/score_" + interest + ".csv" 
d2 = pd.read_csv(DataDir+ filename)
d2 = d2.drop('Unnamed: 0', 1)
d2.rename(columns={'index':interest}, inplace=True)
d2 = d2.set_index(interest)

In [7]:
dOccupationScore = pd.DataFrame(index=d1.index)
dOccupationScore['Income'] = 0
dOccupationScore['Satisfaction'] = 0
dOccupationScore['Learnability'] = 0
dOccupationScore['Security'] = 0
dOccupationScore = dOccupationScore[['Income','Satisfaction','Learnability', 'Security']]

In [8]:
d1.shape, d2.shape

((953, 35), (35, 8))

In [9]:
for i in ['Income','Satisfaction','Learnability', 'Security']:
    for j in d1.index:
        for k in d1.columns:
            dOccupationScore.loc[j,i] = dOccupationScore.loc[j,i] + d1.loc[j,k]*d2.loc[k,i]

In [10]:
exec("dOccupationScore_" + interest + "= dOccupationScore")

In [11]:
exec("dOccupationScore = dOccupationScore_" + interest)

In [12]:
 dOccupationScore[:5]

Unnamed: 0_level_0,Income,Satisfaction,Learnability,Security
SOC code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11-1011.00,3667.956301,667.859927,-3602.493153,3909.715881
11-1011.03,3043.480313,565.367855,-3020.104331,3272.620516
11-1021.00,2987.396459,559.044264,-2942.772408,3238.490995
11-2011.00,2883.36015,534.457386,-2879.72357,3130.339988
11-2021.00,3078.054736,565.236662,-3063.840979,3312.413232


In [13]:
dOccupationScore['score'] = (dOccupationScore['Satisfaction'] + dOccupationScore['Income'] \
                              + dOccupationScore['Security'] + dOccupationScore['Learnability'])/4
dOccupationScore['color'] = (dOccupationScore['score'] - np.min(dOccupationScore['score']))/ \
                        (np.max(dOccupationScore['score']) - np.min(dOccupationScore['score']))

In [14]:
vmin = min(np.min(dOccupationScore['Income']),np.min(dOccupationScore['Satisfaction']), \
        np.min(dOccupationScore['Security']), np.min(dOccupationScore['Learnability']))

vmax = max(np.max(dOccupationScore['Income']),np.max(dOccupationScore['Satisfaction']), \
        np.max(dOccupationScore['Security']), np.max(dOccupationScore['Learnability']))
vmin, vmax

(-3602.4931532187243, 3909.7158808714426)

In [15]:
dOccupationScore = dOccupationScore.reset_index().merge(d0, left_on='SOC code',  right_on='SOC code', how ='left')
dOccupationScore = dOccupationScore.sort_values(by='score',ascending=False)

In [16]:
dOccupationScore[['Career Pathway','Occupation','Income','Satisfaction','Learnability','Security','score']]

Unnamed: 0,Career Pathway,Occupation,Income,Satisfaction,Learnability,Security,score
0,Business Financial Management and Accounting,Chief Executives,3667.956301,667.859927,-3602.493153,3909.715881,1160.759739
3,Family and Community Services,Chief Executives,3667.956301,667.859927,-3602.493153,3909.715881,1160.759739
4,Logistics Planning and Management Services,Chief Executives,3667.956301,667.859927,-3602.493153,3909.715881,1160.759739
1,Management,Chief Executives,3667.956301,667.859927,-3602.493153,3909.715881,1160.759739
2,Governance,Chief Executives,3667.956301,667.859927,-3602.493153,3909.715881,1160.759739
68,Administration and Administrative Support,"Education Administrators, Elementary and Secon...",3397.379746,633.641521,-3386.187523,3672.699166,1079.383228
90,Therapeutic Services,Medical and Health Services Managers,3378.202146,621.910153,-3359.095007,3630.101896,1067.779797
87,Management,Medical and Health Services Managers,3378.202146,621.910153,-3359.095007,3630.101896,1067.779797
88,Diagnostic Services,Medical and Health Services Managers,3378.202146,621.910153,-3359.095007,3630.101896,1067.779797
89,Health Informatics,Medical and Health Services Managers,3378.202146,621.910153,-3359.095007,3630.101896,1067.779797


In [17]:
#set color map
cmap = matplotlib.cm.get_cmap('RdYlGn')

In [18]:
#best 20 and worst 20
dSummary = pd.concat([dOccupationScore[:20], dOccupationScore[-20:]])
Dir = DataDir + '04/Occupation/'

In [19]:
def radarplot(df, lower, upper, space):
    df = df.sort_values(by='score')
    for count in range(0,len(df)):

        #Adapted from Copyright (C) 2011  Nicolas P. Rougier

        # Data to be represented
        # ----------
        properties = ['Income','Satisfaction', 'Security','Learnability']
        values = df.loc[df.index[count],:][['Income','Satisfaction', 'Security','Learnability']]
        # ----------

        # Choose some nice colors
        matplotlib.rc('axes', facecolor = 'white')

        # Make figure background the same colors as axes 
        fig = plt.figure(figsize=(8,6), facecolor='white')


        # Use a polar axes
        axes = plt.subplot(111, polar=True)

        # Set ticks to the number of properties (in radians)
        #t = np.arange(0,2*np.pi,2*np.pi/len(properties))
        t = np.arange(np.pi/4,2*np.pi,2*np.pi/len(properties))
        plt.xticks(t, [])

        # Set yticks from 0 to 10
        #plt.yticks(np.linspace(0,10,11))
        #plt.yticks(np.linspace(0,4,9))
        plt.yticks(np.linspace(lower,upper,(upper-lower)/space+1))

        # Draw polygon representing values
        points = [(x,y) for x,y in zip(t,values)]
        points.append(points[0])
        points = np.array(points)
        codes = [path.Path.MOVETO,] + \
                [path.Path.LINETO,]*(len(values) -1) + \
                [ path.Path.CLOSEPOLY ]
        _path = path.Path(points, codes)
        _patch = patches.PathPatch(_path, fill=True, color=cmap(df.loc[df.index[count],'color']), linewidth=0, alpha=.7)
        axes.add_patch(_patch)
        _patch = patches.PathPatch(_path, fill=False, linewidth = 2)
        axes.add_patch(_patch)

        # Draw circles at value points
        plt.scatter(points[:,0],points[:,1], linewidth=2,
                    s=50, color='white', edgecolor='black', zorder=10)

        # Set axes limits
        #plt.ylim(0,10)
        #plt.ylim(0,4)
        plt.ylim(lower,upper)

        #add tile
        #plt.title(df.index[count])
        #add tile
        plt.title(df.loc[df.index[count],'Career Pathway']+'/ \n ' + df.loc[df.index[count],'Occupation'])

        # Draw ytick labels to make sure they fit properly
        for i in range(len(properties)):
            angle_rad = i/float(len(properties))*2*np.pi + np.pi/4
            angle_deg = i/float(len(properties))*360 + 45
            ha = "right"
            if angle_rad < np.pi/2 or angle_rad > 3*np.pi/2: ha = "left"
            #plt.text(angle_rad, 10.75, properties[i], size=14,
            #plt.text(angle_rad, 4.75, properties[i], size=14,
            plt.text(angle_rad, upper + 0.25, properties[i], size=14,
                     horizontalalignment=ha, verticalalignment="center")

            # A variant on label orientation
            #    plt.text(angle_rad, 11, properties[i], size=14,
            #             rotation=angle_deg-90,
            #             horizontalalignment='center', verticalalignment="center")

        # Done
        plt.savefig(Dir + str(count+1).zfill(2) +'-radar-chart.png', facecolor='white');
        #plt.show()
        plt.clf();

In [21]:
radarplot(dSummary, -4000, 4000, 1000) #for skill
#radarplot(dSummary, -1200, 1200, 600); # for knowledge



<matplotlib.figure.Figure at 0xc1129b0>

<matplotlib.figure.Figure at 0xc112cc0>

<matplotlib.figure.Figure at 0xc4ca0f0>

<matplotlib.figure.Figure at 0xc71a940>

<matplotlib.figure.Figure at 0xc23c048>

<matplotlib.figure.Figure at 0xc68f048>

<matplotlib.figure.Figure at 0xcc01518>

<matplotlib.figure.Figure at 0xc2d3358>

<matplotlib.figure.Figure at 0xdadb390>

<matplotlib.figure.Figure at 0xcbd0c50>

<matplotlib.figure.Figure at 0xdea2978>

<matplotlib.figure.Figure at 0xe10fcf8>

<matplotlib.figure.Figure at 0xe10f128>

<matplotlib.figure.Figure at 0xde98208>

<matplotlib.figure.Figure at 0xe37cb00>

<matplotlib.figure.Figure at 0xe917400>

<matplotlib.figure.Figure at 0xdb647b8>

<matplotlib.figure.Figure at 0xed972b0>

<matplotlib.figure.Figure at 0xe8da320>

<matplotlib.figure.Figure at 0xf03ebe0>

<matplotlib.figure.Figure at 0xf2089e8>

<matplotlib.figure.Figure at 0xf454048>

<matplotlib.figure.Figure at 0xf678278>

<matplotlib.figure.Figure at 0xf454c50>

<matplotlib.figure.Figure at 0xed6ee80>

<matplotlib.figure.Figure at 0xed64ef0>

<matplotlib.figure.Figure at 0xe344470>

<matplotlib.figure.Figure at 0xe322cf8>

<matplotlib.figure.Figure at 0xe3229e8>

<matplotlib.figure.Figure at 0xdf08160>

<matplotlib.figure.Figure at 0xdf08ef0>

<matplotlib.figure.Figure at 0xcdbfdd8>

<matplotlib.figure.Figure at 0xcdbf630>

<matplotlib.figure.Figure at 0xc2a64e0>

<matplotlib.figure.Figure at 0xe6a4048>

<matplotlib.figure.Figure at 0xc120e10>

<matplotlib.figure.Figure at 0xcbd0320>

<matplotlib.figure.Figure at 0xf66bc50>

<matplotlib.figure.Figure at 0xf01c710>

<matplotlib.figure.Figure at 0xf01cb70>

In [20]:
d = dOccupationScore
d = d.reset_index()
d.to_csv(DataDir + '04/score_Occupation.csv')