Focusing on the data available for all ages, **what does the distribution of unemployment rates look like among the different major categories?**  Come up with a _graphical display_ that allows a reader to easily make sense of the information.


In addition to the comprehensive, all-ages dataset, the github repository _also contains data regarding just **recent college graduates (ages < 28)**_. Comparing this subset of data to the whole dataset that it comes from (all-ages) can provide us with some information about recent trends. **Which majors appear to have experienced a relative boom** among recent graduates and **which majors are dropping off** in popularity? Again, explore visual ways of describing the answer as well as numerical ones.


In [213]:
import pandas as pd
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)


import missingno as msno
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
import pylab
import json
import collections
import pprint
pp = pprint.PrettyPrinter()
import warnings 
warnings.filterwarnings('ignore')

In [6]:
all_ages = pd.read_csv("data-college-majors/all-ages.csv")

In [50]:
grad_students = pd.read_csv("data-college-majors/grad-students.csv")

In [12]:
recent_grads = pd.read_csv("data-college-majors/recent-grads.csv")

In [13]:
majors_list = pd.read_csv("data-college-majors/majors-list.csv")

In [41]:
projected_occupation = pd.read_excel("data-college-majors/occupation.xlsx",sheet_name = 1)[:-4]
#data via https://www.bls.gov/emp/tables/emp-by-major-occupational-group.htm | 2019 - 2029 projections

In [214]:
all_ages.groupby(["Major_category","Major","Employed","Unemployed","Unemployment_rate"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Major_code,Total,Employed_full_time_year_round,Median,P25th,P75th
Major_category,Major,Employed,Unemployed,Unemployment_rate,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Agriculture & Natural Resources,AGRICULTURAL ECONOMICS,26321,821,0.030248,1102,33955,22810,63000,40000,98000.0
Agriculture & Natural Resources,AGRICULTURE PRODUCTION AND MANAGEMENT,76865,2266,0.028636,1101,95326,64240,54000,36000,80000.0
Agriculture & Natural Resources,ANIMAL SCIENCES,81177,3619,0.042679,1103,103549,64937,46000,30000,72000.0
Agriculture & Natural Resources,FOOD SCIENCE,17281,894,0.049188,1104,24280,12722,62000,38500,90000.0
Agriculture & Natural Resources,FORESTRY,48228,2144,0.042563,1302,69447,39613,58000,40500,80000.0
Agriculture & Natural Resources,GENERAL AGRICULTURE,90245,2423,0.026147,1100,128148,74078,50000,34000,80000.0
Agriculture & Natural Resources,MISCELLANEOUS AGRICULTURE,6392,261,0.03923,1199,8549,5074,52000,35000,75000.0
Agriculture & Natural Resources,NATURAL RESOURCES MANAGEMENT,65937,3789,0.054341,1303,83188,50595,52000,37100,75000.0
Agriculture & Natural Resources,PLANT SCIENCE AND AGRONOMY,63043,2070,0.031791,1105,79409,51077,50000,35000,75000.0
Agriculture & Natural Resources,SOIL SCIENCE,4926,264,0.050867,1106,6586,4042,63000,39400,88000.0


In [10]:
all_ages.dtypes

Major_code                         int64
Major                             object
Major_category                    object
Total                              int64
Employed                           int64
Employed_full_time_year_round      int64
Unemployed                         int64
Unemployment_rate                float64
Median                             int64
P25th                              int64
P75th                            float64
dtype: object

In [64]:
all_ages.groupby("Major_category").describe()["Unemployment_rate"].sort_values("max")

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Major_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Agriculture & Natural Resources,10.0,0.039569,0.010023,0.026147,0.030634,0.040897,0.047561,0.054341
Health,12.0,0.047209,0.015766,0.026292,0.033607,0.05002,0.058557,0.07001
Social Science,9.0,0.065686,0.005278,0.054399,0.064519,0.065804,0.069374,0.071057
Business,13.0,0.054496,0.007606,0.043268,0.051378,0.053415,0.058865,0.071354
Biology & Life Science,14.0,0.049936,0.013896,0.016111,0.047777,0.049899,0.057298,0.071598
Interdisciplinary,1.0,0.077269,,0.077269,0.077269,0.077269,0.077269,0.077269
Law & Public Policy,5.0,0.067854,0.00907,0.054036,0.066513,0.069655,0.069848,0.079217
Humanities & Liberal Arts,15.0,0.069429,0.009543,0.042505,0.066715,0.072374,0.074675,0.081348
Communications & Journalism,4.0,0.069125,0.009504,0.061917,0.063749,0.065788,0.071163,0.083005
Engineering,29.0,0.05063,0.015761,0.0,0.043844,0.049846,0.058821,0.085991


In [48]:
recent_grads.sample(5)

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
148,149,6006,ART HISTORY AND CRITICISM,21030.0,3240.0,17790.0,Humanities & Liberal Arts,0.845934,204,17579,13262,6140,9965,1128,0.060298,31000,23000,40000,5139,9738,3426
54,55,4006,COGNITIVE SCIENCE AND BIOPSYCHOLOGY,3831.0,1667.0,2164.0,Biology & Life Science,0.564866,25,2741,2470,711,1584,223,0.075236,41000,20000,60000,1369,921,135
102,103,5503,CRIMINOLOGY,19879.0,10031.0,9848.0,Social Science,0.495397,214,16181,13616,4543,10548,1743,0.097244,35000,25000,45000,3373,10605,1895
127,128,6211,HOSPITALITY MANAGEMENT,43647.0,15204.0,28443.0,Business,0.65166,546,36728,32160,7494,23106,2393,0.061169,33000,25000,42000,2325,23341,9063
118,119,6110,COMMUNITY AND PUBLIC HEALTH,19735.0,4103.0,15632.0,Health,0.792095,130,14512,10099,6377,7460,1833,0.112144,34000,21000,45000,5225,7385,1854


In [217]:
all_ages.sort_values("Major_category")

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000.0
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.042679,46000,30000,72000.0
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280,17281,12722,894,0.049188,62000,38500,90000.0
5,1105,PLANT SCIENCE AND AGRONOMY,Agriculture & Natural Resources,79409,63043,51077,2070,0.031791,50000,35000,75000.0
6,1106,SOIL SCIENCE,Agriculture & Natural Resources,6586,4926,4042,264,0.050867,63000,39400,88000.0
7,1199,MISCELLANEOUS AGRICULTURE,Agriculture & Natural Resources,8549,6392,5074,261,0.03923,52000,35000,75000.0
9,1302,FORESTRY,Agriculture & Natural Resources,69447,48228,39613,2144,0.042563,58000,40500,80000.0
10,1303,NATURAL RESOURCES MANAGEMENT,Agriculture & Natural Resources,83188,65937,50595,3789,0.054341,52000,37100,75000.0


In [51]:
grad_students.sample(5)

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,Grad_P25,Grad_P75,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
28,2100,COMPUTER AND INFORMATION SYSTEMS,Computers & Mathematics,71527,1425,60858,53807,2539,0.040049,80000.0,55000,104000.0,242194,209994,184959,10439,0.047357,65000.0,45000,90000.0,0.227996,0.230769
146,2305,MATHEMATICS TEACHER EDUCATION,Education,80826,1194,51750,34672,748,0.014248,60000.0,47500,80000.0,63346,42354,27419,1610,0.036621,45000.0,35000,62000.0,0.560622,0.333333
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,40000,89000.0,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.10442,0.25
112,5502,ANTHROPOLOGY AND ARCHEOLOGY,Humanities & Liberal Arts,107888,1971,83632,59545,4374,0.049701,65000.0,45100,100000.0,126116,90622,62339,6369,0.065666,45000.0,30000,70000.0,0.461052,0.444444
131,3700,MATHEMATICS,Computers & Mathematics,418056,6906,287467,217363,11245,0.037645,89000.0,60000,127000.0,407046,262174,202078,14142,0.051181,68000.0,43000,100000.0,0.506672,0.308824


In [216]:
all_ages.dtypes

Major_code                         int64
Major                             object
Major_category                    object
Total                              int64
Employed                           int64
Employed_full_time_year_round      int64
Unemployed                         int64
Unemployment_rate                float64
Median                             int64
P25th                              int64
P75th                            float64
dtype: object

## Re-format for D3 

In [203]:
from collections import defaultdict

https://stackoverflow.com/questions/50929768/pandas-multiindex-more-than-2-levels-dataframe-to-nested-dict-json

In [196]:
all_ages_h = all_ages.set_index(['Major_category',"Major"])

In [206]:
all_ages_h.to_csv("all_ages_h.csv")

In [204]:
all_ages_h.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Major_code,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
Major_category,Major,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Agriculture & Natural Resources,GENERAL AGRICULTURE,1100,128148,90245,74078,2423,0.026147,50000,34000,80000.0
Agriculture & Natural Resources,AGRICULTURE PRODUCTION AND MANAGEMENT,1101,95326,76865,64240,2266,0.028636,54000,36000,80000.0
Agriculture & Natural Resources,AGRICULTURAL ECONOMICS,1102,33955,26321,22810,821,0.030248,63000,40000,98000.0
Agriculture & Natural Resources,ANIMAL SCIENCES,1103,103549,81177,64937,3619,0.042679,46000,30000,72000.0
Agriculture & Natural Resources,FOOD SCIENCE,1104,24280,17281,12722,894,0.049188,62000,38500,90000.0


In [191]:
def nest(d):
    result = {}
    for key, value in d.items():
        target = result
        for k in key[:-1]:  # traverse all keys but the last
            target = target.setdefault(k, {})
        target[key[-1]] = value
    return result

In [205]:
tree = lambda: defaultdict(tree)  # a recursive defaultdict
d = tree()

for _, (Total, Employed, Unemployed, unemployment_rate) in all_ages_h.iterrows():
    print(_,)
# for _, (region, type, name, value) in all_ages_h.iterrows():
    #d['children'][region]['name'] = region
    

#json.dumps(d)

ValueError: too many values to unpack (expected 4)

In [212]:
pd.read_json("flare.json")

Unnamed: 0,name,children
0,flare,"{'name': 'Agriculture & Natural Resources', 'c..."
1,flare,"{'name': 'Arts', 'children': [{'name': 'DRAMA ..."
2,flare,"{'name': 'Biology & Life Science', 'children':..."
3,flare,"{'name': 'Business', 'children': [{'name': 'AC..."
4,flare,"{'name': 'Communications & Journalism', 'child..."
5,flare,"{'name': 'Computers & Mathematics', 'children'..."
6,flare,"{'name': 'Education', 'children': [{'name': 'E..."
7,flare,"{'name': 'Engineering', 'children': [{'name': ..."
8,flare,"{'name': 'Health', 'children': [{'name': 'COMM..."
9,flare,"{'name': 'Humanities & Liberal Arts', 'childre..."
