# How it started 
So, inspired by this piece by Lazaro Gamio in the [NYT](https://www.nytimes.com/interactive/2020/03/15/business/economy/coronavirus-worker-risk.html), I wanted to figure out whether I could do a similar analysis for Germany. It is possible, but I had to connect the US data for workers physical proximity with the German labour market data, as there is no analysis in Germany of working contexts on this level of granularity ( I asked three experts and googled half a day to confirm this). So, I had to "translate" US codes to German codes. Here is how I did this.

In [40]:
#import a bunch of go-to-libraries
import pandas as pd
import numpy as np
import altair as alt
import scipy.stats as stats
import scipy

Then we need to get the US proximity data from O*\NET by downloading it [from this page](https://www.onetonline.org/find/descriptor/result/4.C.2.a.3/Physical_Proximity.csv?fmt=csv). And then we load it in here.

In [14]:
contactwith_others=pd.read_csv("../RawData/Physical_Proximity.csv")
contactwith_others

Unnamed: 0,Context,Code,Occupation
0,100,27-2032.00,Choreographers
1,100,29-2021.00,Dental Hygienists
2,100,29-1123.00,Physical Therapists
3,100,29-1069.11,Sports Medicine Physicians
4,99,31-9091.00,Dental Assistants
...,...,...,...
962,17,45-3021.00,Hunters and Trappers
963,14,45-4022.00,Logging Equipment Operators
964,14,27-3043.05,"Poets, Lyricists and Creative Writers"
965,9,27-1013.00,"Fine Artists, Including Painters, Sculptors, a..."


We want to reduce the complexity before we transalte codes. So we use the first 5 digits (including the hyphen) to summarize some jobs, [as the first 5 digits reflect the "broad group" classification](https://www.bls.gov/soc/2018/soc_structure_2018.pdf)

In [15]:
contactwith_others["main_groups_code"]=contactwith_others["Code"].str[:6]
# write the list of broad_groups/main_groups to a list
listofmaingroups=list(contactwith_others["main_groups_code"].unique())
print(len(listofmaingroups))

438


In [16]:

groudict={}
index=0#now iterate through this list of broadgroups

for i in listofmaingroups:
    #for each maingroup we create a list for a row 
    entrieslist=[]
    #we create a list with the jobs listed under this code and the code itself
    joblist=[(list(contactwith_others[contactwith_others["main_groups_code"]==i]["Occupation"])),i]
    #and we create a list with the mean, std etc. metrics for this group of jobs
    metriclist=((contactwith_others[contactwith_others["main_groups_code"]==i]["Context"].describe().tolist()))
    #then we push both these list into our rowlist
    entrieslist.extend(joblist)
    entrieslist.extend(metriclist)
    print(entrieslist)
    #and save it under the indexnumber in the dictionary
    groudict[index]=entrieslist
    #and increase the index by 1
    index+=1
groudict

[['Choreographers', 'Dancers'], '27-203', 2.0, 99.0, 1.4142135623730951, 98.0, 98.5, 99.0, 99.5, 100.0]
[['Dental Hygienists'], '29-202', 1.0, 100.0, nan, 100.0, 100.0, 100.0, 100.0, 100.0]
[['Physical Therapists', 'Radiation Therapists', 'Occupational Therapists', 'Exercise Physiologists', 'Respiratory Therapists', 'Music Therapists', 'Low Vision Therapists, Orientation and Mobility Specialists, and Vision Rehabilitation Therapists', 'Recreational Therapists', 'Speech-Language Pathologists', 'Art Therapists'], '29-112', 10.0, 88.9, 8.5693251387337, 73.0, 84.5, 90.5, 94.5, 100.0]
[['Sports Medicine Physicians', 'Urologists', 'Dermatologists', 'Obstetricians and Gynecologists', 'Surgeons', 'Anesthesiologists', 'Hospitalists', 'Ophthalmologists', 'Family and General Practitioners', 'Physical Medicine and Rehabilitation Physicians', 'Internists, General', 'Pediatricians, General', 'Allergists and Immunologists', 'Neurologists', 'Nuclear Medicine Physicians', 'Psychiatrists', 'Radiologists

[['Taxi Drivers and Chauffeurs'], '53-304', 1.0, 79.0, nan, 79.0, 79.0, 79.0, 79.0, 79.0]
[['Wind Turbine Service Technicians'], '49-908', 1.0, 79.0, nan, 79.0, 79.0, 79.0, 79.0, 79.0]
[['Baggage Porters and Bellhops', 'Concierges'], '39-601', 2.0, 71.5, 9.192388155425117, 65.0, 68.25, 71.5, 74.75, 78.0]
[['Food Service Managers'], '11-905', 1.0, 78.0, nan, 78.0, 78.0, 78.0, 78.0, 78.0]
[['Gaming Surveillance Officers and Gaming Investigators', 'Security Guards'], '33-903', 2.0, 73.0, 7.0710678118654755, 68.0, 70.5, 73.0, 75.5, 78.0]
[['Solar Energy Installation Managers', 'First-Line Supervisors of Construction Trades and Extraction Workers'], '47-101', 2.0, 74.5, 4.949747468305833, 71.0, 72.75, 74.5, 76.25, 78.0]
[['Waiters and Waitresses'], '35-303', 1.0, 78.0, nan, 78.0, 78.0, 78.0, 78.0, 78.0]
[['Career/Technical Education Teachers, Secondary School', 'Secondary School Teachers, Except Special and Career/Technical Education'], '25-203', 2.0, 73.5, 4.949747468305833, 70.0, 71.75, 7

[['Pharmacists'], '29-105', 1.0, 72.0, nan, 72.0, 72.0, 72.0, 72.0, 72.0]
[['Ship Engineers'], '53-503', 1.0, 72.0, nan, 72.0, 72.0, 72.0, 72.0, 72.0]
[['Statement Clerks', 'Billing, Cost, and Rate Clerks'], '43-302', 2.0, 59.0, 18.384776310850235, 46.0, 52.5, 59.0, 65.5, 72.0]
[['Tour Guides and Escorts', 'Travel Guides'], '39-701', 2.0, 62.5, 13.435028842544403, 53.0, 57.75, 62.5, 67.25, 72.0]
[['Medical Appliance Technicians', 'Ophthalmic Laboratory Technicians', 'Dental Laboratory Technicians'], '51-908', 3.0, 62.333333333333336, 9.018499505645789, 53.0, 58.0, 63.0, 67.0, 71.0]
[['Medical Secretaries', 'Secretaries and Administrative Assistants, Except Legal, Medical, and Executive', 'Legal Secretaries', 'Executive Secretaries and Executive Administrative Assistants'], '43-601', 4.0, 48.5, 15.088627063675034, 39.0, 40.5, 42.0, 50.0, 71.0]
[['Model Makers, Metal and Plastic', 'Patternmakers, Metal and Plastic'], '51-406', 2.0, 67.5, 4.949747468305833, 64.0, 65.75, 67.5, 69.25, 71.0]

[['First-Line Supervisors of Aquacultural Workers', 'First-Line Supervisors of Agricultural Crop and Horticultural Workers', 'First-Line Supervisors of Animal Husbandry and Animal Care Workers', 'First-Line Supervisors of Logging Workers'], '45-101', 4.0, 45.5, 14.479871085982314, 32.0, 38.75, 42.0, 48.75, 66.0]
[['First-Line Supervisors of Retail Sales Workers', 'First-Line Supervisors of Non-Retail Sales Workers'], '41-101', 2.0, 54.5, 16.263455967290593, 43.0, 48.75, 54.5, 60.25, 66.0]
[['First-Line Supervisors of Transportation and Material-Moving Machine and Vehicle Operators'], '53-103', 1.0, 66.0, nan, 66.0, 66.0, 66.0, 66.0, 66.0]
[['Gas Compressor and Gas Pumping Station Operators', 'Pump Operators, Except Wellhead Pumpers', 'Wellhead Pumpers'], '53-707', 3.0, 45.666666666666664, 24.131583730317686, 19.0, 35.5, 52.0, 59.0, 66.0]
[['Heating and Air Conditioning Mechanics and Installers', 'Refrigeration Mechanics and Installers'], '49-902', 2.0, 61.0, 7.0710678118654755, 56.0, 5

[['Continuous Mining Machine Operators', 'Mine Cutting and Channeling Machine Operators'], '47-504', 2.0, 56.0, 4.242640687119285, 53.0, 54.5, 56.0, 57.5, 59.0]
[['Electrical and Electronic Equipment Assemblers', 'Electromechanical Equipment Assemblers', 'Coil Winders, Tapers, and Finishers'], '51-202', 3.0, 58.666666666666664, 0.5773502691896258, 58.0, 58.5, 59.0, 59.0, 59.0]
[['Energy Auditors', 'Customs Brokers', 'Security Management Specialists', 'Business Continuity Planners', 'Online Merchants', 'Sustainability Specialists'], '13-119', 6.0, 51.5, 4.370354676682432, 47.0, 49.0, 50.0, 53.25, 59.0]
[['Insulation Workers, Mechanical', 'Insulation Workers, Floor, Ceiling, and Wall'], '47-213', 2.0, 58.5, 0.7071067811865476, 58.0, 58.25, 58.5, 58.75, 59.0]
[['Loading Machine Operators, Underground Mining', 'Excavating and Loading Machine and Dragline Operators', 'Dredge Operators'], '53-703', 3.0, 42.0, 16.09347693943108, 27.0, 33.5, 40.0, 49.5, 59.0]
[['New Accounts Clerks'], '43-414'

[['Proofreaders and Copy Markers'], '43-908', 1.0, 53.0, nan, 53.0, 53.0, 53.0, 53.0, 53.0]
[['Emergency Management Directors'], '11-916', 1.0, 52.0, nan, 52.0, 52.0, 52.0, 52.0, 52.0]
[['Energy Brokers'], '41-309', 1.0, 52.0, nan, 52.0, 52.0, 52.0, 52.0, 52.0]
[['Food Scientists and Technologists', 'Animal Scientists', 'Soil and Plant Scientists'], '19-101', 3.0, 46.333333333333336, 5.507570547286102, 41.0, 43.5, 46.0, 49.0, 52.0]
[['Fraud Examiners, Investigators and Analysts', 'Financial Quantitative Analysts', 'Risk Management Specialists'], '13-209', 3.0, 47.0, 4.58257569495584, 43.0, 44.5, 46.0, 49.0, 52.0]
[['Insurance Policy Processing Clerks', 'Insurance Claims Clerks'], '43-904', 2.0, 49.0, 4.242640687119285, 46.0, 47.5, 49.0, 50.5, 52.0]
[['Insurance Underwriters', 'Financial Analysts', 'Personal Financial Advisors'], '13-205', 3.0, 45.333333333333336, 7.637626158259733, 37.0, 42.0, 47.0, 49.5, 52.0]
[['Librarians'], '25-402', 1.0, 52.0, nan, 52.0, 52.0, 52.0, 52.0, 52.0]
[[

[['Mathematicians'], '15-202', 1.0, 40.0, nan, 40.0, 40.0, 40.0, 40.0, 40.0]
[['Remote Sensing Scientists and Technologists'], '19-209', 1.0, 40.0, nan, 40.0, 40.0, 40.0, 40.0, 40.0]
[['Sociologists'], '19-304', 1.0, 40.0, nan, 40.0, 40.0, 40.0, 40.0, 40.0]
[['Survey Researchers'], '19-302', 1.0, 40.0, nan, 40.0, 40.0, 40.0, 40.0, 40.0]
[['Fundraisers'], '13-113', 1.0, 39.0, nan, 39.0, 39.0, 39.0, 39.0, 39.0]
[['Paralegals and Legal Assistants'], '23-201', 1.0, 39.0, nan, 39.0, 39.0, 39.0, 39.0, 39.0]
[['Public Relations and Fundraising Managers'], '11-203', 1.0, 38.0, nan, 38.0, 38.0, 38.0, 38.0, 38.0]
[['Conveyor Operators and Tenders'], '53-701', 1.0, 37.0, nan, 37.0, 37.0, 37.0, 37.0, 37.0]
[['Meter Readers, Utilities'], '43-504', 1.0, 37.0, nan, 37.0, 37.0, 37.0, 37.0, 37.0]
[['Motion Picture Projectionists'], '39-302', 1.0, 37.0, nan, 37.0, 37.0, 37.0, 37.0, 37.0]
[['Parking Enforcement Workers'], '33-304', 1.0, 37.0, nan, 37.0, 37.0, 37.0, 37.0, 37.0]
[['Payroll and Timekeeping 

{0: [['Choreographers', 'Dancers'],
  '27-203',
  2.0,
  99.0,
  1.4142135623730951,
  98.0,
  98.5,
  99.0,
  99.5,
  100.0],
 1: [['Dental Hygienists'],
  '29-202',
  1.0,
  100.0,
  nan,
  100.0,
  100.0,
  100.0,
  100.0,
  100.0],
 2: [['Physical Therapists',
   'Radiation Therapists',
   'Occupational Therapists',
   'Exercise Physiologists',
   'Respiratory Therapists',
   'Music Therapists',
   'Low Vision Therapists, Orientation and Mobility Specialists, and Vision Rehabilitation Therapists',
   'Recreational Therapists',
   'Speech-Language Pathologists',
   'Art Therapists'],
  '29-112',
  10.0,
  88.9,
  8.5693251387337,
  73.0,
  84.5,
  90.5,
  94.5,
  100.0],
 3: [['Sports Medicine Physicians',
   'Urologists',
   'Dermatologists',
   'Obstetricians and Gynecologists',
   'Surgeons',
   'Anesthesiologists',
   'Hospitalists',
   'Ophthalmologists',
   'Family and General Practitioners',
   'Physical Medicine and Rehabilitation Physicians',
   'Internists, General',
   'P

In [22]:
#then we build a dataframe from this dictionary
columnlist=["jobs","code"]
columnlist.extend((contactwith_others["Context"].describe()).index.tolist())
dataframe=pd.DataFrame().from_dict(groudict).transpose()
dataframe.columns=columnlist
dataframe=dataframe.fillna(0)
#and add the relative std as a column
dataframe["rel_std"]=dataframe["std"]/dataframe["mean"]
dataframe=dataframe.sort_values(by="rel_std")
#turn out, on average the grading of physical proximity only differs by about 7 per cent
dataframe["rel_std"].mean()

0.07262895229881285

So now we have grouped job of the same broader job category and calculated the mean physical proximity for these jobs. And we write the datafile out to be now classified by German job codes.

In [24]:
#and we write this list out 
dataframe.to_csv("sum_up_of_us_codes.csv",index=False)

# Coding US to German job codes

So, next up was the tidious part. I had to classify 438 job groups by hand using [the official job encoding list of the German labour ministry](https://www.klassifikationsserver.de/klassService/jsp/common/url.jsf?variant=kldb2010). I always added to each US-job group the three digit job classification in Germany that I deemed to be fitting. (Saved under "jobs_with_context_classified.csv"). However, to check this classification, I also built a simple scraper to do the same job that I did. I used the aforementioned csv, ran it through deepl to have a German translation of all jobs. With this German translation of the jobs, I have sent a query for each job group [to the Berufenet database](https://berufenet.arbeitsagentur.de/berufenet/faces/index?path=null) of the German labour ministry to check which job code would be the result of such a query. The used scraper is saved under "Puppeteer_Arbeitsagentur", I ended up with three csvs that I got as a result of different forms of query with this scraper (sending the pure translation to the website, split at "hyphen" and only send the first word, split at " " space and send only the last word). Then I merged these three csvs by hand into one and cleaned it up as follows.

In [5]:
raw_data=pd.read_csv("../RawData/halukaschreibt_zusammen.csv")
dropped=raw_data

In [8]:
# we create one master column for the jobcode number and fill it first nans
dropped['number_zusammen'] = np.nan
#if there was a result retrieved from the scraper, we put the number from this query in the master column, if not we leave it as it is
dropped['number_zusammen'] = np.where(dropped['Resultatsberuf'].isna(), np.nan,dropped['Nummer'])
#and we do this for all the three types of columns we had
dropped['number_zusammen'] = np.where(dropped['Resultatsberuf2'].isna(), dropped['number_zusammen'],dropped['Nummer2'])
dropped['number_zusammen'] = np.where(dropped['Resultatsberuf_3'].isna(), dropped['number_zusammen'],dropped['Nummer_3'])
#and also with a master column for the job name
dropped['Resultatsberuf_zusammen'] = np.nan
dropped['Resultatsberuf_zusammen'] = np.where(dropped['Resultatsberuf'].isna(), np.nan,dropped['Resultatsberuf'])
dropped['Resultatsberuf_zusammen'] = np.where(dropped['Resultatsberuf2'].isna(), dropped['Resultatsberuf_zusammen'],dropped['Resultatsberuf2'])
dropped['Resultatsberuf_zusammen'] = np.where(dropped['Resultatsberuf_3'].isna(), dropped['Resultatsberuf_zusammen'],dropped['Resultatsberuf_3'])

#and so get a reduced dataframe
summed_scraper=dropped[["Berufsübersetzung","number_zusammen","Resultatsberuf_zusammen"]]
summed_scraper.to_csv("scraper_results.csv",index=False)
summed_scraper

Unnamed: 0,Berufsübersetzung,number_zusammen,Resultatsberuf_zusammen
0,Dentalhygieni,81113-10,Dentalhygieniker/in
1,Choreograp,94224-10,Ballettmeister/in
2,Kieferorthopä,81113-10,Kieferorthopädische/r Fachhelfer/in
3,Krankenschwes,82182-10,Ambulante/r Pfleger/in
4,Medizinische Notfalltechniker und Rettungssanitä,81342-10,Rettungssanitäter/in
...,...,...,...
434,Erdölingenie,21114-11,Ingenieur/in - Rohstoffgewinnung und -aufberei...
435,Ausschreibungen für Brücken und Schleus,,
436,Log Grader und Scal,,
437,Jäger und Fallenstel,,


Also, I merged the by hand classified data with the data from the scraper and threw out all the jobs where neither the scraper nor I found a German job code.

In [None]:
hand_classified=pd.read_csv("jobs_with_context_classified.csv",dtype={"germancode":str})
hand_scraper=pd.concat([hand_classified,summed_scraper],axis=1)
hand_scraper["three_letter_scraper"]=hand_scraper["number_zusammen"].str[0:3]
hand_scraper=hand_scraper.dropna(subset=["germancode","three_letter_scraper"],how="all")
hand_scraper[hand_scraper["germancode"]==hand_scraper["three_letter_scraper"]]

And also, I marked whether the result from the scraper and my classification were matching. If this was the case then the checked column was marked with a "y". Otherwise it was left empty.

In [12]:
hand_scraper["checked"]=""
hand_scraper['checked'] = np.where(hand_scraper["germancode"].str[:2]==hand_scraper["three_letter_scraper"].str[:2], "y",hand_scraper["checked"])
hand_scraper.to_csv("hand_scraper.csv", index=False)
hand_scraper

Unnamed: 0,germancode,jobs,code,count,mean,std,min,25%,50%,75%,max,rel_std,Berufsübersetzung,number_zusammen,Resultatsberuf_zusammen,three_letter_scraper,checked
0,814,['Dental Hygienists'],29-202,1.0,100.0,0.000000,100.0,100.0,100.0,100.0,100.0,0.000000,Dentalhygieni,81113-10,Dentalhygieniker/in,811,y
1,942,"['Choreographers', 'Dancers']",27-203,2.0,99.0,1.414214,98.0,98.5,99.0,99.5,100.0,0.014285,Choreograp,94224-10,Ballettmeister/in,942,y
2,814,"['Orthodontists', 'Oral and Maxillofacial Surg...",29-102,4.0,97.0,3.366502,92.0,96.5,98.5,99.0,99.0,0.034706,Kieferorthopä,81113-10,Kieferorthopädische/r Fachhelfer/in,811,y
3,813,['Nurse Midwives'],29-116,1.0,97.0,0.000000,97.0,97.0,97.0,97.0,97.0,0.000000,Krankenschwes,82182-10,Ambulante/r Pfleger/in,821,
4,813,['Emergency Medical Technicians and Paramedics'],29-204,1.0,97.0,0.000000,97.0,97.0,97.0,97.0,97.0,0.000000,Medizinische Notfalltechniker und Rettungssanitä,81342-10,Rettungssanitäter/in,813,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
431,722,"['Compensation, Benefits, and Job Analysis Spe...",13-114,1.0,32.0,0.000000,32.0,32.0,32.0,32.0,32.0,0.000000,Spezialisten für Entlohn,,,,
432,914,"['Environmental Economists', 'Economists']",19-301,2.0,31.0,2.828427,29.0,30.0,31.0,32.0,33.0,0.091240,Umweltökono,42314-11,Betriebswirt/in (Hochschule) - Umweltökonomie,423,
434,412,['Petroleum Engineers'],17-217,1.0,30.0,0.000000,30.0,30.0,30.0,30.0,30.0,0.000000,Erdölingenie,21114-11,Ingenieur/in - Rohstoffgewinnung und -aufberei...,211,
436,117,"['Log Graders and Scalers', 'Logging Equipment...",45-402,3.0,23.0,21.931712,7.0,10.5,14.0,31.0,48.0,0.953553,Log Grader und Scal,,,,


I took this file then again and for all the round about 300 job groups where a difference between the results of the two methods existed, I checked again the webpage of Berufenet to find the fitting classification. If this second check 
was the same three digit number as with the first run, I also inserted a "y", otherwise I inserted the number.

# Evaluation of the classification

So how good is the speeded up version of intracoder reliability for the classification? Well, in over 60 percent of the cases I or the scraper reclassified the jobs in exactly the same number code as previously.

In [30]:
data=pd.read_csv("jobcodes_evaluated_part_2.csv",dtype={"germancode":str,"three_letter_scraper":str})
data

Unnamed: 0,germancode,checked,Berufsübersetzung,jobs,code,count,mean,std,min,25%,50%,75%,max,rel_std,Berufsübersetzung.1,number_zusammen,Resultatsberuf_zusammen,three_letter_scraper,matched
0,813,y,Krankenschwes,['Nurse Midwives'],29-116,1.0,97.000000,0.000000,97.0,97.00,97.0,97.00,97.0,0.000000,Krankenschwes,82182-10,Ambulante/r Pfleger/in,821,False
1,817,811,Fußspezialis,['Podiatrists'],29-108,1.0,95.000000,0.000000,95.0,95.00,95.0,95.00,95.0,0.000000,Fußspezialis,,,,False
2,514,y,Transportbeglei,"['Transportation Attendants, Except Flight Att...",53-606,1.0,93.000000,0.000000,93.0,93.00,93.0,93.00,93.0,0.000000,Transportbeglei,,,,False
3,825,814,Orthopäden und Protheti,"['Orthotists and Prosthetists', 'Neurodiagnost...",29-209,6.0,91.333333,6.801961,79.0,89.50,94.0,95.50,97.0,0.074474,Orthopäden und Protheti,,,,False
4,813,832,Haushaltshil,"['Home Health Aides', 'Nursing Assistants', 'O...",31-101,4.0,90.250000,5.377422,84.0,87.75,90.0,92.50,97.0,0.059584,Haushaltshil,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416,924,y,Technische Redakte,"['Technical Writers', 'Copy Writers', 'Poets, ...",27-304,4.0,39.750000,17.858238,14.0,35.00,45.5,50.25,54.0,0.449264,Technische Redakte,92413-10,Redakteur/in,924,True
417,414,y,Astrono,"['Astronomers', 'Physicists']",19-201,2.0,39.000000,11.313709,31.0,35.00,39.0,43.00,47.0,0.290095,Astrono,41484-11,"Astrophysiker/in, Astronom/in",414,True
418,922,y,Manager für Öffentlichkeitsarbeit und Fundrais,['Public Relations and Fundraising Managers'],11-203,1.0,38.000000,0.000000,38.0,38.00,38.0,38.00,38.0,0.000000,Manager für Öffentlichkeitsarbeit und Fundrais,92203-10,Fundraiser/in,922,True
419,611,y,Vertriebsingenie,['Sales Engineers'],41-903,1.0,36.000000,0.000000,36.0,36.00,36.0,36.00,36.0,0.000000,Vertriebsingenie,61124-10,Vertriebsingenieur/in,611,True


In [17]:
len(data[data["checked"]=="y"])/len(data)

0.6128266033254157

And if it is about whether at least the first number which stands for the working sector matched, we get an additional 19 percent, so round about 80 percent of the jobs were classified correctly. This is not great but something we can work with.

In [18]:
len(data[data["germancode"].str[:1]==data["checked"].str[:1]])/len(data)

0.19002375296912113

# Actual analysis
So we do the same analysis in three ways to be sure that faulty classification cannot be the reason why an analysis shows a signal or not. But first we load in German labour data regarding the median wage

In [25]:
german_jobs=pd.read_csv("../RawData/jobs_d.csv", skiprows=6)
german_jobs=german_jobs[["Unnamed: 0","Median"]]
german_jobs["codes"]=german_jobs["Unnamed: 0"].str[:3]
german_jobs=german_jobs.rename(columns={"Unnamed: 0":"Bezeichnung"})
german_jobs["Median"]=german_jobs["Median"].str.replace("\(","")
german_jobs["Median"]=german_jobs["Median"].str.replace("\)","")
german_jobs=german_jobs.dropna()
german_jobs

Unnamed: 0,Bezeichnung,Median,codes
1,111 Landwirtschaft,9.99,111
2,112 Tierwirtschaft,9.59,112
3,113 Pferdewirtschaft,8.84,113
4,114 Fischwirtschaft,10.00,114
5,115 Tierpflege,10.28,115
...,...,...,...
140,947 Museumstechnik und -management,19.21,947
141,011 Offiziere,25.23,011
142,012 Unteroffiziere mit Portepee,17.90,012
143,013 Unteroffiziere ohne Portepee,14.06,013


## Clear Matches
This is the dataset where we only use data where the first run of classification and the second run of classification led to the exact same results. We merge it with the data from the Ministry of Labour regarding wages.

In [31]:
clear_matches=data

clear_matches['number_to_merge'] =np.nan
clear_matches['number_to_merge'] = np.where(clear_matches['checked']=="y",clear_matches['germancode'] ,clear_matches['number_to_merge'] )
clear_matches

Unnamed: 0,germancode,checked,Berufsübersetzung,jobs,code,count,mean,std,min,25%,50%,75%,max,rel_std,Berufsübersetzung.1,number_zusammen,Resultatsberuf_zusammen,three_letter_scraper,matched,number_to_merge
0,813,y,Krankenschwes,['Nurse Midwives'],29-116,1.0,97.000000,0.000000,97.0,97.00,97.0,97.00,97.0,0.000000,Krankenschwes,82182-10,Ambulante/r Pfleger/in,821,False,813
1,817,811,Fußspezialis,['Podiatrists'],29-108,1.0,95.000000,0.000000,95.0,95.00,95.0,95.00,95.0,0.000000,Fußspezialis,,,,False,
2,514,y,Transportbeglei,"['Transportation Attendants, Except Flight Att...",53-606,1.0,93.000000,0.000000,93.0,93.00,93.0,93.00,93.0,0.000000,Transportbeglei,,,,False,514
3,825,814,Orthopäden und Protheti,"['Orthotists and Prosthetists', 'Neurodiagnost...",29-209,6.0,91.333333,6.801961,79.0,89.50,94.0,95.50,97.0,0.074474,Orthopäden und Protheti,,,,False,
4,813,832,Haushaltshil,"['Home Health Aides', 'Nursing Assistants', 'O...",31-101,4.0,90.250000,5.377422,84.0,87.75,90.0,92.50,97.0,0.059584,Haushaltshil,,,,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416,924,y,Technische Redakte,"['Technical Writers', 'Copy Writers', 'Poets, ...",27-304,4.0,39.750000,17.858238,14.0,35.00,45.5,50.25,54.0,0.449264,Technische Redakte,92413-10,Redakteur/in,924,True,924
417,414,y,Astrono,"['Astronomers', 'Physicists']",19-201,2.0,39.000000,11.313709,31.0,35.00,39.0,43.00,47.0,0.290095,Astrono,41484-11,"Astrophysiker/in, Astronom/in",414,True,414
418,922,y,Manager für Öffentlichkeitsarbeit und Fundrais,['Public Relations and Fundraising Managers'],11-203,1.0,38.000000,0.000000,38.0,38.00,38.0,38.00,38.0,0.000000,Manager für Öffentlichkeitsarbeit und Fundrais,92203-10,Fundraiser/in,922,True,922
419,611,y,Vertriebsingenie,['Sales Engineers'],41-903,1.0,36.000000,0.000000,36.0,36.00,36.0,36.00,36.0,0.000000,Vertriebsingenie,61124-10,Vertriebsingenieur/in,611,True,611


In [32]:
merged_clear_matches=pd.merge(german_jobs,clear_matches,left_on="codes",right_on="number_to_merge")
merged_clear_matches.to_csv("german_to_us_jobs_clear_matches.csv", index=False)
merged_clear_matches=merged_clear_matches[["Bezeichnung","Median","mean","number_to_merge"]]
merged_clear_matches["Median"]=pd.to_numeric(merged_clear_matches["Median"],errors="coerce")
merged_clear_matches["mean"]=pd.to_numeric(merged_clear_matches["mean"])
merged_clear_matches=merged_clear_matches.dropna()
merged_clear_matches=merged_clear_matches.rename(columns={"Median":"Lohn","mean":"Nähe"})
merged_clear_matches["Sektor"]=merged_clear_matches["number_to_merge"].str[:1]
merged_clear_matches

Unnamed: 0,Bezeichnung,Lohn,Nähe,number_to_merge,Sektor
0,111 Landwirtschaft,9.99,84.000000,111,1
1,111 Landwirtschaft,9.99,43.000000,111,1
2,111 Landwirtschaft,9.99,41.500000,111,1
3,111 Landwirtschaft,9.99,68.000000,111,1
4,111 Landwirtschaft,9.99,55.666667,111,1
...,...,...,...,...,...
252,935 Kunsthandwerkliche Metallgestaltung,13.42,54.333333,935,9
253,"941 Musik-, Gesang-, Dirigententätigkeiten",24.93,67.250000,941,9
255,943 Moderation und Unterhaltung,19.02,86.500000,943,9
256,"945 Veranstaltungs-, Kamera-, Tontechnik",17.61,58.500000,945,9


## 1st run classification
This is the dataset where we use data where the first run of classification and the second run of classification led to the exact same results *and* we use the classification of the first run, if first and second run at least idenfitfied the same working sector (first digit)

In [33]:
classified_1strun=data
classified_1strun['number_to_merge'] =np.nan
classified_1strun['number_to_merge'] = np.where(classified_1strun['checked']=="y",classified_1strun['germancode'] ,classified_1strun['number_to_merge'] )
classified_1strun['number_to_merge'] = np.where(classified_1strun["germancode"].str[:1]==classified_1strun["checked"].str[:1],classified_1strun['germancode'] ,classified_1strun['number_to_merge'] )
classified_1strun

Unnamed: 0,germancode,checked,Berufsübersetzung,jobs,code,count,mean,std,min,25%,50%,75%,max,rel_std,Berufsübersetzung.1,number_zusammen,Resultatsberuf_zusammen,three_letter_scraper,matched,number_to_merge
0,813,y,Krankenschwes,['Nurse Midwives'],29-116,1.0,97.000000,0.000000,97.0,97.00,97.0,97.00,97.0,0.000000,Krankenschwes,82182-10,Ambulante/r Pfleger/in,821,False,813
1,817,811,Fußspezialis,['Podiatrists'],29-108,1.0,95.000000,0.000000,95.0,95.00,95.0,95.00,95.0,0.000000,Fußspezialis,,,,False,817
2,514,y,Transportbeglei,"['Transportation Attendants, Except Flight Att...",53-606,1.0,93.000000,0.000000,93.0,93.00,93.0,93.00,93.0,0.000000,Transportbeglei,,,,False,514
3,825,814,Orthopäden und Protheti,"['Orthotists and Prosthetists', 'Neurodiagnost...",29-209,6.0,91.333333,6.801961,79.0,89.50,94.0,95.50,97.0,0.074474,Orthopäden und Protheti,,,,False,825
4,813,832,Haushaltshil,"['Home Health Aides', 'Nursing Assistants', 'O...",31-101,4.0,90.250000,5.377422,84.0,87.75,90.0,92.50,97.0,0.059584,Haushaltshil,,,,False,813
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416,924,y,Technische Redakte,"['Technical Writers', 'Copy Writers', 'Poets, ...",27-304,4.0,39.750000,17.858238,14.0,35.00,45.5,50.25,54.0,0.449264,Technische Redakte,92413-10,Redakteur/in,924,True,924
417,414,y,Astrono,"['Astronomers', 'Physicists']",19-201,2.0,39.000000,11.313709,31.0,35.00,39.0,43.00,47.0,0.290095,Astrono,41484-11,"Astrophysiker/in, Astronom/in",414,True,414
418,922,y,Manager für Öffentlichkeitsarbeit und Fundrais,['Public Relations and Fundraising Managers'],11-203,1.0,38.000000,0.000000,38.0,38.00,38.0,38.00,38.0,0.000000,Manager für Öffentlichkeitsarbeit und Fundrais,92203-10,Fundraiser/in,922,True,922
419,611,y,Vertriebsingenie,['Sales Engineers'],41-903,1.0,36.000000,0.000000,36.0,36.00,36.0,36.00,36.0,0.000000,Vertriebsingenie,61124-10,Vertriebsingenieur/in,611,True,611


In [35]:
merged_classified_1strun=pd.merge(german_jobs,classified_1strun,left_on="codes",right_on="number_to_merge")
merged_classified_1strun.to_csv("german_to_us_jobs_1strun.csv", index=False)
merged_classified_1strun=merged_classified_1strun[["Bezeichnung","Median","mean","number_to_merge"]]
merged_classified_1strun["Median"]=pd.to_numeric(merged_classified_1strun["Median"],errors="coerce")
merged_classified_1strun["mean"]=pd.to_numeric(merged_classified_1strun["mean"])
merged_classified_1strun=merged_classified_1strun.dropna()
merged_classified_1strun=merged_classified_1strun.rename(columns={"Median":"Lohn","mean":"Nähe"})
merged_classified_1strun["Sektor"]=merged_classified_1strun["number_to_merge"].str[:1]
merged_classified_1strun

Unnamed: 0,Bezeichnung,Lohn,Nähe,number_to_merge,Sektor
0,111 Landwirtschaft,9.99,84.00,111,1
1,111 Landwirtschaft,9.99,45.50,111,1
2,111 Landwirtschaft,9.99,43.00,111,1
3,111 Landwirtschaft,9.99,41.50,111,1
4,111 Landwirtschaft,9.99,68.00,111,1
...,...,...,...,...,...
332,"941 Musik-, Gesang-, Dirigententätigkeiten",24.93,67.25,941,9
334,943 Moderation und Unterhaltung,19.02,86.50,943,9
335,"944 Theater-, Film- und Fernsehproduktion",18.95,63.00,944,9
336,"945 Veranstaltungs-, Kamera-, Tontechnik",17.61,58.50,945,9


## 2nd run classification
This is the dataset where we use data where the first run of classification and the second run of classification led to the exact same results *and* we use the classification of the second run, if first and second run at least idenfitfied the same working sector (first digit)

In [52]:
classified_2ndrun=data
classified_2ndrun['number_to_merge'] =np.nan
classified_2ndrun['number_to_merge'] = np.where(classified_2ndrun['checked']=="y",classified_2ndrun['germancode'] ,classified_2ndrun['number_to_merge'] )
classified_2ndrun['number_to_merge'] = np.where(classified_2ndrun["germancode"].str[:1]==classified_2ndrun["checked"].str[:1],classified_2ndrun['checked'] ,classified_2ndrun['number_to_merge'] )
classified_2ndrun

Unnamed: 0,germancode,checked,Berufsübersetzung,jobs,code,count,mean,std,min,25%,50%,75%,max,rel_std,Berufsübersetzung.1,number_zusammen,Resultatsberuf_zusammen,three_letter_scraper,matched,number_to_merge
0,813,y,Krankenschwes,['Nurse Midwives'],29-116,1.0,97.000000,0.000000,97.0,97.00,97.0,97.00,97.0,0.000000,Krankenschwes,82182-10,Ambulante/r Pfleger/in,821,False,813
1,817,811,Fußspezialis,['Podiatrists'],29-108,1.0,95.000000,0.000000,95.0,95.00,95.0,95.00,95.0,0.000000,Fußspezialis,,,,False,811
2,514,y,Transportbeglei,"['Transportation Attendants, Except Flight Att...",53-606,1.0,93.000000,0.000000,93.0,93.00,93.0,93.00,93.0,0.000000,Transportbeglei,,,,False,514
3,825,814,Orthopäden und Protheti,"['Orthotists and Prosthetists', 'Neurodiagnost...",29-209,6.0,91.333333,6.801961,79.0,89.50,94.0,95.50,97.0,0.074474,Orthopäden und Protheti,,,,False,814
4,813,832,Haushaltshil,"['Home Health Aides', 'Nursing Assistants', 'O...",31-101,4.0,90.250000,5.377422,84.0,87.75,90.0,92.50,97.0,0.059584,Haushaltshil,,,,False,832
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
416,924,y,Technische Redakte,"['Technical Writers', 'Copy Writers', 'Poets, ...",27-304,4.0,39.750000,17.858238,14.0,35.00,45.5,50.25,54.0,0.449264,Technische Redakte,92413-10,Redakteur/in,924,True,924
417,414,y,Astrono,"['Astronomers', 'Physicists']",19-201,2.0,39.000000,11.313709,31.0,35.00,39.0,43.00,47.0,0.290095,Astrono,41484-11,"Astrophysiker/in, Astronom/in",414,True,414
418,922,y,Manager für Öffentlichkeitsarbeit und Fundrais,['Public Relations and Fundraising Managers'],11-203,1.0,38.000000,0.000000,38.0,38.00,38.0,38.00,38.0,0.000000,Manager für Öffentlichkeitsarbeit und Fundrais,92203-10,Fundraiser/in,922,True,922
419,611,y,Vertriebsingenie,['Sales Engineers'],41-903,1.0,36.000000,0.000000,36.0,36.00,36.0,36.00,36.0,0.000000,Vertriebsingenie,61124-10,Vertriebsingenieur/in,611,True,611


In [53]:
merged_classified_2ndrun=pd.merge(german_jobs,classified_2ndrun,left_on="codes",right_on="number_to_merge")
merged_classified_2ndrun.to_csv("german_to_us_jobs_2ndrun.csv", index=False)
merged_classified_2ndrun=merged_classified_2ndrun[["Bezeichnung","Median","mean","number_to_merge"]]
merged_classified_2ndrun["Median"]=pd.to_numeric(merged_classified_2ndrun["Median"],errors="coerce")
merged_classified_2ndrun["mean"]=pd.to_numeric(merged_classified_2ndrun["mean"])
merged_classified_2ndrun=merged_classified_2ndrun.dropna()
merged_classified_2ndrun=merged_classified_2ndrun.rename(columns={"Median":"Lohn","mean":"Nähe"})
merged_classified_2ndrun["Sektor"]=merged_classified_2ndrun["number_to_merge"].str[:1]
merged_classified_2ndrun

Unnamed: 0,Bezeichnung,Lohn,Nähe,number_to_merge,Sektor
0,111 Landwirtschaft,9.99,84.000000,111,1
1,111 Landwirtschaft,9.99,43.000000,111,1
2,111 Landwirtschaft,9.99,41.500000,111,1
3,111 Landwirtschaft,9.99,68.000000,111,1
4,111 Landwirtschaft,9.99,55.666667,111,1
...,...,...,...,...,...
333,943 Moderation und Unterhaltung,19.02,86.500000,943,9
334,943 Moderation und Unterhaltung,19.02,62.500000,943,9
335,"944 Theater-, Film- und Fernsehproduktion",18.95,43.000000,944,9
336,"945 Veranstaltungs-, Kamera-, Tontechnik",17.61,58.500000,945,9


Now we wanna check for R^2 and for the p-value for the several worksectors and for the different classifications. We first define the specs of the visualization.

In [54]:
charts=alt.Chart().mark_point().encode(
    x='Lohn',
    y='Nähe',
     tooltip="Bezeichnung"
)
bakedchart=charts + charts.transform_regression('Lohn', 'Nähe').mark_line()


And now we run it only with the clear matches

In [59]:
dataset_to_be_used=merged_clear_matches
sectorlist=dataset_to_be_used["Sektor"].unique().tolist()
sectorlist.sort()
for i in (sectorlist):
    print(i)
    subset=dataset_to_be_used[dataset_to_be_used["Sektor"]==i]
    result=(scipy.stats.linregress(subset["Lohn"], subset["Nähe"]))
    print(result.pvalue)
    print(result.rvalue**2)

    combined = alt.layer(charts + charts.transform_regression('Lohn', 'Nähe').mark_line(), data=subset)
    combined.display()

1
0.018063472104251646
0.38416987062246083


2
0.38070903965501923
0.015402609664760014


3
0.10151899374922574
0.14987794560374476


4
0.24820776587146756
0.08236814529584333


5
0.13204693740189694
0.08200974437177695


6
9.029734798527522e-05
0.5826433706240856


7
0.08400313648515664
0.07460769733501918


8
0.08239727466319294
0.06281404565863401


9
0.9703611199377895
0.00011034765028729514


In [60]:
dataset_to_be_used=merged_classified_1strun
sectorlist=dataset_to_be_used["Sektor"].unique().tolist()
sectorlist.sort()
for i in (sectorlist):
    print(i)
    subset=dataset_to_be_used[dataset_to_be_used["Sektor"]==i]
    result=(scipy.stats.linregress(subset["Lohn"], subset["Nähe"]))
    print(result.pvalue)
    print(result.rvalue**2)

    combined = alt.layer(charts + charts.transform_regression('Lohn', 'Nähe').mark_line(), data=subset)
    combined.display()

1
0.26621752598380083
0.08165237235384798


2
0.9201937036034954
0.00013655867352307605


3
0.03952494401113892
0.19520667960921845


4
0.37998543714215227
0.04078786527569032


5
0.7660282601181172
0.0022971665193865765


6
3.815839221760901e-05
0.5288549863247357


7
0.581250483554385
0.006252527997971995


8
0.0031734888994816106
0.12809002048767235


9
0.7639083216601454
0.005800107007528869


In [61]:
dataset_to_be_used=merged_classified_2ndrun
sectorlist=dataset_to_be_used["Sektor"].unique().tolist()
sectorlist.sort()
for i in (sectorlist):
    print(i)
    subset=dataset_to_be_used[dataset_to_be_used["Sektor"]==i]
    result=(scipy.stats.linregress(subset["Lohn"], subset["Nähe"]))
    print(result.pvalue)
    print(result.rvalue**2)

    combined = alt.layer(charts + charts.transform_regression('Lohn', 'Nähe').mark_line(), data=subset)
    combined.display()

1
0.14546139408089603
0.13584650867581421


2
0.9869487995802184
3.640502659299374e-06


3
0.040915615778079034
0.19278527426866285


4
0.24783975406548694
0.06960674515378394


5
0.6744683060484218
0.004571440383032828


6
4.18016436728691e-05
0.5252275833320984


7
0.031482829938105396
0.09097040636853806


8
0.0263004434951997
0.07478430957546904


9
0.9588496349298119
0.0001834793024556954
