In [1]:
import pandas as pd
import numpy as np

## First Table, Salaries by Primary Work Activity

In [2]:
#first clean the salaries by Work Activity
salariesDF = pd.read_csv('..\data\inProcess\salaries.csv')
salariesDF.set_index('Field of study', inplace = True)
salariesDF.head()

Unnamed: 0_level_0,All full-time employed,Computer applications,"Management, sales, or administration",Professional services,Any R&D,Teaching,Other
Field of study,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
All fields,110000,129000,136000,115000,115000,79000,106000
Science,104000,126000,130000,110000,110000,76000,99000
"Biological, agricultural, and environmental life sciences",100000,105000,125000,152000,100000,75000,99000
Agricultural and food sciences,106000,100000,135000,120000,101000,84000,88000
Biochemistry and biophysics,109000,116000,135000,159000,105000,69000,110000


In [3]:
salariesDF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, All fields to Health
Data columns (total 7 columns):
 #   Column                                Non-Null Count  Dtype 
---  ------                                --------------  ----- 
 0   All full-time employed                32 non-null     int64 
 1   Computer applications                 32 non-null     object
 2   Management, sales, or administration  32 non-null     int64 
 3   Professional services                 32 non-null     object
 4   Any R&D                               32 non-null     int64 
 5   Teaching                              32 non-null     int64 
 6   Other                                 32 non-null     object
dtypes: int64(4), object(3)
memory usage: 2.0+ KB


Some of the columns are type "object" indicating that they contain non numeric values

In [4]:
#Find the non numeric rows in column "computer applications"
notNumbers = pd.to_numeric(salariesDF['Computer applications'], errors='coerce').isna()
print(notNumbers.sum())
print(salariesDF['Computer applications'][notNumbers])

2
Field of study
Zoology                             D
Political science and government    D
Name: Computer applications, dtype: object


In [5]:
#Find the non numeric rows in column "rofessional services"
notNumbers = pd.to_numeric(salariesDF['Professional services'], errors='coerce').isna()
print(notNumbers.sum())
print(salariesDF['Professional services'][notNumbers])

2
Field of study
Astronomy and astrophysics                                D
Aerospace, aeronautical, and astronautical engineering    D
Name: Professional services, dtype: object


In [6]:
#Find the non numeric rows in column "Other"
notNumbers = pd.to_numeric(salariesDF['Other'], errors='coerce').isna()
print(notNumbers.sum())
print(salariesDF['Other'][notNumbers])

1
Field of study
Aerospace, aeronautical, and astronautical engineering    D
Name: Other, dtype: object


According to footnotes, "D" stands for "suppressed to avoid disclosure of confidential information."

Looking at data on the web where it was downloaded from, some rows are summaries of subsequent rows. E.g., there's is a "Biological, agricultural, and environmental life sciences" is broken down into the next rows as "Agricultural and food sciences" for a total of 7 categories, each of which is broken into more categories. These indents are on the web, but lost when the data are downloaded as Excel files.

Because the "D" fields are probably small samples, we can replace the "D" with the category value.

In [7]:
salariesDF.index

Index(['All fields', 'Science',
       'Biological, agricultural, and environmental life sciences',
       'Agricultural and food sciences', 'Biochemistry and biophysics',
       'Cell, cellular biology, and molecular biology',
       'Microbiological sciences and immunology',
       'Natural resources and conservation', 'Zoology',
       'Other biological sciences', 'Computer and information sciences',
       'Mathematics and statistics',
       'Physical sciences, geosciences, atmospheric, and ocean sciences',
       'Astronomy and astrophysics', 'Chemistry, except biochemistry',
       'Geosciences, atmospheric, and ocean sciences', 'Physics', 'Psychology',
       'Social sciences', 'Economics', 'Political science and government',
       'Sociology, demography, and population studies',
       'Other social sciences', 'Engineering',
       'Aerospace, aeronautical, and astronautical engineering',
       'Chemical engineering', 'Civil engineering',
       'Electrical and computer engi

Looking at the web site, "zoology" is a subcategory of "biological...sciences"

"political sciences" is a subcategory of "social sciences"

"Astronomy and astrophysics" is a ssubcategoryub of "Physical sciences, geosciences, atmospheric, and ocean sciences"

"Aerospace, aeronautical, and astronautical engineering" is a subcategory of "engineering"


In [8]:
#Replace the non numeric values with the value for the higher order classification for each row
salariesDF.loc['Zoology','Computer applications'] = salariesDF.loc['Biological, agricultural, and environmental life sciences','Computer applications']
salariesDF.loc['Political science and government','Computer applications'] = salariesDF.loc['Social sciences','Computer applications']
salariesDF.loc['Astronomy and astrophysics','Professional services'] = salariesDF.loc['Physical sciences, geosciences, atmospheric, and ocean sciences','Professional services']
salariesDF.loc['Aerospace, aeronautical, and astronautical engineering','Professional services'] = salariesDF.loc['Engineering','Professional services']
salariesDF.loc['Aerospace, aeronautical, and astronautical engineering','Other'] = salariesDF.loc['Engineering','Other']
salariesDF['Computer applications'] = pd.to_numeric(salariesDF['Computer applications'], errors='raise')
salariesDF['Professional services'] = pd.to_numeric(salariesDF['Professional services'], errors='raise')
salariesDF['Other'] = pd.to_numeric(salariesDF['Other'], errors='raise')
salariesDF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, All fields to Health
Data columns (total 7 columns):
 #   Column                                Non-Null Count  Dtype
---  ------                                --------------  -----
 0   All full-time employed                32 non-null     int64
 1   Computer applications                 32 non-null     int64
 2   Management, sales, or administration  32 non-null     int64
 3   Professional services                 32 non-null     int64
 4   Any R&D                               32 non-null     int64
 5   Teaching                              32 non-null     int64
 6   Other                                 32 non-null     int64
dtypes: int64(7)
memory usage: 3.0+ KB


## Second Table, Salaries by location data


In [9]:
#next clean the salaries by location data (broad field only)
locationDF = pd.read_csv('..\data\inProcess\location.csv')
locationDF.set_index('Field of study', inplace = True)
locationDF.head()

Unnamed: 0_level_0,All locations,New England,Connecticut,Maine,Massachusetts,New Hampshire,Rhode Island,Vermont,Middle Atlantic,New Jersey,...,Utah,Wyoming,Pacific,Alaska,California,Hawaii,Oregon,Washington,Puerto Rico,U.S. territories and other areas
Field of study,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
All fields,110000,115000,115000,99000,120000,96000,109000,102000,115000,130000,...,98000,75000,125000,93000,130000,89000,110000,119000,80000,99000
Total_science,104000,110000,106000,96000,113000,94000,103000,100000,110000,129000,...,94000,72000,118000,90000,120000,89000,97000,110000,75000,89000
"Biological, agricultural and environmental life sciences",100000,105000,108000,99000,109000,85000,99000,88000,106000,120000,...,85000,75000,110000,87000,115000,94000,96000,109000,83000,81000
Computer and information sciences,139000,125000,125000,D,126000,D,D,D,150000,148000,...,117000,D,157000,D,157000,D,140000,158000,S,S
Mathematics and statistics,108000,118000,105000,D,129000,78000,100000,S,125000,134000,...,109000,D,125000,D,134000,103000,88000,124000,D,70000


Again, fields with "D" or "S" are suppressed. Because states are subcategories of region, we can replace these values with the region values.

In [10]:
#Replace all suppression indicators with NaN
locationDF.replace('D', np.NaN, inplace = True)
locationDF.replace('S', np.NaN, inplace = True)
locationDF.replace('*', np.NaN, inplace = True)
print('Number of Null values is: ',locationDF.isna().sum().sum())

Number of Null values is:  52


In [11]:
locationDF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, All fields to Health
Data columns (total 63 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   All locations                     10 non-null     int64 
 1   New England                       10 non-null     int64 
 2   Connecticut                       10 non-null     int64 
 3   Maine                             7 non-null      object
 4   Massachusetts                     10 non-null     int64 
 5   New Hampshire                     9 non-null      object
 6   Rhode Island                      8 non-null      object
 7   Vermont                           7 non-null      object
 8   Middle Atlantic                   10 non-null     int64 
 9   New Jersey                        10 non-null     int64 
 10  New York                          10 non-null     int64 
 11  Pennsylvania                      10 non-null     int64 
 12  East North Centr

In [12]:
#Replace the NaNs in each column with the values from the region
#can't use a loop for this because the regions aren't the same number of
#rows. Have to point each column to the correct region manually.

locationDF['Maine'].fillna(locationDF['New England'], inplace = True)
locationDF['New Hampshire'].fillna(locationDF['New England'], inplace = True)
locationDF['Rhode Island'].fillna(locationDF['New England'], inplace = True)
locationDF['Vermont'].fillna(locationDF['New England'], inplace = True)
locationDF['Iowa'].fillna(locationDF['West North Central'], inplace = True)
locationDF['Kansas'].fillna(locationDF['West North Central'], inplace = True)
locationDF['Nebraska'].fillna(locationDF['West North Central'], inplace = True)
locationDF['North Dakota'].fillna(locationDF['West North Central'], inplace = True)
locationDF['South Dakota'].fillna(locationDF['West North Central'], inplace = True)
locationDF['Delaware'].fillna(locationDF['South Atlantic'], inplace = True)
locationDF['District of Columbia'].fillna(locationDF['South Atlantic'], inplace = True)
locationDF['West Virginia'].fillna(locationDF['South Atlantic'], inplace = True)
locationDF['Kentucky'].fillna(locationDF['East South Central'], inplace = True)
locationDF['Mississippi'].fillna(locationDF['East South Central'], inplace = True)
locationDF['Arkansas'].fillna(locationDF['West South Central'], inplace = True) 
locationDF['Oklahoma'].fillna(locationDF['West South Central'], inplace = True) 
locationDF['Arizona'].fillna(locationDF['Mountain'], inplace = True) 
locationDF['Idaho'].fillna(locationDF['Mountain'], inplace = True) 
locationDF['Montana'].fillna(locationDF['Mountain'], inplace = True) 
locationDF['Nevada'].fillna(locationDF['Mountain'], inplace = True) 
locationDF['Wyoming'].fillna(locationDF['Mountain'], inplace = True) 
locationDF['Alaska'].fillna(locationDF['Pacific'], inplace = True) 
locationDF['Hawaii'].fillna(locationDF['Pacific'], inplace = True)
locationDF = locationDF.apply(pd.to_numeric)

In [13]:
#US territories and Puerto Rico are both missing values. Just using "all locations"
#is artificially high. Will calculate the percentage the Territories are of the 
#median salaries from all locations and put those values in for the Territories
territories_percentage = (locationDF['U.S. territories and other areas'] / locationDF['All locations']).mean()
locationDF['U.S. territories and other areas'].fillna(locationDF['All locations']*territories_percentage, inplace = True)

#now set Puerto Rico to the terrritories value
locationDF['Puerto Rico'].fillna(locationDF['U.S. territories and other areas'], inplace = True)


In [15]:
print('Number of Null values is: ',locationDF.isna().sum().sum())

Number of Null values is:  0


## Third table. Median Salaries by Sector of employment

In [17]:
#next clean the salaries by sector data
sectorDF = pd.read_csv('..\data\inProcess\sector.csv')
sectorDF.set_index('Field of study', inplace = True)
sectorDF.head()

Unnamed: 0_level_0,All full-time employed,4-year educational institution,Other educational institution,"Private, for profit","Private, nonprofit",Federal government,State or local government,Self-employed,Other
Field of study,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
All fields,110000,90000,71000,137000,114000,120000,90000,99000,124000
Science,104000,88000,71000,135000,109000,118000,86000,100000,124000
"Biological, agricultural, and environmental life sciences",100000,84000,69000,130000,105000,110000,79000,99000,89000
Agricultural and food sciences,106000,94000,88000,125000,100000,115000,79000,89000,65000
Biochemistry and biophysics,109000,82000,57000,129000,131000,114000,108000,68000,D


In [18]:
#Replace all suppression indicators with NaN
sectorDF.replace('D', np.NaN, inplace = True)
sectorDF.replace('S', np.NaN, inplace = True)
sectorDF.replace('*', np.NaN, inplace = True)
print('Number of Null values is: ',sectorDF.isna().sum().sum())

Number of Null values is:  28


In [19]:
sectorDF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, All fields to Health
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   All full-time employed          32 non-null     int64 
 1   4-year educational institution  32 non-null     int64 
 2   Other educational institution   29 non-null     object
 3   Private, for profit             32 non-null     int64 
 4   Private, nonprofit              32 non-null     int64 
 5   Federal government              32 non-null     int64 
 6   State or local government       27 non-null     object
 7   Self-employed                   19 non-null     object
 8   Other                           25 non-null     object
dtypes: int64(5), object(4)
memory usage: 2.5+ KB
