# Assess Data Quality Programmatically

In this exercise, you will perform a programatic assessment of U.S. Bureau of Labor Statistics' "Occupational Employment and Wage Statistics (OEWS) Research Estimates by State and Industry". This dataset contains data on manager role occupations and the hourly wage. 

You will be assessing this data for **completeness** and **consistency**.

As a supplementary dataset, you are provided with an additional dataset, the 2021 1-year ACS PUMS dataset, to validate the data quality issues against.

In [1]:
#DO NOT MODIFY - imports
import pandas as pd
import numpy as np

## Datasets context

### OEWS data (uncleaned)

The OEWS dataset was gathered manually as a CSV from the U.S. Bureau of Labor Statistics' website. The data was narrowed down to specifically focus on the managerial domain.

The dataset has a number of variables - there are four variables of significance to us:

- AREA_TITLE: Area/location name, e.g. Alabama
- OCC_CODE: The Standard Occupational Classification (SOC) code, e.g. 11-0000
- OCC_TITLE: The Standard Occupational Classification (SOC) title, e.g. Management Occupations
- H_MEAN: The mean hourly wage of the worker, e.g. 61.13

### PUMS data (cleaned)

The PUMS dataset was downloaded via the Census Data API from the United Statest Census Bureau, and narrowed down for the Kern County - Bakersfield MSA, California area.

Dataset variables:

- WRK: Whether the individual worked last week
    - 0: N/A (not reported)
    - 1: Worked
    - 2: Did not work
- SEX: Sex (Male / Female) of the individual
    - 1: Male
    - 2: Female 
- SCOP: Standard Occupational Classification (SOC) codes for 2018 and later, based on the 2018 SOC codes

In [2]:
#DO NOT MODIFY
#Read in the uncleaned excel file (note: will take a few minutes to load)
oews_data = pd.read_excel('oes_research_2021_sec_55-56.xlsx')
#Show the first few rows
oews_data.head()

Unnamed: 0,AREA,AREA_TITLE,NAICS,NAICS_TITLE,I_GROUP,OCC_CODE,OCC_TITLE,O_GROUP,TOT_EMP,EMP_PRSE,...,H_MEDIAN,H_PCT75,H_PCT90,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90,ANNUAL,HOURLY
0,1,Alabama,55,Management of Companies and Enterprises,sector,00-0000,All Occupations,total,21920,0.0,...,35.6,56.94,79.49,35470,47040,74050,118440,165330,,
1,1,Alabama,55,Management of Companies and Enterprises,sector,11-0000,Management Occupations,major,4820,4.1,...,61.13,92.03,#,61600,94020,127140,191420,#,,
2,1,Alabama,55,Management of Companies and Enterprises,sector,11-1021,General and Operations Managers,detailed,1600,7.0,...,60.5,#,#,60010,78520,125850,#,#,,
3,1,Alabama,55,Management of Companies and Enterprises,sector,11-2021,Marketing Managers,detailed,140,13.6,...,61.13,99.23,#,65240,98680,127140,206410,#,,
4,1,Alabama,55,Management of Companies and Enterprises,sector,11-2022,Sales Managers,detailed,140,14.7,...,49.56,77.94,#,59390,79010,103080,162110,#,,


In [3]:
#DO NOT MODIFY
#Read the cleaned .json file
cleaned_pums = pd.read_csv('cleaned_pums_2021.csv')
#Show the first few rows
cleaned_pums.head()

Unnamed: 0,WRK,SEX,SOCP
0,1,2,119151
1,2,1,119111
2,1,2,113121
3,1,1,1110XX
4,1,1,113051


## 1. Inspect the completeness

In the first step, take a look at the completeness of the OEWS dataset, and identify any missing or incomplete values.

### 1.1 Create a subset of the dataset 
Create a subset of the dataset to only include the required variables: `AREA_TITLE`, `OCC_CODE`, `OCC_TITLE`, `H_MEAN`. **Use this subset for all the following steps in this exercise.**

Check if there are any NA values in the data programmatically using `isnull()`.

In [4]:
#FILL IN - create a subset of the dataset
#oews_data_subset = 
oews_data_subset = oews_data[['AREA_TITLE', 'OCC_CODE', 'OCC_TITLE', 'H_MEAN']]

In [5]:
#FILL IN - check programmatically if there are NA values using isnull()
print(oews_data_subset.isnull().sum().sum())

0


### 1.2 Check the summary statistics
Use the `.describe()` and `.info()` function to check the summary statistics for the OEWS dataset, specifically the `H_MEAN` variable. 

In [6]:
#FILL IN - run the .describe() function
oews_data_subset.describe()

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
count,71508,71508,71508,71508
unique,54,596,596,6818
top,California,00-0000,All Occupations,*
freq,3223,1161,1161,536


In [7]:
#FILL IN - run the .info() function
oews_data_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71508 entries, 0 to 71507
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   AREA_TITLE  71508 non-null  object
 1   OCC_CODE    71508 non-null  object
 2   OCC_TITLE   71508 non-null  object
 3   H_MEAN      71508 non-null  object
dtypes: object(4)
memory usage: 2.2+ MB


### 1.3 Look into the dtype of the dataset.
There are a couple of things to notice.
1. The `H_MEAN` variable should be a numerical dtype (i.e., 'float64'), but is instead an object. 
2. Using the `.describe()`, we see the `*` sign under `H_MEAN`, which indicates a wage estimate is **not available** - hence, it should be a NaN value, even though it isn't phrased as such. 

To solve this issue, replace the `*` sign in `H_MEAN` with a `np.NaN` object for the `H_MEAN` variable using `.replace()`.

In [8]:
#FILL IN - Print the dtypes
print(oews_data_subset.dtypes)

AREA_TITLE    object
OCC_CODE      object
OCC_TITLE     object
H_MEAN        object
dtype: object


In [12]:
#DO NOT MODIFY
#Disable chained assignments
#Objective: Silences warnings when operating on slices of dataframes
#for the purposes of this exercise
pd.options.mode.chained_assignment = None 

In [13]:
#FILL IN
#Replace the * sign with np.nan
oews_data_subset['H_MEAN'] = oews_data_subset['H_MEAN'].replace({'*': np.nan})

### 1.4 Check the number of NA values again
Now, check the NA values in in the OEWS dataset again

In [14]:
#Check number of NA values in OEWS data
print(oews_data_subset.isnull().sum().sum())

536


## 2. Inspect the consistency

Check for consistency between the OEWS and PUMS data for the `AREA_TITLE` and `OCC_CODE`/`SOCP` variables, and answer the following questions.

### 2.1 Is the Area consistent between the two datasets? 
**Note**: Recall that the PUMS dataset **only** contains data for the Kern County - Bakersfield MSA, California area.

Is the Area consistent between the two datasets? Use the `.head()` function, and optionally `.describe()` and `.info()`.

In [15]:
#FILL IN - inspect the head of the OEWS dataframe
oews_data_subset.head()

Unnamed: 0,AREA_TITLE,OCC_CODE,OCC_TITLE,H_MEAN
0,Alabama,00-0000,All Occupations,42.88
1,Alabama,11-0000,Management Occupations,70.9
2,Alabama,11-1021,General and Operations Managers,72.76
3,Alabama,11-2021,Marketing Managers,69.97
4,Alabama,11-2022,Sales Managers,62.97


In [16]:
#FILL IN - inspect the head of the PUMS dataframe
cleaned_pums.head()

Unnamed: 0,WRK,SEX,SOCP
0,1,2,119151
1,2,1,119111
2,1,2,113121
3,1,1,1110XX
4,1,1,113051


*Answer*: 

The locations are not consistent - the OEWS data is providing data for multiple states within the US.

### 2.2. Are the occupation codes consistent?

Are the occupation codes consistent between the two datasets (`OCC_CODE` and `SOCP`)? Use the `.sample()` function to pull a few random samples from the dataset. What is the difference, if any?

In [24]:
#FILL IN
#Pull a random sample from the OEWS dataframe, indexed on OCC_CODE
oews_data_subset['OCC_CODE'].sample(4)

69058    49-0000
70199    49-0000
28155    39-7010
56753    41-0000
Name: OCC_CODE, dtype: object

In [23]:
#FILL IN
#Pull a random sample from the cleaned_pums dataframe, indexed on SOCP
cleaned_pums['SOCP'].sample(4)

11058    1191XX
2009     119111
18657    119021
2833     119021
Name: SOCP, dtype: object

*Answer*: 

We can see inconsistency between the SCOP and OCC_code variables - specifically the format (the lack of a hyphen in the PUMS' SCOP dataset).