# LCA Exploratory Analysis 3 - Who's being rejected and why?

For this last section, I want to take a look at the applicants who have their LCAs rejected. Note that these aren't full H1B applications - the LCA is only one step in the process. The overwhelming majority of these are approved, so I'm curious to see if there's a pattern to those applications which are rejected.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

f = open("Delta_LCA.csv", "r")
delta_df = pd.read_csv(f, index_col=0)
f.close()

In [2]:
# I'm only interested in CERTIFIED and DENIED applications here, so let's filter for those.

certified_delta_df = delta_df[delta_df.STATUS.str.contains("CERTIFIED")]
denied_delta_df = delta_df[delta_df.STATUS == "DENIED"]
cd_delta_df = delta_df[delta_df.STATUS != "WITHDRAWN"]  # I'm not counting WITHDRAWN apps.
cd_delta_df = cd_delta_df.replace("CERTIFIED-WITHDRAWN", "CERTIFIED")  # I'll count apps that were withdrawn post-fact.
total_applications = float(len(cd_delta_df))

print "Certified Applications:", len(certified_delta_df)
print "Denied Applications:", len(denied_delta_df)
print "Percent Denied Applications:", round(100 * len(denied_delta_df)/total_applications, 3)

Certified Applications: 472803
Denied Applications: 10982
Percent Denied Applications: 2.27


Let's look at large companies and see if any particular group of them tend to get denied.

In [3]:
grouped_by_company = cd_delta_df.groupby(["LCA_CASE_EMPLOYER_NAME"])
valid_companies = grouped_by_company.size()[grouped_by_company.size() > 250].keys()

denied_company_df = denied_delta_df[cd_delta_df["LCA_CASE_EMPLOYER_NAME"].isin(valid_companies)]
certified_company_df = certified_delta_df[cd_delta_df["LCA_CASE_EMPLOYER_NAME"].isin(valid_companies)]
d_grouped_by_company = denied_company_df.groupby("LCA_CASE_EMPLOYER_NAME").size()
c_grouped_by_company = certified_company_df.groupby("LCA_CASE_EMPLOYER_NAME").size()

# I'm going to perform a join on these two series to calculate the percent denied for companies.
# There has got to be a better way to do this - I'd love to know how.

company_df = pd.concat([c_grouped_by_company, d_grouped_by_company], axis=1)

company_df["Denied_prop"] = company_df[1]/company_df[0]

print company_df["Denied_prop"].dropna().order(ascending=False).head(10)

BETA SOFT SYSTEMS INC                         0.202830
ERNST & YOUNG US LLP                          0.050492
CHARTER GLOBAL INC                            0.045614
PHOTON INFOTECH INC                           0.044379
MOTOROLA MOBILITY LLC                         0.042493
HARVARD UNIVERSITY                            0.041509
IBM INDIA PRIVATE LTD                         0.038636
ECLINICALWORKS LLC                            0.038596
CAPGEMINI US LLC                              0.035539
MASTECH INC A MASTECH HOLDINGS INC COMPANY    0.033226
Name: Denied_prop, dtype: float64


Well, there's one huge outlier. After looking them up, I found a lot of negative reviews for "Beta Soft Systems Inc" which lead me to be very suspicious of this company. Other than that, there isn't anything particularly interesting here. Perhaps sorting by state will show something?

In [4]:
d_grouped_by_state = denied_delta_df.groupby("STATE").size()
c_grouped_by_state = certified_delta_df.groupby("STATE").size()

state_df = pd.concat([c_grouped_by_state, d_grouped_by_state], axis=1)

state_df["Denied_prop"] = state_df[1]/state_df[0]

print state_df["Denied_prop"].dropna().order(ascending=False)

STATE
PR    0.153374
WY    0.142857
GU    0.122222
HI    0.070850
MS    0.068441
VI    0.066667
NM    0.064935
LA    0.049834
NV    0.047750
MT    0.041667
AL    0.038613
DC    0.038473
FL    0.037617
ID    0.035556
NY    0.034625
ND    0.034091
OK    0.032602
WV    0.031746
NE    0.030060
AK    0.029851
SD    0.028708
UT    0.028137
KY    0.027778
CO    0.027518
VT    0.026936
KS    0.026740
MA    0.025701
MD    0.025632
SC    0.024103
OH    0.022504
CA    0.022419
VA    0.022205
TX    0.021772
MI    0.021329
MO    0.020059
AR    0.020010
PA    0.019983
OR    0.019776
NC    0.019659
MN    0.019319
IL    0.018739
AZ    0.017998
GA    0.017992
NJ    0.017514
TN    0.016699
WI    0.016657
CT    0.016462
WA    0.016457
IN    0.015867
ME    0.014634
IA    0.014373
NH    0.014276
RI    0.013276
DE    0.011584
Name: Denied_prop, dtype: float64


Well, Puerto Rico, Wyoming, and Guam are standouts, but otherwise nothing interesting here. Let's keep going and look at industries.

In [5]:
grouped_by_ind = cd_delta_df.groupby(["LCA_CASE_NAICS_CODE"])
valid_industries = grouped_by_ind.size()[grouped_by_ind.size() > 100].keys()

denied_ind_df = denied_delta_df[cd_delta_df["LCA_CASE_NAICS_CODE"].isin(valid_industries)]
certified_ind_df = certified_delta_df[cd_delta_df["LCA_CASE_NAICS_CODE"].isin(valid_industries)]
d_grouped_by_ind = denied_ind_df.groupby("LCA_CASE_NAICS_CODE").size()
c_grouped_by_ind = certified_ind_df.groupby("LCA_CASE_NAICS_CODE").size()

ind_df = pd.concat([c_grouped_by_ind, d_grouped_by_ind], axis=1)

ind_df["Denied_prop"] = ind_df[1]/ind_df[0]

print ind_df["Denied_prop"].dropna().order(ascending=False).head()
print ind_df["Denied_prop"].dropna().order(ascending=False).tail()

LCA_CASE_NAICS_CODE
54111    0.208589
5411     0.200000
53131    0.156716
5412     0.146789
6111     0.145833
Name: Denied_prop, dtype: float64
LCA_CASE_NAICS_CODE
334400    0.004292
51121     0.004176
452112    0.003831
3336      0.002660
334290    0.002392
Name: Denied_prop, dtype: float64


Hmmm, "Offices of Lawyers" and " Legal Services" seem to have the wost luck getting certified, while manufacturing industries such as " Engine, Turbine, and Power Transmission Equipment Manufacturing" and "Other Communications Equipment Manufacturing" are very safe. Perhaps there are specific jobs related to these industries that can tell me more.

In [6]:
grouped_by_job = cd_delta_df.groupby(["LCA_CASE_SOC_NAME"])
valid_jobs = grouped_by_job.size()[grouped_by_job.size() > 100].keys()

denied_job_df = denied_delta_df[cd_delta_df["LCA_CASE_SOC_NAME"].isin(valid_jobs)]
certified_job_df = certified_delta_df[cd_delta_df["LCA_CASE_SOC_NAME"].isin(valid_jobs)]
d_grouped_by_job = denied_job_df.groupby("LCA_CASE_SOC_NAME").size()
c_grouped_by_job = certified_job_df.groupby("LCA_CASE_SOC_NAME").size()

job_df = pd.concat([c_grouped_by_job, d_grouped_by_job], axis=1)

job_df["Denied_prop"] = job_df[1]/job_df[0]

print job_df["Denied_prop"].dropna().order(ascending=False).head()
print job_df["Denied_prop"].dropna().order(ascending=False).tail()

Media and Communication Workers, All Other    0.291391
Film and Video Editors                        0.273810
Chefs and Head Cooks                          0.231579
Designers, All Other                          0.206667
Administrative Services Managers              0.183206
Name: Denied_prop, dtype: float64
Marine Engineers and Naval Architects    0.012712
Computer Systems Analysts                0.011984
Computer Hardware Engineers              0.011331
Computer Occupations, All Other          0.010892
Postsecondary Teachers, All Other        0.006173
Name: Denied_prop, dtype: float64


Those in media, design, and administration are most likely to be rejected, while applicants in tech and engineering fields, along with postsecondary teachers (read: university employees) don't have much of a problem. I'm not surpirsed that the engineering and univeristy workers don't have much trouble, though the jobs that are least likely to be certified don't match the industry with the most trouble (Law).

Let's look at part time vs. full time next.

In [7]:
d_grouped_by_time = denied_job_df.groupby("FULL_TIME_POS").size()
c_grouped_by_time = certified_job_df.groupby("FULL_TIME_POS").size()

time_df = pd.concat([c_grouped_by_time, d_grouped_by_time], axis=1)

time_df["Denied_prop"] = time_df[1]/time_df[0]

print time_df["Denied_prop"]

FULL_TIME_POS
N    0.049389
Y    0.021189
Name: Denied_prop, dtype: float64


Those who aren't working full time are more than twice as likely to be denied. Let's keep digging and examine wages and units of pay.

In [8]:
c_year_df = certified_delta_df[certified_delta_df["PW_UNIT_1"] == "Year"]
d_year_df = denied_delta_df[denied_delta_df["PW_UNIT_1"] == "Year"]

c_hour_df = certified_delta_df[certified_delta_df["PW_UNIT_1"] == "Hour"]
d_hour_df = denied_delta_df[denied_delta_df["PW_UNIT_1"] == "Hour"]

print "Average Annual Certified Wage:", round(c_year_df["PW_1"].mean(), 2)
print "Average Annual Denied Wage:", round(d_year_df["PW_1"].mean(), 2)

print "Average Hourly Certified Wage:", round(c_hour_df["PW_1"].mean(), 2)
print "Average Hourly Denied Wage:", round(d_hour_df["PW_1"].mean(), 2)

print "Prop Salaried Workers Denied:", round(len(d_year_df)/float(len(d_year_df)+len(c_year_df)), 3)
print "Prop Hourly Workers Denied:", round(len(d_hour_df)/float(len(d_hour_df)+len(c_hour_df)), 3)

Average Annual Certified Wage: 70013.45
Average Annual Denied Wage: 67187.47
Average Hourly Certified Wage: 30.78
Average Hourly Denied Wage: 27.08
Prop Salaried Workers Denied: 0.02
Prop Hourly Workers Denied: 0.052


This makes sense. Not only are denied applicants likely to make less, those who work hourly are much more likely to be denied. This goes hand in hand with the previous observation that applicants who aren't working full time are most likely to be denied. The prevailing pattern seems to be that working full time with an annual salary provides the best odds of being certified.