# LCA Exploratory Analysis 2 - Hiring: Who, where, and what?

I've looked at the wages and identified some patterns, but I'm also curious about which companies are sending the most applications and which states are receiving the most potential immigrants.

In [1]:
import pandas as pd

f = open("Delta_LCA.csv", "r")
delta_df = pd.read_csv(f, index_col=0)
f.close()

In [2]:
grouped_by_company = delta_df.groupby("LCA_CASE_EMPLOYER_NAME")

company_applications = grouped_by_company.size().order(ascending=False)

top_10_app_companies = company_applications.head(10)

top_10_app_company_keys = top_10_app_companies.keys()  # I'll be using this later.

print top_10_app_companies

LCA_CASE_EMPLOYER_NAME
INFOSYS LIMITED                      23759
TATA CONSULTANCY SERVICES LIMITED    14080
WIPRO LIMITED                         8358
DELOITTE CONSULTING LLP               6976
ACCENTURE LLP                         5502
IBM INDIA PRIVATE LIMITED             4987
HCL AMERICA INC                       4741
ERNST & YOUNG US LLP                  3954
MICROSOFT CORPORATION                 3650
IGATE TECHNOLOGIES INC                3124
dtype: int64


IT consultancies take the cake here by a huge margin. Infosys, Tata Consultancy Services, Wipro, HCL America, and IGATE Technologies all primarily provide IT consulting services. Rounding out the top 10 are more general consultancies such as Deloitte and Accenture, an auditing firm (Ernst & Young), and more traditional technology companies (IBM India and Microsoft). I'd like to know what portion of our applicant pool comes from these top 10 companies.

Another interesting fact is that five of the top ten companies are based in India (Infosys, Tata, Wipro, IBM India, and HCL). How much of our total applicant pool do these five Indian companies account for? While we're at it, what about just Infosys?

In [3]:
top_10_applicants = delta_df[delta_df["LCA_CASE_EMPLOYER_NAME"].isin(top_10_app_company_keys)]

indian_company_list = ["INFOSYS LIMITED", "TATA CONSULTANCY SERVICES LIMITED", "WIPRO LIMITED", 
                       "IBM INDIA PRIVATE LIMITED", "HCL AMERICA INC"]

indian_applicants = delta_df[delta_df["LCA_CASE_EMPLOYER_NAME"].isin(indian_company_list)]
infosys_applicants = delta_df[delta_df["LCA_CASE_EMPLOYER_NAME"] == "INFOSYS LIMITED"]

print "Percent of Pool from Top 10 Companies: ", 100* len(top_10_applicants)/float(len(delta_df))
print "Percent of Pool from Top 5 Indian Companies: ", 100 * len(indian_applicants)/float(len(delta_df))
print "Percent of Pool from Infosys Limited: ", 100 * len(infosys_applicants)/float(len(delta_df))

Percent of Pool from Top 10 Companies:  15.8773887012
Percent of Pool from Top 5 Indian Companies:  11.2211770749
Percent of Pool from Infosys Limited:  4.767169354


That's really interesting. I'd expect a few companies to be responsible for the lion's share of applications, but I'm surprised at how dominant the IT industry and India are. Infosys alone nearly accounts for 5% of applicants.

As a quick aside, how do some of our Silicon Valley favorites stack up?

In [4]:
google_apps = company_applications[company_applications.keys() == "GOOGLE INC"].values[0]
apple_apps = company_applications[company_applications.keys() == "APPLE INC"].values[0]
facebook_apps = company_applications[company_applications.keys() == "FACEBOOK INC"].values[0]
linkedin_apps = company_applications[company_applications.keys() == "LINKEDIN CORPORATION"].values[0]

print "Google:", google_apps
print "Apple:", apple_apps
print "Facebook:", facebook_apps
print "LinkedIn:", linkedin_apps

Google: 3032
Apple: 1426
Facebook: 728
LinkedIn: 374


Google just barely misses being part of the top 10. These are some interesting numbers, given the size of these companies. Let's reevaluate after taking each company's size into account (I'll just source the numbers from wikipedia).

In [5]:
google_size = 55419.
apple_size = 98000.
facebook_size = 10082.
linkedin_size = 6800.

print "Google Apps per Employee:", round(google_apps/google_size, 3)
print "Apple Apps per Employee:", round(apple_apps/apple_size, 3)
print "Facebook Apps per Employee:", round(facebook_apps/facebook_size, 3)
print "LinkedIn Apps per Employee:", round(linkedin_apps/linkedin_size, 3)

Google Apps per Employee: 0.055
Apple Apps per Employee: 0.015
Facebook Apps per Employee: 0.072
LinkedIn Apps per Employee: 0.055


It looks like Facebook has a greater tendency to hire foreign employees. Remember that they were also one of the best companies in terms of wage delta (i.e. they paid immigrants well in relation to domestic workers in the same state). Google and LinkedIn are hiring at the same rate, but Apple sends out considerably fewer applications relative to the size of the company.

While we're at it, let's just take a quick look at our favorite schools.

In [6]:
print "Stanford:", company_applications[company_applications.keys() == 
                                           "THE BOARD OF TRUSTEES OF THE LELAND STANFORD JR"].values
print "UC Berkeley:", company_applications[company_applications.keys() == 
                                           "UNIVERSITY OF CALIFORNIA BERKELEY"].values

# Berkeley's Lawrence Lab accounts for another 150 or so applicants, 
# but I decided not to count it as part of the university proper.

Stanford: [279]
UC Berkeley: [128]


How about applications by state? Let's see where most of these applicants would be working (if their LCA is approved).

In [7]:
grouped_by_state = delta_df.groupby("STATE")

state_applications = grouped_by_state.size().order(ascending=False)

print state_applications

STATE
CA    95013
TX    50286
NY    46535
NJ    35656
IL    26560
MA    19105
PA    18570
GA    17247
WA    17077
FL    17017
VA    14907
NC    13158
MI    13117
OH    13044
MD     9109
MN     8672
CT     8495
AZ     7046
MO     6024
CO     5543
WI     5306
IN     5175
OR     4448
TN     4322
DC     3460
DE     3030
IA     2900
KS     2325
UT     2303
AR     2197
RI     2177
KY     2030
SC     1978
OK     1716
AL     1650
LA     1617
NH     1609
NE     1585
NV     1168
NM      848
ME      635
MS      581
HI      549
ID      477
ND      375
WV      339
VT      320
SD      219
AK      212
GU      206
PR      190
WY      108
MT      103
VI       49
dtype: int64


California certainly seems popular, let's take a closer look.

In [8]:
print "CA Proportion of Applicants:", round(state_applications[state_applications.keys() 
                                                               == "CA"]/float(len(delta_df)), 3)

CA Proportion of Applicants: 0.191


Although California accounts for 12% of the us population, it accounts for 19% of applicants - nearly a fifth. Perhaps the tech industry has something to do with this? Let's look at applications by industry next.

In [9]:
grouped_by_industry = delta_df.groupby("LCA_CASE_NAICS_CODE")

industry_applications = grouped_by_industry.size().order(ascending=False)

print industry_applications.head(10)

LCA_CASE_NAICS_CODE
541511    194028
541512     28120
611310     22583
541519     11825
541510     11140
541330     10456
5416        8309
54161       7442
523110      6189
54151       6125
dtype: int64


Oh yes it does. The top industry is "Custom Computer Programming Services". Many of the other top ten are computer related as well. Let's group these into broader categories and take another look.

In [10]:
computer_industries = [541511, 541512, 541519, 541510, 54151]
eng_sci_industries = [541330, 5416]
universities = [611310]
consulting = [54161]
investment_banking = [523110]

total_applicants = float(len(delta_df))
computer_ind_apps = len(delta_df[delta_df["LCA_CASE_NAICS_CODE"].isin(computer_industries)])
eng_sci_ind_apps = len(delta_df[delta_df["LCA_CASE_NAICS_CODE"].isin(eng_sci_industries)])
university_apps = len(delta_df[delta_df["LCA_CASE_NAICS_CODE"].isin(universities)])
consulting_apps = len(delta_df[delta_df["LCA_CASE_NAICS_CODE"].isin(consulting)])
banking_apps = len(delta_df[delta_df["LCA_CASE_NAICS_CODE"].isin(investment_banking)])
top_10_ind_apps = industry_applications.head(10).sum()

print "Computer Apps:", computer_ind_apps, "- Proportion:", round(computer_ind_apps/total_applicants, 3)
print "University Apps:", university_apps, "- Proportion:", round(university_apps/total_applicants, 3)
print "Engineering Apps:", eng_sci_ind_apps, "- Proportion:", round(eng_sci_ind_apps/total_applicants, 3)
print "Consulting Apps:", consulting_apps, "- Proportion:", round(consulting_apps/total_applicants, 3)
print "Banking Apps:", banking_apps, "- Proportion:", round(banking_apps/total_applicants, 3)
print "Top 10 Industry Apps:", top_10_ind_apps, "- Proportion:", round(top_10_ind_apps/total_applicants, 3)

Computer Apps: 251238 - Proportion: 0.504
University Apps: 22583 - Proportion: 0.045
Engineering Apps: 18765 - Proportion: 0.038
Consulting Apps: 7442 - Proportion: 0.015
Banking Apps: 6189 - Proportion: 0.012
Top 10 Industry Apps: 306217 - Proportion: 0.614


Yeah, tech seems to have something to do with it. Let's take a look at applications by job type - I've got a pretty good guess as to what I'll see.

In [11]:
grouped_by_job = delta_df.groupby("LCA_CASE_SOC_NAME")

job_applications = grouped_by_job.size().order(ascending=False)

print job_applications.head(10)

LCA_CASE_SOC_NAME
Computer Systems Analysts                      92552
Software Developers, Applications              77815
Computer Programmers                           72063
Computer Occupations, All Other                38736
Software Developers, Systems Software          15705
Management Analysts                            11470
Accountants and Auditors                        9387
Financial Analysts                              8903
Network and Computer Systems Administrators     8121
Mechanical Engineers                            7265
dtype: int64


Exactly as I expected. For a future project, it would be really interesting to run this same analysis on data from previous years and measure the differences in each category.